Table of Contents
cs.CL [Back]
[1] Slang Context-based Inference Enhancement via Greedy Search-Guided Chain-of-Thought Prompting
Jinghan Cao,Qingyang Ren,Xiangyun Chen,Xinjin Li,Haoxiang Gao,Yu Zhao
Main category: cs.CL
TL;DR: 本文提出了一种基于贪心搜索引导的思维链框架,用于提升小语言模型在俚语解释任务中的准确性,发现模型大小和温度设置对俚语推理影响有限。
Details
Motivation: 俚语表达嵌入在特定的语境、文化和语言框架中,缺乏领域训练数据时,大语言模型难以仅靠词汇信息准确解释俚语含义。 Method: 提出一种贪心搜索引导的思维链(Chain-of-Thought)框架,并将其应用于小语言模型,结合实证研究分析模型规模与温度对俚语推理的影响。 Result: 实验表明该框架显著提升了俚语含义解释的准确性;模型尺寸和温度设置对推理准确率影响有限,更大参数量的Transformer模型并未带来更高准确率。 Conclusion: 俚语理解高度依赖上下文,而结构化推理提示框架(如贪心搜索+思维链)可有效增强小模型的俚语解释能力,为提升语言模型的语境理解提供了实用方案。 Abstract: Slang interpretation has been a challenging downstream task for Large Language Models (LLMs) as the expressions are inherently embedded in contextual, cultural, and linguistic frameworks. In the absence of domain-specific training data, it is difficult for LLMs to accurately interpret slang meaning based on lexical information. This paper attempts to investigate the challenges of slang inference using large LLMs and presents a greedy search-guided chain-of-thought framework for slang interpretation. Through our experiments, we conclude that the model size and temperature settings have limited impact on inference accuracy. Transformer-based models with larger active parameters do not generate higher accuracy than smaller models. Based on the results of the above empirical study, we integrate greedy search algorithms with chain-of-thought prompting for small language models to build a framework that improves the accuracy of slang interpretation. The experimental results indicate that our proposed framework demonstrates improved accuracy in slang meaning interpretation. These findings contribute to the understanding of context dependency in language models and provide a practical solution for enhancing slang comprehension through a structured reasoning prompting framework.[2] Steering at the Source: Style Modulation Heads for Robust Persona Control
Yoshihiro Izawa,Gouki Minegishi,Koshi Eguchi,Sosuke Hosokawa,Kenjiro Taura
Main category: cs.CL
TL;DR: 本文提出了一种针对大语言模型(LLM)的细粒度控制方法——Style Modulation Heads(风格调制头),通过几何分析定位仅三个关键注意力头,实现对 persona 和 style 的精准干预,显著缓解了传统残差流干预导致的连贯性下降问题。
Details
Motivation: 传统激活引导(activation steering)在残差流上进行干预,虽高效但会破坏文本连贯性,因其 indiscriminately 影响聚合特征并放大无关噪声。 Method: 通过层间余弦相似度与头级贡献分数的几何分析,定位出仅三个独立调控 persona 与 style 的稀疏注意力头(Style Modulation Heads),并在这些头的输出上实施定向干预。 Result: 相比残差流干预,该方法在保持强行为控制能力的同时,显著缓解了连贯性退化;验证了组件级精确定位可提升控制的安全性与精度。 Conclusion: LLM 的可控性可通过识别和干预功能特异的稀疏内部组件(如特定注意力头)来更安全、更精确地实现,无需微调或全局干预。 Abstract: Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term Style Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering. More broadly, our findings show that precise, component-level localization enables safer and more precise model control.[3] Training-Free Agentic AI: Probabilistic Control and Coordination in Multi-Agent LLM Systems
Mohammad Parsa Hosseini,Ankit Shah,Saiyra Qureshi,Alex Huang,Connie Miao,Wei Wei
Main category: cs.CL
TL;DR: 本文提出REDEREF,一种轻量级、无需训练的多智能体大语言模型(LLM)协作控制器,通过信念引导委托、反思驱动重路由、证据选择和记忆感知先验,显著提升路由效率与系统鲁棒性。
Details
Motivation: 多智能体LLM系统在实际部署中受限于低效路由、噪声反馈和高交互成本。 Method: REDEREF包含四项核心机制:(i) 基于Thompson采样的信念引导委托;(ii) 利用校准LLM或程序化评判器的反思驱动重路由;(iii) 证据驱动而非输出平均的选择策略;(iv) 记忆感知先验以缓解冷启动问题。 Result: 在多智能体分知识任务中,相比随机递归委托,REDEREF降低28% token消耗、17%智能体调用次数、19%成功耗时,且在智能体或评判器性能下降时仍保持良好适应性。 Conclusion: 无需训练的简单、可解释的概率控制机制能显著提升多智能体LLM系统的效率与鲁棒性。 Abstract: Multi-agent large language model (LLM) systems enable complex, long-horizon reasoning by composing specialized agents, but practical deployment remains hindered by inefficient routing, noisy feedback, and high interaction cost. We introduce REDEREF, a lightweight and training-free controller for multi-agent LLM collaboration that improves routing efficiency during recursive delegation. REDEREF integrates (i) belief-guided delegation via Thompson sampling to prioritize agents with historically positive marginal contributions, (ii) reflection-driven re-routing using a calibrated LLM or programmatic judge, (iii) evidence-based selection rather than output averaging, and (iv) memory-aware priors to reduce cold-start inefficiency. Across multi-agent split-knowledge tasks, we show that while recursive retry alone saturates task success, belief-guided routing reduces token usage by 28%, agent calls by 17%, and time-to-success by 19% compared to random recursive delegation, and adapts gracefully under agent or judge degradation. These results demonstrate that simple, interpretable probabilistic control can meaningfully improve the efficiency and robustness of multi-agent LLM systems without training or fine-tuning.[4] How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing
Javier Marín
Main category: cs.CL
TL;DR: 本文提出forced-completion probing方法,发现语言模型在处理正确与错误答案时,内部表征在各层间通过旋转(而非缩放)显著分离;模型会主动抑制正确答案,而非被动失败;该现象在1.6B参数模型以上才出现,表明存在事实处理能力的相变。
Details
Motivation: 现有研究将真实性视为静态的单层表征属性,缺乏对模型全深度中正确与错误延续所引发表征动态差异的理解。 Method: 提出forced-completion probing方法:对同一查询输入分别强制接续已知正确/错误的单token,并在四个decoder-only模型(1.5B–13B)的每一层追踪五种几何度量。 Result: 1)正确与错误路径通过角度分离(旋转),模长几乎不变;2)模型主动抑制正确token的概率;3)上述现象仅在≥1.6B参数模型中出现,提示事实处理能力存在相变。 Conclusion: 事实约束处理具有特定几何特性——以旋转为主、主动抑制为机制,这无法被单层探针或模长分析所捕获。 Abstract: When a language model is fed a wrong answer, what happens inside the network? Current understanding treats truthfulness as a static property of individual-layer representations-a direction to be probed, a feature to be extracted. Less is known about the dynamics: how internal representations diverge across the full depth of the network when the model processes correct versus incorrect continuations. We introduce forced-completion probing, a method that presents identical queries with known correct and incorrect single-token continuations and tracks five geometric measurements across every layer of four decoder-only models(1.5B-13B parameters). We report three findings. First, correct and incorrect paths diverge through rotation, not rescaling: displacement vectors maintain near-identical magnitudes while their angular separation increases, meaning factual selection is encoded in direction on an approximate hypersphere. Second, the model does not passively fail on incorrect input-it actively suppresses the correct answer, driving internal probability away from the right token. Third, both phenomena are entirely absent below a parameter threshold and emerge at 1.6B, suggesting a phase transition in factual processing capability. These results show that factual constraint processing has a specific geometric character-rotational, not scalar; active, not passive-that is invisible to methods based on single-layer probes or magnitude comparisons.[5] Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation
Minsang Kim,Seung Jun Baek
Main category: cs.CL
TL;DR: 本文提出了一种面向学生的知识蒸馏框架TSD-KD,通过选择性地蒸馏关键推理token、结合间接(偏好重排序)与直接(置信度引导的分布匹配)蒸馏,并引入熵正则化,显著提升小模型在复杂推理任务上的性能,甚至在部分任务上超越教师模型。
Details
Motivation: 现有知识蒸馏方法要求学生模型完全模仿教师模型的全输出分布,但学生容量有限,易导致分布不匹配,尤其在复杂推理任务中表现不佳。 Method: 提出Token-Selective Dual Knowledge Distillation(TSD-KD):1)间接蒸馏——学生生成候选响应,教师仅进行偏好重排序,不强制分布对齐;2)直接蒸馏——仅对教师与学生置信度差异显著的关键token进行分布匹配;3)加入熵正则化以维持学生自信度。 Result: 在10个挑战性推理基准上达到SOTA,准确率较基线和次优方法最高分别提升54.4%和40.3%;在4个任务中,学生模型性能反超其教师模型,最高达20.3%。 Conclusion: TSD-KD是一种以学生为中心的知识蒸馏范式,通过目标明确、反馈间接的方式支持学生自主推理与持续改进,有效缓解了小模型在复杂推理任务中的能力瓶颈。 Abstract: Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain-of-Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for student-centric distillation. TSD-KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD-KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re-ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student's confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self-improvement. The experiments show the state-of-the-art performance of TSD-KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner-up in accuracy by up to 54.4\% and 40.3\%, respectively. Notably, a student trained by TSD-KD even outperformed its own teacher model in four cases by up to 20.3\%. The source code is available at https://github.com/kmswin1/TSD-KD.[6] Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets
Roben Delos Reyes,Timothy Douglas,Asanobu Kitamoto
Main category: cs.CL
TL;DR: 本文提出了一种基于智能体的合成推文数据集生成工作流,用于解决危机信息学中真实推文数据获取难、标注成本高、覆盖范围有限等问题,并在震后损毁评估任务中验证了其有效性。
Details
Motivation: Twitter数据访问政策变化导致真实危机推文数据难以获取;现有标注数据集局限于特定历史事件、标注成本高、泛化性差,制约了危机信息学AI系统的开发与评估。 Method: 设计了一种迭代式智能体工作流:基于预设目标特征(如地点、损毁等级)生成合成推文,通过预定义合规性检查进行评估,并利用结构化反馈优化后续迭代生成。 Result: 在震后损毁评估案例中,生成的合成推文能准确反映目标地点和损毁等级标签;该合成数据集可用于评估地理定位与损毁等级预测等AI任务,性能接近真实数据。 Conclusion: 该工作流为危机信息学提供了灵活、可扩展的合成社交媒体数据生成新范式,支持跨事件、跨社会语境和多任务的系统性AI开发与评估。 Abstract: Twitter (now X) has become an important source of social media data for situational awareness during crises. Crisis informatics research has widely used tweets from Twitter to develop and evaluate artificial intelligence (AI) systems for various crisis-relevant tasks, such as extracting locations and estimating damage levels from tweets to support damage assessment. However, recent changes in Twitter's data access policies have made it increasingly difficult to curate real-world tweet datasets related to crises. Moreover, existing curated tweet datasets are limited to past crisis events in specific contexts and are costly to annotate at scale. These limitations constrain the development and evaluation of AI systems used in crisis informatics. To address these limitations, we introduce an agentic workflow for generating crisis-related synthetic tweet datasets. The workflow iteratively generates synthetic tweets conditioned on prespecified target characteristics, evaluates them using predefined compliance checks, and incorporates structured feedback to refine them in subsequent iterations. As a case study, we apply the workflow to generate synthetic tweet datasets relevant to post-earthquake damage assessment. We show that the workflow can generate synthetic tweets that capture their target labels for location and damage level. We further demonstrate that the resulting synthetic tweet datasets can be used to evaluate AI systems on damage assessment tasks like geolocalization and damage level prediction. Our results indicate that the workflow offers a flexible and scalable alternative to real-world tweet data curation, enabling the systematic generation of synthetic social media data across diverse crisis events, societal contexts, and crisis informatics applications.[7] Widespread Gender and Pronoun Bias in Moral Judgments Across LLMs
Gustavo Lúcius Fernandes,Jeiverson C. V. M. Santos,Pedro O. S. Vaz-de-Melo
Main category: cs.CL
TL;DR: 本研究通过控制句法层面的语法人称、数和性别标记,系统性地评估了大语言模型(LLMs)在道德判断(尤其是公平性分类)中的偏见。基于ETHICS数据集的550个基础句子,生成了14,850个语义等价但语法/人口统计标记不同的反事实句子,测试6个主流模型家族,发现显著的系统性偏见:单数第三人称更易被判为‘公平’,第二人称受惩罚;非二元性别主体最受偏好,男性主体最被歧视。作者认为这源于训练中的分布与对齐偏差,呼吁在道德LLM应用中实施针对性公平干预。
Details
Motivation: 大型语言模型越来越多地被用于评估道德或伦理陈述,但其判断可能反映社会与语言偏见,亟需在细粒度(如句法层面)上系统检验其公平性判断的偏差来源。 Method: 基于ETHICS数据集的550个平衡基础句子,每句生成26种反事实变体,系统改变人称(第一/二/三)、数(单/复)及性别标记(男/女/非二元等),共14,850个语义等价句子;在Grok、GPT、LLaMA、Gemma、DeepSeek、Mistral六大模型家族上进行公平性二分类评估,并用统计奇偶性差异(SPD)量化组间不公平性。 Result: 发现显著统计偏差:单数第三人称句子更常被判为‘公平’,第二人称则被系统性惩罚;性别标记影响最强——非二元主体始终最受偏好,男性主体最被歧视;所有模型家族均呈现类似模式。 Conclusion: LLMs在道德判断中存在由语法形式与人口统计标记诱发的系统性偏差,根源可能在于训练数据分布与对齐目标的固有偏向;因此,在将LLM部署于道德评估等高风险场景前,必须开展针对性的公平性干预与评估。 Abstract: Large language models (LLMs) are increasingly used to assess moral or ethical statements, yet their judgments may reflect social and linguistic biases. This work presents a controlled, sentence-level study of how grammatical person, number, and gender markers influence LLM moral classifications of fairness. Starting from 550 balanced base sentences from the ETHICS dataset, we generated 26 counterfactual variants per item, systematically varying pronouns and demographic markers to yield 14,850 semantically equivalent sentences. We evaluated six model families (Grok, GPT, LLaMA, Gemma, DeepSeek, and Mistral), and measured fairness judgments and inter-group disparities using Statistical Parity Difference (SPD). Results show statistically significant biases: sentences written in the singular form and third person are more often judged as "fair'', while those in the second person are penalized. Gender markers produce the strongest effects, with non-binary subjects consistently favored and male subjects disfavored. We conjecture that these patterns reflect distributional and alignment biases learned during training, emphasizing the need for targeted fairness interventions in moral LLM applications.[8] Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities
Yurui Zhu,Giovanni Colavizza,Matteo Romanello
Main category: cs.CL
TL;DR: 本文提出一个面向社会科学与人文学科(SSH)现实场景的参考文献提取与解析统一基准,涵盖多语言、脚注嵌入、格式异构等挑战,并对比评估了GROBID与多种大语言模型(LLM)在提取、解析及端到端文档解析任务上的性能,发现解析仍是主要瓶颈,轻量LoRA微调和混合部署策略可显著提升鲁棒性。
Details
Motivation: 现有参考文献提取与解析评测多基于干净、英文、文末参考文献列表,无法反映社会科学与人文学科(SSH)中常见的多语言、脚注嵌入、缩写频繁、格式异构等真实复杂场景,亟需更具代表性的基准。 Method: 构建覆盖多语言、多格式(文末/脚注/混合)、多学科的三个互补数据集(CEX、EXCITE、LinkedBooks),定义提取、解析、端到端文档解析三类递进任务;在schema约束下系统评估GROBID与多个先进LLM(DeepSeek-V3.1、Mistral-Small等);引入LoRA微调与分段/流水线策略,并提出GROBID与LLM协同的路由式混合部署方案。 Result: 参考文献提取任务在中等能力模型上即趋于饱和,而解析与端到端解析仍是主要瓶颈;LoRA微调在SSH密集型数据集上带来稳定提升;分段与流水线策略显著增强鲁棒性;混合部署策略可兼顾效率与准确性。 Conclusion: 面向SSH真实场景的参考文献处理需超越传统纯监督或纯LLM范式,应采用任务适配、轻量微调与结构化路由相结合的混合方法,以应对格式噪声、多语言与异构惯例带来的挑战。 Abstract: Bibliographic reference extraction and parsing are foundational for citation indexing, linking, and downstream scholarly knowledge-graph construction. However, most established evaluations focus on clean, English, end-of-document bibliographies, and therefore underrepresent the Social Sciences and Humanities (SSH), where citations are frequently multilingual, embedded in footnotes, abbreviated, and shaped by heterogeneous historical conventions. We present a unified benchmark that targets these SSH-realistic conditions across three complementary datasets: CEX (English journal articles spanning multiple disciplines), EXCITE (German/English documents with end-section, footnote-only, and mixed regimes), and LinkedBooks (humanities references with strong stylistic variation and multilinguality). We evaluate three tasks of increasing difficulty -- reference extraction, reference parsing, and end-to-end document parsing -- under a schema-constrained setup that enables direct comparison between a strong supervised pipeline baseline (GROBID) and contemporary LLMs (DeepSeek-V3.1, Mistral-Small-3.2-24B, Gemma-3-27B-it, and Qwen3-VL (4B-32B variants)). Across datasets, extraction largely saturates beyond a moderate capability threshold, while parsing and end-to-end parsing remain the primary bottlenecks due to structured-output brittleness under noisy layouts. We further show that lightweight LoRA adaptation yields consistent gains -- especially on SSH-heavy benchmarks -- and that segmentation/pipelining can substantially improve robustness. Finally, we argue for hybrid deployment via routing: leveraging GROBID for well-structured, in-distribution PDFs while escalating multilingual and footnote-heavy documents to task-adapted LLMs.[9] Privacy Preserving Topic-wise Sentiment Analysis of the Iran Israel USA Conflict Using Federated Transformer Models
Md Saiful Islam,Tanjim Taharat Aurpa,Sharad Hasan,Farzana Akter
Main category: cs.CL
TL;DR: 本研究通过分析YouTube新闻频道用户评论,结合主题情感分析、深度学习与联邦学习,构建了一个隐私保护的全球公众舆论分析框架,以应对2026年伊朗-以色列-美国冲突引发的社交媒体热议。
Details
Motivation: 2026年伊朗-以色列-美国冲突升级引发全球社交媒体广泛讨论,亟需一种兼顾数据隐私与分析精度的公众情感分析方法。 Method: 收集约19,000条YouTube新闻评论,经预处理后使用VADER初标情感并人工校验;采用LDA提取冲突相关主题;对比微调多种Transformer模型(BERT、RoBERTa等)进行情感分类;将最优模型嵌入联邦学习框架,并用SHAP进行可解释性分析。 Result: ELECTRA模型准确率达91.32%;在双客户端联邦学习设置下仍达89.59%准确率;SHAP成功识别影响情感判断的关键词汇。 Conclusion: 基于Transformer与联邦学习的隐私保护情感分析框架在真实社交媒体数据上表现优异,兼具高精度、强隐私性与可解释性,为地缘政治事件的全球舆情监测提供了新范式。 Abstract: The recent escalation of the Iran Israel USA conflict in 2026 has triggered widespread global discussions across social media platforms. As people increasingly use these platforms for expressing opinions, analyzing public sentiment from these discussions can provide valuable insights into global public perception. This study aims to analyze global public sentiment regarding the Iran Israel USA conflict by mining user-generated comments from YouTube news channels. The work contributes to public opinion analysis by introducing a privacy preserving framework that combines topic wise sentiment analysis with modern deep learning techniques and Federated Learning. To achieve this, approximately 19,000 YouTube comments were collected from major international news channels and preprocessed to remove noise and normalize text. Sentiment labels were initially generated using the VADER sentiment analyzer and later validated through manual inspection to improve reliability. Latent Dirichlet Allocation (LDA) was applied to identify key discussion topics related to the conflict. Several transformer-based models, including BERT, RoBERTa, XLNet, DistilBERT, ModernBERT, and ELECTRA, were fine tuned for sentiment classification. The best-performing model was further integrated into a federated learning environment to enable distributed training by preserving user data privacy. Additionally, Explainable Artificial Intelligence (XAI) techniques using SHAP were applied to interpret model predictions and identify influential words affecting sentiment classification. Experimental results demonstrate that transformer models perform effectively, and among them, ELECTRA achieved the best performance with 91.32% accuracy. The federated learning also maintained strong performance while preserving privacy, achieving 89.59% accuracy in a two client configuration.[10] Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation
Hanwen Shen,Ting Ying,Jiajie Lu,Shanshan Wang
Main category: cs.CL
TL;DR: 本文提出CAP-TTA框架,通过测试时上下文感知的LoRA更新,在检测到高偏见提示(分布外)时动态适配大语言模型,有效降低毒性输出、减少更新延迟,并缓解灾难性遗忘。
Details
Motivation: 尽管去偏见的大语言模型在已知偏见模式上表现良好,但在面对未知偏见提示时仍易产生毒性输出,说明存在分布偏移问题,需动态适应。 Method: 提出CAP-TTA测试时自适应框架:基于OOD检测识别高偏见提示;当偏见风险触发器超过阈值时,利用预计算对角预调节器进行轻量、快速、稳定的上下文感知LoRA更新。 Result: 在多个毒性提示场景与基准测试中,CAP-TTA显著降低偏见(经人工评估验证),更新延迟远低于AdamW/SGD,并在保持先进去偏效果的同时,大幅提升叙事流畅性,缓解灾难性遗忘。 Conclusion: CAP-TTA是一种高效、稳定、低开销的测试时自适应去偏方法,兼顾去偏性能、生成质量与推理效率,为部署安全可控的大语言模型提供了新路径。 Abstract: Although debiased LLMs perform well on known bias patterns, they often fail to generalize to unfamiliar bias prompts, producing toxic outputs. We first validate that such high-bias prompts constitute a \emph{distribution shift} via OOD detection, and show static models degrade under this shift. To adapt on-the-fly, we propose \textbf{CAP-TTA}, a test-time adaptation framework that performs context-aware LoRA updates only when the bias-risk \emph{trigger} exceeds a threshold, using a precomputed diagonal \emph{preconditioner} for fast and stable updates. Across toxic-prompt settings and benchmarks, CAP-TTA reduces bias (confirmed by human evaluation) while achieving much lower update latency than AdamW/SGD; it also mitigates catastrophic forgetting by significantly improving narrative fluency over SOTA debiasing baseline while maintaining comparable debiasing effectiveness.[11] QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models
Yao Wu,Kangping Yin,Liang Dong,Zhenxin Ma,Shuting Xu,Xuehai Wang,Yuxuan Jiang,Tingting Yu,Yunqing Hong,Jiayi Liu,Rianzhe Huang,Shuxin Zhao,Haiping Hu,Wen Shang,Jian Xu,Guanjun Jiang
Main category: cs.CL
TL;DR: 本文提出QuarkMedBench,一个面向真实世界医疗场景的大语言模型评估基准,涵盖2万余单轮与数千多轮临床、健康与专业类查询,并设计了基于多模型共识与证据检索的自动化评分框架,实现高信度(91.8%专家一致率)且可动态更新的客观评估。
Details
Motivation: 现有医学LLM评估过度依赖选择题,难以反映真实医疗查询中的非结构化、模糊性与长尾复杂性,导致高考试分数与低实际响应质量脱节。 Method: 构建QuarkMedBench数据集(含20,821单-turn和3,853 multi-turn真实医疗查询),并设计自动化评分框架:融合多模型共识与证据检索生成细粒度评分标准(共220,617条),结合分层加权与安全约束量化准确性、关键点覆盖与风险拦截。 Result: 自动生成的评分标准与临床专家盲审结果达91.8%一致性;基线实验揭示SOTA模型在真实临床细节处理上存在显著性能差距,暴露传统考试指标局限性。 Conclusion: QuarkMedBench提供了更生态有效、可复现、可演进的医学LLM评估范式,推动模型向真实临床应用可靠落地。 Abstract: While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to capture the unstructured, ambiguous, and long-tail complexities inherent in genuine user inquiries. To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment. We compiled a massive dataset spanning Clinical Care, Wellness Health, and Professional Inquiry, comprising 20,821 single-turn queries and 3,853 multi-turn sessions. To objectively evaluate open-ended answers, we propose an automated scoring framework that integrates multi-model consensus with evidence-based retrieval to dynamically generate 220,617 fine-grained scoring rubrics (~9.8 per query). During evaluation, hierarchical weighting and safety constraints structurally quantify medical accuracy, key-point coverage, and risk interception, effectively mitigating the high costs and subjectivity of human grading. Experimental results demonstrate that the generated rubrics achieve a 91.8% concordance rate with clinical expert blind audits, establishing highly dependable medical reliability. Crucially, baseline evaluations on this benchmark reveal significant performance disparities among state-of-the-art models when navigating real-world clinical nuances, highlighting the limitations of conventional exam-based metrics. Ultimately, QuarkMedBench establishes a rigorous, reproducible yardstick for measuring LLM performance on complex health issues, while its framework inherently supports dynamic knowledge updates to prevent benchmark obsolescence.[12] Repetition Without Exclusivity: Scale Sensitivity of Referential Mechanisms in Child-Scale Language Models
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: 本文首次系统评估了纯文本语言模型在儿童导向语音数据上训练时的互斥性(ME)偏见,发现模型表现出重复启动效应而非ME,并指出指称性基础可能是实现ME的必要条件。
Details
Motivation: 探究纯文本语言模型是否能通过儿童导向语音数据学习到人类儿童所具有的互斥性(ME)偏见,即倾向于将新词映射到新指称物。 Method: 构建指称抑制操作化定义,设计多轮诊断实验(包括掩码语言模型测试、自回归模型测试、上下文依赖性诊断),并在预注册框架下训练45个不同规模和训练轮数的GPT-2架构模型,在标准化ME测试集上进行评估。 Result: 所有模型均表现出显著的反ME重复启动效应(85–100%项目,p < 2.4×10⁻¹³);该效应随语言建模能力提升而减弱但未消失;上下文依赖性诊断在全部模型中复现;重复次数增加导致启动增强(8/9组显著)。 Conclusion: 仅靠分布学习儿童导向语音无法产生词汇互斥性,而更倾向形成基于重复的指称追踪;指称性基础(grounding)可能是ME出现的必要输入条件,这一主张是经验性的而非先天论的。 Abstract: We present the first systematic evaluation of mutual exclusivity (ME) -- the bias to map novel words to novel referents -- in text-only language models trained on child-directed speech. We operationalise ME as referential suppression: when a familiar object is relabelled in a two-referent discourse context, ME predicts decreased probability of the labelled noun at a subsequent completion position. Three pilot findings motivate a pre-registered scale-sensitivity experiment: (1) a masked language model (BabyBERTa) is entirely insensitive to multi-sentence referential context; (2) autoregressive models show robust repetition priming -- the opposite of ME -- when familiar nouns are re-labelled; and (3) a novel context-dependence diagnostic reveals that apparent ME-like patterns with nonce tokens are fully explained by embedding similarity, not referential disambiguation. In the confirmatory experiment, we train 45 GPT-2-architecture models (2.9M, 8.9M, and 33.5M parameters; 5, 10, and 20 epochs on AO-CHILDES; 5 seeds each) and evaluate on a pre-registered ME battery. Anti-ME repetition priming is significant in all 9 cells (85-100% of items; all p < 2.4 x 10^-13). Priming attenuates with improved language modelling (Spearman rho = -0.533, p = 0.0002) but never crosses zero across a 3.8x perplexity range. The context-dependence diagnostic replicates in all 9 cells, and dose-response priming increases with repetitions in 8/9 cells (all trend p < 0.002). These findings indicate that distributional learning on child-directed speech produces repetition-based reference tracking rather than lexical exclusivity. We connect this to the grounded cognition literature and argue that referential grounding may be a necessary ingredient for ME -- an empirical claim about required input structure, not a nativist one.[13] Can We Trust LLMs on Memristors? Diving into Reasoning Ability under Non-Ideality
Taiqiang Wu,Yuxin Cheng,Chenchen Ding,Runming Yang,Xincheng Feng,Wenyong Zhou,Zhengwu Liu,Ngai Wong
Main category: cs.CL
TL;DR: 本文研究了忆阻器基模拟存内计算架构中非理想性对大语言模型推理能力的影响,并评估了三种无需训练的策略以提升鲁棒性,提出了针对不同噪声水平的有效实践指南。
Details
Motivation: 忆阻器基存内计算架构虽具高能效和计算密度优势,但其固有非理想性导致精度下降,影响大语言模型推理性能,亟需系统分析与应对策略。 Method: 首先全面实证分析典型忆阻器非理想性对LLM推理的影响;随后系统评估三种训练无关策略(思维模式、上下文学习、模块冗余)的效果,并总结鲁棒性提升规律。 Result: 发现推理能力随非理想性增强而显著下降且因基准而异;浅层冗余最有效;思维模式在低噪声下表现好但高噪声下退化;上下文学习缩短输出长度并带来轻微性能折损。 Conclusion: 本研究揭示了非理想性下LLM推理的新机制,并为忆阻器CIM架构上部署LLM提供了实用、无需再训练的鲁棒性增强策略与设计指南。 Abstract: Memristor-based analog compute-in-memory (CIM) architectures provide a promising substrate for the efficient deployment of Large Language Models (LLMs), owing to superior energy efficiency and computational density. However, these architectures suffer from precision issues caused by intrinsic non-idealities of memristors. In this paper, we first conduct a comprehensive investigation into the impact of such typical non-idealities on LLM reasoning. Empirical results indicate that reasoning capability decreases significantly but varies for distinct benchmarks. Subsequently, we systematically appraise three training-free strategies, including thinking mode, in-context learning, and module redundancy. We thus summarize valuable guidelines, i.e., shallow layer redundancy is particularly effective for improving robustness, thinking mode performs better under low noise levels but degrades at higher noise, and in-context learning reduces output length with a slight performance trade-off. Our findings offer new insights into LLM reasoning under non-ideality and practical strategies to improve robustness.[14] Knowledge Distillation for Large Language Models
Alejandro Paredes La Torre,Barbara Flores,Diego Rodriguez
Main category: cs.CL
TL;DR: 本文提出了一种结合知识蒸馏与思维链引导的强化学习的资源高效大语言模型压缩框架,在多个语言和代码数据集上验证了其有效性,并通过量化进一步提升了效率。
Details
Motivation: 解决大语言模型在资源受限场景下部署困难的问题,寻求在保持性能的同时显著减小模型规模。 Method: 采用Qwen 3B为教师模型、Qwen 0.5B为学生模型,进行跨语言(英、西)及代码数据集的知识蒸馏;引入思维链标注的Codeforces数据,结合Group Relative Policy Optimization进行强化学习优化;最后应用4比特权重量化。 Result: 学生模型在英文任务上保留教师70%–91%能力,西班牙语达95%,代码任务Rouge-L达93.5%;CoT+RL显著提升推理连贯性与代码正确性;4-bit量化进一步降低内存与延迟。 Conclusion: 知识蒸馏与思维链引导的强化学习相结合,可有效生成紧凑、高效、适用于资源受限环境部署的语言模型。 Abstract: We propose a resource-efficient framework for compressing large language models through knowledge distillation, combined with guided chain-of-thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly-15k, Spanish Dolly-15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge-L in code. For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization using CoT-annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post-training 4-bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain-of-thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource-constrained settings.[15] LiveWeb-IE: A Benchmark For Online Web Information Extraction
Seungbin Yang,Jihwan Kim,Jaemin Choi,Dongjin Kim,Soyoung Yang,ChaeHun Park,Jaegul Choo
Main category: cs.CL
TL;DR: Error
Details
Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Web information extraction (WIE) is the task of automatically extracting data from web pages, offering high utility for various applications. The evaluation of WIE systems has traditionally relied on benchmarks built from HTML snapshots captured at a single point in time. However, this offline evaluation paradigm fails to account for the temporally evolving nature of the web; consequently, performance on these static benchmarks often fails to generalize to dynamic real-world scenarios. To bridge this gap, we introduce \dataset, a new benchmark designed for evaluating WIE systems directly against live websites. Based on trusted and permission-granted websites, we curate natural language queries that require information extraction of various data categories, such as text, images, and hyperlinks. We further design these queries to represent four levels of complexity, based on the number and cardinality of attributes to be extracted, enabling a granular assessment of WIE systems. In addition, we propose Visual Grounding Scraper (VGS), a novel multi-stage agentic framework that mimics human cognitive processes by visually narrowing down web page content to extract desired information. Extensive experiments across diverse backbone models demonstrate the effectiveness and robustness of VGS. We believe that this study lays the foundation for developing practical and robust WIE systems.[16] Generate Then Correct: Single Shot Global Correction for Aspect Sentiment Quad Prediction
Shidong He,Haoyu Wang,Wenjie Luo
Main category: cs.CL
TL;DR: 本文提出Generate-then-Correct(G2C)方法,通过先生成再全局修正的方式解决方面情感四元组预测(ASQP)中的训练-推理不匹配问题,显著提升Rest15和Rest16数据集上的性能。
Details
Motivation: 现有ASQP方法采用线性化模板加自回归解码,导致训练-推理不匹配(暴露偏差),早期错误会顺序传播且难以修复。 Method: 提出两阶段框架G2C:生成器初步输出四元组,校正器基于大语言模型合成的含典型错误的草稿,进行单次、序列级全局修正。 Result: 在Rest15和Rest16数据集上,G2C优于多个强基线模型。 Conclusion: G2C有效缓解了ASQP中因线性化和暴露偏差引发的错误传播问题,验证了生成+全局校正范式的有效性。 Abstract: Aspect-based sentiment analysis (ABSA) extracts aspect-level sentiment signals from user-generated text, supports product analytics, experience monitoring, and public-opinion tracking, and is central to fine-grained opinion mining. A key challenge in ABSA is aspect sentiment quad prediction (ASQP), which requires identifying four elements: the aspect term, the aspect category, the opinion term, and the sentiment polarity. However, existing studies usually linearize the unordered quad set into a fixed-order template and decode it left-to-right. With teacher forcing training, the resulting training-inference mismatch (exposure bias) lets early prefix errors propagate to later elements. The linearization order determines which elements appear earlier in the prefix, so this propagation becomes order-sensitive and is hard to repair in a single pass. To address this, we propose a method, Generate-then-Correct (G2C): a generator drafts quads and a corrector performs a single-shot, sequence-level global correction trained on LLM-synthesized drafts with common error patterns. On the Rest15 and Rest16 datasets, G2C outperforms strong baseline models.[17] Projection-Free Evolution Strategies for Continuous Prompt Search
Yu Cai,Canxi Huang,Xiaoyu He
Main category: cs.CL
TL;DR: 本文提出了一种无需投影的连续提示搜索方法,基于进化策略直接在完整提示空间中优化,并引入置信度正则化机制以提升少样本场景下的泛化能力,在GLUE基准的7个任务上显著优于现有基线。
Details
Motivation: 现有基于随机投影的连续提示搜索方法虽降低维度,但未能有效捕获提示空间内在的低维结构,且其投影机制的有效性与原理尚不明确。 Method: 提出一种投影无关的进化策略优化方法,直接在全提示空间中搜索,并通过适配内在维度的自适应机制提升效率;同时引入基于置信度的正则化机制,增强模型对目标词(verbalizers)的预测信心。 Result: 在GLUE基准的7个自然语言理解任务上,该方法显著优于现有连续提示搜索基线,且无额外计算开销。 Conclusion: 提示空间具有低维本质结构,但随机投影并非最优选择;投影无关的进化搜索结合置信正则化,可更高效、更鲁棒地实现连续提示优化。 Abstract: Continuous prompt search offers a computationally efficient alternative to conventional parameter tuning in natural language processing tasks. Nevertheless, its practical effectiveness can be significantly hindered by the black-box nature and the inherent high-dimensionality of the objective landscapes. Existing methods typically mitigate these challenges by restricting the search to a randomly projected low-dimensional subspace. However, the effectiveness and underlying motivation of the projection mechanism remain ambiguous. In this paper, we first empirically demonstrate that despite the prompt space possessing a low-dimensional structure, random projections fail to adequately capture this essential structure. Motivated by this finding, we propose a projection-free prompt search method based on evolutionary strategies. By directly optimizing in the full prompt space with an adaptation mechanism calibrated to the intrinsic dimension, our method achieves competitive search capabilities without additional computational overhead. Furthermore, to bridge the generalization gap in few-shot scenarios, we introduce a confidence-based regularization mechanism that systematically enhances the model's confidence in the target verbalizers. Experimental results on seven natural language understanding tasks from the GLUE benchmark demonstrate that our proposed approach significantly outperforms existing baselines.[18] DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents
Snehasis Mukhopadhyay
Main category: cs.CL
TL;DR: 本文提出DECEPTGUARD框架,系统比较黑盒、思维链(CoT)感知与激活探针三类监控方式,结合DECEPTSYNTH合成数据生成与DeceptArena真实沙箱基准评估,发现CoT与激活探针监控显著优于黑盒方法;揭示透明性与可检测性权衡,并提出HYBRID-CONSTITUTIONAL集成方法,在检测LLM代理欺骗行为上达到0.934 pAUROC,大幅超越现有最优水平。
Details
Motivation: 可靠检测大语言模型(LLM)代理的欺骗行为是其在高风险场景中安全部署的前提;已有工作仅依赖黑盒监控(仅观察工具调用与输出),忽略内部推理信号,存在检测盲区。 Method: 提出DECEPTGUARD统一框架,对比黑盒、CoT-aware与激活探针三类监控范式;构建DECEPTSYNTH合成数据生成流水线,覆盖12类欺骗行为;在DeceptArena真实沙箱基准上训练并评估监控器;提出HYBRID-CONSTITUTIONAL集成防御策略。 Result: CoT-aware与激活探针监控相较黑盒监控平均pAUROC提升+0.097,尤其对隐蔽、长周期欺骗效果更优;实证揭示透明性提升导致行为信号减弱、而CoT本身因后训练忠实性下降而变得不可靠;HYBRID-CONSTITUTIONAL集成在测试集上达0.934 pAUROC,显著超越SOTA。 Conclusion: 内部推理信号(尤其是CoT与隐藏状态)对欺骗检测至关重要,但需应对透明性与可检测性的内在权衡;多源信号融合的鲁棒集成方案是迈向可信LLM代理监控的关键路径。 Abstract: Reliable detection of deceptive behavior in Large Language Model (LLM) agents is an essential prerequisite for safe deployment in high-stakes agentic contexts. Prior work on scheming detection has focused exclusively on black-box monitors that observe only externally visible tool calls and outputs, discarding potentially rich internal reasoning signals. We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace), and activation-probe monitors (additionally reading hidden-state representations from a frozen open-weights encoder). We introduce DECEPTSYNTH, a scalable synthetic pipeline for generating deception-positive and deception-negative agent trajectories across a novel 12-category taxonomy spanning verbal, behavioral, and structural deception. Our monitors are optimized on 4,800 synthetic trajectories and evaluated on 9,200 held-out samples from DeceptArena, a benchmark of realistic sandboxed agent environments with execution-verified labels. Across all evaluation settings, CoT-aware and activation-probe monitors substantially outperform their black-box counterparts (mean pAUROC improvement of +0.097), with the largest gains on subtle, long-horizon deception that leaves minimal behavioral footprints. We empirically characterize a transparency-detectability trade-off: as agents learn to suppress overt behavioral signals, chain-of-thought becomes the primary detection surface but is itself increasingly unreliable due to post-training faithfulness degradation. We propose HYBRID-CONSTITUTIONAL ensembles as a robust defense-in-depth approach, achieving a pAUROC of 0.934 on the held-out test set, representing a substantial advance over the prior state of the art.[19] GhanaNLP Parallel Corpora: Comprehensive Multilingual Resources for Low-Resource Ghanaian Languages
Lawrence Adu Gyamfi,Paul Azunre,Stephen Edward Moore,Joel Budu,Akwasi Asare,Mich-Seth Owusu,Jonathan Ofori Asiamah
Main category: cs.CL
TL;DR: GhanaNLP构建了加纳5种低资源语言(Twi、Fante、Ewe、Ga、Kusaal)与英语的平行语料库,共41,513句对,由专业人员人工翻译标注并附结构化元数据,旨在支持机器翻译、语音技术及语言保护等应用,并已部署于Khaya AI翻译引擎。
Details
Motivation: 解决加纳多种本土语言在数字空间中严重缺乏高质量、结构化语言数据的问题,推动非洲语言AI技术的包容性与可及性。 Method: 由专业人员开展数据收集、人工翻译与对齐,并添加标准结构化元数据;对语料进行质量评估,并将其集成到实际AI系统(如Khaya AI翻译引擎)中验证应用效果。 Result: 建成包含5种加纳语言的高质量平行语料库(41,513句对),具备良好一致性与可用性,并已成功应用于真实场景的AI翻译系统。 Conclusion: 该工作为低资源非洲语言提供了可靠基础资源,助力实现AI民主化,促进语言技术公平发展与濒危语言保护。 Abstract: Low resource languages present unique challenges for natural language processing due to the limited availability of digitized and well structured linguistic data. To address this gap, the GhanaNLP initiative has developed and curated 41,513 parallel sentence pairs for the Twi, Fante, Ewe, Ga, and Kusaal languages, which are widely spoken across Ghana yet remain underrepresented in digital spaces. Each dataset consists of carefully aligned sentence pairs between a local language and English. The data were collected, translated, and annotated by human professionals and enriched with standard structural metadata to ensure consistency and usability. These corpora are designed to support research, educational, and commercial applications, including machine translation, speech technologies, and language preservation. This paper documents the dataset creation methodology, structure, intended use cases, and evaluation, as well as their deployment in real world applications such as the Khaya AI translation engine. Overall, this work contributes to broader efforts to democratize AI by enabling inclusive and accessible language technologies for African languages.[20] PMIScore: An Unsupervised Approach to Quantify Dialogue Engagement
Yongkang Guo,Zhihuan Huang,Yuqing Kong
Main category: cs.CL
TL;DR: 本文提出PMIScore,一种基于点互信息(PMI)的无监督对话参与度量化方法,通过LLM嵌入与小神经网络训练来估计PMI,有效克服直接计算PMI的困难,并在合成与真实数据集上验证了其有效性。
Details
Motivation: 对话参与度是衡量对话有效性的重要指标,但因其主观性强且缺乏金标准,难以可靠量化。 Method: 提出PMIScore方法,利用点互信息(PMI)建模响应对历史的条件生成概率;采用对偶形式的散度学习PMI,包括构建正负样本对话对、使用大语言模型提取嵌入、并用互信息损失函数训练小型神经网络。 Result: 在合成与真实世界数据集上的实验验证了PMIScore在PMI估计上的有效性,也证实了PMI作为参与度指标的合理性。 Conclusion: PMIScore是一种高效、可解释、无监督的对话参与度量化方法,为大模型评测、人机交互优化及沟通技能提升提供了新工具。 Abstract: High dialogue engagement is a crucial indicator of an effective conversation. A reliable measure of engagement could help benchmark large language models, enhance the effectiveness of human-computer interactions, or improve personal communication skills. However, quantifying engagement is challenging, since it is subjective and lacks a "gold standard". This paper proposes PMIScore, an efficient unsupervised approach to quantify dialogue engagement. It uses pointwise mutual information (PMI), which is the probability of generating a response conditioning on the conversation history. Thus, PMIScore offers a clear interpretation of engagement. As directly computing PMI is intractable due to the complexity of dialogues, PMIScore learned it through a dual form of divergence. The algorithm includes generating positive and negative dialogue pairs, extracting embeddings by large language models (LLMs), and training a small neural network using a mutual information loss function. We validated PMIScore on both synthetic and real-world datasets. Our results demonstrate the effectiveness of PMIScore in PMI estimation and the reasonableness of the PMI metric itself.[21] APEX-Searcher: Augmenting LLMs' Search Capabilities through Agentic Planning and Execution
Kun Chen,Qingchao Kong,Zhao Feifei,Wenji Mao
Main category: cs.CL
TL;DR: 本文提出APEX-Searcher框架,通过解耦检索为规划与执行两阶段,结合强化学习优化策略规划、监督微调提升多跳子任务执行能力,显著提升多跳RAG与任务规划性能。
Details
Motivation: 现有端到端多轮检索方法面临推理路径模糊和稀疏奖励问题,导致检索不准与性能下降。 Method: 提出两阶段代理式框架:第一阶段用面向分解的奖励进行强化学习优化规划;第二阶段基于高质量多跳轨迹进行监督微调以增强迭代子任务执行能力。 Result: 在多个基准上显著提升了多跳RAG和任务规划性能。 Conclusion: APEX-Searcher通过解耦规划与执行并采用混合训练策略,有效缓解了多跳检索中的模糊性和稀疏奖励挑战,增强了LLM的搜索能力。 Abstract: Retrieval-augmented generation (RAG), based on large language models (LLMs), serves as a vital approach to retrieving and leveraging external knowledge in various domain applications. When confronted with complex multi-hop questions, single-round retrieval is often insufficient for accurate reasoning and problem solving. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches significantly improve problem-solving performance, they are still faced with challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL) process, leading to inaccurate retrieval results and performance degradation. To address these issues, in this paper, we proposes APEX-Searcher, a novel Agentic Planning and Execution framework to augment LLM search capabilities. Specifically, we introduce a two-stage agentic framework that decouples the retrieval process into planning and execution: It first employs RL with decomposition-specific rewards to optimize strategic planning; Built on the sub-task decomposition, it then applies supervised fine-tuning on high-quality multi-hop trajectories to equip the model with robust iterative sub-task execution capabilities. Extensive experiments demonstrate that our proposed framework achieves significant improvements in both multi-hop RAG and task planning performances across multiple benchmarks.[22] GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent
Yuri Kuratov,Matvey Kairov,Aydar Bulatov,Ivan Rodkin,Mikhail Burtsev
Main category: cs.CL
TL;DR: 本文提出GradMem,一种通过测试时梯度优化将长上下文压缩写入少量前缀记忆token的方法,在无需原始上下文的情况下实现高效问答。
Details
Motivation: 为解决Transformer在长上下文应用中KV缓存内存开销大的问题,探索一种可压缩、单次读取、多次查询的内存机制。 Method: GradMem通过在冻结模型权重的前提下,对少量前缀记忆token执行基于自监督上下文重建损失的每样本测试时梯度下降,实现迭代误差校正的记忆写入。 Result: 在关联键值检索任务上优于同等内存规模的前向式记忆写入方法;梯度步数增加比重复前向写入更有效提升容量;在bAbI和SQuAD变体等真实NLP任务上表现具竞争力。 Conclusion: GradMem验证了基于梯度的记忆写入范式在压缩记忆建模中的有效性与泛化能力,为长上下文处理提供了低内存、高灵活性的新路径。 Abstract: Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.[23] Large Language Models Reproduce Racial Stereotypes When Used for Text Annotation
Petter Törnberg
Main category: cs.CL
TL;DR: 本文揭示了大型语言模型(LLMs)在自动文本标注任务中系统性地复现并放大社会刻板印象,尤其体现在姓名与方言线索引发的种族偏见上,对研究、治理与决策构成潜在风险。
Details
Motivation: 随着LLMs被广泛用于学术研究、内容审核和招聘等自动化文本标注场景,亟需评估其是否隐含并传播社会偏见,尤其是基于身份线索(如姓名、方言)的系统性偏差。 Method: 通过两项大规模实验:(1)姓名实验——在39项标注任务中测试19个LLM对不同族裔关联姓名(Black、Asian、Arab、White)文本的评价;(2)方言实验——对比同一句子使用非洲裔美式英语(AAVE)与标准美式英语(SAE)时,19个LLM在专业性、教育感、毒性、情绪等维度的评分差异;总计超400万条标注判断。 Result: 几乎所有模型均表现出一致偏见:Black相关姓名被评更富攻击性/八卦性;Asian相关姓名被评更聪明但更不自信/不善社交;Arab相关姓名被评认知高但人际低;所有少数族裔姓名均被评更缺乏自律;AAVE文本被系统性评为更不专业、更不显教育背景、更具毒性与愤怒感;唯在“雇佣适配性”任务中,微调模型出现反向过度矫正,偏向少数族裔姓名。 Conclusion: LLMs作为自动标注器会将社会结构性偏见编码进训练数据与下游应用,威胁数据可靠性与算法公平性,亟需在方法论与部署实践中纳入偏见审计与缓解机制。 Abstract: Large language models (LLMs) are increasingly used for automated text annotation in tasks ranging from academic research to content moderation and hiring. Across 19 LLMs and two experiments totaling more than 4 million annotation judgments, we show that subtle identity cues embedded in text systematically bias annotation outcomes in ways that mirror racial stereotypes. In a names-based experiment spanning 39 annotation tasks, texts containing names associated with Black individuals are rated as more aggressive by 18 of 19 models and more gossipy by 18 of 19. Asian names produce a bamboo-ceiling profile: 17 of 19 models rate individuals as more intelligent, while 18 of 19 rate them as less confident and less sociable. Arab names elicit cognitive elevation alongside interpersonal devaluation, and all four minority groups are consistently rated as less self-disciplined. In a matched dialect experiment, the same sentence is judged significantly less professional (all 19 models, mean gap $-0.774$), less indicative of an educated speaker ($-0.688$), more toxic (18/19), and more angry (19/19) when written in African American Vernacular English rather than Standard American English. A notable exception occurs for name-based hireability, where fine-tuning appears to overcorrect, systematically favoring minority-named applicants. These findings suggest that using LLMs as automated annotators can embed socially patterned biases directly into the datasets and measurements that increasingly underpin research, governance, and decision-making.[24] OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset
Wenbin Hu,Huihao Jing,Haochen Shi,Changxuan Fan,Haoran Li,Yangqiu Song
Main category: cs.CL
TL;DR: 本文构建了一个名为OmniCompliance-100K的规则驱动、真实世界场景覆盖的大型合规安全数据集,涵盖74项法规与政策,含12985条规则和106009个案例,用于评估大语言模型在多领域合规能力上的表现。
Details
Motivation: 现有LLM安全数据集缺乏基于真实法规的、规则支撑的现实案例,难以支撑鲁棒的安全防护;亟需从合规视角构建系统性、多源权威的安全评测基准。 Method: 利用强Web搜索智能体,从多领域权威来源(如AI公司政策、社交平台条款、金融/医疗/教育法规、人权保护文件等)采集规则及其对应的真实世界合规案例,构建OmniCompliance-100K数据集,并开展LLM跨尺度安全与合规能力评测。 Result: 构建了覆盖74项法规、12,985条规则、106,009个案例的OmniCompliance-100K数据集;验证了规则与案例间高度对齐;通过基准测试揭示了当前先进LLM在不同合规领域的能力差异与共性缺陷。 Conclusion: OmniCompliance-100K填补了规则驱动型LLM安全评测的数据空白,为推动合规导向的大模型安全研究提供了可扩展、可解释、多维度的基准与方法论支持。 Abstract: Ensuring the safety and compliance of large language models (LLMs) is of paramount importance. However, existing LLM safety datasets often rely on ad-hoc taxonomies for data generation and suffer from a significant shortage of rule-grounded, real-world cases that are essential for robustly protecting LLMs. In this work, we address this critical gap by constructing a comprehensive safety dataset from a compliance perspective. Using a powerful web-searching agent, we collect a rule-grounded, real-world case dataset OmniCompliance-100K, sourced from multi-domain authoritative references. The dataset spans 74 regulations and policies across a wide range of domains, including security and privacy regulations, content safety and user data privacy policies from leading AI companies and social media platforms, financial security requirements, medical device risk management standards, educational integrity guidelines, and protections of fundamental human rights. In total, our dataset contains 12,985 distinct rules and 106,009 associated real-world compliance cases. Our analysis confirms a strong alignment between the rules and their corresponding cases. We further conduct extensive benchmarking experiments to evaluate the safety and compliance capabilities of advanced LLMs across different model scales. Our experiments reveal several interesting findings that have great potential to offer valuable insights for future LLM safety research.[25] ToolFlood: Beyond Selection -- Hiding Valid Tools from LLM Agents via Semantic Covering
Hussein Jawad,Nicolas J-B Brunel
Main category: cs.CL
TL;DR: 本文提出ToolFlood,一种针对工具增强型大语言模型代理的检索层攻击方法,通过注入少量精心设计的恶意工具,利用嵌入空间几何特性使这些工具在检索中覆盖大量用户查询,从而排挤所有良性工具,导致高达95%的攻击成功率。
Details
Motivation: 随着LLM代理系统规模扩大,其嵌入式检索阶段的鲁棒性尚未被充分研究,而该阶段对工具选择至关重要。 Method: ToolFlood采用两阶段对抗性工具生成策略:第一阶段用LLM迭代生成多样化的工具名称与描述;第二阶段通过基于余弦距离阈值的迭代贪心选择,最大化覆盖剩余查询,直至覆盖全部目标查询或达到预算限制。 Result: 在标准基准(如ToolBench)上,ToolFlood以仅1%的注入率实现最高达95%的攻击成功率,并提供了关于检索饱和的理论分析。 Conclusion: ToolFlood揭示了当前工具增强型LLM代理在检索层存在的严重安全漏洞,凸显了提升嵌入检索鲁棒性的紧迫性。 Abstract: Large Language Model (LLM) agents increasingly use external tools for complex tasks and rely on embedding-based retrieval to select a small top-k subset for reasoning. As these systems scale, the robustness of this retrieval stage is underexplored, even though prior work has examined attacks on tool selection. This paper introduces ToolFlood, a retrieval-layer attack on tool-augmented LLM agents. Rather than altering which tool is chosen after retrieval, ToolFlood overwhelms retrieval itself by injecting a few attacker-controlled tools whose metadata is carefully placed by exploiting the geometry of embedding space. These tools semantically span many user queries, dominate the top-k results, and push all benign tools out of the agent's context. ToolFlood uses a two-phase adversarial tool generation strategy. It first samples subsets of target queries and uses an LLM to iteratively generate diverse tool names and descriptions. It then runs an iterative greedy selection that chooses tools maximizing coverage of remaining queries in embedding space under a cosine-distance threshold, stopping when all queries are covered or a budget is reached. We provide theoretical analysis of retrieval saturation and show on standard benchmarks that ToolFlood achieves up to a 95% attack success rate with a low injection rate (1% in ToolBench). The code will be made publicly available at the following link: https://github.com/as1-prog/ToolFlood[26] sebis at ArchEHR-QA 2026: How Much Can You Do Locally? Evaluating Grounded EHR QA on a Single Notebook
Ibrahim Ebrar Yurt,Fabian Karl,Tejaswi Choppa,Florian Matthes
Main category: cs.CL
TL;DR: 本文探讨了在单台笔记本电脑上实现基于电子健康记录(EHR)的临床问答系统,强调隐私保护与本地部署可行性;通过参与ArchEHR-QA 2026共享任务四类子任务并全本地实验,验证了轻量模型经合理配置可接近大模型性能,证明了本地化、隐私优先的EHR问答系统切实可行。
Details
Motivation: 解决大型云模型因隐私约束和算力需求难以在临床环境部署的问题,探索在单台笔记本等普通硬件上实现隐私保护、可落地的EHR问答系统。 Method: 参与ArchEHR-QA 2026全部四个子任务,设计并评估多种可在商用硬件(如笔记本)上本地运行的问答方法,所有实验均不依赖外部API或云基础设施。 Result: 所提本地化系统在共享任务排行榜中表现具竞争力:两个子任务成绩高于平均水平;小模型经优化后性能可逼近更大模型。 Conclusion: 当前模型与普通硬件已足以支持完全本地运行、隐私保护的EHR问答系统,具备临床实际部署潜力。 Abstract: Clinical question answering over electronic health records (EHRs) can help clinicians and patients access relevant medical information more efficiently. However, many recent approaches rely on large cloud-based models, which are difficult to deploy in clinical environments due to privacy constraints and computational requirements. In this work, we investigate how far grounded EHR question answering can be pushed when restricted to a single notebook. We participate in all four subtasks of the ArchEHR-QA 2026 shared task and evaluate several approaches designed to run on commodity hardware. All experiments are conducted locally without external APIs or cloud infrastructure. Our results show that such systems can achieve competitive performance on the shared task leaderboards. In particular, our submissions perform above average in two subtasks, and we observe that smaller models can approach the performance of much larger systems when properly configured. These findings suggest that privacy-preserving EHR QA systems running fully locally are feasible with current models and commodity hardware. The source code is available at https://github.com/ibrahimey/ArchEHR-QA-2026.[27] FLUX: Data Worth Training On
Gowtham,Sai Rupesh,Sanjay Kumar,Saravanan,Venkata Chaithanya
Main category: cs.CL
TL;DR: FLUX是一种新型的大语言模型数据预处理流程,旨在同时实现高数据质量与高token保留率,打破传统方法中二者不可兼得的权衡。实验表明,基于FLUX数据训练的模型在MMLU等基准上超越DCLM和FineWeb,且显著降低训练计算开销。
Details
Motivation: 现有预处理流程无法兼顾大规模与高质量:要么激进过滤导致大量token丢失,要么保留海量但噪声大的数据。 Method: 提出FLUX预处理流程,通过严谨的质量控制机制最大化可用token保留率,并在CC-MAIN-2025-51等数据集上验证其高效提取能力。 Result: FLUX在CC-MAIN-2025-51中提取50B可用token(较DCLM+25%);3B模型在60B FLUX数据上达32.14% MMLU,超DCLM(31.98%)和FineWeb(29.88%);仅用39B token即达DCLM 60B效果,节省34.4%训练算力;FLUX-Base产出192B tokens,优于FineWeb的170B。 Conclusion: FLUX首次实现了高token保留率、强质量控制与计算高效性的统一,树立了网页规模数据预处理的新SOTA,重新定义了大模型可扩展数据集构建的边界。 Abstract: Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice one for the other: either aggressively filtering to improve quality at the cost of severe token loss, or retaining large volumes of data while introducing substantial noise. In this work, we introduce FLUX, a preprocessing pipeline specifically designed to break this long-standing trade-off by maximizing token retention while enforcing rigorous quality control. Models trained on FLUX-curated data consistently outperform prior methods. A 3B-parameter model trained on 60B tokens with FLUX achieves 32.14% MMLU accuracy, surpassing the previous state-of-the-art pipeline DCLM (31.98%) and significantly outperforming FineWeb (29.88%). FLUX achieves the same aggregate score as a model trained on DCLM data using only 39B tokens, resulting in a 34.4% reduction in training compute. At the data level, FLUX extracts 50B usable tokens from a single dump (CC-MAIN-2025-51), compared to 40B from DCLM (+25% retention). FLUX-Base yields 192B tokens, exceeding FineWeb's 170B while still maintaining superior quality. Overall, FLUX establishes a new state of the art in web-scale data preprocessing by demonstrating that high retention, strong quality control, and computational efficiency can be achieved simultaneously, redefining the limits of scalable dataset construction for modern language models.[28] Beyond Explicit Edges: Robust Reasoning over Noisy and Sparse Knowledge Graphs
Hang Gao,Dimitris N. Metaxas
Main category: cs.CL
TL;DR: INSES是一种动态图推理框架,结合大模型引导导航与嵌入相似性扩展,解决知识图谱噪声、稀疏和不完整问题,并通过轻量级路由器平衡效率与推理深度。
Details
Motivation: 标准图算法依赖静态连通性和显式边,在噪声、稀疏或不完整的现实知识图谱中表现不佳。 Method: 提出INSES框架,包含LLM引导的导航(去噪并指导探索)和基于嵌入的相似性扩展(恢复隐藏链接、弥合语义鸿沟),并引入轻量级路由器在Naïve RAG与INSES间智能调度。 Result: 在多个基准上持续超越SOTA RAG和GraphRAG;在MINE基准上,对不同构建方法的知识图谱(KGGEN、GraphRAG、OpenIE)分别提升准确率5%、10%、27%。 Conclusion: INSES有效提升了图增强检索在复杂、不完善知识图谱上的鲁棒性与多跳推理能力,兼顾效率与深度。 Abstract: GraphRAG is increasingly adopted for converting unstructured corpora into graph structures to enable multi-hop reasoning. However, standard graph algorithms rely heavily on static connectivity and explicit edges, often failing in real-world scenarios where knowledge graphs (KGs) are noisy, sparse, or incomplete. To address this limitation, we introduce INSES (Intelligent Navigation and Similarity Enhanced Search), a dynamic framework designed to reason beyond explicit edges. INSES couples LLM-guided navigation, which prunes noise and steers exploration, with embedding-based similarity expansion to recover hidden links and bridge semantic gaps. Recognizing the computational cost of graph reasoning, we complement INSES with a lightweight router that delegates simple queries to Naïve RAG and escalates complex cases to INSES, balancing efficiency with reasoning depth. INSES consistently outperforms SOTA RAG and GraphRAG baselines across multiple benchmarks. Notably, on the MINE benchmark, it demonstrates superior robustness across KGs constructed by varying methods (KGGEN, GraphRAG, OpenIE), improving accuracy by 5%, 10%, and 27%, respectively.[29] SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions
Konstantinos Thomas,Giorgos Filandrianos,Maria Lymperaiou,Chrysoula Zerva,Giorgos Stamou
Main category: cs.CL
TL;DR: 本文介绍了SemEval-2026 Task 6(CLARITY),一个面向政治话语中问题回避现象的NLP共享任务,包含清晰度分类与回避策略细粒度分类两个子任务,并分析了参赛系统表现及有效方法。
Details
Motivation: 政治演讲者常战略性回避问题却维持表面回应性,这种现象对公共话语至关重要,但在NLP中尚未被充分研究。 Method: 构建基于美国总统访谈语料的基准数据集,依据专家定义的响应清晰度与回避策略分类体系,设立两级分类任务;组织国际共享任务,收集并评估124支队伍的模型提交结果,分析不同方法(如大语言模型提示、层级化利用分类体系)的效果。 Result: 清晰度分类最高达0.89宏F1,显著超越基线;回避策略分类最高仅0.68宏F1,与最佳基线持平;表明后者更具挑战性;融合两子任务及利用分类体系层级结构的方法效果最优。 Conclusion: CLARITY确立了政治回应回避作为计算话语分析的新挑战性基准,凸显建模政治语言中战略模糊性的困难,并验证了结构化先验知识与大模型协同的重要性。 Abstract: Political speakers often avoid answering questions directly while maintaining the appearance of responsiveness. Despite its importance for public discourse, such strategic evasion remains underexplored in Natural Language Processing. We introduce SemEval-2026 Task 6, CLARITY, a shared task on political question evasion consisting of two subtasks: (i) clarity-level classification into Clear Reply, Ambivalent, and Clear Non-Reply, and (ii) evasion-level classification into nine fine-grained evasion strategies. The benchmark is constructed from U.S. presidential interviews and follows an expert-grounded taxonomy of response clarity and evasion. The task attracted 124 registered teams, who submitted 946 valid runs for clarity-level classification and 539 for evasion-level classification. Results show a substantial gap in difficulty between the two subtasks: the best system achieved 0.89 macro-F1 on clarity classification, surpassing the strongest baseline by a large margin, while the top evasion-level system reached 0.68 macro-F1, matching the best baseline. Overall, large language model prompting and hierarchical exploitation of the taxonomy emerged as the most effective strategies, with top systems consistently outperforming those that treated the two subtasks independently. CLARITY establishes political response evasion as a challenging benchmark for computational discourse analysis and highlights the difficulty of modeling strategic ambiguity in political language.[30] NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments
Rupak Raj Ghimire,Bipesh Subedi,Balaram Prasain,Prakash Poudyal,Praveen Acharya,Nischal Karki,Rupak Tiwari,Rishikesh Kumar Sharma,Jenny Poudel,Bal Krishna Bal
Main category: cs.CL
TL;DR: 本文构建了NepTam20K(人工标注)和NepTam80K(合成)两个尼泊尔语-塔芒语平行语料库,并在多个预训练模型上进行了机器翻译实验,其中NLLB-200微调效果最佳。
Details
Motivation: 南亚语言(尤其是塔芒语)严重缺乏高质量平行语料,制约机器翻译发展,亟需构建专用资源。 Method: 通过网络爬虫采集尼泊尔语新闻与在线文本,经预处理、语义过滤、时态与情感极性平衡后,由母语者翻译为塔芒语,并由语言学家审校,最终构建两个句对齐平行语料库;并在mBART、M2M-100、NLLB-200及Transformer上开展基线翻译实验。 Result: NLLB-200微调在Nepali→Tamang和Tamang→Nepali方向分别取得40.92和45.26的sacreBLEU分数,为当前最优结果。 Conclusion: 本工作填补了塔芒语机器翻译数据资源空白,所建语料库覆盖多领域且质量可控,验证了大型多语言模型在低资源语言翻译中的潜力。 Abstract: Modern Translation Systems heavily rely on high-quality, large parallel datasets for state-of-the-art performance. However, such resources are largely unavailable for most of the South Asian languages. Among them, Nepali and Tamang fall into such category, with Tamang being among the least digitally resourced languages in the region. This work addresses the gap by developing NepTam20K, a 20K gold standard parallel corpus, and NepTam80K, an 80K synthetic Nepali-Tamang parallel corpus, both sentence-aligned and designed to support machine translation. The datasets were created through a pipeline involving data scraping from Nepali news and online sources, pre-processing, semantic filtering, balancing for tense and polarity (in NepTam20K dataset), expert translation into Tamang by native speakers of the language, and verification by an expert Tamang linguist. The dataset covers five domains: Agriculture, Health, Education and Technology, Culture, and General Communication. To evaluate the dataset, baseline machine translation experiments were carried out using various multilingual pre-trained models: mBART, M2M-100, NLLB-200, and a vanilla Transformer model. The fine-tuning on the NLLB-200 achieved the highest sacreBLEU scores of 40.92 (Nepali-Tamang) and 45.26 (Tamang-Nepali).[31] CMHL: Contrastive Multi-Head Learning for Emotionally Consistent Text Classification
Menna Elgabry,Ali Hamdi,Khaled Shaban
Main category: cs.CL
TL;DR: 本文提出CMHL模型,通过多任务学习、心理学辅助监督和对比矛盾损失,在文本情感分类任务中以更小的参数量(125M)超越更大规模模型,达到新的SOTA性能,并展现出跨领域泛化能力。
Details
Motivation: 挑战当前依赖大规模语言模型或多模型集成提升性能的假设,探索是否可通过更智能的单模型架构设计实现更好效果。 Method: 提出CMHL单模型架构,包含三项创新:(1)联合预测基本情绪、效价与强度的多任务学习;(2)基于Russell环形模型的心理学辅助监督;(3)对比矛盾损失以惩罚互斥情绪预测(如同时高置信预测喜悦与愤怒)。 Result: 在dair-ai Emotion数据集上F1达93.75%,超越56倍参数量的LLM及sLM集成;在Reddit SWMH数据集上F1为72.50%、召回率达73.30%,优于MentalBERT等专用模型。 Conclusion: 模型结构智能(而非参数规模)是推动文本情感分类进步的关键;嵌入心理学先验与显式一致性约束的单模型可高效、可解释且具临床相关性地超越大模型与集成方法。 Abstract: Textual Emotion Classification (TEC) is one of the most difficult NLP tasks. State of the art approaches rely on Large language models (LLMs) and multi-model ensembles. In this study, we challenge the assumption that larger scale or more complex models are necessary for improved performance. In order to improve logical consistency, We introduce CMHL, a novel single-model architecture that explicitly models the logical structure of emotions through three key innovations: (1) multi-task learning that jointly predicts primary emotions, valence, and intensity, (2) psychologically-grounded auxiliary supervision derived from Russell's circumplex model, and (3) a novel contrastive contradiction loss that enforces emotional consistency by penalizing mutually incompatible predictions (e.g., simultaneous high confidence in joy and anger). With just 125M parameters, our model outperforms 56x larger LLMs and sLM ensembles with a new state-of-the-art F1 score of 93.75\% compared to (86.13\%-93.2\%) on the dair-ai Emotion dataset. We further show cross domain generalization on the Reddit Suicide Watch and Mental Health Collection dataset (SWMH), outperforming domain-specific models like MentalBERT and MentalRoBERTa with an F1 score of 72.50\% compared to (68.16\%-72.16\%) + a 73.30\% recall compared to (67.05\%-70.89\%) that translates to enhanced sensitivity for detecting mental health distress. Our work establishes that architectural intelligence (not parameter count) drives progress in TEC. By embedding psychological priors and explicit consistency constraints, a well-designed single model can outperform both massive LLMs and complex ensembles, offering a efficient, interpretable, and clinically-relevant paradigm for affective computing.[32] OasisSimp: An Open-source Asian-English Sentence Simplification Dataset
Hannah Liu,Muxin Tian,Iqra Ali,Haonan Gao,Qiaoyiwen Wu,Blair Yang,Uthayasanker Thayasivam,En-Shiun Annie Lee,Pakawat Nakwijit,Surangika Ranathunga,Ravi Shekhar
Main category: cs.CL
TL;DR: 本文介绍了OasisSimp,一个覆盖英语、僧伽罗语、泰米尔语、普什图语和泰语的多语言句子简化数据集,填补了低资源语言在该任务上的数据空白,并基于该数据集评估了8个开源多语言大模型,揭示了其在低资源语言上的性能瓶颈。
Details
Motivation: 现有句子简化研究受限于中低资源语言高质量标注数据的缺乏,亟需构建覆盖更多语言的基准数据集。 Method: 由专业标注员依据详细指南,对五种语言的句子进行语义保持、流畅且语法正确的简化,构建OasisSimp多语言句子简化数据集,并在该数据集上评测八个开源多语言大语言模型。 Result: 实验显示当前多语言大模型在高资源语言(如英语)上表现较好,但在泰语、普什图语、泰米尔语等低资源语言上性能显著下降,暴露了模型在低资源简化任务中的局限性。 Conclusion: OasisSimp为低资源语言句子简化提供了首个高质量多语言基准数据集,不仅可作为训练资源,更可作为评估模型泛化能力的挑战性基准,推动该方向后续研究。 Abstract: Sentence simplification aims to make complex text more accessible by reducing linguistic complexity while preserving the original meaning. However, progress in this area remains limited for mid-resource and low-resource languages due to the scarcity of high-quality data. To address this gap, we introduce the OasisSimp dataset, a multilingual dataset for sentence-level simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai. Among these, no prior sentence simplification datasets exist for Thai, Pashto, and Tamil, while limited data is available for Sinhala. Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. We evaluate eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset and observe substantial performance disparities between high-resource and low-resource languages, highlighting the simplification challenges in multilingual settings. The OasisSimp dataset thus provides both a valuable multilingual resource and a challenging benchmark, revealing the limitations of current LLM-based simplification methods and paving the way for future research in low-resource sentence simplification. The dataset is available at https://OasisSimpDataset.github.io/.[33] The GELATO Dataset for Legislative NER
Matthew Flynn,Timothy Obiso,Sam Newman
Main category: cs.CL
TL;DR: 本文提出了GELATO数据集,用于美国立法文本的两层命名实体识别,并结合微调的Transformer模型和优化提示的LLM进行实验,验证了RoBERTa与LLM组合在立法NER任务中的有效性。
Details
Motivation: 现有命名实体识别方法难以适配美国立法文本的复杂结构和专业术语,亟需专门构建面向该领域的高质量标注数据集与适配模型。 Method: 构建GELATO数据集(含美国第118届国会众议院与参议院法案),设计两层NER本体;第一层使用微调的BERT/RoBERTa完成粗粒度实体识别;第二层采用优化提示的大型语言模型(LLM)进行细粒度分类。 Result: RoBERTa在第一层预测中表现优异,BERT相对较弱;LLM在第二层预测中有效补充细粒度识别能力;整体框架展现出对立法文本NER任务的良好适用性。 Conclusion: GELATO数据集及RoBERTa+LLM的两阶段建模范式为立法文本信息抽取及相关下游任务提供了可靠基础与新研究方向。 Abstract: This paper introduces GELATO (Government, Executive, Legislative, and Treaty Ontology), a dataset of U.S. House and Senate bills from the 118th Congress annotated using a novel two-level named entity recognition ontology designed for U.S. legislative texts. We fine-tune transformer-based models (BERT, RoBERTa) of different architectures and sizes on this dataset for first-level prediction. We then use LLMs with optimized prompts to complete the second level prediction. The strong performance of RoBERTa and relatively weak performance of BERT models, as well as the application of LLMs as second-level predictors, support future research in legislative NER or downstream tasks using these model combinations as extraction tools.[34] MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos
Arushi Goel,Sreyan Ghosh,Vatsal Agarwal,Nishit Anand,Kaousheik Jayakumar,Lasha Koroshinadze,Yao Xu,Katie Lyons,James Case,Karan Sapra,Kevin J. Shih,Siddharth Gururani,Abhinav Shrivastava,Ramani Duraiswami,Dinesh Manocha,Andrew Tao,Bryan Catanzaro,Mohammad Shoeybi,Wei Ping
Main category: cs.CL
TL;DR: 本文提出了MMOU基准,用于评估多模态大语言模型在长而复杂的视频中联合推理视觉、音频和文本信号的能力,揭示了现有模型在此类任务上的显著性能差距和系统性失败模式。
Details
Motivation: 现有MLLMs在单独评估视觉和音频理解时表现良好,但在长而复杂的视频中联合推理多模态信号的能力尚未被系统研究。 Method: 构建了一个名为MMOU的新基准,包含15,000个问题和9038个网络采集的多样化视频,覆盖13种需跨模态及时序整合证据的技能类别,并对20多个先进多模态模型进行了评测。 Result: 最佳闭源模型准确率仅为64.2%,最强开源模型仅达46.8%;分析揭示了模型在长视频中应用基本技能的频繁失败及系统性错误模式。 Conclusion: 当前多模态模型在长形式全模态理解方面仍面临巨大挑战,MMOU为未来研究提供了重要评估工具和改进方向。 Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in visual and audio understanding when evaluated in isolation. However, their ability to jointly reason over omni-modal (visual, audio, and textual) signals in long and complex videos remains largely unexplored. We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions. MMOU consists of 15,000 carefully curated questions paired with 9038 web-collected videos of varying length, spanning diverse domains and exhibiting rich, tightly coupled audio-visual content. The benchmark covers 13 fundamental skill categories, all of which require integrating evidence across modalities and time. All questions are manually annotated across multiple turns by professional annotators, ensuring high quality and reasoning fidelity. We evaluate 20+ state-of-the-art open-source and proprietary multimodal models on MMOU. The results expose substantial performance gaps: the best closed-source model achieves only 64.2% accuracy, while the strongest open-source model reaches just 46.8%. Our results highlight the challenges of long-form omni-modal understanding, revealing that current models frequently fail to apply even fundamental skills in long videos. Through detailed analysis, we further identify systematic failure modes and provide insights into where and why current models break.[35] Selective Fine-Tuning of GPT Architectures for Parameter-Efficient Clinical Text Classification
Fariba Afrin Irany,Sampson Akwafuo
Main category: cs.CL
TL;DR: 本文提出了一种参数高效的GPT-2选择性微调框架,仅更新最后Transformer块、最终层归一化模块和轻量分类头,在MIMIC-IV-Note放射科报告上实现约91%准确率,仅需更新不到6%参数,显著优于仅训练分类头和全模型微调。
Details
Motivation: 临床文本具有高度专业性、标注数据稀缺,且大模型全量微调计算开销大,亟需高效适配策略。 Method: 冻结GPT-2大部分参数,仅微调最后一层Transformer块、最后一层LayerNorm及轻量分类头。 Result: 在5万份放射科报告上达到约91%分类准确率,仅更新<6%参数;性能优于head-only训练与全量微调。 Conclusion: 选择性微调是一种高效、可扩展的临床文本分类框架,在性能与计算效率间取得良好平衡。 Abstract: The rapid expansion of electronic health record (EHR) systems has generated large volumes of unstructured clinical narratives that contain valuable information for disease identification, patient cohort discovery, and clinical decision support. Extracting structured knowledge from these free-text documents remains challenging because clinical language is highly specialized, labeled datasets are limited, and full fine-tuning of large pretrained language models can require substantial computational resources. Efficient adaptation strategies are therefore essential for practical clinical natural language processing applications. This study proposes a parameter-efficient selective fine-tuning framework for adapting GPT-2 to clinical text classification tasks. Instead of updating the entire pretrained model, the majority of network parameters are frozen, and only the final Transformer block, the final layer normalization module, and a lightweight classification head are updated during training. This design substantially reduces the number of trainable parameters while preserving the contextual representation capabilities learned during pretraining. The proposed approach is evaluated using radiology reports from the MIMIC-IV-Note dataset with automatically derived CheXpert-style labels. Experiments on 50,000 radiology reports demonstrate that selective fine-tuning achieves approximately 91% classification accuracy while updating fewer than 6% of the model parameters. Comparative experiments with head-only training and full-model fine-tuning show that the proposed method provides a favorable balance between predictive performance and computational efficiency. These results indicate that selective fine-tuning offers an efficient and scalable framework for clinical text classification.[36] Vavanagi: a Community-run Platform for Documentation of the Hula Language in Papua New Guinea
Bri Olewale,Raphael Merx,Ekaterina Vylomova
Main category: cs.CL
TL;DR: Vavanagi是一个由社区主导的、面向巴布亚新几内亚胡拉语(Hula)的语言技术平台,支持众包英-胡拉文本翻译与语音录制,并由长者审校、社区自主治理数据基础设施;项目已达社区完全主导级别(Level 5),产出超1.2万平行句对、覆盖9000个胡拉词,体现语言技术赋能文化传承与代际联结的可行路径。
Details
Motivation: 推动濒危小语种(如胡拉语)的语言技术发展,避免外部主导导致的文化失真与社区疏离,强调以社区为中心的语言保护与数字赋权。 Method: 构建社区运行的Vavanagi平台,整合众包翻译、语音录制、长者审校和社区治理的数据基础设施;提出衡量社区参与度的五级框架,并将Vavanagi定位为最高级别(Level 5)。 Result: 77名译者与4名审校者协作产出超12,000条英-胡拉平行句对,覆盖约9,000个独特胡拉词汇;确立首个达Level 5的社区主导型语言技术项目。 Conclusion: Vavanagi证明了大规模社区自主驱动的语言技术项目切实可行,能有效连接城乡成员、弥合代际鸿沟、依社区意愿传承文化,为全球小语种数字保护提供可复用范式。 Abstract: We present Vavanagi, a community-run platform for Hula (Vula'a), an Austronesian language of Papua New Guinea with approximately 10,000 speakers. Vavanagi supports crowdsourced English-Hula text translation and voice recording, with elder-led review and community-governed data infrastructure. To date, 77 translators and 4 reviewers have produced over 12k parallel sentence pairs covering 9k unique Hula words. We also propose a multi-level framework for measuring community involvement, from consultation to fully community-initiated and governed projects. We position Vavanagi at Level 5: initiative, design, implementation, and data governance all sit within the Hula community, making it, to our knowledge, the first community-led language technology initiative for a language of this size. Vavanagi shows how language technology can bridge village-based and urban members, connect generations, and support cultural heritage on the community's own terms.[37] Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective
Tianyi Zhang,David Traum
Main category: cs.CL
TL;DR: 本文批判了当前开放域和个性化对话系统中依赖表面相似性指标(如BLEU、ROUGE)的评估方法,以LAPDOG为例,通过人类与大模型评判揭示其在对话历史、人格一致性与响应连贯性上的缺陷,指出需采用认知科学支撑的评估框架。
Details
Motivation: 现有对话系统评估过度依赖BLEU、ROUGE等表面词汇匹配指标,无法反映对话所需的连贯性、一致性和共享理解等深层认知属性。 Method: 以检索增强型个性化对话框架LAPDOG为案例,结合人类评判与大语言模型(LLM)评判,系统分析其在对话历史完整性、检索故事与人格一致性、响应生成连贯性等方面的缺陷,并对比其与传统词汇相似度指标的结果差异。 Result: 人类与LLM评判结果高度一致,但均显著偏离BLEU/ROUGE等指标得分,表明现有指标无法有效反映真实对话质量;识别出LAPDOG存在对话历史损坏、人格-检索内容矛盾、响应不连贯等关键问题。 Conclusion: 对话系统评估亟需转向基于认知科学原理的、更贴近自然人际交流的可靠框架,而非依赖表层文本相似性。 Abstract: In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and personalized dialogue use surface-level similarity metrics (e.g., BLEU, ROUGE, F1) as one of their main reporting measures, which fail to capture these deeper aspects of conversational quality. We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology. Using both human and LLM-based judges, we identify limitations in current evaluation practices, including corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation. Our results show that human and LLM judgments align closely but diverge from lexical similarity metrics, underscoring the need for cognitively grounded evaluation methods. Broadly, this work charts a path toward more reliable assessment frameworks for retrieval-augmented dialogue systems that better reflect the principles of natural human communication.[38] QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis
Yutong Wu,Chenrui Cao,Pengwei Jin,Di Huang,Rui Zhang,Xishan Zhang,Zidong Du,Qi Guo,Xing Hu
Main category: cs.CL
TL;DR: 本文提出了一种数据合成框架,用于提升自然语言到SystemVerilog断言(NL2SVA)的生成性能,通过利用开源RTL代码指导大模型生成高质量SVA,并采用双向翻译筛选语义等价样本;基于该数据训练的CodeV-SVA-14B模型在多个基准上达到或超越GPT-5、DeepSeek-R1等先进大模型。
Details
Motivation: 现有通用大模型在NL2SVA任务中表现差,主因是高质量真实SVA语料稀缺,且缺乏可靠的自然语言与SVA语义等价性判定方法。 Method: 构建数据合成框架:1)利用大规模开源RTL代码引导LLM生成贴近真实场景的SVA;2)采用双向翻译(NL→SVA→NL)作为语义等价性筛选机制;3)基于合成数据训练专用SVA生成模型CodeV-SVA系列。 Result: CodeV-SVA-14B在NL2SVA-Human和NL2SVA-Machine基准上的Func.@1准确率分别达75.8%和84.0%,媲美或优于GPT-5和DeepSeek-R1等前沿大模型。 Conclusion: 所提数据合成框架有效缓解了SVA领域数据稀缺与语义对齐难题,验证了面向特定硬件验证任务定制化模型训练的可行性与优越性。 Abstract: SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general-purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high-quality real-world SVA corpora and the lack of reliable methods to determine NL-SVA semantic equivalence. For the former, large-scale open-source RTLs are used to guide LLMs to generate real-world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV-SVA, a series of SVA generation models. Notably, CodeV-SVA-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.[39] Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring
Weixin Guan,Liang Li,Jiapeng Liu,Bing Li,Peng Fu,Chengyang Fang,Xiaoshuai Hao,Can Ma,Weiping Wang
Main category: cs.CL
TL;DR: 本文提出了一种与大推理语言模型(LRLMs)原生推理过程深度耦合的早期退出方法,通过监测高熵转移词元引发的路径偏离来动态终止冗余推理(即'过思考'),从而在提升推理效率的同时避免性能下降。
Details
Motivation: 现有早期退出方法存在训练开销大、推理吞吐受限、易过度截断导致性能下降等问题;作者观察到'过思考'常伴随推理路径偏离和高熵转移词元,由此提出新监控指标。 Method: 提出基于路径偏离指数(Path Deviation Index)的早期退出方法,该指数利用高熵转移词元频率作为监控信号,无需额外代理模型或频繁切换推理/探针生成模式,直接嵌入原生推理过程。 Result: 在多种基准和不同规模/类型的LRLMs上实验表明,该方法相比现有早期退出方法,在提升推理效率的同时,带来最大的相对于基础CoT的性能增益。 Conclusion: 路径偏离指数是一种有效、轻量且与模型推理过程自然兼容的过思考检测机制,为高效鲁棒的大模型推理提供了新思路。 Abstract: Large Reasoning Language Models (LRLMs) demonstrate impressive capabilities on complex tasks by utilizing long Chain-of-Thought reasoning. However, they are prone to overthinking, which generates redundant reasoning steps that degrade both performance and efficiency. Recently, early-exit strategies are proposed to mitigate overthinking by dynamically and adaptively terminating redundant reasoning. However, current early-exit methods either introduce extra training overhead by relying on proxy models or limit inference throughput due to the frequent content switching between reasoning and generating probing answers. Moreover, most early-exit methods harm LRLMs performance due to over-truncation. Our insight stems from an observation: overthinking often causes LRLMs to deviate from the correct reasoning path, which is frequently accompanied by high-entropy transition tokens. Given this, we propose an early-exit method deeply coupled with the native reasoning process, which leverages the path deviation index as a dedicated monitoring metric for the frequent occurrence of high-entropy transition tokens to dynamically detect and terminate overthinking trajectories. We conduct experiments across multiple benchmarks using LRLMs of different types and scales, and the results indicate that our method delivers the largest performance improvement over vanilla CoT compared to existing early-exit methods.[40] Automatic Inter-document Multi-hop Scientific QA Generation
Seungmin Lee,Dongha Kim,Yuni Jeon,Junyoung Koh,Min Song
Main category: cs.CL
TL;DR: 本文提出AIM-SciQA框架,用于自动生成多文档、多跳科学问答数据集(IM-SciQA),弥补现有单文档事实型问答生成研究忽视跨文档推理的不足;该框架结合LLM抽取单跳问答、语义对齐构建跨文档关系,并利用引用信息增强,经验证具备高事实一致性与良好区分能力。
Details
Motivation: 现有自动科学问题生成研究集中于单文档事实型问答,忽略了对科学理解至关重要的跨文档推理能力。 Method: 提出AIM-SciQA框架:利用大语言模型(LLM)进行机器阅读理解以抽取单跳问答;基于嵌入的语义对齐构建跨文档关系;有选择地融合引用信息;在8211篇PubMed Central论文上构建IM-SciQA数据集,并扩展出引用引导变体CIM-SciQA。 Result: 生成411,409个单跳和13,672个多跳科学问答;人工与自动验证表明其具有高事实一致性;实验显示该数据集能有效区分检索与问答阶段的推理能力;CIM-SciQA性能接近Oracle设置。 Conclusion: IM-SciQA是一个真实、可解释的检索增强型科学推理基准,验证了多文档多跳问答生成的可行性与有效性,并通过CIM-SciQA增强了方法的通用性与可信度。 Abstract: Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts single-hop QAs using large language models (LLMs) with machine reading comprehension and constructs cross-document relations based on embedding-based semantic alignment while selectively leveraging citation information. Applied to 8,211 PubMed Central papers, it produced 411,409 single-hop and 13,672 multi-hop QAs, forming the IM-SciQA dataset. Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning. We further extend this framework to construct CIM-SciQA, a citation-guided variant achieving comparable performance to the Oracle setting, reinforcing the dataset's validity and generality.[41] MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering
Shaowei Guan,Yu Zhai,Hin Chi Kwok,Jiawei Du,Xinyu Feng,Jing Li,Harry Qin,Vivian Hui
Main category: cs.CL
TL;DR: 本文提出MedPriv-Bench,首个联合评估医疗开放问答中隐私保护与临床效用的基准,揭示了当前LLM在医疗RAG场景中普遍存在隐私-效用权衡问题。
Details
Motivation: 现有医疗AI评测过度关注准确性,忽视上下文泄露这一新型隐私风险,而HIPAA和GDPR等法规要求严格保护患者隐私,亟需能同时评估隐私与效用的专用基准。 Method: 构建多智能体、人工参与的合成 pipeline生成敏感医疗上下文与临床问题;采用预训练RoBERTa-NLI模型作为自动化裁判量化数据泄露程度,并与人类专家评估对齐。 Result: MedPriv-Bench实现85.9%的人机评估一致性;在9个主流LLM上验证了普遍存在的隐私-效用权衡现象。 Conclusion: 医疗AI系统必须通过领域专用基准(如MedPriv-Bench)进行安全性和有效性双重验证,以满足隐私敏感场景的合规要求。 Abstract: Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.[42] SemantiCache: Efficient KV Cache Compression via Semantic Chunking and Clustered Merging
Shunlong Wu,Hai Lin,Shaoshen Chen,Tingwei Lu,Yongqin Zeng,Shaoxiong Zhan,Hai-Tao Zheng,Hong-Gee Kim
Main category: cs.CL
TL;DR: 本文提出SemantiCache,一种基于语义层次结构的KV缓存压缩框架,通过语义分块、贪心种子聚类和比例注意力机制,在大幅加速推理和降低内存的同时保持模型性能。
Details
Motivation: 现有KV缓存压缩方法在离散token或非语义块上操作,易导致语义碎片化,造成不可逆信息损失和性能下降。 Method: 提出SemantiCache框架:1)基于自然语义边界(分隔符)划分语义连贯块;2)在每块内使用贪心种子聚类(GSC)算法进行语义聚类;3)将聚类结果合并为语义核心,并引入比例注意力机制重平衡注意力贡献。 Result: 在多个基准和模型上实验表明,SemantiCache可使推理解码阶段加速最高达2.61倍,显著降低内存占用,同时模型性能与原始模型相当。 Conclusion: SemantiCache通过语义对齐的压缩策略有效缓解了传统KV缓存压缩中的语义碎片问题,为高效大模型推理提供了新范式。 Abstract: Existing KV cache compression methods generally operate on discrete tokens or non-semantic chunks. However, such approaches often lead to semantic fragmentation, where linguistically coherent units are disrupted, causing irreversible information loss and degradation in model performance. To address this, we introduce SemantiCache, a novel compression framework that preserves semantic integrity by aligning the compression process with the semantic hierarchical nature of language. Specifically, we first partition the cache into semantically coherent chunks by delimiters, which are natural semantic boundaries. Within each chunk, we introduce a computationally efficient Greedy Seed-Based Clustering (GSC) algorithm to group tokens into semantic clusters. These clusters are further merged into semantic cores, enhanced by a Proportional Attention mechanism that rebalances the reduced attention contributions of the merged tokens. Extensive experiments across diverse benchmarks and models demonstrate that SemantiCache accelerates the decoding stage of inference by up to 2.61 times and substantially reduces memory footprint, while maintaining performance comparable to the original model.[43] Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models
Yixuan Tang,Yi Yang
Main category: cs.CL
TL;DR: 本文提出Delta-Consistent Scoring(DCS)框架,无需人工标注,利用大语言模型(LLM)表征联合建模绝对立场与相邻会议间的相对立场变化,从而生成时间一致的货币政策立场得分,并在分类准确率和经济意义指标上均优于现有方法。
Details
Motivation: 现有立场检测方法将每次FOMC声明孤立分类,忽视了市场反应依赖于立场相对变化这一关键事实;需一种能捕捉时序相对性、避免人工标注的自动量化方法。 Method: 提出无标注的Delta-Consistent Scoring(DCS)框架:利用冻结的LLM提取文本表征,同时学习每个声明的绝对立场分和相邻声明间的相对变化分,并通过delta一致性目标约束二者一致性,实现自监督时序建模。 Result: 在四个LLM主干上均超越监督探针和LLM-as-judge基线,句子级分类准确率达71.1%;会议级得分与通胀指标强相关,且显著预测国债收益率变动。 Conclusion: LLM表征中蕴含可被相对时序结构有效解码的货币政策信号;DCS证明无需人工标签即可构建兼具准确性与经济解释力的立场量化工具。 Abstract: Federal Open Market Committee (FOMC) statements are a major source of monetary-policy information, and even subtle changes in their wording can move global financial markets. A central task is therefore to measure the hawkish--dovish stance conveyed in these texts. Existing approaches typically treat stance detection as a standard classification problem, labeling each statement in isolation. However, the interpretation of monetary-policy communication is inherently relative: market reactions depend not only on the tone of a statement, but also on how that tone shifts across meetings. We introduce Delta-Consistent Scoring (DCS), an annotation-free framework that maps frozen large language model (LLM) representations to continuous stance scores by jointly modeling absolute stance and relative inter-meeting shifts. Rather than relying on manual hawkish--dovish labels, DCS uses consecutive meetings as a source of self-supervision. It learns an absolute stance score for each statement and a relative shift score between consecutive statements. A delta-consistency objective encourages changes in absolute scores to align with the relative shifts. This allows DCS to recover a temporally coherent stance trajectory without manual labels. Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish--dovish classification. The resulting meeting-level scores are also economically meaningful: they correlate strongly with inflation indicators and are significantly associated with Treasury yield movements. Overall, the results suggest that LLM representations encode monetary-policy signals that can be recovered through relative temporal structure.[44] Motivation in Large Language Models
Omer Nahum,Asael Sklar,Ariel Goldstein,Roi Reichart
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLMs)是否展现出类似人类的动机特征,发现其自我报告的动机水平与行为表现、任务类型及外部干预之间存在系统性关联,表明动机可作为理解LLM行为的统一框架。
Details
Motivation: 探究大语言模型是否具备类似人类动机的心理构念,及其在行为、决策和任务表现中的作用。 Method: 通过实验分析LLM自我报告的动机水平,考察其与行为特征(如选择、努力、绩效)的关系,并测试外部因素对动机报告的影响。 Result: 发现LLM表现出结构化、一致性的动机模式:动机报告与行为显著相关,随任务类型变化,并可被外部干预调节。 Conclusion: 动机是一个能系统解释LLM行为的连贯构念,其动态特征与人类心理学中观察到的动机现象高度相似。 Abstract: Motivation is a central driver of human behavior, shaping decisions, goals, and task performance. As large language models (LLMs) become increasingly aligned with human preferences, we ask whether they exhibit something akin to motivation. We examine whether LLMs "report" varying levels of motivation, how these reports relate to their behavior, and whether external factors can influence them. Our experiments reveal consistent and structured patterns that echo human psychology: self-reported motivation aligns with different behavioral signatures, varies across task types, and can be modulated by external manipulations. These findings demonstrate that motivation is a coherent organizing construct for LLM behavior, systematically linking reports, choices, effort, and performance, and revealing motivational dynamics that resemble those documented in human psychology. This perspective deepens our understanding of model behavior and its connection to human-inspired concepts.[45] Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling
Suvadeep Hajra,Palash Nandi,Tanmoy Chakraborty
Main category: cs.CL
TL;DR: 本文提出了一种名为PDPS的新方法,通过在输出空间中进行多样化的响应采样来高效暴露大语言模型的安全漏洞,相比传统方法显著提升了 jailbreak 成功率并降低了计算成本。
Details
Motivation: 现有安全调优方法(如SFT和RLHF)虽提升了LLM鲁棒性,但往往压制而非消除不安全行为,导致长尾中的关键失败难以发现;而主流红队测试聚焦输入空间对抗提示搜索,忽略了输出空间的系统性探索潜力。 Method: 提出渐进式多样群体采样(PDPS),结合随机词元级采样与多样性感知选择,在固定安全关键提示下生成并筛选语义多样的响应子集,以高效探索输出空间。 Result: PDPS在多个jailbreak基准和开源LLM上,攻击成功率媲美大规模独立同分布(IID)采样,但仅需其8%–29%的计算成本;在受限响应数场景下,成功率比IID采样和Diverse Beam Search提升26%–40%,且生成的不安全响应数量更多、类型更丰富。 Conclusion: 输出空间的多样性响应生成是暴露LLM深层安全缺陷的有效途径,PDPS为高效、低成本红队测试提供了新范式。 Abstract: Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost. Under limited-response settings, it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.[46] Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains
Andrew Katz
Main category: cs.CL
TL;DR: 本文提出了一种基于surprisal(负对数概率)的扩展评估框架,将传统二元语法判断推广至序数尺度分类与评分任务,通过分析模型在各评分等级上的 surprisal 曲线及熵值,更高效、可解释地评估语言模型在多领域(如因果判断、隐喻识别等)中的知识表征与不确定性。
Details
Motivation: 现有最小对立对(minimal pairs)范式局限于二元语法判断,且标准提示生成式评估成本高、易引发事后合理化、忽略模型不确定性。 Method: 采用surprisal(即负对数概率)替代文本生成,计算模型对序数量表(如1–5或1–9)上各选项的 surprisal 值,构建surprisal曲线,并用熵度量模型不确定性。 Result: 在社会-生态-技术系统分类、因果语句识别(二元与连续)、隐喻检测和定性演绎编码四个领域中,surprisal曲线呈现清晰极小值(对应预期评分),熵值能有效区分模糊项与简单项。 Conclusion: surprisal-based ordinal evaluation 是一种高效、无需生成、可量化不确定性的新评估范式,适用于多领域语言知识测量。 Abstract: The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty. We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic "surprise" (negative log probability) they assign to each position on rating scales (e.g., 1-5 or 1-9), yielding full surprisal curves that reveal both the model's preferred response and its uncertainty via entropy. We explore this framework across four domains: social-ecological-technological systems classification, causal statement identification (binary and scaled), figurative language detection, and deductive qualitative coding. Across these domains, surprisal curves produce interpretable classification signals with clear minima near expected ordinal scale positions, and entropy over the completion tended to distinguish genuinely ambiguous items from easier items.[47] BiT-MCTS: A Theme-based Bidirectional MCTS Approach to Chinese Fiction Generation
Zhaoyi Li,Xu Zhang,Xiaojun Wan
Main category: cs.CL
TL;DR: 本文提出BiT-MCTS框架,采用‘高潮先行、双向扩展’策略生成结构完整、主题深刻的长篇线性小说。
Details
Motivation: 现有大语言模型在开放主题下生成长篇线性小说时,难以保证全局结构和叙事多样性。 Method: 基于弗赖塔格金字塔理论,先提取主题的核心戏剧冲突并生成明确高潮,再通过双向蒙特卡洛树搜索(MCTS)分别向前后扩展情节(如铺垫、上升行动、下降行动、结局),最后依据优化后的提纲生成完整叙事。 Result: 在中文主题语料库上的实验表明,BiT-MCTS显著提升了叙事连贯性、情节结构合理性和主题深度,并能生成更长、更连贯的故事,自动指标与人工评估均优于强基线。 Conclusion: BiT-MCTS为长篇主题驱动叙事生成提供了一种结构化、可控且有效的新范式。 Abstract: Generating long-form linear fiction from open-ended themes remains a major challenge for large language models, which frequently fail to guarantee global structure and narrative diversity when using premise-based or linear outlining approaches. We present BiT-MCTS, a theme-driven framework that operationalizes a "climax-first, bidirectional expansion" strategy motivated by Freytag's Pyramid. Given a theme, our method extracts a core dramatic conflict and generates an explicit climax, then employs a bidirectional Monte Carlo Tree Search (MCTS) to expand the plot backward (rising action, exposition) and forward (falling action, resolution) to produce a structured outline. A final generation stage realizes a complete narrative from the refined outline. We construct a Chinese theme corpus for evaluation and conduct extensive experiments across three contemporary LLM backbones. Results show that BiT-MCTS improves narrative coherence, plot structure, and thematic depth relative to strong baselines, while enabling substantially longer, more coherent stories according to automatic metrics and human judgments.[48] Creative Convergence or Imitation? Genre-Specific Homogeneity in LLM-Generated Chinese Literature
Yuanchi Ma,Kaize Shi,Hui He,Zhihua Zhang,Zhongxiang Lei,Ziliang Qiu,Renfen Hu,Jiamou Liu
Main category: cs.CL
TL;DR: 本文提出了一种结合普罗普叙事学与叙事功能的理论框架,用于分析大语言模型(LLMs)生成故事的结构同质化问题;以中文网络文学为对象,扩展了普罗普理论至34个现代叙事功能,并构建人工标注语料库;实验表明LLMs难以理解叙事功能含义,导致僵化、单一的叙事逻辑。
Details
Motivation: 解决大语言模型在叙事生成中普遍存在的结构同质化、情节重复与结局刻板等问题。 Method: 引入并扩展普罗普叙事理论,定义适用于现代中文网络文学的34个叙事功能,构建人工标注语料库,对LLM生成文本进行叙事结构分析。 Result: 发现LLM生成文本同质化的主因是其无法真正理解叙事功能语义,而仅依赖固定生成范式。 Conclusion: 需超越表面文本统计模式,从叙事功能层面建模与干预,才能提升LLM叙事多样性与逻辑深度。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in narrative generation. However, they often produce structurally homogenized stories, frequently following repetitive arrangements and combinations of plot events along with stereotypical resolutions. In this paper, we propose a novel theoretical framework for analysis by incorporating Proppian narratology and narrative functions. This framework is used to analyze the composition of narrative texts generated by LLMs to uncover their underlying narrative logic. Taking Chinese web literature as our research focus, we extend Propp's narrative theory, defining 34 narrative functions suited to modern web narrative structures. We further construct a human-annotated corpus to support the analysis of narrative structures within LLM-generated text. Experiments reveal that the primary reasons for the singular narrative logic and severe homogenization in generated texts are that current LLMs are unable to correctly comprehend the meanings of narrative functions and instead adhere to rigid narrative generation paradigms.[49] Echoes Across Centuries: Phonetic Signatures of Persian Poets
Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar
Main category: cs.CL
TL;DR: 本研究通过大规模语料库和六种语音指标,分析波斯诗歌中的语音质地,揭示其作为文学历史现象的系统性诗人差异与历史演变。
Details
Motivation: 将波斯诗歌中的语音质地视为独立的文学历史现象,而非格律副产品或仅用于分类的特征。 Method: 基于31,988首诗、1,116,306行、83位诗人的语料库,限定五种主要古典格律;每行转为音素表示,并用六种语音指标(硬度、响度、嘶音度、元音比、音素熵、辅音簇比)分析;采用统计模型控制格律、诗体与行长度,构建多维风格地图并进行历时分析。 Result: 发现格律与诗体虽解释大量语音变异,但无法消除诗人间系统性差异;识别出高响度抒情、高硬度修辞/史诗、嘶音神秘、高熵复杂等典型语音风格;语音分布随世纪变化,反映体裁兴替、文学制度与表演语境变迁。 Conclusion: 波斯诗歌语音是共享韵律结构内的条件化变异,既非纯个人风格亦非单纯格律残留;本研究建立了波斯诗歌语音分析的大规模语料库框架,展示了计算语音学对文学史阐释的价值。 Abstract: This study examines phonetic texture in Persian poetry as a literary-historical phenomenon rather than a by-product of meter or a feature used only for classification. The analysis draws on a large corpus of 1,116,306 mesras from 31,988 poems written by 83 poets, restricted to five major classical meters to enable controlled comparison. Each line is converted into a grapheme-to-phoneme representation and analyzed using six phonetic metrics: hardness, sonority, sibilance, vowel ratio, phoneme entropy, and consonant-cluster ratio. Statistical models estimate poet-level differences while controlling for meter, poetic form, and line length. The results show that although meter and form explain a substantial portion of phonetic variation, they do not eliminate systematic differences between poets. Persian poetic sound therefore appears as conditioned variation within shared prosodic structures rather than as either purely individual style or simple metrical residue. A multidimensional stylistic map reveals several recurrent phonetic profiles, including high-sonority lyric styles, hardness-driven rhetorical or epic styles, sibilant mystical contours, and high-entropy complex textures. Historical analysis indicates that phonetic distributions shift across centuries, reflecting changes in genre prominence, literary institutions, and performance contexts rather than abrupt stylistic breaks. The study establishes a corpus-scale framework for phonetic analysis in Persian poetry and demonstrates how computational phonetics can contribute to literary-historical interpretation while remaining attentive to the formal structures that shape Persian verse.[50] PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark
Mohammad Javad Ranjbar Kalahroodi,Mohammad Amini,Parmis Bathayan,Heshaam Faili,Azadeh Shakery
Main category: cs.CL
TL;DR: 本文介绍了PARSA-Bench,首个面向波斯语语言与文化音频理解的基准测试,涵盖16项任务、8000多个样本,揭示当前大音频语言模型在波斯诗歌韵律、传统音乐及语码转换等文化相关任务上表现极差,尤其韵律(vazn)检测接近随机水平。
Details
Motivation: 现有音频理解基准未覆盖波斯语特有的古典诗歌、传统音乐和普遍存在的语码转换现象,缺乏对其文化特异性音频理解能力的评估工具。 Method: 构建PARSA-Bench基准,包含16项任务(其中10项为新设)、超8000个样本,覆盖语音理解、副语言分析与文化音频理解;对多种音频-语言模型进行系统评测,并对比纯文本基线。 Result: 纯文本基线持续优于音频模型;所有模型在韵律(vazn)检测任务上表现接近随机,暴露当前模型在波斯语韵律感知上的根本性缺陷。 Conclusion: PARSA-Bench揭示了当前大音频语言模型在文化特异性音频理解(尤其是波斯语韵律)上的严重局限,强调需发展能真正建模音频特征而非仅依赖转录的模型。 Abstract: Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench[51] Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs
Auksarapak Kietkajornrit,Jad Tarifi,Nima Asgharbeygi
Main category: cs.CL
TL;DR: 本文提出一种模块化框架,将规划、事实检索和答案合成显式分离,通过轻量级学生规划器学习结构化分解,显著提升事实性问答的准确性和推理效率。
Details
Motivation: 现有基于大语言模型的事实性问答方法在处理时效性或冲突信息时仍不可靠,且检索增强或工具增强模型常依赖隐式规划,导致工具使用效率低下。 Method: 提出模块化框架,用教师-学生框架训练轻量级学生规划器,生成包含抽象推理步骤和可搜索事实请求的结构化分解;监督信号仅含规划轨迹和事实请求,不提供答案或证据;推理时由规划器生成计划,由提示工程模块执行检索和答案合成。 Result: 在极具挑战性的SEAL-0基准上验证,该方法相比单体推理模型和基于提示的工具增强框架,在准确率和延迟两方面均有提升。 Conclusion: 显式学习的规划结构对构建可靠的事实性问答大语言模型至关重要。 Abstract: Fact-seeking question answering with large language models (LLMs) remains unreliable when answers depend on up-to-date or conflicting information. Although retrieval-augmented and tool-using LLMs reduce hallucinations, they often rely on implicit planning, leading to inefficient tool usage. We propose a modular framework that explicitly separates planning from factual retrieval and answer synthesis. A lightweight student planner is trained via a teacher-student framework to generate structured decompositions consisting of abstract reasoning steps and searchable fact requests. The supervision signals contain only planning traces and fact requests, without providing factual answers or retrieved evidence. At inference, the planner produces plans, while prompt-engineered modules perform retrieval and response synthesis. We evaluate the proposed framework on SEAL-0, an extremely challenging benchmark for search-augmented LLMs. Results show that supervised planning improves both accuracy and latency compared to monolithic reasoning models and prompt-based tool-augmented frameworks, demonstrating that explicitly learned planning structures are essential for reliable fact-seeking LLMs.[52] An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs
Qian Zhu,Xinnan Guo,Jingjing Huo,Jun Li,Pan Liu,Wenyan Yang,Wanqing Xu,Xuan Lin
Main category: cs.CL
TL;DR: 本文提出INS-S1系列保险领域大模型,通过可验证数据合成与渐进式SFT-RL课程框架实现高合规性、低幻觉的垂直领域对齐,兼顾领域性能与通用能力。
Details
Motivation: 保险等高风险垂直领域要求模型严格遵循复杂法规与业务逻辑,零容忍幻觉,但现有方法存在领域专精与通用智能之间的能力权衡,或过度依赖RAG而缺乏内在推理能力。 Method: 提出端到端对齐范式:(1)可验证数据合成系统,构建面向精算推理与合规的分层数据集;(2)渐进式监督微调-强化学习(SFT-RL)课程框架,融合动态数据退火、经验证推理(RLVR)与AI反馈(RLAIF),优化数据比例与奖励信号以约束领域行为并防止灾难性遗忘;同时发布大规模保险评测基准INSEva(39k+样本)。 Result: INS-S1在保险领域任务上达到SOTA,显著超越DeepSeek-R1和Gemini-2.5-Pro;保持顶尖通用能力,并将幻觉率降至创纪录的0.6%(HHEM)。 Conclusion: 严格的领域专业化无需以牺牲通用智能为代价,可通过高质量对齐范式实现二者协同提升。 Abstract: Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.[53] AI Can Learn Scientific Taste
Jingqi Tong,Mingzhe Li,Hangcheng Li,Yongzhuo Yang,Yurong Mou,Weijie Ma,Zhiheng Xi,Hongji Chen,Xiaoran Liu,Qinyuan Cheng,Ming Zhang,Qiguang Chen,Weifeng Ge,Qipeng Guo,Tianlei Ying,Tianxiang Sun,Yining Zheng,Xinchi Chen,Jun Zhao,Ning Ding,Xuanjing Huang,Yugang Jiang,Xipeng Qiu
Main category: cs.CL
TL;DR: 本文提出了一种名为'从社区反馈中进行强化学习(RLCF)'的新范式,旨在让AI具备类似人类科学家的'科学品味'——即判断和提出高影响力研究想法的能力。通过构建'科学裁判(Scientific Judge)'模型进行偏好建模,并以此作为奖励信号训练'科学思考者(Scientific Thinker)'策略模型,实验证明该方法在跨时间、跨领域及同行评审偏好上均表现优异,标志着AI向人类级科学家迈出关键一步。
Details
Motivation: 现有AI科学家研究多聚焦于执行能力提升,而‘科学品味’(即判断与提出高影响力研究想法的能力)这一核心素养尚未被系统探索。 Method: 提出RLCF训练范式:1)基于70万对领域和时间匹配的高低被引论文构建‘科学裁判’模型,用于偏好建模;2)以该模型为奖励函数,通过强化学习训练‘科学思考者’策略模型来生成高潜力研究想法。 Result: ‘科学裁判’在多项指标上超越GPT-5.2、Gemini 3 Pro等SOTA大模型,并具备对未来年份、未见领域及同行评审偏好的泛化能力;‘科学思考者’生成的研究想法具有更高潜在影响力。 Conclusion: AI可以习得科学品味,这是实现人类水平AI科学家的关键进展。 Abstract: Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.[54] Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows
Aditya Sharan,Sriram Hebbale,Dhruv Kumar
Main category: cs.CL
TL;DR: 本文提出Infinite Problem Generator(IPG),一种基于‘公式即代码’范式的智能体框架,用于生成可验证、高保真的物理问题数据集(如ClassicalMechanicsV1),并发现公式数量与验证代码长度呈强线性相关,从而提供无需代理指标的难度量化方法。
Details
Motivation: 大型语言模型在复杂推理任务上受限于高质量、可验证训练数据的稀缺;物理等领域中传统文本增强易导致幻觉,静态基准缺乏推理轨迹,难以支撑细粒度监督微调。 Method: 提出Infinite Problem Generator(IPG):将物理公式编码为可执行Python程序,确保数学一致性;以专家种子问题为起点,通过符号化组合与程序化求解生成新问题;定义并验证‘复杂度蓝图’(公式数量 vs 验证代码长度)作为难度度量。 Result: 发布ClassicalMechanicsV1数据集(1335题,源自165个专家种子),覆盖102个公式,平均每题含3.05个公式;实证发现公式数量与验证代码长度高度线性相关(R²≈0.95);开源IPG全流程、数据集及评测报告。 Conclusion: IPG为推理密集型领域提供了可控、可验证、可扩展的数据合成范式;‘复杂度蓝图’确立了代码复杂度作为无代理、精准的问题难度度量,支持难度可控的课程学习。 Abstract: Training large language models for complex reasoning is bottlenecked by the scarcity of verifiable, high-quality data. In domains like physics, standard text augmentation often introduces hallucinations, while static benchmarks lack the reasoning traces required for fine-tuning. We introduce the Infinite Problem Generator (IPG), an agentic framework that synthesizes physics problems with guaranteed solvability through a Formula-as-Code paradigm. Unlike probabilistic text generation, IPG constructs solutions as executable Python programs, enforcing strict mathematical consistency. As a proof-of-concept, we release ClassicalMechanicsV1, a high-fidelity corpus of 1,335 classical mechanics problems expanded from 165 expert seeds. The corpus demonstrates high structural diversity, spanning 102 unique physical formulas with an average complexity of 3.05 formulas per problem. Furthermore, we identify a Complexity Blueprint, demonstrating a strong linear correlation ($R^2 \approx 0.95$) between formula count and verification code length. This relationship establishes code complexity as a precise, proxy-free metric for problem difficulty, enabling controllable curriculum generation. We release the full IPG pipeline, the ClassicalMechanicsV1 dataset, and our evaluation report to support reproducible research in reasoning-intensive domains.[55] MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection
Arkadiusz Modzelewski,Witold Sosnowski,Eleni Papadopulos,Elisa Sartori,Tiziano Labruna,Giovanni Da San Martino,Adam Wierzbicki
Main category: cs.CL
TL;DR: 本文提出首个面向恶意意图标注的英文虚假信息数据集MALINT,并基于该数据集评测多种语言模型在意图分类任务上的表现;进一步提出‘意图增强推理’方法,借鉴心理免疫理论,将恶意意图分析融入大模型推理过程,显著提升零样本虚假信息检测效果。
Details
Motivation: 现有英文虚假信息数据集和研究很少关注虚假信息背后的主观意图,而意图是理解与应对恶意传播的关键因素。 Method: 构建专家协同标注的MALINT数据集;在二分类与多标签意图分类任务上评测12种语言模型;提出基于意图的‘免疫式’推理(intent-based inoculation),将意图分析融入LLM推理以削弱虚假信息说服力。 Result: 意图增强推理在六个虚假信息数据集、五种大语言模型和七种语言上的零样本检测中均取得性能提升;MALINT数据集已开源。 Conclusion: 恶意意图是虚假信息检测中一个关键且可建模的维度;引入意图分析不仅能提升检测性能,也为构建更具鲁棒性和可解释性的反虚假信息系统提供了新路径。 Abstract: The intentional creation and spread of disinformation poses a significant threat to public discourse. However, existing English datasets and research rarely address the intentionality behind the disinformation. This work presents MALINT, the first human-annotated English corpus developed in collaboration with expert fact-checkers to capture disinformation and its malicious intent. We utilize our novel corpus to benchmark 12 language models, including small language models (SLMs) such as BERT and large language models (LLMs) like Llama 3.3, on binary and multilabel intent classification tasks. Moreover, inspired by inoculation theory from psychology and communication studies, we investigate whether incorporating knowledge of malicious intent can improve disinformation detection. To this end, we propose intent-based inoculation, an intent-augmented reasoning for LLMs that integrates intent analysis to mitigate the persuasive impact of disinformation. Analysis on six disinformation datasets, five LLMs, and seven languages shows that intent-augmented reasoning improves zero-shot disinformation detection. To support research in intent-aware disinformation detection, we release the MALINT dataset with annotations from each annotation step.[56] Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models
Deepon Halder,Angira Mukherjee
Main category: cs.CL
TL;DR: 本文介绍了Multilingual TinyStories数据集,一个包含17种印度语言的合成儿童故事大型语料库,专为小语言模型(SLMs)训练与评估设计,采用混合生成与过滤流程构建。
Details
Motivation: 低资源语言缺乏高质量、连贯且领域适配的训练语料,制约了鲁棒语言模型的发展。 Method: 提出混合策展流程:利用Sarvam-M语言模型和新型组合式提示工程框架进行本地脚本原生生成,并结合Google Translate API实现大规模跨语言扩展;再通过严格的程序化过滤筛选出高质量故事。 Result: 构建了包含132,942个故事、超9390万词符的Multilingual TinyStories数据集,覆盖17种印度语言,全部使用本地文字书写。 Conclusion: 该数据集为印度语言的小型语言模型训练、多语言建模及迁移学习提供了基础性资源。 Abstract: The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children's stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.[57] Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes
Deepon Halder,Raj Dabre
Main category: cs.CL
TL;DR: 本文提出Top-b自适应解码策略,通过动态带宽系数匹配模型分布的瞬时香农熵,在保持推理准确率的同时降低生成熵和解码方差。
Details
Motivation: 现有Top-k、Top-p等静态截断解码策略无法适配自然语言动态的信息密度,导致在高熵创造性生成与低熵逻辑推理间难以兼顾。 Method: 将生成过程建模为相对概率流形上的轨迹,提出Top-b(自适应相对带宽采样)解码策略,其候选集由与当前分布香农熵严格耦合的动态带宽系数调控,并从理论上证明Top-b是尾部分布的方差最小化算子。 Result: 在GPQA和GSM8K基准上验证,Top-b显著降低了生成熵与解码间方差,同时维持具有竞争力的推理准确率。 Conclusion: Top-b实现了对自回归生成过程的近似自调节控制,为解码策略提供了更符合信息论原理的动态建模范式。 Abstract: Probabilistic language generators are theoretically modeled as discrete stochastic processes, yet standard decoding strategies (Top-k, Top-p) impose static truncation rules that fail to accommodate the dynamic information density of natural language. This misalignment often forces a suboptimal trade-off: static bounds are either too restrictive for high-entropy creative generation or too permissive for low-entropy logical reasoning. In this work, we formalize the generation process as a trajectory through a relative probability manifold. We introduce Top-b (Adaptive Relative Band Sampling), a decoding strategy that regulates the candidate set via a dynamic bandwidth coefficient coupled strictly to the instantaneous Shannon entropy of the model's distribution. We provide a theoretical framework demonstrating that Top-b acts as a variance-minimizing operator on the tail distribution. Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating control system for autoregressive generation.[58] Parameter-Efficient Quality Estimation via Frozen Recursive Models
Umar Abubacar,Roman Bauer,Diptesh Kanojia
Main category: cs.CL
TL;DR: 本文研究Tiny Recursive Models (TRM)在低资源语言质量估计(QE)任务中的适用性,发现其递归机制难以迁移;而表征质量比架构选择更重要,冻结预训练嵌入(如XLM-R)可大幅减少参数量(37倍)且不损性能,甚至在部分语言上以极小参数量超越大模型。
Details
Motivation: 探索Tiny Recursive Models(TRM)的递归机制是否能迁移到低资源语言的质量估计(QE)任务中,并寻求更高效、轻量的QE建模方法。 Method: 采用三阶段方法,在8个低资源语言对上开展实验,对比外部迭代、内部递归、不同嵌入策略(冻结vs微调)及架构(TRM vs 标准Transformer)的效果,并评估参数效率与性能权衡。 Result: TRM的递归机制不适用于QE;冻结XLM-R嵌入的TRM-QE达到Spearman相关系数0.370,与微调版本(0.369)相当,优于同深度标准Transformer(0.336);在印地语和泰米尔语上,仅用1/80参数即超越MonoTransQuest(5.6亿参数)。 Conclusion: 在低资源QE任务中,模型架构的递归设计不如高质量表征重要;冻结预训练嵌入结合权重共享可实现高参数效率与强性能,为资源受限场景提供实用解决方案。 Abstract: Tiny Recursive Models (TRM) achieve strong results on reasoning tasks through iterative refinement of a shared network. We investigate whether these recursive mechanisms transfer to Quality Estimation (QE) for low-resource languages using a three-phase methodology. Experiments on $8$ language pairs on a low-resource QE dataset reveal three findings. First, TRM's recursive mechanisms do not transfer to QE. External iteration hurts performance, and internal recursion offers only narrow benefits. Next, representation quality dominates architectural choices, and lastly, frozen pretrained embeddings match fine-tuned performance while reducing trainable parameters by 37$\times$ (7M vs 262M). TRM-QE with frozen XLM-R embeddings achieves a Spearman's correlation of 0.370, matching fine-tuned variants (0.369) and outperforming an equivalent-depth standard transformer (0.336). On Hindi and Tamil, frozen TRM-QE outperforms MonoTransQuest (560M parameters) with 80$\times$ fewer trainable parameters, suggesting that weight sharing combined with frozen embeddings enables parameter efficiency for QE. We release the code publicly for further research. Code is available at https://github.com/surrey-nlp/TRMQE.[59] $PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought
Shubhashis Roy Dipta,Daniel Bis,Kun Zhou,Lichao Wang,Benjamin Z. Yao,Chenlei Guo,Ruhi Sarikaya
Main category: cs.CL
TL;DR: 本文提出了一种多阶段对齐方法,使大语言模型能在推理时回忆并应用相关业务规则,无需在上下文中包含全部策略,同时设计了基于Jaccard分数的PolicyRecall奖励和幻觉惩罚以优化GRPO训练。
Details
Motivation: 现有LLM在工具使用任务中表现优秀,但在遵循复杂、特定业务规则方面存在困难;将全部业务策略放入上下文会导致高延迟、计算浪费及‘大海捞针’问题,影响性能。 Method: 提出多阶段对齐方法,使模型在思维链推理过程中自主回忆并应用相关业务策略;引入基于Jaccard分数的PolicyRecall奖励和幻觉惩罚用于GRPO训练。 Result: 最佳模型相较基线提升16分,比同规模上下文学习基线高3分,且用词量减少40%。 Conclusion: 该方法有效提升了模型对业务规则的遵循能力,兼顾效率与准确性,为面向行业的对话助手提供了实用解决方案。 Abstract: Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the "needle-in-the-haystack" problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.[60] Seamless Deception: Larger Language Models Are Better Knowledge Concealers
Dhananjay Ashok,Ruth-Ann Armstrong,Jonathan May
Main category: cs.CL
TL;DR: 本文研究如何检测语言模型(LMs)是否在主动隐藏其已掌握的有害知识,发现现有基于梯度或提示的隐蔽行为检测方法在小模型上有效,但在大模型(>70B参数)上失效,且难以跨架构和主题泛化,揭示了黑盒审计的关键局限。
Details
Motivation: 语言模型可能习得有害知识并在审计时故意伪装无知,需开发能识别其主动隐瞒行为的方法。 Method: 训练分类器检测语言模型的隐蔽行为,对比梯度驱动与提示驱动的隐蔽方式,并在不同模型规模、架构和知识主题上评估其泛化能力。 Result: 分类器在小模型上优于人类评估者,但无法可靠泛化到新架构或新主题;在超过700亿参数的模型上性能退化至随机水平。 Conclusion: 仅依赖黑盒审计难以可靠识别大模型的隐蔽行为,亟需发展更鲁棒的检测方法。 Abstract: Language Models (LMs) may acquire harmful knowledge, and yet feign ignorance of these topics when under audit. Inspired by the recent discovery of deception-related behaviour patterns in LMs, we aim to train classifiers that detect when a LM is actively concealing knowledge. Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods. However, contrary to prior work, we find that the classifiers do not reliably generalize to unseen model architectures and topics of hidden knowledge. Most concerningly, the identifiable traces associated with concealment become fainter as the models increase in scale, with the classifiers achieving no better than random performance on any model exceeding 70 billion parameters. Our results expose a key limitation in black-box-only auditing of LMs and highlight the need to develop robust methods to detect models that are actively hiding the knowledge they contain.[61] Computational Analysis of Semantic Connections Between Herman Melville Reading and Writing
Nudrat Habib,Elisa Barney Smith,Steven Olsen Smith
Main category: cs.CL
TL;DR: 本研究通过计算语义相似性分析,探讨赫尔曼·梅尔维尔的阅读对其创作的影响,利用BERTScore比较其作品与已知藏书中的文本片段,以精度、召回率和F1分数作为语义对齐的指标,成功识别出专家确认的相似案例,并发现新线索供进一步定性研究。
Details
Motivation: 探索梅尔维尔的阅读经历如何影响其文学创作,为文学影响研究提供可量化的计算支持。 Method: 基于梅尔维尔藏书记录,对作品与藏书文本进行句子级和5-gram级分段,采用BERTScore计算语义相似性,并以precision、recall、F1作为语义对齐的解释性指标,而非硬性阈值判定文本复用。 Result: 该方法能有效复现专家识别的相似实例,并发现若干新的潜在影响片段,值得后续人文细读验证。 Conclusion: 语义相似性分析可作为文学源流与影响研究中一种有力且互补的计算辅助框架。 Abstract: This study investigates the potential influence of Herman Melville reading on his own writings through computational semantic similarity analysis. Using documented records of books known to have been owned or read by Melville, we compare selected passages from his works with texts from his library. The methodology involves segmenting texts at both sentence level and non-overlapping 5-gram level, followed by similarity computation using BERTScore. Rather than applying fixed thresholds to determine reuse, we interpret precision, recall, and F1 scores as indicators of possible semantic alignment that may suggest literary influence. Experimental results demonstrate that the approach successfully captures expert-identified instances of similarity and highlights additional passages warranting further qualitative examination. The findings suggest that semantic similarity methods provide a useful computational framework for supporting source and influence studies in literary scholarship.[62] Towards Next-Generation LLM Training: From the Data-Centric Perspective
Hao Liang,Zhengyang Zhao,Zhaoyang Han,Meiyi Qiang,Xiaochen Ma,Bohan Zeng,Qifeng Cai,Zhiyu Li,Linpeng Tang,Weinan E,Wentao Zhang
Main category: cs.CL
TL;DR: 本文提出两个互补的研究方向:一是构建基于智能体的自动化数据准备系统,以支持自动化的数据工作流构建和可扩展的数据管理;二是设计统一的数据-模型交互训练系统,实现训练过程中数据的动态选择、混合与重加权,从而提升数据利用效率和模型性能。
Details
Motivation: 当前大语言模型(LLMs)训练面临数据准备与利用效率低下的瓶颈:数据集构建依赖临时脚本,缺乏成熟、基于智能体的自动化数据准备系统;训练中数据常被整体使用,缺乏系统性的数据选择、混合优化与重加权机制。 Method: 提出两个互补方法:1)构建鲁棒、基于智能体的自动数据准备系统,支持自动化工作流构建与可扩展数据管理;2)设计统一的数据-模型交互训练系统,支持训练全程的数据动态选择、混合与重加权。 Result: 为LLM数据工程提出了结构化、系统化的研究框架,明确了自动化数据准备与动态数据利用两大关键技术路径,并指出了未来挑战与发展方向。 Conclusion: 自动化、智能化的数据准备与动态、自适应的数据利用是提升LLM训练效率与效果的关键,需构建端到端的数据-模型协同系统。 Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks and domains, with data playing a central role in enabling these advances. Despite this success, the preparation and effective utilization of the massive datasets required for LLM training remain major bottlenecks. In current practice, LLM training data is often constructed using ad hoc scripts, and there is still a lack of mature, agent-based data preparation systems that can automatically construct robust and reusable data workflows, thereby freeing data scientists from repetitive and error-prone engineering efforts. Moreover, once collected, datasets are often consumed largely in their entirety during training, without systematic mechanisms for data selection, mixture optimization, or reweighting. To address these limitations, we advocate two complementary research directions. First, we propose building a robust, agent-based automatic data preparation system that supports automated workflow construction and scalable data management. Second, we argue for a unified data-model interaction training system in which data is dynamically selected, mixed, and reweighted throughout the training process, enabling more efficient, adaptive, and performance-aware data utilization. Finally, we discuss the remaining challenges and outline promising directions for future research and system development.[63] Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning
Xinran Zhang
Main category: cs.CL
TL;DR: 本研究探讨了不同监督格式对LoRA安全微调效果的影响,发现非身份导向的监督格式(D)在多个模型上表现最优,挑战了身份框架假说。
Details
Motivation: 探究安全监督的表述方式是否比其显式身份内容更重要,检验身份框架假说在低数据LoRA微调中的有效性。 Method: 基于相同核心安全规则构建四种监督格式(宪法式A、信条式身份框架B、带世界观维持尾部的信条式C、匹配的非身份式D),在Llama 3.1 8B、Qwen2.5 7B和Gemma 3 4B上进行低数据LoRA安全微调,并使用HarmBench(双法官评估+人工仲裁)及MMLU/ARC-Challenge进行评测。 Result: 非身份条件D在全部三个模型家族上均取得最高拒绝率(Llama 74.4%,Gemma 76.9%,Qwen 74.1%),显著优于信条式B及其他条件;能力测试未见性能折损。 Conclusion: 显式的信条式身份语言并非实现最强安全提升的必要条件,监督格式的设计比身份内容本身更具决定性。 Abstract: How safety supervision is written may matter more than the explicit identity content it contains. We study low-data LoRA safety fine-tuning with four supervision formats built from the same core safety rules: constitutional rules (A), creed-style identity framing (B), a B-matched creed condition with a worldview/confession identity-maintenance tail (C), and a matched non-identity condition (D). Across three instruction-tuned model families (Llama 3.1 8B, Qwen2.5 7B, and Gemma 3 4B), we evaluate HarmBench using a reconciled dual-judge pipeline combining Bedrock-hosted DeepSeek v3.2 and Sonnet 4.6, with disagreement and boundary cases manually resolved. The non-identity condition D is the strongest group on all three model families on the full 320-behavior HarmBench set, reaching 74.4% refusal on Llama, 76.9% on Gemma, and 74.1% on Qwen. By comparison, creed-style framing (B) improves over plain constitutional rules (A) on Llama and Gemma, but remains substantially below D, yielding an overall descriptive ordering of $D > B > C \geq A > baseline$. This provides a bounded empirical challenge to a strong version of the identity-framing hypothesis: explicit creed-style identity language is not necessary for the strongest gains observed here. Capability evaluations on MMLU and ARC-Challenge show no meaningful trade-off across conditions.[64] Learning Constituent Headedness
Zeyao Qi,Yige Chen,KyungTae Lim,Haihua Pan,Jungyeul Park
Main category: cs.CL
TL;DR: 本文将句法成分的中心词(headedness)作为显式表示层进行建模,通过联合构形树与依存标注数据监督学习中心词预测,在英汉数据上显著优于传统基于规则的中心词推导方法,并提升了构形-依存转换的保真度与跨资源/语言迁移能力。
Details
Motivation: 句法成分的中心词在句法分析中广泛使用,但现有构形树库通常未显式标注,主流处理流程依赖人工制定的逐层传递规则(percolation rules)推导,缺乏统一、可学习的建模方式。 Method: 将中心词识别定义为一个监督学习任务,利用对齐的构形树与依存树数据构建训练目标:每个构形成分的中心词被定义为其对应依存跨度的中心词;在英语和汉语对齐语料上训练模型。 Result: 模型在内在准确率上接近上限,显著超越Collins风格的基于规则的percolation方法;预测的中心词用于head-driven二叉化后获得与基准相当的句法分析精度;提升了确定性构形到依存转换的保真度,并可通过简单标签映射实现跨语料和跨语言迁移。 Conclusion: 将中心词作为显式、可学习的表示层是可行且有效的,不仅提高了性能,还增强了模型的可解释性、一致性与泛化能力。 Abstract: Headedness is widely used as an organizing device in syntactic analysis, yet constituency treebanks rarely encode it explicitly and most processing pipelines recover it procedurally via percolation rules. We treat this notion of constituent headedness as an explicit representational layer and learn it as a supervised prediction task over aligned constituency and dependency annotations, inducing supervision by defining each constituent head as the dependency span head. On aligned English and Chinese data, the resulting models achieve near-ceiling intrinsic accuracy and substantially outperform Collins-style rule-based percolation. Predicted heads yield comparable parsing accuracy under head-driven binarization, consistent with the induced binary training targets being largely equivalent across head choices, while increasing the fidelity of deterministic constituency-to-dependency conversion and transferring across resources and languages under simple label-mapping interfaces.[65] Towards Privacy-Preserving Machine Translation at the Inference Stage: A New Task and Benchmark
Wei Shao,Lemao Liu,Yinqiao Li,Guoping Huang,Shuming Shi,Linqi Song
Main category: cs.CL
TL;DR: 本文提出了一种新的'隐私保护机器翻译(PPMT)'任务,旨在在翻译模型推理阶段保护文本中的隐私信息,特别是命名实体;构建了三个基准测试数据集、设计了相应评估指标,并提出了系列基准方法。
Details
Motivation: 现有在线翻译服务需将用户文本发送至云端,存在敏感信息泄露风险,而机器翻译领域在推理阶段的隐私保护研究尚属空白,缺乏明确定义的任务、评估数据集与指标及基准方法。 Method: 提出了PPMT任务,聚焦于保护文本中命名实体的隐私;构建了三个基准测试数据集,设计了专用评估指标,并提出了一系列基准方法。 Result: 建立了PPMT任务框架,包括定义、数据集、评估指标和初步基准方法,为机器翻译中的隐私保护研究提供了新视角和基础支撑。 Conclusion: 该工作填补了机器翻译推理阶段隐私保护研究的空白,为后续深入探索提供了系统性起点和坚实基础。 Abstract: Current online translation services require sending user text to cloud servers, posing a risk of privacy leakage when the text contains sensitive information. This risk hinders the application of online translation services in privacy-sensitive scenarios. One way to mitigate this risk for online translation services is introducing privacy protection mechanisms targeting the inference stage of translation models. However, compared to subfields of NLP like text classification and summarization, the machine translation research community has limited exploration of privacy protection during the inference stage. There is no clearly defined privacy protection task for the inference stage, dedicated evaluation datasets and metrics, and reference benchmark methods. The absence of these elements has seriously constrained researchers' in-depth exploration of this direction. To bridge this gap, this paper proposes a novel "Privacy-Preserving Machine Translation" (PPMT) task, aiming to protect the private information in text during the model inference stage. For this task, we constructed three benchmark test datasets, designed corresponding evaluation metrics, and proposed a series of benchmark methods as a starting point for this task. The definition of privacy is complex and diverse. Considering that named entities often contain a large amount of personal privacy and commercial secrets, we have focused our research on protecting only the named entity's privacy in the text. We expect this research work will provide a new perspective and a solid foundation for the privacy protection problem in machine translation.[66] Vietnamese Automatic Speech Recognition: A Revisit
Thi Vu,Linh The Nguyen,Dat Quoc Nguyen
Main category: cs.CL
TL;DR: 本文提出了一种通用的数据聚合与预处理流程,用于从多样且可能含噪的开源数据源中构建高质量ASR数据集,并以越南语为例构建了500小时统一高质量数据集。
Details
Motivation: 低资源语言缺乏大规模、高质量、标注一致的ASR数据集,制约了鲁棒模型的发展。 Method: 设计了一个新颖、可泛化的数据聚合与预处理流程,包含严格的数据清洗、多样性与平衡性控制,以及词级时间戳等关键特征的提取与整合。 Result: 成功为越南语构建了一个统一、高质量、含词级时间戳的500小时ASR数据集,并支持SOTA模型训练与评估。 Conclusion: 该流程具有通用性,可推广至其他低资源语言,为提升其ASR性能提供了可靠的数据基础。 Abstract: Automatic Speech Recognition (ASR) performance is heavily dependent on the availability of large-scale, high-quality datasets. For low-resource languages, existing open-source ASR datasets often suffer from insufficient quality and inconsistent annotation, hindering the development of robust models. To address these challenges, we propose a novel and generalizable data aggregation and preprocessing pipeline designed to construct high-quality ASR datasets from diverse, potentially noisy, open-source sources. Our pipeline incorporates rigorous processing steps to ensure data diversity, balance, and the inclusion of crucial features like word-level timestamps. We demonstrate the effectiveness of our methodology by applying it to Vietnamese, resulting in a unified, high-quality 500-hour dataset that provides a foundation for training and evaluating state-of-the-art Vietnamese ASR systems. Our project page is available at https://github.com/qualcomm-ai-research/PhoASR.[67] Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA
Renhao Pei,Siyao Peng,Verena Blaschke,Robert Litschko,Barbara Plank
Main category: cs.CL
TL;DR: 本文构建了一个新的问答数据集,用于评估大语言模型(LLMs)在本地维基百科知识(如粤语、巴伐利亚语)上的覆盖与可靠性,发现LLMs在缺乏上下文时表现差,引入本地维基导言或翻译可显著提升性能,凸显了LLM在文化包容性与区域知识覆盖上的不足。
Details
Motivation: 现有大语言模型对低资源本地语言变体(如粤语、巴伐利亚语)的知识覆盖和可靠性未知,尤其在信息不对称(如本地维基有而标准维基无)情况下表现如何尚不明确。 Method: 人工构建覆盖粤语/普通话、德语/巴伐利亚语的本地维基挑战性问答数据集;通过提供本地维基导言段落、翻译等上下文进行消融实验;结合主题、地理标注与分层评估分析性能差异。 Result: LLMs在仅依赖参数化知识时几乎无法回答本地维基独有信息的问题;加入本地维基导言显著提升准确率;翻译为高资源语言后进一步提升效果;本地维基被证实同时承载区域性和全球性知识。 Conclusion: 当前LLMs在本地语言知识覆盖上存在严重缺陷,亟需将本地知识源(如本地维基)系统性纳入训练与检索流程,以提升文化包容性与知识公平性。 Abstract: Large Language Models (LLMs) are becoming a common way for humans to seek knowledge, yet their coverage and reliability vary widely. Especially for local language varieties, there are large asymmetries, e.g., information in local Wikipedia that is absent from the standard variant. However, little is known about how well LLMs perform under such information asymmetry, especially on closely related languages. We manually construct a novel challenge question-answering (QA) dataset that captures knowledge conveyed on a local Wikipedia page, which is absent from their higher-resource counterparts-covering Mandarin Chinese vs. Cantonese and German vs. Bavarian. Our experiments show that LLMs fail to answer questions about information only in local editions of Wikipedia. Providing context from lead sections substantially improves performance, with further gains possible via translation. Our topical, geographic annotations, and stratified evaluations reveal the usefulness of local Wikipedia editions as sources of both regional and global information. These findings raise critical questions about inclusivity and cultural coverage of LLMs.[68] The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments
Elmira Salari,Maria Claudia Nunes Delfino,Hazem Amamou,José Victor de Souza,Shruti Kshirsagar,Alan Davoust,Anderson Avila
Main category: cs.CL
TL;DR: 本文研究了检索到的意识形态文本对大语言模型(LLM)输出的影响,特别是在检索增强生成(RAG)框架下;通过构建基于新冠治疗争议性学术文献的意识形态语料库,结合词汇多维分析(LMDA)识别意识形态维度,并设计两类提示引导LLM作答,结果表明检索文本显著影响LLM输出的意识形态倾向,且加入LMDA描述的增强提示进一步强化该效应。
Details
Motivation: 现有研究虽日益关注大语言模型中的意识形态问题,但在检索增强生成(RAG)背景下, retrieved 文本所携带的意识形态如何影响模型输出尚缺乏系统探讨。 Method: 构建包含1117篇关于新冠治疗的学术文章的意识形态语料库;采用词汇多维分析(LMDA)识别其中三个意识形态维度;设计两类上下文提示(仅含问题+意识形态文本;问题+意识形态文本+LMDA描述)驱动LLM回答;使用余弦相似度评估LLM响应与参考意识形态文本在词汇和语义层面的对齐程度。 Result: LLM基于意识形态检索文本生成的回答,其意识形态倾向明显更贴近所检索文本;加入LMDA描述的增强提示进一步提升了这种意识形态对齐程度。 Conclusion: RAG系统中检索到的外部知识可能隐含并传递意识形态偏见,甚至被恶意利用;因此必须在RAG框架中主动识别并管控意识形态话语,以缓解非预期偏差与恶意操纵风险。 Abstract: This paper studies the impact of retrieved ideological texts on the outputs of large language models (LLMs). While interest in understanding ideology in LLMs has recently increased, little attention has been given to this issue in the context of Retrieval-Augmented Generation (RAG). To fill this gap, we design an external knowledge source based on ideological loaded texts about COVID-19 treatments. Our corpus is based on 1,117 academic articles representing discourses about controversial and endorsed treatments for the disease. We propose a corpus linguistics framework, based on Lexical Multidimensional Analysis (LMDA), to identify the ideologies within the corpus. LLMs are tasked to answer questions derived from three identified ideological dimensions, and two types of contextual prompts are adopted: the first comprises the user question and ideological texts; and the second contains the question, ideological texts, and LMDA descriptions. Ideological alignment between reference ideological texts and LLMs' responses is assessed using cosine similarity for lexical and semantic representations. Results demonstrate that LLMs' responses based on ideological retrieved texts are more aligned with the ideology encountered in the external knowledge, with the enhanced prompt further influencing LLMs' outputs. Our findings highlight the importance of identifying ideological discourses within the RAG framework in order to mitigate not just unintended ideological bias, but also the risks of malicious manipulation of such models.[69] ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations
Hankun Kang,Xin Miao,Jianhao Chen,Jintao Wen,Mayi Xu,Weiyu Zhang,Wenpeng Lu,Tieyun Qian
Main category: cs.CL
TL;DR: 本文提出ContiGuard框架,首次针对时间演化扰动文本的持续毒性检测问题,通过大语言模型驱动的语义增强策略和判别性特征学习策略,提升检测器对动态演化扰动的持续适应与鲁棒性。
Details
Motivation: 传统毒性检测器静态、难以应对恶意用户不断演化的规避扰动;扰动引入噪声损害语义理解与关键特征学习,加剧持续学习难度。 Method: 提出ContiGuard框架:1)LLM驱动的语义增强策略,动态注入LLM挖掘的语义与毒性线索以提升理解;2)判别性驱动的特征学习策略,强化判别性特征、抑制非关键特征,构建鲁棒分类边界。 Result: ContiGuard在持续毒性检测任务上显著优于基线方法,展现出更强的语义理解能力、抗扰动鲁棒性及长期持续学习稳定性。 Conclusion: ContiGuard为面向演化扰动的毒性检测提供了首个有效的持续学习解决方案,验证了语义增强与判别性特征学习对提升检测器动态适应能力的关键作用。 Abstract: Toxicity detection mitigates the dissemination of toxic content (e.g., hateful comments, posts, and messages within online social actions) to safeguard a healthy online social environment. However, malicious users persistently develop evasive perturbations to disguise toxic content and evade detectors. Traditional detectors or methods are static over time and are inadequate in addressing these evolving evasion tactics. Thus, continual learning emerges as a logical approach to dynamically update detection ability against evolving perturbations. Nevertheless, disparities across perturbations hinder the detector's continual learning on perturbed text. More importantly, perturbation-induced noises distort semantics to degrade comprehension and also impair critical feature learning to render detection sensitive to perturbations. These amplify the challenge of continual learning against evolving perturbations. In this work, we present ContiGuard, the first framework tailored for continual learning of the detector on time-evolving perturbed text (termed continual toxicity detection) to enable the detector to continually update capability and maintain sustained resilience against evolving perturbations. Specifically, to boost the comprehension, we present an LLM-powered semantic enriching strategy, where we dynamically incorporate possible meaning and toxicity-related clues excavated by LLM into the perturbed text to improve the comprehension. To mitigate non-critical features and amplify critical ones, we propose a discriminability-driven feature learning strategy, where we strengthen discriminative features while suppressing the less-discriminative ones to shape a robust classification boundary for detection...[70] Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks
Zijian Yu,Kejun Xiao,Huaipeng Zhao,Tao Luo,Xiaoyi Zeng
Main category: cs.CL
TL;DR: 本文提出Shopping Companion框架,通过统一处理记忆检索与购物辅助,并引入双奖励强化学习策略,在长时偏好感知的电商购物任务中显著提升性能。
Details
Motivation: 现有LLM代理在电商购物任务中面临两大挑战:缺乏评估长期偏好感知购物任务的基准,以及因将偏好识别与购物辅助分离而导致端到端优化不足。 Method: 构建包含120万真实商品、覆盖两个购物任务的新型长期记忆基准;提出统一框架Shopping Companion,联合处理记忆检索与购物辅助,并支持用户干预;设计工具级双奖励强化学习策略以应对多轮交互中的稀疏不连续奖励。 Result: 即使GPT-5等SOTA模型在该基准上成功率仍低于70%;所提出的轻量级LLM在Shopping Companion训练下持续超越强基线,在偏好捕捉和任务性能上均更优。 Conclusion: 统一建模长期偏好与购物决策的框架(Shopping Companion)及配套训练策略,有效提升了LLM在复杂电商场景中的实用性与鲁棒性。 Abstract: In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budgeting, and bundle deals, where accurately capturing user preferences from long-term conversations is critical. However, two challenges hinder realizing this potential: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of end-to-end optimization due to existing designs that treat preference identification and shopping assistance as separate components. In this paper, we introduce a novel benchmark with a long-term memory setup, spanning two shopping tasks over 1.2 million real-world products, and propose Shopping Companion, a unified framework that jointly tackles memory retrieval and shopping assistance while supporting user intervention. To train such capabilities, we develop a dual-reward reinforcement learning strategy with tool-wise rewards to handle the sparse and discontinuous rewards inherent in multi-turn interactions. Experimental results demonstrate that even state-of-the-art models (such as GPT-5) achieve success rates under 70% on our benchmark, highlighting the significant challenges in this domain. Notably, our lightweight LLM, trained with Shopping Companion, consistently outperforms strong baselines, achieving better preference capture and task performance, which validates the effectiveness of our unified design.[71] Developing an English-Efik Corpus and Machine Translation System for Digitization Inclusion
Offiong Bassey Edet,Mbuotidem Sunday Awak,Emmanuel Oyo-Ita,Benjamin Okon Nyong,Ita Etim Bassey
Main category: cs.CL
TL;DR: 本研究评估了mT5和NLLB200模型在英语-埃菲克语(Efik)低资源语言翻译任务上的性能,使用13865句社区构建的平行语料,结果表明NLLB200显著优于mT5,验证了为小众语言构建实用翻译系统的可行性,并强调包容性数据实践与文化适配评估的重要性。
Details
Motivation: 低资源语言(如埃菲克语)在NLP中严重缺失,尽管部分非洲语言已有进展,但大量本土语言仍被忽视,亟需技术赋能以保护语言多样性与文化传承。 Method: 基于社区构建的小规模英语-埃菲克语平行语料(13,865句对),对mT5和NLLB200两个主流多语言神经机器翻译模型进行微调,并采用BLEU和chrF指标进行双向翻译性能评估。 Result: NLLB200在英→埃和埃→英方向分别取得BLEU 26.64和31.21、chrF 51.04和47.92,显著优于mT5;表明其在低资源设置下具有更强的翻译流畅性与语义保真度。 Conclusion: 即使仅用小规模社区语料,先进多语言模型(如NLLB200)也能有效支持低资源语言翻译;研究呼吁加强包容性数据建设与文化敏感型评估,推动公平、可持续的NLP发展。 Abstract: Low-resource languages serve as invaluable repositories of human history, preserving cultural and intellectual diversity. Despite their significance, they remain largely absent from modern natural language processing systems. While progress has been made for widely spoken African languages such as Swahili, Yoruba, and Amharic, smaller indigenous languages like Efik continue to be underrepresented in machine translation research. This study evaluates the effectiveness of state-of-the-art multilingual neural machine translation models for English-Efik translation, leveraging a small-scale, community-curated parallel corpus of 13,865 sentence pairs. We fine-tuned both the mT5 multilingual model and the NLLB200 model on this dataset. NLLB-200 outperformed mT5, achieving BLEU scores of 26.64 for English-Efik and 31.21 for Efik-English, with corresponding chrF scores of 51.04 and 47.92, indicating improved fluency and semantic fidelity. Our findings demonstrate the feasibility of developing practical machine translation tools for low-resource languages and highlight the importance of inclusive data practices and culturally grounded evaluation in advancing equitable NLP.[72] Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models
Han Zhang,Jiamin Su,Li liu
Main category: cs.CL
TL;DR: 本文提出Decision-Level Ordinal Modeling (DLOM)框架,将自动作文评分(AES)建模为显式的序数决策问题,避免传统自回归生成式方法的隐式打分缺陷;针对多模态AES设计了门控融合模块(DLOM-GF),针对纯文本AES引入距离感知正则化(DLOM-DA),实验表明其在多模态和纯文本数据集上均显著优于基线方法。
Details
Motivation: 现有基于大语言模型的AES方法将评分建模为自回归token生成,决策隐式且对多模态输入中模态相关性变化敏感;缺乏对序数评分结构的显式建模。 Method: 提出DLOM框架:复用语言模型头部直接输出各预定义分数标签的logits,实现显式序数决策;针对多模态场景设计DLOM-GF(带门控融合模块),针对纯文本场景设计DLOM-DA(含距离感知正则化)。 Result: 在多模态EssayJudge数据集上,DLOM超越生成式监督微调基线,DLOM-GF在模态相关性异质时进一步提升;在纯文本ASAP/ASAP++基准上,DLOM-DA持续有效并优于强基线。 Conclusion: 显式建模序数评分空间比隐式生成更鲁棒、可解释且适应性强,DLOM为单模态与多模态AES提供了统一而有效的范式。 Abstract: Automated essay scoring (AES) predicts multiple rubric-defined trait scores for each essay, where each trait follows an ordered discrete rating scale. Most LLM-based AES methods cast scoring as autoregressive token generation and obtain the final score via decoding and parsing, making the decision implicit. This formulation is particularly sensitive in multimodal AES, where the usefulness of visual inputs varies across essays and traits. To address these limitations, we propose Decision-Level Ordinal Modeling (DLOM), which makes scoring an explicit ordinal decision by reusing the language model head to extract score-wise logits on predefined score tokens, enabling direct optimization and analysis in the score space. For multimodal AES, DLOM-GF introduces a gated fusion module that adaptively combines textual and multimodal score logits. For text-only AES, DLOM-DA adds a distance-aware regularization term to better reflect ordinal distances. Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous. On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.[73] LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: 本文使用信号检测理论(SDT)框架分析大语言模型(LLMs)的校准表现,发现传统校准指标(如ECE)混淆了敏感性(sensitivity)与响应偏差(bias),而SDT可将其解耦;研究验证温度参数不仅改变置信度,还影响答案生成本身,导致其不能单纯类比人类心理物理实验中的决策准则调节;结果表明所有模型均呈现不等方差证据分布,且指令微调模型比基础模型更不对称;SDT参数化分析揭示了校准指标无法区分的模型差异。
Details
Motivation: 传统校准评估指标(如Expected Calibration Error)混淆了模型判别能力(敏感性)和响应倾向(偏差),亟需更精细的理论框架(如信号检测理论)来解耦并深入理解LLM的不确定性建模机制。 Method: 采用预注册研究设计,将三个LLM作为观察者,在168,000次事实判别任务中应用完整SDT参数化框架:包括不等方差模型拟合、决策准则估计和z-ROC分析,并检验温度参数是否等效于人类心理物理实验中的奖赏操控所引起的准则偏移。 Result: 温度调节同时提升敏感性(AUC)并移动决策准则,证实其不能单纯视为准则调节;所有模型均呈现不等方差证据分布(z-ROC斜率0.52–0.84),其中指令模型(0.52–0.63)比基础模型(0.77–0.87)和人类识别记忆(~0.80)更不对称;SDT分解显示不同敏感性-偏差组合的模型在传统校准指标下无法区分。 Conclusion: SDT全参数框架为LLM不确定性建模提供了超越传统校准指标的诊断能力,揭示了温度对生成与置信的双重影响,强调需发展适配LLM生成特性的新型评估范式。 Abstract: Large language models (LLMs) are evaluated for calibration using metrics such as Expected Calibration Error that conflate two distinct components: the model's ability to discriminate correct from incorrect answers (sensitivity) and its tendency toward confident or cautious responding (bias). Signal Detection Theory (SDT) decomposes these components. While SDT-derived metrics such as AUROC are increasingly used, the full parametric framework - unequal-variance model fitting, criterion estimation, z-ROC analysis - has not been applied to LLMs as signal detectors. In this pre-registered study, we treat three LLMs as observers performing factual discrimination across 168,000 trials and test whether temperature functions as a criterion shift analogous to payoff manipulations in human psychophysics. Critically, this analogy may break down because temperature changes the generated answer itself, not only the confidence assigned to it. Our results confirm the breakdown with temperature simultaneously increasing sensitivity (AUC) and shifting criterion. All models exhibited unequal-variance evidence distributions (z-ROC slopes 0.52-0.84), with instruct models showing more extreme asymmetry (0.52-0.63) than the base model (0.77-0.87) or human recognition memory (~0.80). The SDT decomposition revealed that models occupying distinct positions in sensitivity-bias space could not be distinguished by calibration metrics alone, demonstrating that the full parametric framework provides diagnostic information unavailable from existing metrics.[74] ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation
Yuzhe Shang,Pengzhi Gao,Yazheng Yang,Jiayao Ma,Wei Liu,Jian Luan,Jingsong Su
Main category: cs.CL
TL;DR: 本文提出ExPosST框架,通过显式位置分配解决解码器-only大语言模型在同步机器翻译中的位置错配问题,兼顾推理效率、位置一致性与模型兼容性,并引入策略一致的微调方法提升训练-推理对齐。
Details
Motivation: 解码器-only大语言模型应用于同步机器翻译时存在位置错配问题,导致推理效率与位置一致性难以兼顾;现有方法依赖特定位置编码或提示设计,缺乏通用性、高效性与兼容性。 Method: 提出ExPosST通用框架:1)为输入源词预留固定位置槽以支持KV缓存的高效解码;2)引入策略一致的微调策略,使训练过程与推理时的解码行为对齐。 Result: 在多个语种对上的实验表明,ExPosST能有效支持多种策略下的同步翻译,同时提升位置一致性与推理效率。 Conclusion: ExPosST是一种通用、高效且兼容性强的同步机器翻译框架,通过显式位置分配和策略一致微调,成功缓解了位置错配带来的权衡困境。 Abstract: Large language models (LLMs) have recently demonstrated promising performance in simultaneous machine translation (SimulMT). However, applying decoder-only LLMs to SimulMT introduces a positional mismatch, which leads to a dilemma between decoding efficiency and positional consistency. Existing approaches often rely on specific positional encodings or carefully designed prompting schemes, and thus fail to simultaneously achieve inference efficiency, positional consistency, and broad model compatibility. In this work, we propose ExPosST, a general framework that resolves this dilemma through explicit position allocation. ExPosST reserves fixed positional slots for incoming source tokens, enabling efficient decoding with KV cache across different positional encoding methods. To further bridge the gap between fine-tuning and inference, we introduce a policy-consistent fine-tuning strategy that aligns training with inference-time decoding behavior. Experiments across multiple language pairs demonstrate that ExPosST effectively supports simultaneous translation under diverse policies.[75] Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI
Jinhu Qi,Yifan Li,Minghao Zhao,Wentao Zhang,Zijian Zhang,Yaoman Li,Irwin King
Main category: cs.CL
TL;DR: 本文提出Holographic Agent Assessment Framework (HAAF),一种面向真实世界场景分布、覆盖多维风险的系统性评估框架,以解决当前AI代理评估碎片化、缺乏代表性的问题。
Details
Motivation: 现有AI代理评估方法过于零散,仅测试孤立能力(如编码、抗幻觉、防越狱、工具使用),缺乏对代理在真实社会技术场景中整体可信度的代表性评估。 Method: 提出HAAF框架,包含四个组件:静态认知与策略分析、交互式沙盒仿真、社会伦理对齐评估、以及分布感知的代表性采样引擎;并通过‘可信优化工厂’实现红蓝队迭代对抗式评估与加固。 Result: 构建了一个能覆盖任务类型、工具接口、交互动态、社会语境和风险等级等多维场景流形的评估范式,尤其关注传统基准忽略的稀有但高后果尾部风险。 Conclusion: HAAF将AI代理评估从孤立基准测试转向基于代表性场景分布的系统性可信度验证,为安全可靠部署提供新范式。 Abstract: As agentic AI systems move beyond static question answering into open-ended, tool-augmented, and multi-step real-world workflows, their increased authority poses greater risks of system misuse and operational failures. However, current evaluation practices remain fragmented, measuring isolated capabilities such as coding, hallucination, jailbreak resistance, or tool use in narrowly defined settings. We argue that the central limitation is not merely insufficient coverage of evaluation dimensions, but the lack of a principled notion of representativeness: an agent's trustworthiness should be assessed over a representative socio-technical scenario distribution rather than a collection of disconnected benchmark instances. To this end, we propose the Holographic Agent Assessment Framework (HAAF), a systematic evaluation paradigm that characterizes agent trustworthiness over a scenario manifold spanning task types, tool interfaces, interaction dynamics, social contexts, and risk levels. The framework integrates four complementary components: (i) static cognitive and policy analysis, (ii) interactive sandbox simulation, (iii) social-ethical alignment assessment, and (iv) a distribution-aware representative sampling engine that jointly optimizes coverage and risk sensitivity -- particularly for rare but high-consequence tail risks that conventional benchmarks systematically overlook. These components are connected through an iterative Trustworthy Optimization Factory. Through cycles of red-team probing and blue-team hardening, this paradigm progressively narrows the vulnerabilities to meet deployment standards, shifting agent evaluation from benchmark islands toward representative, real-world trustworthiness. Code and data for the illustrative instantiation are available at https://github.com/TonyQJH/haaf-pilot.[76] OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora
Jeffrey Flynt
Main category: cs.CL
TL;DR: 本文提出OrgForge,一个开源多智能体仿真框架,用于生成具有严格时间结构和跨文档一致性的合成组织数据,以支持检索增强生成(RAG)系统的评估。
Details
Motivation: 现有真实数据集(如Enron)存在法律风险、人口统计偏差且缺乏结构化真值;而纯LLM生成的合成数据易产生自相矛盾的事实幻觉,难以支撑可靠评估。 Method: 设计了一个基于确定性Python引擎(SimEvent总线)与受限LLM协同的多智能体框架:LLM仅生成受验证提案约束的表层文本,时间戳由局部时钟统一保证因果正确性;引入三个图动力学子系统(介数传播、时间衰减边权、Dijkstra升级路由)独立于LLM建模组织行为;并构建因果链追踪、混合RRF重复故障检测及带概率丢弃的邮件路由引擎。 Result: OrgForge可生成N天仿真中交织的Slack消息、JIRA工单、Confluence页面、Git PR和邮件,全部可追溯至共享不可变事件日志,并支持跨文档证据图构建与故障模式识别。 Conclusion: OrgForge为RAG评估提供了合法、可控、时序一致、跨模态可验证的高质量合成数据源,其模块化设计和MIT许可便于社区扩展与应用。 Abstract: Evaluating retrieval-augmented generation (RAG) pipelines requires corpora where ground truth is knowable, temporally structured, and cross-artifact properties that real-world datasets rarely provide cleanly. Existing resources such as the Enron corpus carry legal ambiguity, demographic skew, and no structured ground truth. Purely LLM-generated synthetic data solves the legal problem but introduces a subtler one: the generating model cannot be prevented from hallucinating facts that contradict themselves across documents.We present OrgForge, an open-source multi-agent simulation framework that enforces a strict physics-cognition boundary: a deterministic Python engine maintains a SimEvent ground truth bus; large language models generate only surface prose, constrained by validated proposals. An actor-local clock enforces causal timestamp correctness across all artifact types, eliminating the class of timeline inconsistencies that arise when timestamps are sampled independently per document. We formalize three graph-dynamic subsystems stress propagation via betweenness centrality, temporal edge-weight decay, and Dijkstra escalation routing that govern organizational behavior independently of any LLM. Running a configurable N-day simulation, OrgForge produces interleaved Slack threads, JIRA tickets, Confluence pages, Git pull requests, and emails, all traceable to a shared, immutable event log. We additionally describe a causal chain tracking subsystem that accumulates cross-artifact evidence graphs per incident, a hybrid reciprocal-rank-fusion recurrence detector for identifying repeated failure classes, and an inbound/outbound email engine that routes vendor alerts, customer complaints, and HR correspondence through gated causal chains with probabilistic drop simulation. OrgForge is available under the MIT license.[77] Pretraining and Benchmarking Modern Encoders for Latvian
Arturs Znotins
Main category: cs.CL
TL;DR: 本文针对拉脱维亚语这一低资源语言,预训练了基于RoBERTa、DeBERTaV3和ModernBERT架构的多种专用编码器模型(含长上下文变体),并在多项拉脱维亚语基准任务上验证其有效性;其中lv-deberta-base模型性能最优,超越现有多语言及单语基线,并开源全部模型与评估资源。
Details
Motivation: 低资源语言如拉脱维亚语在预训练语料中严重缺失,且缺乏高质量单语编码器,限制了其NLP发展。 Method: 基于RoBERTa、DeBERTaV3和ModernBERT架构,预训练一系列拉脱维亚语专用encoder-only模型,包括长上下文变体,并在多个拉脱维亚语诊断与语言学基准上系统评估。 Result: lv-deberta-base(1.11亿参数)表现最佳,整体性能超越更大规模的多语言基线及此前拉脱维亚语专用模型。 Conclusion: 为拉脱维亚语构建高性能、高效率的专用编码器是可行且有效的,所发布模型与资源将推动该语言NLP研究与应用。 Abstract: Encoder-only transformers remain essential for practical NLP tasks. While recent advances in multilingual models have improved cross-lingual capabilities, low-resource languages such as Latvian remain underrepresented in pretraining corpora, and few monolingual Latvian encoders currently exist. We address this gap by pretraining a suite of Latvian-specific encoders based on RoBERTa, DeBERTaV3, and ModernBERT architectures, including long-context variants, and evaluating them across a diverse set of Latvian diagnostic and linguistic benchmarks. Our models are competitive with existing monolingual and multilingual encoders while benefiting from recent architectural and efficiency advances. Our best model, lv-deberta-base (111M parameters), achieves the strongest overall performance, outperforming larger multilingual baselines and prior Latvian-specific encoders. We release all pretrained models and evaluation resources to support further research and practical applications in Latvian NLP.[78] Attention Residuals
Kimi Team,Guangyu Chen,Yu Zhang,Jianlin Su,Weixin Xu,Siyuan Pan,Yaoyu Wang,Yucheng Wang,Guanduo Chen,Bohong Yin,Yutian Chen,Junjie Yan,Ming Wei,Y. Zhang,Fanqing Meng,Chao Hong,Xiaotong Xie,Shaowei Liu,Enzhe Lu,Yunpeng Tai,Yanru Chen,Xin Men,Haiqing Guo,Y. Charles,Haoyu Lu,Lin Sui,Jinguo Zhu,Zaida Zhou,Weiran He,Weixiao Huang,Xinran Xu,Yuzhi Wang,Guokun Lai,Yulun Du,Yuxin Wu,Zhilin Yang,Xinyu Zhou
Main category: cs.CL
TL;DR: 本文提出Attention Residuals(AttnRes),用可学习、输入依赖的softmax注意力机制替代传统残差连接中的固定权重累加,缓解PreNorm下深层隐藏状态失控增长与层贡献稀释问题;进一步设计Block AttnRes降低内存开销,并在Kimi Linear架构(48B/3B)上验证其有效性,显著提升下游任务性能。
Details
Motivation: 传统残差连接(PreNorm)采用固定单位权重累加各层输出,导致深层隐藏状态幅值无控增长、每层贡献被稀释,影响模型训练稳定性和表达能力。 Method: 提出Attention Residuals(AttnRes),以softmax注意力动态聚合前序层输出;为降低开销,设计Block AttnRes——将层分块、仅对块级表征做注意力;结合缓存式流水通信与两阶段计算策略实现高效训练。 Result: 在不同规模模型上验证改进具有一致性;在Kimi Linear(48B总参/3B激活)上预训练1.4T tokens,AttnRes缓解PreNorm稀释,使输出幅值与梯度分布更均匀,并全面提升下游任务性能。 Conclusion: AttnRes及其轻量变体Block AttnRes是标准残差连接的有效替代方案,兼具理论合理性与工程实用性,能提升大模型训练稳定性与泛化能力。 Abstract: Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.[79] Interpretable Predictability-Based AI Text Detection: A Replication Study
Adam Skurla,Dominik Macko,Jakub Simko
Main category: cs.CL
TL;DR: 本文复现并扩展了AuTexTification 2023共享任务中用于机器生成文本作者归属的系统,通过引入新多语言模型(如Qwen、mGPT、mDeBERTa-v3-base)和26个文档级文体特征,并结合SHAP分析,提升了跨语言作者归属性能,同时强调了清晰文档对可复现性与公平比较的重要性。
Details
Motivation: 解决原系统因数据划分、模型可用性及实现细节差异导致的复现困难问题,并提升机器生成文本作者归属在多语言场景下的性能与可解释性。 Method: 复现原系统基础上,替换GPT-2为Qwen和mGPT提取概率特征,采用mDeBERTa-v3-base获取上下文表征,统一英文与西班牙语配置;新增26个文档级文体特征,并使用SHAP分析特征重要性。 Result: 新增文体特征提升了英/西双语下两个子任务的性能;多语言统一配置效果媲美甚至优于语言特化模型;SHAP分析揭示了关键影响特征。 Conclusion: 多语言基础模型与手工文体特征融合有效提升作者归属性能;统一配置增强泛化能力;清晰的实验文档对系统复现与公平评估至关重要。 Abstract: This paper replicates and extends the system used in the AuTexTification 2023 shared task for authorship attribution of machine-generated texts. First, we tried to reproduce the original results. Exact replication was not possible because of differences in data splits, model availability, and implementation details. Next, we tested newer multilingual language models and added 26 document-level stylometric features. We also applied SHAP analysis to examine which features influence the model's decisions. We replaced the original GPT-2 models with newer generative models such as Qwen and mGPT for computing probabilistic features. For contextual representations, we used mDeBERTa-v3-base and applied the same configuration to both English and Spanish. This allowed us to use one shared configuration for Subtask 1 and Subtask 2. Our experiments show that the additional stylometric features improve performance in both tasks and both languages. The multilingual configuration achieves the results that are comparable to or better than language-specific models. The study also shows that clear documentation is important for reliable replication and fair comparison of systems.[80] Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs
Disha Sheshanarayana,Rajat Subhra Pal,Manjira Sinha,Tirthankar Dasgupta
Main category: cs.CL
TL;DR: AdaAnchor是一种自适应的隐空间推理框架,通过动态监测锚点向量的稳定性来决定迭代步数,在保证甚至提升准确率的同时显著减少推理步骤和生成token数量。
Details
Motivation: 现有token-level CoT方法生成长中间推理链导致高计算开销;固定步数的隐空间推理方法需人工调参,难以兼顾不同难度样本的效率与精度平衡。 Method: 提出AdaAnchor框架:1)引入与输入关联的可学习锚点向量进行静默迭代优化;2)设计基于锚点动态稳定性的自适应终止机制,在共享最大步数约束下按需分配计算资源。 Result: 在三个数学应用题基准上,相比固定步数隐推理方法,准确率最高提升5%,平均迭代步数减少48%-60%;相比标准CoT,生成token减少92%-93%。 Conclusion: AdaAnchor实现了更优的准确率-效率权衡,尤其适合对输出长度敏感或计算资源受限的应用场景。 Abstract: Token-level Chain-of-Thought (CoT) prompting has become a standard way to elicit multi-step reasoning in large language models (LLMs), especially for mathematical word problems. However, generating long intermediate traces increases output length and inference cost, and can be inefficient when the model could arrive at the correct answer without extensive verbalization. This has motivated latent-space reasoning approaches that shift computation into hidden representations and only emit a final answer. Yet, many latent reasoning methods depend on a fixed number of latent refinement steps at inference, adding another hyperparameter that must be tuned across models and datasets to balance accuracy and efficiency. We introduce AdaAnchor, a latent reasoning framework that performs silent iterative computation by refining a set of latent anchor vectors attached to the input. AdaAnchor further incorporates an adaptive halting mechanism that monitors anchor stability across iterations and terminates refinement once the anchor dynamics converge, allocating fewer steps to easier instances while reserving additional refinement steps for harder ones under a shared maximum-step budget. Our empirical evaluation across three mathematical word-problem benchmarks shows that AdaAnchor with adaptive halting yields accuracy gains of up to 5% over fixed-step latent refinement while reducing average latent refinement steps by 48-60% under the same maximum-step budget. Compared to standard reasoning baselines, AdaAnchor achieves large reductions in generated tokens (92-93%) by moving computation into silent latent refinement, offering a different accuracy-efficiency trade-off with substantially lower output-token usage.[81] Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization
Jihao Zhao,Shuaishuai Zu,Zhiyuan Ji,Chunlai Zhou,Biao Qin
Main category: cs.CL
TL;DR: 本文提出了一种基于多智能体协同和扎根理论的细粒度评价标准构建方法,并设计了记忆增强的回放策略优化算法(MRPO),实现无需额外训练的模型自省与端到端优化,在创意写作任务中显著提升小模型性能。
Details
Motivation: 创意写作作为开放生成任务,缺乏可验证参考答案,导致奖励建模与自动评估受限于高人力成本、评估偏差和粗粒度反馈。 Method: 1)基于扎根理论设计多智能体协同工作流,进行问题的维度分解与层次归纳,动态生成可解释、可复用的细粒度评价标准;2)提出记忆增强的回放策略优化(MRPO)算法,支持模型基于动态标准进行无需额外训练的自我反思与迭代改进,并融合监督微调与强化学习将评价标准转化为奖励信号,实现端到端优化。 Result: 自动构建的评价标准性能媲美人工标注;Writer-R1-4B模型在多个创意写作任务上超越基线,并优于部分100B+参数开源大模型。 Conclusion: 该方法有效缓解了创意写作中自动评估难的问题,为小模型在开放生成任务中的高效优化提供了新范式。 Abstract: As a typical open-ended generation task, creative writing lacks verifiable reference answers, which has long constrained reward modeling and automatic evaluation due to high human annotation costs, evaluative bias, and coarse feedback signals. To address these challenges, this paper first designs a multi-agent collaborative workflow based on Grounded Theory, performing dimensional decomposition and hierarchical induction of the problem to dynamically produce interpretable and reusable fine-grained criteria. Furthermore, we propose the Memory-augmented Replay Policy Optimization (MRPO) algorithm: on the one hand, without additional training, MRPO guides models to engage in self-reflection based on dynamic criteria, enabling controlled iterative improvement; on the other hand, we adopt the training paradigm that combines supervised fine-tuning with reinforcement learning to convert evaluation criteria into reward signals, achieving end-to-end optimization. Experimental results demonstrate that the automatically constructed criteria achieve performance gains comparable to human annotations. Writer-R1-4B models trained with this approach outperform baselines across multiple creative writing tasks and surpass some 100B+ parameter open-source models.[82] Bridging National and International Legal Data: Two Projects Based on the Japanese Legal Standard XML Schema for Comparative Law Studies
Makoto Nakamura
Main category: cs.CL
TL;DR: 本文提出了一种计算比较法学的集成框架,通过将日本法律标准(JLS)XML模式转换为Akoma Ntoso(AKN)标准,并利用多语言嵌入与语义相似度技术识别跨国法律条文对应关系,构建了支持探索性比较分析的可视化跨司法管辖区网络系统。
Details
Motivation: 实现日本法律文本与国际法律数据库(如LegalDocML)的互操作性,并支持跨法域法律条文的自动化比较分析。 Method: 构建JLS到Akoma Ntoso(AKN)的转换管道;应用多语言嵌入模型与语义文本相似度技术;结合FAISS检索与Cross-Encoder重排序构建原型系统;生成并可视化跨司法管辖区法律条文对应网络。 Result: 实现了日本法律条文向国际标准AKN的结构互操作;成功识别出不同国家法律体系间条文的语义对应关系;开发出支持探索式比较分析的可视化原型系统。 Conclusion: 该框架有效 bridged 结构化法律数据标准与语义级比较分析之间的鸿沟,为计算比较法学提供了可扩展、可复用的方法论基础。 Abstract: This paper presents an integrated framework for computational comparative law by connecting two consecutive research projects based on the Japanese Legal Standard (JLS) XML schema. The first project establishes structural interoperability by developing a conversion pipeline from JLS to the Akoma Ntoso (AKN) standard, enabling Japanese statutes to be integrated into international LegalDocML-based legislative databases. Building on this foundation, the second project applies multilingual embedding models and semantic textual similarity techniques to identify corresponding provisions across national legal systems. A prototype system combining multilingual embeddings, FAISS retrieval, and Cross-Encoder reranking generates candidate correspondences and visualizes them as cross-jurisdictional networks for exploratory comparative analysis.[83] MMKU-Bench: A Multimodal Update Benchmark for Diverse Visual Knowledge
Baochen Fu,Yuntao Du,Cheng Chang,Baihao Jin,Wenzhi Deng,Muhao Xu,Hongmei Yan,Weiye Song,Yi Wan
Main category: cs.CL
TL;DR: 本文提出MMKU-Bench,一个用于评估多模态知识更新的综合基准,涵盖更新知识与未知知识两类场景,并在该基准上对比分析了SFT、RLHF和知识编辑等方法的表现。
Details
Motivation: 现有研究仅关注学习新知识,忽视已学知识随现实变化需更新的问题,且缺乏跨模态一致性的系统评估。 Method: 构建包含25k知识实例和49k图像的MMKU-Bench基准,覆盖更新知识和未知知识两种场景,并在该基准上评估SFT、RLHF和知识编辑等方法。 Result: 实验表明SFT和RLHF易导致灾难性遗忘,知识编辑能更好保持通用能力但持续更新能力有限。 Conclusion: MMKU-Bench为多模态知识更新提供了可靠、全面的评估基准,推动该领域发展。 Abstract: As real-world knowledge continues to evolve, the parametric knowledge acquired by multimodal models during pretraining becomes increasingly difficult to remain consistent with real-world knowledge. Existing research on multimodal knowledge updating focuses only on learning previously unknown knowledge, while overlooking the need to update knowledge that the model has already mastered but that later changes; moreover, evaluation is limited to the same modality, lacking a systematic analysis of cross-modal consistency. To address these issues, this paper proposes MMKU-Bench, a comprehensive evaluation benchmark for multimodal knowledge updating, which contains over 25k knowledge instances and more than 49k images, covering two scenarios, updated knowledge and unknown knowledge, thereby enabling comparative analysis of learning across different knowledge types. On this benchmark, we evaluate a variety of representative approaches, including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and knowledge editing (KE). Experimental results show that SFT and RLHF are prone to catastrophic forgetting, while KE better preserve general capabilities but exhibit clear limitations in continual updating. Overall, MMKU-Bench provides a reliable and comprehensive evaluation benchmark for multimodal knowledge updating, advancing progress in this field.[84] Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike
Miriam Winkler,Verena Blaschke,Barbara Plank
Main category: cs.CL
TL;DR: 本文提出了两个多语言间接问答(IQA)数据集InQA+和GenIQA,用于评估和训练模型识别间接回答的极性;实验表明IQA任务极具挑战性,现有模型表现差、易过拟合,且GPT-4o-mini生成的数据质量不足。
Details
Motivation: 间接性是日常交流常见特征,但在NLP中(尤其低资源语言)研究不足;需构建多语言IQA数据集并系统评估模型能力。 Method: 构建双语(英、德、巴伐利亚语)高质量小规模评估集InQA+与大规模GPT-4o-mini生成训练集GenIQA;在mBERT、XLM-R、mDeBERTa上开展消融实验,分析标签歧义、标签集、数据量等因素影响。 Result: 所有语言(含英语)IQA性能均差,严重过拟合;巴伐利亚语等低资源语言未见明显劣势,但需大量训练数据;GPT-4o-mini生成数据质量低,缺乏足够语用理解。 Conclusion: IQA是语用层面困难的任务,当前多语言预训练模型和大模型生成数据均不充分;需更高质量标注与语用建模方法。 Abstract: Indirectness is a common feature of daily communication, yet is underexplored in NLP research for both low-resource as well as high-resource languages. Indirect Question Answering (IQA) aims at classifying the polarity of indirect answers. In this paper, we present two multilingual corpora for IQA of varying quality that both cover English, Standard German and Bavarian, a German dialect without standard orthography: InQA+, a small high-quality evaluation dataset with hand-annotated labels, and GenIQA, a larger training dataset, that contains artificial data generated by GPT-4o-mini. We find that IQA is a pragmatically hard task that comes with various challenges, based on several experiment variations with multilingual transformer models (mBERT, XLM-R and mDeBERTa). We suggest and employ recommendations to tackle these challenges. Our results reveal low performance, even for English, and severe overfitting. We analyse various factors that influence these results, including label ambiguity, label set and dataset size. We find that the IQA performance is poor in high- (English, German) and low-resource languages (Bavarian) and that it is beneficial to have a large amount of training data. Further, GPT-4o-mini does not possess enough pragmatic understanding to generate high-quality IQA data in any of our tested languages.[85] HindSight: Evaluating Research Idea Generation via Future Impact
Bo Jiang
Main category: cs.CL
TL;DR: 本文提出了一种基于时间分割的AI研究想法评估框架HS,通过将生成的想法与未来真实发表的论文匹配,并依据引用量和会议录用情况评分,发现LLM评判与实际科研影响力存在显著偏差。
Details
Motivation: 现有AI生成研究想法的评估方法(如LLM裁判或人工评审)主观性强,且脱离真实科研影响,亟需一种客观、可验证的评估范式。 Method: 提出HS时间分割评估框架:以时间点T为界,限制生成系统仅使用T之前文献,然后在T之后30个月内匹配真实发表论文,按引用数和录用 venue 打分。 Result: 在10个AI/ML领域实验表明,HS能显著区分检索增强与普通生成方法(2.5倍得分差异,p<0.001),而LLM裁判无法识别该差异(p=0.584);且HS得分与LLM判定的新颖性呈负相关(ρ=-0.29, p<0.01)。 Conclusion: HS提供了一种更客观、更具预测力的研究想法评估方式,揭示LLM在评估中倾向于高估表面新颖但缺乏实际影响力的‘空洞创意’。 Abstract: Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce \hs{}, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~$T$, we restrict an idea generation system to pre-$T$ literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation ($p{=}0.584$), while \hs{} shows the retrieval-augmented system produces 2.5$\times$ higher-scoring ideas ($p{<}0.001$). Moreover, \hs{} scores are \emph{negatively} correlated with LLM-judged novelty ($ρ{=}{-}0.29$, $p{<}0.01$), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.[86] The Hrunting of AI: Where and How to Improve English Dialectal Fairness
Wei Li,Adrian de Wynter
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(LLMs)在英语方言(如约克郡、盖尔迪、康沃尔及非裔美国人白话英语)中的性能受限问题,发现人类标注者间一致性低会直接影响LLM作为评判者的可靠性,并进一步影响模型优化的可行性;微调难以缓解该问题,但部分LLM具备高质量数据生成能力,为数据稀缺场景提供新思路。
Details
Motivation: 大型语言模型在英语方言中表现不佳,且因数据稀缺难以改进,本文旨在探究数据质量与可获得性如何影响方言场景下LLM改进的可行性。 Method: 评估四种英语方言(约克郡、盖尔迪、康沃尔、非裔美国人白话英语)及西弗里斯兰语(对照),分析人类-人类一致性、LLM-人类一致性及各类指标(如准确率)的关系,并考察微调对一致性模式的影响。 Result: 人类-人类一致性直接决定LLM-as-a-judge性能;LLM-人类一致性与人类-人类一致性模式一致;微调无法消除甚至可能加剧该模式;部分LLM能生成高质量方言数据,具备扩展潜力。 Conclusion: 方言场景下低人口导致低人类一致性,进而威胁LLM评估与改进的可靠性;需谨慎评估数据以保障公平包容,数据稀缺时亟需开发新工具应对该一致性模式。 Abstract: It is known that large language models (LLMs) underperform in English dialects, and that improving them is difficult due to data scarcity. In this work we investigate how quality and availability impact the feasibility of improving LLMs in this context. For this, we evaluate three rarely-studied English dialects (Yorkshire, Geordie, and Cornish), plus African-American Vernacular English, and West Frisian as control. We find that human-human agreement when determining LLM generation quality directly impacts LLM-as-a-judge performance. That is, LLM-human agreement mimics the human-human agreement pattern, and so do metrics such as accuracy. It is an issue because LLM-human agreement measures an LLM's alignment with the human consensus; and hence raises questions about the feasibility of improving LLM performance in locales where low populations induce low agreement. We also note that fine-tuning does not eradicate, and might amplify, this pattern in English dialects. But also find encouraging signals, such as some LLMs' ability to generate high-quality data, thus enabling scalability. We argue that data must be carefully evaluated to ensure fair and inclusive LLM improvement; and, in the presence of scarcity, new tools are needed to handle the pattern found.[87] Efficient Document Parsing via Parallel Token Prediction
Lei Li,Ze Zhao,Meng Li,Zhongwang Lun,Yi Yuan,Xingjing Lu,Zheng Wei,Jiang Bian,Zang Li
Main category: cs.CL
TL;DR: 本文提出了一种名为Parallel-Token Prediction (PTP)的并行解码方法,用于加速文档解析任务中的视觉语言模型(VLMs),在不牺牲性能的前提下显著提升解码速度(1.6x–2.2x)、减少幻觉、增强泛化能力。
Details
Motivation: 视觉语言模型(VLMs)在文档解析中受限于自回归(AR)解码,导致推理速度慢,亟需高效并行解码方案。 Method: 提出Parallel-Token Prediction(PTP):在输入序列中插入可学习token,并设计相应训练目标以支持多token并行预测;同时构建了大规模高质量文档解析数据生成流程。 Result: 在OmniDocBench和olmOCR-bench上验证,解码速度提升1.6–2.2倍,幻觉减少,泛化能力强。 Conclusion: PTP是一种即插即用、模型无关、简单有效的并行解码方法,显著提升了VLM在文档解析任务中的效率与鲁棒性。 Abstract: Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.[88] Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation
Xinyue Ma,Pol Pastells,Mireia Farrús,Mariona Taulé
Main category: cs.CL
TL;DR: 本文构建了一个中英被动句双向多领域数据集,用于评估机器翻译系统对被动语态的处理能力,发现模型倾向于保留源文语态,且在英译中时语态一致性更接近人工翻译,而商用NMT模型指标得分更高,大语言模型则更擅长使用多样化替代译法。
Details
Motivation: 中英文被动句构造与分布差异大,现有MT评估缺乏针对该语言现象的专门数据集和分析,需特别关注被动句翻译质量。 Method: 从五个中英平行语料库中自动抽取被动句,依据人工译文进行结构标注,构建含73,965句对的双向多领域数据集,并建立人工校验的测试集;评估两类开源MT系统及四个商用模型。 Result: 模型受源文语态影响大于源语言整体语态使用习惯,倾向保留被动语态;在英译中时语态一致性更接近人工翻译;商用NMT模型在自动指标上得分更高,而大语言模型在译文多样性上表现更好。 Conclusion: 被动语态是中英MT中的关键挑战,需结合语境与频率特征建模;专用数据集可有效揭示模型行为偏差,推动更细粒度的MT评估与改进。 Abstract: Machine Translation (MT) evaluation has gone beyond metrics, towards more specific linguistic phenomena. Regarding English-Chinese language pairs, passive sentences are constructed and distributed differently due to language variation, thus need special attention in MT. This paper proposes a bidirectional multi-domain dataset of passive sentences, extracted from five Chinese-English parallel corpora and annotated automatically with structure labels according to human translation, and a test set with manually verified annotation. The dataset consists of 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters). We evaluate two state-of-the-art open-source MT systems with our dataset, and four commercial models with the test set. The results show that, unlike humans, models are more influenced by the voice of the source text rather than the general voice usage of the source language, and therefore tend to maintain the passive voice when translating a passive in either direction. However, models demonstrate some knowledge of the low frequency and predominantly negative context of Chinese passives, leading to higher voice consistency with human translators in English-to-Chinese translation than in Chinese-to-English translation. Commercial NMT models scored higher in metric evaluations, but LLMs showed a better ability to use diverse alternative translations. Datasets and annotation script will be shared upon request.[89] Practicing with Language Models Cultivates Human Empathic Communication
Aakriti Kumar,Nalin Poungpeth,Diyi Yang,Bruce Lambert,Matthew Groh
Main category: cs.CL
TL;DR: 本研究构建了一个名为Lend an Ear的实验性对话平台,通过大规模人机对话数据构建共情表达分类法,并验证了一种基于大语言模型(LLM)的个性化反馈干预能显著提升人类参与者的共情表达能力;研究还发现‘沉默共情效应’——人们能感受共情却难以表达。
Details
Motivation: 尽管大语言模型在盲测中常被认为比人类更富共情,但当人们知晓回应来自AI时,其被理解与被认可感反而下降;这揭示了人类自身在共情表达上的不足,亟需科学方法识别、评估并提升真实人际共情沟通能力。 Method: 构建Lend an Ear平台,收集968名参与者与扮演困扰者的LLM之间的2904场文本对话(共33938条消息),据此建立自然对话中的惯用共情表达数据驱动分类法;再通过预注册随机对照实验,比较LLM个性化反馈、非个性化视频反馈与无干预对照组对共情表达模式校准效果。 Result: 1)成功构建首个基于大规模自然对话的共情表达分类法;2)LLM个性化反馈干预显著提升参与者共情表达与规范模式的一致性;3)发现‘沉默共情效应’:人们能感知共情但系统性地无法有效表达;4)参与者能准确识别符合规范的共情表达并认为其更具共情性。 Conclusion: 共情表达是一种可测量、可干预、可提升的沟通技能;基于LLM的实时个性化反馈是一种高效、可扩展的共情能力培养新范式,为心理学与人机协同教育提供了重要实证基础。 Abstract: Empathy is central to human connection, yet people often struggle to express it effectively. In blinded evaluations, large language models (LLMs) generate responses that are often judged more empathic than human-written ones. Yet when a response is attributed to AI, recipients feel less heard and validated than when comparable responses are attributed to a human. To probe and address this gap in empathic communication skill, we built Lend an Ear, an experimental conversation platform in which participants are asked to offer empathic support to an LLM role-playing personal and workplace troubles. From 33,938 messages spanning 2,904 text-based conversations between 968 participants and their LLM conversational partners, we derive a data-driven taxonomy of idiomatic empathic expressions in naturalistic dialogue. Based on a pre-registered randomized experiment, we present evidence that a brief LLM coaching intervention offering personalized feedback on how to effectively communicate empathy significantly boosts alignment of participants' communication patterns with normative empathic communication patterns relative to both a control group and a group that received video-based but non-personalized feedback. Moreover, we find evidence for a silent empathy effect that people feel empathy but systematically fail to express it. Nonetheless, participants reliably identify responses aligned with normative empathic communication criteria as more expressive of empathy. Together, these results advance the scientific understanding of how empathy is expressed and valued and demonstrate a scalable, AI-based intervention for scaffolding and cultivating it.[90] From Documents to Spans: Code-Centric Learning for LLM-based ICD Coding
Xu Zhang,Wenxin Ma,Chenxu Wu,Rongsheng Wang,Kun Zhang,S. Kevin Zhou
Main category: cs.CL
TL;DR: 本文提出Code-Centric Learning框架,通过将监督信号从整篇临床文档转向短证据片段,解决LLM在ICD编码中泛化性差、可解释性弱和训练成本高的三大挑战。
Details
Motivation: 现有ICD编码数据集覆盖有限、缺乏支持证据、且处理长文档计算昂贵,导致LLM细调效果受限。 Method: 提出以代码为中心的学习框架,包括混合训练策略和代码中心的数据扩展,聚焦于短证据片段而非整篇文档进行监督学习。 Result: 在相同LLM主干下显著优于强基线;小规模LLM可达大型专有模型性能;提升对未见ICD码的准确率并保持可解释性。 Conclusion: Code-Centric Learning是一种高效、可解释且泛化能力强的ICD编码新范式,具备全自动ICD编码应用潜力。 Abstract: ICD coding is a critical yet challenging task in healthcare. Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model's ability to generalize to unseen codes. Second, naive fine-tuning diminishes the interpretability of LLMs, as few public datasets contain explicit supporting evidence for assigned codes. Third, ICD coding typically involves long clinical documents, making fine-tuning LLMs computationally expensive. To address these issues, we propose Code-Centric Learning, a training framework that shifts supervision from full clinical documents to scalable, short evidence spans. The key idea of this framework is that span-level learning improves LLMs' ability to perform document-level ICD coding. Our proposed framework consists of a mixed training strategy and code-centric data expansion, which substantially reduces training cost, improves accuracy on unseen ICD codes and preserves interpretability. Under the same LLM backbone, our method substantially outperforms strong baselines. Notably, our method enables small-scale LLMs to achieve performance comparable to much larger proprietary models, demonstrating its effectiveness and potential for fully automated ICD coding.[91] Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies
Giuseppe Samo,Paola Merlo
Main category: cs.CL
TL;DR: 本文构建了面向四种语言的动词交替范式数据集,通过Blackbird Language Matrices(BLM)任务评估大语言模型对跨句范式模式(如状态变化、宾语省略、希伯来语binyanim)的系统性掌握能力,并引入多类模板与语言学驱动的数据增强策略。
Details
Motivation: 现有研究多关注句子内语言现象,而大语言模型对跨句子的范式性语言规律(如动词交替)的建模能力尚不明确,亟需可解释、可控的诊断性评测基准。 Method: 构建覆盖英语、德语、意大利语和希伯来语的动词交替范式数据集;设计类RPM/ARC风格的Blackbird Language Matrices(BLM)语言推理任务;采用三种复杂度递进的模板,并结合语言学指导的合成与自然数据增强策略。 Result: 在四种语言上报告了简单基线模型性能,验证了数据集具备良好的诊断区分能力,能有效揭示模型在跨句范式知识上的强弱。 Conclusion: 所提数据集与BLM任务为评估和提升大语言模型对深层、系统性语法知识的理解提供了新工具,凸显其在语言结构泛化能力评测中的价值。 Abstract: Large language models (LLMs) have shown remarkable performance across various sentence-based linguistic phenomena, yet their ability to capture cross-sentence paradigmatic patterns, such as verb alternations, remains underexplored. In this work, we present curated paradigm-based datasets for four languages, designed to probe systematic cross-sentence knowledge of verb alternations (change-of-state and object-drop constructions in English, German and Italian, and Hebrew binyanim). The datasets comprise thousands of the Blackbird Language Matrices (BLMs) problems. The BLM task -- an RPM/ARC-like task devised specifically for language -- is a controlled linguistic puzzle where models must select the sentence that completes a pattern according to syntactic and semantic rules. We introduce three types of templates varying in complexity and apply linguistically-informed data augmentation strategies across synthetic and natural data. We provide simple baseline performance results across English, Italian, German, and Hebrew, that demonstrate the diagnostic usefulness of the datasets.[92] CCTU: A Benchmark for Tool Use under Complex Constraints
Junjie Ye,Guoqiang Zhang,Wenjie Fu,Tao Gui,Qi Zhang,Xuanjing Huang
Main category: cs.CL
TL;DR: 本文提出了CCTU基准,用于评估大语言模型(LLM)在复杂约束下的工具使用能力,涵盖12类约束、200个高难度测试用例,并设计了可执行的约束验证模块;实验表明当前LLM在严格约束下任务完成率低于20%,且自我修正能力薄弱。
Details
Motivation: 现有大语言模型在显式约束下进行工具使用的能力缺乏专门评测,限制了其在实际复杂场景中的可靠部署。 Method: 构建基于四维(资源、行为、工具集、响应)12类约束的CCTU评测基准,包含200个长提示(平均超4700词元)、多约束(平均7类)测试用例,并开发支持多轮交互的可执行约束验证模块;对9个SOTA模型在思考/非思考模式下进行评测与归因分析。 Result: 所有模型在需完全满足所有约束时任务完成率均低于20%;超50%案例出现约束违反,尤以资源和响应维度为主;模型即使获得详细违规反馈,自我修正能力仍极有限。 Conclusion: 当前LLM在复杂约束下的工具使用能力存在显著瓶颈,尤其在约束理解、遵守与自我修正方面亟待提升;CCTU为后续研究提供了标准化评测框架与开源资源。 Abstract: Solving problems through tool use under explicit constraints constitutes a highly challenging yet unavoidable scenario for large language models (LLMs), requiring capabilities such as function calling, instruction following, and self-refinement. However, progress has been hindered by the absence of dedicated evaluations. To address this, we introduce CCTU, a benchmark for evaluating LLM tool use under complex constraints. CCTU is grounded in a taxonomy of 12 constraint categories spanning four dimensions (i.e., resource, behavior, toolset, and response). The benchmark comprises 200 carefully curated and challenging test cases across diverse tool-use scenarios, each involving an average of seven constraint types and an average prompt length exceeding 4,700 tokens. To enable reliable evaluation, we develop an executable constraint validation module that performs step-level validation and enforces compliance during multi-turn interactions between models and their environments. We evaluate nine state-of-the-art LLMs in both thinking and non-thinking modes. Results indicate that when strict adherence to all constraints is required, no model achieves a task completion rate above 20%. Further analysis reveals that models violate constraints in over 50% of cases, particularly in the resource and response dimensions. Moreover, LLMs demonstrate limited capacity for self-refinement even after receiving detailed feedback on constraint violations, highlighting a critical bottleneck in the development of robust tool-use agents. To facilitate future research, we release the data and code.[93] PYTHEN: A Flexible Framework for Legal Reasoning in Python
Ha-Thanh Nguyen,Ken Satoh
Main category: cs.CL
TL;DR: PYTHEN是一个基于Python的新型可废止法律推理框架,通过利用Python内置的any()和all()函数,支持合取与析取条件及更丰富的例外处理机制,旨在降低形式化法律推理门槛,促进法律AI发展。
Details
Motivation: 为了解决传统法律推理系统(如PROLEG)在灵活性、易用性及可访问性方面的不足,使缺乏逻辑编程背景的研究人员、法律科技开发者和专业人士也能便捷地进行形式化法律推理。 Method: 设计并实现PYTHEN框架,借鉴PROLEG和《Python之禅》理念,利用Python原生any()和all()函数建模法律规则、条件与例外,支持合取(ALL)与析取(ANY)逻辑,并构建可扩展的架构以支持自动形式化与法律AI应用。 Result: 成功开发出PYTHEN框架,具备直观语法、灵活条件表达能力和增强的例外处理机制;完成与PROLEG的对比分析;验证其在自动形式化与下一代法律AI系统中的适用潜力。 Conclusion: PYTHEN有效弥合了符号推理能力与Python易用生态之间的鸿沟,为法律AI提供了实用、可扩展且低门槛的工具,推动形式化法律推理的普及化与民主化。 Abstract: This paper introduces PYTHEN, a novel Python-based framework for defeasible legal reasoning. PYTHEN is designed to model the inherently defeasible nature of legal argumentation, providing a flexible and intuitive syntax for representing legal rules, conditions, and exceptions. Inspired by PROLEG (PROlog-based LEGal reasoning support system) and guided by the philosophy of The Zen of Python, PYTHEN leverages Python's built-in any() and all() functions to offer enhanced flexibility by natively supporting both conjunctive (ALL) and disjunctive (ANY) conditions within a single rule, as well as a more expressive exception-handling mechanism. This paper details the architecture of PYTHEN, provides a comparative analysis with PROLEG, and discusses its potential applications in autoformalization and the development of next-generation legal AI systems. By bridging the gap between symbolic reasoning and the accessibility of Python, PYTHEN aims to democratize formal legal reasoning for young researchers, legal tech developers, and professionals without extensive logic programming expertise. We position PYTHEN as a practical bridge between the powerful symbolic reasoning capabilities of logic programming and the rich, ubiquitous ecosystem of Python, making formal legal reasoning accessible to a broader range of developers and legal professionals.[94] Tagarela - A Portuguese speech dataset from podcasts
Frederico Santos de Oliveira,Lucas Rafael Stefanel Gris,Alef Iury Siqueira Ferreira,Augusto Seben da Rosa,Alexandre Costa Ferro Filho,Edresson Casanova,Christopher Dane Shulby,Rafael Teixeira Sousa,Diogo Fernandes Costa Silva,Anderson da Silva Soares,Arlindo Rodrigues Galvão Filho
Main category: cs.CL
TL;DR: 本文介绍了TAGARELA——一个大规模葡萄牙语播客语音数据集(8972小时),用于训练ASR和TTS模型,其规模媲美英语GigaSpeech;通过音频预处理与混合ASR转录策略保障质量,并基于该数据集训练并评估了专用ASR/TTS模型,验证其有效性;数据集已开源。
Details
Motivation: 葡萄牙语在语音处理领域资源匮乏,缺乏公开、大规模、高质量的数据集,限制了ASR和TTS模型的发展。 Method: 构建了包含8972小时葡萄牙语播客音频的TAGARELA数据集;采用音频预处理流程,并结合基于高保真API转录训练的ASR模型进行混合自动转录;最终仅使用该数据集训练ASR和TTS模型并评估性能。 Result: TAGARELA数据集规模接近GigaSpeech,支持训练出高性能葡萄牙语ASR和TTS模型;实验证明其能显著提升葡萄牙语语音技术的鲁棒性与自然度。 Conclusion: TAGARELA填补了葡萄牙语大规模语音数据的空白,为推动其ASR和TTS技术发展提供了关键基础资源,且已公开发布以促进社区研究。 Abstract: Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese. The dataset is released publicly, available at https://freds0.github.io/TAGARELA/, to foster the development of robust speech technologies.[95] DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models
Xueyu Zhou,Yangrong Hu,Jian Huang
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的解码策略Dependency-Oriented Sampler (DOS),利用Transformer注意力矩阵建模词元间依赖关系,提升掩码扩散语言模型(MDLMs)在代码生成与数学推理任务中的性能和并行解码效率。
Details
Motivation: 现有预训练掩码扩散语言模型(MDLMs)的解码策略主要依赖词元级不确定性,忽略了序列级信息和词元间依赖关系。 Method: 提出Dependency-Oriented Sampler(DOS),不需额外训练,通过分析Transformer各层注意力矩阵来近似词元间依赖,并在更新掩码位置时更重视未掩码词元提供的信息。 Result: DOS在代码生成和数学推理任务上均取得一致更优性能;可无缝集成现有并行采样方法,在不降低生成质量前提下提升效率。 Conclusion: 利用序列级依赖信息指导掩码更新是提升MDLMs解码质量与效率的有效途径,DOS为无需训练的高效解码提供了新思路。 Abstract: Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.[96] When Does Sparsity Mitigate the Curse of Depth in LLMs
Dilxat Muhtar,Xinyuan Song,Sebastian Pokutta,Max Zimmer,Nico Pelleriti,Thomas Hofmann,Shiwei Liu
Main category: cs.CL
TL;DR: 本文揭示了稀疏性在大语言模型(LLM)深度利用中的关键调节作用,指出无论是隐式(如权重衰减、长上下文注意力)还是显式(如分组查询注意力、MoE专家激活)稀疏性,均可抑制层归一化中累积方差增长,提升深层模块功能分化与利用率,并提出实用训练准则,带来4.6%下游任务准确率提升。
Details
Motivation: 解决LLM中‘深度诅咒’问题——深层模块因Pre-LN中方差累积导致近恒等行为、表征能力下降、利用不足。 Method: 通过控制变量的深度扩展实验与定向层有效性干预,系统分析两类稀疏性(隐式:权重衰减/长上下文注意力;显式:GQA键值共享/MoE专家激活)对输出方差和层功能分化的影响。 Result: 发现稀疏性普遍降低输出方差、增强功能分化,显著提升深度利用率;据此提出训练深度有效LLM的经验准则,在下游任务上获得4.6%准确率提升。 Conclusion: 稀疏性(尤其源自标准设计选择者)是提升LLM深度扩展效果的关键但被忽视的机制,兼具效率与表征优化双重价值。 Abstract: Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at https://github.com/pUmpKin-Co/SparsityAndCoD.[97] A Closer Look into LLMs for Table Understanding
Jia Wang,Chuanyu Qin,Mingyu Zheng,Qingyi Si,Peize Li,Zheng Lin
Main category: cs.CL
TL;DR: 本文对16个大语言模型(包括通用、专用表格和混合专家模型)进行实证研究,从注意力动态、有效层深度、专家激活和输入设计四个维度分析其表格理解机制,揭示了三阶段注意力模式、深层依赖、专家分工及提示与微调的协同效应。
Details
Motivation: 尽管大语言模型(LLMs)在表格理解任务中表现优异,但其内部工作机制尚不清晰,亟需系统性实证分析以提升可解释性并指导后续研究。 Method: 对16个LLMs(涵盖通用、专用表格和MoE架构)开展多维实证分析,重点考察注意力机制演化、各层贡献度、MoE中专家激活模式以及不同输入设计(如思维链提示、表格微调)的影响。 Result: 发现:(1) LLMs呈现三阶段注意力模式(广域扫描→局部定位→关键增强);(2) 表格任务比数学推理更依赖深层网络;(3) MoE模型中中间层激活表格专用专家,首尾层复用通用专家;(4) 思维链提示提升表格注意力,且经表格微调后进一步增强。 Conclusion: 该研究揭示了LLMs处理表格数据的内在机制,为提升模型可解释性、优化架构设计及改进提示/微调策略提供了实证依据与方向指引。 Abstract: Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and perform downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key findings include: (1) LLMs follow a three-phase attention pattern -- early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in middle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further enhanced by table-tuning. We hope these findings and insights can facilitate interpretability and future research on table-related tasks.[98] Fusian: Multi-LoRA Fusion for Fine-Grained Continuous MBTI Personality Control in Large Language Models
Zehao Chen,Rong Pan
Main category: cs.CL
TL;DR: 本文提出Fusian框架,通过轨迹收集和强化学习动态融合LoRA适配器,实现大语言模型中人格特质强度的连续、细粒度控制。
Details
Motivation: 现有方法将人格特质视为离散类别,无法精确控制特质强度的连续谱。 Method: Fusian框架包含两阶段:(1)轨迹收集,通过保存SFT过程中的一系列LoRA适配器来映射特质连续流形;(2)基于强化学习的动态融合,训练策略网络生成混合权重,并通过Dirichlet分布采样融合多个冻结适配器。 Result: 在Qwen3-14B模型上的实验表明,Fusian在人格特质强度对齐精度上显著优于基线方法。 Conclusion: Fusian实现了大语言模型中人格控制的细粒度与连续性,为可控人格建模提供了新范式。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in simulating diverse human behaviors and personalities. However, existing methods for personality control, which include prompt engineering and standard Supervised Fine-Tuning (SFT), typically treat personality traits as discrete categories (e.g., "Extroverted" vs. "Introverted"), lacking the ability to precisely control the intensity of a trait on a continuous spectrum. In this paper, we introduce Fusian, a novel framework for fine-grained, continuous personality control in LLMs. Fusian operates in two stages: (1) Trajectory Collection, where we capture the dynamic evolution of personality adoption during SFT by saving a sequence of LoRA adapters, effectively mapping the continuous manifold of a trait; and (2) RL-based Dynamic Fusion, where we train a policy network using Reinforcement Learning to dynamically compute mixing weights for these frozen adapters. By sampling from a Dirichlet distribution parameterized by the policy network, Fusian fuses multiple adapters to align the model's output with a specific numerical target intensity. Experiments on the Qwen3-14B model demonstrate that Fusian achieves high precision in personality control, significantly outperforming baseline methods in aligning with user-specified trait intensities.[99] SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia
Pengfei Yue,Xingran Zhao,Juntao Chen,Peng Hou,Wang Longchao,Jianghang Lin,Shengchuan Zhang,Anxiang Zeng,Liujuan Cao
Main category: cs.CL
TL;DR: 本文提出了SEA-Vision基准,用于评估东南亚11种语言的文档解析与文本中心视觉问答(TEC-VQA)能力,涵盖多样化的文档类型和复杂任务,并设计了高效高质量的混合标注流程。
Details
Motivation: 现有基准多聚焦高资源语言,难以反映真实多语言环境;东南亚语言多样性高、文字系统复杂、文档类型繁多,亟需针对性评测基准。 Method: 构建SEA-Vision基准,包含15,234页文档解析样本(9类文档,三级层次标注)和7,496组TEC-VQA问答对;提出融合自动过滤评分、MLLM辅助标注与母语者轻量验证的混合标注流程。 Result: 在多个主流多模态模型上测试发现,其在低资源东南亚语言上性能显著下降,揭示当前多语言文档与场景文本理解仍存在明显差距。 Conclusion: SEA-Vision填补了东南亚多语言文档理解评测空白,有望推动全球文档与场景文本理解技术的发展。 Abstract: Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.[100] CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents
Taeyun Roh,Wonjune Jang,Junha Jung,Jaewoo Kang
Main category: cs.CL
TL;DR: 本文提出CLAG,一种基于聚类的智能体记忆框架,通过SLM驱动的路由器将记忆分簇并生成簇描述,实现局部化演进与两阶段检索,显著提升小语言模型代理的记忆效率与问答鲁棒性。
Details
Motivation: 现有记忆系统将经验存入单一全局池,易导致知识稀释或污染,尤其对易受无关上下文干扰的小语言模型(SLMs)影响严重。 Method: CLAG采用SLM驱动的路由器对新记忆进行语义聚类,并为各簇自动生成主题摘要和描述性标签,构建自包含的功能单元;在记忆演化阶段实行簇内局部更新,在检索阶段采用两阶段机制:先用簇轮廓过滤相关簇,再在簇内检索。 Result: 在多个QA数据集上,使用三种SLM骨干网络的实验表明,CLAG持续优于现有记忆系统,在提升答案质量与鲁棒性的同时保持轻量高效。 Conclusion: CLAG通过结构化聚类与局部化管理有效缓解跨主题干扰、增强记忆密度,为SLM智能体提供了更可靠、可扩展的记忆架构。 Abstract: Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store experiences in a single global retrieval pool which can gradually dilute or corrupt stored knowledge. This problem is especially pronounced for small language models (SLMs), which are highly vulnerable to irrelevant context. We introduce CLAG, a CLustering-based AGentic memory framework where an SLM agent actively organizes memory by clustering. CLAG employs an SLM-driven router to assign incoming memories to semantically coherent clusters and autonomously generates cluster-specific profiles, including topic summaries and descriptive tags, to establish each cluster as a self-contained functional unit. By performing localized evolution within these structured neighborhoods, CLAG effectively reduces cross-topic interference and enhances internal memory density. During retrieval, the framework utilizes a two-stage process that first filters relevant clusters via their profiles, thereby excluding distractors and reducing the search space. Experiments on multiple QA datasets with three SLM backbones show that CLAG consistently improves answer quality and robustness over prior memory systems for agents, remaining lightweight and efficient.[101] Invisible failures in human-AI interactions
Christopher Potts,Moritz Sudhof
Main category: cs.CL
TL;DR: 本文通过分析WildChat数据集中的大量人机交互数据,发现78%的AI失败是‘隐形失败’(用户未察觉或未反馈问题),并归纳出八种隐形失败类型及其共现模式;研究指出91%的失败源于交互设计而非模型能力不足,且预计94%的此类失败在模型能力提升后仍将存在。
Details
Motivation: AI系统常无声失败,传统基于显式反馈的评估方法难以捕捉这类问题,亟需系统性识别和分类隐形失败以提升可靠性。 Method: 基于WildChat数据集进行大规模定量分析,识别隐形失败案例,归纳八种失败 archetype,并分析其共现模式与成因(交互性 vs. 能力性)。 Result: 发现78%的AI失败为隐形失败;归纳出八种可复现的隐形失败 archetype;91%失败属交互性,其中94%预计在更强模型中依然存在;archetype 在不同使用场景中表现出系统性与变异性。 Conclusion: 隐形失败的分类体系对产品开发、科学研究和政策制定中的AI可靠性监控具有关键价值。 Abstract: AI systems fail silently far more often than they fail visibly. In a large-scale quantitative analysis of human-AI interactions from the WildChat dataset, we find that 78% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisible failures cluster into eight archetypes that help us characterize where and how AI systems are failing to meet users' needs. In addition, the archetypes show systematic co-occurrence patterns indicating higher-level failure types. To address the question of whether these archetypes will remain relevant as AI systems become more capable, we also assess failures for whether they are primarily interactional or capability-driven, finding that 91% involve interactional dynamics, and we estimate that 94% of such failures would persist even with a more capable model. Finally, we illustrate how the archetypes help us to identify systematic and variable AI limitations across different usage domains. Overall, we argue that our invisible failure taxonomy can be a key component in reliable failure monitoring for product developers, scientists, and policy makers. Our code and data are available at https://github.com/bigspinai/bigspin-invisible-failure-archetypes[102] ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models
Duy Vu Minh Nguyen,Chinh Thanh Truong,Phuc Hoang Tran,Hung Tuan Le,Nguyen Van-Thanh Dat,Trung Hieu Pham,Kiet Van Nguyen
Main category: cs.CL
TL;DR: 本文提出了ViX-Ray数据集,包含5400张越南语胸部X光图像及专家标注的发现与诊断意见,并基于该数据集微调多个视觉语言模型,评估其在越南语医学影像理解任务上的性能,揭示了现有模型在精确性和幻觉问题上的不足。
Details
Motivation: 现有视觉语言模型缺乏越南语医学数据训练,难以准确生成符合越南患者临床需求的诊断结果。 Method: 构建越南语胸部X光图像数据集ViX-Ray(5400张图像,含医生撰写的发现与印象标注),分析其语言特征,并在该数据集上微调五个开源VLM,与GPT-4V和Gemini对比评估。 Result: 微调后的模型在发现生成上部分符合临床真实,但在印象生成中普遍存在低精度与高幻觉现象;ViX-Ray被确立为越南语临床VLM评估的重要基准。 Conclusion: ViX-Ray填补了越南语医学视觉语言建模的数据空白,揭示了跨语言、跨领域VLM部署的关键挑战,为后续研究提供了新基准与改进方向。 Abstract: Vietnamese medical research has become an increasingly vital domain, particularly with the rise of intelligent technologies aimed at reducing time and resource burdens in clinical diagnosis. Recent advances in vision-language models (VLMs), such as Gemini and GPT-4V, have sparked a growing interest in applying AI to healthcare. However, most existing VLMs lack exposure to Vietnamese medical data, limiting their ability to generate accurate and contextually appropriate diagnostic outputs for Vietnamese patients. To address this challenge, we introduce ViX-Ray, a novel dataset comprising 5,400 Vietnamese chest X-ray images annotated with expert-written findings and impressions from physicians at a major Vietnamese hospital. We analyze linguistic patterns within the dataset, including the frequency of mentioned body parts and diagnoses, to identify domain-specific linguistic characteristics of Vietnamese radiology reports. Furthermore, we fine-tune five state-of-the-art open-source VLMs on ViX-Ray and compare their performance to leading proprietary models, GPT-4V and Gemini. Our results show that while several models generate outputs partially aligned with clinical ground truths, they often suffer from low precision and excessive hallucination, especially in impression generation. These findings not only demonstrate the complexity and challenge of our dataset but also establish ViX-Ray as a valuable benchmark for evaluating and advancing vision-language models in the Vietnamese clinical domain.[103] Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models
Xiyu Liu,Qingyi Si,Zhengxiao Liu,Chenxu Yang,Naibin Gu,Zheng Lin
Main category: cs.CL
TL;DR: 本文提出RoSE方法,通过各向同性几何对齐和分层知识集成解决大语言模型在相同主语知识编辑中因激活漂移导致的泛化失败问题。
Details
Motivation: 现有locate-then-edit知识编辑方法在相同主语场景下存在泛化失败:模型能回忆编辑后的知识,却无法按用户指令正确调用更新知识。 Method: 提出RoSE(Robust Same-subject Editing)方法,包含两个核心组件:(1) 各向同性几何对齐(Isotropic Geometric Alignment),最小化表征偏差;(2) 分层知识集成(Hierarchical Knowledge Integration),平滑优化曲面。同时指出原有正交梯度联合优化与协方差约束分别导致尖锐极小值与协方差陷阱。 Result: RoSE显著提升了模型在指令遵循任务中的表现,增强了大语言模型代理的交互式参数化记忆鲁棒性。 Conclusion: 知识编辑的泛化失败源于编辑后模型几何容差不足与激活漂移冲突;RoSE通过几何与优化双路径改进,实现了更鲁棒的相同主语知识更新。 Abstract: While locate-then-edit knowledge editing efficiently updates knowledge encoded within Large Language Models (LLMs), a critical generalization failure mode emerges in the practical same-subject knowledge editing scenario: models fail to recall the updated knowledge when following user instructions, despite successfully recalling it in the original edited form. This paper identifies the geometric root of this generalization collapse as a fundamental conflict where the inner activation drifts induced by prompt variations exceed the model's geometric tolerance for generalization after editing. We attribute this instability to a dual pathology: (1) The joint optimization with orthogonal gradients collapses solutions into sharp minima with narrow stability, and (2) the standard covariance constraint paradoxically acts as a Covariance Trap that amplifies input perturbations. To resolve this, we introduce RoSE (Robust Same-subject Editing), which employs Isotropic Geometric Alignment to minimize representational deviation and Hierarchical Knowledge Integration to smooth the optimization landscape. Extensive experiments demonstrate that RoSE significantly improves instruction-following capabilities, laying the foundation for robust interactive parametric memory of LLM agents.[104] SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction
David Števaňák,Marek Šuppa
Main category: cs.CL
TL;DR: 本文构建了斯洛伐克语大规模关键词提取数据集SlovKE,并在该数据集上评估了多种无监督方法及基于大语言模型(KeyLLM)的方法,揭示了形态变化导致的匹配难题,并开源了数据与代码。
Details
Motivation: 斯洛伐克语等形态丰富、低资源语言的关键词提取研究受限于缺乏合适的评测数据集,亟需构建高质量、大规模的基准资源。 Method: 1)从斯洛伐克学位论文中央注册库中爬取并清洗227,432篇科学摘要,构建SlovKE数据集;2)在该数据集上评测YAKE、TextRank、KeyBERT(使用SlovakBERT嵌入)三种无监督方法及基于GPT-3.5-turbo的KeyLLM方法;3)结合精确匹配F1@6、部分匹配及人工评估(κ=0.61)进行综合分析。 Result: 无监督方法精确匹配F1@6最高仅11.6%,而部分匹配可达51.5%,凸显形态不一致问题;KeyLLM显著缩小精确–部分匹配差距,人工评估证实其能更好捕获作者意图的关键概念。 Conclusion: 形态错配是统计方法在斯洛伐克语关键词提取中的主要失败原因,该结论对其他屈折语具普适性;KeyLLM展现出更强的规范化能力;SlovKE数据集和评测代码已开源。 Abstract: Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases -- scraped and systematically cleaned from the Slovak Central Register of Theses -- representing a 25-fold increase over the largest prior Slovak resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6\% exact-match $F1@6$, with a large gap to partial matching (up to 51.5\%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact--partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents ($κ= 0.61$) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods -- a finding relevant to other inflected languages. The dataset (https://huggingface.co/datasets/NaiveNeuron/SlovKE) and evaluation code (https://github.com/NaiveNeuron/SlovKE) are publicly available.[105] Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation
Yanick Zengaffinen,Andreas Opedal,Donya Rooein,Kv Aditya Srivatsa,Shashank Sonkar,Mrinmaya Sachan
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLMs)在生成多项选择题干扰项时如何建模学生常见误解,发现其推理过程与教育学最佳实践高度一致:先正确解题,再模拟多种误解,最后筛选干扰项;错误主要源于解题或候选答案筛选环节,而非误解模拟本身;提供正确答案作为提示可提升干扰项质量8%。
Details
Motivation: 建模学生合理误解对教育AI至关重要,而生成高质量、 plausible 的多项选择题干扰项是检验该能力的关键任务。 Method: 提出一个分析框架和分类法,系统考察SOTA大语言模型在干扰项生成中的推理策略,并与学习科学中的教育实践进行对比;同时分析失败模式,并通过提示工程(如显式提供正确答案)验证关键影响因素。 Result: LLMs的推理流程普遍符合教育最佳实践(先解题→模拟误解→筛选干扰项);失败主因是解题错误或候选筛选不当,而非误解模拟不足;显式提供正确答案使生成干扰项与人工编写的一致性提升8%。 Conclusion: LLMs具备建模学生错误推理的潜力,其过程具有可解释性与结构化特征;提升解题准确性和锚定正确答案是优化干扰项生成的关键路径。 Abstract: Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for analyzing the strategies used by state-of-the-art LLMs, examining their reasoning procedures and comparing them to established best practices in the learning sciences. Our structured analysis reveals a surprising alignment between their processes and best practices: the models typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. An analysis of failure modes reveals that errors arise primarily from failures in recovering the correct solution and selecting among response candidates, rather than simulating errors or structuring the process. Consistent with these results, we find that providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, our analysis offers a structured and interpretable lens into LLMs' ability to model incorrect student reasoning and produce high-quality distractors.[106] Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning
Aozhe Wang,Yuchen Yan,Nan Zhou,Zhengxi Lu,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen
Main category: cs.CL
TL;DR: 本文提出Code-A1框架,通过对抗协同进化机制联合优化代码生成模型与测试生成模型,解决自博弈中自串通或泛化不足的问题,实现高质量、针对性强的测试生成与更优代码生成性能。
Details
Motivation: 现有强化学习代码生成方法受限于高质量测试套件稀缺、数据集覆盖有限及静态奖励无法随模型演进;自博弈方法虽统一代码与测试生成,但面临白盒导致的自串通或黑盒导致的测试泛化不足困境。 Method: 提出Code-A1:一种对抗协同进化框架,将代码LLM(目标为通过更多测试)与测试LLM(目标为暴露更多缺陷)解耦并赋予对立目标;引入‘错误本’经验回放机制和兼顾测试有效性与对抗难度的复合奖励函数;允许测试LLM白盒访问候选代码以生成针对性强的对抗性测试。 Result: 在Qwen2.5-Coder模型上的实验表明,Code-A1生成的代码性能媲美甚至超越使用人工标注测试训练的模型,同时显著提升测试生成能力。 Conclusion: Code-A1通过架构解耦与对抗目标设计,有效规避自串通风险,在保障白盒信息利用的同时提升测试质量与代码生成效果,为测试驱动的代码生成提供了新范式。 Abstract: Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.[107] Mechanistic Origin of Moral Indifference in Language Models
Lingyu Li,Yan Teng,Yingchun Wang
Main category: cs.CL
TL;DR: 本文指出大语言模型(LLMs)在道德表征上存在固有的‘道德冷漠’——即无法区分对立道德概念及细粒度典型性梯度;作者基于原型理论构建25.1万道德向量,通过稀疏自编码器在Qwen3-8B上定位并重构单义道德特征的拓扑关系,实现表征层面的对齐,显著提升道德推理能力(Flames基准75%胜率),并呼吁从‘事后修正’转向‘主动培育’式的对齐范式。
Details
Motivation: 现有行为对齐方法忽视表面合规与内部未对齐表征之间的差距,且LLMs因将多元道德概念压缩为统一概率分布而天然趋向道德冷漠,导致长尾风险;需从表征层面识别并修复该根本缺陷。 Method: 1)基于Prototype Theory和Social-Chemistry-101构建251k道德向量;2)在23个模型上实证分析道德表征的 indifferent 现象;3)在Qwen3-8B上应用稀疏自编码器提取单义道德特征,并依据真值道德向量重构其拓扑结构。 Result: 发现所有测试模型(无论规模、架构或是否经对齐训练)均无法区分对立道德类别及典型性梯度;经表征重构后,在Flames对抗基准上达到75%成对胜率,显著提升道德推理与细粒度判断能力。 Conclusion: LLMs的道德冷漠是表征层面的系统性缺陷,仅靠行为对齐不足;需通过可解释的、基于认知理论的表征干预实现根本对齐;未来对齐应借鉴经验主义哲学,转向内生性、培育式路径。 Abstract: Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs' latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B, isolate mono-semantic moral features, and targetedly reconstruct their topological relationships to align with ground-truth moral vectors. This representational alignment naturally improves moral reasoning and granularity, achieving a 75% pairwise win-rate on the independent adversarial Flames benchmark. Finally, we elaborate on the remedial nature of current intervention methods from an experientialist philosophy, arguing that endogenously aligned AI might require a transformation from post-hoc corrections to proactive cultivation.[108] Mixture-of-Depths Attention
Lianghui Zhu,Yuxin Fang,Bencheng Liao,Shijie Wang,Tianheng Cheng,Zilong Huang,Chen Chen,Lai Wei,Yutao Zeng,Ya Wang,Yi Lin,Yu Li,Xinggang Wang
Main category: cs.CL
TL;DR: 本文提出了一种名为混合深度注意力(MoDA)的新机制,通过允许每个注意力头同时关注当前层和前序层的键值对,缓解深层大语言模型中的信号退化问题,并在保持高硬件效率的同时显著提升模型性能。
Details
Motivation: 随着大语言模型(LLMs)不断加深,浅层提取的有用特征在多次残差更新中逐渐被稀释,导致深层难以恢复这些信息,即信号退化问题。 Method: 提出混合深度注意力(MoDA),使每个注意力头可同时访问当前层及前序层的KV对;并设计了支持非连续内存访问的硬件高效实现算法。 Result: 在1.5B参数模型上,MoDA在10个验证基准上平均困惑度降低0.2,在10个下游任务上平均性能提升2.11%,仅增加3.7% FLOPs开销;且与post-norm组合效果优于pre-norm。 Conclusion: MoDA是一种有效缓解信号退化、支持模型深度扩展的新型注意力机制,具备实用性和可扩展性。 Abstract: Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .[109] OpenClaw-RL: Train Any Agent Simply by Talking
Yinjie Wang,Xuyang Chen,Xiaolong Jin,Mengdi Wang,Ling Yang
Main category: cs.CL
TL;DR: OpenClaw-RL 提出利用各类交互中普遍存在的‘下一状态信号’(如用户回复、工具输出、GUI变化等)作为在线强化学习的实时监督源,通过融合评估性信号(标量奖励)与指导性信号(基于后见的策略蒸馏),实现无需协调的异步训练,使智能体在真实使用中持续自我提升。
Details
Motivation: 现有基于智能体的强化学习系统未充分利用交互中自然产生的‘下一状态信号’(如用户回复、工具输出、界面变化等)作为实时学习信号,导致训练效率低、泛化性差且难以在线持续优化。 Method: 提出 OpenClaw-RL 框架:1)统一建模多源交互(对话、终端、GUI、SWE、工具调用)为共享策略训练任务;2)从下一状态中分离提取两类信号——经 PRM 判定的评估性标量奖励,以及通过 Hindsight-Guided On-Policy Distillation (OPD) 提取的文本级指导性提示与 token-level 方向性优势监督;3)采用完全异步架构,实现推理、评判与训练三者并行无协调。 Result: 在个人智能体场景中,模型能从用户重查、纠正和显式反馈中自动学习;在通用智能体场景中,同一框架支持终端、GUI、软件工程及工具调用等多模态 RL 训练,并验证了过程奖励的有效性。 Conclusion: 下一状态信号是普适且高信息量的学习源;OpenClaw-RL 证明了统一利用其评估性与指导性成分可显著提升智能体在线适应能力与跨任务泛化性,为真正实用的 agentic RL 提供新范式。 Abstract: Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RLcs.CV [Back]
[110] KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR
Henry Gagnier,Sophie Gagnier,Ashwin Kirubakaran
Main category: cs.CV
TL;DR: 本文构建了一个包含7219张图像的合成OCR数据集,涵盖哈萨克语的阿拉伯、西里尔和拉丁三种文字,并评估了三种多模态大语言模型(MLLMs)在OCR和语言识别任务上的表现;结果表明,当前MLLMs在低资源阿巴贾德文字(如阿拉伯文)上的OCR能力严重不足,且存在语言误识别问题,远不如传统OCR方法,凸显了对包容性模型和基准的迫切需求。
Details
Motivation: 哈萨克语使用阿拉伯、西里尔和拉丁三种文字,是OCR领域中独特且低资源的语言;目前缺乏针对其阿拉伯文和拉丁文的OCR基准和图像数据,亟需构建数据集并评估现有模型能力。 Method: 构建覆盖三种文字、含字体/颜色/噪声变化的7219张合成OCR图像数据集;在子集上评测Gemma-3-12B-it、Qwen2.5-VL-7B-Instruct和Llama-3.2-11B-Vision-Instruct三个MLLMs的OCR与语言识别性能,并与经典OCR基线对比。 Result: 所有MLLMs在拉丁和阿拉伯文字OCR上均失败;阿拉伯文字常被误识为阿拉伯语、波斯语或库尔德语;传统OCR在字符错误率上显著优于MLLMs。 Conclusion: 当前MLLMs处理低资源阿巴贾德文字的能力存在显著缺陷,亟需开发支持低资源文字与语言的更具包容性的模型和评测基准。 Abstract: Kazakh is a Turkic language using the Arabic, Cyrillic, and Latin scripts, making it unique in terms of optical character recognition (OCR). Work on OCR for low-resource Kazakh scripts is very scarce, and no OCR benchmarks or images exist for the Arabic and Latin scripts. We construct a synthetic OCR dataset of 7,219 images for all three scripts with font, color, and noise variations to imitate real OCR tasks. We evaluated three multimodal large language models (MLLMs) on a subset of the benchmark for OCR and language identification: Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct. All models are unsuccessful with Latin and Arabic script OCR, and fail to recognize the Arabic script as Kazakh text, misclassifying it as Arabic, Farsi, and Kurdish. We further compare MLLMs with a classical OCR baseline and find that while traditional OCR has lower character error rates, MLLMs fail to match this performance. These findings show significant gaps in current MLLM capabilities to process low-resource Abjad-based scripts and demonstrate the need for inclusive models and benchmarks supporting low-resource scripts and languages.[111] Gloss-Free Sign Language Translation: An Unbiased Evaluation of Progress in the Field
Ozge Mercanoglu Sincan,Jian He Low,Sobhan Asasi,Richard Bowden
Main category: cs.CV
TL;DR: 本文对近期无词素的SLT模型进行了全面复现与统一评估,发现许多性能提升源于实现细节和评测设置差异,而非方法创新。
Details
Motivation: 当前SLT领域性能提升来源不明,难以区分是方法创新还是实现/评测差异所致,亟需公平、可复现的基准分析。 Method: 在统一代码库中复现多个主流无词素SLT模型,标准化预处理、视频编码器和训练配置,进行公平对比。 Result: 多数文献报道的性能增益在统一条件下显著减弱,表明实现细节与评测设置对结果影响巨大。 Conclusion: SLT研究需更重视实验一致性与可复现性;公开代码库有助于推动领域透明化与健康发展。 Abstract: Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa. While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here (https://github.com/ozgemercanoglu/sltbaselines) to support transparency and reproducibility in SLT research.[112] Safety-Guided Flow (SGF): A Unified Framework for Negative Guidance in Safe Generation
Mingyu Kim,Young-Heon Kim,Mijung Park
Main category: cs.CV
TL;DR: 本文提出了一种基于MMD势能的统一概率框架,将Shielded Diffusion和Safe Denoiser统一为面向不安全样本的能量型负向引导,并通过控制屏障函数分析指出负向引导应在去噪早期强干预、后期衰减至零,以兼顾安全性与生成质量。
Details
Motivation: 现有扩散/流模型的安全机制分为两类:机器人规划中基于几何约束的控制屏障函数,以及数据驱动的负向引导;但后者缺乏对‘何时需要安全引导’的理论依据。 Method: 提出基于最大均值差异(MMD)势能的统一概率框架,将两类方法建模为能量型负向引导;结合控制屏障函数理论,推导出负向引导需在关键时间窗内强施加、并在窗外交替衰减至零。 Result: 在多个真实安全生成任务上验证了负向引导应集中在去噪早期阶段,才能实现既安全又高质量的生成。 Conclusion: 负向引导的有效性依赖于时序调控,而非恒定强度;本文框架为安全生成提供了理论支撑与实用指导。 Abstract: Safety mechanisms for diffusion and flow models have recently been developed along two distinct paths. In robot planning, control barrier functions are employed to guide generative trajectories away from obstacles at every denoising step by explicitly imposing geometric constraints. In parallel, recent data-driven, negative guidance approaches have been shown to suppress harmful content and promote diversity in generated samples. However, they rely on heuristics without clearly stating when safety guidance is actually necessary. In this paper, we first introduce a unified probabilistic framework using a Maximum Mean Discrepancy (MMD) potential for image generation tasks that recasts both Shielded Diffusion and Safe Denoiser as instances of our energy-based negative guidance against unsafe data samples. Furthermore, we leverage control-barrier functions analysis to justify the existence of a critical time window in which negative guidance must be strong; outside of this window, the guidance should decay to zero to ensure safe and high-quality generation. We evaluate our unified framework on several realistic safe generation scenarios, confirming that negative guidance should be applied in the early stages of the denoising process for successful safe generation.[113] Benchmarking Compact VLMs for Clip-Level Surveillance Anomaly Detection Under Weak Supervision
Kirill Borodin,Kirill Kondrashov,Nikita Vasiliev,Ksenia Gladkova,Inna Larina,Mikhail Gorodnichev,Grach Mkrtchian
Main category: cs.CV
TL;DR: 本文研究了轻量级视觉语言模型(VLMs)在CCTV安全监控中作为异常检测器的应用,提出统一评估协议,在弱监督下兼顾检测精度与单片段延迟,并通过参数高效微调提升其鲁棒性与实用性。
Details
Motivation: CCTV安全监控需要异常检测器在弱监督条件下兼具可靠的片段级准确率和可预测的单片段延迟。 Method: 构建统一评估协议(涵盖预处理、提示设计、数据集划分、指标与运行设置),对比参数高效微调的轻量级VLM、无训练VLM流水线及弱监督基线方法。 Result: 参数高效微调后的轻量级VLM在准确率(如F1、ROC-AUC)上媲美甚至超越现有方法,同时保持有竞争力的单片段延迟,并降低了对提示工程的敏感性。 Conclusion: 参数高效微调使轻量级VLM成为可靠、高效且鲁棒的片段级异常检测器,在透明一致的实验框架下实现了精度与效率的良好权衡。 Abstract: CCTV safety monitoring demands anomaly detectors combine reliable clip-level accuracy with predictable per-clip latency despite weak supervision. This work investigates compact vision-language models (VLMs) as practical detectors for this regime. A unified evaluation protocol standardizes preprocessing, prompting, dataset splits, metrics, and runtime settings to compare parameter-efficiently adapted compact VLMs against training-free VLM pipelines and weakly supervised baselines. Evaluation spans accuracy, precision, recall, F1, ROC-AUC, and average per-clip latency to jointly quantify detection quality and efficiency. With parameter-efficient adaptation, compact VLMs achieve performance on par with, and in several cases exceeding, established approaches while retaining competitive per-clip latency. Adaptation further reduces prompt sensitivity, producing more consistent behavior across prompt regimes under the shared protocol. These results show that parameter-efficient fine-tuning enables compact VLMs to serve as dependable clip-level anomaly detectors, yielding a favorable accuracy-efficiency trade-off within a transparent and consistent experimental setup.[114] Information-Theoretic Constraints for Continual Vision-Language-Action Alignment
Libang Zhao,Qixin Zeng,Hongyin Zhang,Donglin Wang
Main category: cs.CV
TL;DR: 本文提出Info-VLA框架,通过回放锚点对比学习和跨模态互信息最大化,缓解视觉-语言-动作模型在持续学习中的灾难性遗忘问题,有效保持跨模态信息结构。
Details
Motivation: VLA模型在开放环境中持续学习新技能时面临严重灾难性遗忘,根源在于跨模态(视觉、语言、动作)信息结构的退化。 Method: 提出Info-VLA框架,包含两个互补约束:1)回放锚点对比学习,利用冻结教师模型构建稳定对齐锚点;2)跨模态互信息最大化,约束视觉与语言表征间的依赖结构。 Result: 在LIBERO基准上,Info-VLA显著优于现有方法,在任务保持率和适应能力两方面均取得提升。 Conclusion: Info-VLA通过联合保持历史对齐与跨模态依赖信息,实现了持续学习中稳定性与可塑性的更好平衡。 Abstract: When deployed in open-ended robotic environments, Vision--Language--Action (VLA) models need to continually acquire new skills, yet suffer from severe catastrophic forgetting. We observe that this degradation is related to the deterioration of cross-modal information structure, where dependencies among visual observations, language instructions, and actions progressively diffuse during continual adaptation. But existing continual learning methods fail to preserve such cross-modal information dependencies. Thus, we propose Info-VLA, an information-preserving continual learning framework that maintains cross-modal information structure through two complementary constraints. Replay Anchor Contrastive Learning constructs stable alignment anchors from a frozen teacher model, preserving cross-modal alignment in the representation space. Cross-Modal Mutual Information Maximization further preserves dependency structure between visual and language representations through mutual information constraints. By jointly preserving historical alignment and cross-modal dependency information, Info-VLA balances stability and plasticity during continual learning. Furthermore, experiments on the LIBERO show that Info-VLA significantly outperforms existing methods in both task retention and adaptation.[115] MultiSolSegment: Multi-channel segmentation of overlapping features in electroluminescence images of photovoltaic cells
Ojas Sanghi,Norman Jost,Benjamin G. Pierce,Emma Cooper,Isaiah H. Deane,Jennifer L. Braid
Main category: cs.CV
TL;DR: 本文提出了一种多通道U-Net架构,用于电致发光(EL)图像的像素级多标签分割,以解决现有方法无法对同一像素分配多个标签的问题,从而准确识别重叠的光伏组件缺陷(如裂纹与焊带交叉)。模型在准确率(98%)和跨数据集泛化能力上表现优异,提升了大规模光伏系统中缺陷量化与寿命预测的自动化水平。
Details
Motivation: 现有EL图像分析方法无法对同一像素分配多个标签,限制了对重叠退化特征(如裂纹跨越焊带)的识别能力。 Method: 提出一种多通道U-Net架构,为裂纹、焊带、暗区和非电池区域分别输出独立的概率图,实现像素级多标签语义分割。 Result: 模型达到98%的准确率,并在未见过的数据集上表现出良好泛化能力。 Conclusion: 该框架为光伏组件自动检测提供了可扩展、可扩展的工具,显著提升大规模光伏系统中缺陷量化与寿命预测的精度与效率。 Abstract: Electroluminescence (EL) imaging is widely used to detect defects in photovoltaic (PV) modules, and machine learning methods have been applied to enable large-scale analysis of EL images. However, existing methods cannot assign multiple labels to the same pixel, limiting their ability to capture overlapping degradation features. We present a multi-channel U-Net architecture for pixel-level multi-label segmentation of EL images. The model outputs independent probability maps for cracks, busbars, dark areas, and non-cell regions, enabling accurate co-classification of interacting features such as cracks crossing busbars. The model achieved an accuracy of 98% and has been shown to generalize to unseen datasets. This framework offers a scalable, extensible tool for automated PV module inspection, improving defect quantification and lifetime prediction in large-scale PV systems.[116] Complementarity-Supervised Spectral-Band Routing for Multimodal Emotion Recognition
Zhexian Huang,Bo Zhao,Hui Ma,Zhishu Liu,Jie Zhang,Ruixin Zhang,Shouhong Ding,Zitong Yu
Main category: cs.CV
TL;DR: 本文提出Atsuko模型,通过多尺度频带分解与专家协作建模细粒度互补特征,结合双路径调制器和边缘互补性模块(MCM)实现跨模态细粒度融合,在多个情感识别基准上取得优越性能。
Details
Motivation: 现有方法机械依赖单模态性能,忽略真实互补性;粗粒度融合难以满足情感任务所需的细粒度表征;异构模态间信息密度不一致阻碍跨模态特征挖掘。 Method: 提出Complementarity-Supervised Multi-Band Expert Network(Atsuko):1)对各模态特征进行正交高低中频带分解;2)设计双路径模态级路由器实现细粒度跨频带选择与跨模态融合;3)引入Marginal Complementarity Module(MCM),通过双模态对比量化各模态移除后的性能损失,生成互补性分布以软监督路由器聚焦于提供独特信息增益的模态。 Result: 在CMU-MOSI、CMU-MOSEI、CH-SIMS、CH-SIMSv2和MIntRec等多个基准上均取得优于现有方法的性能。 Conclusion: 细粒度频带分解与互补性驱动的软监督路由机制可有效提升多模态情感识别中跨模态协同建模能力,缓解主导模态带来的捷径学习问题。 Abstract: Multimodal emotion recognition fuses cues such as text, video, and audio to understand individual emotional states. Prior methods face two main limitations: mechanically relying on independent unimodal performance, thereby missing genuine complementary contributions, and coarse-grained fusion conflicting with the fine-grained representations required by emotion tasks. As inconsistent information density across heterogeneous modalities hinders inter-modal feature mining, we propose the Complementarity-Supervised Multi-Band Expert Network, named Atsuko, to model fine-grained complementary features via multi-scale band decomposition and expert collaboration. Specifically, we orthogonally decompose each modality's features into high, mid, and low-frequency components. Building upon this band-level routing, we design a modality-level router with a dual-path mechanism for fine-grained cross-band selection and cross-modal fusion. To mitigate shortcut learning from dominant modalities, we propose the Marginal Complementarity Module (MCM) to quantify performance loss when removing each modality via bi-modal comparison. The resulting complementarity distribution provides soft supervision, guiding the router to focus on modalities contributing unique information gains. Extensive experiments show our method achieves superior performance on the CMU-MOSI, CMU-MOSEI, CH-SIMS, CH-SIMSv2, and MIntRec benchmarks.[117] Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning
Zhenyu Zhang,Yixiong Zou,Yuhua Li,Ruixuan Li,Guangyao Chen
Main category: cs.CV
TL;DR: 本文发现,在基于视觉语言模型(VLM)的无源跨域少样本学习(SF-CDFSL)中,增强视觉模态判别力反而损害性能;根本原因在于交叉熵微调会诱发视觉学习‘捷径’,削弱关键的跨模态对齐;为此提出扰动视觉学习并渐进式利用语义关系对齐图文模态的新方法,显著提升多数据集和骨干网络上的SOTA性能。
Details
Motivation: 在VLM-based SF-CDFSL任务中,传统提升视觉判别力的做法反而抑制性能,亟需解释该反直觉现象并提出有效解决方案。 Method: 通过理论与实验分析揭示交叉熵损失中视觉学习与跨模态学习的竞争机制;提出两阶段方法:1)扰动视觉学习以抑制捷径;2)利用视觉-文本语义关系渐进式对齐图文模态。 Result: 在4个CDFSL数据集和11个FSL数据集、多种骨干(CLIP/SigLIP/PE-Core)上均取得SOTA结果。 Conclusion: 过度强调视觉判别力会阻碍VLM在SF-CDFSL中的跨模态对齐,应主动引导模型聚焦于图文语义一致性而非单模态判别性。 Abstract: Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that \textbf{strengthening visual-modal discriminability actually suppresses VLMs' performance}. In this paper, we aim to delve into this phenomenon for an interpretation and a solution. By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss ($\mathcal{L}_{\mathrm{vlm}}$) inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL. However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce $\mathcal{L}_{\mathrm{vlm}}$ without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance. Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trap.[118] DDS-UDA: Dual-Domain Synergy for Unsupervised Domain Adaptation in Joint Segmentation of Optic Disc and Optic Cup
Yusong Xiao,Yuxuan Wu,Li Xiao,Gang Qu,Haiye Huo,Yu-Ping Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为DDS-UDA的新型无监督域自适应框架,用于解决眼底图像中视盘与视杯联合分割在跨机构部署时因标注稀缺和域偏移导致的性能下降问题;该框架通过双向跨域一致性正则化与频域驱动的域内伪标签学习,提升模型鲁棒性与泛化能力,并在多域数据集上验证了其优越性。
Details
Motivation: CNN在单中心数据上表现优异,但临床落地受限于高质量大样本标注稀缺及跨设备/协议部署时的域偏移问题;现有UDA方法缺乏对跨域干扰抑制与域内泛化能力提升的统一建模。 Method: 提出DDS-UDA框架,包含两个核心模块:1)基于粗到细动态掩码生成器的双向跨域一致性正则化模块,实现特征级语义信息交换并抑制噪声传播;2)频域驱动的域内伪标签学习模块,通过合成频谱幅度混合监督信号增强域内泛化;整体采用师生架构解耦域特异性偏差与域不变表征。 Result: 在两个多域眼底图像数据集上全面评估,DDS-UDA显著优于多种现有UDA方法,提升了视盘与视杯联合分割的准确率与鲁棒性。 Conclusion: DDS-UDA为解决医学图像分割中的域偏移与标注稀缺问题提供了新思路,具备良好的临床转化潜力。 Abstract: Convolutional neural networks (CNNs) have achieved exciting performance in joint segmentation of optic disc and optic cup on single-institution datasets. However, their clinical translation is hindered by two major challenges: limited availability of large-scale, high-quality annotations and performance degradation caused by domain shift during deployment across heterogeneous imaging protocols and acquisition platforms. While unsupervised domain adaptation (UDA) provides a way to mitigate these limitations, most existing approaches do not address cross-domain interference and intra-domain generalization within a unified framework. In this paper, we present the Dual-Domain Synergy UDA (DDS-UDA), a novel UDA framework that comprises two key modules. First, a bi-directional cross-domain consistency regularization module is enforced to mitigate cross-domain interference through feature-level semantic information exchange guided by a coarse-to-fine dynamic mask generator, suppressing noise propagation while preserving structural coherence. Second, a frequency-driven intra-domain pseudo label learning module is used to enhance intra-domain generalization by synthesizing spectral amplitude-mixed supervision signals, which ensures high-fidelity feature alignment across domains. Implemented within a teacher-student architecture, DDS-UDA disentangles domain-specific biases from domain-invariant feature-level representations, thereby achieving robust adaptation to heterogeneous imaging environments. We conduct a comprehensive evaluation of our proposed method on two multi-domain fundus image datasets, demonstrating that it outperforms several existing UDA based methods and therefore providing an effective way for optic disc and optic cup segmentation.[119] Post Training Quantization for Efficient Dataset Condensation
Linh-Tam Tran,Sung-Ho Bae
Main category: cs.CV
TL;DR: 本文首次探索了在数据集压缩中应用后训练量化技术,提出了一种基于图像块的量化方法,结合量化感知聚类和分布对齐模块,在极低比特(如2-bit)下显著提升压缩后数据的表示质量与下游模型准确率。
Details
Motivation: 现有数据集压缩方法忽略了量化对进一步降低存储成本的潜力,尤其在极低比特下传统量化会导致表示质量严重下降。 Method: 提出基于图像块的后训练量化方法,引入量化感知聚类以减少量化参数开销,并设计分布对齐的重构精炼模块;整个框架可即插即用适配多种数据集压缩生成的合成图像。 Result: 在CIFAR-10/100、Tiny ImageNet及ImageNet子集上,相同存储约束下性能持续超越先前方法;在极端压缩(IPC=1,2-bit)下,DM方法测试准确率从26.0%提升至54.1%,接近翻倍。 Conclusion: 后训练量化可有效增强数据集压缩的存储效率与表示能力,所提方法在极低比特下保持高质量重建与下游泛化性能,为高效轻量级数据蒸馏提供了新范式。 Abstract: Dataset Condensation (DC) distills knowledge from large datasets into smaller ones, accelerating training and reducing storage requirements. However, despite notable progress, prior methods have largely overlooked the potential of quantization for further reducing storage costs. In this paper, we take the first step to explore post-training quantization in dataset condensation, demonstrating its effectiveness in reducing storage size while maintaining representation quality without requiring expensive training cost. However, we find that at extremely low bit-widths (e.g., 2-bit), conventional quantization leads to substantial degradation in representation quality, negatively impacting the networks trained on these data. To address this, we propose a novel \emph{patch-based post-training quantization} approach that ensures localized quantization with minimal loss of information. To reduce the overhead of quantization parameters, especially for small patch sizes, we employ quantization-aware clustering to identify similar patches and subsequently aggregate them for efficient quantization. Furthermore, we introduce a refinement module to align the distribution between original images and their dequantized counterparts, compensating for quantization errors. Our method is a plug-and-play framework that can be applied to synthetic images generated by various DC methods. Extensive experiments across diverse benchmarks including CIFAR-10/100, Tiny ImageNet, and ImageNet subsets demonstrate that our method consistently outperforms prior works under the same storage constraints. Notably, our method nearly \textbf{doubles the test accuracy} of existing methods at extreme compression regimes (e.g., 26.0\% $\rightarrow$ 54.1\% for DM at IPC=1), while operating directly on 2-bit images without additional distillation.[120] MURE: Hierarchical Multi-Resolution Encoding via Vision-Language Models for Visual Document Retrieval
Fengbin Zhu,Zijing Cai,Yuzhe Wang,Pengyang Shao,Wenjie Wang,Fuli Feng,Richang Hong,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出X-VisEmb范式及MURE框架,通过多分辨率采样、跨粒度特征融合与自适应表征蒸馏,提升视觉文档检索的精度与效率。
Details
Motivation: 现有视觉文档检索(VDR)模型难以在高分辨率文档处理中兼顾效果(保留细粒度信息)与效率(控制视觉token数量),导致索引开销大、检索延迟高。 Method: 提出X-VisEmb范式,包含多分辨率采样与编码、跨粒度特征融合、自适应表征蒸馏;基于此构建MURE框架:利用视觉语言模型(VLM)作为分层多分辨率编码器,引入分辨率级Matryoshka表征学习(RMRL)实现特征融合,并采用语义感知的分层聚类机制压缩视觉token。 Result: 在两个主流VDR基准上,MURE持续超越强基线;仅用ColPali 50%的视觉token预算即显著优于后者。 Conclusion: MURE通过协同优化多尺度表征与高效token压缩,在保持甚至提升检索性能的同时大幅降低计算与存储开销,为高效VDR提供了新范式。 Abstract: Visual Document Retrieval (VDR) requires representations that capture both fine-grained visual details and global document structure to ensure retrieval efficacy while maintaining computational efficiency. Existing VDR models struggle to balance effectiveness and efficiency when processing high-resolution documents: they often either lose fine-grained information or generate an excessive number of visual tokens, resulting in significant indexing overhead and high retrieval latency. In this work, we rethink the visual encoding mechanism and propose a new X-VisEmb paradigm that progresses from multi-resolution sampling and encoding, through cross-granularity feature fusion, to adaptive representation distillation. A preliminary study validates its feasibility and effectiveness in capturing complementary visual cues at varying scales. Building on the insights, we develop MURE, a novel framework that employs VLMs as a hierarchical multi-resolution encoder, integrates resolution-level Matryoshka representation learning (RMRL) for effective feature fusion, and applies a semantic-aware hierarchical clustering mechanism for visual token compression. Experiments on two widely used VDR benchmarks show that our MURE framework consistently beats strong baselines. Furthermore, it significantly outperforms ColPali with only 50% of its visual token budget.[121] Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts
Xi Chen,Maojun Zhang,Yu Liu,Shen Yan
Main category: cs.CV
TL;DR: 本文提出SpectralMoE,一种基于混合专家(MoE)的参数高效微调框架,用于解决光谱遥感中域泛化语义分割(DGSS)面临的跨域光谱偏移问题;其通过双门控MoE结构对视觉与深度特征进行模态特异性局部精调,并利用交叉注意力融合结构线索,显著提升模型在未见域上的鲁棒性与精度。
Details
Motivation: 光谱遥感中的域泛化语义分割受跨采集条件的光谱偏移严重制约,现有全局、同质化的参数高效微调方法难以应对地物空间异质性,易导致语义混淆。 Method: 提出SpectralMoE框架:采用双门控Mixture-of-Experts(MoE)结构,分别对基础模型的视觉特征和由RGB波段估计的深度特征进行top-k专家路由与局部精调;再通过交叉注意力机制将 refined 的结构线索融入视觉流,缓解光谱变化引起的语义歧义。 Result: 在高光谱、多光谱及RGB遥感影像的多个DGSS基准上均达到新SOTA性能。 Conclusion: 细粒度、空间自适应的特征精调比全局微调更有效;结合深度引导与模态解耦的MoE架构是提升DGSS鲁棒性的关键路径。 Abstract: Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While Parameter-Efficient Fine-Tuning (PEFT) on foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one-size-fits-all" tuning struggles with the spatial heterogeneity of land cover, causing semantic confusion. We argue that the key to robust DGSS lies not in a single global adaptation, but in performing fine-grained, spatially-adaptive refinement of a foundation model's features. To achieve this, we propose SpectralMoE, a novel PEFT framework for DGSS. It operationalizes this principle by utilizing a Mixture-of-Experts (MoE) architecture to perform local precise refinement on the foundation model's features, incorporating depth features estimated from selected RGB bands of the spectral remote sensing imagery to guide the fine-tuning process. Specifically, SpectralMoE employs a dual-gated MoE architecture that independently routes visual and depth features to top-k selected experts for specialized refinement, enabling modality-specific adjustments. A subsequent cross-attention mechanism then judiciously fuses the refined structural cues into the visual stream, mitigating semantic ambiguities caused by spectral variations. Extensive experiments show that SpectralMoE sets a new state-of-the-art on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB remote sensing imagery.[122] AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification
Hamza Mooraj,George Pantazopoulos,Alessandro Suglia
Main category: cs.CV
TL;DR: 本文系统比较了CNN、对比式视觉语言模型(VLM)和生成式VLM在农作物病害细粒度分类中的性能,提出新基准AgriPath-LF16,并发现不同架构在实验室与田间图像上的表现差异显著,强调部署场景应主导模型选择。
Details
Motivation: 现有农作物病害检测评估常局限于单一网络结构或实验室数据,缺乏对真实多变采集条件(如实验室vs田间)下模型鲁棒性的系统性分析。 Method: 构建包含111k图像的AgriPath-LF16基准(含16种作物、41种病害,明确区分实验室与田间图像),在统一协议下训练并评测三类模型(CNN、对比式VLM、生成式VLM),采用macro-F1和Parse Success Rate(PSR)指标,覆盖全量、仅实验室、仅田间三种训练设定。 Result: CNN在实验室图像上精度最高但域迁移能力差;对比式VLM参数高效、跨域性能稳健;生成式VLM对分布变化鲁棒性最强,但存在自由文本生成导致的新失败模式。 Conclusion: 模型架构选择应基于实际部署场景(如是否面临域偏移),而非仅追求整体准确率;不同范式各有适用边界,需权衡精度、鲁棒性与可靠性。 Abstract: Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark containing 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardized training and evaluation. All models are trained and evaluated under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability. The results reveal distinct performance profiles. CNNs achieve the highest accuracy on lab imagery but degrade under domain shift. Contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance. Generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate accuracy alone.[123] Int3DNet: Scene-Motion Cross Attention Network for 3D Intention Prediction in Mixed Reality
Taewook Ha,Woojin Cho,Dooyoung Kim,Woontack Woo
Main category: cs.CV
TL;DR: Int3DNet 是一种场景感知网络,直接从场景几何与头手运动线索预测三维意图区域,无需显式物体识别,提升混合现实中人机交互的实时性与鲁棒性。
Details
Motivation: 在混合现实中,准确预测用户意图可降低交互延迟、提升响应主动性与体验流畅性,但现有方法依赖显式物体感知,泛化性与鲁棒性不足。 Method: 提出 Int3DNet,采用稀疏运动线索(头手)与场景点云的交叉注意力融合机制,端到端预测三维空间中的用户意图区域。 Result: 在 MoGaze 和 CIRCLE 数据集上验证,时间跨度达 1500 ms 仍保持稳定性能,优于基线方法,并适用于新场景;进一步通过基于意图区域的视觉问答(VQA)演示了实用性。 Conclusion: Int3DNet 实现了不依赖物体检测的、鲁棒且场景自适应的三维意图预测,为 MR 中主动式人机交互提供了可靠基础。 Abstract: We propose Int3DNet, a scene-aware network that predicts 3D intention areas directly from scene geometry and head-hand motion cues, enabling robust human intention prediction without explicit object-level perception. In Mixed Reality (MR), intention prediction is critical as it enables the system to anticipate user actions and respond proactively, reducing interaction delays and ensuring seamless user experiences. Our method employs a cross attention fusion of sparse motion cues and scene point clouds, offering a novel approach that directly interprets the user's spatial intention within the scene. We evaluated Int3DNet on MoGaze and CIRCLE datasets, which are public datasets for full-body human-scene interactions, showing consistent performance across time horizons of up to 1500 ms and outperforming the baselines, even in diverse and unseen scenes. Moreover, we demonstrate the usability of proposed method through a demonstration of efficient visual question answering (VQA) based on intention areas. Int3DNet provides reliable 3D intention areas derived from head-hand motion and scene geometry, thus enabling seamless interaction between humans and MR systems through proactive processing of intention areas.[124] Bi-CamoDiffusion: A Boundary-informed Diffusion Approach for Camouflaged Object Detection
Patricia L. Suarez,Leo Thomas Ramos,Angel D. Sappa
Main category: cs.CV
TL;DR: Bi-CamoDiffusion 是 CamoDiffusion 的改进版本,通过无参数的边缘先验注入机制提升伪装目标检测中的边界清晰度与结构准确性,并在多个基准上超越现有方法。
Details
Motivation: 解决伪装目标检测中边界模糊、结构歧义及假阳性问题,提升目标与背景的精确分离能力。 Method: 在早期嵌入中引入无参数的边缘先验注入机制,并设计统一优化目标,联合优化空间精度、结构约束与不确定性监督。 Result: 在 CAMO、COD10K 和 NC4K 基准上全面超越基线模型与现有 SOTA 方法,在 $S_m$、$F_β^{w}$、$E_m$、$MAE$ 等指标上表现最优,显著提升细长结构与突起部分的分割精度及边界锐度。 Conclusion: Bi-CamoDiffusion 有效增强了伪装目标检测的边界建模能力与整体鲁棒性,为该任务提供了更优的扩散模型范式。 Abstract: Bi-CamoDiffusion is introduced, an evolution of the CamoDiffusion framework for camouflaged object detection. It integrates edge priors into early-stage embeddings via a parameter-free injection process, which enhances boundary sharpness and prevents structural ambiguity. This is governed by a unified optimization objective that balances spatial accuracy, structural constraints, and uncertainty supervision, allowing the model to capture of both the object's global context and its intricate boundary transitions. Evaluations across the CAMO, COD10K, and NC4K benchmarks show that Bi-CamoDiffusion surpasses the baseline, delivering sharper delineation of thin structures and protrusions while also minimizing false positives. Also, our model consistently outperforms existing state-of-the-art methods across all evaluated metrics, including $S_m$, $F_β^{w}$, $E_m$, and $MAE$, demonstrating a more precise object-background separation and sharper boundary recovery.[125] Graph2Video: Leveraging Video Models to Model Dynamic Graph Evolution
Hua Liu,Yanbin Wei,Fei Xing,Tyler Derr,Haoyu Han,Yu Zhang
Main category: cs.CV
TL;DR: 本文提出Graph2Video框架,将动态图中目标链接的时序邻域建模为‘图视频’,利用视频基础模型的归纳偏置来同时捕捉细粒度局部变化和长程时序动态,提升链路预测性能。
Details
Motivation: 现有动态图链路预测模型难以有效捕获时序交互顺序的细微变化、长时依赖关系以及节点对特异的关系动态。 Method: 提出Graph2Video框架,将目标链接的时序邻域视为一系列有序‘图帧’,堆叠成‘图视频’,并利用视频基础模型提取链路级嵌入作为轻量且即插即用的记忆单元,可无缝集成到现有动态图编码器中。 Result: 在多个基准数据集上的实验表明,Graph2Video在大多数情况下优于当前最优基线方法。 Conclusion: 借鉴计算机视觉中的时空建模技术(如视频理解)是推动动态图学习发展的有效且有前景的方向。 Abstract: Dynamic graphs are common in real-world systems such as social media, recommender systems, and traffic networks. Existing dynamic graph models for link prediction often fall short in capturing the complexity of temporal evolution. They tend to overlook fine-grained variations in temporal interaction order, struggle with dependencies that span long time horizons, and offer limited capability to model pair-specific relational dynamics. To address these challenges, we propose \textbf{Graph2Video}, a video-inspired framework that views the temporal neighborhood of a target link as a sequence of "graph frames". By stacking temporally ordered subgraph frames into a "graph video", Graph2Video leverages the inductive biases of video foundation models to capture both fine-grained local variations and long-range temporal dynamics. It generates a link-level embedding that serves as a lightweight and plug-and-play link-centric memory unit. This embedding integrates seamlessly into existing dynamic graph encoders, effectively addressing the limitations of prior approaches. Extensive experiments on benchmark datasets show that Graph2Video outperforms state-of-the-art baselines on the link prediction task in most cases. The results highlight the potential of borrowing spatio-temporal modeling techniques from computer vision as a promising and effective approach for advancing dynamic graph learning.[126] BrainCast: A Spatio-Temporal Forecasting Model for Whole-Brain fMRI Time Series Prediction
Yunlong Gao,Jinbo Yang,Li Xiao,Haiye Huo,Yang Ji,Hao Wang,Aiying Zhang,Yu-Ping Wang
Main category: cs.CV
TL;DR: 本文提出BrainCast框架,用于预测全脑fMRI时间序列,以在不增加扫描时间的前提下提升数据质量与下游分析性能。
Details
Motivation: 临床fMRI扫描时间短导致数据质量低、统计效力不足,亟需在不额外采集数据的情况下延长高质量fMRI时间序列。 Method: BrainCast将fMRI时间序列预测建模为多变量时间序列预测任务,包含三个核心模块:空间交互感知模块(将各ROI时间序列嵌入为token以建模跨ROI依赖)、时间特征精炼模块(在ROI层面增强fMRI信号的高低频成分以刻画神经动力学)、时空模式对齐模块(融合时空表征生成全脑特征)。 Result: 在HCP静息态与任务态fMRI数据上,BrainCast显著优于现有时间序列预测方法;扩展后的fMRI序列提升了认知能力预测性能。 Conclusion: BrainCast为受限扫描时长下的fMRI研究提供了有效的时间序列外推工具,具有明确的临床与神经科学应用价值。 Abstract: Functional magnetic resonance imaging (fMRI) enables noninvasive investigation of brain function, while short clinical scan durations, arising from human and non-human factors, usually lead to reduced data quality and limited statistical power for neuroimaging research. In this paper, we propose BrainCast, a novel spatio-temporal forecasting framework specifically tailored for whole-brain fMRI time series forecasting, to extend informative fMRI time series without additional data acquisition. It formulates fMRI time series forecasting as a multivariate time series prediction task and jointly models temporal dynamics within regions of interest (ROIs) and spatial interactions across ROIs. Specifically, BrainCast integrates a Spatial Interaction Awareness module to characterize inter-ROI dependencies via embedding every ROI time series as a token, a Temporal Feature Refinement module to capture intrinsic neural dynamics within each ROI by enhancing both low- and high-energy temporal components of fMRI time series at the ROI level, and a Spatio-temporal Pattern Alignment module to combine spatial and temporal representations for producing informative whole-brain features. Experimental results on resting-state and task fMRI datasets from the Human Connectome Project demonstrate the superiority of BrainCast over state-of-the-art time series forecasting baselines. Moreover, fMRI time series extended by BrainCast improve downstream cognitive ability prediction, highlighting the clinical and neuroscientific impact brought by whole-brain fMRI time series forecasting in scenarios with restricted scan durations.[127] IAML: Illumination-Aware Mirror Loss for Progressive Learning in Low-Light Image Enhancement Auto-encoders
Farida Mohsen,Tala Zaim,Ali Al-Zawqari,Ali Safa,Samir Belhaouari
Main category: cs.CV
TL;DR: 本文提出了一种基于教师-学生自动编码器和渐进式学习的新训练方法,引入光照感知镜像损失(IAML)来蒸馏多尺度干净图像特征,显著提升了低光照图像增强效果,在多个指标上达到SOTA。
Details
Motivation: 现有低光照图像增强方法在特征对齐和光照变化建模方面存在不足,需更有效的特征蒸馏机制。 Method: 构建教师-学生自动编码器架构,采用渐进式多尺度特征蒸馏,并设计光照感知镜像损失(IAML)对齐学生解码器各层特征与教师侧干净特征。 Result: 在三个主流低光照数据集上,SSIM、PSNR和LPIPS指标均达到当前最优水平;消融实验验证了IAML对重建精度的显著提升作用。 Conclusion: IAML驱动的教师-学生蒸馏框架有效提升了低光照图像增强性能,为光照鲁棒特征学习提供了新思路。 Abstract: This letter presents a novel training approach and loss function for learning low-light image enhancement auto-encoders. Our approach revolves around the use of a teacher-student auto-encoder setup coupled to a progressive learning approach where multi-scale information from clean image decoder feature maps is distilled into each layer of the student decoder in a mirrored fashion using a newly-proposed loss function termed Illumination-Aware Mirror Loss (IAML). IAML helps aligning the feature maps within the student decoder network with clean feature maps originating from the teacher side while taking into account the effect of lighting variations within the input images. Extensive benchmarking of our proposed approach on three popular low-light image enhancement datasets demonstrate that our model achieves state-of-the-art performance in terms of average SSIM, PSNR and LPIPS reconstruction accuracy metrics. Finally, ablation studies are performed to clearly demonstrate the effect of IAML on the image reconstruction accuracy.[128] FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach
Ning Liao,Xiaoxing Wang,Xiaohan Qin,Junchi Yan
Main category: cs.CV
TL;DR: 本文提出FineRMoE,一种在中间层和输出层均采用细粒度专家设计的MoE架构,通过双层稀疏前向计算、专用路由机制及通用重利用训练方法,在多个基准上显著提升参数效率与推理速度。
Details
Motivation: 现有细粒度MoE在中间维度超过最优阈值后性能不再提升,单维度细粒度设计存在瓶颈。 Method: 提出FineRMoE架构,扩展细粒度设计至中间和输出维度;引入双层稀疏前向计算范式与专用路由机制;设计通用upcycling方法以低成本构建FineRMoE。 Result: 在十个标准基准上优于最强基线:参数效率提升6倍,prefill延迟降低281倍,解码吞吐量提升136倍。 Conclusion: FineRMoE突破单维度细粒度限制,通过多维细粒度设计与高效训练策略,显著提升MoE模型性能与推理效率。 Abstract: As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.[129] WaveComm: Lightweight Communication for Collaborative Perception via Wavelet Feature Distillation
Erdemt Bao,Jin Yang
Main category: cs.CV
TL;DR: WaveComm是一种基于小波变换的通信框架,通过传输低频特征并重建高频细节,在大幅降低通信开销的同时保持多智能体协同感知性能。
Details
Motivation: 多智能体协同感知系统中,信息交换带来巨大通信开销,限制了系统可扩展性和实时性,尤其在带宽受限场景下导致性能下降和可靠性降低。 Method: 提出WaveComm框架:利用离散小波变换(DWT)分解特征图,仅传输紧凑的低频分量;在接收端用轻量生成器重建被省略的高频细节;引入多尺度蒸馏(MSD)损失,从像素、结构、语义和分布四个层面优化重建质量。 Result: 在OPV2V和DAIR-V2X数据集上的LiDAR与相机感知任务中,通信量分别降至原始的86.3%和87.0%时仍保持SOTA性能;相比现有方法,在通信效率和感知精度上均取得有竞争力的提升;消融实验验证了各核心组件有效性。 Conclusion: WaveComm有效平衡了通信效率与感知性能,在带宽受限的多智能体协同感知系统中具有实用价值和推广潜力。 Abstract: In multi-agent collaborative sensing systems, substantial communication overhead from information exchange significantly limits scalability and real-time performance, especially in bandwidth-constrained environments. This often results in degraded performance and reduced reliability. To address this challenge, we propose WaveComm, a wavelet-based communication framework that drastically reduces transmission loads while preserving sensing performance in low-bandwidth scenarios. The core innovation of WaveComm lies in decomposing feature maps using Discrete Wavelet Transform (DWT), transmitting only compact low-frequency components to minimize communication overhead. High-frequency details are omitted, and their effects are reconstructed at the receiver side using a lightweight generator. A Multi-Scale Distillation (MSD) Loss is employed to optimize the reconstruction quality across pixel, structural, semantic, and distributional levels. Experiments on the OPV2V and DAIR-V2X datasets for LiDAR-based and camera-based perception tasks demonstrate that WaveComm maintains state-of-the-art performance even when the communication volume is reduced to 86.3% and 87.0% of the original, respectively. Compared to existing approaches, WaveComm achieves competitive improvements in both communication efficiency and perception accuracy. Ablation studies further validate the effectiveness of its key components.[130] Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding
Zhongxing Xu,Zhonghua Wang,Zhe Qian,Dachuan Shi,Feilong Tang,Ming Hu,Shiyan Su,Xiaocheng Zou,Wei Feng,Dwarikanath Mahapatra,Yifan Peng,Mingquan Lin,Zongyuan Ge
Main category: cs.CV
TL;DR: 本文提出了一种名为LEAD的解码策略,通过熵感知的推理模式切换和先验引导的视觉锚点注入,利用token概率分布中的语义信息来缓解多模态大推理模型(MLRMs)中的幻觉问题。
Details
Motivation: 观察到过渡词(如because、however)与幻觉高度相关且常出现在高熵状态,而现有方法依赖离散文本输入,未能充分利用密集上下文线索。 Method: 提出Latent Entropy-Aware Decoding(LEAD),核心是熵感知的推理模式切换:高熵时使用概率加权连续嵌入,低熵时回归离散token嵌入;并引入先验引导的视觉锚点注入机制。 Result: 在多个MLRM模型和基准测试上,LEAD显著缓解了幻觉现象,提升了视觉问答的可靠性。 Conclusion: 基于token概率分布构建丰富语义表征并动态切换推理模式,可有效增强上下文推理能力,减少因过渡词引发的幻觉。 Abstract: Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.[131] Multimodal Deep Learning for Dynamic and Static Neuroimaging: Integrating MRI and fMRI for Alzheimer Disease Analysis
Anima Kujur,Zahra Monfared
Main category: cs.CV
TL;DR: 本文提出了一种融合MRI和fMRI的多模态深度学习框架,用于阿尔茨海默病(AD)、轻度认知障碍(MCI)和正常认知状态的多分类;采用3D CNN提取MRI结构特征,LSTM建模fMRI时序特征,并通过特征融合实现时空联合学习;在小规模配对数据集(29人)上验证了数据增强对多模态模型的有效性,但对大规模单模态MRI数据无效,强调了数据规模与模态对增强策略设计的重要性。
Details
Motivation: 阿尔茨海默病早期精准分类需综合利用结构与功能脑影像信息,但现有方法多限于单模态,且小样本下模型泛化能力差,亟需适配多模态、小样本场景的数据增强与融合策略。 Method: 构建多模态深度学习框架:MRI分支采用3D CNN提取空间结构特征,fMRI分支采用RNN(LSTM)建模时间动态特征;两路特征融合后联合训练;在29例配对MRI-fMRI数据上评估,对比有/无数据增强的效果,并与大规模单模态MRI结果对照。 Result: 数据增强显著提升了多模态3DCNN-LSTM模型的分类稳定性与泛化能力;但在大规模单模态MRI数据上未见改善;证实增强策略效果高度依赖于数据规模与模态组合。 Conclusion: 多模态融合框架结合针对性数据增强可有效提升小样本神经影像AD分类性能;研究强调应根据数据规模与模态特性定制增强策略,而非直接迁移单模态经验。 Abstract: Magnetic Resonance Imaging (MRI) provides detailed structural information, while functional MRI (fMRI) captures temporal brain activity. In this work, we present a multimodal deep learning framework that integrates MRI and fMRI for multi-class classification of Alzheimer Disease (AD), Mild Cognitive Impairment, and Normal Cognitive State. Structural features are extracted from MRI using 3D convolutional neural networks, while temporal features are learned from fMRI sequences using recurrent architectures. These representations are fused to enable joint spatial-temporal learning. Experiments were conducted on a small paired MRI-fMRI dataset (29 subjects), both with and without data augmentation. Results show that data augmentation substantially improves classification stability and generalization, particularly for the multimodal 3DCNN-LSTM model. In contrast, augmentation was found to be ineffective for a large-scale single-modality MRI dataset. These findings highlight the importance of dataset size and modality when designing augmentation strategies for neuroimaging-based AD classification.[132] Real-Time Monocular Scene Analysis for UAV in Outdoor Environments
Yara AlaaEldin
Main category: cs.CV
TL;DR: 本文提出了一种联合深度学习架构Co-SemDepth,用于无人机在低空非结构化环境中基于单目相机进行深度与语义地图预测;为缓解真实标注数据稀缺问题,构建了新的合成数据集TopAir,并系统分析了合成到真实的泛化能力,探索了Cycle-GAN与扩散模型在风格迁移中的效果,最终将方法拓展至海洋场景(MidSea数据集),验证其泛化性。
Details
Motivation: 无人机在低空非结构化环境中缺乏高质量、大规模的真实标注数据,制约了深度估计与语义分割任务的性能;同时,合成数据训练模型在迁移到真实域时存在域偏移问题,亟需系统分析影响泛化的关键因素并提升跨域适应能力。 Method: 提出联合深度学习模型Co-SemDepth;构建合成数据集TopAir和MidSea;开展合成到真实的泛化性对比实验(含Co-SemDepth与TaskPrompter);引入Cycle-GAN与扩散模型进行图像风格迁移以缩小域间差距。 Result: Co-SemDepth在深度估计任务上展现出优于TaskPrompter的合成到真实泛化能力;扩散模型在合成→真实风格迁移中效果优于Cycle-GAN;在海洋场景中,Co-SemDepth在SMD真实数据上表现良好,但在MIT数据集上仍需提升。 Conclusion: 联合建模与针对性合成数据构建可有效提升无人机视觉感知模型的泛化能力;扩散模型是更优的跨域风格迁移工具;面向特定领域(如海洋)的定制化数据与模型微调仍是提升真实场景鲁棒性的关键路径。 Abstract: In this thesis, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture, named Co-SemDepth, that can perform the two tasks accurately and rapidly, and validate its effectiveness on a variety of datasets. The training of neural networks requires an abundance of annotated data, and in the UAV field, the availability of such data is limited. We introduce a new synthetic dataset in this thesis, TopAir that contains images captured with a nadir view in outdoor environments at different altitudes, helping to fill the gap. While using synthetic data for the training is convenient, it raises issues when shifting to the real domain for testing. We conduct an extensive analytical study to assess the effect of several factors on the synthetic-to-real generalization. Co-SemDepth and TaskPrompter models are used for comparison in this study. The results reveal a superior generalization performance for Co-SemDepth in depth estimation and for TaskPrompter in semantic segmentation. Also, our analysis allows us to determine which training datasets lead to a better generalization. Moreover, to help attenuate the gap between the synthetic and real domains, image style transfer techniques are explored on aerial images to convert from the synthetic to the realistic style. Cycle-GAN and Diffusion models are employed. The results reveal that diffusion models are better in the synthetic to real style transfer. In the end, we focus on the marine domain and address its challenges. Co-SemDepth is trained on a collected synthetic marine data, called MidSea, and tested on both synthetic and real data. The results reveal good generalization performance of Co-SemDepth when tested on real data from the SMD dataset while further enhancement is needed on the MIT dataset.[133] Disentangling Prompt Dependence to Evaluate Segmentation Reliability in Gynecological MRI
Elodie Germani,Krystel Nyangoh-Timoh,Pierre Jannin,John S H Baxter
Main category: cs.CV
TL;DR: 本文提出了一种新的prompt dependence度量框架,将prompt歧义性(用户间差异)与局部敏感性(交互不精确性)解耦,用于评估可提示分割模型(如SAM)在安全关键场景下的鲁棒性,并在女性盆腔MRI数据上验证了其有效性与解释性。
Details
Motivation: 现有可提示分割模型虽具零样本泛化能力,但其对用户提示变化的鲁棒性(即prompt dependence)缺乏系统研究,尤其在安全关键、用户差异大的场景中亟需可解释的评估框架。 Method: 提出首个显式解耦prompt ambiguity(用户间提示差异)和local sensitivity(提示交互微小扰动)的prompt dependence量化框架,并在两个女性盆腔MRI数据集(子宫、膀胱分割)上进行实验分析。 Result: 实验发现两类prompt dependence指标均与分割性能呈强负相关;二者互相关性低,验证了解耦设计合理性;且能有效指示prompt相关的失败模式。 Conclusion: 所提框架为评估可提示分割模型的鲁棒性提供了可解释、可量化的工具,有助于提升其在临床等安全关键场景中的可靠性与可信度。 Abstract: Promptable segmentation models (e.g., the Segment Anything Models) enable generalizable, zero-shot segmentation across diverse domains. Although predictions are deterministic for a fixed image-prompt pair, the robustness of these models to variations in user prompts, referred to as prompt dependence, remains underexplored. In safety-critical workflows with substantial inter-user variability, interpretable and informative frameworks are needed to evaluate prompt dependence. In this work, we assess the reliability of promptable segmentation by analyzing and measuring its sensitivity to prompt variability. We introduce the first formulation of prompt dependence that explicitly disentangles prompt ambiguity (inter-user variability) from local sensitivity (interaction imprecision), offering an interpretable view of segmentation robustness. Experiments on two female pelvic MRI datasets for uterus and bladder segmentation reveal a strong negative correlation between both metrics and segmentation performance, highlighting the value of our framework for assessing robustness. The two metrics have low mutual correlation, supporting the disentangled design of our formulation, and provide meaningful indicators of prompt-related failure modes.[134] GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning
Jiajin Liu,Dongzhe Fan,Chuanhao Ji,Daochen Zha,Qiaoyu Tan
Main category: cs.CV
TL;DR: 本文提出GraphVLM基准,系统评估视觉-语言模型(VLMs)在多模态图学习(MMGL)中的能力,探索VLM作为编码器、对齐器和预测器三种范式,发现VLM-as-Predictor效果最优。
Details
Motivation: 现有VLMs在结构化数据(如关系图)上的推理能力尚未充分探索,而现实应用(社交网络、推荐系统、科学发现)中多模态信息天然具有结构化特性,亟需填补该空白。 Method: 构建GraphVLM系统性基准,涵盖六个多领域数据集,并提出三种VLM与图推理融合范式:VLM-as-Encoder(多模态特征融合增强GNN)、VLM-as-Aligner(跨模态对齐以支持LLM结构化推理)、VLM-as-Predictor(直接用VLM作为多模态图学习骨干)。 Result: 实验表明VLM在所有三种角色下均能提升MMGL性能,其中VLM-as-Predictor带来最显著且稳定的效果提升。 Conclusion: VLMs具备作为多模态图学习新基础模型的巨大潜力,VLM-as-Predictor范式最具前景。 Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real-world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured. To bridge this gap, we present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning (MMGL). GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: (1) VLM-as-Encoder, which enriches graph neural networks through multimodal feature fusion; (2) VLM-as-Aligner, which bridges modalities in latent or linguistic space to facilitate LLM-based structured reasoning; and (3) VLM-as-Predictor, which directly employs VLMs as multimodal backbones for graph learning tasks. Extensive experiments across six datasets from diverse domains demonstrate that VLMs enhance multimodal graph learning via all three roles. Among these paradigms, VLM-as-Predictor achieves the most substantial and consistent performance gains, revealing the untapped potential of vision-language models as a new foundation for multimodal graph learning. The benchmark code is publicly available at https://github.com/oamyjin/GraphVLM.[135] Agentic LLM Workflow for MR Spectroscopy Volume-of-Interest Placements in Brain Tumors
Sangyoon Lee,Francesca Branzoli,Małgorzata Marjańska,Patrick Bolan
Main category: cs.CV
TL;DR: 本文提出一种基于代理型大语言模型(LLM)的工作流,用于优化磁共振波谱(MRS)中感兴趣体积(VOI)的放置,通过多目标视觉Transformer生成多样化候选VOI,并由LLM依据临床偏好选择最优方案,从而提升肿瘤覆盖与坏死区规避能力,降低操作者间变异。
Details
Motivation: 传统MRS中VOI放置存在高操作者间变异性,尤其对异质性肿瘤;单一确定性模型难以兼顾不同临床偏好和解剖个体差异。 Method: 构建代理型LLM工作流:先由多个目标偏好不同的视觉Transformer模型生成多样化候选VOI,再由LLM基于定量指标(如肿瘤覆盖、坏死规避)进行偏好驱动的选择。 Result: 在110例临床脑肿瘤病例上,该工作流相比通用专家放置方案,在特定用户偏好下显著提升了实性肿瘤覆盖率并更好规避坏死区。 Conclusion: 该方法提供了一种无需重新训练模型即可灵活适配不同临床目标的VOI个性化放置策略,提升了MRS的临床适用性与鲁棒性。 Abstract: Magnetic resonance spectroscopy (MRS) provides clinically valuable metabolic characterization of brain tumors, but its utility depends on accurate placement of the spectroscopy volume-of-interest (VOI). However, VOI placement typically has a broad operating window: for a given tumor there are multiple possible VOIs that would lead to high-quality MRS measurements. Thus, a VOI place-ment can be tuned for clinician preference, case-specific anatomy, and clinical pri-orities, which leads to high inter-operator variability, especially for heterogeneous tumors. We propose an agentic large language model (LLM) workflow that de-composes VOI placement into generation of diverse candidate VOIs, from which the LLM selects an optimal one based on quantitative metrics. Candidate VOIs are generated by vision transformer-based placement models trained with differ-ent objective function preferences, which allows selection from acceptable alterna-tives rather than a single deterministic placement. On 110 clinical brain tumor cas-es, the agentic workflow achieves improved solid tumor coverage and necrosis avoidance depending on the user preferences compared to the general-purpose expert placements. Overall, the proposed workflow provides a strategy to adapt VOI placement to different clinical objectives without retraining task-specific models.[136] Geometry-Aware Semantic Reasoning for Training Free Video Anomaly Detection
Ali Zia,Usman Ali,Muhammad Umer Ramzan,Hamza Abid,Abdul Rehman,Wei Xiang
Main category: cs.CV
TL;DR: 本文提出MM-VAD,一种无需训练的视频异常检测框架,通过将场景描述映射到双曲空间并结合测试时自适应提示优化与协方差感知的马氏距离精调,提升异常检测的稳定性与可解释性。
Details
Motivation: 现有无训练视频异常检测方法依赖静态提示和几何无关的特征融合,导致仅基于欧氏嵌入的浅层相似性匹配,预测不稳定、可解释性差,尤其在复杂或层次化场景中表现不佳。 Method: MM-VAD将字幕生成的场景表征投影至双曲空间以保持层次结构,并利用冻结大语言模型进行自适应问答式异常评估;测试时通过无监督置信-稀疏目标优化轻量可学习提示;引入协方差感知的马氏距离细化,增强跨模态对齐稳定性。 Result: 在XD-Violence、UCF-Crime、ShanghaiTech和UCSD Ped2四个基准上,AUC分别达90.03%、83.24%、96.95%和98.81%,显著优于先前无训练方法。 Conclusion: 几何感知表征与自适应语义校准为无训练视频异常检测提供了原理清晰且高效的新范式,优于传统的静态欧氏匹配。 Abstract: Training-free video anomaly detection (VAD) has recently emerged as a scalable alternative to supervised approaches, yet existing methods largely rely on static prompting and geometry-agnostic feature fusion. As a result, anomaly inference is often reduced to shallow similarity matching over Euclidean embeddings, leading to unstable predictions and limited interpretability, especially in complex or hierarchically structured scenes. We introduce MM-VAD, a geometry-aware semantic reasoning framework for training free VAD that reframes anomaly detection as adaptive test-time inference rather than fixed feature comparison. Our approach projects caption-derived scene representations into hyperbolic space to better preserve hierarchical structure and performs anomaly assessment through an adaptive question answering process over a frozen large language model. A lightweight, learnable prompt is optimised at test time using an unsupervised confidence-sparsity objective, enabling context-specific calibration without updating any backbone parameters. To further ground semantic predictions in visual evidence, we incorporate a covariance-aware Mahalanobis refinement that stabilises cross-modal alignment. Across four benchmarks, MM-VAD consistently improves over prior training-free methods, achieving 90.03% AUC on XD-Violence and 83.24%, 96.95%, and 98.81% on UCF-Crime, ShanghaiTech, and UCSD Ped2, respectively. Our results demonstrate that geometry-aware representation and adaptive semantic calibration provide a principled and effective alternative to static Euclidean matching in training-free VAD.[137] InfiniteDance: Scalable 3D Dance Generation Towards in-the-wild Generalization
Ronghui Li,Zhongyuan Hu,Li Siyao,Youliang Zhang,Haozhe Xie,Mingyuan Zhang,Jie Guo,Xiu Li,Ziwei Liu
Main category: cs.CV
TL;DR: 本文提出了一种可扩展的3D舞蹈生成方法,通过构建高质量大规模舞蹈数据集(含100.69小时)和设计ChoreoLLaMA模型(融合RAG与慢/快节奏MoE模块),显著提升对未见音乐的泛化能力与物理合理性。
Details
Motivation: 现有3D舞蹈生成方法在真实场景中泛化能力差,面对未见音乐常生成结构混乱或物理不可行的动作,主因是舞蹈-音乐配对数据有限且模型容量不足。 Method: (1)构建全自动单目视频3D舞蹈重建流程,引入脚部恢复扩散模型(FRDM)以保证物理合理性和运动表现力;(2)提出基于LLaMA的ChoreoLLaMA模型,集成检索增强生成(RAG)模块和慢/快节奏混合专家(MoE)模块,提升对陌生音乐的鲁棒性与节奏适应性。 Result: 在多舞种实验中,该方法在定性与定量评估上均超越现有方法;构建了100.69小时高质量、多模态3D舞蹈数据集;代码、模型与数据将开源。 Conclusion: 通过数据与模型双尺度扩展,本工作推动了面向真实场景的可扩展3D舞蹈生成,为音乐驱动舞蹈生成提供了新范式。 Abstract: Although existing 3D dance generation methods perform well in controlled scenarios, they often struggle to generalize in the wild. When conditioned on unseen music, existing methods often produce unstructured or physically implausible dance, largely due to limited music-to-dance data and restricted model capacity. This work aims to push the frontier of generalizable 3D dance generation by scaling up both data and model design. (1) On the data side, we develop a fully automated pipeline that reconstructs high-fidelity 3D dance motions from monocular videos. To eliminate the physical artifacts prevalent in existing reconstruction methods, we introduce a Foot Restoration Diffusion Model (FRDM) guided by foot-contact and geometric constraints that enforce physical plausibility while preserving kinematic smoothness and expressiveness, resulting in a diverse, high-quality multimodal 3D dance dataset totaling 100.69 hours. (2) On model design, we propose Choreographic LLaMA (ChoreoLLaMA), a scalable LLaMA-based architecture. To enhance robustness under unfamiliar music conditions, we integrate a retrieval-augmented generation (RAG) module that injects reference dance as a prompt. Additionally, we design a slow/fast-cadence Mixture-of-Experts (MoE) module that enables ChoreoLLaMA to smoothly adapt motion rhythms across varying music tempos. Extensive experiments across diverse dance genres show that our approach surpasses existing methods in both qualitative and quantitative evaluations, marking a step toward scalable, real-world 3D dance generation. Code, models, and data will be released.[138] A Computer-aided Framework for Detecting Osteosarcoma in Computed Tomography Scans
Maximo Rodriguez-Herrero,Dante D. Sanchez-Gallegos,Marco Antonio Núñez-Gaona,Heriberto Aguirre-Meneses,Luis Alberto Villalvazo Gutiérrez,Mario Ibrahin Gutiérrez Velasco,J. L. Gonzalez-Compean,Jesus Carretero
Main category: cs.CV
TL;DR: 本文提出了一种基于CNN的机器学习与可视化框架,用于自动化诊断骨肉瘤,通过CT扫描的预处理、检测、后处理和可视化流程,在12名患者数据上达到94.8% AUC和94.6%特异性。
Details
Motivation: 骨肉瘤是常见原发性骨癌,早期检测对防止骨转移至关重要,亟需快速准确的自动化诊断辅助医生预后。 Method: 构建包含CT扫描预处理(数据增强、感兴趣区域识别)、CNN模型分类、后处理(3D骨模型可视化并高亮病灶)的端到端诊断框架。 Result: 在12例患者数据上评估,AUC达94.8%,特异性为94.6%。 Conclusion: 所提框架在骨肉瘤自动化诊断中表现出高准确性与实用性,具备临床辅助潜力。 Abstract: Osteosarcoma is the most common primary bone cancer, mainly affecting the youngest and oldest populations. Its detection at early stages is crucial to reduce the probability of developing bone metastasis. In this context, accurate and fast diagnosis is essential to help physicians during the prognosis process. The research goal is to automate the diagnosis of osteosarcoma through a pipeline that includes the preprocessing, detection, postprocessing, and visualization of computed tomography (CT) scans. Thus, this paper presents a machine learning and visualization framework for classifying CT scans using different convolutional neural network (CNN) models. Preprocessing includes data augmentation and identification of the region of interest in scans. Post-processing includes data visualization to render a 3D bone model that highlights the affected area. An evaluation on 12 patients revealed the effectiveness of our framework, obtaining an area under the curve (AUC) of 94.8\% and a specificity of 94.6\%.[139] Deep Learning for BioImaging: What Are We Learning?
Ivan Svatko,Maxime Sanchez,Ihab Bendidi,Gilles Cottrell,Auguste Genovesio
Main category: cs.CV
TL;DR: 本文系统研究了显微镜图像表示学习,发现当前最先进的方法在细胞培养和组织成像数据上表现与简单基线(如未训练模型和细胞组织结构表示)相当,且未能稳定学习到生物学上有意义的高层特征;常用评估指标不足以反映表示质量,因此需要更具诊断性的基准来推动该领域发展。
Details
Motivation: 显微镜成像中表示学习究竟学到了什么尚不明确,亟需系统性评估现有方法在关键生物尺度(细胞培养与组织成像)上的实际学习能力。 Method: 构建一系列简单但具有揭示性的基线(包括未训练模型和基于细胞/组织结构的简单表示),在精心整理的显微镜图像基准上进行系统比较,并分析常用评估指标的有效性。 Result: 最先进方法性能与简单基线相当;未能一致习得生物学上有意义的高层语义特征;常用基准指标常掩盖模型缺陷;详细对比可帮助定位模型优缺点。 Conclusion: 显微镜图像表示学习的进步不仅依赖更强模型,更需设计能真实衡量‘所学内容’的诊断性基准。 Abstract: Representation learning has driven major advances in natural image analysis by enabling models to acquire high-level semantic features. In microscopy imaging, however, it remains unclear what current representation learning methods actually learn. In this work, we conduct a systematic study of representation learning for the two most widely used and broadly available microscopy data types, representing critical scales in biology: cell culture and tissue imaging. To this end, we introduce a set of simple yet revealing baselines on curated benchmarks, including untrained models and simple structural representations of cellular tissue. Our results show that, surprisingly, state-of-the-art methods perform comparably to these baselines. We further show that, in contrast to natural images, existing models fail to consistently acquire high-level, biologically meaningful features. Moreover, we demonstrate that commonly used benchmark metrics are insufficient to assess representation quality and often mask this limitation. In addition, we investigate how detailed comparisons with these benchmarks provide ways to interpret the strengths and weaknesses of models for further improvements. Together, our results suggest that progress in microscopy image representation learning requires not only stronger models, but also more diagnostic benchmarks that measure what is actually learned.[140] DINOv3 with Test-Time Calibration for Automated Carotid Intima-Media Thickness Measurement on CUBS v1
Zhenpeng Zhang,Jinwei Lu,Yurui Dong,Bo Yuan
Main category: cs.CV
TL;DR: 本文提出了一种基于DINOv3视觉基础模型的框架,用于颈动脉内-中膜复合体(IMC)分割与颈动脉内-中膜厚度(CIMT)测量,在CUBS v1数据集上实现了临床相关精度(~0.1 mm),验证了基础模型在可解释、校准感知的超声生物标志物量化中的可行性。
Details
Motivation: 现有方法在联合分割与CIMT测量方面缺乏鲁棒且可迁移的深度模型,尤其在视觉基础模型兴起背景下;本文受DINOv3在医学分割及测试时优化中的进展启发,探索其在CIMT自动化量化中的潜力。 Method: 基于DINOv3构建端到端框架:先在固定分辨率下预测IMC区域;再逐列提取上下边界;利用CUBS提供的每幅图像校准因子校正缩放误差;最终输出物理单位(微米)的CIMT值;并在测试时进行阈值校准以优化测量。 Result: 在三个患者级测试集上,平均Dice为0.7739±0.0037,IoU为0.6384±0.0044;CIMT绝对误差均值为181.16±11.57 μm,Pearson相关系数均值为0.480±0.259;在独立验证子集(n=28)中,测试时阈值校准将平均绝对误差从141.0 μm降至101.1 μm,并减小零偏倚。 Conclusion: DINOv3为基础的框架可在临床可接受误差范围内(~0.1 mm)实现CIMT测量,证明视觉基础模型在需校准感知与可解释性的医学超声定量分析中具有应用前景。 Abstract: Carotid intima-media thickness (CIMT) measured from B-mode ultrasound is an established vascular biomarker for atherosclerosis and cardiovascular risk stratification. Although a wide range of computerized methods have been proposed for carotid boundary delineation and CIMT estimation, robust and transferable deep models that jointly address segmentation and measurement remain underexplored, particularly in the era of vision foundation models. Motivated by recent advances in adapting DINOv3 to medical segmentation and exploiting DINOv3 in test-time optimization pipelines, we investigate a DINOv3-based framework for carotid intima-media complex segmentation and subsequent CIMT measurement on the Carotid Ultrasound Boundary Study (CUBS) v1 dataset. Our pipeline predicts the intima-media band at a fixed image resolution, extracts upper and lower boundaries column-wise, corrects for image resizing using the per-image calibration factor provided by CUBS, and reports CIMT in physical units. Across three patient-level test splits, our method achieved a mean test Dice of 0.7739 $\pm$ 0.0037 and IoU of 0.6384 $\pm$ 0.0044. The mean CIMT absolute error was 181.16 $\pm$ 11.57 $μ$m, with a mean Pearson correlation of 0.480 $\pm$ 0.259. In a held-out validation subset ($n=28$), test-time threshold calibration reduced the mean absolute CIMT error from 141.0 $μ$m at the default threshold to 101.1 $μ$m at the measurement-optimized threshold, while simultaneously reducing systematic bias toward zero. Relative to the error ranges reported in the original CUBS benchmark for classical computerized methods, these results place a DINOv3-based approach within the clinically relevant $\sim$0.1 mm measurement regime. Together, our findings support the feasibility of using vision foundation models for interpretable, calibration-aware CIMT measurement.[141] Taming Vision Priors for Data Efficient mmWave Channel Modeling
Zhenlin An,Longfei Shangguan,John Kaewell,Philip Pietraski,Jelena Senic,Camillo Gentile,Nada Golmie,Kyle Jamieson
Main category: cs.CV
TL;DR: VisRFTwin 是一种结合视觉语义先验与可微射线追踪的数字孪生框架,显著减少毫米波信道测量需求,并提升传播建模精度。
Details
Motivation: 毫米波传播建模对AR和自动驾驶至关重要,但现有可微射线追踪方法依赖大量信道测量或人工调参的场景材质模型,部署困难。 Method: 利用冻结的视觉语言模型从多视角图像提取语义嵌入,转化为表面介电常数与电导率初值;以此初始化基于Sionna的可微射线追踪器,仅需数十个稀疏信道采样即可通过梯度下降校准材质参数;校准后保留视觉特征与材质参数映射关系,支持快速跨场景迁移。 Result: 在办公室、城市峡谷和动态公共场所三个真实场景中,VisRFTwin将信道测量需求降低至原来的1/10,延迟扩展误差中位数比纯数据驱动深度学习方法低59%。 Conclusion: VisRFTwin实现了高精度、低测量开销、强泛化能力的毫米波传播建模,为实时AR与自主系统提供了可部署的物理引导建模方案。 Abstract: Accurately modeling millimeter-wave (mmWave) propagation is essential for real-time AR and autonomous systems. Differentiable ray tracing offers a physics-grounded solution but still facing deployment challenges due to its over-reliance on exhaustive channel measurements or brittle, hand-tuned scene models for material properties. We present VisRFTwin, a scalable and data-efficient digital-twin framework that integrates vision-derived material priors with differentiable ray tracing. Multi-view images from commodity cameras are processed by a frozen Vision-Language Model to extract dense semantic embeddings, which are translated into initial estimates of permittivity and conductivity for scene surfaces. These priors initialize a Sionna-based differentiable ray tracer, which rapidly calibrates material parameters via gradient descent with only a few dozen sparse channel soundings. Once calibrated, the association between vision features and material parameters is retained, enabling fast transfer to new scenarios without repeated calibration. Evaluations across three real-world scenarios, including office interiors, urban canyons, and dynamic public spaces show that VisRFTwin reduces channel measurement needs by up to 10$\times$ while achieving a 59% lower median delay spread error than pure data-driven deep learning methods.[142] VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering
Youting Wang,Yuan Tang,Yitian Qian,Chen Zhao
Main category: cs.CV
TL;DR: 本文提出VisualLeakBench评估套件,用于检测大型视觉语言模型(LVLMs)在OCR注入和上下文敏感的个人身份信息(PII)泄露方面的鲁棒性;实验发现不同模型在两类攻击上的表现差异显著,且防御性系统提示词的效果高度依赖于图像模板类型。
Details
Motivation: 现有LVLM安全性评估多聚焦于显式有害内容,忽视了部署中更隐蔽、隐私关键的语义视觉攻击(如OCR注入和上下文PII泄露),缺乏面向真实场景的鲁棒性评测基准。 Method: 构建VisualLeakBench:包含1000张合成对抗图像(覆盖8类PII)及50张真实世界截图;对GPT-5.2、Claude~4、Gemini-3 Flash、Grok-4四大前沿模型进行OCR ASR与PII ASR量化评估,并测试防御性系统提示词的有效性。 Result: Claude~4 OCR ASR最低(14.2%)但PII ASR最高(74.4%),呈现'先照搬后警告'模式;Grok-4 PII ASR最低(20.4%);防御提示使Claude~4 PII泄露从74.4%降至2.2%,对Gemini-3 Flash在合成数据上无效但在真实截图上完全消除(50%→0%)。 Conclusion: LVLM在隐私敏感视觉攻击下存在严重鲁棒性缺陷;防御策略效果具有模板敏感性,强调需结合合成与真实数据开展多维安全评测;作者开源数据与代码以推动部署级VLM安全评估。 Abstract: As Large Vision-Language Models (LVLMs) are increasingly deployed in agent-integrated workflows and other deployment-relevant settings, their robustness against semantic visual attacks remains under-evaluated -- alignment is typically tested on explicit harmful content rather than privacy-critical multimodal scenarios. We introduce VisualLeakBench, an evaluation suite to audit LVLMs against OCR Injection and Contextual PII Leakage using 1,000 synthetically generated adversarial images with 8 PII types, validated on 50 in-the-wild (IRL) real-world screenshots spanning diverse visual contexts. We evaluate four frontier systems (GPT-5.2, Claude~4, Gemini-3 Flash, Grok-4) with Wilson 95% confidence intervals. Claude~4 achieves the lowest OCR ASR (14.2%) but the highest PII ASR (74.4%), exhibiting a comply-then-warn pattern -- where verbatim data disclosure precedes any safety-oriented language. Grok-4 achieves the lowest PII ASR (20.4%). A defensive system prompt eliminates PII leakage for two models, reduces Claude~4's leakage from 74.4% to 2.2%, but has no effect on Gemini-3 Flash on synthetic data. Strikingly, IRL validation reveals Gemini-3 Flash does respond to mitigation on real-world images (50% to 0%), indicating that mitigation robustness is template-sensitive rather than uniformly absent. We release our dataset and code for reproducible robustness and safety evaluation of deployment-relevant vision-language systems.[143] Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers
Yuntao Shou,Xiangyong Cao,Qian Zhao,Deyu Meng
Main category: cs.CV
TL;DR: 本文提出了一种可控病理图像合成方法IC-DiT,通过多智能体LVLM框架构建细粒度标注数据,并在扩散Transformer中融合空间布局、文本描述和视觉嵌入,实现高保真、强空间可控且诊断一致的生成。
Details
Motivation: 现有文本引导扩散模型缺乏对病理图像中空间布局、组织形态和语义细节的细粒度控制;同时缺少配对的补丁级空间布局与详细诊断描述的大规模数据集。 Method: 构建可扩展的多智能体LVLM标注框架以高效生成临床对齐的细粒度监督数据;基于该数据提出In-Context Diffusion Transformer(IC-DiT),融合空间布局、文本描述和视觉嵌入,采用分层多模态注意力机制。 Result: 在五个组织病理学数据集上实验表明,IC-DiT在图像保真度、空间可控性和诊断一致性方面均优于现有方法;生成图像可有效提升癌症分类与生存分析等下游任务性能。 Conclusion: IC-DiT为可控病理图像合成提供了新范式,其数据构建框架与模型设计共同推动了医学图像生成向临床可用方向发展。 Abstract: Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text-guided diffusion models offer only coarse global control and lack the ability to enforce fine-grained structural constraints. Progress is further limited by the absence of large datasets that pair patch-level spatial layouts with detailed diagnostic descriptions, since generating such annotations for gigapixel whole-slide images is prohibitively time-consuming for human experts. To overcome these challenges, we first develop a scalable multi-agent LVLM annotation framework that integrates image description, diagnostic step extraction, and automatic quality judgment into a coordinated pipeline, and we evaluate the reliability of the system through a human verification process. This framework enables efficient construction of fine-grained and clinically aligned supervision at scale. Building on the curated data, we propose In-Context Diffusion Transformer (IC-DiT), a layout-aware generative model that incorporates spatial layouts, textual descriptions, and visual embeddings into a unified diffusion transformer. Through hierarchical multimodal attention, IC-DiT maintains global semantic coherence while accurately preserving structural and morphological details. Extensive experiments on five histopathology datasets show that IC-DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency than existing methods. In addition, the generated images serve as effective data augmentation resources for downstream tasks such as cancer classification and survival analysis.[144] Cylindrical Mechanical Projector for Omnidirectional Fringe Projection Profilometry
Mincheol Choi,Gaeun Kim,Jae-Sang Hyun
Main category: cs.CV
TL;DR: 本文提出了一种基于圆柱形机械投影仪的新型全向3D重建方法,通过旋转平台和带ON/OFF槽的圆柱图案发生器实现多频相移条纹的全向投影,并结合多波长解包裹算法和准标定技术,仅用单相机即实现高精度重建,深度扩展不确定度为0.215 mm。
Details
Motivation: 现有数字条纹投影法存在单向投影和光谱受限等固有缺陷,难以满足日益增长的全向360度3D重建需求(如元宇宙、3D通信)。 Method: 设计圆柱形机械投影系统,含旋转平台与双间隔ON/OFF槽圆柱图案发生器,实现全向多频相移条纹投影;结合多波长解包裹算法与准标定技术,仅使用单台相机完成三维重建。 Result: 实验验证了该系统的可重复性、可再现性及测量不确定性,重建深度的扩展不确定度为0.215 mm,证实其高精度与实用可行性。 Conclusion: 所提方法有效克服传统条纹投影的单向性和光谱限制,实现了高精度、全向、单相机3D重建,适用于实际工业与前沿应用。 Abstract: The demand for 360-degree 3D reconstruction has significantly increased in recent years across various domains such as the metaverse and 3D telecommunication. Accordingly, the importance of precise and wide-area 3D sensing technology has become emphasized. While the digital fringe projection method has been widely used due to its high accuracy and implementation flexibility, it suffers from fundamental limitations such as unidirectional projection and a restricted available light spectrum. To address these issues, this paper proposes a novel 3D reconstruction method based on a cylindrical mechanical projector. The proposed method consists of a rotational stage and a cylindrical pattern generator with ON/OFF slots at two distinct intervals, enabling omnidirectional projection of multi-frequency phase-shifted fringe patterns. By applying a multi-wavelength unwrapping algorithm and a quasi-calibration technique, the system achieves high-accuracy 3D reconstruction using only a single camera. Experimental results, supported by repeatability and reproducibility analyses together with a measurement uncertainty evaluation, confirm reliable measurement performance and practical feasibility for omnidirectional 3D reconstruction. The expanded uncertainty of the reconstructed depth was evaluated as 0.215 mm.[145] VeloEdit: Training-Free Consistent and Continuous Instruction-Based Image Editing via Velocity Field Decomposition
Zongqing Li,Zhihui Liu,Yujie Xie,Shansiyuan Wu,Hongshen Lv,Songzhi Su
Main category: cs.CV
TL;DR: 本文提出VeloEdit,一种无需训练的图像编辑方法,通过动态识别编辑区域并利用速度场调控,在保持非编辑区域一致性的同时实现连续可控的编辑强度。
Details
Motivation: 现有基于流匹配的指令式图像编辑方法在非编辑区域一致性上表现不佳,且缺乏对编辑强度的细粒度控制。 Method: VeloEdit通过量化源内容保持与目标编辑所需速度场之间的差异来动态划分编辑区域;在保留区域用源恢复速度替代编辑速度以增强一致性,在目标区域通过速度插值实现编辑强度的连续调节。 Result: 在Flux.1 Kontext和Qwen-Image-Edit上的实验表明,VeloEdit显著提升了视觉一致性和编辑连续性,且计算开销极小。 Conclusion: VeloEdit是一种高效、无需训练、直接操作速度场的图像编辑方法,解决了现有方法在一致性与可控性方面的关键缺陷。 Abstract: Instruction-based image editing aims to modify source content according to textual instructions. However, existing methods built upon flow matching often struggle to maintain consistency in non-edited regions due to denoising-induced reconstruction errors that cause drift in preserved content. Moreover, they typically lack fine-grained control over edit strength. To address these limitations, we propose VeloEdit, a training-free method that enables highly consistent and continuously controllable editing. VeloEdit dynamically identifies editing regions by quantifying the discrepancy between the velocity fields responsible for preserving source content and those driving the desired edits. Based on this partition, we enforce consistency in preservation regions by substituting the editing velocity with the source-restoring velocity, while enabling continuous modulation of edit intensity in target regions via velocity interpolation. Unlike prior works that rely on complex attention manipulation or auxiliary trainable modules, VeloEdit operates directly on the velocity fields. Extensive experiments on Flux.1 Kontext and Qwen-Image-Edit demonstrate that VeloEdit improves visual consistency and editing continuity with negligible additional computational cost. Code is available at https://github.com/xmulzq/VeloEdit.[146] High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding
Ji Woo Hong,Hee Suk Yoon,Gwanhyeong Koo,Eunseop Yoon,SooHwan Eom,Qi Dai,Chong Luo,Chang D. Yoo
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的解码框架,通过仅训练一个扩散解码器来提升预训练视觉语言模型(VLM)生成图像的视觉保真度,无需修改原始VLM,且仅需ImageNet-1K数据短时训练。
Details
Motivation: 现有大规模视觉语言模型受限于离散图像分词化,导致生成图像视觉保真度不足;而直接迁移到连续表征需高昂的再训练成本。 Method: 提出扩散解码框架,包含Logit-to-Code Distributional Mapping(将VLM输出logits映射为带不确定性特征的连续加权码向量)、Logit Calibration(对齐训练与推理logits)和Distribution-Conditioned Diffusion Decoder(基于上述表征生成高保真图像)。 Result: 在仅用ImageNet-1K短训条件下,显著提升VQ-VAE重建与VLM文本生成图像的视觉保真度,且不改动原始VLM。 Conclusion: 该方法以极低代价有效突破离散分词瓶颈,在保持VLM完整性的同时实现高质量图像生成。 Abstract: Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we propose a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained VLMs, thereby preserving the original model intact. At its core, Logit-to-Code Distributional Mapping converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features, providing an effective conditioning signal for diffusion decoding. A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap. Conditioned on these representations, the Distribution-Conditioned Diffusion Decoder generates high-fidelity images. Achieved solely through short training on ImageNet-1K, our method consistently improves visual fidelity for both VQ-VAE reconstructions and text-to-image generations from VLM-predicted tokens.[147] WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics
Yuhong Dai,Yanlin Lai,Mitt Huang,Hangyu Guo,Dingming Li,Hongbo Peng,Haodong Li,Yingxiu Zhao,Haoran Lyu,Zheng Ge,Xiangyu Zhang,Daxin Jiang
Main category: cs.CV
TL;DR: 本文提出了WebVR基准,用于评估多模态大语言模型(MLLMs)从演示视频中忠实重建网页的能力,填补了视频驱动网页生成领域的研究空白。
Details
Motivation: 现有网页生成基准仅依赖文本提示或静态截图,而视频能自然传达交互流程、过渡时序和运动连续性等更丰富的信号,对网页重建至关重要;但视频条件下的网页生成尚未被系统探索,也缺乏专用基准。 Method: 构建了包含175个多样化网页的WebVR基准,所有网页通过受控合成流程生成(非网络爬取),并设计了一套细粒度、以人为本的视觉评估量表,用于多维度评估生成网页的质量;同时在19个模型上开展实验,并发布数据集、评估工具与基线结果。 Result: 实验揭示了当前模型在精细风格与动态质量重建方面存在显著不足;所提自动评估量表与人类偏好一致性达96%。 Conclusion: WebVR为视频到网页生成任务提供了首个系统性基准与评估框架,推动该方向未来研究。 Abstract: Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19 models reveal substantial gaps in recreating fine-grained style and motion quality, while the rubric-based automatic evaluation achieves 96% agreement with human preferences. We release the dataset, evaluation toolkit, and baseline results to support future research on video-to-webpage generation.[148] Comparative Analysis of Deep Learning Architectures for Multi-Disease Classification of Single-Label Chest X-rays
Ali M. Bahram,Saman Muhammad Omer,Hardi M. Mohammed
Main category: cs.CV
TL;DR: 本研究系统比较了七种深度学习架构在胸部X光多病分类任务中的性能,结果表明所有模型均能达到90%以上的准确率,其中ConvNeXt-Tiny精度最高,MobileNetV2参数效率最优,且各模型对结核和新冠识别近乎完美,验证了轻量高效AI辅助诊断的可行性。
Details
Motivation: 解决全球放射科医生短缺和阅片者间差异导致的胸部X光诊断准确性受限问题。 Method: 在患者级别划分的18,080张五类胸部X光图像数据集上,统一训练并评估ConvNeXt-Tiny、DenseNet121/201、ResNet50、ViT-B/16、EfficientNetV2-M和MobileNetV2七种模型,采用ImageNet预训练权重、标准化预处理与一致超参,并使用Grad-CAM进行可解释性分析。 Result: 所有模型测试准确率均超90%;ConvNeXt-Tiny达92.31%准确率和95.70% AUROC;MobileNetV2以3.5M参数实现90.42%准确率和94.10% AUROC,训练仅需48分钟;结核与新冠AUROC均≥99.97%;Grad-CAM显示临床一致的注意力模式。 Conclusion: 高精度多病种胸部X光分类无需过高计算资源,所验证的轻量高效模型对资源丰富与受限医疗场景下的AI辅助诊断均具重要应用价值。 Abstract: Chest X-ray imaging remains the primary diagnostic tool for pulmonary and cardiac disorders worldwide, yet its accuracy is hampered by radiologist shortages and inter-observer variability. This study presents a systematic comparative evaluation of seven deep learning architectures for multi-class chest disease classification: ConvNeXt-Tiny, DenseNet121, DenseNet201, ResNet50, ViT-B/16, EfficientNetV2-M, and MobileNetV2. A balanced dataset of 18,080 chest X-ray images spanning five disease categories (Cardiomegaly, COVID-19, Normal, Pneumonia, and Tuberculosis) was constructed from three public repositories and partitioned at the patient level to prevent data leakage. All models were trained under identical conditions using ImageNet-pretrained weights, standardized preprocessing, and consistent hyperparameters. All seven architectures exceeded 90% test accuracy. ConvNeXt-Tiny achieved the highest performance (92.31% accuracy, 95.70% AUROC), while MobileNetV2 emerged as the most parameter-efficient model (3.5M parameters, 90.42% accuracy, 94.10% AUROC), completing training in 48 minutes. Tuberculosis and COVID-19 classification was near-perfect (AUROC >= 99.97%) across all architectures, while Normal, Cardiomegaly, and Pneumonia presented greater challenges due to overlapping radiographic features. Grad-CAM visualizations confirmed clinically consistent attention patterns across disease categories. These findings demonstrate that high-accuracy multi-disease chest X-ray classification is achievable without excessive computational resources, with important implications for AI-assisted diagnosis in both resource-rich and resource-constrained healthcare settings.[149] Colony Grounded SAM2: Zero-shot detection and segmentation of bacterial colonies using foundation models
Daan Korporaal,Patrick de Kruijf,Ralph H. G. M. Litjens,Bas H. M. van der Velden
Main category: cs.CV
TL;DR: 本文提出了一种名为Colony Grounded SAM2的零样本推理流程,用于在无需额外训练的情况下检测和分割琼脂平板图像中的细菌菌落,结合了Grounding DINO和Segment Anything Model 2,并在微生物学领域进行了微调,实现了高精度的检测与分割。
Details
Motivation: 细菌菌落在琼脂平板图像中的检测与分类对微生物学研究至关重要,但受限于标注数据集的缺乏。 Method: 提出Colony Grounded SAM2零样本推理流程,利用预训练基础模型Grounding DINO和Segment Anything Model 2,并针对微生物学领域进行微调。 Result: 在分布外数据集上达到93.1%的平均精度(mAP)和0.85的Dice@detection分数,展现出优异的检测与分割能力;整个流程及模型权重开源共享。 Conclusion: 该方法无需额外训练即可实现跨场景细菌菌落的鲁棒检测与分割,为微生物学中的标注与分类任务提供了有效工具。 Abstract: The detection and classification of bacterial colonies in images of agar-plates is important in microbiology, but is hindered by the lack of labeled datasets. Therefore, we propose Colony Grounded SAM2, a zero-shot inference pipeline to detect and segment bacterial colonies in multiple settings without any further training. By utilizing the pre-trained foundation models Grounding DINO and Segment Anything Model 2, fine-tuned to the microbiological domain, we developed a model that is robust to data changes. Results showed a mean Average Precision of 93.1\% and a $Dice@detection$ score of 0.85, showing excellent detection and segmentation capabilities on out-of-distribution datasets. The entire pipeline with model weights are shared open access to aid with annotation- and classification purposes in microbiology.[150] Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models
Sihan Cao,Jianwei Zhang,Pengcheng Zheng,Jiaxin Yan,Caiyan Qin,Yalan Ye,Wei Dong,Peng Wang,Yang Yang,Chaoning Zhang
Main category: cs.CV
TL;DR: 本文提出TPRL,一种基于强化学习的视觉标记剪枝框架,通过语言引导的序列优化实现自适应剪枝,在大幅降低计算开销的同时几乎不损失精度。
Details
Motivation: 大型视觉语言模型(LVLMs)推理成本高,现有剪枝方法难以建模多步依赖决策,且依赖手工设计规则,缺乏对复杂推理路径的自适应优化能力。 Method: 将视觉标记剪枝建模为带显式状态转移的序列决策过程;使用自监督自编码器压缩视觉标记以获得紧凑状态表示;策略先通过示教学习初始化,再用PPO联合优化任务准确率与计算效率。 Result: 最多可移除66.7%的视觉标记,推理FLOPs最多降低54.2%,平均精度仅下降0.7%。 Conclusion: TPRL实现了高效、自适应、性能保持良好的视觉标记剪枝,为LVLMs轻量化提供了新范式。 Abstract: Large Vision-Language Models (LVLMs) incur substantial inference costs due to the processing of a vast number of visual tokens. Existing methods typically struggle to model progressive visual token reduction as a multi-step decision process with sequential dependencies and often rely on hand-engineered scoring rules that lack adaptive optimization for complex reasoning trajectories. To overcome these limitations, we propose TPRL, a reinforcement learning framework that learns adaptive pruning trajectories through language-guided sequential optimization tied directly to end-task performance. We formulate visual token pruning as a sequential decision process with explicit state transitions and employ a self-supervised autoencoder to compress visual tokens into a compact state representation for efficient policy learning. The pruning policy is initialized through learning from demonstrations and subsequently fine-tuned using Proximal Policy Optimization (PPO) to jointly optimize task accuracy and computational efficiency. Our experimental results demonstrate that TPRL removes up to 66.7\% of visual tokens and achieves up to a 54.2\% reduction in FLOPs during inference while maintaining a near-lossless average accuracy drop of only 0.7\%. Code is released at \href{https://github.com/MagicVicCoder/TPRL}{\textcolor{mypink}{https://github.com/MagicVicCoder/TPRL}}.[151] COT-FM: Cluster-wise Optimal Transport Flow Matching
Chiensheng Chiang,Kuan-Hsun Tu,Jia-Wei Liao,Cheng-Fu Chou,Tsung-Wei Ke
Main category: cs.CV
TL;DR: COT-FM是一种改进Flow Matching生成过程的通用框架,通过重塑概率路径,使向量场更平直,从而加速采样并提升生成质量,且无需修改模型结构。
Details
Motivation: Flow Matching(FM)模型常因随机或批量耦合导致生成轨迹弯曲,增大离散化误差、降低样本质量。 Method: COT-FM采用分而治之策略:对目标样本聚类,并为每个簇反演预训练FM模型以获得专用源分布,从而优化局部传输并得到更平直的向量场。 Result: COT-FM在2D数据集、图像生成基准和机器人操作任务中均显著加速采样、提升生成质量,且作为即插即用方法保持模型架构不变。 Conclusion: COT-FM是一种高效、通用且架构无关的Flow Matching改进框架,能可靠提升生成性能。 Abstract: We introduce COT-FM, a general framework that reshapes the probability path in Flow Matching (FM) to achieve faster and more reliable generation. FM models often produce curved trajectories due to random or batchwise couplings, which increase discretization error and reduce sample quality. COT-FM fixes this by clustering target samples and assigning each cluster a dedicated source distribution obtained by reversing pretrained FM models. This divide-and-conquer strategy yields more accurate local transport and significantly straighter vector fields, all without changing the model architecture. As a plug-and-play approach, COT-FM consistently accelerates sampling and improves generation quality across 2D datasets, image generation benchmarks, and robotic manipulation tasks.[152] SERUM: Simple, Efficient, Robust, and Unifying Marking for Diffusion-based Image Generation
Jan Kociszewski,Hubert Jastrzębski,Tymoteusz Stępkowski,Filip Manijak,Krzysztof Rojek,Franziska Boenisch,Adam Dziedzic
Main category: cs.CV
TL;DR: SERUM是一种简单而高效的扩散模型图像水印方法,通过在初始噪声中添加独特水印噪声并训练轻量检测器实现,具有强鲁棒性、高效率和低质量损失。
Details
Motivation: 解决现有扩散模型图像水印方法鲁棒性有限、计算开销大、难以支持多用户个性化水印等问题。 Method: 在扩散生成的初始噪声中嵌入唯一水印噪声,并训练一个轻量级检测器来识别水印;采用解耦架构以支持多用户独立水印嵌入。 Result: 在1%假正率下达到最高真阳性率,对各类图像增强和去水印攻击具有强鲁棒性,注入与检测速度快,检测器训练开销低,图像质量影响可忽略。 Conclusion: SERUM为扩散模型生成图像提供了一种实用、高效且可扩展的水印标记与检测方案,能可靠区分生成图像与自然图像。 Abstract: We propose SERUM: an intriguingly simple yet highly effective method for marking images generated by diffusion models (DMs). We only add a unique watermark noise to the initial diffusion generation noise and train a lightweight detector to identify watermarked images, simplifying and unifying the strengths of prior approaches. SERUM provides robustness against any image augmentations or watermark removal attacks and is extremely efficient, all while maintaining negligible impact on image quality. In contrast to prior approaches, which are often only resilient to limited perturbations and incur significant training, injection, and detection costs, our SERUM achieves remarkable performance, with the highest true positive rate (TPR) at a 1% false positive rate (FPR) in most scenarios, along with fast injection and detection and low detector training overhead. Its decoupled architecture also seamlessly supports multiple users by embedding individualized watermarks with little interference between the marks. Overall, our method provides a practical solution to mark outputs from DMs and to reliably distinguish generated from natural images.[153] TennisExpert: Towards Expert-Level Analytical Sports Video Understanding
Zhaoyu Liu,Xi Weng,Lianyu Hu,Zhe Hou,Kan Jiang,Jin Song Dong,Yang Liu
Main category: cs.CV
TL;DR: 本文提出了TennisVL大型网球基准数据集和TennisExpert多模态理解框架,旨在解决自动网球理解中缺乏细粒度标注数据和高效实时模型的两大挑战。
Details
Motivation: 自动网球理解面临两大挑战:一是缺乏大规模、细粒度标注及专家级评论的数据集;二是难以构建既准确又高效的适用于实时部署的多模态系统。 Method: 构建了包含200+场职业比赛(471.9小时)和40,000+回合级片段的TennisVL基准;提出TennisExpert框架,融合视频语义解析器与基于Qwen3-VL-8B的记忆增强模型,解析比分、击球序列、弹跳点、球员位置等要素,并通过分层记忆模块建模长短时序上下文。 Result: TennisExpert在多项任务上持续超越GPT-5、Gemini、Claude等强闭源基线,在战术推理、比赛动态建模等方面表现更优。 Conclusion: TennisVL和TennisExpert为专业网球分析、自动化教练与实时解说提供了可靠的数据基础与技术范式,推动了体育理解领域的多模态研究发展。 Abstract: Tennis is one of the most widely followed sports, generating extensive broadcast footage with strong potential for professional analysis, automated coaching, and real-time commentary. However, automatic tennis understanding remains underexplored due to two key challenges: (1) the lack of large-scale benchmarks with fine-grained annotations and expert-level commentary, and (2) the difficulty of building accurate yet efficient multimodal systems suitable for real-time deployment. To address these challenges, we introduce TennisVL, a large-scale tennis benchmark comprising over 200 professional matches (471.9 hours) and 40,000+ rally-level clips. Unlike existing commentary datasets that focus on descriptive play-by-play narration, TennisVL emphasizes expert analytical commentary capturing tactical reasoning, player decisions, and match momentum. Furthermore, we propose TennisExpert, a multimodal tennis understanding framework that integrates a video semantic parser with a memory-augmented model built on Qwen3-VL-8B. The parser extracts key match elements (e.g., scores, shot sequences, ball bounces, and player locations), while hierarchical memory modules capture both short- and long-term temporal context. Experiments show that TennisExpert consistently outperforms strong proprietary baselines, including GPT-5, Gemini, and Claude, and demonstrates improved ability to capture tactical context and match dynamics.[154] Qianfan-OCR: A Unified End-to-End Model for Document Intelligence
Daxiang Dong,Mingming Zheng,Dong Xu,Chunhua Luo,Bairong Zhuang,Yuxuan Li,Ruoyun He,Haoran Wang,Wenyu Zhang,Wenbo Wang,Yicheng Wang,Xue Xiong,Ayong Zheng,Xiaoying Zuo,Ziwei Ou,Jingnan Gu,Quanhao Guo,Jianmin Wu,Dawei Yin,Dou Shen
Main category: cs.CV
TL;DR: Qianfan-OCR是一个40亿参数的端到端视觉语言模型,统一文档解析、版面分析与理解,支持图像直出Markdown及多种提示驱动任务;提出‘Layout-as-Thought’机制以恢复布局感知能力,在多项基准测试中领先同规模模型。
Details
Motivation: 解决端到端OCR中显式版面分析能力缺失的问题,同时提升复杂版面下的准确性与多功能文档理解能力。 Method: 构建4B参数端到端VLM Qianfan-OCR;引入可选的‘Layout-as-Thought’推理阶段,通过特殊think token触发生成边界框、元素类型和阅读顺序等结构化版面表示。 Result: 在OmniDocBench v1.5(93.12)和OlmOCR Bench(79.8)上位居端到端模型榜首,在OCRBench、CCOCR、DocVQA、ChartQA上媲美同规模通用VLM,并在公开关键信息抽取基准上平均分最高,超越Gemini-3.1-Pro、Seed-2.0和Qwen3-VL-235B。 Conclusion: Qianfan-OCR验证了端到端架构结合显式布局推理机制的有效性,为高性能、多功能文档理解提供了新范式,且已开源并部署于百度千帆平台。 Abstract: We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.[155] FlowAD: Ego-Scene Interactive Modeling for Autonomous Driving
Mingzhe Guo,Yixiang Yang,Chuanrong Han,Rufeng Zhang,Shirui Li,Ji Wan,Zhipeng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的自车-场景交互建模范式(FlowAD),通过建模以自车为中心的场景流(scene flow)来显式刻画自车运动对观测的影响,从而提升自动驾驶中感知、规划与理解能力。
Details
Motivation: 现有自动驾驶环境建模方法常忽略自车运动对观测的反馈效应,导致对驾驶过程理解不完整,限制规划性能。 Method: 提出FlowAD框架:1)基于自车朝向和转向速度进行自车引导的场景划分,构建基本流单元;2)在流单元上进行时空流预测,建模场景流的空间位移与时间演化;3)任务感知增强模块利用学习到的时空流动态,支持物体级与区域级下游任务;并引入新评估指标FCP(Frames before Correct Planning)衡量场景理解能力。 Result: 在nuScenes上相比SparseDrive降低19%碰撞率,FCP提升1.39帧(60%);在Bench2Drive上取得51.77驾驶分;验证了其在感知、端到端规划及VLM分析中的通用性与有效性。 Conclusion: 自车-场景交互建模是提升自动驾驶系统整体性能的关键,FlowAD通过引入可学习的、物理意义明确的场景流表征,在不依赖仿真数据的前提下显著增强了场景理解与任务泛化能力。 Abstract: Effective environment modeling is the foundation for autonomous driving, underpinning tasks from perception to planning. However, current paradigms often inadequately consider the feedback of ego motion to the observation, which leads to an incomplete understanding of the driving process and consequently limits the planning capability. To address this issue, we introduce a novel ego-scene interactive modeling paradigm. Inspired by human recognition, the paradigm represents ego-scene interaction as the scene flow relative to the ego-vehicle. This conceptualization allows for modeling ego-motion feedback within a feature learning pattern, advantageously utilizing existing log-replay datasets rather than relying on scenario simulations. We specifically propose FlowAD, a general flow-based framework for autonomous driving. Within it, an ego-guided scene partition first constructs basic flow units to quantify scene flow. The ego-vehicle's forward direction and steering velocity directly shape the partition, which reflects ego motion. Then, based on flow units, spatial and temporal flow predictions are performed to model dynamics of scene flow, encompassing both spatial displacement and temporal variation. The final task-aware enhancement exploits learned spatio-temporal flow dynamics to benefit diverse tasks through object and region-level strategies. We also propose a novel Frames before Correct Planning (FCP) metric to assess the scene understanding capability. Experiments in both open and closed-loop evaluations demonstrate FlowAD's generality and effectiveness across perception, end-to-end planning, and VLM analysis. Notably, FlowAD reduces 19% collision rate over SparseDrive with FCP improvements of 1.39 frames (60%) on nuScenes, and achieves an impressive driving score of 51.77 on Bench2Drive, proving the superiority. Code, model, and configurations will be released here.[156] Combining Microscopy Data and Metadata for Reconstruction of Cellular Traction Forces Using a Hybrid Vision Transformer-U-Net
Yunfei Huang,Elena Van der Vorst,Alexander Richard,Benedikt Sabass
Main category: cs.CV
TL;DR: 本文提出了一种名为ViT+UNet的混合深度学习架构,结合U-Net与视觉Transformer,用于提升牵引力显微镜(TFM)数据中细胞牵引力场预测的准确性、多尺度泛化能力及噪声鲁棒性,并支持整合细胞类型等元数据以提高预测特异性。
Details
Motivation: 现有深度学习方法在TFM数据分析中仍面临跨多空间尺度可靠推理不足、难以融合细胞类型等上下文信息以提升精度的问题。 Method: 提出ViT+UNet混合模型,将U-Net与Vision Transformer结合;通过结构化输入设计,融入细胞类型等元数据;在多种空间尺度和噪声水平下进行模型评估。 Result: ViT+UNet在牵引力场预测任务上优于单独的U-Net和Vision Transformer;展现出更强的多尺度泛化能力和噪声鲁棒性;能有效利用细胞类型信息提升预测准确性和特异性。 Conclusion: ViT+UNet是一种鲁棒、可扩展且具上下文感知能力的深度学习框架,适用于不同实验条件和成像系统的TFM数据分析。 Abstract: Traction force microscopy (TFM) is a widely used technique for quantifying the forces that cells exert on their surrounding extracellular matrix. Although deep learning methods have recently been applied to TFM data analysis, several challenges remain-particularly achieving reliable inference across multiple spatial scales and integrating additional contextual information such as cell type to improve accuracy. In this study, we propose ViT+UNet, a robust deep learning architecture that integrates a U-Net with a Vision Transformer. Our results demonstrate that this hybrid model outperforms both standalone U-Net and Vision Transformer architectures in predicting traction force fields. Furthermore, ViT+UNet exhibits superior generalization across diverse spatial scales and varying noise levels, enabling its application to TFM datasets obtained from different experimental setups and imaging systems. By appropriately structuring the input data, our approach also allows the inclusion of metadata, in our case cell-type information, to enhance prediction specificity and accuracy.[157] MAD: Microenvironment-Aware Distillation -- A Pretraining Strategy for Virtual Spatial Omics from Microscopy
Jiashu Han,Kunzan Liu,Yeojin Kim,Saurabh Sinha,Sixian You
Main category: cs.CV
TL;DR: 本文提出MAD(微环境感知蒸馏)预训练策略,通过联合自蒸馏细胞形态视图和微环境视图,学习细胞中心嵌入表示,在多种组织与成像模态下实现细胞亚型分类、转录组预测等下游任务的SOTA性能,甚至超越参数量相当但数据规模大得多的基础模型。
Details
Motivation: 弥合显微镜成像与组学技术之间的鸿沟,以无标记、高通量方式从图像中读取单细胞分子状态;探索自监督预训练如何有效编码组织微环境中的单细胞身份及其可捕获的生物学信息深度。 Method: MAD(微环境感知蒸馏):对同一索引细胞的形态视图和微环境视图进行联合自蒸馏,映射至统一嵌入空间,学习细胞中心表征。 Result: 在多组织、多成像模态上,MAD在细胞亚型识别、转录组预测和生物信息推断等下游任务中达到SOTA性能;性能优于参数量相近但训练数据规模大得多的现有基础模型。 Conclusion: MAD的双视图联合自蒸馏能有效捕捉组织内细胞的复杂性与多样性,是一种适用于显微镜图像表征学习的通用方法,支持虚拟空间组学及大规模显微数据的生物学洞见挖掘。 Abstract: Bridging microscopy and omics would allow us to read molecular states from images-at single-cell resolution and tissue scale-without the cost and throughput limits of omics technologies. Self-supervised pretraining offers a scalable approach with minimal labels, yet how to encode single-cell identity within tissue environments-and the extent of biological information such models can capture-remains an open question. Here, we introduce MAD (microenvironment-aware distillation), a pretraining strategy that learns cell-centric embeddings by jointly self-distilling the morphology view and the microenvironment view of the same indexed cell into a unified embedding space. Across diverse tissues and imaging modalities, MAD achieves state-of-the-art prediction performance on downstream tasks including cell subtyping, transcriptomic prediction, and bioinformatic inference. MAD even outperforms foundation models with a similar number of model parameters that have been trained on substantially larger datasets. These results demonstrate that MAD's dual-view joint self-distillation effectively captures the complexity and diversity of cells within tissues. Together, this establishes MAD as a general tool for representation learning in microscopy, enabling virtual spatial omics and biological insights from vast microscopy datasets.[158] Event-Driven Video Generation
Chika Maduabuchi
Main category: cs.CV
TL;DR: 本文提出事件驱动视频生成(EVD)框架,通过事件感知的建模与采样机制,显著缓解现有文本到视频模型在物理交互上的常见幻觉问题。
Details
Motivation: 现有先进文本到视频模型虽帧画面逼真,但在物理交互(如接触、支撑、动作执行)上频繁失败,作者认为主因是‘逐帧去噪’范式缺乏对交互发生时空位置的显式建模。 Method: 提出EVD:1)轻量级事件头预测token对齐的事件活跃度;2)事件-状态耦合损失函数;3)带滞后效应和早步调度的事件门控采样,抑制非交互时刻的冗余更新。 Result: 在EVD-Bench评测中,EVD显著提升人类偏好得分与VBench动态性指标,大幅降低状态持续性、空间精度、支撑关系与接触稳定性等四类失败模式,且不损害视觉质量。 Conclusion: 显式事件接地是一种实用且有效的抽象,可有效减少视频生成中的交互幻觉。 Abstract: State-of-the-art text-to-video models often look realistic frame-by-frame yet fail on simple interactions: motion starts before contact, actions are not realized, objects drift after placement, and support relations break. We argue this stems from frame-first denoising, which updates latent state everywhere at every step without an explicit notion of when and where an interaction is active. We introduce Event-Driven Video Generation (EVD), a minimal DiT-compatible framework that makes sampling event-grounded: a lightweight event head predicts token-aligned event activity, event-grounded losses couple activity to state change during training, and event-gated sampling (with hysteresis and early-step scheduling) suppresses spurious updates while concentrating updates during interactions. On EVD-Bench, EVD consistently improves human preference and VBench dynamics, substantially reducing failure modes in state persistence, spatial accuracy, support relations, and contact stability without sacrificing appearance. These results indicate that explicit event grounding is a practical abstraction for reducing interaction hallucinations in video generation.[159] Diabetic Retinopathy Grading with CLIP-based Ranking-Aware Adaptation:A Comparative Study on Fundus Image
Sungjun Cho
Main category: cs.CV
TL;DR: 本文研究了三种基于CLIP的糖尿病视网膜病变(DR)五级严重程度分级方法,在APTOS 2019和Messidor-2数据集上验证,其中排序感知提示模型性能最优(准确率93.42%,AUROC 0.9845)。
Details
Motivation: 糖尿病视网膜病变是可预防性失明的主因,亟需高效、自动化的眼底图像筛查方法以支持大规模早期诊断。 Method: 提出并比较三种CLIP相关方法:(1) 基于提示工程的零样本基线;(2) 结合FCN与CBAM注意力机制的混合FCN-CLIP模型;(3) 编码DR进展序数结构的排序感知提示模型;训练数据融合APTOS 2019与Messidor-2(n=5,406),采用重采样与类别特异性阈值缓解类别不平衡。 Result: 排序感知模型获得最高整体准确率(93.42%,AUROC 0.9845)及关键重度病例高召回;混合FCN-CLIP模型在增殖性DR检测上最优(92.49%,AUROC 0.99);零样本基线表现最弱(55.17%,AUROC 0.75)。 Conclusion: 三种方法各具优势且互补,排序感知建模与架构融合均显著提升DR分级性能,为临床筛查提供了实用、鲁棒的AI解决方案。 Abstract: Diabetic retinopathy (DR) is a leading cause of preventable blindness, and automated fundus image grading can play an important role in large-scale screening. In this work, we investigate three CLIP-based approaches for five-class DR severity grading: (1) a zero-shot baseline using prompt engineering, (2) a hybrid FCN-CLIP model augmented with CBAM attention, and (3) a ranking-aware prompting model that encodes the ordinal structure of DR progression. We train and evaluate on a combined dataset of APTOS 2019 and Messidor-2 (n=5,406), addressing class imbalance through resampling and class-specific optimal thresholding. Our experiments show that the ranking-aware model achieves the highest overall accuracy (93.42%, AUROC 0.9845) and strong recall on clinically critical severe cases, while the hybrid FCN-CLIP model (92.49%, AUROC 0.99) excels at detecting proliferative DR. Both substantially outperform the zero-shot baseline (55.17%, AUROC 0.75). We analyze the complementary strengths of each approach and discuss their practical implications for screening contexts.[160] Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion
Yang Yang,Tianyi Zhang,Wei Huang,Jinwei Chen,Boxi Wu,Xiaofei He,Deng Cai,Bo Li,Peng-Tao Jiang
Main category: cs.CV
TL;DR: 本文提出Anchor Forcing框架,通过锚点引导的重缓存机制和三区域RoPE设计,解决交互式长视频生成中提示切换导致的质量下降与运动动态减弱问题。
Details
Motivation: 现有流式视频扩散模型在交互式长视频生成中存在提示切换时语义上下文与潜在线索难以兼顾、以及蒸馏过程中位置编码分布偏移削弱运动先验两大问题。 Method: 提出Anchor Forcing框架:1)锚点引导的重缓存机制,在提示切换时利用锚点缓存稳定KV状态;2)三区域RoPE配合RoPE重对齐蒸馏,弥合无界流式索引与预训练有界RoPE之间的差距。 Result: 在长视频实验中,该方法在交互场景下显著提升感知质量与运动指标,优于现有流式基线。 Conclusion: Anchor Forcing有效缓解了交互式长视频生成中的边界条件弱化与运动先验退化问题,为高质量、长时程、可交互视频生成提供了新思路。 Abstract: Interactive long video generation requires prompt switching to introduce new subjects or events, while maintaining perceptual fidelity and coherent motion over extended horizons. Recent distilled streaming video diffusion models reuse a rolling KV cache for long-range generation, enabling prompt-switch interaction through re-cache at each switch. However, existing streaming methods still exhibit progressive quality degradation and weakened motion dynamics. We identify two failure modes specific to interactive streaming generation: (i) at each prompt switch, current cache maintenance cannot simultaneously retain KV-based semantic context and recent latent cues, resulting in weak boundary conditioning and reduced perceptual quality; and (ii) during distillation, unbounded time indexing induces a positional distribution shift from the pretrained backbone's bounded RoPE regime, weakening pretrained motion priors and long-horizon motion retention. To address these issues, we propose \textbf{Anchor Forcing}, a cache-centric framework with two designs. First, an anchor-guided re-cache mechanism stores KV states in anchor caches and warm-starts re-cache from these anchors at each prompt switch, reducing post-switch evidence loss and stabilizing perceptual quality. Second, a tri-region RoPE with region-specific reference origins, together with RoPE re-alignment distillation, reconciles unbounded streaming indices with the pretrained RoPE regime to better retain motion priors. Experiments on long videos show that our method improves perceptual quality and motion metrics over prior streaming baselines in interactive settings. Project page: https://github.com/vivoCameraResearch/Anchor-Forcing[161] Nuanced Emotion Recognition Based on a Segment-based MLLM Framework Leveraging Qwen3-Omni for AH Detection
Liang Tang,Hongda Li,Jiayu Zhang,Long Chen,Shuxian Li,Siqi Pei,Tiaonan Duan,Yuhao Cheng
Main category: cs.CV
TL;DR: 本文提出了一种结合时序分段建模与多模态大语言模型(MLLM)的视频情感识别框架,专用于识别矛盾性(Ambivalence)和犹豫性(Hesitancy)等细微心理状态;通过将长视频切分为≤5秒片段,并在BAH数据集上微调Qwen3-Omni-30B-A3B模型,实现了85.1%的测试准确率,显著优于现有方法。
Details
Motivation: 识别视频中体现行为干预与数字健康价值的细微心理状态(如矛盾性和犹豫性)具有重要意义,但其常表现为跨模态不一致(如表情、语调、语义间的冲突),给自动化识别带来巨大挑战。 Method: 采用分段策略将长视频切分为≤5秒短片段以缓解计算与token限制;基于MS-Swift框架,使用LoRA与全参数微调策略,在BAH数据集上对Qwen3-Omni-30B-A3B多模态大语言模型进行训练,实现视觉与听觉信号的协同分析。 Result: 在测试集上达到85.1%的准确率,显著超越现有基准,验证了多模态大语言模型在捕捉复杂情绪冲突方面的优越能力;代码已开源。 Conclusion: 时序分段建模与多模态大语言模型的有效融合,为识别跨模态不一致的情感状态提供了高效且鲁棒的新范式。 Abstract: Emotion recognition in videos is a pivotal task in affective computing, where identifying subtle psychological states such as Ambivalence and Hesitancy holds significant value for behavioral intervention and digital health. Ambivalence and Hesitancy states often manifest through cross-modal inconsistencies such as discrepancies between facial expressions, vocal tones, and textual semantics, posing a substantial challenge for automated recognition. This paper proposes a recognition framework that integrates temporal segment modeling with Multimodal Large Language Models. To address computational efficiency and token constraints in long video processing, we employ a segment-based strategy, partitioning videos into short clips with a maximum duration of 5 seconds. We leverage the Qwen3-Omni-30B-A3B model, fine-tuned on the BAH dataset using LoRA and full-parameter strategies via the MS-Swift framework, enabling the model to synergistically analyze visual and auditory signals. Experimental results demonstrate that the proposed method achieves an accuracy of 85.1% on the test set, significantly outperforming existing benchmarks and validating the superior capability of Multimodal Large Language Models in capturing complex and nuanced emotional conflicts. The code is released at https://github.com/dlnn123/A-H-Detection-with-Qwen-Omni.git.[162] Bridging the Visual-to-Physical Gap: Physically Aligned Representations for Fall Risk Analysis
Xianqi Zhang
Main category: cs.CV
TL;DR: 本文提出PHARL方法,通过物理感知的对齐表示学习,在无需临床结果标签的情况下,提升基于视觉的跌倒风险分析性能,并展现出零样本序数性。
Details
Motivation: 现有基于视觉的跌倒分析方法依赖难以获取且噪声大的真实损伤标签,导致监督学习效果受限;同时,外观相似的动作可能对应截然不同的物理结果,仅靠视觉难以推断接触力学与保护反应等关键因素。 Method: 提出PHARL(Physics-aware Alignment Representation Learning)框架:引入轨迹级时间一致性约束以稳定表征学习,并结合多类物理对齐约束(利用仿真生成的接触结果指导嵌入空间几何结构),将视频片段与时间对齐的仿真描述符配对进行训练,保持纯前馈推理。 Result: 在四个公开数据集上,PHARL持续优于纯视觉基线,在风险对齐表征质量与跌倒检测性能两方面均表现优异;并首次实现零样本序数性——自动浮现可解释的严重性排序(Head > Trunk > Supported)。 Conclusion: PHARL证明了引入物理先验进行无标签表征学习的有效性,为高风险动作分析提供了更鲁棒、可解释且无需人工标注的新范式。 Abstract: Vision-based fall analysis has advanced rapidly, but a key bottleneck remains: visually similarmotions can correspond to very different physical outcomes because small differences in contactmechanics and protective responses are hard to infer from appearance alone. Most existingapproaches handle this by supervised injury prediction, which depends on reliable injury labels.In practice, such labels are difficult to obtain: video evidence is often ambiguous (occlusion,viewpoint limits), and true injury events are rare and cannot be safely staged, leading to noisysupervision. We address this problem with PHARL (PHysics-aware Alignment RepresentationLearning), which learns physically meaningful fall representations without requiring clinicaloutcome labels. PHARL regularizes motion embeddings with two complementary constraints:(1) trajectory-level temporal consistency for stable representation learning, and (2) multi-classphysics alignment, where simulation-derived contact outcomes shape embedding geometry. Bypairing video windows with temporally aligned simulation descriptors, PHARL captures localimpact-relevant dynamics while keeping inference purely feed-forward. Experiments on fourpublic datasets show that PHARL consistently improves risk-aligned representation quality overvisual-only baselines while maintaining strong fall-detection performance. Notably, PHARL alsoexhibits zero-shot ordinality: an interpretable severity structure (Head > Trunk > Supported)emerges without explicit ordinal supervision.[163] WAT: Online Video Understanding Needs Watching Before Thinking
Zifan Han,Hongbo Sun,Jinglin Xu,Canhui Tang,Yulong Lei,Xuchong Zhang,Hongbin Sun,Zhongjiang He,Hao Sun
Main category: cs.CV
TL;DR: WAT是一种面向在线视频流推理的两阶段框架,通过分层记忆系统(短时与长时记忆)和查询感知检索机制,在内存受限下实现高效跨时间推理,并在多个在线视频基准上达到SOTA性能。
Details
Motivation: 现有视频大模型在在线流式场景中难以在严格内存限制下保持长时序上下文,无法满足实时视频理解需求。 Method: 提出WAT(Watching Before Thinking)两阶段框架:第一阶段‘观看’构建分层记忆(STM缓冲近期帧,LTM以冗余感知策略维护历史摘要);第二阶段‘思考’基于查询与STM上下文从LTM中检索相关历史帧进行跨时序推理;并构建WAT-85K流式标注数据集支持训练。 Result: 在StreamingBench和OVO-Bench等在线视频基准上分别取得77.7%和55.2%准确率,优于现有开源在线视频大模型,且支持实时帧率运行。 Conclusion: WAT通过解耦感知与推理、设计高效记忆机制与检索策略,有效解决了在线视频流推理中的长时序建模与内存约束矛盾,为实时视频理解提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) have shown strong capabilities in image understanding, motivating recent efforts to extend them to video reasoning. However, existing Video LLMs struggle in online streaming scenarios, where long temporal context must be preserved under strict memory constraints. We propose WAT (Watching Before Thinking), a two-stage framework for online video reasoning. WAT separates processing into a query-independent watching stage and a query-triggered thinking stage. The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) that buffers recent frames and a fixed-capacity Long-Term Memory (LTM) that maintains a diverse summary of historical content using a redundancy-aware eviction policy. In the thinking stage, a context-aware retrieval mechanism combines the query with the current STM context to retrieve relevant historical frames from the LTM for cross-temporal reasoning. To support training for online video tasks, we introduce WAT-85K, a dataset containing streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting. Experiments show that WAT achieves state-of-the-art performance on online video benchmarks, including 77.7% accuracy on StreamingBench and 55.2% on OVO-Bench, outperforming existing open-source online Video LLMs while operating at real-time frame rates.[164] Distance-aware Soft Prompt Learning for Multimodal Valence-Arousal Estimation
Byeongjin Jung,Chanyeong Park,Sejoon Lim
Main category: cs.CV
TL;DR: 本文提出了一种基于距离感知软提示学习的多模态VA(效价-唤醒度)情感估计框架,利用CLIP图像编码器和AST音频编码器提取特征,结合GRU时序建模与分层融合机制,在Aff-Wild2数据集上实现了高精度的连续情感回归。
Details
Motivation: 现有基于CLIP等视觉语言模型的方法受限于文本提示的离散性,难以有效处理VA这类连续情感维度回归任务。 Method: 将VA空间划分为3×3网格,为每个区域设计文本描述;用高斯核计算真值坐标到各区域中心的距离生成软标签;采用CLIP图像编码器和AST音频编码器提取特征,通过GRU建模时序,并结合跨模态注意力与门控融合进行分层多模态融合。 Result: 在Aff-Wild2数据集上显著提升VA估计准确率,尤其在无约束的真实场景(in-the-wild)中表现优异。 Conclusion: 语义引导的软提示学习与分层多模态融合策略可有效弥合预训练语言模型与连续情感回归之间的鸿沟,为自然场景下细粒度情感分析提供了新思路。 Abstract: Valence-arousal (VA) estimation is crucial for capturing the nuanced nature of human emotions in naturalistic environments. While pre-trained Vision-Language models like CLIP have shown remarkable semantic alignment capabilities, their application in continuous regression tasks is often limited by the discrete nature of text prompts. In this paper, we propose a novel multimodal framework for VA estimation that introduces Distance-aware Soft Prompt Learning to bridge the gap between semantic space and continuous dimensions. Specifically, we partition the VA space into a 3X3 grid, defining nine emotional regions, each associated with distinct textual descriptions. Rather than a hard categorization, we employ a Gaussian kernel to compute soft labels based on the Euclidean distance between the ground truth coordinates and the region centers, allowing the model to learn fine-grained emotional transitions. For multimodal integration, our architecture utilizes a CLIP image encoder and an Audio Spectrogram Transformer (AST) to extract robust spatial and acoustic features. These features are temporally modeled via Gated Recurrent Units (GRUs) and integrated through a hierarchical fusion scheme that sequentially combines cross-modal attention for alignment and gated fusion for adaptive refinement. Experimental results on the Aff-Wild2 dataset demonstrate that our proposed semantic-guided approach significantly enhances the accuracy of VA estimation, achieving competitive performance in unconstrained ``in-the-wild'' scenarios.[165] MIBench: Evaluating LMMs on Multimodal Interaction
Yu Miao,Zequn Yang,Yake Wei,Ziheng Chen,Haotian Ni,Haodong Duan,Kai Chen,Di Hu
Main category: cs.CV
TL;DR: 本文提出了MIBench基准,用于评估大型多模态模型(LMMs)在多模态交互方面的能力,涵盖信息提取(视觉/文本主导)与联合协同生成,并在识别、理解、推理三个认知层次上进行分层评测;实验发现当前LMMs在多模态交互能力上仍存在明显局限。
Details
Motivation: 现有LMMs缺乏对多模态交互能力的系统性评测,而该能力是衡量其真正多模态智能的关键;需构建覆盖多样化交互模式与认知层次的综合基准。 Method: 提出MIBench基准,以(con_v, con_t, task)三元组形式组织10,000+图文上下文对、32个任务,从视觉主导/文本主导信息提取、跨模态协同生成三方面,在识别、理解、推理三个认知层级进行分层评测。 Result: 实验表明:(1) LMMs的多模态交互能力受限,不随参数和数据量增加而显著提升;(2) 易受文本干扰而弱化视觉处理;(3) 仅具基础级跨模态协同能力;(4) 原生训练的多模态模型在基础交互能力上存在明显缺陷。 Conclusion: MIBench揭示了当前LMMs在多模态交互上的关键短板,为未来构建更强大、鲁棒的多模态模型提供了明确评估工具与改进方向。 Abstract: In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as "multimodal interaction". How well a model handles various multimodal interactions largely characterizes its multimodal ability. In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a (con_v , con_t, task) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key aspects: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint synergy. Each interaction capability is evaluated hierarchically across three cognitive levels: Recognition, Understanding, and Reasoning. MIBench comprises over 10,000 vision-text context pairs spanning 32 distinct tasks. Evaluation of state-of-the-art LMMs show that: (1) LMMs' ability on multimodal interaction remains constrained, despite the scaling of model parameters and training data; (2) they are easily distracted by textual modalities when processing vision information; (3) they mostly possess a basic capacity for multimodal synergy; and (4) natively trained multimodal models show noticeable deficits in fundamental interaction ability. We expect that these observations can serve as a reference for developing LMMs with more enhanced multimodal ability in the future.[166] A Deformable Attention-Based Detection Transformer with Cross-Scale Feature Fusion for Industrial Coil Spring Inspection
Matteo Rossi,Pony Matt
Main category: cs.CV
TL;DR: 本文提出MSD-DETR,一种面向机车线圈弹簧自动视觉检测的多尺度可变形检测Transformer框架,通过结构重参数化、可变形注意力机制和跨尺度特征融合,在保持高推理速度的同时显著提升检测精度。
Details
Motivation: 机车线圈弹簧表面缺陷形态多样、尺度变化大、工业背景复杂,现有方法难以兼顾精度与实时性。 Method: 提出MSD-DETR:(1) 结构重参数化策略解耦训练与推理结构;(2) 可变形注意力机制实现内容自适应空间采样;(3) 融合GSConv与VoVGSCSP的跨尺度特征融合架构。 Result: 在真实机车弹簧数据集上达到92.4% mAP@0.5,98 FPS,较YOLOv8提升3.1%,较RT-DETR提升2.8%,速度相当。 Conclusion: MSD-DETR为工业线圈弹簧质量检测建立了新基准,在精度与效率间取得更好平衡。 Abstract: Automated visual inspection of locomotive coil springs presents significant challenges due to the morphological diversity of surface defects, substantial scale variations, and complex industrial backgrounds. This paper proposes MSD-DETR (Multi-Scale Deformable Detection Transformer), a novel detection framework that addresses these challenges through three key innovations: (1) a structural re-parameterization strategy that decouples training-time multi-branch topology from inference-time efficiency, enhancing feature extraction while maintaining real-time performance; (2) a deformable attention mechanism that enables content-adaptive spatial sampling, allowing dynamic focus on defect-relevant regions regardless of morphological irregularity; and (3) a cross-scale feature fusion architecture incorporating GSConv modules and VoVGSCSP blocks for effective multi-resolution information aggregation. Comprehensive experiments on a real-world locomotive coil spring dataset demonstrate that MSD-DETR achieves 92.4\% mAP@0.5 at 98 FPS, outperforming state-of-the-art detectors including YOLOv8 (+3.1\% mAP) and the baseline RT-DETR (+2.8\% mAP) while maintaining comparable inference speed, establishing a new benchmark for industrial coil spring quality inspection.[167] Spatial Transcriptomics as Images for Large-Scale Pretraining
Yishun Zhu,Jiaxin Qi,Jian Wang,Yuhua Zheng,Jianqiang Huang
Main category: cs.CV
TL;DR: 本文提出将空间转录组学(ST)数据视为可裁剪的多通道图像,通过从原始切片中裁剪固定大小的空间块并设计基因子集选择规则,构建适合大规模预训练的数据集,从而在保留空间上下文的同时提升下游任务性能。
Details
Motivation: 现有ST预训练中,将每个spot作为独立样本会丢失空间依赖性,而将整张切片作为单一样本则导致输入过大、样本过少,难以有效预训练。 Method: 将ST数据建模为多通道图像,通过空间裁剪生成固定尺寸patch,并沿通道维度设计基因子集选择规则以控制输入维度和提升预训练稳定性。 Result: 所提图像式数据构建方法在多个下游任务上持续优于传统预训练方案;消融实验验证空间裁剪与通道设计均不可或缺。 Conclusion: 该工作确立了一种统一且实用的ST数据组织范式,支持大规模预训练,并为后续空间组学模型发展提供基础。 Abstract: Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, preserving spatial context essential for clinical and pathological studies. With rising sequencing throughput and advancing platforms, the expanding data volumes motivate large-scale ST pretraining. However, the fundamental unit for pretraining, i.e., what constitutes a single training sample, remains ill-posed. Existing choices fall into two camps: (1) treating each spot as an independent sample, which discards spatial dependencies and collapses ST into single-cell transcriptomics; and (2) treating an entire slide as a single sample, which produces prohibitively large inputs and drastically fewer training examples, undermining effective pretraining. To address this gap, we propose treating spatial transcriptomics as croppable images. Specifically, we define a multi-channel image representation with fixed spatial size by cropping patches from raw slides, thereby preserving spatial context while substantially increasing the number of training samples. Along the channel dimension, we define gene subset selection rules to control input dimensionality and improve pretraining stability. Extensive experiments show that the proposed image-like dataset construction for ST pretraining consistently improves downstream performance, outperforming conventional pretraining schemes. Ablation studies verify that both spatial patching and channel design are necessary, establishing a unified, practical paradigm for organizing ST data and enabling large-scale pretraining.[168] CtrlAttack: A Unified Attack on World-Model Control in Diffusion Models
Shuhan Xu,Siyuan Liang,Hongling Zheng,Yong Luo,Han Hu,Lefei Zhang,Dacheng Tao
Main category: cs.CV
TL;DR: 本文首次分析了图像到视频(I2V)扩散模型在状态转移层面的鲁棒性漏洞,提出一种基于轨迹控制的新型攻击方法CtrlAttack,通过低维速度场建模和时序积分构造连续位移场,在保持时序一致性的前提下有效干扰模型的状态演化,成功在白盒和黑盒设置下显著破坏时间一致性,揭示了I2V模型在世界建模层面的安全风险。
Details
Motivation: 现有I2V模型研究集中于视觉质量和可控性,而对其隐式学习的时序状态转移的鲁棒性缺乏系统评估,存在安全风险认知空白。 Method: 提出轨迹控制攻击CtrlAttack:将扰动建模为低维速度场,通过时序积分生成连续位移场以干扰状态演化;同时将扰动映射至观测空间,支持白盒与黑盒攻击。 Result: 在低维强正则化扰动下,白盒攻击成功率超90%,黑盒超80%;FID与FVD变化分别控制在6和130以内,验证了攻击有效性与时序一致性保持能力。 Conclusion: I2V模型在状态动力学层面存在显著脆弱性,CtrlAttack揭示了其作为隐式世界模型所面临的新安全挑战,为后续鲁棒性增强与可信视频生成提供警示与基础。 Abstract: Diffusion-based image-to-video (I2V) models increasingly exhibit world-model-like properties by implicitly capturing temporal dynamics. However, existing studies have mainly focused on visual quality and controllability, and the robustness of the state transition learned by the model remains understudied. To fill this gap, we are the first to analyze the vulnerability of I2V models, find that temporal control mechanisms constitute a new attack surface, and reveal the challenge of modeling them uniformly under different attack settings. Based on this, we propose a trajectory-control attack, called CtrlAttack, to interfere with state evolution during the generation process. Specifically, we represent the perturbation as a low-dimensional velocity field and construct a continuous displacement field via temporal integration, thereby affecting the model's state transitions while maintaining temporal consistency; meanwhile, we map the perturbation to the observation space, making the method applicable to both white-box and black-box attack settings. Experimental results show that even under low-dimensional and strongly regularized perturbation constraints, our method can still significantly disrupt temporal consistency by increasing the attack success rate (ASR) to over 90% in the white-box setting and over 80% in the black-box setting, while keeping the variation of the FID and FVD within 6 and 130, respectively, thus revealing the potential security risk of I2V models at the level of state dynamics.[169] Vision-Language Based Expert Reporting for Painting Authentication and Defect Detection
Eman Ouda,Mohammed Salah,Arsenii O. Chulkov,Gianfranco Gargiulo,Gian Luca Tartaglia,Stefano Sfarra,Yusra Abdulrahman
Main category: cs.CV
TL;DR: 本文提出了一种全自动热成像-视觉语言模型(VLM)框架,用于文化遗产保护中的脉冲主动红外热成像(AIRT)分析,融合多模态热处理与结构化自然语言报告生成,提升解释一致性与可重复性。
Details
Motivation: 热成像在文物真实性与状态评估中至关重要,但当前解释和报告高度依赖专家经验,缺乏标准化、可解释的框架,限制了跨馆藏比较与系统性应用。 Method: 结合主成分热成像(PCT)、热信号重建(TSR)与脉冲相位热成像(PPT)进行多模态AIRT分析,生成异常掩膜并融合为共识分割;再输入视觉语言模型(VLM)自动生成含位置、热行为、物理解释及不确定性说明的结构化文本报告。 Result: 在两件镶嵌工艺品上验证,该框架实现了稳定可靠的异常检测与一致的结构化解释,表明其具备良好可重复性与跨样本泛化能力。 Conclusion: 该全自动VLM框架有望推动热成像技术在文化遗产保护中标准化、可解释、系统化的应用,减少人为偏差,增强跨机构协作与文档整合。 Abstract: Authenticity and condition assessment are central to conservation decision-making, yet interpretation and reporting of thermographic output remain largely bespoke and expert-dependent, complicating comparison across collections and limiting systematic integration into conservation documentation. Pulsed Active Infrared Thermography (AIRT) is sensitive to subsurface features such as material heterogeneity, voids, and past interventions; however, its broader adoption is constrained by artifact misinterpretation, inter-laboratory variability, and the absence of standardized, explainable reporting frameworks. Although multi-modal thermographic processing techniques are established, their integration with structured natural-language interpretation has not been explored in cultural heritage. A fully automated thermography-vision-language model (VLM) framework is presented. It combines multi-modal AIRT analysis with modality-aware textual reporting, without human intervention during inference. Thermal sequences are processed using Principal Component Thermography (PCT), Thermographic Signal Reconstruction (TSR), and Pulsed Phase Thermography (PPT), and the resulting anomaly masks are fused into a consensus segmentation that emphasizes regions supported by multiple thermal indicators while mitigating boundary artifacts. The fused evidence is provided to a VLM, which generates structured reports describing the location of the anomaly, thermal behavior, and plausible physical interpretations while explicitly acknowledging the uncertainty and diagnostic limitations. Evaluation on two marquetries demonstrates consistent anomaly detection and stable structured interpretations, indicating reproducibility and generalizability across samples.[170] Draft-and-Target Sampling for Video Generation Policy
Qikang Zhang,Yingjie Lei,Wei Liu,Daochang Liu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的扩散推理新范式Draft-and-Target Sampling,通过双轨迹自博弈去噪(粗略草稿+精细验证)、令牌分块与渐进接受策略,显著提升视频生成机器人策略的推理效率,最高提速2.1倍,成功率损失极小。
Details
Motivation: 现有基于视频生成模型的机器人策略存在高计算成本和长推理延迟问题,亟需高效、免训练的推理优化方法。 Method: 提出Draft-and-Target Sampling:1)双轨迹自博弈去噪——draft采样用大步长快速生成全局轨迹,target采样用小步长精细验证;2)引入token chunking和progressive acceptance策略减少冗余计算。 Result: 在三个基准上实现最高2.1倍推理加速,在几乎不牺牲任务成功率的前提下显著提升SOTA方法的效率。 Conclusion: Draft-and-Target Sampling是一种训练无关、通用性强的高效视频生成推理范式,为实时机器人策略部署提供了可行路径。 Abstract: Video generation models have been used as a robot policy to predict the future states of executing a task conditioned on task description and observation. Previous works ignore their high computational cost and long inference time. To address this challenge, we propose Draft-and-Target Sampling, a novel diffusion inference paradigm for video generation policy that is training-free and can improve inference efficiency. We introduce a self-play denoising approach by utilizing two complementary denoising trajectories in a single model, draft sampling takes large steps to generate a global trajectory in a fast manner and target sampling takes small steps to verify it. To further speedup generation, we introduce token chunking and progressive acceptance strategy to reduce redundant computation. Experiments on three benchmarks show that our method can achieve up to 2.1x speedup and improve the efficiency of current state-of-the-art methods with minimal compromise to the success rate. Our code is available.[171] LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models
Chenglin Wang,Yucheng Zhou,Shawn Chen,Tao Wang,Kai Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的离散扩散语言模型加速方法LADR,利用图像的空间马尔可夫性质,在不牺牲甚至提升生成质量(尤其空间推理)的前提下实现约4倍推理加速。
Details
Motivation: 离散扩散语言模型在多模态生成中表现优异,但迭代解码导致高推理延迟;现有加速方法需昂贵重训练或未能有效利用视觉数据的2D空间冗余。 Method: 提出Locality-Aware Dynamic Rescue(LADR),一种无需训练的方法:1)利用图像空间马尔可夫性;2)聚焦于‘生成前沿’(即已观测像素邻域)的token恢复以最大化信息增益;3)结合形态学邻居识别、风险约束过滤机制和流形一致的逆调度策略。 Result: 在四个文本到图像生成基准上,LADR相较标准基线实现约4倍加速,同时保持或提升生成保真度,尤其在空间推理任务中效果更优。 Conclusion: LADR为离散扩散语言模型提供了一种高效、高质量、免训练的推理加速新范式,实现了效率与质量的最优权衡。 Abstract: Discrete Diffusion Language Models have emerged as a compelling paradigm for unified multimodal generation, yet their deployment is hindered by high inference latency arising from iterative decoding. Existing acceleration strategies often require expensive re-training or fail to leverage the 2D spatial redundancy inherent in visual data. To address this, we propose Locality-Aware Dynamic Rescue (LADR), a training-free method that expedites inference by exploiting the spatial Markov property of images. LADR prioritizes the recovery of tokens at the ''generation frontier'', regions spatially adjacent to observed pixels, thereby maximizing information gain. Specifically, our method integrates morphological neighbor identification to locate candidate tokens, employs a risk-bounded filtering mechanism to prevent error propagation, and utilizes manifold-consistent inverse scheduling to align the diffusion trajectory with the accelerated mask density. Extensive experiments on four text-to-image generation benchmarks demonstrate that our LADR achieves an approximate 4 x speedup over standard baselines. Remarkably, it maintains or even enhances generative fidelity, particularly in spatial reasoning tasks, offering a state-of-the-art trade-off between efficiency and quality.[172] Synthetic Melanoma Image Generation and Evaluation Using Generative Adversarial Networks
Pei-Yu Lin,Yidan Shen,Neville Mathew,Renjie Hu,Siyu Huang,Courtney M. Queen,Cameron E. West,Ana Ciurea,George Zouridakis
Main category: cs.CV
TL;DR: 本文系统评估了四种GAN架构(DCGAN、StyleGAN2及两种StyleGAN3变体)在高分辨率黑色素瘤图像合成中的性能,发现StyleGAN2在图像质量、诊断相关特征保持及缓解类别不平衡方面表现最优,并经皮肤科医生和下游分类器验证其临床实用性。
Details
Motivation: melanoma图像数据稀缺且严重类别不平衡,限制了深度学习在皮肤病变分析中的应用;亟需高质量、诊断可信的合成 melanoma 图像以提升模型性能。 Method: 在ISIC 2018和2020两个专家标注数据集上统一训练并调优四种GAN(DCGAN、StyleGAN2、StyleGAN3-T/R),重点优化R1正则化;采用FID、FMD、定性皮肤镜评估、冻结EfficientNet分类器测试及两位皮肤科医生盲评进行多维度图像质量与诊断保真度评估;进一步开展合成图像增强实验验证其对分类AUC的提升效果。 Result: StyleGAN2在FID(24.8/7.96)、分类器识别率(83%为melanoma)、医生区分准确率(66.5%,接近随机)及kappa值(0.17)上综合最优;加入StyleGAN2合成图像后,melanoma检测AUC从0.925提升至0.945。 Conclusion: StyleGAN2生成的黑色素瘤图像具有良好的感知质量与诊断相关特征保真度,可有效缓解数据不平衡问题,为 melanoma AI辅助诊断提供可靠的数据增强方案。 Abstract: Melanoma is the most lethal form of skin cancer, and early detection is critical for improving patient outcomes. Although dermoscopy combined with deep learning has advanced automated skin-lesion analysis, progress is hindered by limited access to large, well-annotated datasets and by severe class imbalance, where melanoma images are substantially underrepresented. To address these challenges, we present the first systematic benchmarking study comparing four GAN architectures-DCGAN, StyleGAN2, and two StyleGAN3 variants (T/R)-for high-resolution melanoma-specific synthesis. We train and optimize all models on two expert-annotated benchmarks (ISIC 2018 and ISIC 2020) under unified preprocessing and hyperparameter exploration, with particular attention to R1 regularization tuning. Image quality is assessed through a multi-faceted protocol combining distribution-level metrics (FID), sample-level representativeness (FMD), qualitative dermoscopic inspection, downstream classification with a frozen EfficientNet-based melanoma detector, and independent evaluation by two board-certified dermatologists. StyleGAN2 achieves the best balance of quantitative performance and perceptual quality, attaining FID scores of 24.8 (ISIC 2018) and 7.96 (ISIC 2020) at gamma=0.8. The frozen classifier recognizes 83% of StyleGAN2-generated images as melanoma, while dermatologists distinguish synthetic from real images at only 66.5% accuracy (chance = 50%), with low inter-rater agreement (kappa = 0.17). In a controlled augmentation experiment, adding synthetic melanoma images to address class imbalance improved melanoma detection AUC from 0.925 to 0.945 on a held-out real-image test set. These findings demonstrate that StyleGAN2-generated melanoma images preserve diagnostically relevant features and can provide a measurable benefit for mitigating class imbalance in melanoma-focused machine learning pipelines.[173] ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning
Eric Nazarenus,Chuqiao Li,Yannan He,Xianghui Xie,Jan Eric Lenssen,Gerard Pons-Moll
Main category: cs.CV
TL;DR: ActionPlan 是一种统一的运动扩散框架,通过引入每帧动作规划(frame-level text latents)作为语义锚点,结合语义与运动线索进行去噪,在单个模型中同时支持实时流式生成和高质量离线生成,并支持零样本动作编辑与插值。
Details
Motivation: 现有方法难以兼顾实时流式生成与高质量离线生成;需统一框架以提升效率与质量并支持灵活编辑。 Method: 提出每帧动作规划(text latents)作为密集语义锚点,设计面向潜在变量的扩散步,使各运动潜变量可独立去噪、灵活采样顺序。 Result: 实时流式生成速度提升5.25倍,FID指标提升18%;同时支持零样本动作编辑与in-betweening。 Conclusion: ActionPlan 在统一模型中成功融合实时性与高质量生成能力,并拓展出零样本编辑功能,验证了动作规划机制的有效性与通用性。 Abstract: We present ActionPlan, a unified motion diffusion framework that bridges real-time streaming with high-quality offline generation within a single model. The core idea is to introduce a per-frame action plan: the model predicts frame-level text latents that act as dense semantic anchors throughout denoising, and uses them to denoise the full motion sequence with combined semantic and motion cues. To support this structured workflow, we design latent-specific diffusion steps, allowing each motion latent to be denoised independently and sampled in flexible orders at inference. As a result, ActionPlan can run in a history-conditioned, future-aware mode for real-time streaming, while also supporting high-quality offline generation. The same mechanism further enables zero-shot motion editing and in-betweening without additional models. Experiments demonstrate that our real-time streaming is 5.25x faster while also achieving 18% motion quality improvement over the best previous method in terms of FID.[174] LibraGen: Playing a Balance Game in Subject-Driven Video Generation
Jiahao Zhu,Shanshan Lao,Lijie Liu,Gen Li,Tianhao Qi,Wei Han,Bingchuan Li,Fangfang Liu,Zhuowei Chen,Tianxiang Ma,Qian HE,Yi Zhou,Xiaohua Xie
Main category: cs.CV
TL;DR: LibraGen 是一种用于主题到视频(S2V)生成的新框架,通过强调数据质量(‘支点’)和‘提升支点、调至平衡’理念,在视频生成基础模型(VGFMs)的固有优势(如运动连贯性、视觉美感、提示对齐)与新增S2V能力之间实现动态平衡。其核心包括混合数据过滤、Tune-to-Balance微调范式、双DPO优化(Consis-DPO 和 Real-Fake DPO)以及时变动态无分类器引导机制。仅用千级训练样本即超越开源与商用S2V模型。
Details
Motivation: 现有S2V方法常牺牲VGFM固有先验(如运动连贯性、美学、提示对齐)来强化定制能力,缺乏二者间的系统性平衡。 Method: 提出LibraGen框架:1)以数据质量为‘支点’,构建自动+人工混合过滤 pipeline;2)Tune-to-Balance后训练范式,融合跨对/对内监督数据并采用模型合并;3)设计Consis-DPO与Real-Fake DPO两个DPO流程并融合;4)引入时变动态classifier-free guidance用于推理控制。 Result: 在仅使用千量级训练数据条件下,LibraGen在多项指标上全面优于当前主流开源及商业S2V模型。 Conclusion: 平衡VGFM固有先验与S2V新能力是关键,LibraGen通过质量驱动的数据构建、分阶段协同优化与动态推理控制,实现了高效、可控、高质量的主题到视频生成。 Abstract: With the advancement of video generation foundation models (VGFMs), customized generation, particularly subject-to-video (S2V), has attracted growing attention. However, a key challenge lies in balancing the intrinsic priors of a VGFM, such as motion coherence, visual aesthetics, and prompt alignment, with its newly derived S2V capability. Existing methods often neglect this balance by enhancing one aspect at the expense of others. To address this, we propose LibraGen, a novel framework that views extending foundation models for S2V generation as a balance game between intrinsic VGFM strengths and S2V capability. Specifically, guided by the core philosophy of "Raising the Fulcrum, Tuning to Balance," we identify data quality as the fulcrum and advocate a quality-over-quantity approach. We construct a hybrid pipeline that combines automated and manual data filtering to improve overall data quality. To further harmonize the VGFM's native capabilities with its S2V extension, we introduce a Tune-to-Balance post-training paradigm. During supervised fine-tuning, both cross-pair and in-pair data are incorporated, and model merging is employed to achieve an effective trade-off. Subsequently, two tailored direct preference optimization (DPO) pipelines, namely Consis-DPO and Real-Fake DPO, are designed and merged to consolidate this balance. During inference, we introduce a time-dependent dynamic classifier-free guidance scheme to enable flexible and fine-grained control. Experimental results demonstrate that LibraGen outperforms both open-source and commercial S2V models using only thousand-scale training data.[175] MIRAGE: Model-agnostic Industrial Realistic Anomaly Generation and Evaluation for Visual Anomaly Detection
Jinwei Hu,Francesco Borsatti,Arianna Stropeni,Davide Dalle Pezze,Manuel Barusco,Gian Antonio Susto
Main category: cs.CV
TL;DR: 本文提出MIRAGE,一种无需真实缺陷样本、无需训练的工业级逼真异常图像与像素级掩码自动生成框架,利用大模型API、VLM提示生成与CLIP质量筛选,并引入无训练双分支变化检测模块生成掩码,在MVTec AD和VisA数据集上验证其在异常分割与图像质量两方面的有效性,并开源超13,000对图像-掩码数据。
Details
Motivation: 现有工业视觉异常检测方法多依赖纯正常样本训练,但少量异常数据可显著提升性能;而当前异常生成方法或需真实缺陷样本、或依赖昂贵硬件、或生成结果缺乏 realism。 Method: 提出MIRAGE流水线:1)调用任意生成模型(如Gemini 2.5 Flash Image)作为黑盒;2)用视觉语言模型(VLM)自动生成缺陷提示;3)CLIP-based质量过滤确保图文对齐;4)设计无训练双分支语义变化检测模块,融合Grounding DINO(文本条件)与YOLOv26-Seg(结构细节)特征生成像素级掩码。 Result: 在MVTec AD和VisA上评估显示,MIRAGE生成图像在下游异常分割任务中性能优异,视觉质量指标(IS、IC-LPIPS)高,且人类感知研究(31人,1550次配对投票)证实其高度逼真;同时开源含13,000+图像-掩码对的大规模基准数据集。 Conclusion: MIRAGE为工业缺陷检测提供了一种可扩展、易获取、无需真实缺陷数据的异常生成与评估新范式,推动了无监督/弱监督VAD的实际落地。 Abstract: Industrial visual anomaly detection (VAD) methods are typically trained on normal samples only, yet performance improves substantially when even limited anomalous data is available. Existing anomaly generation approaches either require real anomalous examples, demand expensive hardware, or produce synthetic defects that lack realism. We present MIRAGE (Model-agnostic Industrial Realistic Anomaly Generation and Evaluation), a fully automated pipeline for realistic anomalous image generation and pixel-level mask creation that requires no training and no anomalous images. Our pipeline accesses any generative model as a black box via API calls, uses a VLM for automatic defect prompt generation, and includes a CLIP-based quality filter to retain only well-aligned generated images. For mask generation at scale, we introduce a lightweight, training-free dual-branch semantic change detection module combining text-conditioned Grounding DINO features with fine-grained YOLOv26-Seg structural features. We benchmark four generation methods using Gemini 2.5 Flash Image (Nano Banana) as the generative backbone, evaluating performance on MVTec AD and VisA across two distinct tasks: (i) downstream anomaly segmentation and (ii) visual quality of the generated images, assessed via standard metrics (IS, IC-LPIPS) and a human perceptual study involving 31 participants and 1,550 pairwise votes. The results demonstrate that MIRAGE offers a scalable, accessible foundation for anomaly-aware industrial inspection that requires no real defect data. As a final contribution, we publicly release a large-scale dataset comprising 500 image-mask pairs per category for every MVTec AD and VisA class, over 13,000 pairs in total, alongside all generation prompts and pipeline code.[176] A Systematic Benchmark of GAN Architectures for MRI-to-CT Synthesis
Alessandro Pesci,Valerio Guarrasi,Marco Alì,Isabella Castiglioni,Paolo Soda
Main category: cs.CV
TL;DR: 本文对10种GAN架构在SynthRAD2025数据集上进行了MRI-to-CT翻译的统一基准测试,评估涵盖解剖区域、指标维度与计算复杂度,发现监督配对模型更优,Pix2Pix综合表现最佳,并开源代码以保障可复现性。
Details
Motivation: 解决现有MRI-to-CT翻译研究中缺乏系统、公平跨模型比较的问题,支持MRI-only临床工作流并减少电离辐射暴露。 Method: 在SynthRAD2025数据集的腹部、胸部、头颈部三个解剖区域上,采用统一预处理、优化设置和验证协议,对10种GAN架构(含配对与非配对)进行训练与评估;使用体素精度、结构保真度、感知质量、分布真实性等多维指标及计算复杂度分析。 Result: 监督配对模型始终优于非配对模型;Pix2Pix在各解剖区表现最均衡且质量-复杂度权衡最优;多区域联合训练提升结构鲁棒性,单区域训练提升体素级精度。 Conclusion: 该基准为MRI-only放射治疗中的模型选型提供了定量与计算层面的实用指导,并建立了可复现的未来对比研究框架。 Abstract: The translation from Magnetic resonance imaging (MRI) to Computed tomography (CT) has been proposed as an effective solution to facilitate MRI-only clinical workflows while limiting exposure to ionizing radiation. Although numerous Generative Adversarial Network (GAN) architectures have been proposed for MRI-to-CT translation, systematic and fair comparisons across heterogeneous models remain limited. We present a comprehensive benchmark of ten GAN architectures evaluated on the SynthRAD2025 dataset across three anatomical districts (abdomen, thorax, head-and-neck). All models were trained under a unified validation protocol with identical preprocessing and optimization settings. Performance was assessed using complementary metrics capturing voxel-wise accuracy, structural fidelity, perceptual quality, and distribution-level realism, alongside an analysis of computational complexity. Supervised Paired models consistently outperformed Unpaired approaches, confirming the importance of voxel-wise supervision. Pix2Pix achieved the most balanced performance across districts while maintaining a favorable quality-to-complexity trade-off. Multi-district training improved structural robustness, whereas intra-district training maximized voxel-wise fidelity. This benchmark provides quantitative and computational guidance for model selection in MRI-only radiotherapy workflows and establishes a reproducible framework for future comparative studies. To ensure the reproducibility of our experiments we make our code public, together with the overall results, at the following link:https://github.com/arco-group/MRI_TO_CT.git[177] Eleven Primitives and Three Gates: The Universal Structure of Computational Imaging
Chengshuai Yang,Xin Yuan
Main category: cs.CV
TL;DR: 本文提出计算成像系统的通用结构理论,证明所有成像前向模型可分解为仅含11种物理类型基本单元的有向无环图(有限基本单元定理),且所有重建失败均可归因于信息缺失、载波噪声和算子失配这三类独立根源(三元分解),并验证其在12种模态与5类载波家族中的有效性。
Details
Motivation: 计算成像系统种类繁多、建模复杂,缺乏统一的结构分析框架,亟需揭示其共性本质以指导设计、诊断与校正。 Method: 通过理论建模与数学证明,提出‘有限基本单元定理’和‘三元分解’理论,并在12种成像模态及全部五类载波家族中进行实验验证。 Result: 确立了由11种物理基本单元和3个故障门限构成的计算成像通用语法;在实际部署设备上实现+0.8至+13.9 dB的重建性能提升。 Conclusion: 该工作建立了首个适用于计算成像系统设计、诊断与校正的通用语法框架,兼具理论完备性与工程实用性。 Abstract: Computational imaging systems -- from coded-aperture cameras to cryo-electron microscopes -- span five carrier families yet share a hidden structural simplicity. We prove that every imaging forward model decomposes into a directed acyclic graph over exactly 11 physically typed primitives (Finite Primitive Basis Theorem) -- a sufficient and minimal basis that provides a compositional language for designing any imaging modality. We further prove that every reconstruction failure has exactly three independent root causes: information deficiency, carrier noise, and operator mismatch (Triad Decomposition). The three gates map to the system lifecycle: Gates 1 and 2 guide design (sampling geometry, carrier selection); Gate 3 governs deployment-stage calibration and drift correction. Validation across 12 modalities and all five carrier families confirms both results, with +0.8 to +13.9 dB recovery on deployed instruments. Together, the 11 primitives and 3 gates establish the first universal grammar for designing, diagnosing, and correcting computational imaging systems.[178] Hide and Seek: Investigating Redundancy in Earth Observation Imagery
Tasos Papazafeiropoulos,Nikolaos Ioannis Bountos,Nikolas Papadopoulos,Ioannis Papoutsis
Main category: cs.CV
TL;DR: 本文提出地球观测(EO)数据具有多维冗余性(光谱、时间、空间和语义),并系统验证其普遍存在与实用性;利用该冗余可在仅约1/4计算量下达到98.5%的基线性能,且效果跨任务、区域、传感器与模型结构一致。
Details
Motivation: 当前地球观测机器学习研究忽视了EO数据固有的多维冗余特性,而该特性对建模效率与泛化性具有关键影响。 Method: 开展面向EO领域的系统性研究,从光谱、时间、空间和语义四个维度分析冗余的存在性、一致性及其实际影响,并在不同任务、地理位置、传感器、地面采样距离和网络架构下验证其鲁棒性。 Result: EO数据冗余显著且普适;利用冗余可将训练与推理计算量降低约4倍(GFLOPs),同时保持约98.5%的基线性能,且结果在多种设置下高度一致。 Conclusion: 多维冗余是EO数据的结构性本质属性,而非实验偶然现象;该发现为构建更高效、可扩展、易获取的大规模EO模型提供了理论基础与实践路径。 Abstract: The growing availability of Earth Observation (EO) data and recent advances in Computer Vision have driven rapid progress in machine learning for EO, producing domain-specific models at ever-increasing scales. Yet this progress risks overlooking fundamental properties of EO data that distinguish it from other domains. We argue that EO data exhibit a multidimensional redundancy (spectral, temporal, spatial, and semantic) which has a more pronounced impact on the domain and its applications than what current literature reflects. To validate this hypothesis, we conduct a systematic domain-specific investigation examining the existence, consistency, and practical implications of this phenomenon across key dimensions of EO variability. Our findings confirm that redundancy in EO data is both substantial and pervasive: exploiting it yields comparable performance ($\approx98.5\%$ of baseline) at a fraction of the computational cost ($\approx4\times$ fewer GFLOPs), at both training and inference. Crucially, these gains are consistent across tasks, geospatial locations, sensors, ground sampling distances, and architectural designs; suggesting that multi-faceted redundancy is a structural property of EO data rather than an artifact of specific experimental choices. These results lay the groundwork for more efficient, scalable, and accessible large-scale EO models.[179] SAIF: A Stability-Aware Inference Framework for Medical Image Segmentation with Segment Anything Model
Ke Wu,Shiqi Chen,Yiheng Zhong,Hengxian Liu,Yingxue Su,Yifang Wang,Junhao Jin,Guangyu Ren
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、即插即用的推理框架SAIF,通过建模提示框和阈值的不确定性来提升SAM在医学图像分割中的推理稳定性与鲁棒性。
Details
Motivation: SAM作为冻结主干部署时存在推理不稳定性,主要源于边界框提示的定位误差和固定阈值二值化带来的决策不确定性,导致预测方差高、边界区域可靠性差。 Method: SAIF构建由结构化边界框扰动与阈值变化构成的联合不确定性空间,基于决策稳定性与边界一致性评估各假设,并引入稳定性-一致性得分以筛选不稳定候选并实现概率空间中的稳定性加权融合。 Result: 在Synapse、CVC-ClinicDB、Kvasir-SEG和CVC-300数据集上,SAIF一致提升了分割精度与鲁棒性,达到当前最优性能,且无需重训练或模型修改。 Conclusion: SAIF是一种轻量、通用、训练无关的推理增强方法,有效缓解SAM在医学图像分割中因提示与阈值敏感性导致的不稳定性问题,显著提升临床部署可靠性。 Abstract: Segment Anything Model (SAM) enable scalable medical image segmentation but suffer from inference-time instability when deployed as a frozen backbone. In practice, bounding-box prompts often contain localization errors, and fixed threshold binarization introduces additional decision uncertainty. These factors jointly cause high prediction variance, especially near object boundaries, degrading reliability. We propose the Stability-Aware Inference Framework (SAIF), a training-free and plug-and-play inference framework that improves robustness by explicitly modeling prompt and threshold uncertainty. SAIF constructs a joint uncertainty space via structured box perturbations and threshold variations, evaluates each hypothesis using decision stability and boundary consistency, and introduces a stability-consistency score to filter unstable candidates and perform stability-weighted fusion in probability space. Experiments on Synapse, CVC-ClinicDB, Kvasir-SEG, and CVC-300 demonstrate that SAIF consistently improves segmentation accuracy and robustness, achieving state-of-the-art performance without retraining or architectural modification. Our anonymous code is released at https://anonymous.4open.science/r/SAIF.[180] NumColor: Precise Numeric Color Control in Text-to-Image Generation
Muhammad Atif Butt,Diego Hernandez,Alexandra Gomez-Villa,Kai Wang,Javier Vazquez-Corral,Joost Van De Weijer
Main category: cs.CV
TL;DR: 本文提出NumColor方法,解决文本到图像扩散模型无法准确解析十六进制或RGB等数值颜色代码的问题,通过Color Token Aggregator和ColorBook模块,在CIE Lab空间中实现颜色与文本嵌入的几何对齐,并在多个模型上实现零样本迁移与显著的颜色准确性提升。
Details
Motivation: 现有文本到图像扩散模型因子词分词(subword tokenization)将颜色代码(如#FF5733)打碎为无意义token,导致文本编码器无法将其映射为一致的颜色语义表征,从而难以精确控制生成图像中的数值颜色。 Method: 提出NumColor框架,包含两个核心组件:1)Color Token Aggregator——检测任意分词方式下的颜色规范;2)ColorBook——含6707个可学习嵌入,映射至感知均匀的CIE Lab颜色空间;引入方向对齐损失和插值一致性损失以保证Lab空间与嵌入空间的几何对应;构建合成数据集NumColor-Data(50万张带明确像素级颜色标注的渲染图)用于训练。 Result: NumColor在FLUX上训练后,零样本迁移到SD3、SD3.5、PixArt-α和PixArt-Σ,数值颜色准确率提升4–9倍,GenColorBench上的色彩和谐度提升10–30倍。 Conclusion: NumColor有效弥合了数值颜色描述与扩散模型文本编码之间的语义鸿沟,具备跨架构泛化能力,为可控图像生成中的精确颜色操控提供了通用、高效且无需微调的解决方案。 Abstract: Text-to-image diffusion models excel at generating images from natural language descriptions, yet fail to interpret numerical colors such as hex codes (#FF5733) and RGB values (rgb(255,87,51)). This limitation stems from subword tokenization, which fragments color codes into semantically meaningless tokens that text encoders cannot map to coherent color representations. We present NumColor, that enables precise numerical color control across multiple diffusion architectures. NumColor comprises two components: a Color Token Aggregator that detects color specifications regardless of tokenization, and a ColorBook containing 6,707 learnable embeddings that map colors to embedding space of text encoder in perceptually uniform CIE Lab space. We introduce two auxiliary losses, directional alignment and interpolation consistency, to enforce geometric correspondence between Lab and embedding spaces, enabling smooth color interpolation. To train the ColorBook, we construct NumColor-Data, a synthetic dataset of 500K rendered images with unambiguous color-to-pixel correspondence, eliminating the annotation ambiguity inherent in photographic datasets. Although trained solely on FLUX, NumColor transfers zero-shot to SD3, SD3.5, PixArt-α, and PixArt-Σ without model-specific adaptation. NumColor improves numerical color accuracy by 4-9x across five models, while simultaneously improving color harmony scores by 10-30x on GenColorBench benchmark.[181] Semantic Aware Feature Extraction for Enhanced 3D Reconstruction
Ronald Nap,Andy Xiao
Main category: cs.CV
TL;DR: 本文提出了一种语义感知的特征提取框架,通过多任务学习联合训练关键点检测、描述和语义分割,并结合深度匹配模块,在单目鱼眼相机输入下实现带语义标注与高程估计的多层3D重建。
Details
Motivation: 现有基于深度学习的特征匹配方法主要关注几何属性,忽视高层语义信息,限制了在复杂场景(如多楼层停车场)中的匹配一致性与3D重建精度。 Method: 采用多任务学习框架联合优化关键点检测、描述和语义分割;引入深度匹配模块提升对应关系;使用单目鱼眼相机数据,在多楼层停车环境中进行端到端训练与测试。 Result: 生成语义标注的3D点云,显著提升结构细节与高程信息精度,支持带海拔估计的多级语义3D重建。 Conclusion: 融合语义线索的联合训练能有效增强特征匹配一致性,提升复杂场景下的语义化3D重建性能。 Abstract: Feature matching is a fundamental problem in computer vision with wide-ranging applications, including simultaneous localization and mapping (SLAM), image stitching, and 3D reconstruction. While recent advances in deep learning have improved keypoint detection and description, most approaches focus primarily on geometric attributes and often neglect higher-level semantic information. This work proposes a semantic-aware feature extraction framework that employs multi-task learning to jointly train keypoint detection, keypoint description, and semantic segmentation. The method is benchmarked against standard feature matching techniques and evaluated in the context of 3D reconstruction. To enhance feature correspondence, a deep matching module is integrated. The system is tested using input from a single monocular fisheye camera mounted on a vehicle and evaluated within a multi-floor parking structure. The proposed approach supports semantic 3D reconstruction with altitude estimation, capturing elevation changes and enabling multi-level mapping. Experimental results demonstrate that the method produces semantically annotated 3D point clouds with improved structural detail and elevation information, underscoring the effectiveness of joint training with semantic cues for more consistent feature matching and enhanced 3D reconstruction.[182] Performance evaluation of deep learning models for image analysis: considerations for visual control and statistical metrics
Christof A. Bertram,Jonas Ammeling,Alexander Bartel,Gillian Beamer,Marc Aubreville
Main category: cs.CV
TL;DR: 本文综述了兽医病理学中深度学习自动图像分析(DL-AIA)模型性能评估的实践,比较了视觉评估与统计评估两种方法,并强调二者结合可更全面地揭示模型性能与误差来源。
Details
Motivation: 为确保DL-AIA在诊断病理、毒理病理和科研等实际应用中的安全性与可靠性,亟需开展客观、严谨的泛化性能与鲁棒性评估。 Method: 文献综述与方法学分析:识别并对比兽医病理领域中两类主流性能评估实践(纯视觉控制 vs. 统计控制),并系统讨论统计评估的关键要素(指标选择、测试集构建、标注质量、重采样、多模型比较、稳定性评估等)。 Result: 发现视觉评估易操作但主观性强;统计评估更客观但依赖高质量数据与规范流程;二者具有互补性。 Conclusion: 视觉与统计评估应结合使用,以全面理解DL模型性能及误差根源,提升其在兽医病理临床与科研中的可信度与实用性。 Abstract: Deep learning-based automated image analysis (DL-AIA) has been shown to outperform trained pathologists in tasks related to feature quantification. Related to these capacities the use of DL-AIA tools is currently extending from proof-of-principle studies to routine applications such as patient samples (diagnostic pathology), regulatory safety assessment (toxicologic pathology), and recurrent research tasks. To ensure that DL-AIA applications are safe and reliable, it is critical to conduct a thorough and objective generalization performance assessment (i.e., the ability of the algorithm to accurately predict patterns of interest) and possibly evaluate model robustness (i.e., the algorithm's capacity to maintain predictive accuracy on images from different sources). In this article, we review the practices for performance assessment in veterinary pathology publications by which two approaches were identified: 1) Exclusive visual performance control (i.e. eyeballing of algorithmic predictions) plus validation of the models application utilizing secondary performance indices, and 2) Statistical performance control (alongside the other methods), which requires a dataset creation and separation of an hold-out test set prior to model training. This article compares the strengths and weaknesses of statistical and visual performance control methods. Furthermore, we discuss relevant considerations for rigorous statistical performance evaluation including metric selection, test dataset image composition, ground truth label quality, resampling methods such as bootstrapping, statistical comparison of multiple models, and evaluation of model stability. It is our conclusion that visual and statistical evaluation have complementary strength and a combination of both provides the greatest insight into the DL model's performance and sources of error.[183] DiveUp: Learning Feature Upsampling from Diverse Vision Foundation Models
Xiaoqiong Liu,Heng Fan
Main category: cs.CV
TL;DR: 本文提出DiveUp框架,通过多视觉基础模型(VFM)间的结构一致性引导特征上采样,避免单模型依赖导致的位置错位和高范数伪影问题;引入基于局部质心(COM)场的通用关系特征表示和尖峰感知选择策略,实现跨模型特征对齐与可靠专家筛选;该方法编码器无关、无需重训练,显著提升多种密集预测任务性能。
Details
Motivation: 现有特征上采样方法依赖单一VFM的高分辨率特征进行自重建,易受源模型固有的位置错位和高范数伪影影响,泛化能力受限。 Method: 提出DiveUp框架:1)利用多个VFM作为专家小组,通过其结构共识正则化上采样过程;2)构建基于局部质心(COM)场的通用关系特征表示,以提取内在几何结构并支持跨模型交互;3)设计尖峰感知选择策略,按区域评估各VFM的空间可靠性,仅聚合最可靠的专家指导。 Result: DiveUp在多种下游密集预测任务(如语义分割、实例分割等)上达到SOTA性能;具备编码器无关性,同一联合训练模型可通用上采样来自不同VFMs的特征,无需针对每个模型单独微调。 Conclusion: 多VFM关系引导是提升特征上采样鲁棒性与泛化性的有效范式;DiveUp通过结构共识建模、通用关系表征与可靠性感知融合,克服了单模型依赖的根本局限,为VFM的像素级理解提供了新思路。 Abstract: Recently, feature upsampling has gained increasing attention owing to its effectiveness in enhancing vision foundation models (VFMs) for pixel-level understanding tasks. Existing methods typically rely on high-resolution features from the same foundation model to achieve upsampling via self-reconstruction. However, relying solely on intra-model features forces the upsampler to overfit to the source model's inherent location misalignment and high-norm artifacts. To address this fundamental limitation, we propose DiveUp, a novel framework that breaks away from single-model dependency by introducing multi-VFM relational guidance. Instead of naive feature fusion, DiveUp leverages diverse VFMs as a panel of experts, utilizing their structural consensus to regularize the upsampler's learning process, effectively preventing the propagation of inaccurate spatial structures from the source model. To reconcile the unaligned feature spaces across different VFMs, we propose a universal relational feature representation, formulated as a local center-of-mass (COM) field, that extracts intrinsic geometric structures, enabling seamless cross-model interaction. Furthermore, we introduce a spikiness-aware selection strategy that evaluates the spatial reliability of each VFM, effectively filtering out high-norm artifacts to aggregate guidance from only the most reliable expert at each local region. DiveUp is a unified, encoder-agnostic framework; a jointly-trained model can universally upsample features from diverse VFMs without requiring per-model retraining. Extensive experiments demonstrate that DiveUp achieves state-of-the-art performance across various downstream dense prediction tasks, validating the efficacy of multi-expert relational guidance. Our code and models are available at: https://github.com/Xiaoqiong-Liu/DiveUp[184] Analytical Logit Scaling for High-Resolution Sea Ice Topology Retrieval from Weakly Labeled SAR Imagery
Reda Elwaradi,Julien Gimenez,Stéphane Hordoir,Mehdi Ait Hamma,Adrien Chan-Hon-Tong,Flora Weissgerber
Main category: cs.CV
TL;DR: 本文提出了一种弱监督深度学习方法,融合Sentinel-1 SAR与AMSR-2辐射计数据,结合U-Net和区域损失函数,并引入解析logit缩放技术,在无像素级标注下实现40米分辨率海冰精细分割。
Details
Motivation: 现有业务化海冰图仅提供粗粒度区域多边形(弱标签),导致自动分割模型难以达到像素级精度,预测浓度图欠置信且模糊。 Method: 构建融合Sentinel-1 SAR与AMSR-2数据的弱监督U-Net框架,采用区域级损失训练;提出Analytical Logit Scaling方法——基于每景图像隐空间2%与98%分位数动态计算温度与偏置,实现物理驱动的后处理二值化。 Result: 在高度破碎化的夏季海冰场景中达到78%准确率,成功揭示40米分辨率的冰裂隙(leads),同时保持宏观区域浓度精度。 Conclusion: 该方法有效弥合了弱监督学习与高分辨率物理分割之间的鸿沟,无需人工像素标注即可实现细粒度海冰结构提取与宏观浓度一致性保持。 Abstract: High-resolution sea ice mapping using Synthetic Aperture Radar (SAR) is crucial for Arctic navigation and climate monitoring. However, operational ice charts provide only coarse, region-level polygons (weak labels), forcing automated segmentation models to struggle with pixel-level accuracy and often yielding under-confident, blurred concentration maps. In this paper, we propose a weakly supervised deep learning pipeline that fuses Sentinel-1 SAR and AMSR-2 radiometry data using a U-Net architecture trained with a region-based loss. To overcome the severe under-confidence caused by weak labels, we introduce an Analytical Logit Scaling method applied post-inference. By dynamically calculating the temperature and bias based on the latent space percentiles (2\% and 98\%) of each scene, we force a physical binarization of the predictions. This adaptive scaling acts as a topological extractor, successfully revealing fine-grained sea ice fractures (leads) at a 40-meter resolution without requiring any manual pixel-level annotations. Our approach not only resolves local topology but also perfectly preserves regional macroscopic concentrations, achieving a 78\% accuracy on highly fragmented summer scenes, thereby bridging the gap between weakly supervised learning and high-resolution physical segmentation.[185] LingoMotion: An Interpretable and Unambiguous Symbolic Representation for Human Motion
Yao Zhang,Zhuchenyang Liu,Yu Xiao
Main category: cs.CV
TL;DR: 本文提出LingoMotion,一种受自然语言启发的、分层的人体运动符号化表示方法,通过关节角度定义运动字母表,并构建形态学和语法规则,实现对简单动作及复杂活动的可解释、无歧义表达。
Details
Motivation: 现有运动表示方法(如MotionGPT)多为黑箱潜在向量,可解释性差,且基于关节位置易产生歧义。 Method: 提出LingoMotion运动语言框架,包括:基于关节角度的运动字母表、描述简单动作及其属性(如速度、尺度)的形态学规则、以及组合词与短语以表达复杂活动的语法规则;并在Motion-X大规模数据集上实现并评估运动字母表。 Result: 初步实验表明LingoMotion能高保真地表示人体运动。 Conclusion: LingoMotion提供了一种可解释、无歧义、层次化的运动符号表示范式,为运动理解、生成与交互提供了新基础。 Abstract: Existing representations for human motion, such as MotionGPT, often operate as black-box latent vectors with limited interpretability and build on joint positions which can cause ambiguity. Inspired by the hierarchical structure of natural languages - from letters to words, phrases, and sentences - we propose LingoMotion, a motion language that facilitates interpretable and unambiguous symbolic representation for both simple and complex human motion. In this paper, we introduce the concept design of LingoMotion, including the definitions of motion alphabet based on joint angles, the morphology for forming words and phrases to describe simple actions like walking and their attributes like speed and scale, as well as the syntax for describing more complex human activities with sequences of words and phrases. The preliminary results, including the implementation and evaluation of motion alphabet using a large-scale motion dataset Motion-X, demonstrate the high fidelity of motion representation.[186] Opportunistic Cardiac Health Assessment: Estimating Phenotypes from Localizer MRI through Multi-Modal Representations
Busra Nur Zeybek,Özgün Turgut,Yundi Zhang,Jiazhen Pan,Robert Graf,Sophie Starck,Daniel Rueckert,Sevgi Gokce Kafali
Main category: cs.CV
TL;DR: 本文提出C-TRIP多模态框架,融合本地化MRI、ECG和表格数据,利用低成本易获取的MR局部图像替代昂贵高分辨率CMR,准确预测心脏表型(CPs),提升临床可及性。
Details
Motivation: 现有心脏表型(CPs)评估依赖昂贵且高时空分辨率要求的 cine CMR;而常规MR检查中被废弃的快速粗略本地化图像虽质量低、无时间信息,却蕴含结构信息;同时ECG和患者表格数据也提供互补的心脏健康线索,亟需整合利用。 Method: 提出C-TRIP三阶段多模态框架:1)分别训练MRI本地化图像、ECG信号和表格元数据的单模态编码器;2)融合预训练编码器以对齐并统一多模态潜在空间;3)基于融合表征仅用本地化MRI进行CP预测。 Result: C-TRIP在功能型CPs上预测准确,在结构型CPs上呈现高相关性;且因仅依赖快速、低成本的本地化MRI,显著提升CP评估的临床可及性。 Conclusion: C-TRIP证明了利用常规检查中被忽略的本地化MRI联合ECG与表格数据,可有效替代CMR实现心脏表型评估,为低成本、广覆盖的心脏健康筛查提供了新范式。 Abstract: Cardiovascular diseases are the leading cause of death. Cardiac phenotypes (CPs), e.g., ejection fraction, are the gold standard for assessing cardiac health, but they are derived from cine cardiac magnetic resonance imaging (CMR), which is costly and requires high spatio-temporal resolution. Every magnetic resonance (MR) examination begins with rapid and coarse localizers for scan planning, which are discarded thereafter. Despite non-diagnostic image quality and lack of temporal information, localizers can provide valuable structural information rapidly. In addition to imaging, patient-level information, including demographics and lifestyle, influence the cardiac health assessment. Electrocardiograms (ECGs) are inexpensive, routinely ordered in clinical practice, and capture the temporal activity of the heart. Here, we introduce C-TRIP (Cardiac Tri-modal Representations for Imaging Phenotypes), a multi-modal framework that aligns localizer MRI, ECG signals, and tabular metadata to learn a robust latent space and predict CPs using localizer images as an opportunistic alternative to CMR. By combining these three modalities, we leverage cheap spatial and temporal information from localizers, and ECG, respectively while benefiting from patient-specific context provided by tabular data. Our pipeline consists of three stages. First, encoders are trained independently to learn uni-modal representations. The second stage fuses the pre-trained encoders to unify the latent space. The final stage uses the enriched representation space for CP prediction, with inference performed exclusively on localizer MRI. Proposed C-TRIP yields accurate functional CPs, and high correlations for structural CPs. Since localizers are inherently rapid and low-cost, our C-TRIP framework could enable better accessibility for CP estimation.[187] A Grid-Based Framework for E-Scooter Demand Representation and Temporal Input Design for Deep Learning: Evidence from Austin, Texas
Mohammad Sahnoon Merkebe Getachew Demissie,Roberto Souza
Main category: cs.CV
TL;DR: 本文提出了一种可复现的数据处理流程和基于统计的时序输入结构设计方法,用于图像到图像的共享微出行需求预测,并通过大规模电动滑板车数据验证了其有效性。
Details
Motivation: 现有深度学习方法在共享微出行需求预测中,时序输入结构的设计缺乏系统性和统计验证,历史需求特征常被启发式选择,影响模型性能与泛化能力。 Method: 构建基于网格的时空数据集(将行程记录转为每小时取/还车需求图像),设计包含行程过滤、人口普查区映射、网格构建、需求聚合及全局活动掩码的数据处理流程;提出结合相关性与误差分析的时序输入选择方法,并通过UNET基线模型的消融实验与非参数检验(Holm校正)确定最优时序深度。 Result: 所提时序结构能捕捉短期持续性、日周期和周周期特征;相比相邻小时和固定周期基线,下一小时和未来24小时预测的均方误差分别降低最多37%和35%。 Conclusion: 基于统计验证的时序输入设计与严谨的数据集构建对提升微出行时空需求预测性能具有关键价值。 Abstract: Despite progress in deep learning for shared micromobility demand prediction, the systematic design and statistical validation of temporal input structures remain underexplored. Temporal features are often selected heuristically, even though historical demand strongly affects model performance and generalizability. This paper introduces a reproducible data-processing pipeline and a statistically grounded method for designing temporal input structures for image-to-image demand prediction. Using large-scale e-scooter data from Austin, Texas, we build a grid-based spatiotemporal dataset by converting trip records into hourly pickup and dropoff demand images. The pipeline includes trip filtering, mapping Census Tracts to spatial locations, grid construction, demand aggregation, and creation of a global activity mask that limits evaluation to historically active areas. This representation supports consistent spatial learning while preserving demand patterns. We then introduce a combined correlation- and error-based procedure to identify informative historical inputs. Optimal temporal depth is selected through an ablation study using a baseline UNET model with paired non-parametric tests and Holm correction. The resulting temporal structures capture short-term persistence as well as daily and weekly cycles. Compared with adjacent-hour and fixed-period baselines, the proposed design reduces mean squared error by up to 37 percent for next-hour prediction and 35 percent for next-24-hour prediction. These results highlight the value of principled dataset construction and statistically validated temporal input design for spatiotemporal micromobility demand prediction.[188] Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis
Dayou Li,Lulin Liu,Bangya Liu,Shijie Zhou,Jiu Feng,Ziqi Lu,Minghui Zheng,Chenyu You,Zhiwen Fan
Main category: cs.CV
TL;DR: 本文提出EgoHOI,一种无需未来物体状态输入、仅依赖动作信号即可生成物理合理、接触一致的自我中心人-物交互(HOI)世界模型,通过将3D几何与运动学先验蒸馏为物理信息嵌入来提升动态真实性。
Details
Motivation: 现有世界模型多为依赖未来物体状态的条件视频生成器,缺乏真正物理驱动的交互模拟能力;而自我中心HOI建模面临头部快速运动、严重遮挡和高自由度手部动作等挑战,亟需不依赖未来状态的物理一致仿真方法。 Method: EgoHOI利用3D估计提取几何与运动学先验,将其蒸馏为物理信息嵌入,并以此正则化自我中心视频生成过程,实现仅从动作信号出发的物理合理 rollout。 Result: 在HOT3D数据集上,EgoHOI显著优于强基线模型;消融实验验证了物理信息嵌入设计的有效性。 Conclusion: EgoHOI成功摆脱对未来物体轨迹的依赖,首次实现了真正基于动作驱动、物理一致的自我中心HOI世界建模,为具身AI提供了更可信的交互仿真基础。 Abstract: To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human-Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.[189] Locatability-Guided Adaptive Reasoning for Image Geo-Localization with Vision-Language Models
Bo Yu,Fengze Yang,Yiming Liu,Chao Wang,Xuewen Luo,Taozhe Li,Ruimin Ke,Xiaofan Zhou,Chenxi Liu
Main category: cs.CV
TL;DR: 本文提出Geo-ADAPT框架,通过优化的可定位性评分筛选图像、构建分层推理数据集,并采用两阶段GRPO策略实现自适应推理深度,在地理定位任务中达到SOTA并显著减少幻觉。
Details
Motivation: 现有RAG方法受限于检索库质量,而推理驱动方法无法内化图像可定位性,依赖固定深度推理路径,导致幻觉增多和精度下降。 Method: 提出优化的Locatability Score量化图像适配深度推理的能力;构建locatability分层的Geo-ADAPT-51K数据集;设计两阶段Group Relative Policy Optimization(GRPO)课程学习框架,含定制化奖励函数以调控自适应推理深度、视觉接地与层级地理精度。 Result: Geo-ADAPT在多个地理定位基准上达到SOTA性能,并显著降低幻觉,实现更高效、自适应的推理。 Conclusion: 通过将图像可定位性显式建模并结合自适应推理策略,本文为VLMs在地理定位任务中的鲁棒、高效推理提供了新范式。 Abstract: The emergence of Vision-Language Models (VLMs) has introduced new paradigms for global image geo-localization through retrieval-augmented generation (RAG) and reasoning-driven inference. However, RAG methods are constrained by retrieval database quality, while reasoning-driven approaches fail to internalize image locatability, relying on inefficient, fixed-depth reasoning paths that increase hallucinations and degrade accuracy. To overcome these limitations, we introduce an Optimized Locatability Score that quantifies an image's suitability for deep reasoning in geo-localization. Using this metric, we curate Geo-ADAPT-51K, a locatability-stratified reasoning dataset enriched with augmented reasoning trajectories for complex visual scenes. Building on this foundation, we propose a two-stage Group Relative Policy Optimization (GRPO) curriculum with customized reward functions that regulate adaptive reasoning depth, visual grounding, and hierarchical geographical accuracy. Our framework, Geo-ADAPT, learns an adaptive reasoning policy, achieves state-of-the-art performance across multiple geo-localization benchmarks, and substantially reduces hallucinations by reasoning both adaptively and efficiently.[190] Causal Attribution via Activation Patching
Amirmohammad Izadi,Mohammadali Banayeeanzade,Alireza Mirrokni,Hosein Hasani,Mobin Bagherian,Faridoun Mehri,Mahdieh Soleymani Baghshah
Main category: cs.CV
TL;DR: 本文提出了一种名为CAAP的新方法,通过直接干预ViT内部激活而非输入扰动,来更准确地归因图像patch对预测的因果贡献,显著提升了归因的保真度和空间定位性。
Details
Motivation: 现有ViT归因方法难以准确分离单个图像patch对应内部表征的因果贡献,因为类相关证据依赖于多层patch token间的交互,而输入级扰动无法真实反映模型实际使用的内部证据。 Method: 提出Causal Attribution via Activation Patching (CAAP),对每个patch,在中间层范围内将其源图像对应的内部激活插入中性目标上下文中,以目标类别得分作为归因信号,实现对patch内部表征因果效应的直接估计。 Result: CAAP在多个ViT骨干网络和标准评估指标上显著优于现有方法,生成更保真、更空间局域化的归因图。 Conclusion: CAAP通过在中间层进行因果激活干预,避免了晚期全局混合带来的空间模糊性,为ViT提供了原理更坚实、效果更优的归因方法。 Abstract: Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing gradient-based and perturbation-based techniques often fail to isolate the causal contribution of internal representations associated with individual image patches. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers, and input-level perturbations can be poor proxies for patch importance, since they may fail to reconstruct the internal evidence actually used by the model. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal effect of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing class-relevant evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP significantly outperforms existing methods and produces more faithful and localized attributions.[191] FMS$^2$: Unified Flow Matching for Segmentation and Synthesis of Thin Structures
Babak Asadi,Peiyang Wu,Mani Golparvar-Fard,Viraj Shah,Ramez Hajj
Main category: cs.CV
TL;DR: 本文提出FMS²框架,包含SegFlow(基于流匹配的分割模型)和SynFlow(掩码条件图像生成器),显著提升薄结构(如裂缝、血管)分割的连续性、锐度与拓扑准确性,并在少样本和跨域场景下表现出强泛化能力。
Details
Motivation: 薄结构分割面临几何拓扑敏感、标注成本高、跨域泛化差等挑战,现有方法仅孤立解决其中部分问题。 Method: 提出FMS²框架:(1) SegFlow采用编码器-解码器结构,将分割建模为图像到掩码的连续传输过程,通过流匹配回归学习时变速度场,并用ODE积分输出掩码;(2) SynFlow是掩码驱动的图像生成器,多尺度注入掩码几何信息,结合边缘感知门控与可控掩码生成,合成像素对齐的图像-掩码对。 Result: SegFlow在5个裂缝/血管基准上超越CNN、Transformer、Mamba及生成式基线,mIoU提升17.2%,Betti匹配误差降低37.3%;结合SynFlow后,仅用25%真实标注即可恢复近全监督性能,跨域IoU平均提升0.11。 Conclusion: FMS²通过轨迹级监督与可控结构增强的合成数据,统一解决了薄结构分割中的连续性、拓扑保真与泛化难题,无需复杂损失设计或后处理。 Abstract: Segmenting thin structures like infrastructure cracks and anatomical vessels is a task hampered by topology-sensitive geometry, high annotation costs, and poor generalization across domains. Existing methods address these challenges in isolation. We propose FMS$^2$, a flow-matching framework with two modules. (1) SegFlow is a 2.96M-parameter segmentation model built on a standard encoder-decoder backbone that recasts prediction as continuous image $\rightarrow$ mask transport. It learns a time-indexed velocity field with a flow-matching regression loss and outputs the mask via ODE integration, rather than supervising only end-state logits. This trajectory-level supervision improves thin-structure continuity and sharpness, compared with tuned topology-aware loss baselines, without auxiliary topology heads, post-processing, or multi-term loss engineering. (2) SynFlow is a mask-conditioned mask $\rightarrow$ image generator that produces pixel-aligned synthetic image-mask pairs. It injects mask geometry at multiple scales and emphasizes boundary bands via edge-aware gating, while a controllable mask generator expands sparsity, width, and branching regimes. On five crack and vessel benchmarks, SegFlow alone outperforms strong CNN, Transformer, Mamba, and generative baselines, improving the volumetric metric (mean IoU) from 0.511 to 0.599 (+17.2%) and reducing the topological metric (Betti matching error) from 82.145 to 51.524 (-37.3%). When training with limited labels, augmenting SegFlow with SynFlow-generated pairs recovers near-full performance using 25% of real annotations and improves cross-domain IoU by 0.11 on average. Unlike classical data augmentation that promotes invariance via label-preserving transforms, SynFlow provides pixel-aligned paired supervision with controllable structural shifts (e.g., sparsity, width, branching), which is particularly effective under domain shift.[192] Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision
Yunhe Gao,Yabin Zhang,Chong Wang,Jiaming Liu,Maya Varma,Jean-Benoit Delbrouck,Akshay Chaudhari,Curtis Langlotz
Main category: cs.CV
TL;DR: 本文提出MASS(MAsk-guided Self-Supervised learning),一种面向3D医学影像的掩码引导自监督学习方法,以类无关自动掩码为结构监督信号,学习解剖语义丰富的通用表征,无需人工标注,在小样本分割、低数据监督和零样本病理分类等任务上显著优于现有自监督方法。
Details
Motivation: 现有3D医学影像自监督方法依赖低层重建或对比目标,难以捕获关键解剖语义,限制下游迁移能力;而视觉与语言领域已成功构建基础模型,医学影像亟需类似范式。 Method: MASS将上下文感知的分割作为预训练代理任务,利用自动产生的类无关掩码提供结构监督,在数千个涵盖解剖结构与病灶的掩码提案上训练,学习外观、形状、空间上下文及解剖关系的整体语义表征。 Result: 在多种数据规模下验证有效:小规模(20–200例)单数据集预训练至大规模(5K例)多模态(CT/MRI/PET)预训练均无需标注;实现(i)对新结构的少样本分割,(ii)仅用20–40%标注数据即达全监督性能,Dice分数较基线提升超20,(iii)冻结编码器即可在未见病理分类任务上匹敌全监督模型。 Conclusion: MASS证明掩码引导的自监督预训练可提取广泛泛化的医学知识,为构建无需专家标注的3D医学影像基础模型开辟新路径。 Abstract: Foundation models have transformed vision and language by learning general-purpose representations from large-scale unlabeled data, yet 3D medical imaging lacks analogous approaches. Existing self-supervised methods rely on low-level reconstruction or contrastive objectives that fail to capture the anatomical semantics critical for medical image analysis, limiting transfer to downstream tasks. We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. MASS's key insight is that automatically generated class-agnostic masks provide sufficient structural supervision for learning semantically rich representations. By training on thousands of diverse mask proposals spanning anatomical structures and pathological findings, MASS learns what semantically defines medical structures: the holistic combination of appearance, shape, spatial context, and anatomical relationships. We demonstrate effectiveness across data regimes: from small-scale pretraining on individual datasets (20-200 scans) to large-scale multi-modal pretraining on 5K CT, MRI, and PET volumes, all without annotations. MASS demonstrates: (i) few-shot segmentation on novel structures, (ii) matching full supervision with only 20-40\% labeled data while outperforming self-supervised baselines by over 20 in Dice score in low-data regimes, and (iii) frozen-encoder classification on unseen pathologies that matches full supervised training with thousands of samples. Mask-guided self-supervised pretraining captures broadly generalizable knowledge, opening a path toward 3D medical imaging foundation models without expert annotations. Code is available: https://github.com/Stanford-AIMI/MASS.[193] TSDCRF: Balancing Privacy and Multi-Object Tracking via Time-Series CRF and Normalized Control Penalty
Bo Ma,Jinsong Wu,Weiqi Yan
Main category: cs.CV
TL;DR: 本文提出TSDCRF框架,在多目标跟踪中兼顾隐私保护与跟踪性能,通过差分隐私噪声、归一化控制惩罚和时序动态条件随机场实现ID稳定性和轨迹鲁棒性。
Details
Motivation: 多目标跟踪常依赖可能泄露身份信息的外观或位置线索,而直接加噪又易导致ID切换或目标丢失,亟需兼顾隐私与跟踪准确性的新方法。 Method: 提出TSDCRF三组件框架:(i)基于可配置隐私预算的(ε,δ)-差分隐私高斯噪声;(ii)噪声注入前的归一化控制惩罚(NCP)以抑制不稳定类别预测;(iii)时序动态条件随机场(DCRF)增强跨帧一致性并校正轨迹偏差。框架与检测器/跟踪器解耦。 Result: 在MOT16、MOT17、Cityscapes和KITTI上验证,相比白噪声及NTPD、PPDTSA等方法,TSDCRF在KL散度偏移、跟踪RMSE和轨迹劫持鲁棒性方面更优,同时保障隐私。 Conclusion: TSDCRF是一种即插即用的隐私增强跟踪 refinement 框架,在保持 tracker/detector 不变前提下,显著改善隐私-效用权衡,适用于实际敏感场景。 Abstract: Multi-object tracking in video often requires appearance or location cues that can reveal sensitive identity information, while adding privacy-preserving noise typically disrupts cross-frame association and causes ID switches or target loss. We propose TSDCRF, a plug-in refinement framework that balances privacy and tracking by combining three components: (i) $(\varepsilon,δ)$-differential privacy via calibrated Gaussian noise on sensitive regions under a configurable privacy budget; (ii) a Normalized Control Penalty (NCP) that down-weights unstable or conflicting class predictions before noise injection to stabilize association; and (iii) a time-series dynamic conditional random field (DCRF) that enforces temporal consistency and corrects trajectory deviation after noise, mitigating ID switches and resilience to trajectory hijacking. The pipeline is agnostic to the choice of detector and tracker (e.g., YOLOv4 and DeepSORT). We evaluate on MOT16, MOT17, Cityscapes, and KITTI. Results show that TSDCRF achieves a better privacy--utility trade-off than white noise and prior methods (NTPD, PPDTSA): lower KL-divergence shift, lower tracking RMSE, and improved robustness under trajectory hijacking while preserving privacy. Source code in https://github.com/mabo1215/TSDCRF.git[194] SHAMISA: SHAped Modeling of Implicit Structural Associations for Self-supervised No-Reference Image Quality Assessment
Mahdi Naseri,Zhou Wang
Main category: cs.CV
TL;DR: SHAMISA是一种无需参考图像和人类标注的无参考图像质量评估(NR-IQA)方法,通过非对比式自监督学习,利用合成元数据与内在特征结构构建软性、可控的失真感知关系,在无标签失真图像上训练出泛化性强、鲁棒性好的质量评估模型。
Details
Motivation: 现有NR-IQA方法严重依赖大量昂贵的人类主观质量标注,限制了其可扩展性与实用性;同时,传统自监督方法采用刚性二元相似性约束,难以建模复杂多样的失真模式与内容敏感性。 Method: 提出SHAMISA框架:1)设计组合式失真引擎,在连续参数空间中生成可控、单因素变化的失真图像族;2)构建双源关系图,融合显式合成失真先验与隐式特征结构关联;3)采用非对比式自监督方式训练卷积编码器,并冻结后接线性回归器进行质量预测。 Result: 在合成、真实及跨数据集NR-IQA基准上均取得优异性能,显著提升跨数据集泛化能力与鲁棒性,且完全不依赖人类质量标注或对比损失。 Conclusion: SHAMISA验证了利用结构化关系建模与可控失真合成进行无监督NR-IQA的可行性与有效性,为摆脱标注瓶颈提供了新范式。 Abstract: No-Reference Image Quality Assessment (NR-IQA) aims to estimate perceptual quality without access to a reference image of pristine quality. Learning an NR-IQA model faces a fundamental bottleneck: its need for a large number of costly human perceptual labels. We propose SHAMISA, a non-contrastive self-supervised framework that learns from unlabeled distorted images by leveraging explicitly structured relational supervision. Unlike prior methods that impose rigid, binary similarity constraints, SHAMISA introduces implicit structural associations, defined as soft, controllable relations that are both distortion-aware and content-sensitive, inferred from synthetic metadata and intrinsic feature structure. A key innovation is our compositional distortion engine, which generates an uncountable family of degradations from continuous parameter spaces, grouped so that only one distortion factor varies at a time. This enables fine-grained control over representational similarity during training: images with shared distortion patterns are pulled together in the embedding space, while severity variations produce structured, predictable shifts. We integrate these insights via dual-source relation graphs that encode both known degradation profiles and emergent structural affinities to guide the learning process throughout training. A convolutional encoder is trained under this supervision and then frozen for inference, with quality prediction performed by a linear regressor on its features. Extensive experiments on synthetic, authentic, and cross-dataset NR-IQA benchmarks demonstrate that SHAMISA achieves strong overall performance with improved cross-dataset generalization and robustness, all without human quality annotations or contrastive losses.[195] Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning
Sungrae Hong,Jiwon Jeong,Jisu Shin,Donghee Han,Sol Lee,Kyungeun Kim,Mun Yong Yi
Main category: cs.CV
TL;DR: 本文提出了一种面向错误严重性的多实例学习(MIL)训练策略,用于全切片图像(WSI)诊断,通过层次化类别结构、严重性加权损失、概率对齐和语义特征重混,显著减少临床关键误判。
Details
Motivation: 现有MIL方法忽略诊断优先级,无法区分多分类中不同严重程度的误分类,导致临床关键错误未被重视。 Method: 构建诊断类别的层次结构;设计严重性加权交叉熵损失;引入概率对齐与语义特征重混以保障层次一致性;提出基于非对称Mikel's Wheel的医学特异性错误严重性度量。 Result: 在公开及真实世界病理数据集上显著降低关键误判;在自然域数据上验证了方法泛化能力。 Conclusion: 该错误严重性感知框架提升了MIL在临床诊断中的可靠性与实用性,兼顾准确性与临床意义。 Abstract: Multiple Instance Learning (MIL) has emerged as a promising paradigm for Whole Slide Image (WSI) diagnosis, offering effective learning with limited annotations. However, existing MIL frameworks overlook diagnostic priorities and fail to differentiate the severity of misclassifications in multiclass, leaving clinically critical errors unaddressed. We propose a mistake-severity-aware training strategy that organizes diagnostic classes into a hierarchical structure, with each level optimized using a severity-weighted cross-entropy loss that penalizes high-severity misclassifications more strongly. Additionally, hierarchical consistency is enforced through probabilistic alignment, a semantic feature remix applied to the instance bag to robustly train class priority and accommodate clinical cases involving multiple symptoms. An asymmetric Mikel's Wheel-based metric is also introduced to quantify the severity of errors specific to medical fields. Experiments on challenging public and real-world in-house datasets demonstrate that our approach significantly mitigates critical errors in MIL diagnosis compared to existing methods. We present additional experimental results on natural domain data to demonstrate the generalizability of our proposed method beyond medical contexts.[196] RSEdit: Text-Guided Image Editing for Remote Sensing
Chen Zhenyuan,Zhang Zechuan,Zhang Feng
Main category: cs.CV
TL;DR: RSEdit 是一种专为遥感图像设计的文本引导编辑框架,通过适配预训练扩散模型并利用双时相遥感数据训练,实现了物理一致、地理信息保持的精准编辑,显著优于通用和商用基线方法。
Details
Motivation: 通用文本引导图像编辑器在遥感图像上易产生伪影、幻觉对象并破坏正射投影约束,根源在于预训练模型缺乏遥感领域知识,且条件生成机制与遥感数据的双时相结构和空间先验不匹配。 Method: 提出 RSEdit 框架,通过通道拼接(channel concatenation)和上下文内词元拼接(in-context token concatenation)将预训练 U-Net 和 DiT 扩散模型适配为指令驱动的遥感图像编辑器,并在 6 万+ 双时相遥感图像对上进行训练。 Result: 在灾害影响、城市扩张、季节变化等多类遥感编辑任务上显著优于通用及商业基线方法,展现出强泛化能力;支持完整复现(开源代码、模型、评估协议、训练日志与生成结果)。 Conclusion: RSEdit 成功弥合了通用图像编辑与遥感专业需求之间的鸿沟,成为下游遥感分析的可靠数据引擎。 Abstract: General-domain text-guided image editors achieve strong photorealism but introduce artifacts, hallucinate objects, and break the orthographic constraints of remote sensing (RS) imagery. We trace this gap to two high-level causes: (i) limited RS world knowledge in pre-trained models, and (ii) conditioning schemes that misalign with the bi-temporal structure and spatial priors of Earth observation data. We present RSEdit, a unified framework that adapts pretrained text-to-image diffusion models - both U-Net and DiT - into instruction-following RS editors via channel concatenation and in-context token concatenation. Trained on over 60,000 semantically rich bi-temporal remote sensing image pairs, RSEdit learns precise, physically coherent edits while preserving geospatial content. Experiments show clear gains over general and commercial baselines, demonstrating strong generalizability across diverse scenarios including disaster impacts, urban growth, and seasonal shifts, positioning RSEdit as a robust data engine for downstream analysis. We will release code, pretrained models, evaluation protocols, training logs, and generated results for full reproducibility. Code: https://github.com/Bili-Sakura/RSEdit-Preview[197] Sparse-Dense Mixture of Experts Adapter for Multi-Modal Tracking
Yabin Zhu,Jianqi Li,Chenglong Li,Jiaxiang Wang,Chengjie Gu,Jin Tang
Main category: cs.CV
TL;DR: 本文提出了一种用于多模态跟踪的参数高效微调(PEFT)新框架SDMoEA,结合稀疏-稠密混合专家适配器与基于Gram矩阵的语义对齐超图融合模块,以统一建模模态特异性与跨模态共享信息,并提升高阶多级融合效果。
Details
Motivation: 现有PEFT方法在多模态跟踪中难以在统一共享参数框架下有效表征异构跨模态特征,且缺乏对高阶多级融合关系的建模能力。 Method: 提出Sparse-Dense Mixture of Experts Adapter(SDMoEA):包含稀疏MoE(捕获模态特异性信息)和稠密共享MoE(建模跨模态共享信息);并设计Gram-based Semantic Alignment Hypergraph Fusion(GSAHF)模块,利用Gram矩阵实现跨模态语义对齐,并构建超图以建模高阶依赖关系。 Result: 在LasHeR、RGBT234、VTUAV、VisEvent、COESOT、DepthTrack和VOT-RGBD2022等多个多模态跟踪基准上,性能优于其他PEFT方法。 Conclusion: SDMoEA通过解耦与协同建模模态特异性和共享信息,并引入超图结构增强高阶融合,显著提升了PEFT在多模态跟踪中的有效性与泛化性。 Abstract: Parameter-efficient fine-tuning (PEFT) techniques, such as prompts and adapters, are widely used in multi-modal tracking because they alleviate issues of full-model fine-tuning, including time inefficiency, high resource consumption, parameter storage burden, and catastrophic forgetting. However, due to cross-modal heterogeneity, most existing PEFT-based methods struggle to effectively represent multi-modal features within a unified framework with shared parameters. To address this problem, we propose a novel Sparse-Dense Mixture of Experts Adapter (SDMoEA) framework for PEFT-based multi-modal tracking under a unified model structure. Specifically, we design an SDMoE module as the multi-modal adapter to model modality-specific and shared information efficiently. SDMoE consists of a sparse MoE and a dense-shared MoE: the former captures modality-specific information, while the latter models shared cross-modal information. Furthermore, to overcome limitations of existing tracking methods in modeling high-order correlations during multi-level multi-modal fusion, we introduce a Gram-based Semantic Alignment Hypergraph Fusion (GSAHF) module. It first employs Gram matrices for cross-modal semantic alignment, ensuring that the constructed hypergraph accurately reflects semantic similarity and high-order dependencies between modalities. The aligned features are then integrated into the hypergraph structure to exploit its ability to model high-order relationships, enabling deep fusion of multi-level multi-modal information. Extensive experiments demonstrate that the proposed method achieves superior performance compared with other PEFT approaches on several multi-modal tracking benchmarks, including LasHeR, RGBT234, VTUAV, VisEvent, COESOT, DepthTrack, and VOT-RGBD2022.[198] Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search
Bo Ma,Jinsong Wu,Wei Qi Yan
Main category: cs.CV
TL;DR: 本文提出Bodhi VLM框架,用于建模隐私预算与分层视觉表征中噪声注入的对齐关系,通过敏感概念定位、多尺度特征区域检测和期望最大化隐私评估(EMPA)模块,提供可解释的预算对齐信号,适用于多种目标检测器和VLM视觉编码器。
Details
Motivation: 现有隐私保护学习系统在分层视觉表征中注入噪声,但缺乏可解释、跨模型通用的隐私预算对齐建模方法。 Method: 提出Bodhi VLM框架:(1) 利用NCP和MDAV聚类将敏感概念映射到分层结构;(2) 采用自底向上(BUA)和自顶向下(TDA)策略定位敏感特征区域;(3) 设计EMPA模块,通过拟合敏感特征分布与指定参考分布(如Laplace或Gaussian)比较生成预算对齐信号。 Result: 在YOLO、PPDPTS、DETR、CLIP、LLaVA、BLIP等模型上验证了BUA/TDA策略具有一致偏差趋势,EMPA提供稳定对齐信号;优于Chi-square、KL、MMD及MomentReg、NoiseMLE、Wass-1等基线方法。 Conclusion: Bodhi VLM提供了首个面向分层神经表征的可学习、可解释的隐私对齐建模框架,超越传统后验审计范式,为隐私保护视觉与多模态模型设计提供新视角。 Abstract: Learning systems that preserve privacy often inject noise into hierarchical visual representations; a central challenge is to \emph{model} how such perturbations align with a declared privacy budget in a way that is interpretable and applicable across vision backbones and vision--language models (VLMs). We propose \emph{Bodhi VLM}, a \emph{privacy-alignment modeling} framework for \emph{hierarchical neural representations}: it (1) links sensitive concepts to layer-wise grouping via NCP and MDAV-based clustering; (2) locates sensitive feature regions using bottom-up (BUA) and top-down (TDA) strategies over multi-scale representations (e.g., feature pyramids or vision-encoder layers); and (3) uses an Expectation-Maximization Privacy Assessment (EMPA) module to produce an interpretable \emph{budget-alignment signal} by comparing the fitted sensitive-feature distribution to an evaluator-specified reference (e.g., Laplace or Gaussian with scale $c/ε$). The output is reference-relative and is \emph{not} a formal differential-privacy estimator. We formalize BUA/TDA over hierarchical feature structures and validate the framework on object detectors (YOLO, PPDPTS, DETR) and on the \emph{visual encoders} of VLMs (CLIP, LLaVA, BLIP). BUA and TDA yield comparable deviation trends; EMPA provides a stable alignment signal under the reported setups. We compare with generic discrepancy baselines (Chi-square, K-L, MMD) and with task-relevant baselines (MomentReg, NoiseMLE, Wass-1). Results are reported as mean$\pm$std over multiple seeds with confidence intervals in the supplementary materials. This work contributes a learnable, interpretable modeling perspective for privacy-aligned hierarchical representations rather than a post hoc audit only. Source code: \href{https://github.com/mabo1215/bodhi-vlm.git}{Bodhi-VLM GitHub repository}[199] UniVid: Pyramid Diffusion Model for High Quality Video Generation
Xinyu Xiao,Binbin Yang,Tingtian Li,Yipeng Yu,Sen Lei
Main category: cs.CV
TL;DR: 本文提出了一种统一的视频生成模型UniVid,支持文本和参考图像双模态控制,通过引入时序金字塔跨帧时空注意力模块和双流交叉注意力机制,实现了高质量、高时间一致性的文本到视频、图像到视频及图文联合到视频生成。
Details
Motivation: 现有扩散模型难以将文本到视频(T2V)与图像到视频(I2V)两种范式统一到一个模型中,缺乏对多模态控制(文本+图像)的灵活支持。 Method: 基于预训练文生图扩散模型进行扩展,引入时序金字塔跨帧空间-时间注意力模块和卷积以增强时序一致性;设计双流交叉注意力机制,支持文本与图像模态的自由加权融合与插值控制。 Result: 在T2V、I2V和(T+I)2V任务上均展现出优于现有方法的时间一致性与生成质量。 Conclusion: UniVid成功实现了文本与图像双模态驱动的统一视频生成框架,兼顾语义描述能力与视觉细节保真度,为多条件视频生成提供了新范式。 Abstract: Diffusion-based text-to-video generation (T2V) or image-to-video (I2V) generation have emerged as a prominent research focus. However, there exists a challenge in integrating the two generative paradigms into a unified model. In this paper, we present a unified video generation model (UniVid) with hybrid conditions of the text prompt and reference image. Given these two available controls, our model can extract objects' appearance and their motion descriptions from textual prompts, while obtaining texture details and structural information from image clues to guide the video generation process. Specifically, we scale up the pre-trained text-to-image diffusion model for generating temporally coherent frames via introducing our temporal-pyramid cross-frame spatial-temporal attention modules and convolutions. To support bimodal control, we introduce a dual-stream cross-attention mechanism, whose attention scores can be freely re-weighted for interpolation of between single and two modalities controls during inference. Extensive experiments showcase that our UniVid achieves superior temporal coherence on T2V, I2V and (T+I)2V tasks.[200] Sky2Ground: A Benchmark for Site Modeling under Varying Altitude
Zengyan Wang,Sirshapan Mitra,Rajat Modi,Grace Lim,Yogesh Rawat
Main category: cs.CV
TL;DR: 本文提出了Sky2Ground三视角数据集及SkyNet模型,用于解决大高度差下的跨视角定位与重建问题,显著提升了多视角对齐性能。
Details
Motivation: 现有方法在处理卫星、航拍和地面图像等大高度差、正交视角的多源图像时性能下降,缺乏合适的基准数据集和方法来应对几何重叠稀疏、视角差异大及真实图像噪声等问题。 Method: 构建了包含51个地点、涵盖卫星/航拍/地面图像的Sky2Ground数据集,并提出SkyNet模型,采用课程学习策略逐步引入卫星图像以增强跨视角一致性。 Result: SkyNet在RRA@5和RTA@5指标上分别超越现有方法9.6%和18.1%;基准测试揭示了当前姿态估计与重建方法在大高度差场景下的局限性。 Conclusion: Sky2Ground与SkyNet共同构成了面向大规模、多高度三维感知与可泛化相机定位的综合评测平台与新基线。 Abstract: We introduce Sky2Ground, a three-view dataset designed for varying altitude camera localization, correspondence learning, and reconstruction. The dataset combines structured synthetic imagery with real, in-the-wild images, providing both controlled multi-view geometry and realistic scene noise. Each of the 51 sites contains thousands of satellite, aerial, and ground images spanning wide altitude ranges and nearly orthogonal viewing angles, enabling rigorous evaluation across global-to-local contexts. We benchmark state of the art pose estimation models, including MASt3R, DUSt3R, Map Anything, and VGGT, and observe that the use of satellite imagery often degrades performance, highlighting the challenges under large altitude variations. We also examine reconstruction methods, highlighting the challenges introduced by sparse geometric overlap, varying perspectives, and the use of real imagery, which often introduces noise and reduces rendering quality. To address some of these challenges, we propose SkyNet, a model which enhances cross-view consistency when incorporating satellite imagery with a curriculum-based training strategy to progressively incorporate more satellite views. SkyNet significantly strengthens multi-view alignment and outperforms existing methods by 9.6% on RRA@5 and 18.1% on RTA@5 in terms of absolute performance. Sky2Ground and SkyNet together establish a comprehensive testbed and baseline for advancing large-scale, multi-altitude 3D perception and generalizable camera localization. Code and models will be released publicly for future research.[201] Ego-1K -- A Large-Scale Multiview Video Dataset for Egocentric Vision
Jae Yong Lee,Daniel Scharstein,Akash Bapat,Hao Hu,Andrew Fu,Haoru Zhao,Paul Sammut,Xiang Li,Stephen Jeapes,Anik Gupta,Lior David,Saketh Madhuvarasu,Jay Girish Joshi,Jason Wither
Main category: cs.CV
TL;DR: 本文介绍了Ego-1K数据集,一个大规模、时间同步的以自我为中心的多视角视频集合,用于推动神经3D视频合成与动态场景理解研究。
Details
Motivation: 随着多摄像头智能眼镜日益普及,亟需高质量的以自我为中心的数据集来支持 egocentric 场景重建等前沿研究。 Method: 构建了包含12个同步相机环绕4相机VR头显的定制化采集设备,采集近1000段以手部动作和手物交互为主的短egocentric视频,并完成标定、处理与发布。 Result: 实验表明,该数据集因近距离动态物体及设备自身运动导致的大视差与图像运动,对现有3D/4D新视角合成方法构成独特挑战。 Conclusion: Ego-1K为egocentric 3D/4D理解提供了重要基准与资源,推动该领域发展。 Abstract: We present Ego-1K, a large-scale collection of time-synchronized egocentric multiview videos designed to advance neural 3D video synthesis and dynamic scene understanding. The dataset contains nearly 1,000 short egocentric videos captured with a custom rig with 12 synchronized cameras surrounding a 4-camera VR headset worn by the user. Scene content focuses on hand motions and hand-object interactions in different settings. We describe rig design, data processing, and calibration. Our dataset enables new ways to benchmark egocentric scene reconstruction methods, an important research area as smart glasses with multiple cameras become omnipresent. Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to large disparities and image motion caused by close dynamic objects and rig egomotion. Our dataset supports future research in this challenging domain. It is available at https://huggingface.co/datasets/facebook/ego-1k.[202] Multi-Object Advertisement Creative Generation
Jialu Gao,Mithun Das Gupta,Qun Li,Raveena Kshatriya,Andrew D. Wilson,Keng-hao Chang,Balasaravanan Thoravi Kumaravel
Main category: cs.CV
TL;DR: 本文提出CreativeAds系统,用于在电商广告中规模化自动生成高质量的生活方式图像,通过三模块流水线解决产品配对、布局生成和背景生成等挑战,并支持用户自定义调整。
Details
Motivation: 现有生成式AI虽能生成逼真图像,但在电商广告中难以保证产品在真实场景中的真实性表现,需大量人工干预,难以扩展到大规模商品目录。 Method: 提出CreativeAds多商品广告生成系统,包含产品配对、布局生成、背景生成三个模块的创新流水线,并配备直观UI支持规模化监管与个体定制控制。 Result: 用户研究与图像评估表明,CreativeAds可在无需GenAI专业知识的前提下,规模化生成大量高质量生活方式广告图像。 Conclusion: CreativeAds有效解决了GenAI在电商广告中规模化应用的核心挑战,提升了自动化广告图像生成的质量与可用性。 Abstract: Lifestyle images are photographs that capture environments and objects in everyday settings. In furniture product marketing, advertisers often create lifestyle images containing products to resonate with potential buyers, allowing buyers to visualize how the products fit into their daily lives. While recent advances in Generative Artificial Intelligence (GenAI) have given rise to realistic image content creation, their application in e-commerce advertising is challenging because high-quality ads must authentically representing the products in realistic scearios. Therefore, manual intervention is usually required for individual generations, making it difficult to scale to larger product catalogs. To understand the challenges faced by advertisers using GenAI to create lifestyle images at scale, we conducted evaluations on ad images generated using state-of-the-art image generation models and identified the major challenges. Based on our findings, we present CreativeAds, a multi-product ad creation system that supports scalable automated generation with customized parameter adjustment for individual generation. To ensure automated high-quality ad generation, CreativeAds innovates a pipeline that consists of three modules to address challenges in product pairing, layout generation, and background generation separately. Furthermore, CreativeAds contains an intuitive user interface to allow users to oversee generation at scale, and it also supports detailed controls on individual generation for user customized adjustments. We performed a user study on CreativeAds and extensive evaluations of the generated images, demonstrating CreativeAds's ability to create large number of high-quality images at scale for advertisers without requiring expertise in GenAI tools.[203] QTrack: Query-Driven Reasoning for Multi-modal MOT
Tajamul Ashraf,Tavaheed Tariq,Sonia Yadav,Abrar Ul Riyaz,Wasif Tak,Moloud Abdar,Janibul Bashir
Main category: cs.CV
TL;DR: 本文提出了一种基于自然语言查询的多目标跟踪新范式,构建了RMOT26基准数据集,并设计了端到端视觉-语言模型QTrack及时间感知策略,实现语义驱动、推理为中心的跟踪。
Details
Motivation: 传统多目标跟踪(MOT)未考虑用户通过自然语言指令指定目标的需求,缺乏语义引导和选择性推理能力。 Method: 提出查询驱动的跟踪范式;构建含接地文本查询和序列级划分的RMOT26基准;设计端到端视觉-语言模型QTrack;引入时间感知策略优化(TPAPO)与结构化奖励。 Result: 在RMOT26上验证了QTrack在语言引导、推理导向跟踪任务中的有效性,显著提升语义一致性、时序连贯性与身份稳定性。 Conclusion: 语言查询可有效引导多目标跟踪,所提范式、数据集与模型为面向推理的智能视觉跟踪开辟了新方向。 Abstract: Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency. To support this setting, we construct RMOT26, a large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust evaluation of generalization. We further present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization. Additionally, we introduce a Temporal Perception-Aware Policy Optimization strategy with structured rewards to encourage motion-aware reasoning. Extensive experiments demonstrate the effectiveness of our approach for reasoning-centric, language-guided tracking. Code and data are available at https://github.com/gaash-lab/QTrack[204] PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment
Zhexiao Xiong,Yizhi Song,Liu He,Wei Xiong,Yu Yuan,Feng Qiao,Nathan Jacobs
Main category: cs.CV
TL;DR: 本文提出PhysAlign框架,通过合成物理标注视频数据和构建物理潜空间,实现物理一致的图像到视频生成,显著提升时间稳定性和物理推理能力。
Details
Motivation: 现有视频扩散模型生成内容时间上不连贯,违背基本物理直觉,限制了实际应用。 Method: 构建基于刚体仿真的可控合成数据生成流程,创建高精度物理与3D标注数据集;结合显式3D几何约束与Gram基时空关系对齐,从视频基础模型中提取运动学先验,构建统一物理潜空间。 Result: PhysAlign在复杂物理推理和时间稳定性任务上显著优于现有视频扩散模型,同时保持零样本视觉质量。 Conclusion: PhysAlign有望弥合纯视觉合成与刚体运动学之间的鸿沟,为真正物理驱动的视频生成提供实用范式。 Abstract: Video Diffusion Models (VDMs) offer a promising approach for simulating dynamic scenes and environments, with broad applications in robotics and media generation. However, existing models often generate temporally incoherent content that violates basic physical intuition, significantly limiting their practical applicability. We propose PhysAlign, an efficient framework for physics-coherent image-to-video (I2V) generation that explicitly addresses this limitation. To overcome the critical scarcity of physics-annotated videos, we first construct a fully controllable synthetic data generation pipeline based on rigid-body simulation, yielding a highly-curated dataset with accurate, fine-grained physics and 3D annotations. Leveraging this data, PhysAlign constructs a unified physical latent space by coupling explicit 3D geometry constraints with a Gram-based spatio-temporal relational alignment that extracts kinematic priors from video foundation models. Extensive experiments demonstrate that PhysAlign significantly outperforms existing VDMs on tasks requiring complex physical reasoning and temporal stability, without compromising zero-shot visual quality. PhysAlign shows the potential to bridge the gap between raw visual synthesis and rigid-body kinematics, establishing a practical paradigm for genuinely physics-grounded video generation. The project page is available at https://physalign.github.io/PhysAlign.[205] Brain Tumor Classification from 3D MRI Using Persistent Homology and Betti Features: A Topological Data Analysis Approach on BraTS2020
Faisal Ahmed
Main category: cs.CV
TL;DR: 本文提出了一种基于拓扑数据分析(TDA)的脑肿瘤分类框架,利用持久同调从3D MRI图像中提取100维拓扑特征(Betti-0/1/2),结合传统机器学习模型(如随机森林)实现HGG/LGG二分类,准确率达89.19%,兼具高效性与可解释性。
Details
Motivation: 脑肿瘤MRI分类面临高维性与结构复杂性挑战,现有深度学习方法依赖大量数据和复杂模型,缺乏可解释性。 Method: 基于3D FLAIR MRI图像,应用持久同调提取Betti数描述的拓扑特征(连通分支、环、空洞),构建100维紧凑拓扑特征向量,并输入Random Forest和XGBoost等传统分类器进行HGG/LGG二分类。 Result: 在BraTS 2020数据集上,随机森林结合优选Betti特征达到89.19%分类准确率。 Conclusion: 持久同调可有效提取MRI中具有判别力且可解释的3D肿瘤形态学特征,为轻量、可解释的医学图像分析提供了新路径。 Abstract: Accurate and interpretable brain tumor classification from medical imaging remains a challenging problem due to the high dimensionality and complex structural patterns present in magnetic resonance imaging (MRI). In this study, we propose a topology-driven framework for brain tumor classification based on Topological Data Analysis (TDA) applied directly to three-dimensional (3D) MRI volumes. Specifically, we analyze 3D Fluid Attenuated Inversion Recovery (FLAIR) images from the BraTS 2020 dataset and extract interpretable topological descriptors using persistent homology. Persistent homology captures intrinsic geometric and structural characteristics of the data through Betti numbers, which describe connected components (Betti-0), loops (Betti-1), and voids (Betti-2). From the 3D MRI volumes, we derive a compact set of 100 topological features that summarize the underlying topology of brain tumor structures. These descriptors represent complex 3D tumor morphology while significantly reducing data dimensionality. Unlike many deep learning approaches that require large-scale training data or complex architectures, the proposed framework relies on computationally efficient topological features extracted directly from the images. These features are used to train classical machine learning classifiers, including Random Forest and XGBoost, for binary classification of high-grade glioma (HGG) and low-grade glioma (LGG). Experimental results on the BraTS 2020 dataset show that the Random Forest classifier combined with selected Betti features achieves an accuracy of 89.19%. These findings highlight the potential of persistent homology as an effective and interpretable approach for analyzing complex 3D medical images and performing brain tumor classification.[206] AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison
Xi Jiang,Yue Guo,Jian Li,Yong Liu,Bin-Bin Gao,Hanqiu Deng,Jun Liu,Heng Zhao,Chengjie Wang,Feng Zheng
Main category: cs.CV
TL;DR: 本文提出AD-Copilot,一种专用于工业异常检测(IAD)的交互式多模态大语言模型,通过视觉上下文对比增强细粒度感知能力,并构建了专用数据集Chat-AD与新基准MMAD-BBox,在多项指标上显著超越现有方法,甚至超过人类专家水平。
Details
Motivation: 现有MLLMs在工业异常检测中表现不佳,因其训练数据偏通用网络图像,且独立编码图像、仅在语言空间比较,难以捕捉细微视觉差异。 Method: 提出AD-Copilot:1)构建工业图像知识挖掘的数据流水线,生成高质量多模态数据集Chat-AD;2)设计基于跨注意力的Comparison Encoder实现图像对细粒度视觉对比;3)采用融合领域知识的多阶段训练策略;4)建立新定位基准MMAD-BBox(基于边界框评估)。 Result: 在MMAD基准上达82.3%准确率,无数据泄露下全面领先;在MMAD-BBox上相较基线最高提升3.35倍;跨多个专用与通用基准泛化性强;部分IAD任务超越人类专家。 Conclusion: AD-Copilot有效弥合了通用MLLM与工业异常检测之间的鸿沟,展现出作为真实工业质检可靠助手的巨大潜力,所有数据与模型将开源。 Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive success in natural visual understanding, yet they consistently underperform in industrial anomaly detection (IAD). This is because MLLMs trained mostly on general web data differ significantly from industrial images. Moreover, they encode each image independently and can only compare images in the language space, making them insensitive to subtle visual differences that are key to IAD. To tackle these issues, we present AD-Copilot, an interactive MLLM specialized for IAD via visual in-context comparison. We first design a novel data curation pipeline to mine inspection knowledge from sparsely labeled industrial images and generate precise samples for captioning, VQA, and defect localization, yielding a large-scale multimodal dataset Chat-AD rich in semantic signals for IAD. On this foundation, AD-Copilot incorporates a novel Comparison Encoder that employs cross-attention between paired image features to enhance multi-image fine-grained perception, and is trained with a multi-stage strategy that incorporates domain knowledge and gradually enhances IAD skills. In addition, we introduce MMAD-BBox, an extended benchmark for anomaly localization with bounding-box-based evaluation. The experiments show that AD-Copilot achieves 82.3% accuracy on the MMAD benchmark, outperforming all other models without any data leakage. In the MMAD-BBox test, it achieves a maximum improvement of $3.35\times$ over the baseline. AD-Copilot also exhibits excellent generalization of its performance gains across other specialized and general-purpose benchmarks. Remarkably, AD-Copilot surpasses human expert-level performance on several IAD tasks, demonstrating its potential as a reliable assistant for real-world industrial inspection. All datasets and models will be released for the broader benefit of the community.[207] RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
Xuezhen Wang,Li Ma,Yulin Shen,Zeyu Wang,Pedro V. Sander
Main category: cs.CV
TL;DR: 本文提出RetimeGS,一种改进的4D高斯泼溅方法,通过显式建模高斯随时间的变化、光流引导初始化与监督、三重渲染监督等策略,有效缓解时间混叠,实现无鬼影、时间一致的动态场景任意时刻重建。
Details
Motivation: 现有4D高斯泼溅方法在离散帧上过拟合,难以连续时间建模,导致时间插值时出现鬼影,本质是时间混叠问题。 Method: 提出RetimeGS:显式定义3D高斯的时间行为;引入光流引导的初始化与监督;采用三重渲染监督;结合其他针对性策略以提升时间平滑性与一致性。 Result: 在含快速运动、非刚性形变和严重遮挡的数据集上,RetimeGS在重建质量与时间一致性上均优于当前最优方法。 Conclusion: 显式时间建模与多级监督可有效解决4D高斯泼溅中的时间混叠问题,显著提升任意时刻渲染的保真度与连贯性。 Abstract: Temporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slow-motion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow-guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large motions. Experiments on datasets featuring fast motion, non-rigid deformation, and severe occlusions demonstrate that RetimeGS achieves superior quality and coherence over state-of-the-art methods.[208] Advancing Cancer Prognosis with Hierarchical Fusion of Genomic, Proteomic and Pathology Imaging Data from a Systems Biology Perspective
Junjie Zhou,Bao Xue,Meiling Wang,Wei Shao,Daoqiang Zhang
Main category: cs.CV
TL;DR: 本文提出HFGPI框架,通过建模基因→蛋白→组织图像的生物学层级关系,结合分子编码、基因调控蛋白融合与蛋白引导超图学习,提升癌症预后预测精度。
Details
Motivation: 现有多模态生存分析方法忽视了蛋白质组作为基因组与组织病理特征之间的关键中间层,且未考虑多源数据间的固有生物学层级结构,导致信息融合不充分。 Method: 提出分层融合框架HFGPI:1)Molecular Tokenizer对基因和蛋白构建生物信息增强表征;2)Gene-Regulated Protein Fusion(GRPF)利用图感知交叉注意力建模基因-蛋白调控关系;3)Protein-Guided Hypergraph Learning(PGHL)通过超图卷积建模蛋白与图像块间的高阶形态学关联;最后逐层融合特征用于生存预测。 Result: 在五个基准数据集上的实验表明,HFGPI显著优于当前最先进方法。 Conclusion: 引入生物学层级结构指导的多模态融合策略可有效提升癌症生存预测性能,验证了系统生物学视角在计算病理学中的重要价值。 Abstract: To enhance the precision of cancer prognosis, recent research has increasingly focused on multimodal survival methods by integrating genomic data and histology images. However, current approaches overlook the fact that the proteome serves as an intermediate layer bridging genomic alterations and histopathological features while providing complementary biological information essential for survival prediction. This biological reality exposes another architectural limitation: existing integrative analysis studies fuse these heterogeneous data sources in a flat manner that fails to capture their inherent biological hierarchy. To address these limitations, we propose HFGPI, a hierarchical fusion framework that models the biological progression from genes to proteins to histology images from a systems biology perspective. Specifically, we introduce Molecular Tokenizer, a molecular encoding strategy that integrates identity embeddings with expression profiles to construct biologically informed representations for genes and proteins. We then develop Gene-Regulated Protein Fusion (GRPF), which employs graph-aware cross-attention with structure-preserving alignment to explicitly model gene-protein regulatory relationships and generate gene-regulated protein representations. Additionally, we propose Protein-Guided Hypergraph Learning (PGHL), which establishes associations between proteins and image patches, leveraging hypergraph convolution to capture higher-order protein-morphology relationships. The final features are progressively fused across hierarchical layers to achieve precise survival outcome prediction. Extensive experiments on five benchmark datasets demonstrate the superiority of HFGPI over state-of-the-art methods.[209] Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space
Quoc-Huy Trinh,Xi Ding,Yang Liu,Zhenyue Qin,Xingjian Li,Gorkem Durak,Halil Ertugrul Aktas,Elif Keles,Ulas Bagci,Min Xu
Main category: cs.CV
TL;DR: 本文提出了一种自动生成3D医学空间VQA数据的智能体管道,并构建了首个评估医学MLLMs 3D空间智能的基准SpatialMed(含近1万问答对),实验表明现有模型在医学3D空间推理能力上严重不足。
Details
Motivation: 视觉空间智能对医学影像解读至关重要,但当前多模态大语言模型(MLLMs)在3D成像中的空间智能研究仍严重缺失,主因是缺乏带结构化3D空间标注的数据集。 Method: 设计了一个基于多智能体协作、集成体积与距离计算器等计算工具并辅以放射科专家验证的自动合成空间视觉问答(VQA)数据的智能体流程。 Result: 构建了首个面向医学MLLMs 3D空间智能评估的综合基准SpatialMed,包含近10,000个跨多器官与肿瘤类型的问答对;在14个SOTA MLLMs上的评测与深入分析表明,当前模型在医学3D空间推理方面表现薄弱。 Conclusion: 现有医学MLLMs普遍缺乏可靠的3D空间推理能力,亟需高质量空间标注数据与针对性建模方法来提升其临床实用性。 Abstract: Visual spatial intelligence is critical for medical image interpretation, yet remains largely unexplored in Multimodal Large Language Models (MLLMs) for 3D imaging. This gap persists due to a systemic lack of datasets featuring structured 3D spatial annotations beyond basic labels. In this study, we introduce an agentic pipeline that autonomously synthesizes spatial visual question-answering (VQA) data by orchestrating computational tools such as volume and distance calculators with multi-agent collaboration and expert radiologist validation. We present SpatialMed, the first comprehensive benchmark for evaluating 3D spatial intelligence in medical MLLMs, comprising nearly 10K question-answer pairs across multiple organs and tumor types. Our evaluations on 14 state-of-the-art MLLMs and extensive analyses reveal that current models lack robust spatial reasoning capabilities for medical imaging.[210] ALTIS: Automated Loss Triage and Impact Scoring from Sentinel-1 SAR for Property-Level Flood Damage Assessment
Amogh Vinaykumar,Prem Kamasani
Main category: cs.CV
TL;DR: 本文提出ALTIS系统,利用Sentinel-1 SAR数据构建五阶段流水线,在灾后24–48小时内生成面向保险理赔的房产级洪涝影响评分与优先级清单,显著提升查勘效率。
Details
Motivation: 现有SAR洪涝检测研究依赖学术指标(如IoU、F1),无法满足保险业对快速、可操作、财产级决策支持的实际需求;传统人工查勘慢、贵、受限于地理条件。 Method: 提出ALTIS五阶段流程:(i) 多时序双极化SAR变化检测(VV/VH强度+InSAR相干性);(ii) 融合高精度DEM的物理驱动水深估计;(iii) 基于地块边界的房产级统计;(iv) 利用NFIP理赔数据标定深度-损失关系;(v) 置信度加权的分级排序。并定义Insurance-Grade Flood Triage (IGFT)、IRR和TES新评估指标。 Result: 在哈维飓风(2017年,德克萨斯州哈里斯县)验证中,ALTIS预计可实现约0.52的巡查减少率(IRR)且保持90%高严重性理赔召回率,即减少超一半无效现场派遣。 Conclusion: ALTIS首次将SAR遥感能力与保险理赔工作流深度耦合,建立了从地球观测研究到可量化保险效益的方法论基准。 Abstract: Floods are among the costliest natural catastrophes globally, yet the property and casualty insurance industry's post-event response remains heavily reliant on manual field inspection: slow, expensive, and geographically constrained. Satellite Synthetic Aperture Radar (SAR) offers cloud-penetrating, all-weather imaging uniquely suited to rapid post-flood assessment, but existing research evaluates SAR flood detection against academic benchmarks such as IoU and F1-score that do not capture insurance-workflow requirements. We present ALTIS: a five-stage pipeline transforming raw Sentinel-1 GRD and SLC imagery into property-level impact scores within 24-48 hours of flood peak. Unlike prior approaches producing pixel-level maps or binary outputs, ALTIS delivers a ranked, confidence-scored triage list consumable by claims platforms, integrating (i) multi-temporal SAR change detection using dual-polarization VV/VH intensity and InSAR coherence, (ii) physics-informed depth estimation fusing flood extent with high-resolution DEMs, (iii) property-level zonal statistics from parcel footprints, (iv) depth-damage calibration against NFIP claims, and (v) confidence-scored triage ranking. We formally define Insurance-Grade Flood Triage (IGFT) and introduce the Inspection Reduction Rate (IRR) and Triage Efficiency Score (TES). Using Hurricane Harvey (2017) across Harris County, Texas, we present preliminary analysis grounded in validated sub-components suggesting ALTIS is designed to achieve an IRR of approximately 0.52 at 90% recall of high-severity claims, potentially eliminating over half of unnecessary dispatches. By blending SAR flood intelligence with the realities of claims management, ALTIS establishes a methodological baseline for translating earth observation research into measurable insurance outcomes.[211] Efficient Semi-Automated Material Microstructure Analysis Using Deep Learning: A Case Study in Additive Manufacturing
Sanjeev S. Navaratna,Nikhil Thawari,Gunashekhar Mari,Amritha V P,Murugaiyan Amirthalingam,Rohit Batra
Main category: cs.CV
TL;DR: 本文提出了一种基于主动学习的半自动图像分割框架,结合U-Net、交互式标注界面与新型SMILE采样策略,显著提升AM材料缺陷分割精度并大幅降低人工标注成本。
Details
Motivation: 材料显微图像异质性强、高质量标注数据稀缺,导致传统图像处理和深度学习方法难以泛化,依赖耗时费力的人工标注。 Method: 构建融合U-Net模型、交互式用户修正界面和核心集图像选择(特别是SMILE:基于嵌入空间的最大最小拉丁超立方采样)的主动学习分割流程,并在六轮迭代中对比三种子集选择策略。 Result: SMILE策略将宏F1分数从0.74提升至0.93,人工标注时间减少约65%;分割结果进一步用于缺陷分类与工艺参数映射。 Conclusion: 该框架显著降低标注负担,同时保持可扩展性与鲁棒性,适用于多种材料体系的图像分析任务。 Abstract: Image segmentation is fundamental to microstructural analysis for defect identification and structure-property correlation, yet remains challenging due to pronounced heterogeneity in materials images arising from varied processing and testing conditions. Conventional image processing techniques often fail to capture such complex features rendering them ineffective for large-scale analysis. Even deep learning approaches struggle to generalize across heterogeneous datasets due to scarcity of high-quality labeled data. Consequently, segmentation workflows often rely on manual expert-driven annotations which are labor intensive and difficult to scale. Using an additive manufacturing (AM) dataset as a case study, we present a semi-automated active learning based segmentation pipeline that integrates a U-Net based convolutional neural network with an interactive user annotation and correction interface and a representative core-set image selection strategy. The active learning workflow iteratively updates the model by incorporating user corrected segmentations into the training pool while the core-set strategy identifies representative images for annotation. Three subset selection strategies, manual selection, uncertainty driven sampling and proposed maximin Latin hypercube sampling from embeddings (SMILE) method were evaluated over six refinement rounds. The SMILE strategy consistently outperformed other approaches, improving the macro F1 score from 0.74 to 0.93 while reducing manual annotation time by about 65 percent. The segmented defect regions were further analyzed using a coupled classification model to categorize defects based on microstructural characteristics and map them to corresponding AM process parameters. The proposed framework reduces labeling effort while maintaining scalability and robustness and is broadly applicable to image based analysis across diverse materials systems.[212] MOGeo: Beyond One-to-One Cross-View Object Geo-localization
Bo Lv,Qingwang Zhang,Le Wu,Yuanyuan Li,Yingying Zhu
Main category: cs.CV
TL;DR: 本文提出了一种新的跨视角多目标地理定位任务(CVMOGL),并构建了基准数据集CMLocation及新方法MOGeo,验证了其在复杂现实场景下的有效性。
Details
Motivation: 现有跨视角物体地理定位方法通常假设查询图像中仅含单个物体,难以满足实际应用中多物体定位的复杂需求,因此需要提出更贴近现实的新任务。 Method: 提出了CVMOGL新任务,构建了包含两个子集的基准数据集CMLocation,并设计了新型跨视角多目标地理定位方法MOGeo。 Result: 在多种应用场景下进行了大量实验,结果表明MOGeo在更现实的设置下表现优于现有SOTA方法,但该问题仍具挑战性。 Conclusion: CVMOGL是一项更具现实意义且具挑战性的新任务,CMLocation和MOGeo为后续研究提供了重要基础和方向。 Abstract: Cross-View Object Geo-Localization (CVOGL) aims to locate an object of interest in a query image within a corresponding satellite image. Existing methods typically assume that the query image contains only a single object, which does not align with the complex, multi-object geo-localization requirements in real-world applications, making them unsuitable for practical scenarios. To bridge the gap between the realistic setting and existing task, we propose a new task, called Cross-View Multi-Object Geo-Localization (CVMOGL). To advance the CVMOGL task, we first construct a benchmark, CMLocation, which includes two datasets: CMLocation-V1 and CMLocation-V2. Furthermore, we propose a novel cross-view multi-object geo-localization method, MOGeo, and benchmark it against existing state-of-the-art methods. Extensive experiments are conducted under various application scenarios to validate the effectiveness of our method. The results demonstrate that cross-view object geo-localization in the more realistic setting remains a challenging problem, encouraging further research in this area.[213] VFM-Loc: Zero-Shot Cross-View Geo-Localization via Aligning Discriminative Visual Hierarchies
Jun Lu,Zehao Sang,Haoqi Wei,Xiangyun Liu,Kun Zhu,Haitao Guo,Zhihui Gong,Lei Ding
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的零样本跨视角地理定位(CVGL)框架VFM-Loc,利用视觉基础模型(VFMs)的通用视觉表征,通过分层线索提取和统计流形对齐策略,在不依赖监督训练的情况下显著提升了大视角差异下的定位精度。
Details
Motivation: 现有监督方法在封闭数据集上表现良好,但在真实场景中因视角差异大、数据集偏差严重而泛化能力差,亟需一种无需训练、泛化性强的零样本解决方案。 Method: 提出VFM-Loc框架:1)采用广义均值池化与尺度加权RMAC进行分层线索提取;2)基于领域PCA和正交Procrustes分析构建统计流形对齐流程,实现异构特征在共享度量空间中的线性对齐。 Result: 在标准基准上实现强零样本性能,在具有大倾斜角的LO-UCV数据集上Recall@1超越监督方法超20%。 Conclusion: 通过对预训练特征进行有原则的对齐,可有效弥合跨视角差距,确立了一种鲁棒且无需训练的现实CVGL新范式。 Abstract: Cross-View Geo-Localization (CVGL) in remote sensing aims to locate a drone-view query by matching it to geo-tagged satellite images. Although supervised methods have achieved strong results on closeset benchmarks, they often fail to generalize to unconstrained, real-world scenarios due to severe viewpoint differences and dataset bias. To overcome these limitations, we present VFM-Loc, a training-free framework for zero-shot CVGL that leverages the generalizable visual representations from vision foundational models (VFMs). VFM-Loc identifies and matches discriminative visual clues across different viewpoints through a progressive alignment strategy. First, we design a hierarchical clue extraction mechanism using Generalized Mean pooling and Scale-Weighted RMAC to preserve distinctive visual clues across scales while maintaining hierarchical confidence. Second, we introduce a statistical manifold alignment pipeline based on domain-wise PCA and Orthogonal Procrustes analysis, linearly aligning heterogeneous feature distributions in a shared metric space. Experiments demonstrate that VFM-Loc exhibits strong zero-shot accuracy on standard benchmarks and surpasses supervised methods by over 20% in Recall@1 on the challenging LO-UCV dataset with large oblique angles. This work highlights that principled alignment of pre-trained features can effectively bridge the cross-view gap, establishing a robust and training-free paradigm for real-world CVGL. The relevant code is made available at: https://github.com/DingLei14/VFM-Loc.[214] Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery
Bohan Zhang,Weidong Tang,Zhixiang Chi,Yi Jin,Zhenbo Li,Yang Wang,Yanan Wu
Main category: cs.CV
TL;DR: 本文提出Learning through Creation (LTC)框架,通过在线伪未知样本生成与双最大间隔损失,解决离线训练与在线发现目标不一致问题,在七项基准上显著提升全类准确率。
Details
Motivation: 现有OCD方法在离线训练阶段未显式建模在线发现任务,导致优化目标错位;且常依赖哈希编码或强特征压缩,损害表征能力。 Method: 提出LTC框架:基于核能量最小化与熵最大化(MKEE)的轻量级在线伪未知生成器,动态合成伪新颖样本;结合自适应阈值的双最大间隔损失进行端到端训练;全程无哈希、纯特征驱动。 Result: 在七个OCD基准上全类准确率较先前方法提升1.5%–13.1%,显著优于SOTA。 Conclusion: LTC通过将‘创造未知’嵌入离线学习过程,有效弥合了训练-推理目标鸿沟,验证了显式建模发现能力对OCD性能的关键作用。 Abstract: On-the-Fly Category Discovery (OCD) aims to recognize known classes while simultaneously discovering emerging novel categories during inference, using supervision only from known classes during offline training. Existing approaches rely either on fixed label supervision or on diffusion-based augmentations to enhance the backbone, yet none of them explicitly train the model to perform the discovery task required at test time. It is fundamentally unreasonable to expect a model optimized on limited labeled data to carry out a qualitatively different discovery objective during inference. This mismatch creates a clear optimization misalignment between the offline learning stage and the online discovery stage. In addition, prior methods often depend on hash-based encodings or severe feature compression, which further limits representational capacity. To address these issues, we propose Learning through Creation (LTC), a fully feature-based and hash-free framework that injects novel-category awareness directly into offline learning. At its core is a lightweight, online pseudo-unknown generator driven by kernel-energy minimization and entropy maximization (MKEE). Unlike previous methods that generate synthetic samples once before training, our generator evolves jointly with the model dynamics and synthesizes pseudo-novel instances on the fly at negligible cost. These samples are incorporated through a dual max-margin objective with adaptive thresholding, strengthening the model's ability to delineate and detect unknown regions through explicit creation. Extensive experiments across seven benchmarks show that LTC consistently outperforms prior work, achieving improvements ranging from 1.5 percent to 13.1 percent in all-class accuracy. The code is available at https://github.com/brandinzhang/LTC[215] Geo-ID: Test-Time Geometric Consensus for Cross-View Consistent Intrinsics
Alara Dirik,Stefanos Zafeiriou
Main category: cs.CV
TL;DR: 本文提出Geo-ID框架,通过稀疏几何对应关系在测试时耦合单视角内在图像预测器的独立预测,实现多视角一致的PBR参数分解,无需重新训练或逆渲染。
Details
Motivation: 现有单视角内在图像分解方法在多视角下预测不一致,影响可编辑神经场景和3D重建等下游应用;视频模型虽能提升帧间一致性,但依赖密集有序视频序列且计算开销大,难以适用于稀疏无序图像集。 Method: 提出Geo-ID测试时框架,利用预训练单视角内在预测器,通过稀疏几何对应构建不确定性感知的一致性目标,耦合各视角独立预测,实现跨视角一致分解;该方法模型无关、无需重训练或逆渲染。 Result: 在合成基准和真实场景实验中,随着视角数增加,跨视角内在一致性显著提升,同时保持相近的单视角分解性能;生成的一致内在图像支持下游神经场景表示中的连贯外观编辑与重光照。 Conclusion: Geo-ID为多视角内在图像分解提供了一种高效、通用、即插即用的测试时一致性增强方案,拓展了单视角模型在三维理解与编辑任务中的实用性。 Abstract: Intrinsic image decomposition aims to estimate physically based rendering (PBR) parameters such as albedo, roughness, and metallicity from images. While recent methods achieve strong single-view predictions, applying them independently to multiple views of the same scene often yields inconsistent estimates, limiting their use in downstream applications such as editable neural scenes and 3D reconstruction. Video-based models can improve cross-frame consistency but require dense, ordered sequences and substantial compute, limiting their applicability to sparse, unordered image collections. We propose Geo-ID, a novel test-time framework that repurposes pretrained single-view intrinsic predictors to produce cross-view consistent decompositions by coupling independent per-view predictions through sparse geometric correspondences that form uncertainty-aware consensus targets. Geo-ID is model-agnostic, requires no retraining or inverse rendering, and applies directly to off-the-shelf intrinsic predictors. Experiments on synthetic benchmarks and real-world scenes demonstrate substantial improvements in cross-view intrinsic consistency as the number of views increases, while maintaining comparable single-view decomposition performance. We further show that the resulting consistent intrinsics enable coherent appearance editing and relighting in downstream neural scene representations.[216] Zero-Forgetting CISS via Dual-Phase Cognitive Cascades
Yuquan Lu,Yifu Guo,Zishan Xu,Siyu Zhang,Yu Huo,Siyue Chen,Siyan Wu,Chenghua Zhu,Ruixuan Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Cognitive Cascade Segmentation (CogCaS)的新方法,用于解决持续语义分割(CSS)中的灾难性遗忘问题。该方法采用双阶段级联结构,将任务解耦为类别存在检测和类别特定分割,从而在类增量设置下更有效地进行持续学习。
Details
Motivation: 现有类增量语义分割(CISS)方法,尤其是严格参数隔离(SPI)策略,在Softmax分类头下易受灾难性遗忘影响,且任务归属概率建模不足,限制了模型持续学习能力。 Method: 提出CogCaS框架,基于人类标注员的双阶段直觉,将CSS任务解耦为两个阶段:第一阶段进行类存在检测(判断图像中是否存在某类),第二阶段执行类特定分割(对检测到的类进行像素级分割),从而缓解遗忘并支持新类引入。 Result: 在PASCAL VOC 2012和ADE20K两个基准数据集上验证了CogCaS的有效性,尤其在长序列增量任务场景下显著优于现有SOTA方法。 Conclusion: CogCaS通过解耦任务结构有效缓解了CISS中的灾难性遗忘问题,为持续语义分割提供了新的理论视角与实用框架。 Abstract: Continual semantic segmentation (CSS) is a cornerstone task in computer vision that enables a large number of downstream applications, but faces the catastrophic forgetting challenge. In conventional class-incremental semantic segmentation (CISS) frameworks using Softmax-based classification heads, catastrophic forgetting originates from Catastrophic forgetting and task affiliation probability. We formulate these problems and provide a theoretical analysis to more deeply understand the limitations in existing CISS methods, particularly Strict Parameter Isolation (SPI). To address these challenges, we follow a dual-phase intuition from human annotators, and introduce Cognitive Cascade Segmentation (CogCaS), a novel dual-phase cascade formulation for CSS tasks in the CISS setting. By decoupling the task into class-existence detection and class-specific segmentation, CogCaS enables more effective continual learning, preserving previously learned knowledge while incorporating new classes. Using two benchmark datasets PASCAL VOC 2012 and ADE20K, we have shown significant improvements in a variety of challenging scenarios, particularly those with long sequence of incremental tasks, when compared to exsiting state-of-the-art methods. Our code will be made publicly available upon paper acceptance.[217] Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering
Lin Fan,Yafei Ou,Zhipeng Deng,Pengyu Dai,Hou Chongxian,Jiale Yan,Yaqian Li,Kaiwen Long,Xun Gong,Masayuki Ikebe,Yefeng Zheng
Main category: cs.CV
TL;DR: 本文提出Step-CoT,一个大规模、专家标注的结构化多步思维链(CoT)医学推理数据集,结合临床诊断流程与影像证据,并设计动态图结构聚焦机制的师生框架,显著提升医学视觉问答(VQA)的推理准确性与可解释性。
Details
Motivation: 现有医学VQA中的思维链多为自由形式,缺乏对临床实际结构化诊断流程的建模,导致推理不可追溯、不可解释。 Method: 构建Step-CoT数据集(10K临床案例、70K VQA对),每条样本含按诊断流程组织的多步结构化CoT;提出基于动态图结构聚焦机制的师生训练框架,自动识别并强化关键诊断步骤。 Result: 在医学VQA任务上,使用Step-CoT及所提框架显著提升推理准确率与人类可理解的推理路径质量;模型输出更贴合临床逻辑且可追溯至影像证据。 Conclusion: 结构化、流程对齐的多步CoT监督是提升医学VQA推理能力与可信度的有效途径;Step-CoT为可解释AI辅助诊断提供了高质量基准与新范式。 Abstract: Chain-of-thought (CoT) reasoning has advanced medical visual question answering (VQA), yet most existing CoT rationales are free-form and fail to capture the structured reasoning process clinicians actually follow. This work asks: Can traceable, multi-step reasoning supervision improve reasoning accuracy and the interpretability of Medical VQA? To this end, we introduce Step-CoT, a large-scale medical reasoning dataset with expert-curated, structured multi-step CoT aligned to clinical diagnostic workflows, implicitly grounding the model's reasoning in radiographic evidence. Step-CoT comprises more than 10K real clinical cases and 70K VQA pairs organized around diagnostic workflows, providing supervised intermediate steps that guide models to follow valid reasoning trajectories. To effectively learn from Step-CoT, we further introduce a teacher-student framework with a dynamic graph-structured focusing mechanism that prioritizes diagnostically informative steps while filtering out less relevant contexts. Our experiments show that using Step-CoT can improve reasoning accuracy and interpretability. Benchmark: github.com/hahaha111111/Step-CoT. Dataset Card: huggingface.co/datasets/fl-15o/Step-CoT[218] Dual-Strategy Improvement of YOLOv11n for Multi-Scale Object Detection in Remote Sensing Images
Shuaiyu Zhu,Sergey Ablameyko
Main category: cs.CV
TL;DR: 本文针对YOLOv11n在遥感图像目标检测中精度不足的问题,提出两种改进策略:一是引入LSKA机制和Gold-YOLO结构;二是结合Gold-YOLO与MultiSEAMHead检测头,在保持模型轻量化的同时,分别提升mAP@0.5达1.3%和1.8%。
Details
Motivation: YOLOv11n模型在高分辨率、复杂场景、目标尺度变化大的卫星遥感图像中检测精度不足。 Method: 提出两种改进策略:方法1在骨干网络引入Large Separable Kernel Attention(LSKA)机制增强小目标特征提取,在颈部网络嵌入Gold-YOLO结构实现多尺度特征融合;方法2同样采用Gold-YOLO结构,并结合MultiSEAMHead检测头以进一步提升小目标与多尺度目标的表征与检测能力。 Result: 在DOTAv1数据集上的实验表明,两种方法在保持模型轻量化的前提下,相比YOLOv11n基线模型,mAP@0.5分别提升了1.3%和1.8%。 Conclusion: 所提改进策略有效提升了YOLOv11n在遥感图像目标检测中的精度,具有实际应用价值。 Abstract: Satellite remote sensing images pose significant challenges for object detection due to their high resolution, complex scenes, and large variations in target scales. To address the insufficient detection accuracy of the YOLOv11n model in remote sensing imagery, this paper proposes two improvement strategies. Method 1: (a) a Large Separable Kernel Attention (LSKA) mechanism is introduced into the backbone network to enhance feature extraction for small objects; (b) a Gold-YOLO structure is incorporated into the neck network to achieve multi-scale feature fusion, thereby improving the detection performance of objects at different scales. Method 2: (a) the Gold-YOLO structure is also integrated into the neck network; (b) a MultiSEAMHead detection head is combined to further strengthen the representation and detection capability for small and multi-scale objects. To verify the effectiveness of the proposed improvements, experiments are conducted on the DOTAv1 dataset. The results show that, while maintaining the lightweight advantage of the model, the proposed methods improve detection accuracy (mAP@0.5) by 1.3% and 1.8%, respectively, compared with the baseline YOLOv11n, demonstrating the effectiveness and practical value of the proposed approaches for object detection in remote sensing images.[219] Fine-tuning MLLMs Without Forgetting Is Easier Than You Think
He Li,Yuhui Zhang,Xiaohan Wang,Kaifeng Lyu,Serena Yeung-Levy
Main category: cs.CV
TL;DR: 本文通过调整多模态大语言模型(MLLM)的微调策略,有效缓解灾难性遗忘问题,并提出数据混合训练策略解决任务特异性过拟合,显著提升持续学习性能。
Details
Motivation: 现有研究普遍认为多模态大语言模型在微调中易发生灾难性遗忘,需复杂机制缓解;本文旨在验证简单微调调整是否足以应对不同分布下的遗忘问题,并探索更鲁棒、实用的适应方法。 Method: 设计2×2实验框架评估模型在图像和文本输入的分布内/外组合下的表现;采用参数冻结、低学习率等正则化手段;提出数据混合训练策略以缓解任务特异性过拟合;将该策略拓展至持续学习场景。 Result: 发现适当正则化可有效防止OOD图像导致的遗忘;识别出ID图像+OOD文本引发的任务特异性过拟合型遗忘;所提数据混合策略显著提升VQA与持续学习性能,超越依赖复杂辅助机制的现有方法。 Conclusion: 多模态大语言模型本身具有较强固有鲁棒性;无需复杂架构或机制,仅通过微调策略优化(如正则化与数据混合)即可高效缓解各类遗忘,为实际部署提供简洁可行的指导原则。 Abstract: The paper demonstrate that simple adjustments of the fine-tuning recipes of multimodal large language models (MLLM) are sufficient to mitigate catastrophic forgetting. On visual question answering, we design a 2x2 experimental framework to assess model performance across in-distribution and out-of-distribution image and text inputs. Our results show that appropriate regularization, such as constraining the number of trainable parameters or adopting a low learning rate, effectively prevents forgetting when dealing with out-of-distribution images. However, we uncover a distinct form of forgetting in settings with in-distribution images and out-of-distribution text. We attribute this forgetting as task-specific overfitting and address this issue by introducing a data-hybrid training strategy that combines datasets and tasks. Finally, we demonstrate that this approach naturally extends to continual learning, outperforming existing methods with complex auxiliary mechanisms. In general, our findings challenge the prevailing assumptions by highlighting the inherent robustness of MLLMs and providing practical guidelines for adapting them while preserving their general capabilities.[220] SCoCCA: Multi-modal Sparse Concept Decomposition via Canonical Correlation Analysis
Ehud Gordon,Meir Yossef Levi,Guy Gilboa
Main category: cs.CV
TL;DR: 本文提出Concept CCA(CoCCA)及其稀疏变体SCoCCA,将概念解释性与典型相关分析(CCA)结合,用于提升视觉-语言模型跨模态嵌入的可解释性与对齐性,无需额外训练且在概念发现与操作任务中达到SOTA。
Details
Motivation: 现有基于概念的可解释方法局限于单模态(图像),难以处理视觉-语言模型(如CLIP)中存在的模态间分布差异(模态鸿沟),导致跨模态推理难以解释;而CCA虽能对齐异构特征分布,却未被用于概念级多模态分析。 Method: 发现CCA目标与InfoNCE目标存在内在关联,据此提出无需训练的CoCCA框架:利用预训练CLIP嵌入,通过CCA对齐视觉与文本概念子空间,并进行概念分解;进一步引入稀疏约束得SCoCCA,以增强概念的解耦性与判别性。 Result: 在概念发现、重建与语义操纵(如概念消融)等任务上达到当前最优性能;验证了跨模态概念解释的有效性与泛化能力。 Conclusion: CCA可作为轻量、训练无关的跨模态对齐工具,与概念解释性结合能有效弥合模态鸿沟,推动多模态AI在安全关键场景中的可信部署。 Abstract: Interpreting the internal reasoning of vision-language models is essential for deploying AI in safety-critical domains. Concept-based explainability provides a human-aligned lens by representing a model's behavior through semantically meaningful components. However, existing methods are largely restricted to images and overlook the cross-modal interactions. Text-image embeddings, such as those produced by CLIP, suffer from a modality gap, where visual and textual features follow distinct distributions, limiting interpretability. Canonical Correlation Analysis (CCA) offers a principled way to align features from different distributions, but has not been leveraged for multi-modal concept-level analysis. We show that the objectives of CCA and InfoNCE are closely related, such that optimizing CCA implicitly optimizes InfoNCE, providing a simple, training-free mechanism to enhance cross-modal alignment without affecting the pre-trained InfoNCE objective. Motivated by this observation, we couple concept-based explainability with CCA, introducing Concept CCA (CoCCA), a framework that aligns cross-modal embeddings while enabling interpretable concept decomposition. We further extend it and propose Sparse Concept CCA (SCoCCA), which enforces sparsity to produce more disentangled and discriminative concepts, facilitating improved activation, ablation, and semantic manipulation. Our approach generalizes concept-based explanations to multi-modal embeddings and achieves state-of-the-art performance in concept discovery, evidenced by reconstruction and manipulation tasks such as concept ablation.[221] Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents
Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen
Main category: cs.CV
TL;DR: 本文提出了一种针对计算机使用代理(CUA)的安全防护机制,旨在解决其因视觉感知错误导致的‘视觉混淆副手’(visual confused deputy)问题,通过双通道对比分类方法独立验证点击目标与动作意图,提升GUI交互安全性。
Details
Motivation: 现有CUA系统将屏幕感知失败视为性能问题,而作者指出这本质上是安全问题——代理可能因视觉误判、恶意截图篡改或TOCTOU竞争条件而执行危险操作。 Method: 提出‘双通道对比分类’防护机制:(1)独立评估视觉点击目标;(2)基于部署知识库分析代理对动作的文本推理;任一通道检测到风险即阻断执行。 Result: 在可控攻击、真实GUI截图和代理运行轨迹测试中,该双通道联合防护显著优于单通道方案,能有效识别并阻止视觉上无害但语义危险的操作。 Conclusion: CUA安全不仅依赖更优的动作生成,更需对‘代理认为自己在点击什么’和‘为何如此点击’进行独立验证;该工作为GUI级智能体安全提供了新范式。 Abstract: Computer-using agents (CUAs) act directly on graphical user interfaces, yet their perception of the screen is often unreliable. Existing work largely treats these failures as performance limitations, asking whether an action succeeds, rather than whether the agent is acting on the correct object at all. We argue that this is fundamentally a security problem. We formalize the visual confused deputy: a failure mode in which an agent authorizes an action based on a misperceived screen state, due to grounding errors, adversarial screenshot manipulation, or time-of-check-to-time-of-use (TOCTOU) races. This gap is practically exploitable: even simple screen-level manipulations can redirect routine clicks into privileged actions while remaining indistinguishable from ordinary agent mistakes. To mitigate this threat, we propose the first guardrail that operates outside the agent's perceptual loop. Our method, dual-channel contrastive classification, independently evaluates (1) the visual click target and (2) the agent's reasoning about the action against deployment-specific knowledge bases, and blocks execution if either channel indicates risk. The key insight is that these two channels capture complementary failure modes: visual evidence detects target-level mismatches, while textual reasoning reveals dangerous intent behind visually innocuous controls. Across controlled attacks, real GUI screenshots, and agent traces, the combined guardrail consistently outperforms either channel alone. Our results suggest that CUA safety requires not only better action generation, but independent verification of what the agent believes it is clicking and why. Materials are provided\footnote{Model, benchmark, and code: https://github.com/vllm-project/semantic-router}.[222] Multi-Modal Character Localization and Extraction for Chinese Text Recognition
Qilong Li,Chongsheng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为LER的新方法,用于解决中文场景文本识别(STR)的挑战,通过解耦每个字符并独立识别,同时考虑中文复杂的内部结构,显著提升了中英文文本识别性能。
Details
Motivation: 由于中文字符结构复杂、类别繁多,现有英文STR方法在中文识别上遇到准确率瓶颈,因此需要探索是否适合将英文模型直接应用于中文STR任务。 Method: LER方法包含三个模块:定位(Localization)、提取(Extraction)和识别(Recognition)。定位模块利用多模态信息精确定位字符位置;提取模块并行分离所有字符;识别模块结合中文独特的内部结构进行文本预测。 Result: 在大规模中文基准数据集上,LER显著优于现有方法;在六个英文基准和Union14M数据集上也展现出优异的英文文本识别性能。 Conclusion: LER验证了针对中文STR任务专门设计模型的有效性,同时具备良好的跨语言泛化能力,为中英文STR提供了统一且高效的解决方案。 Abstract: Scene text recognition (STR) methods have demonstrated their excellent capability in English text images. However, due to the complex inner structures of Chinese and the extensive character categories, it poses challenges for recognizing Chinese text in images. Recently, studies have shown that the methods designed for English text recognition encounter an accuracy bottleneck when recognizing Chinese text images. This raises the question: Is it appropriate to apply the model developed for English to the Chinese STR task? To explore this issue, we propose a novel method named LER, which explicitly decouples each character and independently recognizes characters while taking into account the complex inner structures of Chinese. LER consists of three modules: Localization, Extraction, and Recognition. Firstly, the localization module utilizes multimodal information to determine the character's position precisely. Then, the extraction module dissociates all characters in parallel. Finally, the recognition module considers the unique inner structures of Chinese to provide the text prediction results. Extensive experiments conducted on large-scale Chinese benchmarks indicate that our method significantly outperforms existing methods. Furthermore, extensive experiments conducted on six English benchmarks and the Union14M benchmark show impressive results in English text recognition by LER. Code is available at https://github.com/Pandarenlql/LER.[223] MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal
Yiqi Nie,Fei Wang,Junjie Chen,Kun Li,Yudi Cai,Dan Guo,Chenglong Li,Meng Wang
Main category: cs.CV
TL;DR: 本文提出了Meme Reappraisal任务,即在保持原梗图场景、实体和结构布局的前提下,将负面情绪的梗图转化为具有建设性的正面梗图;为此构建了包含细粒度多模态标注的基准数据集MER-Bench,并设计了基于多模态大语言模型评判的结构化评估框架。
Details
Motivation: 受心理学中认知重评(cognitive reappraisal)启发,旨在解决现有梗图理解与生成工作缺乏对情绪可控、结构保持的多模态转换能力的问题。 Method: 提出Meme Reappraisal新任务;构建真实世界梗图基准MER-Bench,含情绪标签、正向改写文本、视觉编辑说明及类型学标注;设计基于MLLM-as-a-Judge的多维度评估框架(模态生成质量、情绪可控性、结构保真度、全局情感一致性)。 Result: 在多个图像编辑与多模态生成系统上的实验表明,当前方法在结构保留、语义一致性和情感转化三方面仍存在显著不足。 Conclusion: MER-Bench为可控梗图编辑与情感感知多模态生成研究奠定了基础。 Abstract: Memes represent a tightly coupled, multimodal form of social expression, in which visual context and overlaid text jointly convey nuanced affect and commentary. Inspired by cognitive reappraisal in psychology, we introduce Meme Reappraisal, a novel multimodal generation task that aims to transform negatively framed memes into constructive ones while preserving their underlying scenario, entities, and structural layout. Unlike prior works on meme understanding or generation, Meme Reappraisal requires emotion-controllable, structure-preserving multimodal transformation under multiple semantic and stylistic constraints. To support this task, we construct MER-Bench, a benchmark of real-world memes with fine-grained multimodal annotations, including source and target emotions, positively rewritten meme text, visual editing specifications, and taxonomy labels covering visual type, sentiment polarity, and layout structure. We further propose a structured evaluation framework based on a multimodal large language model (MLLM)-as-a-Judge paradigm, decomposing performance into modality-level generation quality, affect controllability, structural fidelity, and global affective alignment. Extensive experiments across representative image-editing and multimodal-generation systems reveal substantial gaps in satisfying the constraints of structural preservation, semantic consistency, and affective transformation. We believe MER-Bench establishes a foundation for research on controllable meme editing and emotion-aware multimodal generation. Our code is available at: https://github.com/one-seven17/MER-Bench.[224] CT-Conditioned Diffusion Prior with Physics-Constrained Sampling for PET Super-Resolution
Liutao Yang,Zi Wang,Peiyuan Jing,Xiaowen Wang,Javier A. Montoya-Zegarra,Kuangyu Shi,Daoqiang Zhang,Guang Yang
Main category: cs.CV
TL;DR: 本文提出了一种CT引导的、物理约束的扩散模型框架,用于解决PET超分辨率中因缺乏配对多分辨率数据和系统物理异质性带来的欠约束问题,显著提升了重建质量与临床相关性。
Details
Motivation: PET超分辨率高度欠约束:缺乏同一受试者的配对多分辨率扫描数据,且有效分辨率依赖于扫描仪特定的物理因素(如点扩散函数PSF、探测器几何结构和采集参数),导致监督学习困难,纯图像域生成方法易产生幻觉结构。 Method: 将PET超分辨率建模为在异构系统配置下的后验推断问题;提出CT条件化扩散框架,训练阶段利用高质量PET/CT配对数据通过交叉注意力引入解剖先验,无需LR-HR PET配对;推理阶段结合扫描仪感知的前向模型(显式建模PSF)与基于梯度的数据一致性优化。 Result: 在标准及分布外(OOD)设置下,该方法在实验指标和病灶级临床相关性指标上均持续优于强基线方法,同时减少幻觉伪影、提升结构保真度。 Conclusion: CT条件化+物理约束的扩散建模范式可有效缓解PET超分辨率的欠约束问题,在保持解剖合理性的同时提升重建精度与鲁棒性。 Abstract: PET super-resolution is highly under-constrained because paired multi-resolution scans from the same subject are rarely available, and effective resolution is determined by scanner-specific physics (e.g., PSF, detector geometry, and acquisition settings). This limits supervised end-to-end training and makes purely image-domain generative restoration prone to hallucinated structures when anatomical and physical constraints are weak. We formulate PET super-resolution as posterior inference under heterogeneous system configurations and propose a CT-conditioned diffusion framework with physics-constrained sampling. During training, a conditional diffusion prior is learned from high-quality PET/CT pairs using cross-attention for anatomical guidance, without requiring paired LR--HR PET data. During inference, measurement consistency is enforced through a scanner-aware forward model with explicit PSF effects and gradient-based data-consistency refinement. Under both standard and OOD settings, the proposed method consistently improves experimental metrics and lesion-level clinical relevance indicators over strong baselines, while reducing hallucination artifacts and improving structural fidelity.[225] Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition
Seokmin Lee,Yunghee Lee,Byeonghyun Pak,Byeongju Woo
Main category: cs.CV
TL;DR: 本文提出CroBo框架,通过全局到局部重建目标学习视觉状态表征,使机器人能从视频流中提取包含语义身份与空间位置的精细场景信息,从而提升动态环境中的序贯决策能力。
Details
Motivation: 现有自监督学习方法未明确界定‘良好视觉状态’应编码的内容;作者认为有效的视觉状态需联合编码场景元素的语义身份(what)与空间位置(where),以可靠检测跨帧细微动态变化。 Method: 提出CroBo框架,基于全局-局部重建目标:将参考帧压缩为紧凑瓶颈token(全局表征),再利用该token作为上下文,从稀疏可见线索中重建局部目标区域中被严重遮蔽的图像块。 Result: 在多个基于视觉的机器人策略学习基准上达到SOTA性能;重建分析与感知直线性实验表明,所学表征保留像素级场景构成,并准确编码跨帧‘什么在何处移动’(what-moves-where)。 Conclusion: CroBo通过强制瓶颈token编码细粒度的全局语义-空间信息,生成有利于时序建模与决策的视觉状态表征,验证了‘what-is-where’原则对机器人视觉表征学习的关键作用。 Abstract: For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations.[226] Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation
Stefan Ainetter,Thomas Deixelberger,Edoardo A. Dominici,Philipp Drescher,Konstantinos Vardis,Markus Steinberger
Main category: cs.CV
TL;DR: GuidedSceneGen 是一种文本到3D的室内场景生成框架,通过全局布局引导、全景扩散建模、视频扩散驱动导航与高斯泼溅融合,实现度量准确、全局一致且语义可解释的3D场景生成。
Details
Motivation: 解决现有文本驱动3D生成方法存在的几何漂移和尺度模糊问题,提升室内场景的度量精度、全局一致性与语义可解释性。 Method: 提出四阶段流程:1)从文本预测全局3D语义-几何布局;2)用语义与深度条件化的全景扩散模型生成对齐布局的360°图像;3)以优化相机轨迹引导视频扩散模型探索未见区域;4)用3D高斯泼溅融合多视角视图,构建绝对尺度下的可导航3D场景。 Result: 在定量评估与用户研究中,相比最新全景文本到3D基线,显著提升3D一致性与布局合理性;支持物体位姿与语义标签从布局到重建的精确迁移,以及无需重对齐的渐进式场景扩展;采样速度达穷举路径探索的10倍。 Conclusion: GuidedSceneGen 有效克服了文本到3D生成中的尺度与几何一致性瓶颈,为高质量、可导航、语义丰富的室内场景生成提供了新范式。 Abstract: We present GuidedSceneGen, a text-to-3D generation framework that produces metrically accurate, globally consistent, and semantically interpretable indoor scenes. Unlike prior text-driven methods that often suffer from geometric drift or scale ambiguity, our approach maintains an absolute world coordinate frame throughout the entire generation process. Starting from a textual scene description, we predict a global 3D layout encoding both semantic and geometric structure, which serves as a guiding proxy for downstream stages. A semantics- and depth-conditioned panoramic diffusion model then synthesizes 360° imagery aligned with the global layout, substantially improving spatial coherence. To explore unobserved regions, we employ a video diffusion model guided by optimized camera trajectories that balances coverage and collision avoidance, achieving up to 10x faster sampling compared to exhaustive path exploration. The generated views are fused using 3D Gaussian Splatting, yielding a consistent and fully navigable 3D scene in absolute scale. GuidedSceneGen enables accurate transfer of object poses and semantic labels from layout to reconstruction, and supports progressive scene expansion without re-alignment. Quantitative results and a user study demonstrate greater 3D consistency and layout plausibility compared to recent panoramic text-to-3D baselines.[227] Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
Yuting Tan,Xilong Cheng,Yunxiao Qin,Zhengnan Li,Jingjing Zhang
Main category: cs.CV
TL;DR: EgoViT是一种无需人工标注、专为第一人称视频设计的视觉Transformer框架,通过联合发现与稳定‘原型物体’,提升无监督物体发现与语义分割性能。
Details
Motivation: 受人类通过自我中心视角感知环境并发展视觉智能的启发,探索人工智能系统如何从连续、未标注的第一人称视频中学习稳定的物体表征。 Method: 提出EgoViT框架,包含三个协同机制:(1) 原型物体学习(帧内蒸馏);(2) 深度正则化(几何结构约束);(3) 教师筛选的时间一致性(跨帧身份保持)。端到端训练于无标签第一人称视频。 Result: +8.0% CorLoc(无监督物体发现)、+4.8% mIoU(语义分割),在标准基准上显著优于现有方法。 Conclusion: EgoViT为具身智能中的鲁棒视觉抽象提供了新基础,验证了自监督学习在复杂动态场景中构建稳定物体表征的有效性。 Abstract: Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.[228] Evaluation of Visual Place Recognition Methods for Image Pair Retrieval in 3D Vision and Robotics
Dennis Haitz,Athradi Shritish Shetty,Michael Weinmann,Markus Ulrich
Main category: cs.CV
TL;DR: 本文将视觉地点识别(VPR)重新定义为图像对检索任务,用于注册流水线的前端,并在多个挑战性数据集上系统评估了多种主流VPR方法的性能与适用性。
Details
Motivation: 传统VPR多用于单图检索定位,而本文动机在于探索其作为图像对检索模块在场景配准、SLAM和SfM等下游注册任务中的有效性与鲁棒性。 Method: 对比评估了NetVLAD类基线、分类式全局描述符(CosPlace、EigenPlaces)、特征混合方法(MixVPR)以及大模型驱动方法(AnyLoc、SALAD、MegaLoc)在Tanks and Temples、ScanNet-GS和KITTI三个跨域数据集上的图像对检索性能。 Result: 现代全局描述符方法在感知歧义、序列不完整等挑战场景中展现出良好的即插即用能力;不同方法存在明显且依赖于场景域的优劣差异。 Conclusion: VPR作为图像对检索前端具有实际价值,但需依据具体应用域(如户外、室内、自动驾驶)谨慎选择方法,以保障注册与建图的鲁棒性。 Abstract: Visual Place Recognition (VPR) is a core component in computer vision, typically formulated as an image retrieval task for localization, mapping, and navigation. In this work, we instead study VPR as an image pair retrieval front-end for registration pipelines, where the goal is to find top-matching image pairs between two disjoint image sets for downstream tasks such as scene registration, SLAM, and Structure-from-Motion. We comparatively evaluate state-of-the-art VPR families - NetVLAD-style baselines, classification-based global descriptors (CosPlace, EigenPlaces), feature-mixing (MixVPR), and foundation-model-driven methods (AnyLoc, SALAD, MegaLoc) - on three challenging datasets: object-centric outdoor scenes (Tanks and Temples), indoor RGB-D scans (ScanNet-GS), and autonomous-driving sequences (KITTI). We show that modern global descriptor approaches are increasingly suitable as off-the-shelf image pair retrieval modules in challenging scenarios including perceptual aliasing and incomplete sequences, while exhibiting clear, domain-dependent strengths and weaknesses that are critical when choosing VPR components for robust mapping and registration.[229] OpenCOOD-Air: Prompting Heterogeneous Ground-Air Collaborative Perception with Spatial Conversion and Offset Prediction
Xianke Wu,Songlin Bai,Chengxiang Li,Zhiyao Luo,Yulin Tian,Fenghua Zhu,Yisheng Lv,Yonglin Tian
Main category: cs.CV
TL;DR: 本文提出OpenCOOD-Air框架,将无人机(UAV)引入车车协同感知(V2V),通过迁移学习、跨域空间转换器(CDSC)和空间偏移预测Transformer(SOPT)解决地面-空中域差异与空间信息丢失问题,并构建OPV2V-Air基准,显著提升2D/3D检测精度。
Details
Motivation: 现有V2V协同感知受限于地面遮挡和车载传感器视角局限,导致关键感知盲区。 Method: 提出OpenCOOD-Air框架:1)采用迁移学习微调预训练V2V模型以适配UAV;2)设计CDSC和SOPT模块实现带高度监督的异构地空协同感知;3)构建OPV2V-Air基准验证方法。 Result: 相比SOTA方法,2D和3D AP@0.7分别提升4%和7%。 Conclusion: 引入UAV作为可扩展平台并结合针对性跨域建模,能有效克服V2V感知的固有局限,显著提升协同感知性能。 Abstract: While Vehicle-to-Vehicle (V2V) collaboration extends sensing ranges through multi-agent data sharing, its reliability remains severely constrained by ground-level occlusions and the limited perspective of chassis-mounted sensors, which often result in critical perception blind spots. We propose OpenCOOD-Air, a novel framework that integrates UAVs as extensible platforms into V2V collaborative perception to overcome these constraints. To mitigate gradient interference from ground-air domain gaps and data sparsity, we adopt a transfer learning strategy to fine-tune UAV weights from pre-trained V2V models. To prevent the spatial information loss inherent in this transition, we formulate ground-air collaborative perception as a heterogeneous integration task with explicit altitude supervision and introduce a Cross-Domain Spatial Converter (CDSC) and a Spatial Offset Prediction Transformer (SOPT). Furthermore, we present the OPV2V-Air benchmark to validate the transition from V2V to Vehicle-to-Vehicle-to-UAV. Compared to state-of-the-art methods, our approach improves 2D and 3D AP@0.7 by 4% and 7%, respectively.[230] Discriminative Flow Matching Via Local Generative Predictors
Om Govind Jha,Manoj Bamniya,Ayon Borthakur
Main category: cs.CV
TL;DR: 本文提出 Discriminative Flow Matching,将分类与检测任务建模为条件流传输过程,通过学习从噪声分布到任务目标流形(如类别嵌入或框坐标)的连续向量场,融合生成式建模思想提升判别式视觉模型的鲁棒性与迭代能力。
Details
Motivation: 传统判别式视觉模型依赖静态映射,缺乏生物视觉和现代生成模型所具有的迭代优化与鲁棒性;需 bridging generative and discriminative learning。 Method: 构建基于条件流匹配的框架,使用共享骨干网络+多个独立流预测器,各预测器采用局部流匹配目标独立计算梯度;支持顺序或并行更新以适配内存与硬件约束;应用于图像分类与目标检测。 Result: 在图像分类和目标检测任务上实现生成式启发的鲁棒推理,兼容CNN与ViT等多种架构。 Conclusion: Discriminative Flow Matching 成功将生成式流匹配思想引入判别式任务,在保持效率的同时增强了模型的可解释性、鲁棒性和灵活性。 Abstract: Traditional discriminative computer vision relies predominantly on static projections, mapping input features to outputs in a single computational step. Although efficient, this paradigm lacks the iterative refinement and robustness inherent in biological vision and modern generative modelling. In this paper, we propose Discriminative Flow Matching, a framework that reformulates classification and object detection as a conditional transport process. By learning a vector field that continuously transports samples from a simple noise distribution toward a task-aligned target manifold -- such as class embeddings or bounding box coordinates -- we are at the interface between generative and discriminative learning. Our method attaches multiple independent flow predictors to a shared backbone. These predictors are trained using local flow matching objectives, where gradients are computed independently for each block. We formulate this approach for standard image classification and extend it to the complex task of object detection, where targets are high-dimensional and spatially distributed. This architecture provides the flexibility to update blocks either sequentially to minimise activation memory or in parallel to suit different hardware constraints. By aggregating the predictions from these independent flow predictors, our framework enables robust, generative-inspired inference across diverse architectures, including CNNs and vision transformers.[231] Bidirectional Cross-Attention Fusion of High-Res RGB and Low-Res HSI for Multimodal Automated Waste Sorting
Jonas V. Funk,Lukas Roming,Andreas Michel,Paul Bäcker,Georg Maier,Thomas Längle,Markus Klute
Main category: cs.CV
TL;DR: 本文提出了一种名为Bidirectional Cross-Attention Fusion (BCAF)的模态融合方法,用于RGB与高光谱图像(HSI)的像素级对齐与融合,以提升工业废料自动分拣中的材料分割精度。BCAF采用双分支Swin Transformer结构,分别处理RGB和HSI,并通过局部双向交叉注意力机制在原始分辨率网格上对齐二者,避免预上采样或早期光谱坍缩。在SpectralWaste和K3I-Cycling数据集上均达到SOTA性能。
Details
Motivation: 工业废料分拣需在高速传送带上实现像素级精确分割;RGB图像空间分辨率高但易混淆外观相似材料,HSI光谱区分能力强但空间分辨率低,因此需有效融合二者优势。 Method: 提出Bidirectional Cross-Attention Fusion(BCAF):1)双独立主干网络——标准Swin Transformer处理RGB,3D token化+光谱自注意力的Swin变体处理HSI;2)在原生网格上进行局部、双向交叉注意力对齐,不依赖预上采样或特征通道压缩;3)分析RGB分辨率与HSI谱段数的权衡。 Result: 在SpectralWaste数据集上达76.4% mIoU(31 img/s)和75.4% mIoU(55 img/s);在新工业数据集K3I-Cycling上,材料级分割达62.3% mIoU,塑料子类分割达66.2% mIoU;方法具有模态无关性,可推广至其他共配准多模态场景。 Conclusion: BCAF通过保留各自模态原始特性并实现细粒度跨模态对齐,在废料分割任务中显著优于现有融合方法,为实时高精度工业视觉系统提供了实用且可扩展的解决方案。 Abstract: Growing waste streams and the transition to a circular economy require efficient automated waste sorting. In industrial settings, materials move on fast conveyor belts, where reliable identification and ejection demand pixel-accurate segmentation. RGB imaging delivers high-resolution spatial detail, which is essential for accurate segmentation, but it confuses materials that look similar in the visible spectrum. Hyperspectral imaging (HSI) provides spectral signatures that separate such materials, yet its lower spatial resolution limits detail. Effective waste sorting therefore needs methods that fuse both modalities to exploit their complementary strengths. We present Bidirectional Cross-Attention Fusion (BCAF), which aligns high-resolution RGB with low-resolution HSI at their native grids via localized, bidirectional cross-attention, avoiding pre-upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI-adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self-attention. We also analyze trade-offs between RGB input resolution and the number of HSI spectral slices. Although our evaluation targets RGB-HSI fusion, BCAF is modality-agnostic and applies to co-registered RGB with lower-resolution, high-channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF achieves state-of-the-art performance of 76.4% mIoU at 31 images/s and 75.4% mIoU at 55 images/s. We further evaluate a novel industrial dataset: K3I-Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic-type segmentation (PET, PP, HDPE, LDPE, PS, etc.).[232] Sat-JEPA-Diff: Bridging Self-Supervised Learning and Generative Diffusion for Remote Sensing
Kursat Komurcu,Linas Petkevicius
Main category: cs.CV
TL;DR: 本文提出Sat-JEPA-Diff模型,结合自监督学习(IJEPA)与隐式扩散模型(LDM),在卫星影像预测中兼顾结构准确性与纹理细节,显著提升边界清晰度与感知质量。
Details
Motivation: 标准确定性方法(如PredRNN、SimVP)易导致“回归均值”问题,输出模糊;生成式模型虽具真实纹理,但常引入结构异常。需兼顾结构准确与纹理逼真。 Method: 提出Sat-JEPA-Diff:IJEPA模块预测稳定语义表征,通过轻量级交叉注意力适配器驱动冻结的Stable Diffusion主干,实现结构引导的高质量纹理合成。 Result: 在Sentinel-2全球数据集上达到领先感知指标(GSSIM: 0.8984,FID: 0.1475),显著优于确定性基线,尤其擅长锐利边界重建。 Conclusion: Sat-JEPA-Diff成功弥合结构保真与纹理真实性的鸿沟,验证了SSL+LDM联合范式在遥感时序预测中的有效性与实用性。 Abstract: Predicting satellite imagery requires a balance between structural accuracy and textural detail. Standard deterministic methods like PredRNN or SimVP minimize pixel-based errors but suffer from the "regression to the mean" problem, producing blurry outputs that obscure subtle geographic-spatial features. Generative models provide realistic textures but often misleadingly reveal structural anomalies. To bridge this gap, we introduce Sat-JEPA-Diff, which combines Self-Supervised Learning (SSL) with Hidden Diffusion Models (LDM). An IJEPA module predicts stable semantic representations, which then route a frozen Stable Diffusion backbone via a lightweight cross-attention adapter. This ensures that the synthesized high-accuracy textures are based on absolutely accurate structural predictions. Evaluated on a global Sentinel-2 dataset, Sat-JEPA-Diff excels at resolving sharp boundaries. It achieves leading perceptual scores (GSSIM: 0.8984, FID: 0.1475) and significantly outperforms deterministic baselines, despite standard autoregressive stability limits. The code and dataset are publicly available on https://github.com/VU-AIML/SAT-JEPA-DIFF.[233] DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual Interaction
Jing Wang,Huimin Shi,Quan Zhou,Qibo Liu,Suofei Zhang,Huimin Lu
Main category: cs.CV
TL;DR: 本文提出DCP-CLIP框架,通过动态构建类别相关文本特征并建模图文双交互,实现高效准确的开放词汇语义分割。
Details
Motivation: 现有开放词汇语义分割方法存在图文跨模态交互不足和类别数量庞大导致计算成本高的问题。 Method: 提出粗到细的DCP-CLIP框架:首先利用CLIP识别图像相关语义类别并动态生成文本特征;然后进行图文跨模态粗分割;再融合编码器的空间增强特征实现细粒度分割;最后用分割结果优化类别预测。 Result: 在多个OVSS基准测试中,DCP-CLIP在精度和效率上均优于现有方法。 Conclusion: DCP-CLIP通过动态文本特征构建与双交互建模,有效缓解了图文交互不足与高计算开销问题,提升了开放词汇语义分割性能。 Abstract: The recent years have witnessed the remarkable development for open-vocabulary semantic segmentation (OVSS) using visual-language foundation models, yet still suffer from following fundamental challenges: (1) insufficient cross-modal communications between textual and visual spaces, and (2) significant computational costs from the interactions with massive number of categories. To address these issues, this paper describes a novel coarse-to-fine framework, called DCP-CLIP, for OVSS. Unlike prior efforts that mainly relied on pre-established category content and the inherent spatial-class interaction capability of CLIP, we dynamic constructing category-relevant textual features and explicitly models dual interactions between spatial image features and textual class semantics. Specifically, we first leverage CLIP's open-vocabulary recognition capability to identify semantic categories relevant to the image context, upon which we dynamically generate corresponding textual features to serve as initial textual guidance. Subsequently, we conduct a coarse segmentation by cross-modally integrating semantic information from textual guidance into the visual representations and achieve refined segmentation by integrating spatially enriched features from the encoder to recover fine-grained details and enhance spatial resolution. In final, we leverage spatial information from the segmentation side to refine category predictions for each mask, facilitating more precise semantic labeling. Experiments on multiple OVSS benchmarks demonstrate that DCP-CLIP outperforms existing methods by delivering both higher accuracy and greater efficiency.[234] IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation
Chenru Wang,Yunyi Chen,Zijun Yang,Joey Tianyi Zhou,Chi Zhang
Main category: cs.CV
TL;DR: 本文提出两种新策略(Inversion-Matching 和 Selective Subgroup Sampling)来改进基于扩散模型的数据集蒸馏,以提升蒸馏数据在分类任务中的判别能力与泛化性能。
Details
Motivation: 扩散模型原生优化生成似然,导致蒸馏样本过度集中在高密度区域,缺乏对分类关键的边界样本覆盖,造成生成目标与判别需求的目标错配。 Method: 提出Inversion-Matching(IM):通过反演引导的微调对齐去噪轨迹与反演轨迹,增强多样性;以及Selective Subgroup Sampling(S³):一种无需训练的采样机制,选择兼具代表性与区分性的合成子集以提升类间可分性。 Result: 在多个基准上显著提升蒸馏数据的判别质量与泛化能力,达到基于扩散模型的数据集蒸馏方法中的最优性能。 Conclusion: 目标对齐是提升扩散式数据集蒸馏判别效用的关键,IM与S³协同解决了分布覆盖不足与类间混淆问题。 Abstract: Dataset Distillation aims to synthesize compact datasets that can approximate the training efficacy of large-scale real datasets, offering an efficient solution to the increasing computational demands of modern deep learning. Recently, diffusion-based dataset distillation methods have shown great promise by leveraging the strong generative capacity of diffusion models to produce diverse and structurally consistent samples. However, a fundamental goal misalignment persists: diffusion models are optimized for generative likelihood rather than discriminative utility, resulting in over-concentration in high-density regions and inadequate coverage of boundary samples crucial for classification. To address this issue, we propose two complementary strategies. Inversion-Matching (IM) introduces an inversion-guided fine-tuning process that aligns denoising trajectories with their inversion counterparts, broadening distributional coverage and enhancing diversity. Selective Subgroup Sampling(S^3) is a training-free sampling mechanism that improves inter-class separability by selecting synthetic subsets that are both representative and distinctive. Extensive experiments demonstrate that our approach significantly enhances the discriminative quality and generalization of distilled datasets, achieving state-of-the-art performance among diffusion-based methods.[235] USIS-PGM: Photometric Gaussian Mixtures for Underwater Salient Instance Segmentation
Lin Hong,Xiangtong Yao,Mürüvvet Bozkurt,Xin Wang,Fumin Zhang
Main category: cs.CV
TL;DR: 本文提出USIS-PGM单阶段框架,通过频域感知、动态加权和Transformer实例激活模块,并结合多尺度高斯热图监督,提升水下显著实例分割性能。
Details
Motivation: 水下图像退化导致水下显著实例分割(USIS)比陆地场景更具挑战性,亟需有效方法应对。 Method: 提出USIS-PGM单阶段框架:编码器含频域感知模块和动态加权模块;解码器引入基于Transformer的实例激活模块;并采用Photometric Gaussian Mixture(PGM)生成多尺度高斯热图监督中间特征。 Result: 实验结果表明USIS-PGM在水下显著实例分割任务上具有优越性能和实际应用价值。 Conclusion: USIS-PGM通过多模块协同与新颖监督策略,有效缓解水下图像退化影响,提升了显著实例定位精度与掩码结构一致性。 Abstract: Underwater salient instance segmentation (USIS) is crucial for marine robotic systems, as it enables both underwater salient object detection and instance-level mask prediction for visual scene understanding. Compared with its terrestrial counterpart, USIS is more challenging due to the underwater image degradation. To address this issue, this paper proposes USIS-PGM, a single-stage framework for USIS. Specifically, the encoder enhances boundary cues through a frequency-aware module and performs content-adaptive feature reweighting via a dynamic weighting module. The decoder incorporates a Transformer-based instance activation module to better distinguish salient instances. In addition, USIS-PGM employs multi-scale Gaussian heatmaps generated from ground-truth masks through Photometric Gaussian Mixture (PGM) to supervise intermediate decoder features, thereby improving salient instance localization and producing more structurally coherent mask predictions. Experimental results demonstrate the superiority and practical applicability of the proposed USIS-PGM model.[236] VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction
Hiroto Nakata,Yawen Zou,Shunsuke Sakai,Shun Maeda,Chunzhi Gu,Yijin Wei,Shangce Gao,Chao Zhang
Main category: cs.CV
TL;DR: 本文提出VID-AD数据集与基于语言的异常检测框架,旨在解决工业检测中因视觉干扰(如光照变化、模糊等)导致逻辑级异常难以识别的问题;通过文本描述与对比学习建模逻辑约束,提升对规则违反的鲁棒检测能力。
Details
Motivation: 现有工业异常检测基准缺乏在逻辑状态固定、视觉干扰(如背景杂乱、光照变化、模糊)变化条件下的可控设置,导致模型难以专注于逻辑级违规识别。 Method: 构建VID-AD数据集(10个制造场景×5种拍摄条件),每个场景由两个逻辑约束(数量、长度、类型、位置、关系)定义;提出仅依赖正常图像生成文本描述的语言型检测框架,采用对比学习:正样本为原始描述,负样本为人工构造的矛盾描述,从而学习逻辑属性嵌入。 Result: 在VID-AD全部50个单类任务上,该方法显著且一致地优于多种基线方法,验证了其对视觉干扰的鲁棒性及对逻辑异常的识别能力。 Conclusion: 逻辑异常检测应解耦视觉表征与逻辑推理;VID-AD为该方向提供首个标准化基准,所提语言驱动方法证明了文本描述与矛盾增强在建模逻辑约束上的有效性。 Abstract: Logical anomaly detection in industrial inspection remains challenging due to variations in visual appearance (e.g., background clutter, illumination shift, and blur), which often distract vision-centric detectors from identifying rule-level violations. However, existing benchmarks rarely provide controlled settings where logical states are fixed while such nuisance factors vary. To address this gap, we introduce VID-AD, a dataset for logical anomaly detection under vision-induced distraction. It comprises 10 manufacturing scenarios and five capture conditions, totaling 50 one-class tasks and 10,395 images. Each scenario is defined by two logical constraints selected from quantity, length, type, placement, and relation, with anomalies including both single-constraint and combined violations. We further propose a language-based anomaly detection framework that relies solely on text descriptions generated from normal images. Using contrastive learning with positive texts and contradiction-based negative texts synthesized from these descriptions, our method learns embeddings that capture logical attributes rather than low-level features. Extensive experiments demonstrate consistent improvements over baselines across the evaluated settings. The dataset is available at: https://github.com/nkthiroto/VID-AD.[237] Leveraging a Statistical Shape Model for Efficient Generation of Annotated Training Data: A Case Study on Liver Landmarks Segmentation
Denis Krnjaca,Lorena Krames,Werner Nahm
Main category: cs.CV
TL;DR: 本文提出了一种基于统计形状模型(SSM)自动生成大量标注数据的方法,用于解剖标志点分割,显著减少人工标注负担,并在肝脏前缘和镰状韧带检测任务中验证了其有效性。
Details
Motivation: 现有深度学习方法依赖大量人工标注数据,成本高、耗时长,亟需自动化、低成本的标注数据生成方案。 Method: 构建仅需一次手动标注的统计形状模型(SSM),基于均值形状生成8800例合成标注肝形状;设计专用深度网络进行训练,并在合成与临床数据上评估。 Result: 在500例未见合成SSM形状上达到91.4%平均IoU(前缘87.4%,镰状韧带87.6%);临床数据定性评估显示良好泛化性。 Conclusion: SSM驱动的数据生成策略可有效缓解人工标注瓶颈,支持大规模高质量训练集构建,方法具有跨解剖结构和任务的推广潜力。 Abstract: Anatomical landmark segmentation serves as a critical initial step for robust multimodal registration during computer-assisted interventions. Current approaches predominantly rely on deep learning, which often necessitates the extensive manual generation of annotated datasets. In this paper, we present a novel strategy for creating large annotated datasets using a statistical shape model (SSM) based on a mean shape that is manually labeled only once. We demonstrate the method's efficacy through its application to deep-learning-based anatomical landmark segmentation, specifically targeting the detection of the anterior ridge and the falciform ligament in 3D liver shapes. A specialized deep learning network was trained with 8,800 annotated liver shapes generated by the SSM. The network's performance was evaluated on 500 unseen synthetic SSM shapes, yielding a mean Intersection over Union of 91.4% (87.4% for the anterior ridge and 87.6% for the falciform ligament). Subsequently, the network was applied to clinical patient liver shapes, with qualitative evaluation indicating promising results and highlighting the generalizability of the proposed approach. Our findings suggest that the SSM-based data generation approach alleviates the labor-intensive process of manual labeling while enabling the creation of large annotated training datasets for machine learning. Although our study focuses on liver anatomy, the proposed methodology holds potential for a broad range of applications where annotated training datasets play a pivotal role in developing accurate deep-learning models.[238] When Visual Privacy Protection Meets Multimodal Large Language Models
Xiaofei Hui,Qian Wu,Haoxuan Qu,Majid Mirmehdi,Hossein Rahmani,Jun Liu
Main category: cs.CV
TL;DR: 本文提出了一种针对黑盒多模态大语言模型(MLLM)服务的视觉隐私保护新框架,通过帕累托最优学习目标和关键历史增强优化,在不访问模型内部信息的前提下,有效平衡隐私保护与MLLM性能。
Details
Motivation: 多模态大语言模型(MLLM)云服务(如GPT-4V)广泛使用,用户需上传图像/视频,引发严重视觉数据隐私泄露风险;而现有研究对黑盒场景下的隐私保护尚缺乏深入探索。 Method: 提出一种新型框架:1)设计基于帕累托最优的学习目标,以权衡视觉隐私保护与MLLM性能;2)引入关键历史增强优化策略,适配黑盒MLLM的无梯度优化需求。 Result: 实验表明该方法在多个基准上均有效,能在保持MLLM任务性能的同时显著提升视觉输入的隐私保护能力。 Conclusion: 本文为黑盒MLLM环境下的视觉隐私保护提供了可行且有效的解决方案,推动了实用化隐私增强多模态AI服务的发展。 Abstract: The emergence of Multimodal Large Language Models (MLLMs) and the widespread usage of MLLM cloud services such as GPT-4V raised great concerns about privacy leakage in visual data. As these models are typically deployed in cloud services, users are required to submit their images and videos, posing serious privacy risks. However, how to tackle such privacy concerns is an under-explored problem. Thus, in this paper, we aim to conduct a new investigation to protect visual privacy when enjoying the convenience brought by MLLM services. We address the practical case where the MLLM is a "black box", i.e., we only have access to its input and output without knowing its internal model information. To tackle such a challenging yet demanding problem, we propose a novel framework, in which we carefully design the learning objective with Pareto optimality to seek a better trade-off between visual privacy and MLLM's performance, and propose critical-history enhanced optimization to effectively optimize the framework with the black-box MLLM. Our experiments show that our method is effective on different benchmarks.[239] VAD4Space: Visual Anomaly Detection for Planetary Surface Imagery
Fabrizio Genilotti,Arianna Stropeni,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Gian Antonio Susto
Main category: cs.CV
TL;DR: 本文提出视觉异常检测(VAD)框架用于行星探测中的自动罕见现象发现,首次在真实月球与火星影像上评估前沿特征型VAD方法,并构建两个新基准数据集;结果表明该方法可在资源受限环境下高效识别罕见地表现象,支持多种任务关键应用。
Details
Motivation: 空间任务产生海量高分辨率影像,人工检查不可行;传统监督学习因标注稀缺和封闭世界假设难以发现真正新颖的科学现象。 Method: 采用基于特征的视觉异常检测(VAD)方法,在自建的两个行星影像基准(月球LROC窄角相机影像数据集、火星巡视器表面影像数据集)上进行实证评估,侧重计算高效、适合星载/边缘部署的方案。 Result: 特征型VAD方法能有效识别罕见行星表面现象,且满足星载与巡视器等资源受限平台的实时性与功耗约束。 Conclusion: VAD为行星探测提供了开放世界感知的新范式,所建基准与实证结果推动了其在战术规划、着陆点选择、危险识别、带宽感知数据筛选及未知地质过程发现等任务中的实用化。 Abstract: Space missions generate massive volumes of high-resolution orbital and surface imagery that far exceed the capacity for manual inspection. Detecting rare phenomena is scientifically critical, yet traditional supervised learning struggles due to scarce labeled examples and closed-world assumptions that prevent discovery of genuinely novel observations. In this work, we investigate Visual Anomaly Detection (VAD) as a framework for automated discovery in planetary exploration. We present the first empirical evaluation of state-of-the-art feature-based VAD methods on real planetary imagery, encompassing both orbital lunar data and Mars rover surface imagery. To support this evaluation, we introduce two benchmarks: (i) a lunar dataset derived from Lunar Reconnaissance Orbiter Camera Narrow Angle imagery, comprising of fresh and degraded craters as anomalies alongside normal terrain; and (ii) a Mars surface dataset designed to reflect the characteristics of rover-acquired imagery. We evaluate multiple VAD approaches with a focus on computationally efficient, edge-oriented solutions suitable for onboard deployment, applicable to both orbital platforms surveying the lunar surface and surface rovers operating on Mars. Our results demonstrate that feature-based VAD methods can effectively identify rare planetary surface phenomena while remaining feasible for resource-constrained environments. By grounding anomaly detection in planetary science, this work establishes practical benchmarks and highlights the potential of open-world perception systems to support a range of mission-critical applications, including tactical planning, landing site selection, hazard detection, bandwidth-aware data prioritization, and the discovery of unanticipated geological processes.[240] Human-like Object Grouping in Self-supervised Vision Transformers
Hossein Adeli,Seoyoung Ahn,Andrew Luo,Mengmi Zhang,Nikolaus Kriegeskorte,Gregory Zelinsky
Main category: cs.CV
TL;DR: 本文提出了一种行为基准测试,评估视觉基础模型与人类物体感知的对齐程度,发现基于Transformer、采用DINO自监督目标训练的模型表现最佳;进一步提出新指标量化表征的对象中心性,并揭示Gram矩阵结构在促进感知对齐中的关键作用。
Details
Motivation: 视觉基础模型虽展现出强泛化能力和新兴分割特性,但其与人类物体感知的对齐机制尚不明确,亟需可解释的行为级评估方法。 Method: 构建含1000+试次的自然场景点对物体判断行为实验;用模型表征简单读出预测被试反应时;提出新指标量化patch间对象内/对象外相似性以衡量对象中心性;通过Gram矩阵蒸馏匹配监督与自监督模型的相似结构。 Result: 模型代际间对人类行为的预测能力稳步提升;DINO自监督+Transformer架构效果最优;对象中心性越强,越能准确预测人类分割行为;Gram矩阵蒸馏可提升监督模型与人类行为的对齐度。 Conclusion: 自监督视觉模型以类人方式编码物体结构,Gram矩阵所承载的patch相似性结构是驱动感知对齐的关键因素。 Abstract: Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects' reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer-based models trained with the DINO self-supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to quantify the object-centric component of representations by measuring patch similarity within and between objects. Across models, stronger object-centric structure predicts human segmentation behavior more accurately. We further show that matching the Gram matrix of supervised transformer models, capturing similarity structure across image patches, with that of a self-supervised model through distillation improves their alignment with human behavior, converging with the prior finding that Gram anchoring improves DINOv3's feature quality. Together, these results demonstrate that self-supervised vision models capture object structure in a behaviorally human-like manner, and that Gram matrix structure plays a role in driving perceptual alignment.[241] PhyGaP: Physically-Grounded Gaussians with Polarization Cues
Jiale Wu,Xiaoyang Bai,Zongqi He,Weiwei Xu,Yifan Peng
Main category: cs.CV
TL;DR: 本文提出PhyGaP方法,利用偏振线索改进3D高斯泼溅(3DGS)在反射物体建模中的物理属性(如反照率、反射率)重建能力,支持高质量重光照;通过偏振延迟渲染(PolarDR)和自遮挡感知环境贴图(GridMap)提升反射分解与间接光照建模精度,在合成与真实场景中均优于现有RGB方法。
Details
Motivation: 现有3DGS方法依赖RGB图像,缺乏形状与材质信息,难以准确重建反照率和反射率,导致无法实现高保真重光照。 Method: 提出PhyGaP:1)偏振延迟渲染(PolarDR)建模反射引起的偏振效应;2)自遮挡感知的体素化环境贴图(GridMap)处理非凸物体的间接光照;联合利用偏振线索实现反射分解与物理一致渲染。 Result: 在多个合成与真实场景(含部分偏振线索)上验证:表面法向余弦距离降低45.7%,PSNR提升约2 dB;逆向渲染与重光照性能达当前最优。 Conclusion: PhyGaP通过引入偏振物理先验,显著提升了3DGS在反射物体建模中的物理可解释性与重光照能力,为基于学习的逆向渲染提供了新思路。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment via deferred rendering (DR). However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support high-fidelity relighting. Observing that this limitation stems from the lack of shape and material information in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex objects. We validate on multiple synthetic and real-world scenes, including those featuring only partial polarization cues, that PhyGaP not only excels in reconstructing the appearance and surface normal of reflective 3D objects (~2 dB in PSNR and 45.7% in Cosine Distance better than existing RGB-based methods on average), but also achieves state-of-the-art inverse rendering and relighting capability. Our code will be released soon.[242] U-Face: An Efficient and Generalizable Framework for Unsupervised Facial Attribute Editing via Subspace Learning
Bo Liu,Xuan Cui,Run Zeng,Wei Duan,Chongwen Liu,Jinrui Qian,Lianggui Tang,Hongping Gan
Main category: cs.CV
TL;DR: 本文提出了一种名为U-Face的无监督人脸属性可控编辑框架,通过将语义向量学习建模为子空间学习问题,并引入正交非负约束与属性边界向量,提升解耦性与可控性;同时设计了具有闭式更新和收敛保证的交替迭代算法AIDC。
Details
Motivation: 现有无监督潜在空间人脸属性编辑方法在属性解耦方面存在不足,编辑某一属性时易影响其他属性,导致细粒度可控性差。 Method: 将语义向量学习建模为子空间学习问题,用语义向量矩阵张成低维语义子空间;等价地从投影-重构视角理解并推广至自编码器框架;引入正交非负约束和属性边界向量以增强解耦;设计交替迭代解耦与可控性算法(AIDC),含闭式更新与理论收敛性保证。 Result: U-Face在多个基准数据集上实现了更优的解耦性与属性编辑可控性,定量与定性实验均验证其有效性。 Conclusion: 该方法为无监督人脸属性编辑提供了灵活、可解释且理论上可支撑解耦表示学习的新范式,显著提升了编辑精度与可控性。 Abstract: Latent space-based facial attribute editing methods have gained popularity in applications such as digital entertainment, virtual avatar creation, and human-computer interaction systems due to their potential for efficient and flexible attribute manipulation, particularly for continuous edits. Among these, unsupervised latent space-based methods, which discover effective semantic vectors without relying on labeled data, have attracted considerable attention in the research community. However, existing methods still encounter difficulties in disentanglement, as manipulating a specific facial attribute may unintentionally affect other attributes, complicating fine-grained controllability. To address these challenges, we propose a novel framework designed to offer an effective and adaptable solution for unsupervised facial attribute editing, called Unsupervised Facial Attribute Controllable Editing (U-Face). The proposed method frames semantic vector learning as a subspace learning problem, where latent vectors are approximated within a lower-dimensional semantic subspace spanned by a semantic vector matrix. This formulation can also be equivalently interpreted from a projection-reconstruction perspective and further generalized into an autoencoder framework, providing a foundation that can support disentangled representation learning in a flexible manner. To improve disentanglement and controllability, we impose orthogonal non-negative constraints on the semantic vectors and incorporate attribute boundary vectors to reduce entanglement in the learned directions. Although these constraints make the optimization problem challenging, we design an alternating iterative algorithm, called Alternating Iterative Disentanglement and Controllability (AIDC), with closed-form updates and provable convergence under specific conditions.[243] Towards Generalizable Deepfake Detection via Real Distribution Bias Correction
Ming-Hui Liu,Harry Cheng,Xin Luo,Xin-Shun Xu,Mohan S. Kankanhalli
Main category: cs.CV
TL;DR: 本文提出了一种名为Real Distribution Bias Correction (RDBC)的框架,通过利用真实图像在群体分布上的稳定性与单张图像固有的高斯性,提升深度伪造检测器对未知伪造类型的泛化能力。
Details
Motivation: 现有方法试图用有限源域数据模拟不断演化的伪造类型,但预测无限未知伪造是不可行的;因此需转向挖掘真实数据本身的不变特性。 Method: 提出RDBC框架,包含两个模块:1)基于真实样本i.i.d特性的实总体分布估计模块,用于建模并估计真实数据统计量的正态分布参数;2)基于真实图像固有高斯性的分布采样特征白化模块,通过采样白化操作放大真实与伪造样本间的高斯性差异。 Result: RDBC在域内和跨域深度伪造检测任务上均达到SOTA性能。 Conclusion: 利用真实数据的分布不变性(总体正态性与个体高斯性)可有效提升检测器对未见伪造类型的泛化能力,RDBC为通用伪造检测提供了新范式。 Abstract: To generalize deepfake detectors to future unseen forgeries, most existing methods attempt to simulate the dynamically evolving forgery types using available source domain data. However, predicting an unbounded set of future manipulations from limited prior examples is infeasible. To overcome this limitation, we propose to exploit the invariance of \textbf{real data} from two complementary perspectives: the fixed population distribution of the entire real class and the inherent Gaussianity of individual real images. Building on these properties, we introduce the Real Distribution Bias Correction (RDBC) framework, which consists of two key components: the Real Population Distribution Estimation module and the Distribution-Sampled Feature Whitening module. The former utilizes the independent and identically distributed (\iid) property of real samples to derive the normal distribution form of their statistics, from which the distribution parameters can be estimated using limited source domain data. Based on the learned population distribution, the latter utilizes the inherent Gaussianity of real data as a discriminative prior and performs a sampling-based whitening operation to amplify the Gaussianity gap between real and fake samples. Through synergistic coupling of the two modules, our model captures the real-world properties of real samples, thereby enhancing its generalizability to unseen target domains. Extensive experiments demonstrate that RDBC achieves state-of-the-art performance in both in-domain and cross-domain deepfake detection.[244] Multi-Grained Vision-Language Alignment for Domain Generalized Person Re-Identification
Jiachen Li,Xiaojin Gong,Dongping Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于CLIP的多粒度视觉-语言对齐框架(MUVA),通过引入多粒度文本提示和自适应掩码多头自注意力机制,提升域泛化行人重识别性能,并利用MLLM生成身体部位伪标签进行监督。
Details
Motivation: 现有纯视觉域泛化Re-ID方法性能仍有提升空间;直接迁移VLM效果有限,因其全局特征难以捕捉身份细微差异。 Method: 提出CLIP-based多粒度视觉-语言对齐框架:1)设计多粒度文本提示描述身体部位;2)采用自适应掩码多头自注意力模块提取细粒度视觉特征;3)利用MLLM视觉定位专家自动生成身体部位伪标签用于监督训练。 Result: 在单源与多源域泛化协议下均取得优越性能,显著提升DG Re-ID准确率。 Conclusion: 多粒度视觉-语言对齐能有效缓解VLM全局特征对ID细节不敏感的问题,提升模型跨域泛化能力。 Abstract: Domain Generalized person Re-identification (DG Re-ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision-based models have achieved significant progress, the performance remains further improved. Recently, Vision-Language Models (VLMs) present outstanding generalization capabilities in various visual applications. However, directly adapting a VLM to Re-ID shows limited generalization improvement. This is because the VLM only produces with global features that are insensitive to ID nuances. To tacle this problem, we propose a CLIP-based multi-grained vision-language alignment framework in this work. Specifically, several multi-grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine-grained visual information, an adaptively masked multi-head self-attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM-based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single- and multi-source generalization protocols demonstrate the superior performance of our approach. The implementation code will be released at https://github.com/RikoLi/MUVA.[245] EI-Part: Explode for Completion and Implode for Refinement
Wanhu Sun,Zhongjin Luo,Heliang Zheng,Jiahao Chang,Chongjie Ye,Huiang He,Shengchu Zhao,Rongfei Jia,Xiaoguang Han
Main category: cs.CV
TL;DR: 本文提出EI-Part框架,通过Explode与Implode双状态表示及自注意力机制,实现高质量、结构连贯、几何合理且高效的部件级3D生成。
Details
Motivation: 现有部件级3D生成方法在结构连贯性、几何合理性、精度和效率方面存在不足,难以生成几何上合理且语义明确的部件。 Method: 提出EI-Part框架,采用Explode状态完成部件生成、Implode状态优化几何细节,并在两个状态下均引入自注意力机制以保持部件间结构一致性。 Result: 在多个基准测试中显著优于现有方法,生成语义合理、结构连贯、细节丰富的3D部件,达到部件级3D生成的SOTA性能。 Conclusion: EI-Part通过分阶段表征与注意力驱动的特征融合,有效解决了部件级3D生成中的结构性与几何性难题,兼具高效性与高质量。 Abstract: Part-level 3D generation is crucial for various downstream applications, including gaming, film production, and industrial design. However, decomposing a 3D shape into geometrically plausible and meaningful components remains a significant challenge. Previous part-based generation methods often struggle to produce well-constructed parts, exhibiting poor structural coherence, geometric implausibility, inaccuracy, or inefficiency. To address these challenges, we introduce EI-Part, a novel framework specifically designed to generate high-quality 3D shapes with components, characterized by strong structural coherence, geometric plausibility, geometric fidelity, and generation efficiency. We propose utilizing distinct representations at different stages: an Explode state for part completion and an Implode state for geometry refinement. This strategy fully leverages spatial resolution, enabling flexible part completion and fine geometric detail generation. To maintain structural coherence between parts, a self-attention mechanism is incorporated in both exploded and imploded states, facilitating effective information perception and feature fusion among components during generation. Extensive experiments on multiple benchmarks demonstrate that EI-Part efficiently produces semantically meaningful and structurally coherent parts with fine-grained geometric details, achieving state-of-the-art performance in part-level 3D generation. Project page: https://cvhadessun.github.io/EI-Part/[246] A Hyperbolic Perspective on Hierarchical Structure in Object-Centric Scene Representations
Neelu Madan,Àlex Pujol,Andreas Møgelmose,Sergio Escalera,Kamal Nasrollahi,Graham W. Taylor,Thomas B. Moeslund
Main category: cs.CV
TL;DR: 本文提出一种后处理方法,将基于欧氏空间的slot attention嵌入投影到双曲空间(Lorentz双曲面),以揭示视觉场景中隐含的层次结构,发现双曲几何能自然暴露从场景级到物体级的组织关系,且存在曲率与任务性能的权衡。
Details
Motivation: Slot attention在欧氏空间中学习slot表示,缺乏对视觉场景固有层次结构的几何归纳偏置,因此需要探索更适合建模层次关系的几何空间。 Method: 提出一个无需修改训练流程的后处理流水线,将Euclidean slot嵌入投影到Lorentz双曲面;基于slot attention掩码构建五级视觉层次,并在SPOT、VideoSAUR和SlotContrast等模型上验证。 Result: 双曲投影一致揭示了场景级到物体级的层次组织:粗粒度slot占据更大的流形深度,该现象在欧氏空间中不可见;发现曲率参数c=0.2利于父slot检索,c=0.5更利于层级间分离。 Conclusion: Slot表示本身已编码潜在层次信息,双曲几何可有效揭示之;结果支持未来开展端到端双曲slot learning的研究方向。 Abstract: Slot attention has emerged as a powerful framework for unsupervised object-centric learning, decomposing visual scenes into a small set of compact vector representations called \emph{slots}, each capturing a distinct region or object. However, these slots are learned in Euclidean space, which provides no geometric inductive bias for the hierarchical relationships that naturally structure visual scenes. In this work, we propose a simple post-hoc pipeline to project Euclidean slot embeddings onto the Lorentz hyperboloid of hyperbolic space, without modifying the underlying training pipeline. We construct five-level visual hierarchies directly from slot attention masks and analyse whether hyperbolic geometry reveals latent hierarchical structure that remains invisible in Euclidean space. Integrating our pipeline with SPOT (images), VideoSAUR (video), and SlotContrast (video), We find that hyperbolic projection exposes a consistent scene-level to object-level organisation, where coarse slots occupy greater manifold depth than fine slots, which is absent in Euclidean space. We further identify a "curvature--task tradeoff": low curvature ($c{=}0.2$) matches or outperforms Euclidean on parent slot retrieval, while moderate curvature ($c{=}0.5$) achieves better inter-level separation. Together, these findings suggest that slot representations already encode latent hierarchy that hyperbolic geometry reveals, motivating end-to-end hyperbolic training as a natural next step. Code and models are available at \href{https://github.com/NeeluMadan/HHS}{github.com/NeeluMadan/HHS}.[247] High-speed Imaging through Turbulence with Event-based Light Fields
Yu-Hsiang Huang,Levi Burner,Sachin Shah,Ziyuan Qu,Adithya Pediredla,Christopher A. Metzler
Main category: cs.CV
TL;DR: 本文提出了一种基于事件的光场相机系统,结合机器学习重建算法,首次实现了在强大气湍流下以高帧率对高速运动的非刚性扩展目标进行成像。
Details
Motivation: 事件相机虽能以数千帧每秒估计高速图像,但无法区分场景运动与大气湍流引起的伪影,限制了其在强湍流环境中的应用。 Method: 采用事件驱动的光场相机同步捕获多视角场景,并利用机器学习重建算法,通过事件在多视角间的相关性强弱来区分运动诱导动态(强相关)与湍流诱导动态(弱相关)。 Result: 桌面实验表明,该方法可在强湍流条件下成功成像速度高达16000像素/秒的高速物体。 Conclusion: 事件光场相机与机器学习结合可有效克服大气湍流干扰,为高速非刚性目标的鲁棒成像提供了新范式。 Abstract: This work introduces and demonstrates the first system capable of imaging fast-moving extended non-rigid objects through strong atmospheric turbulence at high frame rate. Event cameras are a novel sensing architecture capable of estimating high-speed imagery at thousands of frames per second. However, on their own event cameras are unable to disambiguate scene motion from turbulence. In this work, we overcome this limitation using event-based light field cameras: By simultaneously capturing multiple views of a scene, event-based light field cameras and machine learning-based reconstruction algorithms are able to disambiguate motion-induced dynamics, which produce events that are strongly correlated across views, from turbulence-induced dynamics, which produce events that are weakly correlated across view. Tabletop experiments demonstrate event-based light field can overcome strong turbulence while imaging high-speed objects traveling at up to 16,000 pixels per second.[248] Intrinsic Tolerance in C-Arm Imaging: How Extrinsic Re-optimization Preserves 3D Reconstruction Accuracy
Lin Li,Benjamin Aubert,Paul Kemper,Aric Plumley
Main category: cs.CV
TL;DR: 本研究通过重新优化C臂X光机的外参来补偿内参标定误差,从而在存在较大内参误差的情况下仍能保持亚毫米级的3D重建精度。
Details
Motivation: C臂荧光透视的3D重建依赖于精确的内参标定,但在临床实践中难以实现高精度标定,因此需要一种对内参误差鲁棒的方法。 Method: 在仿真和真实实验中,对五台商用C臂系统的内参(焦距、主点)施加可控扰动,并基于已知体模几何结构重建3D点,再通过标准优化重估计外参,最后评估重建误差和重投影误差。 Result: 即使焦距误差达500像素(约100 mm),平均3D重建误差仍低于0.2 mm;700像素时仅升至约0.3 mm;主点偏移200像素后,重优化外参可使重建误差几乎不变,重投影误差增加小于0.5像素。 Conclusion: 适度的内参标定误差可通过外参重优化有效补偿,维持亚毫米级重建精度,从而降低对内参标定精度的要求,简化临床C臂系统部署与操作流程。 Abstract: \textbf{Purpose:} C-arm fluoroscopy's 3D reconstruction relies on accurate intrinsic calibration, which is often challenging in clinical practice. This study ensures high-precision reconstruction accuracy by re-optimizing the extrinsic parameters to compensate for intrinsic calibration errors. \noindent\textbf{Methods:} We conducted both simulation and real-world experiments using five commercial C-arm systems. Intrinsic parameters were perturbed in controlled increments. Focal length was increased by 100 to 700 pixels ($\approx$20 mm to 140 mm) and principal point by 20 to 200 pixels. For each perturbation, we (1) reconstructed 3D points from known phantom geometries, (2) re-estimated extrinsic poses using standard optimization, and (3) measured reconstruction and reprojection errors relative to ground truth. \noindent\textbf{Results:} Even with focal length errors up to 500 pixels ($\approx$100 mm, assuming a nominal focal length of $\sim$1000 mm), mean 3D reconstruction error remained under 0.2 mm. Larger focal length deviations (700 pixels) elevated error to only $\approx$0.3 mm. Principal point shifts up to 200 pixels introduced negligible reconstruction error once extrinsic parameters were re-optimized, with reprojection error increases below 0.5 pixels. \noindent\textbf{Conclusion:} Moderate errors in intrinsic calibration can be effectively mitigated by extrinsic re-optimization, preserving submillimeter 3D reconstruction accuracy. This intrinsic tolerance suggests a practical pathway to relax calibration precision requirements, thereby simplifying C-arm system setup and reducing clinical workflow burden without compromising performance.[249] EyeWorld: A Generative World Model of Ocular State and Dynamics
Ziyu Gao,Xinyuan Wu,Xiaolan Chen,Zhuoran Liu,Ruoyu Chen,Bowen Liu,Bingjie Yan,Zhenhan Wang,Kai Jin,Jiancheng Yang,Yih Chung Tham,Mingguang He,Danli Shi
Main category: cs.CV
TL;DR: EyeWorld 是一种生成式眼病世界模型,将眼睛建模为部分可观测的动力学系统,通过跨模态共享的稳定潜在状态实现细粒度解析、结构保持的跨模态转换与质量鲁棒增强,并支持临床进展预测。
Details
Motivation: 现有医学基础模型多为静态,难以应对模态差异和采集条件变化,而眼科决策依赖于多模态影像中随时间演化的细微病灶线索。 Method: 提出 EyeWorld,构建基于临床影像的眼部动力学世界模型;学习跨模态共享的观测稳定潜在眼状态;引入纵向监督以建模时间条件下的状态转移。 Result: 实现了统一框架下的细粒度解剖解析、结构保持的跨模态图像翻译、质量鲁棒增强,以及临床有意义的眼病进展预测。 Conclusion: EyeWorld 通过从静态表征学习转向显式动力学建模,为医学中鲁棒的多模态理解与预后导向仿真提供了新范式。 Abstract: Ophthalmic decision-making depends on subtle lesion-scale cues interpreted across multimodal imaging and over time, yet most medical foundation models remain static and degrade under modality and acquisition shifts. Here we introduce EyeWorld, a generative world model that conceptualizes the eye as a partially observed dynamical system grounded in clinical imaging. EyeWorld learns an observation-stable latent ocular state shared across modalities, unifying fine-grained parsing, structure-preserving cross-modality translation and quality-robust enhancement within a single framework. Longitudinal supervision further enables time-conditioned state transitions, supporting forecasting of clinically meaningful progression while preserving stable anatomy. By moving from static representation learning to explicit dynamical modeling, EyeWorld provides a unified approach to robust multimodal interpretation and prognosis-oriented simulation in medicine.[250] A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning
Yichang Xu,Gaowen Liu,Ramana Rao Kompella,Tiansheng Huang,Sihao Hu,Fatih Ilhan,Selim Furkan Tekin,Zachary Yahn,Ling Liu
Main category: cs.CV
TL;DR: 本文提出A4VL多智能体感知-行动探索联盟,通过多轮感知-行动循环与事件驱动分块对齐,提升长视频推理效率与质量。
Details
Motivation: 现有视觉语言模型(VLM)在处理真实世界长视频时面临计算开销大、推理质量低、难以有效聚焦关键事件等问题,亟需一种高效、可扩展且高精度的长视频问答(VideoQA)方法。 Method: A4VL构建多VLM智能体联盟,每轮执行两阶段探索:(1)感知探索——各智能体从采样帧中提取查询相关感知线索,并进行线索引导的视频块对齐;(2)行动探索——各智能体生成初始答案并经交叉评审与相关性排序达成共识;若未达满意共识,则剪枝低效智能体并重启动新轮探索;否则输出最终答案。结合事件驱动视频分块与线索引导对齐机制。 Result: 在5个主流VideoQA基准上,A4VL显著超越18种代表性VLM及10种最新长视频优化方法,同时推理延迟大幅降低。 Conclusion: A4VL通过多智能体协同、多轮感知-行动闭环与结构化视频理解机制,实现了长视频推理在性能与效率上的双重突破,为复杂视频理解提供了可扩展新范式。 Abstract: This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 10 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency. Our code is released at https://github.com/git-disl/A4VL.[251] TMPDiff: Temporal Mixed-Precision for Diffusion Models
Basile Lewandowski,Simon Kurz,Aditya Shankar,Robert Birke,Jian-Jia Chen,Lydia Y. Chen
Main category: cs.CV
TL;DR: 本文提出TMPDiff,一种针对扩散模型的时间混合精度框架,通过为不同去噪时间步分配不同数值精度来降低推理延迟,显著提升了感知质量。
Details
Motivation: 扩散模型在文本到图像生成中存在高推理延迟问题,而现有量化方法采用固定精度,未能探索时间维度上的精度优化空间。 Method: 提出TMPDiff框架,假设量化误差在时间步间呈累积相加,并基于此设计自适应二分算法,以线性复杂度为各时间步分配最优精度。 Result: 在四个最先进扩散模型和三个数据集上,TMPDiff在相同加速比下相比统一精度基线提升10%-20%感知质量;在FLUX.1-dev上以2.5倍加速达到全精度90%的SSIM。 Conclusion: 时间维度的混合精度量化是一种有效且高效的扩散模型加速策略,能兼顾速度与生成质量。 Abstract: Diffusion models are the go-to method for Text-to-Image generation, but their iterative denoising processes has high inference latency. Quantization reduces compute time by using lower bitwidths, but applies a fixed precision across all denoising timesteps, leaving an entire optimization axis unexplored. We propose TMPDiff, a temporal mixed-precision framework for diffusion models that assigns different numeric precision to different denoising timesteps. We hypothesize that quantization errors accumulate additively across timesteps, which we then validate experimentally. Based on our observations, we develop an adaptive bisectioning-based algorithm, which assigns per-step precisions with linear evaluation complexity, reducing an otherwise exponential search problem. Across four state-of-the-art diffusion models and three datasets, TMPDiff consistently outperforms uniform-precision baselines at matched speedup, achieving 10 to 20% improvement in perceptual quality. On FLUX.1-dev, TMPDiff achieves 90% SSIM relative to the full-precision model at a speedup of 2.5x over 16-bit inference.[252] MotionCFG: Boosting Motion Dynamics via Stochastic Concept Perturbation
Byungjun Kim,Soobin Um,Jong Chul Ye
Main category: cs.CV
TL;DR: 本文提出MotionCFG框架,通过在概念嵌入中注入高斯噪声生成局部负样本,隐式挖掘运动相关的难负例,从而提升文本到视频生成中的动态运动质量,避免显式负提示导致的内容-运动偏移问题。
Details
Motivation: 现有文本到视频(T2V)方法依赖显式负提示(如“static”、“blurry”)结合Classifier-Free Guidance来抑制不良运动,但易引发内容-运动偏移(Content-Motion Drift),损害语义一致性和物体完整性。 Method: 提出MotionCFG:对目标概念嵌入注入高斯噪声,构建局部负锚点,实现隐式硬负例挖掘;结合分段式引导调度,仅在去噪早期施加干预。 Result: 在多个SOTA T2V模型上显著提升运动动态性,计算开销极小、视觉质量损失可忽略;还能有效调控复杂非线性概念(如物体数量)。 Conclusion: MotionCFG提供了一种更鲁棒、语义安全的运动增强机制,突破了传统显式负提示的局限,为T2V中细粒度时序控制开辟新路径。 Abstract: Despite recent advances in Text-to-Video (T2V) synthesis, generating high-fidelity and dynamic motion remains a significant challenge. Existing methods primarily rely on Classifier-Free Guidance (CFG), often with explicit negative prompts (e.g. "static", "blurry"), to suppress undesired artifacts. However, such explicit negations frequently introduce unintended semantic bias and distort object integrity; a phenomenon we define as Content-Motion Drift. To address this, we propose MotionCFG, a framework that enhances motion dynamics by contrasting a target concept with its noise-perturbed counterparts. Specifically, by injecting Gaussian noise into the concept embeddings, MotionCFG creates localized negative anchors that encapsulate a broad complementary space of sub-optimal motion variations. Unlike explicit negations, this approach facilitates implicit hard negative mining without shifting the global semantic identity, allowing for a focused refinement of temporal details. Combined with a piecewise guidance schedule that confines intervention to the early denoising steps, MotionCFG consistently improves motion dynamics across state-of-the-art T2V frameworks with negligible computational overhead and minimal compromise in visual quality. Additionally, we demonstrate that this noise-induced contrastive mechanism is effective not only for sharpening motion trajectories but also for steering complex, non-linear concepts such as precise object numerosity, which are typically difficult to modulate via standard text-based guidance.[253] Self-Supervised Uncertainty Estimation For Super-Resolution of Satellite Images
Zhe Zheng,Valéry Dewil,Pablo Arias
Main category: cs.CV
TL;DR: 本文提出一种无需真实高分辨率数据的自监督超分辨率方法,通过新型自监督损失函数实现重建不确定性估计,并在合成数据集上验证了其校准效果。
Details
Motivation: 卫星影像超分辨率面临缺乏配对低/高分辨率数据的挑战,现有自监督方法虽利用时序冗余但无法量化重建不确定性。 Method: 引入基于决策理论的新型自监督损失函数,通过最小化贝叶斯风险推导出后验均值与方差作为最优估计器。 Result: 在合成SkySat L1B数据集上验证,所提方法能生成与监督方法相当的校准不确定性估计。 Conclusion: 该工作将自监督图像复原与不确定性量化相结合,为不确定性感知的图像重建提供了实用框架。 Abstract: Super-resolution (SR) of satellite imagery is challenging due to the lack of paired low-/high-resolution data. Recent self-supervised SR methods overcome this limitation by exploiting the temporal redundancy in burst observations, but they lack a mechanism to quantify uncertainty in the reconstruction. In this work, we introduce a novel self-supervised loss that allows to estimate uncertainty in image super-resolution without ever accessing the ground-truth high-resolution data. We adopt a decision-theoretic perspective and show that minimizing the corresponding Bayesian risk yields the posterior mean and variance as optimal estimators. We validate our approach on a synthetic SkySat L1B dataset and demonstrate that it produces calibrated uncertainty estimates comparable to supervised methods. Our work bridges self-supervised restoration with uncertainty quantification, making a practical framework for uncertainty-aware image reconstruction.[254] SGR-OCC: Evolving Monocular Priors for Embodied 3D Occupancy Prediction via Soft-Gating Lifting and Semantic-Adaptive Geometric Refinement
Yiran Guo,Simone Mentasti,Xiaofeng Jin,Matteo Frosi,Matteo Matteucci
Main category: cs.CV
TL;DR: 本文提出SGR-OCC框架,通过软门控特征提升和动态射线约束锚点优化,结合两阶段渐进训练策略,解决单目3D语义占据预测中的深度模糊与冷启动问题,在多个基准上达到SOTA性能。
Details
Motivation: 当前在线单目3D语义占据预测面临两大瓶颈:单目深度估计固有的深度歧义导致物体边界‘特征渗漏’,以及时序融合层初始化不良引发的‘冷启动’不稳定性,破坏空间先验。 Method: 提出SGR-OCC框架,包含:(1) Soft-Gating Feature Lifter,用高斯门建模深度不确定性以抑制背景噪声;(2) Dynamic Ray-Constrained Anchor Refinement,将3D位移搜索简化为沿相机射线的1D深度修正;(3) Two-Phase Progressive Training Strategy,采用单位矩阵初始化融合层以缓解冷启动。 Result: 在EmbodiedOcc-ScanNet和Occ-ScanNet基准上取得SOTA:局部预测任务中完成IoU达58.55%,语义mIoU达49.89%;具身预测任务中SC-IoU达55.72%,mIoU达46.22%,分别超越EmbodiedOcc++ 3.65%和3.69%。 Conclusion: SGR-OCC通过‘继承与演化’思想,有效兼顾单目空间表征能力与时序一致性建模,显著提升3D占据预测的精度、边界清晰度与训练稳定性。 Abstract: 3D semantic occupancy prediction is a cornerstone for embodied AI, enabling agents to perceive dense scene geometry and semantics incrementally from monocular video streams. However, current online frameworks face two critical bottlenecks: the inherent depth ambiguity of monocular estimation that causes "feature bleeding" at object boundaries , and the "cold start" instability where uninitialized temporal fusion layers distort high-quality spatial priors during early training stages. In this paper, we propose SGR-OCC (Soft-Gating and Ray-refinement Occupancy), a unified framework driven by the philosophy of "Inheritance and Evolution". To perfectly inherit monocular spatial expertise, we introduce a Soft-Gating Feature Lifter that explicitly models depth uncertainty via a Gaussian gate to probabilistically suppress background noise. Furthermore, a Dynamic Ray-Constrained Anchor Refinement module simplifies complex 3D displacement searches into efficient 1D depth corrections along camera rays, ensuring sub-voxel adherence to physical surfaces. To ensure stable evolution toward temporal consistency, we employ a Two-Phase Progressive Training Strategy equipped with identity-initialized fusion, effectively resolving the cold start problem and shielding spatial priors from noisy early gradients. Extensive experiments on the EmbodiedOcc-ScanNet and Occ-ScanNet benchmarks demonstrate that SGR-OCC achieves state-of-the-art performance. In local prediction tasks, SGR-OCC achieves a completion IoU of 58.55$\%$ and a semantic mIoU of 49.89$\%$, surpassing the previous best method, EmbodiedOcc++, by 3.65$\%$ and 3.69$\%$ respectively. In challenging embodied prediction tasks, our model reaches 55.72$\%$ SC-IoU and 46.22$\%$ mIoU. Qualitative results further confirm our model's superior capability in preserving structural integrity and boundary sharpness in complex indoor environments.[255] Enhancing Eye Feature Estimation from Event Data Streams through Adaptive Inference State Space Modeling
Viet Dung Nguyen,Mobina Ghorbaninejad,Chengyi Ma,Reynold Bailey,Gabriel J. Diaz,Alexander Fix,Ryan J. Suess,Alexander Ororbia
Main category: cs.CV
TL;DR: 本文提出了一种自适应推理状态空间模型(AISSM),用于事件驱动型眼动数据中的眼特征提取,通过动态调整当前与历史信息的权重,并结合动态置信网络估计信噪比和事件密度,显著提升了在不同眼动行为切换下的预测性能。
Details
Motivation: 现有基于事件的数据眼特征提取器难以应对因眼动行为变化导致的事件密度突变,从而影响预测性能。 Method: 提出自适应推理状态空间模型(AISSM),引入动态置信网络估计信噪比和事件密度,以自适应调整当前与近期信息的相对权重;并设计新型训练方法提升训练效率。 Result: 实验表明,AISSM在事件驱动眼特征提取任务中优于当前最优模型。 Conclusion: AISSM能有效应对事件密度突变,提升眼动特征提取鲁棒性与精度,为低功耗实时眼动追踪提供新思路。 Abstract: Eye feature extraction from event-based data streams can be performed efficiently and with low energy consumption, offering great utility to real-world eye tracking pipelines. However, few eye feature extractors are designed to handle sudden changes in event density caused by the changes between gaze behaviors that vary in their kinematics, leading to degraded prediction performance. In this work, we address this problem by introducing the \emph{adaptive inference state space model} (AISSM), a novel architecture for feature extraction that is capable of dynamically adjusting the relative weight placed on current versus recent information. This relative weighting is determined via estimates of the signal-to-noise ratio and event density produced by a complementary \emph{dynamic confidence network}. Lastly, we craft and evaluate a novel learning technique that improves training efficiency. Experimental results demonstrate that the AISSM system outperforms state-of-the-art models for event-based eye feature extraction.[256] Effective Feature Learning for 3D Medical Registration via Domain-Specialized DINO Pretraining
Eytan Kats,Mattias P. Heinrich
Main category: cs.CV
TL;DR: 本文提出了一种基于DINO风格的自监督预训练方法,直接在3D医学影像上学习密集体素特征,用于可变形配准,在腹部MRI/CT跨模态跨病人配准任务中优于自然图像预训练模型(如DINOv2)和现有配准模型,尤其在域外评估中表现更鲁棒高效。
Details
Motivation: 强度型配准方法易受扫描仪差异和复杂形变影响,特征型方法虽更鲁棒但缺乏适配医学影像的高质量稠密特征;需一种计算高效、泛化性强的医学专用特征表示。 Method: 采用DINO-style自监督学习框架,在3D医学影像数据上进行端到端预训练,学习适用于可变形配准的稠密体素特征,并在跨模态(MRI/CT)、跨病人腹部配准任务上评估。 Result: 所提域特化预训练方法在跨模态腹部配准任务上显著优于DINOv2(自然图像预训练)及主流配准模型,推理计算开销更低,且在域外测试中表现更优。 Conclusion: 任务无关但医学影像聚焦的自监督预训练,能有效提升3D医学图像配准的鲁棒性与效率,为通用医学表征学习提供新路径。 Abstract: Medical image registration is a critical component of clinical imaging workflows, enabling accurate longitudinal assessment, multi-modal data fusion, and image-guided interventions. Intensity-based approaches often struggle with interscanner variability and complex anatomical deformations, whereas feature-based methods offer improved robustness by leveraging semantically informed representations. In this work, we investigate DINO-style self-supervised pretraining directly on 3D medical imaging data, aiming to learn dense volumetric features well suited for deformable registration. We assess the resulting representations on challenging interpatient abdominal registration task across both MRI and CT modalities. Our domain-specialized pretraining outperforms the DINOv2 model trained on a large-scale collection of natural images, while requiring substantially lower computational resources at inference time. Moreover, it surpasses established registration models under out-of-domain evaluation, demonstrating the value of task-agnostic yet medical imaging-focused pretraining for robust and efficient 3D image registration.[257] Revisiting the Perception-Distortion Trade-off with Spatial-Semantic Guided Super-Resolution
Dan Wang,Haiyan Sun,Shan Du,Z. Jane Wang,Zhaochong An,Serge Belongie,Xinrui Cui
Main category: cs.CV
TL;DR: 本文提出了一种名为SpaSemSR的空间-语义引导扩散框架,通过空间定位文本引导和语义增强视觉引导两种互补机制,在图像超分辨率任务中实现了感知质量与保真度的更好平衡。
Details
Motivation: 现有GAN方法难以生成真实细粒度纹理,而扩散模型虽能合成丰富细节却易偏离输入、产生幻觉结构并降低保真度;如何利用扩散模型强大的生成先验又不牺牲保真度是关键挑战。 Method: 提出SpaSemSR框架:1)空间定位文本引导——将物体级空间线索与语义提示结合,对齐文本与视觉结构以减少失真;2)语义增强视觉引导——采用多编码器设计与语义退化约束,统一多模态语义先验;二者通过空间-语义注意力自适应融合到扩散过程中。 Result: 在多个基准上实验表明,SpaSemSR在感知-失真权衡上优于现有方法,能同时生成真实且保真的超分结果。 Conclusion: SpaSemSR通过引入空间与语义双重引导机制,有效缓解了扩散模型在超分任务中的失真与幻觉问题,为高保真高质量图像重建提供了新思路。 Abstract: Image super-resolution (SR) aims to reconstruct high resolution images with both high perceptual quality and low distortion, but is fundamentally limited by the perception-distortion trade-off. GAN-based SR methods reduce distortion but still struggle with realistic fine-grained textures, whereas diffusion-based approaches synthesize rich details but often deviate from the input, hallucinating structures and degrading fidelity. This tension raises a key challenge: how to exploit the powerful generative priors of diffusion models without sacrificing fidelity. To address this, we propose SpaSemSR, a spatial-semantic guided diffusion framework with two complementary guidances. First, spatial-grounded textual guidance integrates object-level spatial cues with semantic prompts, aligning textual and visual structures to reduce distortion. Second, semantic-enhanced visual guidance with a multi-encoder design and semantic degradation constraints unifies multimodal semantic priors, improving perceptual realism under severe degradations. These complementary guidances are adaptively fused into the diffusion process via spatial-semantic attention, suppressing distortion and hallucination while retaining the strengths of diffusion models. Extensive experiments on multiple benchmarks show that SpaSemSR achieves a superior perception-distortion balance, producing both realistic and faithful restorations.[258] Improving Visual Reasoning with Iterative Evidence Refinement
Zeru Shi,Kai Mei,Yihao Quan,Dimitris N. Metaxas,Ruixiang Tang
Main category: cs.CV
TL;DR: 本文提出SIEVE框架,通过内部视觉表征实现VLMs的自回归式视觉证据重访,无需外部图像操作,显著提升视觉推理性能。
Details
Motivation: 现有VLMs在视觉推理中依赖外部图像操作(如裁剪、缩放)来重新获取细节,导致额外编码开销并破坏推理连贯性;作者认为模型自身已具备识别和复用视觉证据的强内部信号。 Method: 提出端到端自重访框架SIEVE:自动提取显著图像区域嵌入,并在推理链中按需注入;利用强化学习训练模型决定何时触发重访及选择哪些区域嵌入。 Result: 在多个视觉推理基准及感知、推理、幻觉评估中均取得一致提升,平均性能提高8%。 Conclusion: VLMs可通过内部表征有效支持图像接地推理,SIEVE验证了不依赖外部工具的自重访机制的有效性与实用性。 Abstract: Vision language models (VLMs) are increasingly capable of reasoning over images, but robust visual reasoning often requires re-grounding intermediate steps in the underlying visual evidence. Recent approaches typically rely on external image operations such as zooming or cropping to re-access fine-grained details during inference, which requires additional image re-encoding and can disrupt the reasoning trajectory. We argue that VLMs already provide strong internal signals for identifying and reusing visual evidence, and that these signals can be directly leveraged to support image-grounded reasoning. Motivated by this insight, we propose an end-to-end self-revisit framework, SIEVE, that trains models to re-engage image evidence through internal representations. SIEVE automatically extracts embeddings of salient image regions and injects them into the reasoning chain when additional grounding is needed, enabling later steps to condition on relevant visual cues without external tool calls or re-encoding. We use reinforcement learning to teach the model when to trigger visual revisiting and which region embeddings to retrieve and insert during the reasoning process. Experiments on multiple visual reasoning benchmarks, together with perception, reasoning, and hallucination evaluations, show that SIEVE yields consistent gains, improving performance by 8 percent on average across several benchmarks.[259] Low-Field Magnetic Resonance Image Quality Enhancement using Undersampled k-Space and Out-of-Distribution Generalisation
Daniel Tweneboah Anyimadu,Mohammed M. Abdelsamea,Ahmed Karam Eldaly
Main category: cs.CV
TL;DR: 本文提出一种基于k空间的深度学习框架,用于从欠采样的低场MRI数据直接重建高质量、高场类图像,并引入不确定性量化与OOD泛化评估。
Details
Motivation: 低场MRI成本低但成像质量差、扫描时间长;现有深度学习方法多在分布内(InD)数据上训练,缺乏对分布外(OOD)场景的鲁棒性与可泛化性评估。 Method: 提出k空间双通道U-Net联合处理欠采样k空间的实部与虚部,并结合集成策略生成不确定性图,实现端到端k空间重建与质量增强。 Result: 在低场脑部MRI实验中,该方法在OOD数据下显著优于空间域后处理及其他SOTA方法,重建图像质量接近全采样高场k空间结果。 Conclusion: 本工作首次将低场MRI重建、k空间驱动的质量增强与不确定性量化统一于同一框架,并验证了其在OOD条件下的强泛化能力。 Abstract: Low-field magnetic resonance imaging (MRI) offers affordable access to diagnostic imaging but faces challenges such as prolonged acquisition times and reduced image quality. Although accelerated imaging via k-space undersampling helps reduce scan time, image quality enhancement methods often rely on spatial-domain postprocessing. Deep learning achieved state-of-the-art results in both domains. However, most models are trained and evaluated using in-distribution (InD) data, creating a significant gap in understanding model performance when tested using out-of-distribution (OOD) data. To address these issues, we propose a novel framework that reconstructs high-field-like MR images directly from undersampled low-field MRI k-space, quantifies the impact of reduced sampling, and evaluates the generalisability of the model using OOD. Our approach utilises a k-space dual channel U-Net to jointly process the real and imaginary components of undersampled k-space, restoring missing frequency content, and incorporates an ensemble strategy to generate uncertainty maps. Experiments on low-field brain MRI demonstrate that our k-space-driven image quality enhancement outperforms the counterpart spatial-domain and other state-of-the-art baselines, achieving image quality comparable to full high-field k-space acquisitions using OOD data. To the best of our knowledge, this work is among the first to combine low-field MR image reconstruction, quality enhancement using undersampled k-space, and uncertainty quantification within a unified framework.[260] Low-Field Magnetic Resonance Image Enhancement using Undersampled k-Space
Daniel Tweneboah Anyimadu,Mohammed Abdalla,Mohammed M. Abdelsamea,Ahmed Karam Eldaly
Main category: cs.CV
TL;DR: 本文提出了一种基于U-Net变体的新型深度学习框架,直接在k空间中对低场MRI图像进行超分辨率重建,同时利用欠采样k空间数据实现加速采集与图像质量提升的统一建模。
Details
Motivation: 低场MRI因成本低有望用于资源有限地区,但存在扫描时间长和图像质量差两大瓶颈。 Method: 提出一种直接在k空间操作的U-Net变体模型,将欠采样k空间重建与超分辨率融合为统一过程,而非传统先重建再后处理的方式。 Result: 在合成与真实低场脑部MRI数据上验证,该k空间驱动方法优于传统空间域方法;欠采样重建图像质量可媲美全k空间采集结果,显著缩短扫描时间且不损害诊断价值。 Conclusion: k空间联合建模的深度学习方法能有效兼顾低场MRI的加速采集与图像增强,为资源受限场景提供更实用的解决方案。 Abstract: Low-field magnetic resonance imaging (MRI) offers a cost-effective alternative for medical imaging in resource-limited settings. However, its widespread adoption is hindered by two key challenges: prolonged scan times and reduced image quality. Accelerated acquisition can be achieved using k-space undersampling, while image enhancement traditionally relies on spatial-domain postprocessing. In this work, we propose a novel deep learning framework based on a U-Net variant that operates directly in k-space to super-resolve low-field MR images directly using undersampled data while quantifying the impact of reduced k-space sampling. Unlike conventional approaches that treat image super-resolution as a postprocessing step following image reconstruction from undersampled k-space, our unified model integrates both processes, leveraging k-space information to achieve superior image fidelity. Extensive experiments on synthetic and real low-field brain MRI datasets demonstrate that k-space-driven image super-resolution outperforms conventional spatial-domain counterparts. Furthermore, our results show that undersampled k-space reconstructions achieve comparable quality to full k-space acquisitions, enabling substantial scan-time acceleration without compromising diagnostic utility.[261] Implementation and discussion of the Pith Estimation on Rough Log End Images using Local Fourier Spectrum Analysis method
Henry Marichal,Diego Passarella,Gregory Randall
Main category: cs.CV
TL;DR: 本文分析并实现了Rudolf Schraml和Andreas Uhl提出的基于局部傅里叶谱分析的粗糙原木端面图像“Pith Estimation”方法,并在两个数据集上进行了测试。
Details
Motivation: 针对粗糙原木端面图像中髓心(pith)定位困难的问题,需一种鲁棒且自动化的估计方法。 Method: 采用局部傅里叶谱分析技术实现髓心位置估计,并提供了Python实现。 Result: 该算法在两个不同数据集上完成了验证,表明其具备一定泛化能力和实用性。 Conclusion: 所实现的基于局部傅里叶谱分析的髓心估计算法是可行的,为木材图像分析提供了一种有效工具。 Abstract: In this article, we analyze and propose a Python implementation of the method "Pith Estimation on Rough Log End images using Local Fourier Spectrum Analysis", by Rudolf Schraml and Andreas Uhl. The algorithm is tested over two datasets.[262] Diffusion Reinforcement Learning via Centered Reward Distillation
Yuanzhi Zhu,Xi Wang,Stéphane Lathuilière,Vicky Kalogeiton
Main category: cs.CV
TL;DR: 本文提出了一种名为Centered Reward Distillation (CRD) 的扩散模型强化学习微调框架,通过within-prompt中心化处理规避归一化常数难题,并引入多种技术控制分布漂移,显著提升文本到图像生成中的提示保真度、组合正确性和文字渲染能力,同时减少奖励作弊并加快收敛。
Details
Motivation: 现有扩散与流模型在细粒度提示保真度、组合正确性和文本渲染等关键行为上表现不足,因其预训练目标(如得分匹配或流匹配)对这些行为弱约束;而传统强化学习微调又存在高方差、高内存开销或分布漂移导致的奖励作弊问题。 Method: 提出Centered Reward Distillation (CRD),基于KL正则化奖励最大化,采用前向过程微调;核心是within-prompt中心化使难解归一化常数相消;并引入三项关键技术:(i) 解耦采样器与移动参考分布以防比率信号崩溃,(ii) KL锚定至CFG引导的预训练模型以抑制长期分布漂移,(iii) 奖励自适应KL强度以平衡早期学习速度与后期鲁棒性。 Result: 在GenEval和OCR奖励下的文本到图像微调实验中,CRD实现了具有竞争力的SOTA奖励优化效果,收敛快、奖励作弊少,并在未见偏好指标上得到验证。 Conclusion: CRD是一种高效、稳定、实用的扩散模型RL微调框架,解决了前向式扩散RL中分布漂移与奖励作弊的根本挑战,为高质量可控生成提供了新范式。 Abstract: Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward-process approaches converge faster but can suffer from distribution drift, and hence reward hacking. In this work, we present \textbf{Centered Reward Distillation (CRD)}, a diffusion RL framework derived from KL-regularized reward maximization built on forward-process-based fine-tuning. The key insight is that the intractable normalizing constant cancels under \emph{within-prompt centering}, yielding a well-posed reward-matching objective. To enable reliable text-to-image fine-tuning, we introduce techniques that explicitly control distribution drift: (\textit{i}) decoupling the sampler from the moving reference to prevent ratio-signal collapse, (\textit{ii}) KL anchoring to a CFG-guided pretrained model to control long-run drift and align with the inference-time semantics of the pre-trained model, and (\textit{iii}) reward-adaptive KL strength to accelerate early learning under large KL regularization while reducing late-stage exploitation of reward-model loopholes. Experiments on text-to-image post-training with \texttt{GenEval} and \texttt{OCR} rewards show that CRD achieves competitive SOTA reward optimization results with fast convergence and reduced reward hacking, as validated on unseen preference metrics.[263] DualSwinFusionSeg: Multimodal Martian Landslide Segmentation via Dual Swin Transformer with Multi-Scale Fusion and UNet++
Shahriar Kabir,Abdullah Muhammed Amimul Ehsan,Istiak Ahmmed Rifti,Md Kaykobad Reza
Main category: cs.CV
TL;DR: 本文提出DualSwinFusionSeg模型,用于火星滑坡的多模态遥感图像分割,通过双Swin Transformer编码器分别处理RGB与地球物理数据,并在多尺度上融合特征,结合UNet++解码器提升边界精度,在MMLSv2数据集上取得优异性能。
Details
Motivation: 火星滑坡自动分割对行星地质学、灾害评估和未来机器人探测至关重要,但受限于遥感数据模态异质性强、标注样本稀少等问题,现有方法难以有效处理多源异构数据。 Method: 提出DualSwinFusionSeg架构:采用两个并行的Swin Transformer V2编码器分别提取RGB图像与辅助地球物理数据(如DEM、坡度图、热惯量等)的模态特异性特征;在多个尺度上进行跨模态特征融合;使用带密集嵌套跳跃连接的UNet++解码器恢复细节边界。 Result: 在PBVS 2026 Mars-LS挑战赛MMLSv2数据集上,开发集达到0.867 mIoU和0.905 F1,测试集达0.783 mIoU,验证了模态特异性编码与简单拼接融合策略在小样本下的有效性。 Conclusion: DualSwinFusionSeg为火星等行星表面多模态遥感图像分割提供了高效鲁棒的解决方案,强调模态解耦与多尺度融合对提升小样本分割性能的关键作用。 Abstract: Automated segmentation of Martian landslides, particularly in tectonically active regions such as Valles Marineris,is important for planetary geology, hazard assessment, and future robotic exploration. However, detecting landslides from planetary imagery is challenging due to the heterogeneous nature of available sensing modalities and the limited number of labeled samples. Each observation combines RGB imagery with geophysical measurements such as digital elevation models, slope maps, thermal inertia, and contextual grayscale imagery, which differ significantly in resolution and statistical properties. To address these challenges, we propose DualSwinFusionSeg, a multimodal segmentation architecture that separates modality-specific feature extraction and performs multi-scale cross-modal fusion. The model employs two parallel Swin Transformer V2 encoders to independently process RGB and auxiliary geophysical inputs, producing hierarchical feature representations. Corresponding features from the two streams are fused at multiple scales and decoded using a UNet++ decoder with dense nested skip connections to preserve fine boundary details. Extensive ablation studies evaluate modality contributions, loss functions, decoder architectures, and fusion strategies. Experiments on the MMLSv2 dataset from the PBVS 2026 Mars-LS Challenge show that modality-specific encoders and simple concatenation-based fusion improve segmentation accuracy under limited training data. The final model achieves 0.867 mIoU and 0.905 F1 on the development benchmark and 0.783 mIoU on the held-out test set, demonstrating strong performance for multimodal planetary surface segmentation.[264] CIPHER: Culvert Inspection through Pairwise Frame Selection and High-Efficiency Reconstruction
Seoyoung Lee,Zhangyang Wang
Main category: cs.CV
TL;DR: 本文提出了一种高效的基于RGB的3D重建流程,用于在视觉重复环境中对涵洞类结构进行自动检测与重建,兼顾视角多样性、匹配有效性及实时RGB-几何-语义联合估计。
Details
Motivation: 提升防洪管理作业的安全性与效率,解决涵洞类结构在视觉重复环境下的自动化检测难题。 Method: 设计了一个两阶段流程:首先通过即插即用模块筛选具有最大视角多样性和可靠对应匹配的帧对;随后采用一个实时重建模型同步估计RGB外观、几何结构和语义信息。 Result: 实验表明该方法能有效生成高精度3D重建结果和深度图,在极少人工干预下显著提升涵洞检测效率。 Conclusion: 所提RGB-based 3D重建流程为涵洞自动化巡检提供了高效、鲁棒且实用的技术方案。 Abstract: Automated culvert inspection systems can help increase the safety and efficiency of flood management operations. As a key step to this system, we present an efficient RGB-based 3D reconstruction pipeline for culvert-like structures in visually repetitive environments. Our approach first selects informative frame pairs to maximize viewpoint diversity while ensuring valid correspondence matching using a plug-and-play module, followed by a reconstruction model that simultaneously estimates RGB appearance, geometry, and semantics in real-time. Experiments demonstrate that our method effectively generates accurate 3D reconstructions and depth maps, enhancing culvert inspection efficiency with minimal human intervention.[265] Seeing Through the PRISM: Compound & Controllable Restoration of Scientific Images
Rupa Kurinchi-Vendhan,Pratyusha Sharma,Antonio Torralba,Sara Beery
Main category: cs.CV
TL;DR: PRISM是一种基于提示的条件扩散模型,用于科学与环境图像的复合退化联合去除与选择性修复,通过复合感知监督和加权对比解耦目标,在保持科学信号完整性的同时实现高保真、可控恢复。
Details
Motivation: 科学与环境图像常存在传感器与环境引起的多种噪声混合,现有方法逐个去除退化易导致级联伪影、过校正或丢失重要信号;科学应用要求能同时处理复合退化,并支持专家按需选择性去除特定失真。 Method: 提出PRISM框架:一种提示驱动的条件扩散模型,结合针对混合退化的复合感知监督,以及对齐隐空间中原始退化基元与其混合体的加权对比解耦目标,实现可解释的退化分离与组合式几何建模。 Result: 在显微镜、野生动物监测、遥感和城市气象数据集上,PRISM在复杂复合退化(包括训练未见的零样本混合)上显著优于SOTA方法;选择性修复显著提升下游科学任务(如物种计数、云检测)的精度。 Conclusion: PRISM是一种通用、可控、面向科学效用的高保真图像恢复框架,兼顾联合去噪能力与用户可解释的交互式修复能力。 Abstract: Scientific and environmental imagery often suffer from complex mixtures of noise related to the sensor and the environment. Existing restoration methods typically remove one degradation at a time, leading to cascading artifacts, overcorrection, or loss of meaningful signal. In scientific applications, restoration must be able to simultaneously handle compound degradations while allowing experts to selectively remove subsets of distortions without erasing important features. To address these challenges, we present PRISM (Precision Restoration with Interpretable Separation of Mixtures). PRISM is a prompted conditional diffusion framework which combines compound-aware supervision over mixed degradations with a weighted contrastive disentanglement objective that aligns primitives and their mixtures in the latent space. This compositional geometry enables high-fidelity joint removal of overlapping distortions while also allowing flexible, targeted fixes through natural language prompts. Across microscopy, wildlife monitoring, remote sensing, and urban weather datasets, PRISM outperforms state-of-the-art baselines on complex compound degradations, including zero-shot mixtures not seen during training. Importantly, we show that selective restoration significantly improves downstream scientific accuracy in several domains over standard "black-box" restoration. These results establish PRISM as a generalizable and controllable framework for high-fidelity restoration in domains where scientific utility is a priority.[266] SK-Adapter: Skeleton-Based Structural Control for Native 3D Generation
Anbang Wang,Yuzhuo Ao,Shangzhe Wu,Chi-Keung Tang
Main category: cs.CV
TL;DR: 本文提出SK-Adapter,一种轻量级结构适配器,通过将3D骨架作为首要控制信号注入冻结的3D生成主干网络,实现对原生3D生成模型的精确骨骼操控,并构建了Objaverse-TMS数据集支持训练。
Details
Motivation: 现有原生3D生成模型虽高效高保真,但缺乏对精确结构(如骨骼关节)的可控生成能力,文本或图像提示难以满足结构精度需求。 Method: 设计SK-Adapter轻量适配器网络,将3D骨架的关节坐标与拓扑编码为可学习token,通过跨注意力机制注入冻结的3D生成主干;构建含24k文本-网格-骨架三元组的Objaverse-TMS数据集。 Result: 在保持基础模型几何与纹理质量前提下,显著优于现有基线,实现强鲁棒的结构控制;并首次支持基于骨架引导的局部3D资产编辑。 Conclusion: SK-Adapter为原生3D生成提供了高效、精准、可扩展的结构控制新范式,拓展了其在可控生成与编辑中的应用边界。 Abstract: Native 3D generative models have achieved remarkable fidelity and speed, yet they suffer from a critical limitation: inability to prescribe precise structural articulations, where precise structural control within the native 3D space remains underexplored. This paper proposes SK-Adapter, a simple and yet highly efficient and effective framework that unlocks precise skeletal manipulation for native 3D generation. Moving beyond text or image prompts, which can be ambiguous for precise structure, we treat the 3D skeleton as a first-class control signal. SK-Adapter is a lightweight structural adapter network that encodes joint coordinates and topology into learnable tokens, which are injected into the frozen 3D generation backbone via cross-attention. This smart design allows the model to not only effectively "attend" to specific 3D structural constraints but also preserve its original generative priors. To bridge the data gap, we contribute Objaverse-TMS dataset, a large-scale dataset of 24k text-mesh-skeleton pairs. Extensive experiments confirm that our method achieves robust structural control while preserving the geometry and texture quality of the foundation model, significantly outperforming existing baselines. Furthermore, we extend this capability to local 3D editing, enabling the region specific editing of existing assets with skeletal guidance, which is unattainable by previous methods. Project Page: https://sk-adapter.github.io/[267] Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories
Junyao Hu,Zhongwei Cheng,Waikeung Wong,Xingxing Zou
Main category: cs.CV
TL;DR: 本文提出了Garments2Look数据集,首个大规模多模态全套装扮级虚拟试穿(VTON)数据集,并构建了合成流程与基线方法,揭示当前VTON方法在多衣物、层叠与风格化试穿上的局限性。
Details
Motivation: 现有虚拟试穿系统局限于单件衣物,难以应对真实时尚中涉及多衣物、配饰、细粒度类别、层叠关系与多样化搭配的全套装扮场景;同时缺乏具备足够多样性与细粒度标注的大规模 outfit-level 数据集。 Method: 构建Garments2Look数据集:包含8万组‘多衣物→一套穿搭’样本,覆盖40个主类、300+细分类;每组含3–12张参考衣物图、一张模特实穿图及详细文本标注;提出结合启发式穿搭列表构建与生成、辅以严格自动过滤与人工校验的合成流水线。 Result: 基于该数据集评估现有SOTA VTON与通用图像编辑模型,发现其在完整 outfit 试穿中普遍存在衣物错位、层叠错误、风格不一致及生成伪影等问题,任务难度显著高于单衣VTON。 Conclusion: Garments2Look填补了outfit-level VTON的数据空白,揭示了多衣物协同建模、层叠推理与风格一致性建模的关键挑战,为下一代VTON研究提供了基准与方向。 Abstract: Virtual try-on (VTON) has advanced single-garment visualization, yet real-world fashion centers on full outfits with multiple garments, accessories, fine-grained categories, layering, and diverse styling, remaining beyond current VTON systems. Existing datasets are category-limited and lack outfit diversity. We introduce Garments2Look, the first large-scale multimodal dataset for outfit-level VTON, comprising 80K many-garments-to-one-look pairs across 40 major categories and 300+ fine-grained subcategories. Each pair includes an outfit with 3-12 reference garment images (Average 4.48), a model image wearing the outfit, and detailed item and try-on textual annotations. To balance authenticity and diversity, we propose a synthesis pipeline. It involves heuristically constructing outfit lists before generating try-on results, with the entire process subjected to strict automated filtering and human validation to ensure data quality. To probe task difficulty, we adapt SOTA VTON methods and general-purpose image editing models to establish baselines. Results show current methods struggle to try on complete outfits seamlessly and to infer correct layering and styling, leading to misalignment and artifacts.[268] BluRef: Unsupervised Image Deblurring with Dense-Matching References
Bang-Dang Pham,Anh Tran,Cuong Pham,Minh Hoai
Main category: cs.CV
TL;DR: 本文提出了一种无需配对数据的无监督图像去模糊新方法,通过稠密匹配模型在未配对模糊与清晰图像间建立对应关系,生成伪真值数据,从而实现高效、通用且适用于不同规模网络(包括低资源设备)的去模糊。
Details
Motivation: 现有方法依赖成对的模糊-清晰图像数据,采集困难且泛化性差;亟需一种不依赖配对数据、更易部署、适应性强的无监督去模糊方案。 Method: 利用未配对的模糊图像和相似场景的清晰参考图像,通过稠密匹配模型建立像素级对应关系,生成伪清晰图像作为监督信号,进行端到端无监督训练。 Result: 在多个基准上达到当前最优(state-of-the-art)性能,验证了方法的有效性与泛化能力。 Conclusion: 该方法摆脱了对配对数据和预训练模型的依赖,显著提升了去模糊技术的实用性、可扩展性与部署灵活性,为无监督图像复原提供了新范式。 Abstract: This paper introduces a novel unsupervised approach for image deblurring that utilizes a simple process for training data collection, thereby enhancing the applicability and effectiveness of deblurring methods. Our technique does not require meticulously paired data of blurred and corresponding sharp images; instead, it uses unpaired blurred and sharp images of similar scenes to generate pseudo-ground truth data by leveraging a dense matching model to identify correspondences between a blurry image and reference sharp images. Thanks to the simplicity of the training data collection process, our approach does not rely on existing paired training data or pre-trained networks, making it more adaptable to various scenarios and suitable for networks of different sizes, including those designed for low-resource devices. We demonstrate that this novel approach achieves state-of-the-art performance, marking a significant advancement in the field of image deblurring.[269] Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models
Ruiying Peng,Xueyu Wu,Jing Lei,Lu Hou,Yuanzheng Ma,Xiaohui Li
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的视觉区域引导注意力(VRGA)框架,通过熵-聚焦准则选择和重加权视觉注意力头,缓解多模态大语言模型在多步推理中因注意力分散导致的视觉感知退化问题。
Details
Motivation: MLLMs在扩展推理模式下常出现视觉感知能力下降,尤其是VQA任务中;作者发现根本原因是多步推理过程中视觉注意力分散、偏离问题相关区域。 Method: 分析MLLMs的注意力图,发现推理提示会显著削弱对关键图像区域的注意力;基于注意力熵与空间分散性的强相关性,设计训练-free的VRGA框架,依据熵-聚焦准则筛选并重加权视觉注意力头。 Result: 在多个视觉-语言基准上验证了VRGA能有效缓解感知退化,提升视觉定位与推理准确率,并提供可解释的视觉信息处理洞察。 Conclusion: 注意力分散是MLLMs推理中视觉感知退化的关键原因,VRGA通过动态引导注意力聚焦问题相关区域,无需训练即可显著提升性能与可解释性。 Abstract: Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.[270] Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models
Advaith Ravishankar,Serena Liu,Mingyang Wang,Todd Zhou,Jeffrey Zhou,Arnav Sharma,Ziling Hu,Léopold Das,Abdulaziz Sobirov,Faizaan Siddique,Freddy Yu,Seungjoo Baek,Yan Luo,Mengyu Wang
Main category: cs.CV
TL;DR: 本文对文本到图像生成模型进行系统性基准测试,特别关注单步生成模型与多步模型在公平条件下的性能对比,并提出新的评估指标MMHM以平衡多种评价维度。
Details
Motivation: 现有单步文本到图像模型虽可降低推理成本,但缺乏与多步模型的公平比较(因采样步数和CFG设置不一致);同时,单步模型在多步推理下的扩展性、以及面向ImageNet标签ID的OOD评估标准均不完善。 Method: 在统一类别条件协议下,对8种模型(含单步流模型、多步基线及主流模型)在ImageNet验证集、ImageNetV2和新构建的OOD数据集reLAIONet上进行基准测试,采用FID、Inception Score、CLIP Score和Pick Score四维评估,并提出复合指标MinMax Harmonic Mean(MMHM)优化超参选择。 Result: 发现FID导向的模型开发和CFG调优在少步场景下具有误导性——提升FID可能损害文本-图像对齐度、人类偏好与感知质量;领先单步模型经多步扩展后竞争力显著增强,但仍存在局部失真;MMHM能稳定跨CFG与步数的超参选择。 Conclusion: 单步模型潜力被低估,其性能高度依赖评估协议;需摒弃单一指标(如FID)主导的优化范式,转向多维均衡评估与标准化OOD测试框架。 Abstract: State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, We benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a controlled class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals and worsening perceived quality. We further show that leading one-step models benefit from step scaling and become substantially more competitive under multi-step inference, although they still exhibit characteristic local distortions. To capture these tradeoffs, we introduce MinMax Harmonic Mean (MMHM), a composite proxy over all four metrics that stabilizes hyperparameter selection across guidance and step sweeps.[271] Deep Learning From Routine Histology Improves Risk Stratification for Biochemical Recurrence in Prostate Cancer
Clément Grisi,Khrystyna Faryna,Nefise Uysal,Vittorio Agosti,Enrico Munari,Solène-Florence Kammerer-Jacquet,Paulo Guilherme de Oliveira Salles,Yuri Tolkach,Reinhard Büttner,Sofiya Semko,Maksym Pikul,Axel Heidenreich,Jeroen van der Laak,Geert Litjens
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的生物标志物,可直接从常规H&E染色全片图像预测前列腺癌根治术后生化复发(BCR)的连续个体化风险,并在多中心队列中验证其泛化性与临床增益。
Details
Motivation: 现有临床病理风险模型对组织形态学描述过于粗略,大量预后信息未被利用。 Method: 开发端到端深度学习模型,直接从H&E染色全切片图像预测时间至事件(BCR)风险;联合CAPRA-S临床评分进行多队列验证;开展结果导向的可解释性分析。 Result: 模型在四个独立国际队列中表现稳健,与CAPRA-S联用后C-index提升至0.749–0.788;揭示了传统评分未捕捉的细微组织形态学模式。 Conclusion: 深度学习可从常规前列腺病理图像中提取可重复、临床普适的生物标志物,辅助术后风险分层与个体化管理。 Abstract: Accurate prediction of biochemical recurrence (BCR) after radical prostatectomy is critical for guiding adjuvant treatment and surveillance decisions in prostate cancer. However, existing clinicopathological risk models reduce complex morphology to relatively coarse descriptors, leaving substantial prognostic information embedded in routine histopathology underexplored. We present a deep learning-based biomarker that predicts continuous, patient-specific risk of BCR directly from H&E-stained whole-slide prostatectomy specimens. Trained end-to-end on time-to-event outcomes and evaluated across four independent international cohorts, our model demonstrates robust generalization across institutions and patient populations. When integrated with the CAPRA-S clinical risk score, the deep learning risk score consistently improved discrimination for BCR, increasing concordance indices from 0.725-0.772 to 0.749-0.788 across cohorts. To support clinical interpretability, outcome-grounded analyses revealed subtle histomorphological patterns associated with recurrence risk that are not captured by conventional clinicopathological risk scores. This multicohort study demonstrates that deep learning applied to routine prostate histopathology can deliver reproducible and clinically generalizable biomarkers that augment postoperative risk stratification, with potential to support personalized management of prostate cancer in real-world clinical settings.[272] Joint Segmentation and Grading with Iterative Optimization for Multimodal Glaucoma Diagnosis
Zhiwei Wang,Yuxing Li,Meilu Zhu,Defeng He,Edmund Y. Lam
Main category: cs.CV
TL;DR: 本文提出了一种迭代多模态优化模型(IMO),通过中层特征融合与跨模态对齐(CMFA)联合处理眼底图像和OCT图像,结合去噪扩散机制的迭代解码器,实现视杯/视盘精细分割与青光眼分级。
Details
Motivation: 现有青光眼诊断方法多依赖单一模态(如眼底或OCT),难以捕获早期细微病变,缺乏全面病理信息。 Method: 提出迭代多模态优化模型(IMO),采用中层融合策略整合眼底与OCT特征,引入跨模态特征对齐(CMFA)模块缓解模态差异,并设计基于去噪扩散机制的迭代细化解码器进行联合分割与分级。 Result: 实验表明IMO能有效融合多模态特征,在视杯/视盘分割与青光眼分级任务上取得更全面、临床意义更强的结果。 Conclusion: IMO为青光眼早期精准评估提供了一种鲁棒、可解释且临床实用的多模态联合分析新范式。 Abstract: Accurate diagnosis of glaucoma is challenging, as early-stage changes are subtle and often lack clear structural or appearance cues. Most existing approaches rely on a single modality, such as fundus or optical coherence tomography (OCT), capturing only partial pathological information and often missing early disease progression. In this paper, we propose an iterative multimodal optimization model (IMO) for joint segmentation and grading. IMO integrates fundus and OCT features through a mid-level fusion strategy, enhanced by a cross-modal feature alignment (CMFA) module to reduce modality discrepancies. An iterative refinement decoder progressively optimizes the multimodal features through a denoising diffusion mechanism, enabling fine-grained segmentation of the optic disc and cup while supporting accurate glaucoma grading. Extensive experiments show that our method effectively integrates multimodal features, providing a comprehensive and clinically significant approach to glaucoma assessment. Source codes are available at https://github.com/warren-wzw/IMO.git.[273] Walking Further: Semantic-aware Multimodal Gait Recognition Under Long-Range Conditions
Zhiyang Lu,Wen Jiang,Tianren Wu,Zhichao Wang,Changwang Zhang,Siqi Shen,Ming Cheng
Main category: cs.CV
TL;DR: 本文提出了LRGait——首个面向长距离、跨距离场景的LiDAR-相机多模态步态识别基准,并设计了端到端框架EMGaitNet,通过语义引导融合(SeMi+SGA)、对称交叉注意力融合(SCAF)和时空建模(ST)模块,显著提升长距离步态识别鲁棒性。
Details
Motivation: 现有步态识别方法局限于短距离、单模态,难以泛化到真实世界中的长距离与跨距离场景。 Method: 提出LRGait多模态基准;构建EMGaitNet框架,包含CLIP驱动的语义挖掘(SeMi)、语义引导对齐(SGA)、对称交叉注意力融合(SCAF)和时空建模(ST)模块,实现RGB图像与点云的跨模态对齐与融合。 Result: 在多个步态数据集上实验验证了方法的有效性,显著提升了长距离及跨距离步态识别性能。 Conclusion: LRGait基准与EMGaitNet框架为长距离、多模态步态识别提供了新范式,语义引导的跨模态融合策略有效缓解了模态差异问题。 Abstract: Gait recognition is an emerging biometric technology that enables non-intrusive and hard-to-spoof human identification. However, most existing methods are confined to short-range, unimodal settings and fail to generalize to long-range and cross-distance scenarios under real-world conditions. To address this gap, we present \textbf{LRGait}, the first LiDAR-Camera multimodal benchmark designed for robust long-range gait recognition across diverse outdoor distances and environments. We further propose \textbf{EMGaitNet}, an end-to-end framework tailored for long-range multimodal gait recognition. To bridge the modality gap between RGB images and point clouds, we introduce a semantic-guided fusion pipeline. A CLIP-based Semantic Mining (SeMi) module first extracts human body-part-aware semantic cues, which are then employed to align 2D and 3D features via a Semantic-Guided Alignment (SGA) module within a unified embedding space. A Symmetric Cross-Attention Fusion (SCAF) module hierarchically integrates visual contours and 3D geometric features, and a Spatio-Temporal (ST) module captures global gait dynamics. Extensive experiments on various gait datasets validate the effectiveness of our method.[274] Selective Noise Suppression and Discriminative Mutual Interaction for Robust Audio-Visual Segmentation
Kai Peng,Yunzhe Shen,Miao Zhang,Leiye Liu,Yidong Han,Wei Ji,Jingjing Li,Yongri Piao,Huchuan Lu
Main category: cs.CV
TL;DR: 本文提出SDAVS方法,通过选择性抗噪处理器(SNRP)和判别性音视频互融策略(DAMF),提升动态场景中音视频分割性能,尤其在多源复杂场景下达到SOTA。
Details
Motivation: 现有音视频分割(AVS)方法在音频噪声抑制与音视频模态判别性交互方面仍有不足,需进一步探索模态间有效协同机制。 Method: 提出SDAVS框架,包含两个核心模块:1)选择性抗噪处理器(SNRP),用于选择性增强相关听觉线索并抑制音频噪声;2)判别性音视频互融策略(DAMF),促进音视频表征的一致性与判别性融合。 Result: 在多个主流AVS基准数据集上取得SOTA性能,尤其在多声源和复杂场景下表现突出。 Conclusion: SNRP与DAMF的有效结合显著提升了模型对音频噪声的鲁棒性及音视频模态间的判别性交互能力,推动了AVS任务在真实复杂场景中的实用化进展。 Abstract: The ability to capture and segment sounding objects in dynamic visual scenes is crucial for the development of Audio-Visual Segmentation (AVS) tasks. While significant progress has been made in this area, the interaction between audio and visual modalities still requires further exploration. In this work, we aim to answer the following questions: How can a model effectively suppress audio noise while enhancing relevant audio information? How can we achieve discriminative interaction between the audio and visual modalities? To this end, we propose SDAVS, equipped with the Selective Noise-Resilient Processor (SNRP) module and the Discriminative Audio-Visual Mutual Fusion (DAMF) strategy. The proposed SNRP mitigates audio noise interference by selectively emphasizing relevant auditory cues, while DAMF ensures more consistent audio-visual representations. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on benchmark AVS datasets, especially in multi-source and complex scenes. \textit{The code and model are available at https://github.com/happylife-pk/SDAVS}.[275] DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution
Axi Niu,Kang Zhang,Qingsen Yan,Hao Jin,Jinqiu Sun,Yanning Zhang
Main category: cs.CV
TL;DR: 本文提出DualTSR,一种基于双扩散目标的单 multimodal transformer 架构,用于场景文本图像超分辨率(STISR),无需外部OCR模块,端到端建模图像连续分布与文本离散分布,提升可读性与识别精度。
Details
Motivation: 现有STISR方法依赖外部OCR模型提供文本先验,或采用复杂多组件架构,训练与复现困难。 Method: 提出DualTSR框架,采用单一多模态transformer主干,联合优化条件流匹配(建模HR图像连续分布)和离散扩散(建模文本离散分布),实现视觉与文本信息在每层交互。 Result: 在合成中文数据集与真实世界评估协议上,DualTSR在感知质量与文本保真度方面均取得优异性能。 Conclusion: DualTSR通过统一、简洁的端到端设计,有效替代OCR依赖与复杂多分支结构,在STISR任务中兼具高效性与高性能。 Abstract: Scene Text Image Super-Resolution (STISR) aims to restore high-resolution details in low-resolution text images, which is crucial for both human readability and machine recognition. Existing methods, however, often depend on external Optical Character Recognition (OCR) models for textual priors or rely on complex multi-component architectures that are difficult to train and reproduce. In this paper, we introduce DualTSR, a unified end-to-end framework that addresses both issues. DualTSR employs a single multimodal transformer backbone trained with a dual diffusion objective. It simultaneously models the continuous distribution of high-resolution images via Conditional Flow Matching and the discrete distribution of textual content via discrete diffusion. This shared design enables visual and textual information to interact at every layer, allowing the model to infer text priors internally instead of relying on an external OCR module. Compared with prior multi-branch diffusion systems, DualTSR offers a simpler end-to-end formulation with fewer hand-crafted components. Experiments on synthetic Chinese benchmarks and a curated real-world evaluation protocol show that DualTSR achieves strong perceptual quality and text fidelity.[276] ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
Shishi Xiao,Tongyu Zhou,David Laidlaw,Gromit Yeuk-Yin Chan
Main category: cs.CV
TL;DR: 本文提出ChArtist,一种面向图表艺术化生成的领域特定扩散模型,通过骨架表示实现空间控制与主体驱动控制,兼顾数据保真度与视觉美感。
Details
Motivation: 现有方法难以在保持图表数据准确性的同时融入灵活的视觉元素,传统基于边缘或深度图的结构线索不适合作为图表艺术化生成的条件信号。 Method: 提出基于Diffusion Transformer(DiT)的ChArtist模型,引入骨架式空间控制表示、自适应位置编码机制和空间门控注意力机制,并构建包含3万组三元组的大规模数据集用于微调。 Result: 实现了对图表结构与参考图像视觉特征的双重可控生成,在数据保真度和视觉质量上均取得良好效果,并提出统一的数据准确性评估指标。 Conclusion: 任务特定的结构化表征(如骨架)比通用结构线索更适配图表艺术化生成任务,表明生成模型可通过定制化条件实现数据驱动的视觉叙事。 Abstract: A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are ill-suited as conditioning signals for pictorial chart generation. We present ChArtist, a domain-specific diffusion model for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton-based spatial control representation. This representation encodes only the data-encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine-tuning of pre-trained models for this task, we created a large-scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data-driven visual storytelling by moving beyond general-purpose conditions to task-specific representations. Project page: https://chartist-ai.github.io/.[277] UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation
Xingyuan Li,Songcheng Du,Yang Zou,HaoYuan Xu,Zhiying Jiang,Jinyuan Liu
Main category: cs.CV
TL;DR: 本文提出UniFusion,一个统一的图像融合框架,通过DINOv3特征提取、重建对齐损失和双层优化策略,实现跨任务泛化并有效保留源图像信息。
Details
Motivation: 现有图像融合方法多为任务专用,难以在融合过程中有效保留源图像信息,主要受限于任务特定架构和深层传播导致的信息退化。 Method: 1)利用DINOv3进行模态一致的特征提取,构建共享语义空间;2)引入重建-对齐损失以保持融合输出与输入的一致性;3)采用双层优化策略解耦并联合优化重建与融合目标。 Result: 在多种图像融合任务上验证了UniFusion在视觉质量、泛化能力和实际场景适应性方面均优于现有方法。 Conclusion: UniFusion实现了跨任务通用图像融合,兼顾信息保真度与融合效果,为统一融合框架提供了新思路。 Abstract: Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multi-modal, multi-exposure, or multi-focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task-specific architectures and the degradation of source information caused by deep-layer propagation. To overcome these issues, we propose UniFusion, a unified image fusion framework designed to achieve cross-task generalization. First, leveraging DINOv3 for modality-consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction-alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion's superior visual quality, generalization ability, and adaptability to real-world scenarios. Code is available at https://github.com/dusongcheng/UniFusion.[278] Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining
Chongxin Li,Hanzhang Wang,Lian Duan
Main category: cs.CV
TL;DR: 本文提出Safety-Potential Pruning方法,通过一次性剪枝激活VLM中原本沉默的安全相关子网络,显著提升对越狱攻击的防御能力,无需重训练。
Details
Motivation: 安全提示虽可解释地防御越狱攻击,但其效果受限于模型内在结构响应性;作者发现安全提示仅激活稀疏且在正常任务中静默的参数,由此提出‘安全子网络假说’。 Method: 提出Safety-Potential Pruning:一种单次剪枝框架,移除对安全提示响应弱的权重,从而放大安全相关激活,不依赖额外微调。 Result: 在三个主流VLM架构和三个越狱基准上,攻击成功率相对纯提示方法最多降低22%,同时保持良好的正常任务性能。 Conclusion: 剪枝不仅是压缩手段,更是一种结构性干预方式,可激发对齐相关的子网络,为构建鲁棒越狱防御提供新路径。 Abstract: Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.[279] FIND: A Simple yet Effective Baseline for Diffusion-Generated Image Detection
Jie Li,Yingying Feng,Chi Xie,Jie Hu,Lei Tan,Jiayi Ji
Main category: cs.CV
TL;DR: 本文提出Forgery Identification via Noise Disturbance (FIND)方法,通过向真实图像添加高斯噪声并将其标注为合成图像来训练二分类器,从而利用真实与合成图像在高斯分布拟合难度上的本质差异实现高效、快速、模型无关的AI生成图像检测。
Details
Motivation: 现有基于重建误差的检测方法计算开销大且严重依赖特定扩散模型,泛化性差;作者发现真实图像比合成图像更难被高斯分布拟合,这一分布特性差异更具根本性和普适性。 Method: 提出FIND方法:仅需一个简单二分类器;训练时对真实图像添加高斯噪声并标记为‘合成’,使分类器学习区分二者在统计分布上的本质差异;理论证明加噪后的真实图像在高斯拟合难度上逼近扩散生成图像,同时保持视觉相似性。 Result: 在GenImage基准上检测性能提升11.7%,运行速度比现有方法快126倍;无需调用辅助扩散模型或执行重建过程。 Conclusion: FIND是一种高效、实用、模型无关且泛化性强的AI生成图像检测新范式,通过挖掘和利用底层分布差异,摆脱了对重建和特定生成模型的依赖。 Abstract: The remarkable realism of images generated by diffusion models poses critical detection challenges. Current methods utilize reconstruction error as a discriminative feature, exploiting the observation that real images exhibit higher reconstruction errors when processed through diffusion models. However, these approaches require costly reconstruction computations and depend on specific diffusion models, making their performance highly model-dependent. We identify a fundamental difference: real images are more difficult to fit with Gaussian distributions compared to synthetic ones. In this paper, we propose Forgery Identification via Noise Disturbance (FIND), a novel method that requires only a simple binary classifier. It eliminates reconstruction by directly targeting the core distributional difference between real and synthetic images. Our key operation is to add Gaussian noise to real images during training and label these noisy versions as synthetic. This step allows the classifier to focus on the statistical patterns that distinguish real from synthetic images. We theoretically prove that the noise-augmented real images resemble diffusion-generated images in their ease of Gaussian fitting. Furthermore, simply by adding noise, they still retain visual similarity to the original images, highlighting the most discriminative distribution-related features. The proposed FIND improves performance by 11.7% on the GenImage benchmark while running 126x faster than existing methods. By removing the need for auxiliary diffusion models and reconstruction, it offers a practical, efficient, and generalizable way to detect diffusion-generated content.[280] Not All Directions Matter: Toward Structured and Task-Aware Low-Rank Adaptation
Xi Xiao,Chenrui Ma,Yunbei Zhang,Chen Liu,Zhuxuanzi Wang,Yanshu Li,Lin Zhao,Guosheng Hu,Tianyang Wang,Hao Xu
Main category: cs.CV
TL;DR: 本文提出StructLoRA,通过信息瓶颈引导的滤波器和图结构协调器,解决LoRA中的语义漂移与结构不一致问题,在多种模型上实现SOTA性能,且不增加推理开销。
Details
Motivation: LoRA存在语义漂移(忽略更新方向重要性差异)和结构不一致(各层独立适配)两大问题,限制其在参数高效微调中的效果。 Method: 提出StructLoRA框架:(1) 信息瓶颈引导的滤波器,剔除任务无关更新方向以缓解语义漂移;(2) 轻量级、仅训练时启用的图协调器,强制层间一致性以解决结构不一致。 Result: 在LLaMA、LLaVA、ViT等大语言模型、多模态及视觉模型上显著超越原始LoRA及动态秩分配、稀疏化等先进方法,尤其在低秩、低数据场景下增益更明显;无额外推理开销。 Conclusion: StructLoRA将PEFT的关注点从单纯参数压缩拓展至信息质量与结构完整性的联合优化,为高效微调提供了新范式。 Abstract: Low-Rank Adaptation (LoRA) has become a cornerstone of parameter-efficient fine-tuning (PEFT). Yet, its efficacy is hampered by two fundamental limitations: semantic drift, by treating all update directions with equal importance, and structural incoherence, from adapting layers independently, resulting in suboptimal, uncoordinated updates. To remedy these, we propose StructLoRA, a framework that addresses both limitations through a principled, dual-component design: (1) an Information Bottleneck-guided filter that prunes task-irrelevant directions to mitigate semantic drift, and (2) a lightweight, training-only graph-based coordinator that enforces inter-layer consistency to resolve structural incoherence. Extensive experiments across large language model , vision language model, and vision model (including LLaMA, LLaVA, and ViT) demonstrate that StructLoRA consistently establishes a new state-of-the-art, outperforming not only vanilla LoRA but also advanced dynamic rank allocation and sparsity-based methods. Notably, the benefits are particularly pronounced in challenging low-rank and low-data regimes. Crucially, since our proposed modules operate only during training, StructLoRA enhances performance with zero additional inference cost, advancing the focus of PEFT -- from mere parameter compression to a more holistic optimization of information quality and structural integrity.[281] S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction
Renhe Zhang,Yuyang Tan,Jingyu Gong,Zhizhong Zhang,Lizhuang Ma,Yuan Xie,Xin Tan
Main category: cs.CV
TL;DR: S2GS是一种因果、增量式的3D高斯语义场框架,用于长图像流的在线联合场景重建与理解,避免重复全局计算,显著提升长序列下的可扩展性。
Details
Motivation: 现有离线前馈方法在处理长图像流时反复对不断增长的历史观测进行全局计算,导致运行时间和GPU内存随序列长度快速增长,限制了可扩展性。 Method: 提出Streaming Semantic Gaussian Splatting(S2GS),采用几何-语义解耦的双主干设计:几何分支进行因果建模以增量更新高斯;语义分支结合2D基础视觉模型与查询驱动解码器预测分割掩码和身份嵌入,并通过查询级对比对齐和轻量在线实例记忆关联加以稳定。 Result: S2GS在联合重建与理解基准上达到或超越强离线基线,且在长时序(1000+帧)下运行时间和GPU内存增长显著更慢;而离线方法通常在约80帧时即显存溢出。 Conclusion: S2GS实现了真正在线、因果、可扩展的联合场景理解与重建,为长序列实时三维感知提供了新范式。 Abstract: Existing offline feed-forward methods for joint scene understanding and reconstruction on long image streams often repeatedly perform global computation over an ever-growing set of past observations, causing runtime and GPU memory to increase rapidly with sequence length and limiting scalability. We propose Streaming Semantic Gaussian Splatting (S2GS), a strictly causal, incremental 3D Gaussian semantic field framework: it does not leverage future frames and continuously updates scene geometry, appearance, and instance-level semantics without reprocessing historical frames, enabling scalable online joint reconstruction and understanding. S2GS adopts a geometry-semantic decoupled dual-backbone design: the geometry branch performs causal modeling to drive incremental Gaussian updates, while the semantic branch leverages a 2D foundation vision model and a query-driven decoder to predict segmentation masks and identity embeddings, further stabilized by query-level contrastive alignment and lightweight online association with an instance memory. Experiments show that S2GS matches or outperforms strong offline baselines on joint reconstruction-and-understanding benchmarks, while significantly improving long-horizon scalability: it processes 1,000+ frames with much slower growth in runtime and GPU memory, whereas offline global-processing baselines typically run out of memory at around 80 frames under the same setting.[282] FOCUS: Bridging Fine-Grained Recognition and Open-World Discovery across Domains
Vaibhav Rathore,Divyam Gupta,Moloud Abdar,Subhasis Chaudhuri,Biplab Banerjee
Main category: cs.CV
TL;DR: 本文提出了首个面向细粒度领域泛化广义类别发现(FG-DG-GCD)的统一框架FoCUS,通过域一致部件发现与不确定性感知特征增强,在新构建的细粒度DG-GCD基准上显著提升聚类准确率,并兼顾效率与泛化性。
Details
Motivation: 现有广义类别发现(GCD)假设标注与未标注数据同分布,难以应对真实场景中域偏移下的开放世界识别;细粒度识别因类间差异小、类内变化大,使域泛化尤为困难。 Method: 构建首个细粒度DG-GCD基准(基于CUB、Cars、Aircraft的绘画/素描域),并提出单阶段框架FoCUS,包含域一致部件发现(DCPD)和不确定性感知特征增强(UFA)两个核心模块。 Result: FoCUS在所提基准上聚类准确率分别超越强基线GCD、FG-GCD、DG-GCD达3.28%、9.68%、2.07%;在粗粒度DG-GCD任务中保持竞争力,计算效率达SOTA的近3倍。 Conclusion: FoCUS为细粒度开放世界识别提供了可扩展、高效且鲁棒的新范式,推动DG-GCD向实际部署迈进。 Abstract: We introduce the first unified framework for *Fine-Grained Domain-Generalized Generalized Category Discovery* (FG-DG-GCD), bringing open-world recognition closer to real-world deployment under domain shift. Unlike conventional GCD, which assumes labeled and unlabeled data come from the same distribution, DG-GCD learns only from labeled source data and must both recognize known classes and discover novel ones in unseen, unlabeled target domains. This problem is especially challenging in fine-grained settings, where subtle inter-class differences and large intra-class variation make domain generalization significantly harder. To support systematic evaluation, we establish the first *FG-DG-GCD benchmarks* by creating identity-preserving *painting* and *sketch* domains for CUB-200-2011, Stanford Cars, and FGVC-Aircraft using controlled diffusion-adapter stylization. On top of this ,we propose FoCUS, a single-stage framework that combines *Domain-Consistent Parts Discovery* (DCPD) for geometry-stable part reasoning with *Uncertainty-Aware Feature Augmentation* (UFA) for confidence-calibrated feature regularization through uncertainty-guided perturbations. Extensive experiments show that FoCUS outperforms strong GCD, FG-GCD, and DG-GCD baselines by **3.28%**, **9.68%**, and **2.07%**, respectively, in clustering accuracy on the proposed benchmarks. It also remains competitive on coarse-grained DG-GCD tasks while achieving nearly **3x** higher computational efficiency than the current state of the art. ^[Code and datasets will be released upon acceptance.][283] CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control
Zhiyi Kuang,Chengan He,Egor Zakharov,Yuxuan Xue,Shunsuke Saito,Olivier Maury,Timur Bagautdinov,Youyi Zheng,Giljoo Nam
Main category: cs.CV
TL;DR: CamLit 是首个统一的视频扩散模型,能从单张输入图像联合实现新视角合成(NVS)与重光照(relighting),生成时空一致、可精确控制相机位姿和光照的高质量视频。
Details
Motivation: 现有方法通常将新视角合成与重光照作为独立任务处理,导致流程复杂、协同性差;本文旨在构建一个统一生成模型,简化视频生成流程并提升多控制维度的一致性与 realism。 Method: 提出 CamLit,一种基于扩散机制的统一视频生成模型,以单张图像、用户指定相机轨迹和环境贴图为输入,在单次生成过程中同步输出时序连贯的新视角重光照帧及其对应反照率(albedo)帧。 Result: 在定性与定量实验中,CamLit 在 NVS 和 relighting 两个任务上均达到与当前最优方法相当的保真度,且未牺牲任一任务的视觉质量;验证了单模型联合控制相机与光照的可行性与有效性。 Conclusion: CamLit 证明了单一扩散模型可高效集成相机位姿与光照控制,显著简化生成管线,同时保持高竞争力的性能与一致性 realism,为多因素可控视频生成提供了新范式。 Abstract: We present CamLit, the first unified video diffusion model that jointly performs novel view synthesis (NVS) and relighting from a single input image. Given one reference image, a user-defined camera trajectory, and an environment map, CamLit synthesizes a video of the scene from new viewpoints under the specified illumination. Within a single generative process, our model produces temporally coherent and spatially aligned outputs, including relit novel-view frames and corresponding albedo frames, enabling high-quality control of both camera pose and lighting. Qualitative and quantitative experiments demonstrate that CamLit achieves high-fidelity outputs on par with state-of-the-art methods in both novel view synthesis and relighting, without sacrificing visual quality in either task. We show that a single generative model can effectively integrate camera and lighting control, simplifying the video generation pipeline while maintaining competitive performance and consistent realism.[284] BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification
Haoxuan Xu,Guanglin Niu
Main category: cs.CV
TL;DR: 本文提出了一种名为双向交互变换(BIT)的新网络,通过基于匹配的策略显式建模可见光与红外图像对之间的交互,以解决可见-红外行人重识别(VI-ReID)中模态差异大、数据分布偏移等问题。
Details
Motivation: 现有方法忽视可见光与红外模态间复杂隐式关联,且在红外样本远少于可见光样本的分布偏移场景下性能受限。 Method: 提出双向交互变换(BIT)网络,采用编码器-解码器结构:编码器提取初步特征,解码器实现双向特征融合与查询感知打分,以增强跨模态对应关系。 Result: 在多个基准数据集上实验表明,BIT达到当前最优性能。 Conclusion: BIT是首个在VI-ReID中引入成对匹配驱动交互的方法,有效缓解模态鸿沟与分布偏移问题。 Abstract: Visible-Infrared Person Re-Identification (VI-ReID) is a challenging retrieval task due to the substantial modality gap between visible and infrared images. While existing methods attempt to bridge this gap by learning modality-invariant features within a shared embedding space, they often overlook the complex and implicit correlations between modalities. This limitation becomes more severe under distribution shifts, where infrared samples are often far fewer than visible ones. To address these challenges, we propose a novel network termed Bi-directional Interaction Transformation (BIT). Instead of relying on rigid feature alignment, BIT adopts a matching-based strategy that explicitly models the interaction between visible and infrared image pairs. Specifically, BIT employs an encoder-decoder architecture where the encoder extracts preliminary feature representations, and the decoder performs bi-directional feature integration and query aware scoring to enhance cross-modality correspondence. To our best knowledge, BIT is the first to introduce such pairwise matching-driven interaction in VI-ReID. Extensive experiments on several benchmarks demonstrate that our BIT achieves state-of-the-art performance, highlighting its effectiveness in the VI-ReID task.[285] OAHuman: Occlusion-Aware 3D Human Reconstruction from Monocular Images
Yuanwang Yang,Hongliang Liu,Muxin Zhang,Nan Ma,Jingyu Yang,Yu-Kun Lai,Kun Li
Main category: cs.CV
TL;DR: 本文提出OAHuman框架,通过显式解耦几何重建与纹理合成,解决单目RGB图像在遮挡场景下3D人体重建的难题,提升模型完整性与真实性。
Details
Motivation: 单目3D人体重建在真实场景中因遮挡(物体、他人、图像截断)导致几何缺失和外观线索不可靠,现有神经隐式方法因形状与纹理耦合建模而在遮挡下性能下降。 Method: 提出OAHuman框架,采用解耦-感知范式:几何重建在遮挡区域通过感知增强并隔离纹理干扰;纹理合成仅基于可见区域学习,避免错误传播。 Result: 在遮挡丰富的基准上实验表明,OAHuman在结构完整性、表面细节和纹理真实性方面显著优于现有方法。 Conclusion: OAHuman通过显式解耦几何与纹理建模,有效提升了遮挡条件下单目3D人体重建的鲁棒性与保真度,为该长期挑战提供了新思路。 Abstract: Monocular 3D human reconstruction in real-world scenarios remains highly challenging due to frequent occlusions from surrounding objects, people, or image truncation. Such occlusions lead to missing geometry and unreliable appearance cues, severely degrading the completeness and realism of reconstructed human models. Although recent neural implicit methods achieve impressive results on clean inputs, they struggle under occlusion due to entangled modeling of shape and texture. In this paper, we propose OAHuman, an occlusion-aware framework that explicitly decouples geometry reconstruction and texture synthesis for robust 3D human modeling from a single RGB image. The core innovation lies in the decoupling-perception paradigm, which addresses the fundamental issue of geometry-texture cross-contamination in occluded regions. Our framework ensures that geometry reconstruction is perceptually reinforced even in occluded areas, isolating it from texture interference. In parallel, texture synthesis is learned exclusively from visible regions, preventing texture errors from being transferred to the occluded areas. This decoupling approach enables OAHuman to achieve robust and high-fidelity reconstruction under occlusion, which has been a long-standing challenge in the field. Extensive experiments on occlusion-rich benchmarks demonstrate that OAHuman achieves superior performance in terms of structural completeness, surface detail, and texture realism, significantly improving monocular 3D human reconstruction under occlusion conditions.[286] MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos
Sagnik Majumder,Anish Nethi,Ziad Al-Halah,Kristen Grauman
Main category: cs.CV
TL;DR: 本文提出MistExit模型,用于视频中早期错误检测,通过结合错误检测器和强化学习策略,在尽可能少观察视频帧的情况下准确判断关键步骤是否正确执行。
Details
Motivation: 在 procedural activity 视频中实现早期错误检测,以减少不必要的视频观察,提升实时性和效率。 Method: 提出 MistExit 模型,包含一个能估计当前关键步骤正确性并预测未来视觉特征的错误检测器,以及一个基于强化学习的自适应退出策略,决定何时停止处理新帧并输出最终预测。 Result: 在多个真实世界 procedural video 数据集上,MistExit 在错误检测准确率上优于现有方法,同时显著减少了所需观察的视频比例。 Conclusion: MistExit 能够在保证高准确率的同时实现更早的错误判断,为实时视频分析提供了高效可行的解决方案。 Abstract: We introduce the task of early mistake detection in video, where the goal is to determine whether a keystep in a procedural activity is performed correctly while observing as little of the streaming video as possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the detector processes recently observed frames to estimate the keystep's correctness while anticipating future visual features, enabling reliable early mistake estimates. Meanwhile, the policy aggregates the detector outputs and visual observations over time and adaptively decides when to exit (i.e., stop processing incoming frames) while producing the final prediction. Using diverse real-world procedural video datasets, we demonstrate that our MistExit model achieves superior mistake detection accuracy while reducing the fraction of video observed compared to state-of-the-art models. Project: https://vision.cs.utexas.edu/projects/mist_exit.[287] ZOTTA: Test-Time Adaptation with Gradient-Free Zeroth-Order Optimization
Ronghao Zhang,Shuaicheng Niu,Qi Deng,Yanjie Dong,Jian Chen,Runhao Zeng
Main category: cs.CV
TL;DR: ZOTTA是一种无需反向传播的测试时自适应(TTA)框架,利用零阶优化(ZOO)仅通过前向传播实现高效模型自适应,结合分布鲁棒层选择与空间特征聚合对齐,提升稳定性与效率,适用于边缘设备上的非可微或量化模型。
Details
Motivation: 现有TTA方法大多依赖反向传播,计算开销大且不适用于非可微模型(如量化模型),难以部署在边缘设备;而现有无BP方法存在架构依赖性或高维优化能力不足的问题。 Method: 提出ZOTTA框架:1)分布鲁棒层选择——自动识别并冻结已提取分布不变特征的层,仅更新域敏感层以降低优化维度;2)空间特征聚合对齐——通过对齐源域与目标域全局空间特征来稳定零阶优化、减小梯度方差。 Result: 在ImageNet-C/R/Sketch/A上显著优于或媲美基于BP的方法,例如在ImageNet-C上相比SAR内存减少84%,准确率提升3.9%。 Conclusion: ZOTTA实现了架构无关、稳定高效的无反向传播TTA,为边缘设备上的轻量与量化模型提供了实用的鲁棒性增强方案。 Abstract: Test-time adaptation (TTA) aims to improve model robustness under distribution shifts by adapting to unlabeled test data, but most existing methods rely on backpropagation (BP), which is computationally costly and incompatible with non-differentiable models such as quantized models, limiting practical deployment on numerous edge devices. Recent BP-free approaches alleviate overhead but remain either architecture-specific or limited in optimization capacity to handle high-dimensional models. We propose ZOTTA, a fully BP-free TTA framework that performs efficient adaptation using only forward passes via Zeroth-Order Optimization (ZOO). While ZOO is theoretically appealing, naive application leads to slow convergence under high-dimensional parameter spaces and unstable optimization due to the lack of labels. ZOTTA overcomes these challenges through 1) Distribution-Robust Layer Selection, which automatically identifies and freezes layers that already extract distribution-invariant features, updating only domain-sensitive layers to reduce the optimization dimensionality and accelerate convergence; 2) Spatial Feature Aggregation Alignment, which stabilizes ZOO by aligning globally aggregated spatial features between source and target to reduce gradient variance. Together, these components enable architecture-agnostic and stable BP-free adaptation. Extensive experiments on ImageNet-C/R/Sketch/A show that ZOTTA outperforms or matches BP-based methods, e.g., it reduces memory usage by 84% and improves accuracy by 3.9% over SAR on ImageNet-C.[288] DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization
Ngoc-Son Nguyen,Thanh V. T. Tran,Jeongsoo Choi,Hieu-Nghia Huynh-Nguyen,Truong-Son Hy,Van Nguyen
Main category: cs.CV
TL;DR: 本文提出DiFlowDubber,一种基于离散流匹配的两阶段视频配音框架,通过FaPro模块捕捉面部表情中的韵律与风格特征,并利用Synchronizer模块实现精准语音-唇动同步,在多个指标上超越现有方法。
Details
Motivation: 现有视频配音方法受限于数据稀缺或两阶段流程,难以生成富有表现力的韵律、丰富的声学特征及精确的语音-唇动同步。 Method: 提出DiFlowDubber,包含两个核心模块:FaPro模块用于从面部表情中提取全局韵律与风格线索以指导语音建模;Synchronizer模块用于弥合文本、视频与语音之间的模态差距,提升跨模态对齐与唇动同步精度;整体采用离散流匹配作为生成主干。 Result: 在两个主流基准数据集上的实验表明,DiFlowDubber在多项评估指标上优于先前方法。 Conclusion: DiFlowDubber通过创新的两阶段训练框架与跨模态协同建模,有效提升了视频驱动配音的表达性、自然性与同步精度,为高质量语音生成提供了新思路。 Abstract: Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.[289] Toward Clinically Ready Foundation Models in Medical Image Analysis: Adaptation Mechanisms and Deployment Trade-offs
Karma Phuntsho,Abdullah,Kyungmi Lee,Ickjai Lee,Euijoon Ahn
Main category: cs.CV
TL;DR: 本文提出了一种以策略为中心的医学影像分析中基础模型(FMs)适配框架,将适配分为五类机制,并系统分析其在鲁棒性、标注效率、计算开销、可审计性及监管可行性等方面的权衡,强调临床部署约束下的可控表征调整。
Details
Motivation: 现有综述多关注架构创新与应用广度,而忽视了基础模型在医学影像中适配机制对鲁棒性、校准性与监管可行性的关键影响,亟需结构化梳理。 Method: 提出五类FM适配机制:参数级、表征级、目标函数级、数据级及架构/序列级;从适应深度、标注效率、域鲁棒性、计算成本、可审计性与监管负担六维度进行系统对比分析,并覆盖分类、分割、检测任务中的临床相关失效模式。 Result: 揭示不同适配策略对临床关键属性(如校准稳定性、多中心部署兼容性、验证协议适配性、监管合规性)的影响规律,形成面向临床部署的适配决策指南。 Conclusion: 将FM适配重新定义为在临床约束下受控的表征演化过程,为构建鲁棒、可审计、可部署的医学AI系统提供实践框架与理论支撑。 Abstract: Foundation models (FMs) have demonstrated strong transferability across medical imaging tasks, yet their clinical utility depends critically on how pretrained representations are adapted to domain-specific data, supervision regimes, and deployment constraints. Prior surveys primarily emphasize architectural advances and application coverage, while the mechanisms of adaptation and their implications for robustness, calibration, and regulatory feasibility remain insufficiently structured. This review introduces a strategy-centric framework for FM adaptation in medical image analysis (MIA). We conceptualize adaptation as a post-pretraining intervention and organize existing approaches into five mechanisms: parameter-, representation-, objective-, data-centric, and architectural/sequence-level adaptation. For each mechanism, we analyze trade-offs in adaptation depth, label efficiency, domain robustness, computational cost, auditability, and regulatory burden. We synthesize evidence across classification, segmentation, and detection tasks, highlighting how adaptation strategies influence clinically relevant failure modes rather than only aggregate benchmark performance. Finally, we examine how adaptation choices interact with validation protocols, calibration stability, multi-institutional deployment, and regulatory oversight. By reframing adaptation as a process of controlled representational change under clinical constraints, this review provides practical guidance for designing FM-based systems that are robust, auditable, and compatible with clinical deployment.[290] All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation
Xudong Wang,Gan Li,Zhiyu Liu,Yao Wang,Lianqing Liu,Zhi Han
Main category: cs.CV
TL;DR: 本文提出Tucker Adaptation (TuKA)方法,解决视觉-语言导航(VLN)中跨多场景持续学习时的灾难性遗忘问题,并构建AlldayWalker代理实现全天候多场景导航。
Details
Motivation: 现有VLN代理在特定场景微调时易发生灾难性遗忘,难以长期灵活部署;传统参数高效适配器(如LoRA)因二维矩阵形式无法建模跨多场景的多层次导航知识。 Method: 提出基于高阶张量表示和Tucker分解的Tucker Adaptation(TuKA),将导航知识解耦为共享子空间与场景特异性专家,并设计解耦知识增量学习策略以巩固共享子空间、约束特定专家。 Result: 所提出的AlldayWalker代理在多个导航场景上持续学习,实验表明其性能持续优于当前最优基线方法。 Conclusion: TuKA有效建模多场景VLN中的多层次知识结构,结合解耦增量学习策略,显著缓解灾难性遗忘,支持VLN代理的长期、灵活、多场景部署。 Abstract: Deploying vision-and-language navigation (VLN) agents requires adaptation across diverse scenes and environments, but fine-tuning on a specific scenario often causes catastrophic forgetting in others, which severely limits flexible long-term deployment. We formalize this challenge as the all-day multi-scenes lifelong VLN (AML-VLN) problem. Existing parameter-efficient adapters (e.g., LoRA and its variants) are limited by their two-dimensional matrix form, which fails to capture the multi-hierarchical navigation knowledge spanning multiple scenes and environments. To address this, we propose Tucker Adaptation (TuKA), which represents the multi-hierarchical navigation knowledge as a high-order tensor and leverages Tucker decomposition to decouple the knowledge into shared subspaces and scenario-specific experts. We further introduce a decoupled knowledge incremental learning strategy to consolidate shared subspaces while constraining specific experts for decoupled lifelong learning. Building on TuKA, we also develop a VLN agent named AlldayWalker, which continually learns across multiple navigation scenarios, achieving all-day multi-scenes navigation. Extensive experiments show that AlldayWalker consistently outperforms state-of-the-art baselines.[291] DC-ViT: Modulating Spatial and Channel Interactions for Multi-Channel Images
Umar Marikkar,Syed Sameed Husain,Muhammad Awais,Sara Atito
Main category: cs.CV
TL;DR: 本文提出Decoupled Vision Transformer (DC-ViT),通过解耦自注意力(DSA)分离空间与通道更新路径,并引入解耦聚合(DAG)学习任务相关通道重要性,以提升多通道成像中通道异构性下的表征能力。
Details
Motivation: 多通道成像(MCI)中因染色协议、传感器和采集设置差异导致通道配置异构,使传统固定通道编码器失效;现有MC-ViTs虽支持灵活通道输入,但无约束的跨通道token交互易造成特征稀释,损害关键的通道特异性语义。 Method: 提出DC-ViT模型:1)Decoupled Self-Attention(DSA),将token更新解耦为空间更新(建模通道内结构)和通道更新(自适应融合跨通道信息);2)Decoupled Aggregation(DAG),动态学习各通道对下游任务的重要性权重。 Result: 在三个MCI基准上,DC-ViT一致优于现有MC-ViT方法,验证了其在保留通道特异性语义与促进有效跨通道交互间的平衡能力。 Conclusion: 解耦设计(DSA+DAG)是应对MCI中通道异构性的有效范式,既避免信息坍缩,又支持可控的跨通道协作,为多通道医学图像分析提供了新思路。 Abstract: Training and evaluation in multi-channel imaging (MCI) remains challenging due to heterogeneous channel configurations arising from varying staining protocols, sensor types, and acquisition settings. This heterogeneity limits the applicability of fixed-channel encoders commonly used in general computer vision. Recent Multi-Channel Vision Transformers (MC-ViTs) address this by enabling flexible channel inputs, typically by jointly encoding patch tokens from all channels within a unified attention space. However, unrestricted token interactions across channels can lead to feature dilution, reducing the ability to preserve channel-specific semantics that are critical in MCI data. To address this, we propose Decoupled Vision Transformer (DC-ViT), which explicitly regulates information sharing using Decoupled Self-Attention (DSA), which decomposes token updates into two complementary pathways: spatial updates that model intra-channel structure, and channel-wise updates that adaptively integrate cross-channel information. This decoupling mitigates informational collapse while allowing selective inter-channel interaction. To further exploit these enhanced channel-specific representations, we introduce Decoupled Aggregation (DAG), which allows the model to learn task-specific channel importances. Extensive experiments across three MCI benchmarks demonstrate consistent improvements over existing MC-ViT approaches.[292] Multi-Period Texture Contrast Enhancement for Low-Contrast Wafer Defect Detection and Segmentation
Zihan Zhang
Main category: cs.CV
TL;DR: 本文提出TexWDS框架,通过多尺度特征保留与频域扰动建模,解决晶圆缺陷分割中微小缺陷与周期性背景纹理冲突的问题,显著提升检测精度与鲁棒性。
Details
Motivation: 晶圆缺陷分割面临微小缺陷与强周期性背景纹理之间的固有矛盾,现有深度学习方法因下采样导致特征稀释、缺乏显式解耦低对比度缺陷与工艺噪声的机制而表现不佳。 Method: 提出TexWDS纹理感知框架,包含三项创新:(1) 多尺度感受野重加权策略以抑制混叠并保留高频缺陷细节;(2) 多尺度统一语义增强器(MUSE),融合局部外观与全局上下文;(3) 即插即用的多周期纹理对比增强模块(MPTCE),在频域建模纹理扰动,显式分离非周期性缺陷与结构化背景。 Result: 在真实工业数据集上,TexWDS在mAP50-95和召回率上分别超越基线8.3%和7.7%,误报率降低约8.6%,达到新SOTA性能。 Conclusion: TexWDS有效提升了复杂周期性背景下微小缺陷的检测能力,具备高精度制造质检的实际应用价值。 Abstract: Wafer defect segmentation is pivotal for semiconductor yield optimization yet remains challenged by the intrinsic conflict between microscale anomalies and highly periodic, overwhelming background textures. Existing deep learning paradigms often falter due to feature dilution during downsampling and the lack of explicit mechanisms to disentangle low-contrast defects from process-induced noise. To transcend these limitations, we propose TexWDS, a texture-aware framework that harmonizes multi-scale feature retention with frequency-domain perturbation modeling. Our methodology incorporates three strategic innovations: (1) A Multi-scale Receptive Field Reweighting strategy is introduced to mitigate aliasing effects and preserve high-frequency details of micro-defects often lost in standard pyramidal architectures. (2) The Multi-scale Unified Semantic Enhancer (MUSE) integrates local appearance with global context encoding, effectively enhancing feature discriminability in low-visibility regions. (3) Crucially, we design a plug-and-play Multi-Periodic Texture Contrast Enhancement (MPTCE) module. By modeling texture disruptions in the frequency domain, MPTCE explicitly decouples non-periodic anomalies from structured backgrounds, boosting contrast for camouflaged defects. Extensive experiments on real-world industrial datasets demonstrate that TexWDS achieves a new state-of-the-art, surpassing the baseline by 8.3% in mAP50-95 and 7.7% in recall, while reducing the false positive rate by approximately 8.6%. These results underscore the framework's robustness in handling complex periodic patterns and its suitability for high-precision manufacturing inspection.[293] RegFormer++: An Efficient Large-Scale 3D LiDAR Point Registration Network with Projection-Aware 2D Transformer
Jiuming Liu,Guangming Wang,Zhe Liu,Chaokang Jiang,Haoang Li,Mengmeng Liu,Tianchen Deng,Marc Pollefeys,Michael Ying Yang,Hesheng Wang
Main category: cs.CV
TL;DR: 本文提出了一种端到端可微分的Transformer网络RegFormer++,用于大规模LiDAR点云配准,无需后处理;通过层次化圆柱投影2D Transformer提取全局特征,并引入双射关联Transformer(BAT)与特征变换最优传输模块,提升抗异常值能力与配准精度和效率。
Details
Motivation: 现有大规模LiDAR配准方法研究较少,面临点云规模大、分布复杂、异常值多等挑战;主流两阶段方法(局部描述子+RANSAC)严重依赖手工设计描述子和后处理策略。 Method: 提出端到端RegFormer++:1)层次化投影感知2D Transformer,将LiDAR点投影至圆柱面并融合3D坐标进行高效全局特征提取;2)双射关联Transformer(BAT),融合交叉注意力与全连接点聚合以减少错误匹配;3)特征变换最优传输模块,提升训练稳定性与姿态回归鲁棒性。 Result: 在KITTI、NuScenes和Argoverse数据集上达到SOTA精度与效率,无需RANSAC等后处理。 Conclusion: RegFormer++验证了端到端、投影感知、全局建模的Transformer架构在大规模LiDAR配准中的有效性与优越性,为户外场景鲁棒实时配准提供了新范式。 Abstract: Although point cloud registration has achieved remarkable advances in object-level and indoor scenes, large-scale LiDAR registration methods has been rarely explored before. Challenges mainly arise from the huge point scale, complex point distribution, and numerous outliers within outdoor LiDAR scans. In addition, most existing registration works generally adopt a two-stage paradigm: They first find correspondences by extracting discriminative local descriptors and then leverage robust estimators (e.g. RANSAC) to filter outliers, which are highly dependent on well-designed descriptors and post-processing choices. To address these problems, we propose a novel end-to-end differential transformer network, termed RegFormer++, for large-scale point cloud alignment without requiring any further post-processing. Specifically, a hierarchical projection-aware 2D transformer with linear complexity is proposed to project raw LiDAR points onto a cylindrical surface and extract global point features, which can improve resilience to outliers due to long-range dependencies. Because we fill original 3D coordinates into 2D projected positions, our designed transformer can benefit from both high efficiency in 2D processing and accuracy from 3D geometric information. Furthermore, to effectively reduce wrong point matching, a Bijective Association Transformer (BAT) is designed, combining both cross attention and all-to-all point gathering. To improve training stability and robustness, a feature-transformed optimal transport module is also designed for regressing the final pose transformation. Extensive experiments on KITTI, NuScenes, and Argoverse datasets demonstrate that our model achieves state-of-the-art performance in terms of both accuracy and efficiency.[294] Seeking Physics in Diffusion Noise
Chujun Tang,Lei Zhong,Fangqiang Ding
Main category: cs.CV
TL;DR: 本文探究了视频扩散模型(DiT)中间去噪表征中是否编码了物理合理性的信号,并提出一种推理时的渐进轨迹选择策略,利用轻量物理验证器在中间检查点对并行去噪轨迹进行评分与剪枝,在提升物理一致性的同时降低推理开销。
Details
Motivation: 探究预训练视频扩散模型(如DiT)的中间特征是否隐含可提取的、与物理合理性相关的信号,而非仅依赖视觉质量或生成器身份。 Method: 分析DiT中间层特征在不同噪声水平下对物理合理/不合理视频的可分性;基于该发现,设计‘渐进轨迹选择’策略:在若干中间去噪步骤用轻量物理验证器评估并剪枝低分轨迹。 Result: 在PhyGenBench上验证,该方法在提升物理一致性的同时显著减少去噪步数,性能媲美Best-of-K采样但计算成本更低。 Conclusion: 冻结的DiT中间特征中存在可被利用的物理相关线索;无需微调模型,仅通过推理时轨迹筛选即可高效提升生成视频的物理合理性。 Abstract: Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.[295] RL-ScanIQA: Reinforcement-Learned Scanpaths for Blind 360°Image Quality Assessment
Yujia Wang,Yuyan Li,Jiuming Liu,Fang-Lue Zhang,Xinhu Zheng,Neil. A Dodgson
Main category: cs.CV
TL;DR: 本文提出RL-ScanIQA,一种基于强化学习的盲360°图像质量评估框架,联合优化扫描路径生成与质量评估,引入多级奖励和失真空间增强以提升稳定性和跨数据集泛化能力。
Details
Motivation: 现有扫描路径方法将视线轨迹生成与质量评估分离,无法实现端到端优化和任务对齐的探索;同时360°图像的视口限制使观看行为对质量感知至关重要。 Method: 提出基于PPO算法的强化学习框架RL-ScanIQA,联合训练扫描路径策略(接收质量驱动反馈)与质量评估器;设计包含扫描路径多样性与赤道偏好先验的多级奖励;采用失真空间增强与排序一致损失提升鲁棒性。 Result: 在三个基准数据集上,RL-ScanIQA显著优于现有方法,展现出更优的同数据集性能与跨数据集泛化能力。 Conclusion: 联合建模扫描路径与质量评估、引入任务驱动的强化学习机制及正则化策略,可有效提升盲360°图像质量评估的准确性与泛化性。 Abstract: Blind 360°image quality assessment (IQA) aims to predict perceptual quality for panoramic images without a pristine reference. Unlike conventional planar images, 360°content in immersive environments restricts viewers to a limited viewport at any moment, making viewing behaviors critical to quality perception. Although existing scanpath-based approaches have attempted to model viewing behaviors by approximating the human view-then-rate paradigm, they treat scanpath generation and quality assessment as separate steps, preventing end-to-end optimization and task-aligned exploration. To address this limitation, we propose RL-ScanIQA, a reinforcement-learned framework for blind 360°IQA. RL-ScanIQA optimize a PPO-trained scanpath policy and a quality assessor, where the policy receives quality-driven feedback to learn task-relevant viewing strategies. To improve training stability and prevent mode collapse, we design multi-level rewards, including scanpath diversity and equator-biased priors. We further boost cross-dataset robustness using distortion-space augmentation together with rank-consistent losses that preserve intra-image and inter-image quality orderings. Extensive experiments on three benchmarks show that RL-ScanIQA achieves superior in-dataset performance and cross-dataset generalization. Codes are available at https://github.com/wangyuji1/RLScanIQA.git.[296] Show Me When and Where: Towards Referring Video Object Segmentation in the Wild
Mingqi Gao,Jinyu Yang,Jingnan Luo,Xiantong Zhen,Jungong Han,Giovanni Montana,Feng Zheng
Main category: cs.CV
TL;DR: 本文提出了一种面向真实场景的指代表达视频目标分割(RVOS)新设定,并构建了大规模未剪辑视频基准数据集YoURVOS,同时设计了对象级多模态Transformer(OMFormer)模型作为基线方法,显著提升了在目标出现时间定位上的性能。
Details
Motivation: 现有RVOS任务基于精剪视频,目标始终出现在所有帧中,忽略了目标在视频中动态出现/消失的真实挑战,无法反映实际应用中的复杂性。 Method: 构建了基于YouTube未剪辑视频的大规模RVOS基准YoURVOS;提出Object-level Multimodal TransFormers(OMFormer),通过建模对象级多模态交互实现高效、全局的时空定位。 Result: 实验表明,以往VOS方法在YoURVOS上性能显著下降(尤其目标缺失帧增多时),而OMFormer保持稳定优异表现;YoURVOS成为推动RVOS实用化的重要新基准。 Conclusion: 本文推动RVOS从理想化设定走向真实场景,通过新数据集与新方法,强调‘何时’与‘何地’联合建模的重要性,为后续研究提供关键支撑。 Abstract: Referring video object segmentation (RVOS) has recently generated great popularity in computer vision due to its widespread applications. Existing RVOS setting contains elaborately trimmed videos, with text-referred objects always appearing in all frames, which however fail to fully reflect the realistic challenges of this task. This simplified setting requires RVOS methods to only predict where objects, with no need to show when the objects appear. In this work, we introduce a new setting towards in-the-wild RVOS. To this end, we collect a new benchmark dataset using Youtube Untrimmed videos for RVOS - YoURVOS, which contains 1,120 in-the-wild videos with 7 times more duration and scenes than existing datasets. Our new benchmark challenges RVOS methods to show not only where but also when objects appear in videos. To set a baseline, we propose Object-level Multimodal TransFormers (OMFormer) to tackle the challenges, which are characterized by encoding object-level multimodal interactions for efficient and global spatial-temporal localisation. We demonstrate that previous VOS methods struggle on our YoURVOS benchmark, especially with the increase of target-absent frames, while our OMFormer consistently performs well. Our YoURVOS dataset offers an imperative benchmark, which will push forward the advancement of RVOS methods for practical applications.[297] 4D Synchronized Fields: Motion-Language Gaussian Splatting for Temporal Scene Understanding
Mohamed Rayan Barhdadi,Samir Abdaljalil,Rasul Khanbayov,Erchin Serpedin,Hasan Kurban
Main category: cs.CV
TL;DR: 本文提出4D同步场(4D Synchronized Fields),一种将几何重建、物体级运动建模与语言语义在统一高斯场中结构化耦合的新4D表示方法,支持开放词汇时序查询。
Details
Motivation: 现有4D表示方法将几何、运动和语义解耦:重建方法忽略可解释运动结构;语言驱动方法后验附加语义,无法感知物体如何运动;运动感知方法仅用黑盒点残差编码动态,缺乏物体级组织。 Method: 提出4D同步场:基于高斯轨迹的4D表示,每条轨迹分解为共享物体运动+隐式残差;引入运动学条件化的ridge map,预测时序语义变化;语言通过每个物体条件化的场与运动学同步。 Result: 在HyperNeRF上PSNR达28.52 dB(语言驱动与运动感知方法中最高,接近纯重建方法);时序状态检索任务中,tIoU达0.733,显著优于4D LangSplat和LangSplat;消融显示运动学条件化带来+0.45 tIoU提升。 Conclusion: 4D同步场是首个从单一训练表示中同时输出可解释运动基元和时序对齐语言场的方法,实现了重建、运动与语义的结构性统一。 Abstract: Current 4D representations decouple geometry, motion, and semantics: reconstruction methods discard interpretable motion structure; language-grounded methods attach semantics after motion is learned, blind to how objects move; and motion-aware methods encode dynamics as opaque per-point residuals without object-level organization. We propose 4D Synchronized Fields, a 4D Gaussian representation that learns object-factored motion in-loop during reconstruction and synchronizes language to the resulting kinematics through a per-object conditioned field. Each Gaussian trajectory is decomposed into shared object motion plus an implicit residual, and a kinematic-conditioned ridge map predicts temporal semantic variation, yielding a single representation in which reconstruction, motion, and semantics are structurally coupled and enabling open-vocabulary temporal queries that retrieve both objects and moments. On HyperNeRF, 4D Synchronized Fields achieves 28.52 dB mean PSNR, the highest among all language-grounded and motion-aware baselines, within 1.5 dB of reconstruction-only methods. On targeted temporal-state retrieval, the kinematic-conditioned field attains 0.884 mean accuracy, 0.815 mean vIoU, and 0.733 mean tIoU, surpassing 4D LangSplat (0.620, 0.433, and 0.439 respectively) and LangSplat (0.415, 0.304, and 0.262). Ablation confirms that kinematic conditioning is the primary driver, accounting for +0.45 tIoU over a static-embedding-only baseline. 4D Synchronized Fields is the only method that jointly exposes interpretable motion primitives and temporally grounded language fields from a single trained representation. Code will be released.[298] A Physically-Grounded Attack and Adaptive Defense Framework for Real-World Low-Light Image Enhancement
Tongshun Zhang,Pingping Liu,Yuqing Lei,Zixuan Zhong,Qiuzhan Zhou,Zhiyuan Zha
Main category: cs.CV
TL;DR: 本文提出一种基于物理攻击与显示自适应防御范式的低光照图像增强方法,通过物理退化合成(PDS)建模真实噪声并构建双层防御系统(含噪声预测器、退化感知MoE和自适应度量防御),显著提升现有LLIE方法的去噪能力与结构保真度。
Details
Motivation: 现有低光照图像增强(LLIE)方法将增强视为黑箱映射,忽视成像过程中物理噪声的变换机制,导致性能受限。 Method: 提出物理攻击-防御范式:攻击侧构建物理退化合成(PDS)管线,模拟ISP逆向、光子/读出噪声注入及sRGB重投影;防御侧包含噪声预测器、退化感知混合专家(DA-MoE)和自适应度量防御(AMD)机制。 Result: 在多个基准上显著提升现有LLIE方法的插件式性能,有效抑制真实噪声并保持结构细节;代码已开源。 Conclusion: 该物理驱动的攻击-防御框架为LLIE提供了更真实、可控且鲁棒的建模思路,验证了显式建模退化过程对增强效果的关键作用。 Abstract: Limited illumination often causes severe physical noise and detail degradation in images. Existing Low-Light Image Enhancement (LLIE) methods frequently treat the enhancement process as a blind black-box mapping, overlooking the physical noise transformation during imaging, leading to suboptimal performance. To address this, we propose a novel LLIE approach, conceptually formulated as a physics-based attack and display-adaptive defense paradigm. Specifically, on the attack side, we establish a physics-based Degradation Synthesis (PDS) pipeline. Unlike standard data augmentation, PDS explicitly models Image Signal Processor (ISP) inversion to the RAW domain, injects physically plausible photon and read noise, and re-projects the data to the sRGB domain. This generates high-fidelity training pairs with explicitly parameterized degradation vectors, effectively simulating realistic attacks on clean signals. On the defense side, we construct a dual-layer fortified system. A noise predictor estimates degradation parameters from the input sRGB image. These estimates guide a degradation-aware Mixture of Experts (DA-MoE), which dynamically routes features to experts specialized in handling specific noise intensities. Furthermore, we introduce an Adaptive Metric Defense (AMD) mechanism, dynamically calibrating the feature embedding space based on noise severity, ensuring robust representation learning under severe degradation. Extensive experiments demonstrate that our approach offers significant plug-and-play performance enhancement for existing benchmark LLIE methods, effectively suppressing real-world noise while preserving structural fidelity. The sourced code is available at https://github.com/bywlzts/Attack-defense-llie.[299] In-Field 3D Wheat Head Instance Segmentation From TLS Point Clouds Using Deep Learning Without Manual Labels
Tomislav Medic,Liangliang Nan
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注的两阶段方法,用于野外小麦穗的三维实例分割,第一阶段通过多视角投影与Grounded SAM生成初始3D实例建议,第二阶段用其作为噪声伪标签训练3D全景分割网络。
Details
Motivation: 针对复杂杂乱场景(如农田植物表型分析)中LiDAR点云三维实例分割难以依赖人工标注的问题,探索无需手动3D标注的可行方案。 Method: 提出两阶段无监督/弱监督pipeline:第一阶段利用3D-to-2D多视角投影、Grounded SAM零样本2D分割及多视角标签融合生成初始3D实例建议;第二阶段以这些建议为噪声伪标签,训练3D全景风格分割网络。 Result: 在野外小麦穗TLS点云上验证了方法可行性,性能优于基于多视角RGB和3D高斯泼溅的Wheat3DGS;两个阶段各自均能输出可用的无标注3D实例分割结果。 Conclusion: 该方法为TLS点云上的低代价、可迁移的三维实例分割提供了新范式,尤其适用于难以人工标注的农业遥感等场景。 Abstract: 3D instance segmentation for laser scanning (LiDAR) point clouds remains a challenge in many remote sensing-related domains. Successful solutions typically rely on supervised deep learning and manual annotations, and consequently focus on objects that can be well delineated through visual inspection and manual labeling of point clouds. However, for tasks with more complex and cluttered scenes, such as in-field plant phenotyping in agriculture, such approaches are often infeasible. In this study, we tackle the task of in-field wheat head instance segmentation directly from terrestrial laser scanning (TLS) point clouds. To address the problem and circumvent the need for manual annotations, we propose a novel two-stage pipeline. To obtain the initial 3D instance proposals, the first stage uses 3D-to-2D multi-view projections, the Grounded SAM pipeline for zero-shot 2D object-centric segmentation, and multi-view label fusion. The second stage uses these initial proposals as noisy pseudo-labels to train a supervised 3D panoptic-style segmentation neural network. Our results demonstrate the feasibility of the proposed approach and show performance improvementsrelative to Wheat3DGS, a recent alternative solution for in-field wheat head instance segmentation without manual 3D annotations based on multi-view RGB images and 3D Gaussian Splatting, showcasing TLS as a competitive sensing alternative. Moreover, the results show that both stages of the proposed pipeline can deliver usable 3D instance segmentation without manual annotations, indicating promising, low-effort transferability to other comparable TLS-based point cloud segmentation tasks.[300] Direct Object-Level Reconstruction via Probabilistic Gaussian Splatting
Shuai Guo,Ao Guo,Junchao Zhao,Qi Chen,Yuxiang Qi,Zechuan Li,Dong Chen,Tianjia Shao,Mingliang Xu
Main category: cs.CV
TL;DR: 本文提出了一种基于2D高斯光栅化的高效单物体3D重建方法,通过引入前景-背景概率线索、动态剪枝低概率高斯、连续概率掩码监督及双阶段过滤策略,在显著减少高斯数量(约1/10)的同时保持重建质量,并展现出对掩码误差的强自校正能力。
Details
Motivation: 现有基于高斯光栅化(Gaussian Splatting)的3D重建方法多针对全场景,引入大量冗余背景信息,导致计算与存储开销大,难以满足单物体高效重建需求。 Method: 提出基于2D高斯光栅化的单物体重建框架:利用YOLO和SAM生成的概率掩码监督高斯的概率属性;动态剪枝低前景概率高斯;设计双阶段初始过滤策略抑制背景高斯;并用渲染出的概率掩码反向优化监督,提升跨视角边界一致性。 Result: 在MIP-360、T&T和NVOS数据集上验证,该方法仅需约1/10的标准3DGS高斯数量,重建质量媲美3DGS,且对掩码误差具有强鲁棒性与自校正能力。 Conclusion: 所提方法有效实现了高保真与高效率兼顾的单物体3D重建,为文化遗产数字化、工业制造和虚拟现实等应用提供了实用新范式。 Abstract: Object-level 3D reconstruction play important roles across domains such as cultural heritage digitization, industrial manufacturing, and virtual reality. However, existing Gaussian Splatting-based approaches generally rely on full-scene reconstruction, in which substantial redundant background information is introduced, leading to increased computational and storage overhead. To address this limitation, we propose an efficient single-object 3D reconstruction method based on 2D Gaussian Splatting. By directly integrating foreground-background probability cues into Gaussian primitives and dynamically pruning low-probability Gaussians during training, the proposed method fundamentally focuses on an object of interest and improves the memory and computational efficiency. Our pipeline leverages probability masks generated by YOLO and SAM to supervise probabilistic Gaussian attributes, replacing binary masks with continuous probability values to mitigate boundary ambiguity. Additionally, we propose a dual-stage filtering strategy for training's startup to suppress background Gaussians. And, during training, rendered probability masks are conversely employed to refine supervision and enhance boundary consistency across views. Experiments conducted on the MIP-360, T&T, and NVOS datasets demonstrate that our method exhibits strong self-correction capability in the presence of mask errors and achieves reconstruction quality comparable to standard 3DGS approaches, while requiring only approximately 1/10 of their Gaussian amount. These results validate the efficiency and robustness of our method for single-object reconstruction and highlight its potential for applications requiring both high fidelity and computational efficiency.[301] Early Failure Detection and Intervention in Video Diffusion Models
Kwon Byung-Ki,Sohwi Lim,Nam Hyeon-Woo,Moon Ye-Bin,Tae-Hyun Oh
Main category: cs.CV
TL;DR: 本文提出了一种面向潜空间文本到视频(T2V)扩散模型的早期失败检测与诊断干预流程,通过实时将潜变量转换为中间视频预览(Real-time Inspection, RI),在RGB空间中利用现有文本-视频对齐评估器进行高效检测(仅39.2ms),并在预测失败时触发分层早退干预,显著降低重生成开销,在VBench上实现最高2.64倍的时间节省,且兼容性强、可扩展至更大模型与更高分辨率。
Details
Motivation: T2V扩散模型生成结果存在文本-视频对齐差或感知质量低等问题;由于采样过程非确定性,难以在推理阶段预判生成是否失败,导致高计算成本的试错式重生成。 Method: 设计实时检测模块(RI),将潜变量快速转为中间视频预览,以便在RGB空间调用成熟文本-视频对齐评估器;当检测到潜在失败时,触发分层、早退出的干预机制;整个流程轻量、即插即用,与提示优化和采样引导等方法正交兼容。 Result: 在CogVideoX-5B和Wan2.1-1.3B上验证,VBench一致性指标提升,时间开销最多降低2.64倍;在Wan2.1-14B(720p、81帧)上仍有效;实验证明失败信号在去噪早期即已出现,且可用标准多模态评估器在中间预览中可靠捕获。 Conclusion: 所提早期检测与干预流程能高效识别并缓解T2V生成失败,显著提升推理效率与稳定性,具备强泛化性与兼容性,为实用化T2V系统提供了新范式。 Abstract: Text-to-video (T2V) diffusion models have rapidly advanced, yet generations still occasionally fail in practice, such as low text-video alignment or low perceptual quality. Since diffusion sampling is non-deterministic, it is difficult to know during inference whether a generation will succeed or fail, incurring high computational cost due to trial-and-error regeneration. To address this, we propose an early failure detection and diagnostic intervention pipeline for latent T2V diffusion models. For detection, we design a Real-time Inspection (RI) module that converts latents into intermediate video previews, enabling the use of established text-video alignment scorers for inspection in the RGB space. The RI module completes the conversion and inspection process in just 39.2ms. This is highly efficient considering that CogVideoX-5B requires 4.3s per denoising step when generating a 480p, 49-frame video on an NVIDIA A100 GPU. Subsequently, we trigger a hierarchical and early-exit intervention pipeline only when failure is predicted. Experiments on CogVideoX-5B and Wan2.1-1.3B demonstrate consistency gains on VBench with up to 2.64 times less time overhead compared to post-hoc regeneration. Our method also generalizes to a higher-capacity setting, remaining effective on Wan2.1-14B with 720p resolution and 81-frame generation. Furthermore, our pipeline is plug-and-play and orthogonal to existing techniques, showing seamless compatibility with prompt refinement and sampling guidance methods. We also provide evidence that failure signals emerge early in the denoising process and are detectable within intermediate video previews using standard vision-language evaluators.[302] Personalized Cell Segmentation: Benchmark and Framework for Reference-Guided Cell Type Segmentation
Bisheng Wang,Jaime S. Cardoso,Lin Wu
Main category: cs.CV
TL;DR: 本文提出了个性化细胞分割(PerCS)任务,旨在根据参考细胞分割特定类型的全部细胞,并构建了包含1372张图像和11万多个标注细胞的基准数据集;同时提出基于DINOv2的PerCS-DINO框架,通过交叉注意力和对比学习融合图像特征与参考嵌入,实验证明其有效性。
Details
Motivation: 现有深度学习方法多局限于通用细胞分割,缺乏区分特定细胞类型的能力,因此需要一种能依据参考细胞精准分割目标细胞类型的新任务。 Method: 提出PerCS-DINO框架,以DINOv2为骨干网络,结合图像特征与参考细胞嵌入,利用交叉注意力机制和对比学习实现个性化细胞分割。 Result: 在自建的PerCS基准上验证了PerCS-DINO的有效性,展示了该方法优于基线模型,并揭示了个性化细胞分割任务的挑战性。 Conclusion: PerCS任务及配套基准为细胞类型特异性分割提供了新范式,PerCS-DINO为其提供了首个有效解决方案,有望推动基于细胞的生物医学应用研究。 Abstract: Accurate cell segmentation is critical for biological and medical imaging studies. Although recent deep learning models have advanced this task, most methods are limited to generic cell segmentation, lacking the ability to differentiate specific cell types. In this work, we introduce the Personalized Cell Segmentation (PerCS) task, which aims to segment all cells of a specific type given a reference cell. To support this task, we establish a benchmark by reorganizing publicly available datasets, yielding 1,372 images and over 110,000 annotated cells. As a pioneering solution, we propose PerCS-DINO, a framework built on the DINOv2 backbone. By integrating image features and reference embeddings via a cross-attention transformer and contrastive learning, PerCS-DINO effectively segments cells matching the reference. Extensive experiments demonstrate the effectiveness of the proposed PerCS-DINO and highlight the challenges of this new task. We expect PerCS to serve as a useful testbed for advancing research in cell-based applications.[303] How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images
Guimeng Liu,Tianze Yu,Somayeh Ebrahimkhani,Lin Zhi Zheng Shawn,Kok Pin Ng,Ngai-Man Cheung
Main category: cs.CV
TL;DR: 本文首次系统揭示了医学多模态大语言模型(MLLMs)在零样本医疗图像理解中性能不佳的关键原因是视觉定位能力不足;为此构建了专家指导的评估数据集VGMED,并提出无需训练的推理时方法VGRefine,显著提升6个医学VQA基准性能。
Details
Motivation: 现有医学MLLMs在零样本医疗任务上表现欠佳,但原因不明;尤其缺乏对其视觉定位(visual grounding)能力的系统评估,而该能力对临床图像理解至关重要。 Method: 构建专家指导的视觉定位评估数据集VGMED,设计新定量指标与定性分析方法;系统评测8个SOTA医学MLLMs;提出推理时注意力优化方法VGRefine。 Result: 发现当前医学MLLMs普遍无法聚焦于临床相关图像区域,且该缺陷在自然图像中不明显;VGRefine在6个Med-VQA基准(110K样本、8种模态)上达到SOTA,无需额外训练或外部模型。 Conclusion: 视觉定位能力不足是导致医学MLLMs性能受限的关键因素之一;VGMED和VGRefine为提升医学多模态理解提供了可复现的评估范式与实用解决方案。 Abstract: Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasks, particularly in zero-shot settings where generalization is critical, remains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation. In this work, we present a pioneering systematic investigation into the visual grounding capabilities of state-of-the-art medical MLLMs. To disentangle visual grounding from semantic grounding, we design VGMED, a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs. We introduce new quantitative metrics and conduct detailed qualitative analyses. Our study across eight state-of-the-art (SOTA) medical MLLMs validates that they often fail to ground their predictions in clinically relevant image regions. We note that this finding is specific to medical image analysis; in contrast, prior work has shown that MLLMs are capable of grounding their predictions in the correct image regions when applied to natural scene images. Motivated by these findings, we propose VGRefine, a simple yet effective inference-time method that refines attention distribution to improve visual grounding in medical settings. Our approach achieves SOTA performance across 6 diverse Med-VQA benchmarks (over 110K VQA samples from 8 imaging modalities) without requiring additional training or external expert models. Overall, our work, for the first time, systematically validates inadequate visual grounding as one of the key contributing factors for medical MLLMs' under-performance. Additional experiments are included in the Supp.[304] AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising
Liyuan Cui,Wentao Hu,Wenyuan Zhang,Zesong Yang,Fan Shi,Xiaoqiang Liu
Main category: cs.CV
TL;DR: 本文提出AvatarForcing,一种用于实时说话头像生成的一步流式扩散框架,通过局部未来窗口去噪、双锚点时间强制机制和两阶段流式蒸馏,在保持低延迟的同时显著提升长序列生成的稳定性与唇音同步质量。
Details
Motivation: 现有自回归方法存在曝光偏差导致错误累积,而全序列扩散模型计算开销大,难以满足实时长时说话头像生成对低延迟和时间稳定性的双重需求。 Method: 提出AvatarForcing:1)一阶流式扩散,固定局部未来窗口并施加异质噪声;2)双锚点时间强制——风格锚点(重索引RoPE+锚音频零填充)和时间锚点(复用已生成干净块);3)两阶段流式蒸馏(离线ODE回填+分布匹配)。 Result: 在标准基准及新建400视频长时基准上,1.3B参数学生模型实现34ms/帧实时推理,视觉质量与唇音同步性能优异。 Conclusion: AvatarForcing有效平衡了实时性、稳定性与生成质量,为长时流式 talking avatar 提供了新范式。 Abstract: Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks to ensure smooth transitions. Real-time one-step inference is enabled by two-stage streaming distillation with offline ODE backfill and distribution matching. Experiments on standard benchmarks and a new 400-video long-form benchmark show strong visual quality and lip synchronization at 34 ms/frame using a 1.3B-parameter student model for realtime streaming. Our page is available at: https://cuiliyuan121.github.io/AvatarForcing/[305] UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding
Yang Zhan,Yuan Yuan
Main category: cs.CV
TL;DR: 本文提出了UAVBench基准和UAVIT-1M指令微调数据集,用于评估和提升多模态大语言模型(MLLMs)在低空无人机场景下的视觉-语言理解能力;实验表明,基于UAVIT-1M微调可显著提升开源MLLMs在该领域的性能。
Details
Motivation: 现有数据集聚焦于少数特定低空视觉任务,难以全面评估MLLMs在真实低空无人机应用中的能力,亟需更全面、高质量、贴近实际的评测与训练资源。 Method: 构建了包含43个测试单元、96.6万样本的UAVBench综合基准,以及含约124万条指令、覆盖78.9万图像和多种空间分辨率的UAVIT-1M指令微调数据集;所有数据均为纯真实低空图像,涵盖丰富天气条件并经人工校验;在11个SOTA MLLMs上开展系统评测与微调实验。 Result: 评测发现开源MLLMs在低空视觉内容对话中表现远逊于闭源模型;但经UAVIT-1M微调后,其性能显著提升,有效缩小差距。 Conclusion: UAVBench和UAVIT-1M为推动MLLMs适配真实低空无人机应用场景提供了关键基础设施,有助于弥合当前模型能力与实际需求之间的鸿沟。 Abstract: Multimodal Large Language Models (MLLMs) have made significant strides in natural images and satellite remote sensing images. However, understanding low-altitude drone scenarios remains a challenge. Existing datasets primarily focus on a few specific low-altitude visual tasks, which cannot fully assess the ability of MLLMs in real-world low-altitude UAV applications. Therefore, we introduce UAVBench, a comprehensive benchmark, and UAVIT-1M, a large-scale instruction tuning dataset, designed to evaluate and improve MLLMs' abilities in low-altitude vision-language tasks. UAVBench comprises 43 test units and 966k high-quality data samples across 10 tasks at the image-level and region-level. UAVIT-1M consists of approximately 1.24 million diverse instructions, covering 789k multi-scene images and about 2,000 types of spatial resolutions with 11 distinct tasks. UAVBench and UAVIT-1M feature pure real-world visual images and rich weather conditions, and involve manual verification to ensure high quality. Our in-depth analysis of 11 state-of-the-art MLLMs using UAVBench reveals that open-source MLLMs cannot generate accurate conversations about low-altitude visual content, lagging behind closed-source MLLMs. Extensive experiments demonstrate that fine-tuning open-source MLLMs on UAVIT-1M significantly addresses this gap. Our contributions pave the way for bridging the gap between current MLLMs and low-altitude UAV real-world application demands. (Project page: https://UAVBench.github.io/)[306] On the Nature of Attention Sink that Shapes Decoding Strategy in MLLMs
Suho Yoo,Youngjoon Jang,Joon Son Chung
Main category: cs.CV
TL;DR: 本文研究了大语言模型中注意力汇聚点(attention sink)的作用,发现其编码了影响解码过程的全局结构化信息,并据此提出了一种轻量级推理时策略OutRo,通过特征对齐与突破因果约束的注意力机制提升多模态大语言模型在视频问答任务中的性能。
Details
Motivation: 理解注意力汇聚点(attention sink)在大语言模型推理过程中的本质作用,而非将其视为偶然现象;探究其如何表征和影响模型行为。 Method: 通过分析发现注意力汇聚点编码全局结构化信息;提出OutRo方法:(i) 将非汇聚token表征与汇聚token在特征空间对齐;(ii) 允许汇聚token突破因果掩码限制,与非汇聚token交互。 Result: OutRo在7个视频问答基准上持续提升多个代表性多模态大语言模型(MLLMs)性能,泛化性强,仅引入1.1倍解码开销。 Conclusion: 注意力汇聚点是承载关键全局信息的功能性结构,合理利用可有效增强模型推理能力;OutRo提供了一种无需额外前向计算、不依赖注意力图的高效推理优化范式。 Abstract: Large language models and their multimodal extensions have achieved remarkable success across diverse tasks, yet the internal mechanisms that govern their reasoning behaviour remain partially understood. In particular, the attention sink, a token that attracts disproportionate attention mass, has been observed in transformer architectures, but its role is still unclear. Our goal is to understand what attention sinks represent and how they shape model behaviour during inference, rather than considering them as incidental artifacts. Through our analysis, we find that attention sink representations encode structured global information that influences the decoding process. Building on our findings, we introduce OutRo, a lightweight inference-time strategy that leverages the sink token to enhance contextual representations: (i) non-sink token representations are aligned with the sink representation in the feature space; and (ii) the sink token is allowed to attend beyond the causal constraint, facilitating information exchange with non-sink tokens. This design enhances the reasoning process without requiring additional forward passes or access to attention maps. Based on extensive experiments, OutRo consistently improves performance across representative MLLMs on seven video QA benchmarks and demonstrates strong generalisation, while incurring only a 1.1x decoding overhead.[307] AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models
Jiarui Zhang,Junqi Hu,Zurong Mai,Yuhang Chen,Shuohong Lou,Henglian Huang,Lingyuan Zhao,Jianxi Huang,Yutong Lu,Haohuan Fu,Juepeng Zheng
Main category: cs.CV
TL;DR: 本文提出AgroOmni数据集和AgroNVILA模型,通过感知-推理解耦架构(含View-Conditioned Meta-Net与Agriculture-aware Relative Policy Optimization),解决农业多模态大模型中的尺度混淆与逻辑漂移问题,在多高度农业推理任务上提升15.18%。
Details
Motivation: 现有农业多模态大语言模型存在‘陆地中心’偏差,导致跨尺度(地面、无人机、卫星)空间理解困难,引发尺度混淆和逻辑漂移,难以支撑复杂农业规划。 Method: 构建首个大规模多视角农业多模态数据集AgroOmni(288K);提出AgroNVILA模型,采用感知-推理解耦(PRD)架构:感知端使用View-Conditioned Meta-Net(VCMN)注入宏观空间上下文以消解尺度歧义;推理端采用农业感知的相对策略优化(ARPO),基于强化学习对齐专家农业逻辑。 Result: AgroNVILA在多高度农业推理任务上显著优于SOTA MLLMs,性能提升+15.18%,验证了其在整体农业空间规划中的鲁棒性。 Conclusion: 感知-推理解耦是提升农业多模态模型跨尺度空间理解与逻辑一致性的重要范式,AgroOmni与AgroNVILA为精准农业AI提供了可扩展的数据与模型基础。 Abstract: Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant "terrestrial-centric" bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale AgroOmni (288K), a multi-view training corpus designed to capture diverse spatial topologies and scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, an MLLM that utilizes a novel Perception-Reasoning Decoupling (PRD) architecture. On the perception side, we incorporate a View-Conditioned Meta-Net (VCMN), which injects macroscopic spatial context into visual tokens, resolving scale ambiguities with minimal computational overhead. On the reasoning side, Agriculture-aware Relative Policy Optimization (ARPO) leverages reinforcement learning to align the model's decision-making with expert agricultural logic, preventing statistical shortcuts. Extensive experiments demonstrate that AgroNVILA outperforms state-of-the-art MLLMs, achieving significant improvements (+15.18%) in multi-altitude agricultural reasoning, reflecting its robust capability for holistic agricultural spatial planning.[308] BROTHER: Behavioral Recognition Optimized Through Heterogeneous Ensemble Regularization for Ambivalence and Hesitancy
Alexandre Pereira,Bruno Fernandes,Pablo Barros
Main category: cs.CV
TL;DR: 本文提出了一种高度正则化的多模态融合框架,利用视觉、声学和语言(含专门设计的统计文本模态)特征,结合粒子群优化(PSO)硬投票集成方法,在自然视频中识别犹豫与矛盾(A/H)行为,达到0.7465的Macro F1分数。
Details
Motivation: 在自然视频中识别犹豫与矛盾(A/H)这类复杂行为状态极具挑战性,因其表现为细微、多模态的冲突,需深层时序与上下文理解,而传统方法难以建模此类 multimodal conflict。 Method: 构建多模态特征提取流程(含新设计的统计文本模态),评估15种模态组合与多种分类器(MLP、随机森林、GBDT),基于验证集BCE损失筛选校准良好的模型,并采用带训练-验证间隙惩罚(lambda)的PSO硬投票集成实现鲁棒融合。 Result: 语言特征为最强单模态预测器;所提PSO集成(lambda=0.2)在测试集上取得最高Macro F1为0.7465,显著优于单一模态及常规融合方法。 Conclusion: 将A/H建模为多模态冲突,并通过智能加权的委员会式集成进行评估,是面向真实场景行为分析的有效且鲁棒的框架。 Abstract: Recognizing complex behavioral states such as Ambivalence and Hesitancy (A/H) in naturalistic video settings remains a significant challenge in affective computing. Unlike basic facial expressions, A/H manifests as subtle, multimodal conflicts that require deep contextual and temporal understanding. In this paper, we propose a highly regularized, multimodal fusion pipeline to predict A/H at the video level. We extract robust unimodal features from visual, acoustic, and linguistic data, introducing a specialized statistical text modality explicitly designed to capture temporal speech variations and behavioral cues. To identify the most effective representations, we evaluate 15 distinct modality combinations across a committee of machine learning classifiers (MLP, Random Forest, and GBDT), selecting the most well-calibrated models based on validation Binary Cross-Entropy (BCE) loss. Furthermore, to optimally fuse these heterogeneous models without overfitting to the training distribution, we implement a Particle Swarm Optimization (PSO) hard-voting ensemble. The PSO fitness function dynamically incorporates a train-validation gap penalty (lambda) to actively suppress redundant or overfitted classifiers. Our comprehensive evaluation demonstrates that while linguistic features serve as the strongest independent predictor of A/H, our heavily regularized PSO ensemble (lambda = 0.2) effectively harnesses multimodal synergies, achieving a peak Macro F1-score of 0.7465 on the unseen test set. These results emphasize that treating ambivalence and hesitancy as a multimodal conflict, evaluated through an intelligently weighted committee, provides a robust framework for in-the-wild behavioral analysis.[309] AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control
Peng Xu,Zhengnan Deng,Jiayan Deng,Zonghua Gu,Shaohua Wan
Main category: cs.CV
TL;DR: 本文提出AerialVLA,一种面向无人机视觉-语言导航(VLN)的端到端框架,通过双视图感知、模糊方向提示和统一连续控制空间,摆脱对密集人工标注和外部检测器的依赖,在TravelUAV基准上实现SOTA性能与强泛化能力。
Details
Motivation: 现有无人机VLN方法依赖密集人工标注或辅助目标检测器,导致语义鸿沟、自主性受限;亟需更轻量、更自主的端到端方案。 Method: 提出AerialVLA框架:1)双视图感知策略减少视觉冗余并保留导航与定位关键线索;2)基于机载传感器的模糊方向提示机制,替代oracle指导;3)统一控制空间融合3-DoF连续运动指令与内在着陆信号,无需外部检测器。 Result: 在TravelUAV基准上,AerialVLA在已见环境中达SOTA性能;在未见场景中成功率达基线方法近三倍,验证其强泛化能力。 Conclusion: 极简、以自主性为中心的端到端范式能学习更鲁棒的视觉-运动表征,优于复杂模块化系统。 Abstract: Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3-Degree-of-Freedom (3-DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems.[310] Representation Alignment for Just Image Transformers is not Easier than You Think
Jaeyo Shin,Jiwook Kim,Hyunjung Shim
Main category: cs.CV
TL;DR: 本文提出PixelREPA方法,解决Representation Alignment(REPA)在像素空间扩散Transformer(如JiT)中失效的问题,通过引入掩码Transformer适配器改进对齐目标,在ImageNet上显著提升FID和Inception Score,并加速收敛。
Details
Motivation: REPA在潜在空间扩散Transformer中有效,但在像素空间JiT模型中失败,导致FID变差、多样性坍缩,根源在于去噪在高维图像空间而语义目标被严重压缩,造成信息不对称。 Method: 提出PixelREPA:将对齐目标从压缩语义表征转为更适合像素空间的表示,并设计Masked Transformer Adapter(浅层Transformer适配器+部分token掩码)来约束表征对齐。 Result: PixelREPA-JiT-B/16在ImageNet 256×256上FID从3.66降至3.17,IS从275.1升至284.6,收敛速度提升超2倍;PixelREPA-H/16达到FID=1.81、IS=317.2。 Conclusion: PixelREPA有效缓解了像素空间扩散Transformer中因信息不对称导致的REPA失效问题,兼顾训练效率与生成质量,为无tokenizer的扩散建模提供了新思路。 Abstract: Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT-B$/16$ and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet $256 \times 256$, while achieving $> 2\times$ faster convergence. Finally, PixelREPA-H$/16$ achieves FID$=1.81$ and IS$=317.2$. Our code is available at https://github.com/kaist-cvml/PixelREPA.[311] HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task
Xiaoya Lu,Yijin Zhou,Zeren Chen,Ruocheng Wang,Bingrui Sima,Enshen Zhou,Lu Sheng,Dongrui Liu,Jing Shao
Main category: cs.CV
TL;DR: 本文提出了一种名为HomeGuard的架构无关的安全防护机制CG-CoT,通过上下文引导的思维链进行主动感知和语义判断,显著提升了视觉语言模型在具身智能体中的安全风险识别能力,并支持下游规划任务。
Details
Motivation: 现有VLMs在复杂环境下面临上下文安全风险,传统规则或提示工程方法难以兼顾可扩展性与感知准确性。 Method: 提出Context-Guided Chain-of-Thought(CG-CoT)机制,结合主动视觉锚定与语义判断,并利用自建接地数据集和带过程奖励的强化微调(RFT)进行两阶段训练。 Result: HomeGuard在风险匹配率上提升超30%,同时降低过度保守性;生成的视觉锚点可作为空间约束用于碰撞规避与安全轨迹生成。 Conclusion: CG-CoT是一种通用、有效且可解释的安全增强范式,兼顾风险识别精度与下游任务可用性。 Abstract: Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing Reinforcement Fine-Tuning (RFT) with process rewards to enforce precise intermediate grounding. Experiments demonstrate that our model HomeGuard significantly enhances safety, improving risk match rates by over 30% compared to base models while reducing oversafety. Beyond hazard detection, the generated visual anchors serve as actionable spatial constraints for downstream planners, facilitating explicit collision avoidance and safety trajectory generation. Code and data are released under https://github.com/AI45Lab/HomeGuard[312] The Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics
Xiangbo Gao,Mingyang Wu,Siyuan Yang,Jiongze Yu,Pardis Taghavi,Fangzhou Lin,Zhengzhong Tu
Main category: cs.CV
TL;DR: 本文提出Visual Chronometer方法,用于从视频视觉动态中直接估计物理帧率(PhyFPS),以解决生成视频中因训练数据时间尺度混杂导致的‘计时幻觉’问题,并通过新构建的基准测试验证其有效性及对生成自然度的提升。
Details
Motivation: 现有生成式视频模型虽视觉逼真,但缺乏真实世界时间尺度下的可靠运动节律,导致生成运动速度模糊、不稳定、不可控,根源在于混合不同真实速度的视频进行训练,忽视了物理时间一致性。 Method: 提出Visual Chronometer预测器,通过受控的时间重采样训练,直接从输入视频的视觉运动线索中回归物理帧率(PhyFPS),不依赖不可靠的元数据;并构建两个新基准PhyFPS-Bench-Real和PhyFPS-Bench-Gen用于系统评估。 Result: 实验表明当前最先进视频生成模型存在严重PhyFPS错位与时间不稳定;应用PhyFPS校正后,AI生成视频的人类感知自然度显著提升。 Conclusion: 恢复视频内在物理时间尺度是实现可信物理模拟的关键一步;Visual Chronometer为评估和校正生成视频的时间真实性提供了可推广的工具与基准。 Abstract: While recent generative video models have achieved remarkable visual realism and are being explored as world models, true physical simulation requires mastering both space and time. Current models can produce visually smooth kinematics, yet they lack a reliable internal motion pulse to ground these motions in a consistent, real-world time scale. This temporal ambiguity stems from the common practice of indiscriminately training on videos with vastly different real-world speeds, forcing them into standardized frame rates. This leads to what we term chronometric hallucination: generated sequences exhibit ambiguous, unstable, and uncontrollable physical motion speeds. To address this, we propose Visual Chronometer, a predictor that recovers the Physical Frames Per Second (PhyFPS) directly from the visual dynamics of an input video. Trained via controlled temporal resampling, our method estimates the true temporal scale implied by the motion itself, bypassing unreliable metadata. To systematically quantify this issue, we establish two benchmarks, PhyFPS-Bench-Real and PhyFPS-Bench-Gen. Our evaluations reveal a harsh reality: state-of-the-art video generators suffer from severe PhyFPS misalignment and temporal instability. Finally, we demonstrate that applying PhyFPS corrections significantly improves the human-perceived naturalness of AI-generated videos. Our project page is https://xiangbogaobarry.github.io/Visual_Chronometer/.[313] LoCAtion: Long-time Collaborative Attention Framework for High Dynamic Range Video Reconstruction
Qianyu Zhang,Bolun Zheng,Lingyu Zhu,Aiai Huang,Zongpeng Li,Shiqi Wang
Main category: cs.CV
TL;DR: 本文提出LoCAtion框架,通过长时协同注意力机制实现无需显式对齐的HDR视频重建,显著减少鬼影和闪烁,提升时序稳定性与视觉质量。
Details
Motivation: 现有HDR视频重建方法依赖脆弱的空间对齐-融合范式,在动态复杂场景中易因配准误差导致鬼影和闪烁。 Method: 提出LoCAtion:基于中等曝光帧为骨干,采用协同注意力机制从非对齐曝光帧中动态聚合辐照度线索;引入学习型全局序列求解器,利用双向上下文与长程时序建模保障全视频一致性。 Result: 在多个基准上达到SOTA视觉质量与时序稳定性,兼具高精度与计算效率。 Conclusion: 摒弃显式对齐、转向协同特征路由是HDR视频重建更鲁棒的路径,LoCAtion为该方向提供了有效且可扩展的解决方案。 Abstract: Prevailing High Dynamic Range (HDR) video reconstruction methods are fundamentally trapped in a fragile alignment-and-fusion paradigm. While explicit spatial alignment can successfully recover fine details in controlled environments, it becomes a severe bottleneck in unconstrained dynamic scenes. By forcing rigid alignment across unpredictable motions and varying exposures, these methods inevitably translate registration errors into severe ghosting artifacts and temporal flickering. In this paper, we rethink this conventional prerequisite. Recognizing that explicit alignment is inherently vulnerable to real-world complexities, we propose LoCAtion, a Long-time Collaborative Attention framework that reformulates HDR video generation from a fragile spatial warping task into a robust, alignment-free collaborative feature routing problem. Guided by this new formulation, our architecture explicitly decouples the highly entangled reconstruction task. Rather than struggling to rigidly warp neighboring frames, we anchor the scene on a continuous medium-exposure backbone and utilize collaborative attention to dynamically harvest and inject reliable irradiance cues from unaligned exposures. Furthermore, we introduce a learned global sequence solver. By leveraging bidirectional context and long-range temporal modeling, it propagates corrective signals and structural features across the entire sequence, inherently enforcing whole-video coherence and eliminating jitter. Extensive experiments demonstrate that LoCAtion achieves state-of-the-art visual quality and temporal stability, offering a highly competitive balance between accuracy and computational efficiency.[314] StAR: Segment Anything Reasoner
Seokju Yun,Dongheon Lee,Noori Bae,Jaesung Jun,Chanseul Cho,Youngmin Ro
Main category: cs.CV
TL;DR: 本文提出Segment Anything Reasoner(StAR)框架,通过多维度设计优化(参数调优、奖励函数、学习策略、答案格式)及首次引入并行测试时缩放,显著提升视觉推理分割能力;同时构建新基准ReasonSeg-X,并以rollout-expanded selective-tuning方法激活基础模型的潜在推理能力,在仅5k样本下实现大幅性能提升。
Details
Motivation: 现有推理分割方法未能充分激发基础模型的视觉推理能力,且缺乏覆盖更广更深推理类型的基准。 Method: 提出StAR框架,从参数调优方案、奖励函数、学习策略和答案格式四方面优化;引入并行测试时缩放;构建ReasonSeg-X新基准;采用rollout-expanded selective-tuning训练策略。 Result: 在多个基准上显著超越近期基线方法,仅用5k训练样本即实现大幅提升,并首次在分割任务中成功应用并行测试时缩放。 Conclusion: StAR有效激活了基础模型中沉睡的推理能力,验证了多维度协同优化与高质量推理基准对提升视觉语言推理分割性能的关键作用。 Abstract: As AI systems are being integrated more rapidly into diverse and complex real-world environments, the ability to perform holistic reasoning over an implicit query and an image to localize a target is becoming increasingly important. However, recent reasoning segmentation methods fail to sufficiently elicit the visual reasoning capabilities of the base mode. In this work, we present Segment Anything Reasoner (StAR), a comprehensive framework that refines the design space from multiple perspectives-including parameter-tuning scheme, reward functions, learning strategies and answer format-and achieves substantial improvements over recent baselines. In addition, for the first time, we successfully introduce parallel test-time scaling to the segmentation task, pushing the performance boundary even further. To extend the scope and depth of reasoning covered by existing benchmark, we also construct the ReasonSeg-X, which compactly defines reasoning types and includes samples that require deeper reasoning. Leveraging this dataset, we train StAR with a rollout-expanded selective-tuning approach to activate the base model's latent reasoning capabilities, and establish a rigorous benchmark for systematic, fine-grained evaluation of advanced methods. With only 5k training samples, StAR achieves significant gains over its base counterparts across extensive benchmarks, demonstrating that our method effectively brings dormant reasoning competence to the surface.[315] PGcGAN: Pathological Gait-Conditioned GAN for Human Gait Synthesis
Mritula Chandrasekaran,Sanket Kachole,Jarek Francik,Dimitrios Makris
Main category: cs.CV
TL;DR: 本文提出了一种病理步态条件生成对抗网络(PGcGAN),用于从3D姿态关键点轨迹生成特定病理类型的步态序列,以解决临床数据稀缺且多变的问题,并验证其在数据增强和步态识别任务中的有效性。
Details
Motivation: 病理步态分析受限于临床数据集规模小、多样性差,难以建模多种步态障碍。 Method: 提出PGcGAN模型,将病理类别的一热编码标签嵌入生成器和判别器;生成器采用带对抗损失和重建损失的条件自编码器结构,以保持步态的结构与时间特性。 Result: 在Pathological Gait Dataset上实验表明,合成序列与真实序列在PCA/t-SNE分布、运动学可视化及下游分类任务中高度一致;用合成数据增强后,GRU、LSTM和CNN模型的病理步态识别性能均提升。 Conclusion: 病理条件化的步态生成可有效支持病理步态分析中的数据增强,提升模型泛化能力与识别性能。 Abstract: Pathological gait analysis is constrained by limited and variable clinical datasets, which restrict the modeling of diverse gait impairments. To address this challenge, we propose a Pathological Gait-conditioned Generative Adversarial Network (PGcGAN) that synthesises pathology-specific gait sequences directly from observed 3D pose keypoint trajectories data. The framework incorporates one-hot encoded pathology labels within both the generator and discriminator, enabling controlled synthesis across six gait categories. The generator adopts a conditional autoencoder architecture trained with adversarial and reconstruction objectives to preserve structural and temporal gait characteristics. Experiments on the Pathological Gait Dataset demonstrate strong alignment between real and synthetic sequences through PCA and t-SNE analyses, visual kinematic inspection, and downstream classification tasks. Augmenting real data with synthetic sequences improved pathological gait recognition across GRU, LSTM, and CNN models, indicating that pathology-conditioned gait synthesis can effectively support data augmentation in pathological gait analysis.[316] G-ZAP: A Generalizable Zero-Shot Framework for Arbitrary-Scale Pansharpening
Zhiqi Yang,Shan Yin,Jingze Liang,Liang-Jian Deng
Main category: cs.CV
TL;DR: 本文提出G-ZAP,一种可泛化的零样本任意尺度图像锐化框架,通过基于特征的隐式神经表示融合网络和多尺度半监督训练策略,实现跨分辨率、跨场景、跨传感器的强泛化能力,并支持权重复用。
Details
Motivation: 现有深度学习方法依赖大规模预训练且泛化能力差;零样本方法虽提升真实场景泛化性但需逐图像优化,无法复用权重,且通常限于固定尺度。 Method: 提出G-ZAP框架,采用基于特征的隐式神经表示(INR)融合网络作为主干,并引入多尺度半监督训练策略。 Result: 在多个真实数据集上达到PAN-scale融合下的视觉质量与定量指标SOTA;支持权重复用且性能媲美逐对重训练。 Conclusion: G-ZAP实现了高效、灵活、强泛化的任意尺度 pansharpening,具备实际部署潜力。 Abstract: Pansharpening aims to fuse a high-resolution panchromatic (PAN) image and a low-resolution multispectral (LRMS) image to produce a high-resolution multispectral (HRMS) image. Recent deep models have achieved strong performance, yet they typically rely on large-scale pretraining and often generalize poorly to unseen real-world image pairs.Prior zero-shot approaches improve real-scene generalization but require per-image optimization, hindering weight reuse, and the above methods are usually limited to a fixed scale.To address this issue, we propose G-ZAP, a generalizable zero-shot framework for arbitrary-scale pansharpening, designed to handle cross-resolution, cross-scene, and cross-sensor generalization.G-ZAP adopts a feature-based implicit neural representation (INR) fusion network as the backbone and introduces a multi-scale, semi-supervised training scheme to enable robust generalization.Extensive experiments on multiple real-world datasets show that G-ZAP achieves state-of-the-art results under PAN-scale fusion in both visual quality and quantitative metrics.Notably, G-ZAP supports weight reuse across image pairs while maintaining competitiveness with per-pair retraining, demonstrating strong potential for efficient real-world deployment.[317] Histo-MExNet: A Unified Framework for Real-World, Cross-Magnification, and Trustworthy Breast Cancer Histopathology
Enam Ahmed Taufika,Md Ahasanul Arafatha,Abhijit Kumar Ghoshb,Md. Tanzim Rezab,Md Ashad Alamc
Main category: cs.CV
TL;DR: Histo-MExNet 是一种面向乳腺癌病理图像分类的尺度不变、不确定性感知统一框架,融合多专家架构、原型学习与物理信息正则化,兼顾高精度、鲁棒性与可解释性。
Details
Motivation: 解决现有深度学习模型对放大倍率变化敏感且缺乏可解释性的问题,提升病理图像分类在临床诊断中的可靠性与可信度。 Method: 提出 Histo-MExNet 框架:集成 DenseNet、ConvNeXt 和 EfficientNet 的门控多专家结构;引入原型学习模块实现示例驱动的可解释性;采用物理信息正则化保障形态保持与空间一致性;使用 Monte Carlo Dropout 进行预测不确定性量化。 Result: 在 BreaKHis 数据集上多倍率训练下达到 96.97% 准确率,泛化至未见放大倍率表现更优;不确定性估计有效识别分布外样本并减少过度自信错误。 Conclusion: Histo-MExNet 在准确率、鲁棒性和可解释性之间取得良好平衡,为临床决策支持提供了可靠工具。 Abstract: Accurate and reliable histopathological image classification is essential for breast cancer diagnosis. However, many deep learning models remain sensitive to magnification variability and lack interpretability. To address these challenges, we propose Histo-MExNet, a unified framework designed for scaleinvariant and uncertainty-aware classification. The model integrates DenseNet, ConvNeXt, and EfficientNet backbones within a gated multi-expert architecture, incorporates a prototype learning module for example-driven interpretability, and applies physics-informed regularization to enforce morphology preservation and spatial coherence during feature learning. Monte Carlo Dropout is used to quantify predictive uncertainty. On the BreaKHis dataset, Histo-MExNet achieves 96.97% accuracy under multi-magnification training and demonstrates improved generalization to unseen magnification levels compared to single-expert models, while uncertainty estimation helps identify out-of-distribution samples and reduce overconfident errors, supporting a balanced combination of accuracy, robustness, and interpretability for clinical decision support.[318] Deep EM with Hierarchical Latent Label Modelling for Multi-Site Prostate Lesion Segmentation
Wen Yan,Yipei Wang,Shiqi Huang,Natasha Thorley,Mark Emberton,Vasilis Stavrinides,Yipeng Hu,Dean Barratt
Main category: cs.CV
TL;DR: 本文提出了一种分层期望最大化(HierEM)框架,通过建模中心特异性标注噪声来提升前列腺病灶分割的跨中心泛化能力。
Details
Motivation: 多中心数据集中标注存在中心特异性差异,导致分割网络过拟合本地标注风格、跨中心泛化性能差。 Method: 将每个标注视为潜在‘干净’病灶掩码的噪声观测,构建分层EM框架:交替执行(1)推断体素级潜在掩码后验分布;(2)以该后验为软标签训练CNN,并在分层先验下估计各中心的敏感性与特异性;该先验将标注质量分解为全局均值加中心/病例级偏差,抑制中心偏差。 Result: 在三个队列上验证:联合数据集评估下各中心平均DSC为29.50%–39.69%;留一中心泛化下为27.91%–32.67%,显著优于对比方法(p<0.039);并给出可解释的中心级标注质量估计(敏感性α:31.5%–47.3%,特异性β≈0.99)。 Conclusion: 显式建模中心依赖的标注变异性可有效提升跨中心泛化性能。 Abstract: Label variability is a major challenge for prostate lesion segmentation. In multi-site datasets, annotations often reflect centre-specific contouring protocols, causing segmentation networks to overfit to local styles and generalise poorly to unseen sites in inference. We treat each observed annotation as a noisy observation of an underlying latent 'clean' lesion mask, and propose a hierarchical expectation-maximisation (HierEM) framework that alternates between: (1) inferring a voxel-wise posterior distribution over the latent mask, and (2) training a CNN using this posterior as a soft target and estimate site-specific sensitivity and specificity under a hierarchical prior. This hierarchical prior decomposes label-quality into a global mean with site- and case-level deviations, reducing site-specific bias by penalising the likelihood term contributed only by site deviations. Experiments on three cohorts demonstrate that the proposed hierarchical EM framework enhances cross-site generalisation compared to state-of-the-art methods. For pooled-dataset evaluation, the per-site mean DSC ranges from 29.50% to 39.69%; for leave-one-site-out generalisation, it ranges from 27.91% to 32.67%, yielding statistically significant improvements over comparison methods (p<0.039). The method also produces interpretable per-site latent label-quality estimates (sensitivity alpha ranges from 31.5% to 47.3% at specificity beta approximates 0.99), supporting post-hoc analyses of cross-site annotation variability. These results indicate that explicitly modelling site-dependent annotation can improve cross-site generalisation.[319] GenState-AI: State-Aware Dataset for Text-to-Video Retrieval on AI-Generated Videos
Minghan Li,Tongna Chen,Tianrui Lv,Yishuai Zhang,Suchao An,Guodong Zhou
Main category: cs.CV
TL;DR: 本文提出GenState-AI——一个面向状态变化的AI生成文本-视频检索基准,聚焦于端态(end-state)的精确建模与区分,通过引入时间难负样本和语义难负样本,细粒度诊断模型在时序推理与语义理解上的缺陷。
Details
Motivation: 现有文本-视频检索基准多基于真实视频,语义易由单帧推断,导致时序推理和端态显式对齐能力被低估;亟需可控、聚焦状态变化的评估基准。 Method: 构建GenState-AI基准:使用Wan2.2-TI2V-5B生成短片段,强调位置、数量及物体关系的精确变化;设计三元组(主视频、时间难负样本、语义难负样本);引入基于排序统计与迁移类别分解的三元组诊断分析方法。 Result: 在两个MLLM基线模型上验证发现:模型普遍混淆主视频与时间难负样本,偏好时序合理但端态错误的视频,暴露端态证据接地不足;对语义替换相对鲁棒;诊断分析可明确区分时序与语义失败来源。 Conclusion: GenState-AI为状态感知、时序与语义敏感的文本-视频检索提供了可控、可解释的专用评测平台,推动模型向端态精准对齐演进。 Abstract: Existing text-to-video retrieval benchmarks are dominated by real-world footage where much of the semantics can be inferred from a single frame, leaving temporal reasoning and explicit end-state grounding under-evaluated. We introduce GenState-AI, an AI-generated benchmark centered on controlled state transitions, where each query is paired with a main video, a temporal hard negative that differs only in the decisive end-state, and a semantic hard negative with content substitution, enabling fine-grained diagnosis of temporal vs. semantic confusions beyond appearance matching. Using Wan2.2-TI2V-5B, we generate short clips whose meaning depends on precise changes in position, quantity, and object relations, providing controllable evaluation conditions for state-aware retrieval. We evaluate two representative MLLM-based baselines, and observe consistent and interpretable failure patterns: both frequently confuse the main video with the temporal hard negative and over-prefer temporally plausible but end-state-incorrect clips, indicating insufficient grounding to decisive end-state evidence, while being comparatively less sensitive to semantic substitutions. We further introduce triplet-based diagnostic analyses, including relative-order statistics and breakdowns across transition categories, to make temporal vs. semantic failure sources explicit. GenState-AI provides a focused testbed for state-aware, temporally and semantically sensitive text-to-video retrieval, and will be released on huggingface.co.[320] End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction
Haoyu Zhang,Wei Zhai,Yuhang Yang,Yang Cao,Zheng-Jun Zha
Main category: cs.CV
TL;DR: 本文提出THO,一种端到端时空Transformer模型,用于单目视频中4D人-物交互重建,通过空间-时间HOI元组先验提升精度与速度,实现31.5 FPS实时推理。
Details
Motivation: 解决单目4D人-物交互重建中因深度模糊和频繁遮挡导致的挑战,以及现有方法多阶段/迭代优化带来的高延迟、非实时性和误差累积问题。 Method: 提出端到端Spatial-Temporal Transformer(THO),利用空间先验(基于接触区域邻近性从人体线索推断被遮挡物体特征)和时间先验(跨帧运动学相关性以增强物理一致性)联合预测人体与物体协同运动。 Result: 在单块RTX 4090 GPU上达到31.5 FPS推理速度,相比优化类方法提速超600倍,同时提升重建精度与时间一致性。 Conclusion: THO通过引入HOI元组先验实现高效、准确、一致的单目4D HOI重建,为实时应用提供了可行方案。 Abstract: Monocular 4D human-object interaction (HOI) reconstruction - recovering a moving human and a manipulated object from a single RGB video - remains challenging due to depth ambiguity and frequent occlusions. Existing methods often rely on multi-stage pipelines or iterative optimization, leading to high inference latency, failing to meet real-time requirements, and susceptibility to error accumulation. To address these limitations, we propose THO, an end-to-end Spatial-Temporal Transformer that predicts human motion and coordinated object motion in a forward fashion from the given video and 3D template. THO achieves this by leveraging spatial-temporal HOI tuple priors. Spatial priors exploit contact-region proximity to infer occluded object features from human cues, while temporal priors capture cross-frame kinematic correlations to refine object representations and enforce physical coherence. Extensive experiments demonstrate that THO operates at an inference speed of 31.5 FPS on a single RTX 4090 GPU, achieving a >600x speedup over prior optimization-based methods while simultaneously improving reconstruction accuracy and temporal consistency. The project page is available at: https://nianheng.github.io/THO-project/[321] Uni-MDTrack: Learning Decoupled Memory and Dynamic States for Parameter-Efficient Visual Tracking in All Modality
Wenrui Cai,Zhenyi Lu,Yuzhe Li,Yongchao Feng,Jinqing Zhang,Qingjie Liu,Yunhong Wang
Main category: cs.CV
TL;DR: 本文提出Uni-MDTrack,通过Memory-Aware Compression Prompt(MCP)和Dynamic State Fusion(DSF)两个模块,高效融合长时记忆与连续动态信息,在保持低计算开销和参数量(约30%可训练)的同时,在10个跨模态数据集上达到SOTA性能,并具备强泛化性和即插即用能力。
Details
Motivation: 现有Transformer单流跟踪器在引入时空上下文时存在三大问题:仅依赖少量历史帧导致上下文利用不足且计算开销大;外置记忆库检索方法特征融合不充分;离散帧建模忽略目标连续动态特性。 Method: 提出Uni-MDTrack框架,包含两个核心模块:1)Memory-Aware Compression Prompt(MCP),将记忆特征压缩为提示令牌并在整个主干网络中深度交互;2)Dynamic State Fusion(DSF),从浅层到深层渐进融合更新的动态状态特征,弥补离散记忆缺陷;支持RGB、RGB-D/T/E及RGB-Language统一跟踪。 Result: 仅训练MCP、DSF和预测头(约30%参数),在5种模态共10个数据集上达到SOTA;两模块具强通用性,作为即插即用组件可显著提升多种基线跟踪器性能,并优于现有参数高效训练方法。 Conclusion: Uni-MDTrack通过协同建模压缩记忆与连续动态,在高效率、低参数量前提下实现了跨模态跟踪性能突破,为轻量高效视觉跟踪提供了新范式。 Abstract: With the advent of Transformer-based one-stream trackers that possess strong capability in inter-frame relation modeling, recent research has increasingly focused on how to introduce spatio-temporal context. However, most existing methods rely on a limited number of historical frames, which not only leads to insufficient utilization of the context, but also inevitably increases the length of input and incurs prohibitive computational overhead. Methods that query an external memory bank, on the other hand, suffer from inadequate fusion between the retrieved spatio-temporal features and the backbone. Moreover, using discrete historical frames as context overlooks the rich dynamics of the target. To address the issues, we propose Uni-MDTrack, which consists of two core components: Memory-Aware Compression Prompt (MCP) module and Dynamic State Fusion (DSF) module. MCP effectively compresses rich memory features into memory-aware prompt tokens, which deeply interact with the input throughout the entire backbone, significantly enhancing the performance while maintaining a stable computational load. DSF complements the discrete memory by capturing the continuous dynamic, progressively introducing the updated dynamic state features from shallow to deep layers, while also preserving high efficiency. Uni-MDTrack also supports unified tracking across RGB, RGB-D/T/E, and RGB-Language modalities. Experiments show that in Uni-MDTrack, training only the MCP, DSF, and prediction head, keeping the proportion of trainable parameters around 30%, yields substantial performance gains, achieves state-of-the-art results on 10 datasets spanning five modalities. Furthermore, both MCP and DSF exhibit excellent generality, functioning as plug-and-play components that can boost the performance of various baseline trackers, while significantly outperforming existing parameter-efficient training approaches.[322] LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos
Rongyi Yu,Chenyuan Duan,Wentao Zhang
Main category: cs.CV
TL;DR: 本文提出了LongVidSearch基准,用于评估长视频中多跳证据检索规划能力,强调检索必要性、标准化接口与效率-准确率权衡分析。
Details
Motivation: 现有长视频QA基准缺乏对多跳检索的严格要求和标准化证据访问接口,难以区分检索规划失败与答案生成失败。 Method: 构建LongVidSearch基准:包含3000个问题、447个长视频(平均26分钟),覆盖四类推理任务与2-4跳证据需求;设计统一工具接口以固定检索后端、隔离检索规划能力;引入工具调用成本度量,结合三法官多数投票评估VideoAgent式QA系统。 Result: GPT-5在LongVidSearch上准确率最高(42.43%),但仍低于50%;使用黄金证据时性能接近完美,证实多跳检索规划是主要瓶颈。 Conclusion: LongVidSearch有效揭示了当前视频代理在多跳检索规划上的局限性,为未来研究提供了可控、可复现的评估框架。 Abstract: Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation. To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark contains 3,000 questions over 447 long videos (average length 26 minutes), covering four reasoning categories: State Mutation, Causal Inference, Global Summary, and Visual Tracking, with 2-hop, 3-hop, and 4-hop evidence requirements. To ensure fair and controlled evaluation, all agents interact with LongVidSearch through a unified tool interface, which fixes the retrieval backend and isolates the agent's ability to formulate queries and plan iterative retrieval. In addition to answer accuracy, we measure tool-call cost to analyze the accuracy-efficiency trade-off under identical access conditions. We evaluate VideoAgent-style QA agents with multiple backbone LLMs using three-judge majority voting. GPT-5 achieves the highest accuracy (42.43), outperforming Gemini 3 Pro (30.97) and GPT-4o (19.20), yet remaining below 50 %, highlighting the difficulty of multi-hop retrieval planning. With gold evidence clips, performance becomes near-perfect, confirming retrieval planning as the primary bottleneck.[323] Wi-Spike: A Low-power WiFi Human Multi-action Recognition Model with Spiking Neural Networks
Nengbo Zhang,Yao Ying,Lu Wang,Kaishun Wu,Jieming Ma,Fei Luo
Main category: cs.CV
TL;DR: 本文提出Wi-Spike,一种基于脉冲神经网络(SNN)的WiFi人体动作识别框架,兼顾高精度与低功耗,在多动作识别任务中达到新SOTA,并显著降低能耗。
Details
Motivation: 现有WiFi感知模型多关注识别精度,忽视功耗与能效问题,难以满足边缘实时感知需求。 Method: 提出Wi-Spike框架:采用事件驱动的脉冲卷积层提取时空特征,引入新型时间注意力机制增强判别表征,并通过脉冲全连接层与投票层完成编码与分类。 Result: 在NTU-Fi-HAR、NTU-Fi-HumanID和UT-HAR三个数据集上,单动作识别精度达95.83%,多动作识别性能优于现有方法;能耗至少降低50%。 Conclusion: Wi-Spike在保证高精度的同时大幅降低能耗,为实时、节能的边缘WiFi感知提供了新范式,确立了多动作识别的新SOTA。 Abstract: WiFi-based human action recognition (HAR) has gained significant attention due to its non-intrusive and privacy-preserving nature. However, most existing WiFi sensing models predominantly focus on improving recognition accuracy, while issues of power consumption and energy efficiency remain insufficiently discussed. In this work, we present Wi-Spike, a bio-inspired spiking neural network (SNN) framework for efficient and accurate action recognition using WiFi channel state information (CSI) signals. Specifically, leveraging the event-driven and low-power characteristics of SNNs, Wi-Spike introduces spiking convolutional layers for spatio-temporal feature extraction and a novel temporal attention mechanism to enhance discriminative representation. The extracted features are subsequently encoded and classified through spiking fully connected layers and a voting layer. Comprehensive experiments on three benchmark datasets (NTU-Fi-HAR, NTU-Fi-HumanID, and UT-HAR) demonstrate that Wi-Spike achieves competitive accuracy in single-action recognition and superior performance in multi-action recognition tasks. As for energy consumption, Wi-Spike reduces the energy cost by at least half compared with other methods, while still achieving 95.83% recognition accuracy in human activity recognition. More importantly, Wi-Spike establishes a new state-of-the-art in WiFi-based multi-action HAR, offering a promising solution for real-time, energy-efficient edge sensing applications.[324] V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning
Lorenzo Mur-Labadia,Matthew Muckley,Amir Bar,Mido Assran,Koustuv Sinha,Mike Rabbat,Yann LeCun,Nicolas Ballas,Adrien Bardes
Main category: cs.CV
TL;DR: V-JEPA 2.1 是一种自监督视觉模型,通过密集预测损失、深层自监督、多模态分词器和有效扩展,在图像与视频理解任务中实现SOTA性能。
Details
Motivation: 提升视觉表征的密集性、语义一致性与时空连贯性,增强全局场景理解能力。 Method: 结合四种关键技术:基于掩码的密集预测损失(可见与掩码token均参与训练)、跨编码器中间层的深层自监督、支持图像与视频统一训练的多模态分词器、以及模型容量与数据规模的有效扩展。 Result: 在Ego4D、EPIC-KITCHENS、TartanDrive、NYUv2、Something-Something-V2等多个基准上达到SOTA;机器人抓取成功率提升20%,深度估计与导航等下游任务表现优异。 Conclusion: V-JEPA 2.1显著推进了密集视觉理解与世界建模的前沿水平,验证了其表征的空间结构化、语义一致性和时间一致性。 Abstract: We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.[325] Refining 3D Medical Segmentation with Verbal Instruction
Kangxian Xie,Jiancheng Yang,Nandor Pinter,Chao Wu,Behzad Bozorgtabar,Mingchen Gao
Main category: cs.CV
TL;DR: 本文提出CoWTalk基准和一种基于语言指令迭代优化3D解剖结构分割结果的方法,以解决医学图像分割中形状预测不准确的问题。
Details
Motivation: 自动化模型在3D解剖分割中常因训练数据有限、标注质量差及分布偏移导致形状预测不佳,而医生口头修正指令缺乏配对数据支持。 Method: 构建含可控合成解剖错误及其修复指令的CoWTalk基准;提出将3D形状表示为向量集,并通过与文本指令交互进行迭代更新的 refinement 模型。 Result: 实验表明该方法显著优于 corrupted 输入和多个基线模型。 Conclusion: 验证了语言驱动、临床医生参与的3D医学形状建模迭代优化的可行性。 Abstract: Accurate 3D anatomical segmentation is essential for clinical diagnosis and surgical planning. However, automated models frequently generate suboptimal shape predictions due to factors such as limited and imbalanced training data, inadequate labeling quality, and distribution shifts between training and deployment settings. A natural solution is to iteratively refine the predicted shape based on the radiologists' verbal instructions. However, this is hindered by the scarcity of paired data that explicitly links erroneous shapes to corresponding corrective instructions. As an initial step toward addressing this limitation, we introduce CoWTalk, a benchmark comprising 3D arterial anatomies with controllable synthesized anatomical errors and their corresponding repairing instructions. Building on this benchmark, we further propose an iterative refinement model that represents 3D shapes as vector sets and interacts with textual instructions to progressively update the target shape. Experimental results demonstrate that our method achieves significant improvements over corrupted inputs and competitive baselines, highlighting the feasibility of language-driven clinician-in-the-loop refinement for 3D medical shapes modeling.[326] WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning
Stefan Englmeier,Katharina Winter,Fabian B. Flohr
Main category: cs.CV
TL;DR: 本文提出WorldVLM,一种融合视觉-语言模型(VLM)与世界模型(WM)的混合架构,利用VLM进行高层语义理解与行为指令生成,指导WM进行动态环境预测,从而提升自动驾驶系统的上下文感知能力、可解释性与泛化性能。
Details
Motivation: 现有VLM空间理解能力有限,难以直接用于端到端驾驶;WM虽擅长动态建模但缺乏高层语义推理能力;需融合二者优势以解决自动驾驶中场景理解与运动预测协同的关键挑战。 Method: 提出WorldVLM混合架构:VLM负责解析视觉-语言输入并生成高层行为命令(如‘减速避让’),WM作为驱动核心接收该命令并预测未来场景演化;系统探索了多种条件注入策略以实现VLM对WM的有效引导。 Result: 验证了VLM可有效生成语义合理的行为指令,并成功驱动WM完成具上下文感知的动态预测;分析了不同条件化策略对性能的影响,揭示了跨模态对齐与指令可执行性等关键设计挑战。 Conclusion: WorldVLM证明了将语义推理(VLM)与物理动态建模(WM)解耦并协同的设计范式可行且有效,为构建可解释、泛化强的下一代自动驾驶模型提供了新思路。 Abstract: Autonomous driving systems depend on on models that can reason about high-level scene contexts and accurately predict the dynamics of their surrounding environment. Vision- Language Models (VLMs) have recently emerged as promising tools for decision-making and scene understanding, offering strong capabilities in contextual reasoning. However, their limited spatial comprehension constrains their effectiveness as end-to-end driving models. World Models (WM) internalize environmental dynamics to predict future scene evolution. Recently explored as ego-motion predictors and foundation models for autonomous driving, they represent a promising direction for addressing key challenges in the field, particularly enhancing generalization while maintaining dynamic prediction. To leverage the complementary strengths of context-based decision making and prediction, we propose WorldVLM: A hybrid architecture that unifies VLMs and WMs. In our design, the high-level VLM generates behavior commands to guide the driving WM, enabling interpretable and context-aware actions. We evaluate conditioning strategies and provide insights into the hybrid design challenges.[327] Mapping Dark-Matter Clusters via Physics-Guided Diffusion Models
Diego Royo,Brandon Zhao,Adolfo Muñoz,Diego Gutierrez,Katherine L. Bouman
Main category: cs.CV
TL;DR: 本文提出了一种全自动的星系团表面质量密度重建方法,基于新构建的大规模模拟数据集DarkClusters-15k,结合扩散先验与弱/强引力透镜观测,实现高效、高精度、物理驱动且不确定性校准良好的质量重建。
Details
Motivation: 现有星系团质量重建方法缺乏可扩展性和大规模基准,难以应对未来宽视场巡天将产生的数十万个星系团。 Method: 构建包含15,000个模拟星系团的DarkClusters-15k数据集;在该数据集上训练一个即插即用的扩散先验模型,学习质量与光分布的统计关系;利用弱透镜和强透镜观测约束生成后验样本,实现质量重建。 Result: 该方法无需人工调参、运行时间缩短至分钟级、精度高于传统方法,并在MACS 1206星系团上达到与专家调优方法相当的结果。 Conclusion: 所提方法为未来宽视场宇宙学巡天提供了可扩展、高精度、物理可靠且不确定性可控的星系团质量重建新范式,并开源了方法与数据集。 Abstract: Galaxy clusters are powerful probes of astrophysics and cosmology through gravitational lensing: the clusters' mass, dominated by 85% dark matter, distorts background light. Yet, mass reconstruction lacks the scalability and large-scale benchmarks to process the hundreds of thousands of clusters expected from forthcoming wide-field surveys. We introduce a fully automated method to reconstruct cluster surface mass density from photometry and gravitational lensing observables. Central to our approach is DarkClusters-15k, our new dataset of 15,000 simulated clusters with paired mass and photometry maps, the largest benchmark to date, spanning multiple redshifts and simulation frameworks. We train a plug-and-play diffusion prior on DarkClusters-15k that learns the statistical relationship between mass and light, and draw posterior samples constrained by weak- and strong-lensing observables; this yields principled reconstructions driven by explicit physics, alongside well-calibrated uncertainties. Our approach requires no expert tuning, runs in minutes rather than hours, achieves higher accuracy, and matches expertly-tuned reconstructions of the MACS 1206 cluster. We release our method and DarkClusters-15k to support development and benchmarking for upcoming wide-field cosmological surveys.[328] Unlocking the Latent Canvas: Eliciting and Benchmarking Symbolic Visual Expression in LLMs
Yiren Zheng,Shibo Li,Jiaming Liu,Haofan Wang,Yiren Song
Main category: cs.CV
TL;DR: 本文提出SVE-ASCII框架,利用ASCII艺术作为LLM原生支持的紧凑文本可视化形式,构建ASCIIArt-7K数据集与ASCIIArt-Bench基准,并通过联合指令微调实现文本↔ASCII双向任务;实验首次实证揭示生成能力可显著提升视觉理解,证实符号化视觉处理中生成与感知的相互增强循环。
Details
Motivation: 现有多模态方法将视觉生成视为外部过程,忽视了大语言模型(LLM)本身潜在的原生视觉表征能力。 Method: 提出SVE-ASCII统一框架;构建基于'Seed-and-Evolve'流程的ASCIIArt-7K高质量数据集;采用联合指令微调策略同步优化Text-to-ASCII生成与ASCII-to-Text理解任务。 Result: 实验证明生成训练能显著提升ASCII视觉理解能力,首次在视觉领域实证了生成与感知之间的相互增强循环;发布数据集、基准(ASCIIArt-Bench)及模型。 Conclusion: ASCII艺术可作为LLM原生、高效的符号化视觉表达媒介;生成与理解任务存在协同增益,为纯文本空间中的视觉智能提供了新范式和坚实基线。 Abstract: Current multimodal approaches predominantly treat visual generation as an external process, relying on pixel rendering or code execution, thereby overlooking the native visual representation capabilities latent within Large Language Models (LLMs). In this work, we unlock this potential through ASCII art, a compact, efficient, and text-native visual format. We introduce SVE-ASCII, a unified framework designed to elicit and benchmark Symbolic Visual Expression directly within the pure text space. To address the scarcity of systematic resources, we construct ASCIIArt-7K, a high-quality dataset synthesized via a novel "Seed-and-Evolve" pipeline that augments human-curated anchors through in-context stylistic editing. We further implement a unified instruction-tuning strategy that jointly optimizes for both Generation (Text-to-ASCII) and Understanding (ASCII-to-Text). Crucially, our experiments reveal a critical phenomenon regarding task duality: while it is established that perception aids generation, we provide compelling evidence that generative training significantly enhances visual comprehension. This confirms a mutually reinforcing cycle in symbolic visual processing, a relationship previously hypothesized but rarely empirically demonstrated in the visual domain. We release our dataset, the ASCIIArt-Bench benchmark, and the SVE-ASCII model, establishing a robust baseline for native text-based visual intelligence.[329] Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets
Zhuoxuan Peng,Boan Zhu,Xingjian Zhang,Wenying Li,S. -H. Gary Chan
Main category: cs.CV
TL;DR: 本文提出EMDUL方法,利用未标注的毫米波数据和已标注的激光雷达数据,扩展毫米波人体姿态估计数据集,显著提升模型性能与泛化能力。
Details
Motivation: 现有毫米波人体姿态估计数据集稀缺且多样性不足,限制了模型泛化能力;而未标注毫米波数据和多样化的激光雷达数据易于获取。 Method: EMDUL包含两部分:1)训练伪标签估计器为未标注毫米波点云生成伪标签;2)构建点云跨模态转换模块,将标注的激光雷达点云翻译为毫米波点云形式。 Result: 扩展后的毫米波数据集使HPE模型在域内和域外设置下姿态估计误差分别降低15.1%和18.9%。 Conclusion: EMDUL有效缓解毫米波HPE数据稀缺与多样性不足问题,通过跨模态迁移与伪标签策略显著提升模型性能与泛化性。 Abstract: Current mmWave datasets for human pose estimation (HPE) are scarce and lack diversity in both point cloud (PC) attributes and human poses, severely hampering the generalization ability of their trained models. On the other hand, unlabeled mmWave HPE data and diverse LiDAR HPE datasets are readily available. We propose EMDUL, a novel approach to expand the volume and diversity of an existing mmWave dataset using unlabeled mmWave data and a LiDAR dataset. EMDUL trains a pseudo-label estimator to annotate the unlabeled mmWave data and is able to convert, or translate, a given annotated LiDAR PC to its mmWave counterpart. Expanded with both LiDAR-converted and pseudo-labeled mmWave PCs, our mmWave dataset significantly boosts the performance and generalization ability of all our HPE models, with substantial 15.1% and 18.9% error reductions for in-domain and out-of-domain settings, respectively.[330] VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning
Chaoyang Wang,Wenrui Bao,Sicheng Gao,Bingxin Xu,Yu Tian,Yogesh S. Rawat,Yunhao Ge,Yuzhang Shang
Main category: cs.CV
TL;DR: 本文提出VLA-Thinker,一种结合图像动态调用的视觉-语言-动作推理框架,通过两阶段训练(SFT冷启动+GRPO强化学习)提升长时程机器人操作任务性能,在LIBERO和RoboTwin 2.0上取得显著效果。
Details
Motivation: 现有VLA模型将视觉输入视为静态上下文,依赖文本链式推理,难以在长时程任务中主动重访环境以解决歧义。 Method: 提出VLA-Thinker框架,将感知建模为可动态调用的推理动作;采用两阶段训练:(1) 基于人工构建的视觉链式推理数据进行监督微调(SFT),激活结构化推理与工具使用能力;(2) 基于GRPO的强化学习对齐完整推理-动作轨迹与任务成功。 Result: 在LIBERO和RoboTwin 2.0基准上显著提升操作性能,LIBERO成功率达97.5%,并在各类长时程机器人任务中表现强劲。 Conclusion: 动态图像感知作为可调用推理动作是提升VLA模型在复杂、长时程具身任务中表现的有效范式。 Abstract: Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .[331] LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion
Zengqun Zhao,Ziquan Liu,Yu Cao,Shaogang Gong,Zhensong Zhang,Jifei Song,Jiankang Deng,Ioannis Patras
Main category: cs.CV
TL;DR: 本文提出LatSearch方法,通过引入潜在奖励模型(latent reward model)在视频扩散模型的去噪过程中提供中间、信息丰富且高效的反馈,从而实现高效的推理时扩展。该方法结合奖励引导的重采样与剪枝(RGRP),显著提升了视频生成的质量、可控性和样本效率。
Details
Motivation: 现有视频扩散模型在推理时优化初始噪声的方法存在误差累积、延迟稀疏奖励信号以及计算成本过高等问题,限制了更强搜索算法的应用;而更强的搜索算法有望大幅提升可控性、样本效率和生成质量。 Method: 提出潜在奖励模型,用于在任意时间步对部分去噪的潜在表示进行视觉质量、运动质量和文本对齐度评分;在此基础上设计LatSearch推理时搜索机制,包含奖励归一化概率驱动的重采样阶段和基于累积奖励的最终剪枝阶段。 Result: 在VBench-2.0基准上评估显示,LatSearch在多个维度上一致优于基线Wan2.1模型,提升了视频生成质量。 Conclusion: LatSearch通过引入中间潜在空间的奖励引导,有效缓解了传统推理优化中的误差累积与高计算开销问题,为视频扩散模型的高效推理时扩展提供了新范式。 Abstract: The recent success of inference-time scaling in large language models has inspired similar explorations in video diffusion. In particular, motivated by the existence of "golden noise" that enhances video quality, prior work has attempted to improve inference by optimising or searching for better initial noise. However, these approaches have notable limitations: they either rely on priors imposed at the beginning of noise sampling or on rewards evaluated only on the denoised and decoded videos. This leads to error accumulation, delayed and sparse reward signals, and prohibitive computational cost, which prevents the use of stronger search algorithms. Crucially, stronger search algorithms are precisely what could unlock substantial gains in controllability, sample efficiency and generation quality for video diffusion, provided their computational cost can be reduced. To fill in this gap, we enable efficient inference-time scaling for video diffusion through latent reward guidance, which provides intermediate, informative and efficient feedback along the denoising trajectory. We introduce a latent reward model that scores partially denoised latents at arbitrary timesteps with respect to visual quality, motion quality, and text alignment. Building on this model, we propose LatSearch, a novel inference-time search mechanism that performs Reward-Guided Resampling and Pruning (RGRP). In the resampling stage, candidates are sampled according to reward-normalised probabilities to reduce over-reliance on the reward model. In the pruning stage, applied at the final scheduled step, only the candidate with the highest cumulative reward is retained, improving both quality and efficiency. We evaluate LatSearch on the VBench-2.0 benchmark and demonstrate that it consistently improves video generation across multiple evaluation dimensions compared to the baseline Wan2.1 model.[332] Interp3R: Continuous-time 3D Geometry Estimation with Frames and Events
Shuang Guo,Filbert Febryanto,Lei Sun,Guillermo Gallego
Main category: cs.CV
TL;DR: Interp3R是首个利用异步事件数据对基于点图的3D视觉基础模型(如DUSt3R)进行时间插值的方法,实现任意时刻的深度与相机位姿估计,无需真实标注,仅用合成数据训练即具备强泛化能力。
Details
Motivation: 现有基于点图的3D视觉基础模型只能在离散图像时刻恢复几何结构,无法建模帧间盲区的场景动态演化。 Method: Interp3R利用异步事件数据插值帧基模型生成的点图,并将插值点图与原始点图联合对齐到统一空间框架中,联合优化深度和相机位姿。 Result: 在多种合成与真实世界基准上展现出强泛化性,显著优于先插值2D视频帧再估计3D几何的两阶段方法。 Conclusion: Interp3R首次实现了基于点图模型的时间连续3D几何重建,为动态场景理解提供了新范式。 Abstract: In recent years, 3D visual foundation models pioneered by pointmap-based approaches such as DUSt3R have attracted a lot of interest, achieving impressive accuracy and strong generalization across diverse scenes. However, these methods are inherently limited to recovering scene geometry only at the discrete time instants when images are captured, leaving the scene evolution during the blind time between consecutive frames largely unexplored. We introduce Interp3R, to the best of our knowledge the first method that enhances pointmap-based models to estimate depth and camera poses at arbitrary time instants. Interp3R leverages asynchronous event data to interpolate pointmaps produced by frame-based models, enabling temporally continuous geometric representations. Depth and camera poses are then jointly recovered by aligning the interpolated pointmaps together with those predicted by the underlying frame-based models into a consistent spatial framework. We train Interp3R exclusively on a synthetic dataset, yet demonstrate strong generalization across a wide range of synthetic and real-world benchmarks. Extensive experiments show that Interp3R outperforms by a considerable margin state-of-the-art baselines that follow a two-stage pipeline of 2D video frame interpolation followed by 3D geometry estimation.[333] Distilling Latent Manifolds: Resolution Extrapolation by Variational Autoencoders
Jiaming Chu,Tao Wang,Lei Jin
Main category: cs.CV
TL;DR: 本文发现VAE编码器蒸馏中一个反直觉现象:仅在低分辨率下训练的轻量编码器,在更高、未见过的分辨率输入上反而重建效果显著提升;实验证明其学习的是分辨率一致的潜在流形,而非分辨率特定的像素映射,从而降低了高分辨率蒸馏所需的计算与数据成本。
Details
Motivation: 现有研究认为模型在训练分布内样本上表现更好,但作者观察到低分辨率蒸馏的VAE编码器在高分辨率输入上性能反超,挑战了这一常识,旨在探究其成因与泛化机制。 Method: 在ImageNet-256上蒸馏VAE编码器(仅训练至256×256),通过分辨率重映射(输入上采样+重建下采样)评估其在512×512等高分辨率下的重建性能,并分析跨分辨率的潜在分布对齐程度。 Result: 蒸馏编码器在512²分辨率下PSNR、MSE、SSIM、LPIPS和rFID等指标显著提升;潜在空间分析表明高分辨率输入产生的隐变量更贴近教师模型的流形结构。 Conclusion: VAE编码器蒸馏学习的是分辨率不变的潜在流形,而非分辨率依赖的像素映射;因此无需高分辨率训练数据或高算力即可获得高分辨率重建能力。 Abstract: Variational Autoencoder (VAE) encoders play a critical role in modern generative models, yet their computational cost often motivates the use of knowledge distillation or quantification to obtain compact alternatives. Existing studies typically believe that the model work better on the samples closed to their training data distribution than unseen data distribution. In this work, we report a counter-intuitive phenomenon in VAE encoder distillation: a compact encoder distilled only at low resolutions exhibits poor reconstruction performance at its native resolution, but achieves dramatically improved results when evaluated at higher, unseen input resolutions. Despite never being trained beyond $256^2$ resolution, the distilled encoder generalizes effectively to $512^2$ resolution inputs, partially inheriting the teacher model's resolution preference.We further analyze latent distributions across resolutions and find that higher-resolution inputs produce latent representations more closely aligned with the teacher's manifold. Through extensive experiments on ImageNet-256, we show that simple resolution remapping-upsampling inputs before encoding and downsampling reconstructions for evaluation-leads to substantial gains across PSNR, MSE, SSIM, LPIPS, and rFID metrics. These findings suggest that VAE encoder distillation learns resolution-consistent latent manifolds rather than resolution-specific pixel mappings. This also means that the high training cost on memory, time and high-resolution datasets are not necessary conditions for distilling a VAE with high-resolution image reconstruction capabilities. On low resolution datasets, the distillation model still could learn the detailed knowledge of the teacher model in high-resolution image reconstruction.[334] ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference
Surendra Pathak,Bo Han
Main category: cs.CV
TL;DR: 本文提出ASAP方法,一种无需训练、兼容KV缓存的视觉token剪枝策略,通过动态双向软注意力掩码缓解注意力偏移,并引入加权软合并机制减少语义冗余,在几乎无性能损失下降低约80%计算量。
Details
Motivation: LVLMs处理高分辨率图像时存在二次方计算开销问题;现有token压缩方法未能有效利用注意力值、忽视token冗余及LVLM中固有的'注意力偏移'现象。 Method: 提出ASAP:1)采用动态双向软注意力掩码校正注意力偏移,提升关键token选择准确性;2)设计加权软合并组件,对语义相似的视觉token进行融合,保留信息最密集的视觉块。 Result: 在LLaVA-NeXT-7B上实现99.02%原始性能保留,同时降低约80% FLOPs。 Conclusion: ASAP是一种高效、即插即用的视觉token压缩方案,兼顾性能保持与显著计算加速,为LVLM实际部署提供新思路。 Abstract: While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades performance. We therefore introduce a weighted soft merging component that merges semantically similar tokens, preserving only the most feature-dense visual patches for subsequent layers. ASAP achieves virtually lossless compression of visual context, retaining 99.02% of the original LLaVA-NeXT-7B performance while aggressively slashing computational FLOPs by ~80%.[335] A comprehensive multimodal dataset and benchmark for ulcerative colitis scoring in endoscopy
Noha Ghatwary,Jiangbei Yue,Ahmed Elgendy,Hanna Nagdy,Ahmed Galal,Hayam Fathy,Hussein El-Amin,Venkataraman Subramanian,Noor Mohammed,Gilberto Ochoa-Ruiz,Sharib Ali
Main category: cs.CV
TL;DR: 本文介绍了一个多中心、多分辨率的溃疡性结肠炎(UC)内镜图像数据集,包含专家标注的Mayo内镜评分(MES)和UC内镜严重程度指数(UCEIS)以及临床描述性文本,填补了自动评分与图像描述生成的研究空白,并提供了多种模型的基准测试结果。
Details
Motivation: 现有UC内镜图像自动评分方法受限于缺乏公开的专家标注数据集和统一基准;同时,临床有意义的图像描述生成研究尚属空白;不同中心设备与流程差异也亟需多中心数据保障算法泛化性。 Method: 构建首个融合双评分标准(MES/UCEIS)分类标签与专家撰写的临床图像描述的多中心、多分辨率内镜图像数据集,并在该数据集上对CNN、ViT、混合模型及主流视觉-语言图像描述模型进行基准测试。 Result: 发布了首个支持分类与图像描述双重任务的UC内镜多模态数据集,并提供了多种模型在该数据集上的性能基准,验证了其作为算法开发与评估平台的有效性。 Conclusion: 该数据集为开发临床可解释、泛化性强的多模态UC评估算法奠定了基础,推动了AI在炎症性肠病内镜诊断中的实际应用。 Abstract: Ulcerative colitis (UC) is a chronic mucosal inflammatory condition that places patients at increased risk of colorectal cancer. Colonoscopic surveillance remains the gold standard for assessing disease activity, and reporting typically relies on standardised endoscopic scoring metrics. The most widely used is the Mayo Endoscopic Score (MES), with some centres also adopting the Ulcerative Colitis Endoscopic Index of Severity (UCEIS). Both are descriptive assessments of mucosal inflammation (MES: 0 to 3; UCEIS: 0 to 8), where higher values indicate more severe disease. However, computational methods for automatically predicting these scores remain limited, largely due to the lack of publicly available expert-annotated datasets and the absence of robust benchmarking. There is also a significant research gap in generating clinically meaningful descriptions of UC images, despite image captioning being a well-established computer vision task. Variability in endoscopic systems and procedural workflows across centres further highlights the need for multi-centre datasets to ensure algorithmic robustness and generalisability. In this work, we introduce a curated multi-centre, multi-resolution dataset that includes expert-validated MES and UCEIS labels, alongside detailed clinical descriptions. To our knowledge, this is the first comprehensive dataset that combines dual scoring metrics for classification tasks with expert-generated captions describing mucosal appearance and clinically accepted reasoning for image captioning. This resource opens new opportunities for developing clinically meaningful multimodal algorithms. In addition to the dataset, we also provide benchmarking using convolutional neural networks, vision transformers, hybrid models, and widely used multimodal vision-language captioning algorithms.[336] Medical Image Spatial Grounding with Semantic Sampling
Andrew Seohwan Yu,Mohsen Hariri,Kunio Nakamura,Mingrui Yang,Xiaojuan Li,Vipin Chaudhary
Main category: cs.CV
TL;DR: 本文提出MIS-Ground医学影像空间定位基准和MIS-SemSam优化方法,提升VLM在三维解剖结构空间定位上的性能。
Details
Motivation: 三维医学影像中解剖结构的空间定位面临图像模态、切片方向、坐标系及解剖术语等独特挑战,现有VLM缺乏系统评估与针对性优化。 Method: 分析视觉(图像模态、切片方向、坐标系)与语言(解剖、方向、关系术语)组件的影响;设计多类型视觉/文本提示(标签、框、掩码)对比实验;构建MIS-Ground基准;提出基于语义采样的轻量级、推理时、模型无关优化方法MIS-SemSam。 Result: MIS-SemSam使Qwen3-VL-32B在MIS-Ground上的准确率提升13.06%;公开发布MIS-Ground基准。 Conclusion: 视觉与语言组件的设计对VLM医学空间定位能力影响显著;MIS-Ground可系统评测漏洞,MIS-SemSam为通用、低成本优化方案。 Abstract: Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs represent a bridge between object detection and segmentation, and report understanding and generation. However, spatial grounding of anatomical structures in the three-dimensional space of medical images poses many unique challenges. In this study, we examine image modalities, slice directions, and coordinate systems as differentiating factors for vision components of VLMs, and the use of anatomical, directional, and relational terminology as factors for the language components. We then demonstrate that visual and textual prompting systems such as labels, bounding boxes, and mask overlays have varying effects on the spatial grounding ability of VLMs. To enable measurement and reproducibility, we introduce \textbf{MIS-Ground}, a benchmark that comprehensively tests a VLM for vulnerabilities against specific modes of \textbf{M}edical \textbf{I}mage \textbf{S}patial \textbf{Ground}ing. We release MIS-Ground to the public at \href{https://anonymous.4open.science/r/mis-ground}{\texttt{anonymous.4open.science/r/mis-ground}}. In addition, we present \textbf{MIS-SemSam}, a low-cost, inference-time, and model-agnostic optimization of VLMs that improve their spatial grounding ability with the use of \textbf{Sem}antic \textbf{Sam}pling. We find that MIS-SemSam improves the accuracy of Qwen3-VL-32B on MIS-Ground by 13.06\%.[337] Texel Splatting: Perspective-Stable 3D Pixel Art
Dylan Ebert
Main category: cs.CV
TL;DR: 本文提出了一种名为texel splatting的新方法,用于在透视投影下实现3D场景的像素艺术渲染,通过固定世界坐标系中的立方体贴图原点并进行纹理溅射,以解决传统网格对齐方法在透视下失效的问题。
Details
Motivation: 现有方法在正交投影下通过相机网格对齐可保持像素稳定,但在透视投影下因不同深度像素移动速率不同而失效,需要新方法解决像素稳定性问题。 Method: 采用texel splatting方法:将场景几何体从世界坐标系中一个固定点渲染到立方体贴图,再将每个纹素作为世界空间四边形溅射到屏幕;利用立方体贴图索引实现旋转不变性,通过网格对齐原点实现平移不变性。 Result: 该方法在透视投影下实现了像素艺术渲染的稳定性,克服了传统方法的深度依赖问题,但存在因固定原点导致的遮挡缺失(disocclusion)及探针边界问题。 Conclusion: Texel splatting为透视下的像素艺术渲染提供了有效解决方案,虽受限于固定原点带来的可视性权衡,但仍显著提升了运动过程中像素的稳定性与风格一致性。 Abstract: Rendering 3D scenes as pixel art requires that discrete pixels remain stable as the camera moves. Existing methods snap the camera to a grid. Under orthographic projection, this works: every pixel shifts by the same amount, and a single snap corrects all of them. Perspective breaks this. Pixels at different depths drift at different rates, and no single snap corrects all depths. Texel splatting avoids this entirely. Scene geometry is rendered into a cubemap from a fixed point in the world, and each texel is splatted to the screen as a world-space quad. Cubemap indexing gives rotation invariance. Grid-snapping the origin gives translation invariance. The primary limitation is that a fixed origin cannot see all geometry; disocclusion at probe boundaries remains an open tradeoff.[338] GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data
Roger Ferrod,Maël Lecene,Krishna Sapkota,George Leifman,Vered Silverman,Genady Beryozkin,Sylvain Lobry
Main category: cs.CV
TL;DR: 本文提出一个基于地籍矢量数据的大规模遥感空间理解数据集(380万标注对象,510k高分辨率图像,135类语义),并通过七项空间推理任务的指令微调基准验证其有效性;结果表明,高质量监督可显著提升标准多模态大模型(如LLaVA)在遥感细粒度空间定位上的性能,无需复杂架构改动。
Details
Motivation: 现有遥感多模态大语言模型在细粒度空间理解上存在严重不足,主因是依赖有限或复用的传统数据集,难以支撑精准地理空间分析需求。 Method: 构建基于真实地籍矢量数据的大规模遥感图像数据集,并设计涵盖七类空间推理任务的指令微调评测基准;采用标准LLaVA架构进行基线实验,对比RS专用模型与商业模型(如Gemini)在零样本和监督微调下的表现。 Result: 标准LLaVA模型经高质量监督微调后,在多项空间推理任务上显著超越当前遥感专用及商业多模态模型;零样本下各模型表现较差,证明监督信号对空间接地至关重要。 Conclusion: 高质量、地理精确的标注数据是提升多模态模型遥感空间理解能力的关键;无需修改模型架构,仅靠精细监督即可使通用模型胜任细粒度空间接地任务。 Abstract: Precise spatial understanding in Earth Observation is essential for translating raw aerial imagery into actionable insights for critical applications like urban planning, environmental monitoring and disaster management. However, Multimodal Large Language Models exhibit critical deficiencies in fine-grained spatial understanding within Remote Sensing, primarily due to a reliance on limited or repurposed legacy datasets. To bridge this gap, we introduce a large-scale dataset grounded in verifiable cadastral vector data, comprising 3.8 million annotated objects across 510k high-resolution images with 135 granular semantic categories. We validate this resource through a comprehensive instruction-tuning benchmark spanning seven spatial reasoning tasks. Our evaluation establishes a robust baseline using a standard LLaVA architecture. We show that while current RS-specialized and commercial models (e.g., Gemini) struggle in zero-shot settings, high-fidelity supervision effectively bridges this gap, enabling standard architectures to master fine-grained spatial grounding without complex architectural modifications.[339] Make it SING: Analyzing Semantic Invariants in Classifiers
Harel Yadid,Meir Yossef Levi,Roy Betser,Guy Gilboa
Main category: cs.CV
TL;DR: 本文提出SING方法,通过将网络特征映射到多模态视觉语言模型,为分类器零空间中的不变性提供语义解释,揭示不同模型在保持类别语义方面的能力差异。
Details
Motivation: 现有方法难以对分类器零空间中不变性的语义内容提供人类可理解的解释,本文旨在填补这一空白。 Method: 提出Semantic Interpretation of the Null-space Geometry (SING) 方法,利用网络特征到多模态视觉语言模型的映射,生成等价图像并赋予其语义解释,支持单图局部分析或图像集的统计分析。 Result: SING揭示ResNet50在零空间中泄露相关语义属性,而自监督预训练的DinoViT在不变空间中更能保持类别语义。 Conclusion: SING为理解深度分类器的几何不变性提供了可解释、语义化的分析工具,有助于评估和改进模型的语义鲁棒性。 Abstract: All classifiers, including state-of-the-art vision models, possess invariants, partially rooted in the geometry of their linear mappings. These invariants, which reside in the null-space of the classifier, induce equivalent sets of inputs that map to identical outputs. The semantic content of these invariants remains vague, as existing approaches struggle to provide human-interpretable information. To address this gap, we present Semantic Interpretation of the Null-space Geometry (SING), a method that constructs equivalent images, with respect to the network, and assigns semantic interpretations to the available variations. We use a mapping from network features to multi-modal vision language models. This allows us to obtain natural language descriptions and visual examples of the induced semantic shifts. SING can be applied to a single image, uncovering local invariants, or to sets of images, allowing a breadth of statistical analysis at the class and model levels. For example, our method reveals that ResNet50 leaks relevant semantic attributes to the null space, whereas DinoViT, a ViT pretrained with self-supervised DINO, is superior in maintaining class semantics across the invariant space.[340] A Heterogeneous Ensemble for Multi-Center COVID-19 Classification from Chest CT Scans
Aadit Nilay,Bhavesh Thapar,Anant Agrawal,Mohammad Nayeem Teli
Main category: cs.CV
TL;DR: 本文提出了一种异构模型集成方法,融合九种不同架构与训练策略的深度学习模型,结合领域感知增强、Focal Loss与嵌入级Mixup缓解过拟合,并通过加权概率融合与按数据源校准阈值,在四个医院中心实现0.9280的平均宏F1,显著优于单模型。
Details
Motivation: COVID-19诊断中RT-PCR速度慢、假阴性高,CT筛查虽快但依赖专家解读;多中心部署自动CT分析时,因设备、协议和人群差异导致严重域偏移,单模型性能下降。 Method: 构建包含9个模型的异构集成:涵盖DINOv2 ViT(自监督)、RadImageNet预训练DenseNet-121、以及7种基于不同骨干网(EfficientNet-B3/ConvNeXt-Tiny/EfficientNetV2-S)的Gated Attention MIL模型;采用随机种子扰动、SWA提升多样性;使用Focal Loss、嵌入级Mixup和域感知增强缓解过拟合;输出通过分数加权概率平均融合,并进行按数据源优化的阈值校准。 Result: 在四个医院中心上平均宏F1达0.9280,比最优单模型(F1=0.8969)提升0.031;验证/训练损失比从35倍降至3倍以内,显著缓解过拟合。 Conclusion: 异构模型集成结合源感知校准是实现鲁棒多中心医学影像分类的关键,可有效应对域偏移与过拟合挑战。 Abstract: The COVID-19 pandemic exposed critical limitations in diagnostic workflows: RT-PCR tests suffer from slow turnaround times and high false-negative rates, while CT-based screening offers faster complementary diagnosis but requires expert radiological interpretation. Deploying automated CT analysis across multiple hospital centres introduces further challenges, as differences in scanner hardware, acquisition protocols, and patient populations cause substantial domain shift that degrades single-model performance. To address these challenges, we present a heterogeneous ensemble of nine models spanning three inference paradigms: (1) a self-supervised DINOv2 Vision Transformer with slice-level sigmoid aggregation, (2) a RadImageNet-pretrained DenseNet-121 with slice-level sigmoid averaging, and (3) seven Gated Attention Multiple Instance Learning models using EfficientNet-B3, ConvNeXt-Tiny, and EfficientNetV2-S backbones with scan-level softmax classification. Ensemble diversity is further enhanced through random-seed variation and Stochastic Weight Averaging. We address severe overfitting, reducing the validation-to-training loss ratio from 35x to less than 3x, through a combination of Focal Loss, embedding-level Mixup, and domain-aware augmentation. Model outputs are fused via score-weighted probability averaging and calibrated with per-source threshold optimization. The final ensemble achieves an average macro F1 of 0.9280 across four hospital centres, outperforming the best single model (F1=0.8969) by +0.031, demonstrating that heterogeneous architectures combined with source-aware calibration are essential for robust multi-site medical image classification.[341] Continual Few-shot Adaptation for Synthetic Fingerprint Detection
Joseph Geo Benjamin,Anil K. Jain,Karthik Nandakumar
Main category: cs.CV
TL;DR: 本文提出一种持续少样本自适应方法,用于检测未知生成式AI模型产生的合成指纹图像,通过结合二元交叉熵与监督对比损失,并在微调中回放旧风格样本来缓解灾难性遗忘。
Details
Motivation: 随着生成式AI的发展,合成指纹图像质量提升,导致指纹识别系统易受数据注入攻击,亟需能泛化到未知生成模型的合成指纹检测方法。 Method: 将合成指纹检测建模为持续少样本自适应问题,采用二元交叉熵损失与作用于特征表示的监督对比损失联合训练,并在每次适配新合成风格时回放少量历史风格样本以防止灾难性遗忘。 Result: 在多种DNN骨干网络和多源真实/合成指纹数据集上的实验表明,该方法在快速适应新合成风格与保留对已知风格检测能力之间取得良好平衡。 Conclusion: 所提持续少样本自适应框架有效提升了合成指纹检测器对未知生成模型的泛化能力,同时缓解了模型在增量学习过程中的灾难性遗忘问题。 Abstract: The quality and realism of synthetically generated fingerprint images have increased significantly over the past decade fueled by advancements in generative artificial intelligence (GenAI). This has exacerbated the vulnerability of fingerprint recognition systems to data injection attacks, where synthetic fingerprints are maliciously inserted during enrollment or authentication. Hence, there is an urgent need for methods to detect if a fingerprint image is real or synthetic. While it is straightforward to train deep neural network (DNN) models to classify images as real or synthetic, often such DNN models overfit the training data and fail to generalize well when applied to synthetic fingerprints generated using unseen GenAI models. In this work, we formulate synthetic fingerprint detection as a continual few-shot adaptation problem, where the objective is to rapidly evolve a base detector to identify new types of synthetic data. To enable continual few-shot adaptation, we employ a combination of binary cross-entropy and supervised contrastive (applied to the feature representation) losses and replay a few samples from previously known styles during fine-tuning to mitigate catastrophic forgetting. Experiments based on several DNN backbones (as feature extractors) and a variety of real and synthetic fingerprint datasets indicate that the proposed approach achieves a good trade-off between fast adaptation for detecting unseen synthetic styles and forgetting of known styles.[342] Spectrum Matching: a Unified Perspective for Superior Diffusability in Latent Diffusion
Mang Ning,Mingxiao Li,Le Zhang,Lanmiao Liu,Matthew B. Blaschko,Albert Ali Salah,Itir Onal Ertugrul
Main category: cs.CV
TL;DR: 本文研究了变分自编码器(VAE)在潜在扩散中的可扩散性(可学习性),提出了频谱匹配假设(Spectrum Matching Hypothesis),包括编码频谱匹配(ESM)和解码频谱匹配(DSM),并通过实验验证其在CelebA和ImageNet上优于现有方法;还扩展至表示对齐(REPA),提出基于DoG的改进方法。
Details
Motivation: 像素空间扩散模型在MSE目标下天然偏向学习低中频成分,而自然图像的幂律功率谱密度(PSD)使该偏差具有感知优势;由此启发作者探究如何提升潜在表示的扩散性能。 Method: 提出频谱匹配假设(ESM与DSM),通过匹配图像与潜在表示的PSD实现ESM,利用频率对齐重建与共享谱掩码实现DSM;进一步将频谱视角拓展至表示对齐(REPA),提出基于差分高斯(DoG)的定向谱能量优化方法。 Result: 在CelebA和ImageNet数据集上,Spectrum Matching显著提升扩散生成质量,优于VA-VAE、EQ-VAE等先前方法;REPA中DoG方法也提升了表示对齐性能。 Conclusion: 频谱特性是理解与提升VAE潜在扩散性能的关键;Spectrum Matching为分析和改进潜在表示提供了统一框架,并能解释多种已有方法。 Abstract: In this paper, we study the diffusability (learnability) of variational autoencoders (VAE) in latent diffusion. First, we show that pixel-space diffusion trained with an MSE objective is inherently biased toward learning low and mid spatial frequencies, and that the power-law power spectral density (PSD) of natural images makes this bias perceptually beneficial. Motivated by this result, we propose the \emph{Spectrum Matching Hypothesis}: latents with superior diffusability should (i) follow a flattened power-law PSD (\emph{Encoding Spectrum Matching}, ESM) and (ii) preserve frequency-to-frequency semantic correspondence through the decoder (\emph{Decoding Spectrum Matching}, DSM). In practice, we apply ESM by matching the PSD between images and latents, and DSM via shared spectral masking with frequency-aligned reconstruction. Importantly, Spectrum Matching provides a unified view that clarifies prior observations of over-noisy or over-smoothed latents, and interprets several recent methods as special cases (e.g., VA-VAE, EQ-VAE). Experiments suggest that Spectrum Matching yields superior diffusion generation on CelebA and ImageNet datasets, and outperforms prior approaches. Finally, we extend the spectral view to representation alignment (REPA): we show that the directional spectral energy of the target representation is crucial for REPA, and propose a DoG-based method to further improve the performance of REPA. Our code is available https://github.com/forever208/SpectrumMatching.[343] TopoCL: Topological Contrastive Learning for Medical Imaging
Guangyu Meng,Pengfei Gu,Peixian Liang,John P. Lalor,Erin Wolf Chambers,Danny Z. Chen
Main category: cs.CV
TL;DR: 本文提出了一种新的拓扑对比学习框架TopoCL,通过引入拓扑感知增强、分层拓扑编码器和自适应混合专家模块,在医学图像对比学习中显式利用拓扑结构,显著提升多种主流对比学习方法在多个医学图像分类任务上的性能。
Details
Motivation: 现有对比学习方法主要关注视觉外观特征,忽略了对医学图像分析至关重要的拓扑特征(如连通性、边界构型、空腔结构等)。 Method: 提出TopoCL框架:1)基于持续同调图间相对瓶颈距离的拓扑感知数据增强;2)结合自注意力与交叉注意力的分层拓扑编码器;3)动态融合视觉与拓扑表征的自适应MoE模块,并可即插即用地集成到现有CL方法中。 Result: 在5种主流对比学习方法(SimCLR、MoCo-v3、BYOL、DINO、Barlow Twins)和5个医学图像分类数据集上验证,线性探针分类准确率平均提升+3.26%,统计显著。 Conclusion: 将拓扑先验显式融入对比学习能有效提升医学图像表征质量,TopoCL是一种通用、有效且即插即用的增强框架。 Abstract: Contrastive learning (CL) has become a powerful approach for learning representations from unlabeled images. However, existing CL methods focus predominantly on visual appearance features while neglecting topological characteristics (e.g., connectivity patterns, boundary configurations, cavity formations) that provide valuable cues for medical image analysis. To address this limitation, we propose a new topological CL framework (TopoCL) that explicitly exploits topological structures during contrastive learning for medical imaging. Specifically, we first introduce topology-aware augmentations that control topological perturbations using a relative bottleneck distance between persistence diagrams, preserving medically relevant topological properties while enabling controlled structural variations. We then design a Hierarchical Topology Encoder that captures topological features through self-attention and cross-attention mechanisms. Finally, we develop an adaptive mixture-of-experts (MoE) module to dynamically integrate visual and topological representations. TopoCL can be seamlessly integrated with existing CL methods. We evaluate TopoCL on five representative CL methods (SimCLR, MoCo-v3, BYOL, DINO, and Barlow Twins) and five diverse medical image classification datasets. The experimental results show that TopoCL achieves consistent improvements: an average gain of +3.26% in linear probe classification accuracy with strong statistical significance, verifying its effectiveness.[344] Human-AI Ensembles Improve Deepfake Detection in Low-to-Medium Quality Videos
Marco Postiglione,Isabel Gortner,V. S. Subrahmanian
Main category: cs.CV
TL;DR: 本文比较了人类和AI检测器在深度伪造检测任务中的表现,发现人类在标准基准DF40和新构建的日常活动视频数据集CharadesDF上均优于AI检测器,尤其在低质量移动设备拍摄的CharadesDF上优势更明显;人类与AI错误类型互补,结合二者可显著减少高置信度错误,表明真实场景下的深度伪造检测需人机协同而非仅依赖AI算法。
Details
Motivation: 当前深度伪造检测多被视为机器学习问题,但人类与AI检测器在现实条件下的相对性能尚不清楚,尤其缺乏对非专业制作视频(如手机拍摄)中检测效果的系统评估。 Method: 通过招募200名人类参与者与测试95种前沿AI检测器,在两个数据集(标准基准DF40和新构建的、由手机拍摄的日常活动视频数据集CharadesDF)上进行对比实验,并分析人类与AI的错误模式及融合效果。 Result: 人类在两数据集上均显著优于AI检测器;在CharadesDF上AI准确率骤降至近随机水平(0.537),而人类保持稳健(0.784);人类易漏检高质量伪造,AI则常将真视频误判为假,二者错误互补,混合人机集成可降低高置信错误。 Conclusion: 真实世界(尤其是非专业视频)中的深度伪造检测不能单靠AI算法,而应采用人类与AI协同的检测范式。 Abstract: Deepfake detection is widely framed as a machine learning problem, yet how humans and AI detectors compare under realistic conditions remains poorly understood. We evaluate 200 human participants and 95 state-of-the-art AI detectors across two datasets: DF40, a standard benchmark, and CharadesDF, a novel dataset of videos of everyday activities. CharadesDF was recorded using mobile phones leading to low/moderate quality videos compared to the more professionally captured DF40. Humans outperform AI detectors on both datasets, with the gap widening in the case of CharadesDF where AI accuracy collapses to near chance (0.537) while humans maintain robust performance (0.784). Human and AI errors are complementary: humans miss high-quality deepfakes while AI detectors flag authentic videos as fake, and hybrid human-AI ensembles reduce high-confidence errors. These findings suggest that effective real-world deepfake detection, especially in non-professionally produced videos, requires human-AI collaboration rather than AI algorithms alone.[345] VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
Daeun Lee,Shoubin Yu,Yue Zhang,Mohit Bansal
Main category: cs.CV
TL;DR: VisonCoach 是一种输入自适应的强化学习框架,通过训练时的视觉提示来提升视频推理中的时空定位能力,并利用自蒸馏使模型在推理时无需外部提示即可实现精准定位。
Details
Motivation: 现有方法在视频推理中难以实现可靠的时空定位,且依赖大规模标注数据或推理时感知工具,导致成本高昂。 Method: 提出VisonCoach框架,包含视觉提示选择器和时空推理器;在强化学习训练中选择性应用视觉提示以增强相关证据、抑制干扰,并通过自蒸馏将能力内化;设计对象感知的定位奖励函数以保证物体身份一致性和多区域边界框重叠。 Result: 在多个视频推理与理解基准(V-STAR、VideoMME、World-Sense、VideoMMMU、PerceptionTest、Charades-STA)上达到SOTA性能,仅需单一高效推理路径,无需外部工具。 Conclusion: 训练阶段引入视觉提示可有效提升视频时空定位能力,而自蒸馏机制能使模型在不依赖提示的情况下保持该能力。 Abstract: Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.[346] EviATTA: Evidential Active Test-Time Adaptation for Medical Segment Anything Models
Jiayi Chen,Yasmeen George,Winston Chong,Jianfei Cai
Main category: cs.CV
TL;DR: 本文提出EviATTA,首个面向医学SAM的主动测试时自适应框架,通过证据建模分解不确定性,并设计分层采样与双重一致性正则化,在少量专家反馈下显著提升跨域医学图像分割的可靠性。
Details
Motivation: 基础医学SAM在面临大分布偏移时,测试时自适应(TTA)效果受限,且现有主动TTA(ATTA)方法存在不确定性估计不可靠、稀疏标注利用效率低的问题。 Method: 提出Evidential Active Test-Time Adaptation(EviATTA):1)采用狄利克雷证据建模分解预测不确定性为分布不确定性和数据不确定性;2)设计分层证据采样策略(图像级分布不确定性选样本,距离感知数据不确定性指导像素标注);3)引入双重一致性正则化(稀疏标注样本上的渐进提示一致性 + 无标签样本上的变分特征一致性)。 Result: 在六个医学图像分割数据集上实验表明,EviATTA在批量和实例级TTA设置下,仅需极少专家反馈即可持续提升适应可靠性。 Conclusion: EviATTA是首个专为医学SAM设计的ATTA框架,通过精细化不确定性建模与高效标注利用,有效缓解大分布偏移下的测试时适应难题,提升了临床部署鲁棒性。 Abstract: Deploying foundational medical Segment Anything Models (SAMs) via test-time adaptation (TTA) is challenging under large distribution shifts, where test-time supervision is often unreliable. While active test-time adaptation (ATTA) introduces limited expert feedback to improve reliability, existing ATTA methods still suffer from unreliable uncertainty estimation and inefficient utilization of sparse annotations. To address these issues, we propose Evidential Active Test-Time Adaptation (EviATTA), which is, to our knowledge, the first ATTA framework tailored for medical SAMs. Specifically, we adopt the Dirichlet-based Evidential Modeling to decompose overall predictive uncertainty into distribution uncertainty and data uncertainty. Building on this decomposition, we design a Hierarchical Evidential Sampling strategy, where image-wise distribution uncertainty is used to select informative shifted samples, while distance-aware data uncertainty guides sparse pixel annotations to resolve data ambiguities. We further introduce Dual Consistency Regularization, which enforces progressive prompt consistency on sparsely labeled samples to better exploit sparse supervision and applies variational feature consistency on unlabeled samples to stabilize adaptation. Extensive experiments on six medical image segmentation datasets demonstrate that EviATTA consistently improves adaptation reliability with minimal expert feedback under both batch-wise and instance-wise test-time adaptation settings.[347] Comparative Analysis of 3D Convolutional and 2.5D Slice-Conditioned U-Net Architectures for MRI Super-Resolution via Elucidated Diffusion Models
Hendrik Chiche,Ludovic Corcos,Logan Rouge
Main category: cs.CV
TL;DR: 本文提出了一种基于阐明扩散模型(EDM)框架的脑部MRI超分辨率方法,比较了全3D卷积U-Net与2.5D切片条件U-Net两种架构,在FOMO60K数据集上验证表明3D模型在PSNR、SSIM和LPIPS指标上均优于基线和其他变体。
Details
Motivation: 开发计算高效的MRI超分辨率方法,以替代昂贵的高场强扫描仪,降低临床成像成本并提升图像质量。 Method: 采用EDM框架,对比两种U-Net主干:(i) 全3D卷积U-Net,处理带多头自注意力的体积块;(ii) 2.5D切片条件U-Net,逐切片超分并利用相邻切片提供跨切片上下文;均使用Karras等人提出的连续sigma噪声条件,并在FOMO60K的NKI队列上训练。 Result: 在5名受试者(6个体积,993张切片)的独立测试集上,3D模型达到37.75 dB PSNR、0.997 SSIM和0.020 LPIPS,全面优于EDSR基线(35.57 dB / 0.024 LPIPS)和2.5D模型(35.82 dB)。 Conclusion: 全3D扩散模型在脑MRI超分辨率任务中展现出更强的建模能力和性能优势,证实了其在医学影像增强中的潜力。 Abstract: Magnetic resonance imaging (MRI) super-resolution (SR) methods that computationally enhance low-resolution acquisitions to approximate high-resolution quality offer a compelling alternative to expensive high-field scanners. In this work we investigate an elucidated diffusion model (EDM) framework for brain MRI SR and compare two U-Net backbone architectures: (i) a full 3D convolutional U-Net that processes volumetric patches with 3D convolutions and multi-head self-attention, and (ii) a 2.5D slice-conditioned U-Net that super-resolves each slice independently while conditioning on an adjacent slice for inter-slice context. Both models employ continuous-sigma noise conditioning following Karras et al. and are trained on the NKI cohort of the FOMO60K dataset. On a held-out test set of 5 subjects (6 volumes, 993 slices), the 3D model achieves 37.75 dB PSNR, 0.997 SSIM, and 0.020 LPIPS, improving on the off-the-shelf pretrained EDSR baseline (35.57 dB / 0.024 LPIPS) and the 2.5D variant (35.82 dB) across all three metrics under the same test data and degradation pipeline.[348] E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
Yunsoo Kim,Changki Sung,Dasol Hong,Hyun Myung
Main category: cs.CV
TL;DR: E2EGS is a pose-free framework for event-based novel view synthesis that uses edge information from noisy event streams to improve trajectory estimation and 3D reconstruction quality.
Details
Motivation: Existing NeRF and 3DGS methods require high-quality RGB inputs and accurate poses, limiting robustness in real-world scenarios; event cameras offer advantages but current event-based NVS methods either assume known poses or rely on depth models with limited generalization. Method: E2EGS extracts edges from event streams using patch-based temporal coherence analysis to distinguish edge-induced consistent events from noise; extracted edges guide structure-aware Gaussian initialization and edge-weighted losses during initialization, tracking, and bundle adjustment. Result: E2EGS achieves superior reconstruction quality and trajectory accuracy on both synthetic and real datasets, enabling fully pose-free event-based 3D reconstruction. Conclusion: Edge information from event streams is crucial for robust pose-free 3D reconstruction, and E2EGS establishes a new paradigm by leveraging spatio-temporal edge characteristics for noise-robust edge extraction and structure-aware optimization. Abstract: The emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require high-quality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods either assume known poses or rely on depth estimation models that are bounded by their initial observations, failing to generalize as the camera traverses previously unseen regions. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we exploit the distinct spatio-temporal characteristics of edges and non-edge regions. The event camera's movement induces consistent events along edges, while non-edge regions produce sparse noise. We leverage this through a patch-based temporal coherence analysis that measures local variance to extract edges while robustly suppressing noise. The extracted edges guide structure-aware Gaussian initialization and enable edge-weighted losses throughout initialization, tracking, and bundle adjustment. Extensive experiments on both synthetic and real datasets demonstrate that E2EGS achieves superior reconstruction quality and trajectory accuracy, establishing a fully pose-free paradigm for event-based 3D reconstruction.[349] MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model
Jinguang Tong,Jinbo Wu,Kaisiyuan Wang,Zhelun Shen,Xuan Huang,Mochu Xiang,Xuesong Li,Yingying Li,Haocheng Feng,Chen Zhao,Hang Zhou,Wei He,Chuong Nguyen,Jingdong Wang,Hongdong Li
Main category: cs.CV
TL;DR: 本文提出MVHOI框架,通过3D基础模型和可控视频生成模型两阶段协同,实现高质量、长时序的人-物交互视频重演,尤其擅长处理复杂3D物体操作。
Details
Motivation: 现有方法难以处理人-物交互中复杂的非平面运动(如物体出平面旋转),仅限于简单图像平面运动。 Method: 提出两阶段MVHOI框架:第一阶段利用3D基础模型(3DFM)基于隐式运动动力学生成多视角一致的物体先验;第二阶段通过可控视频生成模型融合多视角参考图像并结合合理检索机制合成高保真纹理。两阶段在推理中相互增强。 Result: 在复杂3D人-物交互视频重演任务上显著优于先前方法,尤其在长时序与复杂物体操纵方面效果突出。 Conclusion: MVHOI通过桥接多视角条件与视频基础模型,有效提升了HOI视频重演的真实性与运动复杂度,验证了3D先验与可控生成协同的有效性。 Abstract: Human-Object Interaction (HOI) video reenactment with realistic motion remains a frontier in expressive digital human creation. Existing approaches primarily handle simple image-plane motion (e.g., in-plane translations), struggling with complex non-planar manipulations like out-of-plane reorientation. In this paper, we propose MVHOI, a two-stage HOI video reenactment framework that bridges multi-view reference conditions and video foundation models via a 3D Foundation Model (3DFM). The 3DFM first produces view-consistent object priors conditioned on implicit motion dynamics across novel viewpoints. A controllable video generation model then synthesizes high-fidelity object texture by incorporating multi-view reference images, ensuring appearance consistency via a reasonable retrieval mechanism. By enabling these two stages to mutually reinforce one another during the inference phase, our framework shows superior performance in generating long-duration HOI videos with intricate object manipulations. Extensive experiments show substantial improvements over prior approaches, especially for HOI with complex 3D object manipulations.[350] Robust Building Damage Detection in Cross-Disaster Settings Using Domain Adaptation
Asmae Mouradi,Shruti Kshirsagar
Main category: cs.CV
TL;DR: 本文提出了一种基于监督域自适应(SDA)的两阶段集成方法,用于跨地理区域的建筑物损毁分类,显著提升了在未见区域上的检测鲁棒性与可信度。
Details
Motivation: 多灾种基准训练的模型在未见地理区域上因域偏移(domain shift)导致性能下降,削弱人类对自动评估的信任。 Method: 采用监督域自适应(SDA)改进xView2冠军方法,适配至Ida-BD数据集;系统分析各类数据增强组件的影响,并通过消融实验验证SDA的关键作用。 Result: 在未见Ida-BD测试集上,使用SDA结合非锐化增强RGB输入的方案达到Macro-F1为0.5552,去除SDA则检测完全失效。 Conclusion: 域自适应是构建可信赖、适用于人机系统(HMS)集成灾害响应的自动化损毁评估模块的关键环节。 Abstract: Rapid structural damage assessment from remote sensing imagery is essential for timely disaster response. Within human-machine systems (HMS) for disaster management, automated damage detection provides decision-makers with actionable situational awareness. However, models trained on multi-disaster benchmarks often underperform in unseen geographic regions due to domain shift - a distributional mismatch between training and deployment data that undermines human trust in automated assessments. We explore a two-stage ensemble approach using supervised domain adaptation (SDA) for building damage classification across four severity classes. The pipeline adapts the xView2 first-place method to the Ida-BD dataset using SDA and systematically investigates the effect of individual augmentation components on classification performance. Comprehensive ablation experiments on the unseen Ida-BD test split demonstrate that SDA is indispensable: removing it causes damage detection to fail entirely. Our pipeline achieves the most robust performance using SDA with unsharp-enhanced RGB input, attaining a Macro-F1 of 0.5552. These results underscore the critical role of domain adaptation in building trustworthy automated damage assessment modules for HMS-integrated disaster response.[351] AURORA-KITTI: Any-Weather Depth Completion and Denoising in the Wild
Yiting Wang,Tim Brödermann,Hamed Haghighi,Haonan Zhao,Christos Sakaridis,Kurt Debattista,Valentina Donzella
Main category: cs.CV
TL;DR: 本文提出AURORA-KITTI——首个面向恶劣天气下鲁棒深度补全的大规模多模态、多天气基准,并将深度补全与去噪(DCD)统一为联合任务;同时提出高效蒸馏基线DDCD,利用深度基础模型注入干净结构先验,在多个数据集上达到SOTA性能。
Details
Motivation: 现有RGB-LiDAR融合方法在恶劣天气下性能显著下降,因相机图像和LiDAR测量均受天气干扰,缺乏具备真实天气退化、配准良好且带真值的多模态数据集与相应任务建模。 Method: 构建AURORA-KITTI基准(82K+天气一致RGBL样本,含多天气类型/强度、昼夜、镜头遮挡、文本描述及配对干净参考);提出统一任务Depth Completion and Denoising (DCD);设计蒸馏基线DDCD,利用深度基础模型提供结构先验以提升野外训练鲁棒性。 Result: DDCD在AURORA-KITTI和真实世界DENSE数据集上均达SOTA性能,且保持高效率;实验验证天气感知、物理一致的数据比单纯网络结构改进更能提升模型鲁棒性。 Conclusion: 高质量、物理可解释的多天气多模态数据是提升深度补全鲁棒性的关键;统一DCD任务范式与基于基础模型的蒸馏策略为恶劣天气3D感知提供了新思路。 Abstract: Robust depth completion is fundamental to real-world 3D scene understanding, yet existing RGB-LiDAR fusion methods degrade significantly under adverse weather, where both camera images and LiDAR measurements suffer from weather-induced corruption. In this paper, we introduce AURORA-KITTI, the first large-scale multi-modal, multi-weather benchmark for robust depth completion in the wild. We further formulate Depth Completion and Denoising (DCD) as a unified task that jointly reconstructs a dense depth map from corrupted sparse inputs while suppressing weather-induced noise. AURORA-KITTI contains over \textit{82K} weather-consistent RGBL pairs with metric depth ground truth, spanning diverse weather types, three severity levels, day and night scenes, paired clean references, lens occlusion conditions, and textual descriptions. Moreover, we introduce DDCD, an efficient distillation-based baseline that leverages depth foundation models to inject clean structural priors into in-the-wild DCD training. DDCD achieves state-of-the-art performance on AURORA-KITTI and the real-world DENSE dataset while maintaining efficiency. Notably, our results further show that weather-aware, physically consistent data contributes more to robustness than architectural modifications alone. Data and code will be released upon publication.[352] Fractal Autoregressive Depth Estimation with Continuous Token Diffusion
Jinchang Zhang,Xinrou Kang,Guoyu Lu
Main category: cs.CV
TL;DR: 本文提出了一种分形视觉自回归扩散框架(Fractal Visual Autoregressive Diffusion),将单目深度估计重构为由粗到细、逐尺度自回归生成过程,通过多尺度特征融合、连续空间扩散建模与分形递归结构提升性能与效率,并引入不确定性感知的鲁棒共识聚合提升推理稳定性。
Details
Motivation: 单目深度估计中直接使用自回归建模面临RGB与深度模态差异大、像素级生成低效、连续深度预测不稳定等问题。 Method: 提出分形视觉自回归扩散框架:1)用VCFR模块融合多尺度图像特征与当前深度预测以增强跨模态条件;2)采用条件去噪扩散损失在连续空间直接建模深度分布,避免离散量化误差;3)设计分形递归架构,复用基础视觉AR单元实现尺度间参数共享;4)引入不确定性感知的鲁棒共识聚合机制用于多样本推理与可靠性估计。 Result: 在标准基准上实验表明该方法性能优异,验证了各设计组件的有效性。 Conclusion: 所提框架有效缓解了单目深度估计中自回归建模的关键挑战,在精度、效率与稳定性方面取得综合提升。 Abstract: Monocular depth estimation can benefit from autoregressive (AR) generation, but direct AR modeling is hindered by the modality gap between RGB and depth, inefficient pixel-wise generation, and instability in continuous depth prediction. We propose a Fractal Visual Autoregressive Diffusion framework that reformulates depth estimation as a coarse-to-fine, next-scale autoregressive generation process. A VCFR module fuses multi-scale image features with current depth predictions to improve cross-modal conditioning, while a conditional denoising diffusion loss models depth distributions directly in continuous space and mitigates errors caused by discrete quantization. To improve computational efficiency, we organize the scale-wise generators into a fractal recursive architecture, reusing a base visual AR unit in a self-similar hierarchy. We further introduce an uncertainty-aware robust consensus aggregation scheme for multi-sample inference to improve fusion stability and provide a practical pixel-wise reliability estimate. Experiments on standard benchmarks demonstrate strong performance and validate the effectiveness of the proposed design.[353] AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers
Salim Khazem
Main category: cs.CV
TL;DR: 本文提出AdapterTune方法,通过零初始化的残差低秩瓶颈适配器解决ViT冻结主干微调中的优化不稳定与适配器容量设定无指导问题;理论分析揭示适配器秩作为特征空间任务偏移逼近的容量预算,并预测精度随秩增加呈肘形增长;实验表明其在多个数据集和模型尺度上显著优于仅微调分类头,且参数效率远高于全微调。
Details
Motivation: Frozen-backbone transfer with Vision Transformers面临两个未被充分解决的问题:1)在固定特征提取器中简单插入适配器导致优化不稳定;2)缺乏对适配器容量(如秩)设置的原则性指导。 Method: 提出AdapterTune:在每个Transformer块中添加残差低秩瓶颈适配器,其上投影矩阵零初始化,确保微调起点严格等于预训练函数,避免早期表征漂移;从理论角度将适配器秩形式化为在特征空间中逼近下游任务偏移的‘容量预算’,并推导出过风险分解以刻画精度-秩关系。 Result: 在9个数据集、3种骨干模型尺度上验证:在核心5数据集迁移套件中,相比仅微调分类头,平均top-1准确率提升+14.9点,仅需全微调0.92%的可训练参数;在15组数据集-骨干组合中,10组超越全微调;在全部测试组合中均优于仅微调头;消融实验证实了秩、位置与初始化设计的有效性。 Conclusion: AdapterTune通过结构化初始化与理论驱动的容量设计,实现了稳定、高效、高性能的冻结主干迁移学习,为ViT适配器设计提供了新范式与实用工具。 Abstract: Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow'' behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head-only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, AdapterTune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: https://github.com/salimkhazem/adaptertune[354] Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator
Gyeongsik Moon
Main category: cs.CV
TL;DR: Hand4Whole++ 是一种模块化框架,通过引入 CHAM 模块调制全身特征流,并融合预训练手部估计器的精细手指姿态,提升3D全身姿态估计中手部姿态的准确性与整体一致性。
Details
Motivation: 现有全身姿态估计器在手部细节(尤其是手指)上表现不足,而手部专用估计器缺乏身体上下文信息,导致监督信号不匹配。 Method: 提出 Hand4Whole++ 框架:1)设计轻量级 CHAM 模块,利用预训练手部估计器的特征调制全身特征流,提升腕部朝向预测;2)通过可微刚性对齐,将手部估计器输出的手形与手指关节直接融入全身网格。 Result: 在多个基准上显著提升手部姿态精度,并增强整体全身姿态质量,且无需重新训练全身模型。 Conclusion: Hand4Whole++ 有效桥接了全身与手部估计的鸿沟,在保持全局一致性的同时引入高保真手部细节,是一种高效、即插即用的改进方案。 Abstract: Accurately recovering hand poses within the body context remains a major challenge in 3D whole-body pose estimation. This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose Hand4Whole++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. We introduce CHAM (Conditional Hands Modulator), a lightweight module that modulates the whole-body feature stream using hand-specific features extracted from a pre-trained hand pose estimator. This modulation enables the whole-body model to predict wrist orientations that are both accurate and coherent with the upper-body kinematic structure, without retraining the full-body model. In parallel, we directly incorporate finger articulations and hand shapes predicted by the hand pose estimator, aligning them to the full-body mesh via differentiable rigid alignment. This design allows Hand4Whole++ to combine globally consistent body reasoning with fine-grained hand detail. Extensive experiments demonstrate that Hand4Whole++ substantially improves hand accuracy and enhances overall full-body pose quality.[355] Automated Diabetic Screening via Anterior Segment Ocular Imaging: A Deep Learning and Explainable AI Approach
Hasaan Maqsood,Saif Ur Rehman Khan,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim
Main category: cs.CV
TL;DR: 本文提出了一种基于前节眼表图像(使用普通相机拍摄)的深度学习系统,用于自动化糖尿病分类,通过虹膜、巩膜和结膜中的可见生物标志物识别糖尿病状态,在资源有限环境中具有重要应用价值。
Details
Motivation: 传统糖尿病视网膜病变筛查依赖眼底照相,需专业设备与人员,难以在基层及资源匮乏地区开展;亟需一种更易获取、低成本的替代方案。 Method: 采用五种主流深度学习架构(EfficientNet-V2-S(结合SimCLR自监督学习)、ViT、Swin Transformer、ConvNeXt-Base、ResNet-50),在2640张临床标注的前节图像上进行评估;设计了包含镜面反射抑制和CLAHE的预处理流程;使用领域特异的眼部图像进行自监督预训练。 Result: EfficientNet-V2-S+SSL达到最优性能:F1分数98.21%、精确率97.90%、召回率98.55%,显著优于仅用ImageNet初始化的结果(F1 94.63%);正常类精确率达100%,可有效减少误转诊。 Conclusion: 前节眼表图像结合自监督增强的轻量级DL模型可实现高精度糖尿病状态分类,为基层和资源受限场景提供了可行、可靠的自动化筛查新范式。 Abstract: Diabetic retinopathy screening traditionally relies on fundus photography, requiring specialized equipment and expertise often unavailable in primary care and resource limited settings. We developed and validated a deep learning (DL) system for automated diabetic classification using anterior segment ocular imaging a readily accessible alternative utilizing standard photography equipment. The system leverages visible biomarkers in the iris, sclera, and conjunctiva that correlate with systemic diabetic status. We systematically evaluated five contemporary architectures (EfficientNet-V2-S with self-supervised learning (SSL), Vision Transformer, Swin Transformer, ConvNeXt-Base, and ResNet-50) on 2,640 clinically annotated anterior segment images spanning Normal, Controlled Diabetic, and Uncontrolled Diabetic categories. A tailored preprocessing pipeline combining specular reflection mitigation and contrast limited adaptive histogram equalization (CLAHE) was implemented to enhance subtle vascular and textural patterns critical for classification. SSL using SimCLR on domain specific ocular images substantially improved model performance.EfficientNet-V2-S with SSL achieved optimal performance with an F1-score of 98.21%, precision of 97.90%, and recall of 98.55% a substantial improvement over ImageNet only initialization (94.63% F1). Notably, the model attained near perfect precision (100%) for Normal classification, critical for minimizing unnecessary clinical referrals.[356] A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding
Yue Zhang,Liqiang Jing,Jia Li,Yapeng Tian,Xinya Du,Yunhui Guo,Vibhav Gogate
Main category: cs.CV
TL;DR: 本文提出MVX-Bench多视频理解基准和SAMA多视频理解框架,旨在解决现有模型在跨视频推理上的局限性。
Details
Motivation: 现有方法在多视频理解中存在训练-推理不匹配、帧压缩导致的信息损失以及缺乏显式跨视频协调等问题;同时,当前多视频基准主要关注事件级比较,忽视身份匹配、细粒度判别和结构化多步推理。 Method: 构建MVX-Bench基准(含11类视觉任务、1442个问题、4255个真实视频),并提出SAMA框架:融合视觉工具、任务特定技能与冲突感知验证机制,支持迭代式结构化推理。 Result: SAMA在MVX-Bench上显著优于主流开源基线及GPT;消融实验验证了技能设计与冲突解决机制的有效性。 Conclusion: MVX-Bench和SAMA共同推动了多视频理解从单视频泛化向真正跨视频结构化推理的演进。 Abstract: Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.[357] Efficient Event Camera Volume System
Juan Camilo Soto,Ian Noronha,Saru Bharti,Upinder Kaur
Main category: cs.CV
TL;DR: 本文提出EECVS框架,通过将事件流建模为连续时间Dirac脉冲序列,并结合密度驱动的自适应DCT/DTFT/DWT变换与领域特异性系数剪枝策略,实现无伪影、低延迟、高保真且跨数据集泛化的事件相机数据压缩与重建。
Details
Motivation: 事件相机输出稀疏、非均匀,难以融入标准机器人流水线;传统基于时间分箱(binning)的方法引入时序伪影,且缺乏对事件密度变化的自适应能力。 Method: 将事件流建模为连续时间Dirac脉冲序列;提出密度驱动的自适应变换选择机制(DCT/DTFT/DWT);针对各变换域设计专用稀疏系数剪枝策略;消除时间分箱,直接在事件时间戳处进行变换评估;提供ROS2实时实现。 Result: 在EHPT-XC和MVSEC数据集上,DTFT重建Earth Mover Distance最低;EventSAM分割任务中,MVSEC上mean IoU达0.87(远超体素网格的0.44);DCT实现实时性:1.5ms延迟,吞吐量提升2.7倍。 Conclusion: EECVS是首个兼顾计算效率、重建保真度与跨数据集泛化能力的自适应事件压缩框架,显著提升了事件数据在下游机器人视觉任务中的实用性。 Abstract: Event cameras promise low latency and high dynamic range, yet their sparse output challenges integration into standard robotic pipelines. We introduce \nameframew (Efficient Event Camera Volume System), a novel framework that models event streams as continuous-time Dirac impulse trains, enabling artifact-free compression through direct transform evaluation at event timestamps. Our key innovation combines density-driven adaptive selection among DCT, DTFT, and DWT transforms with transform-specific coefficient pruning strategies tailored to each domain's sparsity characteristics. The framework eliminates temporal binning artifacts while automatically adapting compression strategies based on real-time event density analysis. On EHPT-XC and MVSEC datasets, our framework achieves superior reconstruction fidelity with DTFT delivering the lowest earth mover distance. In downstream segmentation tasks, EECVS demonstrates robust generalization. Notably, our approach demonstrates exceptional cross-dataset generalization: when evaluated with EventSAM segmentation, EECVS achieves mean IoU 0.87 on MVSEC versus 0.44 for voxel grids at 24 channels, while remaining competitive on EHPT-XC. Our ROS2 implementation provides real-time deployment with DCT processing achieving 1.5 ms latency and 2.7X higher throughput than alternative transforms, establishing the first adaptive event compression framework that maintains both computational efficiency and superior generalization across diverse robotic scenarios.[358] TrajMamba: An Ego-Motion-Guided Mamba Model for Pedestrian Trajectory Prediction from an Egocentric Perspective
Yusheng Peng,Gaofeng Zhang,Liping Zheng
Main category: cs.CV
TL;DR: 本文提出了一种基于Mamba模型的自车运动引导轨迹预测网络,用于从自车视角预测行人未来轨迹,在PIE和JAAD数据集上达到SOTA性能。
Details
Motivation: 从自车视角预测行人轨迹面临自车与行人之间复杂动态相对运动的挑战。 Method: 采用两个Mamba模型分别编码行人运动和自车运动特征;设计自车运动引导的Mamba解码器,将行人运动特征作为历史上下文、自车运动特征作为引导线索,显式建模二者相对运动;最后基于解码特征生成未来轨迹。 Result: 在PIE和JAAD数据集上取得当前最优性能。 Conclusion: ego-motion-guided Mamba架构能有效建模行人与自车间的相对运动关系,提升轨迹预测精度。 Abstract: Future trajectory prediction of a tracked pedestrian from an egocentric perspective is a key task in areas such as autonomous driving and robot navigation. The challenge of this task lies in the complex dynamic relative motion between the ego-camera and the tracked pedestrian. To address this challenge, we propose an ego-motion-guided trajectory prediction network based on the Mamba model. Firstly, two Mamba models are used as encoders to extract pedestrian motion and ego-motion features from pedestrian movement and ego-vehicle movement, respectively. Then, an ego-motion guided Mamba decoder that explicitly models the relative motion between the pedestrian and the vehicle by integrating pedestrian motion features as historical context with ego-motion features as guiding cues to capture decoded features. Finally, the future trajectory is generated from the decoded features corresponding to the future timestamps. Extensive experiments demonstrate the effectiveness of the proposed model, which achieves state-of-the-art performance on the PIE and JAAD datasets.[359] PHAC: Promptable Human Amodal Completion
Seung Young Noh,Ju Yong Chang
Main category: cs.CV
TL;DR: 本文提出Promptable Human Amodal Completion (PHAC)新任务,通过点/框等简单提示控制人体遮挡补全,结合ControlNet编码提示与扩散模型生成,并引入基于inpainting的细化模块以保持可见区域一致性,显著提升物理合理性和提示对齐效果。
Details
Motivation: 现有HAC模型缺乏用户可控性,无法可靠满足如目标姿态等约束;PGPIS方法虽支持姿态引导,但易丢失实例外观且受训练分布偏差影响。 Method: 提出PHAC任务,使用专用ControlNet模块编码点/框提示并注入预训练扩散模型,仅微调cross-attention块;设计基于inpainting的细化模块,在略加噪的粗补全结果上保真可见区域并平滑遮挡边界。 Result: 在HAC和PGPIS基准上实验表明,该方法生成结果更符合物理规律、质量更高,且提示对齐能力显著优于现有HAC和PGPIS方法。 Conclusion: PHAC统一了可控性与保真性,为人类图像遮挡补全提供了更实用、灵活且高质量的生成框架。 Abstract: Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result, users often resort to repeatedly sampling the model until they obtain a satisfactory output. Pose-guided person image synthesis (PGPIS) methods allow explicit pose conditioning, but frequently fail to preserve the instance-specific visible appearance and tend to be biased toward the training distribution, even when built on strong diffusion model priors. To address these limitations, we introduce promptable human amodal completion (PHAC), a new task that completes occluded human images while satisfying both visible appearance constraints and multiple user prompts. Users provide simple point-based prompts, such as additional joints for the target pose or bounding boxes for desired regions; these prompts are encoded using ControlNet modules specialized for each prompt type. These modules inject the prompt signals into a pre-trained diffusion model, and we fine-tune only the cross-attention blocks to obtain strong prompt alignment without degrading the underlying generative prior. To further preserve visible content, we propose an inpainting-based refinement module that starts from a slightly noised coarse completion, faithfully preserves the visible regions, and ensures seamless blending at occlusion boundaries. Extensive experiments on the HAC and PGPIS benchmarks show that our approach yields more physically plausible and higher-quality completions, while significantly improving prompt alignment compared with existing amodal completion and pose-guided synthesis methods.[360] Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization
Cailing Han,Zhangbin Li,Jinxing Zhou,Wei Qian,Jingjing Hu,Yanghao Zhou,Zhangling Duan,Dan Guo
Main category: cs.CV
TL;DR: 本文提出FSENet框架,利用面部特征引导情感定位,通过Face-guided Sentiment Discovery模块、Point-aware Sentiment Semantics Contrast策略和Boundary-aware Sentiment Pseudo-label Generation方法,提升点级弱监督下视频情感时序定位的精度与泛化能力。
Details
Motivation: 解决点级弱监督情感定位中情感边界不精确的问题,降低帧级标注成本。 Method: 提出FSENet框架,包含三个核心组件:1)Face-guided Sentiment Discovery(FSD)模块,通过双分支建模融合面部特征进行多模态交互;2)Point-aware Sentiment Semantics Contrast(PSSC)策略,利用对比学习区分候选点的情感语义;3)Boundary-aware Sentiment Pseudo-label Generation(BSPG)方法,将稀疏点标注转化为时序平滑的伪标签。 Result: 在基准数据集上实现全监督、视频级弱监督和点级弱监督下的SOTA性能,验证了模型强泛化能力。 Conclusion: FSENet有效提升了点级弱监督下情感时序定位的精度和鲁棒性,尤其在边界识别和跨标注设置泛化方面表现突出。 Abstract: Point-level weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To further tackle the challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network (\textbf{FSENet}), a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach \textit{first} introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We \textit{then} propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model's ability to recognize sentiment boundaries. At \textit{last}, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings.[361] Topology-Preserving Data Augmentation for Ring-Type Polygon Annotations
Sudip Laudari,Sang Hun Baek
Main category: cs.CV
TL;DR: 本文提出了一种保持顺序的多边形数据增强策略,通过在掩码空间中进行变换并投影回索引空间来修复因裁剪导致的环状多边形连通性破坏,从而维持拓扑一致性。
Details
Motivation: 现有几何数据增强方法假设多边形标注为单连通区域,但在建筑平面图等结构化领域中,环形区域常以单一循环多边形链编码(含内外边界),而传统增强中的裁剪操作易破坏其循环连通性与结构关系。 Method: 提出一种顺序保持的多边形增强策略:先在掩码空间执行几何变换,再将保留下来的顶点投影回索引空间,恢复邻接关系,确保原始遍历顺序与拓扑一致性。 Result: 实验表明该方法能可靠恢复连通性,在单次及复合增强下均实现接近完美的循环邻接保持(CAP)指标。 Conclusion: 所提策略在低计算开销下有效解决了结构化图像中环形多边形在数据增强过程中的拓扑失真问题,提升了分割模型对复杂几何结构的鲁棒性。 Abstract: Geometric data augmentation is widely used in segmentation pipelines and typically assumes that polygon annotations represent simply connected regions. However, in structured domains such as architectural floorplan analysis, ring-type regions are often encoded as a single cyclic polygon chain connecting outer and inner boundaries. During augmentation, clipping operations may remove intermediate vertices and disrupt this cyclic connectivity, breaking the structural relationship between the boundaries. In this work, we introduce an order-preserving polygon augmentation strategy that performs transformations in mask space and then projects surviving vertices back into index-space to restore adjacency relations. This repair maintains the original traversal order of the polygon and preserves topological consistency with minimal computational overhead. Experiments demonstrate that the approach reliably restores connectivity, achieving near-perfect Cyclic Adjacency Preservation (CAP) across both single and compound augmentations.[362] SSR: A Training-Free Approach for Streaming 3D Reconstruction
Hui Deng,Yuxin Mao,Yuxin He,Yuchao Dai
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、即插即用的Self-expressive Sequence Regularization (SSR)方法,从Grassmann流形视角建模流式3D重建中的状态演化,通过自表达性构建亲和矩阵,在推理时对状态更新进行正则化,有效抑制几何漂移,提升重建质量。
Details
Motivation: 流式3D重建需在严格延迟约束下进行长时序状态更新,但传统循环模型易因误差累积导致几何漂移。 Method: 提出SSR算子:将隐状态视为Grassmann流形上的点,利用历史状态窗口的自表达性质解析计算亲和矩阵,并以此正则化当前状态更新,使其回归流形一致轨迹。 Result: 在多个长序列基准上验证,SSR显著降低漂移、提升重建质量,且计算开销极小。 Conclusion: SSR是一种通用、轻量、训练无关的状态正则化机制,为流式几何感知系统提供了新的流形一致性建模范式。 Abstract: Streaming 3D reconstruction demands long-horizon state updates under strict latency constraints, yet stateful recurrent models often suffer from geometric drift as errors accumulate over time. We revisit this problem from a Grassmannian manifold perspective: the latent persistent state can be viewed as a subspace representation, i.e., a point evolving on a Grassmannian manifold, where temporal coherence implies the state trajectory should remain on (or near) this manifold.Based on this view, we propose Self-expressive Sequence Regularization (SSR), a plug-and-play, training-free operator that enforces Grassmannian sequence regularity during inference.Given a window of historical states, SSR computes an analytical affinity matrix via the self-expressive property and uses it to regularize the current update, effectively pulling noisy predictions back toward the manifold-consistent trajectory with minimal overhead. Experiments on long-sequence benchmarks demonstrate that SSR consistently reduces drift and improves reconstruction quality across multiple streaming 3D reconstruction tasks.[363] AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas
Longhui Yuan
Main category: cs.CV
TL;DR: AnyPhoto是一种面向多身份保持生成的扩散Transformer微调框架,通过位置对齐画布、身份自适应调制和身份隔离注意力等机制,有效缓解了强条件约束下的复制粘贴倾向,提升了身份保真度与文本提示可控性。
Details
Motivation: 多身份图像生成中,强身份和布局约束易导致模型走‘复制粘贴’捷径,削弱文本提示驱动的可控性。 Method: 提出AnyPhoto框架:(i)RoPE对齐的位置画布+位置对齐的token剪枝以实现空间定位;(ii)基于人脸识别嵌入的AdaLN式身份自适应调制以持久注入身份信息;(iii)身份隔离注意力机制防止跨身份干扰;训练采用条件流匹配联合嵌入空间人脸相似性损失,并引入参考脸替换与位置画布退化策略抑制捷径学习。 Result: 在MultiID-Bench上,AnyPhoto显著提升身份相似性、降低复制粘贴倾向,且性能增益随身份数量增加而增强;同时支持提示驱动的风格化与精准布局。 Conclusion: AnyPhoto在多身份可控生成任务中实现了身份保真性与文本可控性的更好平衡,具备良好的应用潜力。 Abstract: Multi-person identity-preserving generation requires binding multiple reference faces to specified locations under a text prompt. Strong identity/layout conditions often trigger copy-paste shortcuts and weaken prompt-driven controllability. We present AnyPhoto, a diffusion-transformer finetuning framework with (i) a RoPE-aligned location canvas plus location-aligned token pruning for spatial grounding, (ii) AdaLN-style identity-adaptive modulation from face-recognition embeddings for persistent identity injection, and (iii) identity-isolated attention to prevent cross-identity interference. Training combines conditional flow matching with an embedding-space face similarity loss, together with reference-face replacement and location-canvas degradations to discourage shortcuts. On MultiID-Bench, AnyPhoto improves identity similarity while reducing copy-paste tendency, with gains increasing as the number of identities grows. AnyPhoto also supports prompt-driven stylization with accurate placement, showing great potential application value.[364] Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image
Joohyun Kwon,Geonhee Sim,Gyeongsik Moon
Main category: cs.CV
TL;DR: DynaAvatar是一种零样本单图重建3D人体化身的方法,利用Transformer和静态到动态知识迁移策略,结合DynaFlow损失和SMPL-X重标注,实现运动依赖的布料动态建模。
Details
Motivation: 现有单图像3D人体化身方法依赖刚性关节变换,难以建模真实布料动力学;且动态捕捉数据稀缺、SMPL-X拟合质量差。 Method: 提出基于Transformer的前馈架构直接预测动态3D高斯形变;采用静态预训练+LoRA微调实现静态到动态知识迁移;设计光流引导的DynaFlow损失;重标注动态数据集中的SMPL-X拟合。 Result: 在视觉丰富性和泛化性上优于先前方法,能生成高质量、运动相关的布料动态动画。 Conclusion: DynaAvatar通过零样本、无需主题优化的方式,有效解决了单图像驱动下的真实 cloth dynamics 建模难题,提升了3D人体化身的动态表现力与实用性。 Abstract: Existing single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow-guided objective that provides reliable motion-direction geometric cues for cloth dynamics in rendered space. Finally, we reannotate the missing or noisy SMPL-X fittings in existing dynamic capture datasets, as most public dynamic capture datasets contain incomplete or unreliable fittings that are unsuitable for training high-quality 3D avatar reconstruction models. Experiments demonstrate that DynaAvatar produces visually rich and generalizable animations, outperforming prior methods.[365] High-Fidelity 3D Facial Avatar Synthesis with Controllable Fine-Grained Expressions
Yikang He,Jichao Zhang,Wei Wang,Nicu Sebe,Yao Zhao
Main category: cs.CV
TL;DR: 本文提出了一种结合3D感知GAN与3DMM的双映射器框架,通过文本引导优化实现对细粒度面部表情的精确编辑。
Details
Motivation: 现有基于2D或3D的方法在精细表情控制上存在不足,尤其难以实现细粒度、可解释的表情编辑。 Method: 提出Dual Mappers模块(Texture Mapper和Emotion Mapper),分别优化预训练3D-Aware GAN的纹理潜码和3DMM的表情码;引入Text-Guided Optimization,结合CLIP文本嵌入与SubSpace Projection机制,将文本提示投影到表情子空间以提升控制精度。 Result: 实验表明该方法在单张2D人脸图像输入下,能生成高质量、视角一致且表情精准可控的3D渲染结果,优于现有方法。 Conclusion: 所提方法有效融合了生成式建模与可解释3D形变控制,为文本驱动的精细面部表情编辑提供了新范式。 Abstract: Facial expression editing methods can be mainly categorized into two types based on their architectures: 2D-based and 3D-based methods. The former lacks 3D face modeling capabilities, making it difficult to edit 3D factors effectively. The latter has demonstrated superior performance in generating high-quality and view-consistent renderings using single-view 2D face images. Although these methods have successfully used animatable models to control facial expressions, they still have limitations in achieving precise control over fine-grained expressions. To address this issue, in this paper, we propose a novel approach by simultaneously refining both the latent code of a pretrained 3D-Aware GAN model for texture editing and the expression code of the driven 3DMM model for mesh editing. Specifically, we introduce a Dual Mappers module, comprising Texture Mapper and Emotion Mapper, to learn the transformations of the given latent code for textures and the expression code for meshes, respectively. To optimize the Dual Mappers, we propose a Text-Guided Optimization method, leveraging a CLIP-based objective function with expression text prompts as targets, while integrating a SubSpace Projection mechanism to project the text embedding to the expression subspace such that we can have more precise control over fine-grained expressions. Extensive experiments and comparative analyses demonstrate the effectiveness and superiority of our proposed method.[366] Mind-of-Director: Multi-modal Agent-Driven Film Previsualization via Collaborative Decision-Making
Shufeng Nan,Mengtian Li,Sixiao Zheng,Yuwei Lu,Han Zhang,Yanwei Fu
Main category: cs.CV
TL;DR: Mind-of-Director 是一个面向电影预演(previz)的多模态智能体框架,通过模拟导演与制作团队协作决策过程,将创意想法自动转化为高质量3D预演序列,全程约25分钟,并支持游戏引擎内实时交互编辑。
Details
Motivation: 降低电影预演门槛,提升创意到可视化的效率,支持自动化原型生成与人机协同创作。 Method: 构建包含剧本开发、虚拟场景设计、角色行为控制和镜头规划四个协同模块的多智能体框架,并集成基于游戏引擎的实时可视化编辑系统。 Result: 在约25分钟内生成语义准确、高质量的电影预演序列;实验与人工评估验证了其有效性与实用性。 Conclusion: 多智能体协同建模能有效复现电影制作中的复杂决策流程,为AI驱动的创意内容生产提供新范式。 Abstract: We present Mind-of-Director, a multi-modal agent-driven framework for film previz that models the collaborative decision-making process of a film production team. Given a creative idea, Mind-of-Director orchestrates multiple specialized agents to produce previz sequences within the game engine. The framework consists of four cooperative modules: Script Development, where agents draft and refine the screenplay iteratively; Virtual Scene Design, which transforms text into semantically aligned 3D environments; Character Behaviour Control, which determines character blocking and motion; and Camera Planning, which optimizes framing, movement, and composition for cinematic camera effects. A real-time visual editing system built in the game engine further enables interactive inspection and synchronized timeline adjustment across scenes, behaviours, and cameras. Extensive experiments and human evaluations show that Mind-of-Director generates high-quality, semantically grounded previz sequences in approximately 25 minutes per idea, demonstrating the effectiveness of agent collaboration for both automated prototyping and human-in-the-loop filmmaking.[367] Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling
Ernie Chu,Vishal M. Patel
Main category: cs.CV
TL;DR: 本文提出了一种新的音频-视频数据集F2F-JF,用于建模人类对话中的反应性节奏,并基于该数据集设计了一个跨人物视觉上下文驱动的数字人生成任务,验证了其在情绪保真度和视频质量上的提升。
Details
Motivation: 现有音视频数据集多为单人独白,难以建模真实对话中说话人之间的时序依赖与反应性节奏。 Method: 构建了70小时、14k片段的双人脱口秀数据集F2F-JF;采用多目标跟踪、语音区分与轻量人工校验的半自动流程提取对齐的主持人/嘉宾音视频片段;并在此基础上设计了以嘉宾前置视频为条件、驱动主持人视频生成的扩散模型任务。 Result: 在Emotion-FID和FVD指标上取得小幅但一致的提升,同时保持良好的唇音同步质量。 Conclusion: F2F-JF数据集及其配套预处理流程与基线模型为研究双向、时序性人际交互提供了端到端的新范式。 Abstract: Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.[368] Global Truncated Loss Minimization for Robust and Threshold-Resilient Geometric Estimation
Tianyu Huang,Liangzu Peng,Xinyue Zhang,Tongfan Guan,Jinhu Dong,Haoang Li,Laurent Kneip,Yun-Hui Liu
Main category: cs.CV
TL;DR: 本文提出GTM框架,首次实现基于分支定界(BnB)的截断损失(TL)全局最优最小化,通过混合求解设计(n−1维BnB + 1维Lipschitz边界优化)提升鲁棒性、阈值适应性与计算效率。
Details
Motivation: 共识最大化(CM)虽鲁棒但对阈值敏感、离散性导致界限松散;截断损失(TL)利用残差信息更优,但缺乏系统性的全局BnB优化研究。 Method: 提出GTM:在n维问题中对n−1维子空间进行BnB搜索,剩余1维变量通过构造Lipschitz连续的紧致目标函数边界,并用DIRECT求解;该混合设计缩小搜索空间并加速收敛。 Result: 在鲁棒线性回归等任务上,GTM展现出卓越的阈值鲁棒性和最高计算效率;在多种几何估计问题(不同残差形式)中均达到SOTA的异常值鲁棒性、阈值适应性与高效性。 Conclusion: GTM是首个统一、高效且可扩展的BnB框架,成功将TL损失的全局优化应用于广泛几何估计问题,显著优于现有CM和TL方法。 Abstract: To achieve outlier-robust geometric estimation, robust objective functions are generally employed to mitigate the influence of outliers. The widely used consensus maximization(CM) is highly robust when paired with global branch-and-bound(BnB) search. However, CM relies solely on inlier counts and is sensitive to the inlier threshold. Besides, the discrete nature of CM leads to loose bounds, necessitating extensive BnB iterations and computation cost. Truncated losses(TL), another continuous alternative, leverage residual information more effectively and could potentially overcome these issues. But to our knowledge, no prior work has systematically explored globally minimizing TL with BnB and its potential for enhanced threshold resilience or search efficiency. In this work, we propose GTM, the first unified BnB-based framework for globally-optimal TL loss minimization across diverse geometric problems. GTM involves a hybrid solving design: given an n-dimensional problem, it performs BnB search over an (n-1)-dimensional subspace while the remaining 1D variable is solved by bounding the objective function. Our hybrid design not only reduces the search space, but also enables us to derive Lipschitz-continuous bounding functions that are general, tight, and can be efficiently solved by a classic global Lipschitz solver named DIRECT, which brings further acceleration. We conduct a systematic evaluation on various BnB-based methods for CM and TL on the robust linear regression problem, showing that GTM enjoys remarkable threshold resilience and the highest efficiency compared to baseline methods. Furthermore, we apply GTM on different geometric estimation problems with diverse residual forms. Extensive experiments demonstrate that GTM achieves state-of-the-art outlier-robustness and threshold-resilience while maintaining high efficiency across these estimation tasks.[369] HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System
Kailin Lyu,Kangyi Wu,Pengna Li,Xiuyu Hu,Qingyi Si,Cui Miao,Ning Yang,Zihang Wang,Long Xiao,Lianyu Hu,Jingyuan Sun,Ce Hao
Main category: cs.CV
TL;DR: 本文提出HiMemVLN,通过引入分层记忆系统解决开放源代码大模型在视觉-语言导航(VLN)任务中因'导航失忆'导致的性能差距问题,显著提升其零样本导航能力。
Details
Motivation: 现有基于大语言模型(LLM)的视觉-语言导航方法多依赖闭源模型,存在高token成本与数据泄露风险;而开源模型虽被尝试应用,但性能远逊于闭源模型,核心问题在于'导航失忆'。 Method: 提出HiMemVLN框架,将分层记忆系统嵌入多模态大模型中,以增强视觉感知回忆与长期定位能力,缓解导航失忆问题。 Result: 在仿真与真实世界环境中实验表明,HiMemVLN性能接近开源SOTA方法的两倍。 Conclusion: 分层记忆机制可有效缓解开源多模态大模型在VLN任务中的导航失忆问题,显著缩小与闭源模型的性能差距,为高效、安全的零样本导航提供新思路。 Abstract: LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent's navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at https://github.com/lvkailin0118/HiMemVLN.[370] M2IR: Proactive All-in-One Image Restoration via Mamba-style Modulation and Mixture-of-Experts
Shiwei Wang,Yongzhen Wang,Bingwen Hu,Liyan Zhang,Xiao-Ping Zhang,Mingqiang Wei
Main category: cs.CV
TL;DR: 本文提出M2IR框架,通过Mamba-Style Transformer(MST)块在编码阶段主动调控退化传播,并利用自适应退化专家协作(ADEC)模块在解码阶段高效消除残余退化,从而实现从被动响应到主动控制的转变,提升图像恢复质量与泛化能力。
Details
Motivation: 现有基于Transformer的图像恢复模型是被动响应式的,无法主动抑制退化信号,导致特征学习受干扰、模型复杂度高且适应性差。 Method: 提出M2IR框架,包含两个核心组件:1)Mamba-Style Transformer(MST)块,在编码阶段进行像素级选择性状态调制以缓解退化;2)Adaptive Degradation Expert Collaboration(ADEC)模块,在解码阶段通过DA-CLIP驱动的路由器调度退化特异性专家并结合共享专家协同消除残余退化。 Result: M2IR在多个全合一图像恢复基准上实现了更优的泛化性、更强的适应性以及更精细的细节恢复效果。 Conclusion: M2IR通过在编码和解码阶段分别引入主动退化调控与协同消除机制,显著提升了图像恢复性能,标志着从被动响应向主动控制范式的转变。 Abstract: While Transformer-based architectures have dominated recent advances in all-in-one image restoration, they remain fundamentally reactive: propagating degradations rather than proactively suppressing them. In the absence of explicit suppression mechanisms, degraded signals interfere with feature learning, compelling the decoder to balance artifact removal and detail preservation, thereby increasing model complexity and limiting adaptability. To address these challenges, we propose M2IR, a novel restoration framework that proactively regulates degradation propagation during the encoding stage and efficiently eliminates residual degradations during decoding. Specifically, the Mamba-Style Transformer (MST) block performs pixel-wise selective state modulation to mitigate degradations while preserving structural integrity. In parallel, the Adaptive Degradation Expert Collaboration (ADEC) module utilizes degradation-specific experts guided by a DA-CLIP-driven router and complemented by a shared expert to eliminate residual degradations through targeted and cooperative restoration. By integrating the MST block and ADEC module, M2IR transitions from passive reaction to active degradation control, effectively harnessing learned representations to achieve superior generalization, enhanced adaptability, and refined recovery of fine-grained details across diverse all-in-one image restoration benchmarks. Our source codes are available at https://github.com/Im34v/M2IR.[371] RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models
Ravi Ranjan,Utkarsh Grover,Xiaomin Lin,Agoritsa Polyzou
Main category: cs.CV
TL;DR: 本文提出RAZOR框架,一种轻量、模型无关的Transformer模型遗忘学习方法,通过多层多头协同编辑实现高效、精准、稳定地删除敏感信息,同时保持模型性能。
Details
Motivation: Transformer扩散模型和视觉-语言模型在安全与合规方面面临高效移除敏感或不良信息而无需重训练的核心挑战。 Method: RAZOR通过衡量各层和注意力头对目标数据遗忘的贡献度,识别关键组件,并采用正则化更新规则进行渐进式多层多头协同编辑。 Result: 在CLIP、Stable Diffusion及VLMs上验证表明,RAZOR在身份、风格和物体擦除任务中实现高精度、稳定遗忘,支持量化,且保留能力更强、效率更高、速度显著快于传统方法。 Conclusion: RAZOR是一种实用、可扩展的方案,适用于Transformer视觉模型的安全自适应遗忘学习。 Abstract: Transformer based diffusion and vision-language models have achieved remarkable success; yet, efficiently removing undesirable or sensitive information without retraining remains a central challenge for model safety and compliance. We introduce Ratio-Aware Zero/One-step Optimized Retentive unlearning (RAZOR), a lightweight, model-agnostic unlearning framework that generalizes forgetting updates to coordinated multi-layer and multi-head edits within transformer backbones. RAZOR identifies the most important layers and attention heads by measuring how much they contribute to forgetting the target data while preserving useful knowledge. Then, it updates these parts of the model using a carefully regularized rule to avoid harming overall performance. The set of edited components grows gradually, ensuring precise unlearning without over-editing or damaging unrelated capabilities. We evaluate RAZOR on CLIP, Stable Diffusion, and vision-language models (VLMs) using widely adopted unlearning benchmarks covering identity, style, and object erasure tasks. Our results show that RAZOR achieves highly accurate and stable forgetting, even under quantization. This approach offers stronger retention and better efficiency than prior methods. Notably, it also operates significant faster than conventional techniques. These results demonstrate that RAZOR is a practical and scalable solution for safe, adaptive unlearning in transformer-based vision models.[372] RadarXFormer: Robust Object Detection via Cross-Dimension Fusion of 4D Radar Spectra and Images for Autonomous Driving
Yue Sun,Yeqiang Qian,Zhe Wang,Tianhui Li,Chunxiang Wang,Ming Yang
Main category: cs.CV
TL;DR: 本文提出RadarXFormer,一种基于4D毫米波雷达频谱与RGB图像的跨模态融合3D目标检测框架,通过直接利用原始雷达频谱构建高效3D表示,并实现多尺度3D雷达特征立方体与2D图像特征图的融合,在K-Radar数据集上提升了恶劣条件下的检测精度、鲁棒性与实时性。
Details
Motivation: 相机和LiDAR感知系统在恶劣天气和光照条件下性能下降,限制其在智能交通系统中的大规模部署;现有雷达测量稀疏且缺乏高度分辨率,新兴4D雷达虽提供仰角信息但存在噪声大、数据量大等问题。 Method: 提出RadarXFormer框架,摒弃稀疏雷达点云,直接使用原始4D雷达频谱,构建压缩且保留完整3D空间信息的高效3D表示;设计跨维度(3D-2D)融合机制,将多尺度3D球面雷达特征立方体与互补的2D图像特征图进行融合。 Result: 在K-Radar数据集上验证了该方法在挑战性场景下检测精度和鲁棒性的提升,并保持实时推理能力。 Conclusion: RadarXFormer为雷达-视觉融合感知提供了新范式,兼顾环境鲁棒性、语义丰富性与计算效率,有助于推动自动驾驶系统在复杂真实交通环境中的可靠部署。 Abstract: Reliable perception is essential for autonomous driving systems to operate safely under diverse real-world traffic conditions. However, camera- and LiDAR-based perception systems suffer from performance degradation under adverse weather and lighting conditions, limiting their robustness and large-scale deployment in intelligent transportation systems. Radar-vision fusion provides a promising alternative by combining the environmental robustness and cost efficiency of millimeter-wave (mmWave) radar with the rich semantic information captured by cameras. Nevertheless, conventional 3D radar measurements lack height resolution and remain highly sparse, while emerging 4D mmWave radar introduces elevation information but also brings challenges such as signal noise and large data volume. To address these issues, this paper proposes RadarXFormer, a 3D object detection framework that enables efficient cross-modal fusion between 4D radar spectra and RGB images. Instead of relying on sparse radar point clouds, RadarXFormer directly leverages raw radar spectra and constructs an efficient 3D representation that reduces data volume while preserving complete 3D spatial information. The "X" highlights the proposed cross-dimension (3D-2D) fusion mechanism, in which multi-scale 3D spherical radar feature cubes are fused with complementary 2D image feature maps. Experiments on the K-Radar dataset demonstrate improved detection accuracy and robustness under challenging conditions while maintaining real-time inference capability.[373] Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection
Yewon Han,Yumin Seol,EunGyung Kong,Minsoo Jo,Taesup Kim
Main category: cs.CV
TL;DR: 本文提出了一种名为'Two Birds, One Projection'的高效推理时防御方法,通过将跨模态特征投影到识别出的偏差方向的零空间中,同时提升大型视觉语言模型的安全性和实用性,打破传统安全-效用权衡。
Details
Motivation: 现有大型视觉语言模型(LVLM)的越狱防御框架常面临安全性和实用性之间的权衡问题,即增强安全性会损害其在通用视觉推理任务上的性能。作者探究这种权衡是否本质存在,并发现一种跨数据集一致存在的模态诱导偏差方向,该方向源于大语言模型主干与视觉编码器之间次优耦合,且同时损害安全性和实用性。 Method: 提出'Two Birds, One Projection'方法:在推理阶段,将跨模态特征投影到所识别出的模态诱导偏差方向的零空间,以消除该偏差成分;该方法仅需一次前向传播,计算高效。 Result: 该方法在多个基准测试上同时提升了模型的安全性(抵御越狱攻击)和实用性(视觉-语言推理性能),有效打破了传统安全-效用权衡。 Conclusion: 安全性和实用性并非本质上相互排斥的目标;通过识别并消除特定模态偏差方向,可实现二者协同提升;所提方法为LVLM的越狱防御提供了新思路。 Abstract: Existing jailbreak defence frameworks for Large Vision-Language Models often suffer from a safety utility tradeoff, where strengthening safety inadvertently degrades performance on general visual-grounded reasoning tasks. In this work, we investigate whether safety and utility are inherently antagonistic objectives. We focus on a modality induced bias direction consistently observed across datasets, which arises from suboptimal coupling between the Large Language Model backbone and visual encoders. We further demonstrate that this direction undermines performance on both tasks. Leveraging this insight, we propose Two Birds, One Projection, an efficient inference time jailbreak defence that projects cross-modal features onto the null space of the identified bias direction to remove the corresponding components. Requiring only a single forward pass, our method effectively breaks the conventional tradeoff, simultaneously improving both safety and utility across diverse benchmarks.[374] SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space
Zejian Kang,Kai Zheng,Yuanchen Fei,Wentao Yang,Hongyuan Zou,Xiangru Huang
Main category: cs.CV
TL;DR: 本文提出SemanticFace框架,通过两阶段语义蒸馏范式,在可解释的ARKit blendshape空间中实现面部动作估计,将系数预测重构为结构化语义推理,显著提升准确性、感知一致性及跨身份泛化能力。
Details
Motivation: 现有面部动作估计方法通常在紧凑表达空间中预测参数,缺乏显式的语义可解释性,而实际应用(如虚拟形象控制和人机交互)需要对应真实肌肉运动的可解释面部动作。 Method: SemanticFace采用两阶段语义蒸馏范式:首先从真实ARKit系数中提取结构化语义监督信号,再将该知识蒸馏至多模态大语言模型,以实现从单张图像预测可解释的面部动作系数。 Result: 实验表明,语言对齐的语义监督提升了系数预测精度与感知一致性,并具备强跨身份泛化能力及对大幅域偏移(如卡通人脸)的鲁棒性。 Conclusion: SemanticFace成功将面部动作估计建模为结构化语义推理任务,在保持高精度的同时显著增强了结果的可解释性与泛化能力,为面向交互应用的面部建模提供了新范式。 Abstract: Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbf{SemanticFace}, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.[375] Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis
Tuan-Anh Yang,Bao V. Q. Bui,Chanh-Quang Vo-Van,Truong-Son Hy
Main category: cs.CV
TL;DR: 本文提出了一种结合2.5D与3D表示的深度学习框架,用于从胸部CT扫描中检测COVID-19并进行疾病分类;2.5D分支使用DINOv3提取多视角切片特征,3D分支采用VREx与对比学习预训练的ResNet-18建模体素信息,通过logit级集成提升性能,在PHAROS-AIF-MIH基准上取得优异结果。
Details
Motivation: 提升跨数据源的鲁棒性,充分利用CT扫描中切片级与体素级互补信息以增强COVID-19检测与多类疾病分类性能。 Method: 构建双分支网络:2.5D分支用DINOv3处理轴向、冠状、矢状多视角CT切片;3D分支用VREx预训练+监督对比学习优化的ResNet-18建模三维上下文;最终通过logit级集成融合两分支预测。 Result: 在PHAROS-AIF-MIH基准上,二分类COVID-19检测达94.48%准确率和0.9426 Macro F1;多分类任务中2.5D DINOv3模型最优,达79.35%准确率和0.7497 Macro F1。 Conclusion: 融合预训练切片表征与体素建模可显著提升多源医学影像分析的鲁棒性与准确性。 Abstract: We propose a deep learning framework for COVID-19 detection and disease classification from chest CT scans that integrates both 2.5D and 3D representations to capture complementary slice-level and volumetric information. The 2.5D branch processes multi-view CT slices (axial, coronal, sagittal) using a DINOv3 vision transformer to extract robust visual features, while the 3D branch employs a ResNet-18 architecture to model volumetric context and is pretrained with Variance Risk Extrapolation (VREx) followed by supervised contrastive learning to improve cross-source robustness. Predictions from both branches are combined through logit-level ensemble inference. Experiments on the PHAROS-AIF-MIH benchmark demonstrate the effectiveness of the proposed approach: for binary COVID-19 detection, the ensemble achieves 94.48% accuracy and a 0.9426 Macro F1-score, outperforming both individual models, while for multi-class disease classification the 2.5D DINOv3 model achieves the best performance with 79.35% accuracy and a 0.7497 Macro F1-score. These results highlight the benefit of combining pretrained slice-based representations with volumetric modeling for robust multi-source medical imaging analysis. Code is available at https://github.com/HySonLab/PHAROS-AIF-MIH[376] DamageArbiter: A CLIP-Enhanced Multimodal Arbitration Framework for Hurricane Damage Assessment from Street-View Imagery
Yifan Yang,Lei Zou,Wenjing Gong,Kani Fu,Zongrong Li,Siqin Wang,Bing Zhou,Heng Cai,Hao Tian
Main category: cs.CV
TL;DR: 本研究提出了一种基于CLIP的多模态分歧驱动仲裁框架DamageArbiter,用于提升街景图像灾害损伤评估的准确性、可解释性与鲁棒性,在2556张灾后街景图像上将准确率从74.33%提升至82.79%,显著缓解视觉模型在模糊或干扰场景下的过度自信错误。
Details
Motivation: 传统街景图像灾害评估模型缺乏可解释性和可靠性,难以满足应急响应中对高可信度、细粒度评估的需求。 Method: 提出DamageArbiter框架:融合单模态(图像/文本)与多模态(CLIP)模型预测,利用轻量逻辑回归元分类器仲裁分歧;结合人工与大语言模型生成的文本描述,进行系统对比实验。 Result: DamageArbiter准确率达82.79%,较最强基线提升8.46%;有效降低视觉模型在模糊/干扰场景下的过度自信误判;支持地理映射分析模型空间性能差异。 Conclusion: DamageArbiter推动街景灾害评估从粗粒度严重性分类迈向更可靠、可解释的新范式,为应急响应提供更可信的技术支撑。 Abstract: Analyzing street-view imagery with computer vision models for rapid, hyperlocal damage assessment is becoming popular and valuable in emergency response and recovery, but traditional models often act like black boxes, lacking interpretability and reliability. This study proposes a multimodal disagreement-driven Arbitration framework powered by Contrastive Language-Image Pre-training (CLIP) models, DamageArbiter, to improve the accuracy, interpretability, and robustness of damage estimation from street-view imagery. DamageArbiter leverages the complementary strengths of unimodal and multimodal models, employing a lightweight logistic regression meta-classifier to arbitrate cases of disagreement. Using 2,556 post-disaster street-view images, paired with both manually generated and large language model (LLM)-generated text descriptions, we systematically compared the performance of unimodal models (including image-only and text-only models), multimodal CLIP-based models, and DamageArbiter. Notably, DamageArbiter improved the accuracy from 74.33% (ViT-B/32, image-only) to 82.79%, surpassing the 80% accuracy threshold and achieving an absolute improvement of 8.46% compared to the strongest baseline model. Beyond improvements in overall accuracy, compared to visual models relying solely on images, DamageArbiter, through arbitration of discrepancies between unimodal and multimodal predictions, mitigates common overconfidence errors in visual models, especially in situations where disaster visual cues are ambiguous or subject to interference, reducing overconfidence but incorrect predictions. We further mapped and analyzed geo-referenced predictions and misclassifications to compare model performance across locations. Overall, this work advances street-view-based disaster assessment from coarse severity classification toward a more reliable and interpretable framework.[377] Personalized Federated Learning with Residual Fisher Information for Medical Image Segmentation
Meilu Zhu,Yuxing Li,Zhiwei Wang,Edmund Y. Lam
Main category: cs.CV
TL;DR: 本文提出pFL-ResFIM,一种基于残差Fisher信息矩阵(ResFIM)的个性化联邦学习框架,通过识别和聚合域不变参数实现客户端自适应个性化建模,在公共数据集上显著优于现有方法。
Details
Motivation: 解决联邦学习中客户端间数据异质性带来的模型泛化与个性化之间的矛盾,提升各客户端本地模型性能。 Method: 提出残差Fisher信息矩阵(ResFIM)度量参数对域差异的敏感性;采用谱迁移策略在隐私约束下估计各客户端ResFIM;据此将参数划分为域敏感与域不变两类;仅在服务器端聚合域不变参数以构建个性化模型。 Result: 在多个公开数据集上的实验表明,pFL-ResFIM持续优于当前最优方法。 Conclusion: pFL-ResFIM通过参数级的客户端自适应划分与聚合,有效提升了个性化联邦学习的性能与实用性。 Abstract: Federated learning enables multiple clients (institutions) to collaboratively train machine learning models without sharing their private data. To address the challenge of data heterogeneity across clients, personalized federated learning (pFL) aims to learn customized models for each client. In this work, we propose pFL-ResFIM, a novel pFL framework that achieves client-adaptive personalization at the parameter level. Specifically, we introduce a new metric, Residual Fisher Information Matrix (ResFIM), to quantify the sensitivity of model parameters to domain discrepancies. To estimate ResFIM for each client model under privacy constraints, we employ a spectral transfer strategy that generates simulated data reflecting the domain styles of different clients. Based on the estimated ResFIM, we partition model parameters into domain-sensitive and domain-invariant components. A personalized model for each client is then constructed by aggregating only the domain-invariant parameters on the server. Extensive experiments on public datasets demonstrate that pFL-ResFIM consistently outperforms state-of-the-art methods, validating its effectiveness.[378] From Artefact to Insight: Efficient Low-Rank Adaptation of BrushNet for Scanning Probe Microscopy Image Restoration
Ziwei Wei,Yao Shen,Wanheng Lu,Ghim Wei Ho,Kaiyang Zeng
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的轻量级SPM图像修复框架,通过LoRA微调BrushNet,仅用少量配对数据即显著提升去伪影性能,且适用于多种SPM通道。
Details
Motivation: SPM图像常含结构化伪影(如扫描丢失、增益噪声、针尖卷积、相位跳变),现有方法多将其视为孤立去噪或插值任务,而生成式修复视角尚未被充分探索。 Method: 采用基于扩散模型的图像修复框架,利用低秩自适应(LoRA)对BrushNet进行微调(仅调整<0.2%权重),使用从739张实验扫描中蒸馏出的7390对伪影-干净图像进行训练。 Result: 在SPM InpBench基准上,PSNR提升6.61 dB,LPIPS减半;性能媲美全参数微调,但仅需单卡GPU训练;泛化至高度、振幅、相位等多种SPM通道,有效抑制自然图像先验带来的幻觉伪影。 Conclusion: 该轻量高效框架为不可替代的SPM图像恢复提供了新范式,并推动扩散模型在纳米成像分析中的广泛应用。 Abstract: Scanning Probe Microscopy or SPM offers nanoscale resolution but is frequently marred by structured artefacts such as line scan dropout, gain induced noise, tip convolution, and phase hops. While most available methods treat SPM artefact removal as isolated denoising or interpolation tasks, the generative inpainting perspective remains largely unexplored. In this work, we introduce a diffusion based inpainting framework tailored to scientific grayscale imagery. By fine tuning less than 0.2 percent of BrushNet weights with rank constrained low rank adaptation (LoRA), we adapt a pretrained diffusion model using only 7390 artefact, clean pairs distilled from 739 experimental scans. On our forthcoming public SPM InpBench benchmark, the LoRA enhanced model lifts the Peak Signal to Noise Ratio or PSNR by 6.61 dB and halves the Learned Perceptual Image Patch Similarity or LPIPS relative to zero-shot inference, while matching or slightly surpassing the accuracy of full retraining, trainable on a single GPU instead of four high-memory cards. The approach generalizes across various SPM image channels including height, amplitude and phase, faithfully restores subtle structural details, and suppresses hallucination artefacts inherited from natural image priors. This lightweight framework enables efficient, scalable recovery of irreplaceable SPM images and paves the way for a broader diffusion model adoption in nanoscopic imaging analysis.[379] AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
Wenhui Huang,Songyan Zhang,Qihang Huang,Zhidong Wang,Zhiqi Mao,Collister Chua,Zhan Chen,Long Chen,Chen Lv
Main category: cs.CV
TL;DR: 本文提出了一种名为OURS的端到端自动驾驶框架,通过统一的视觉-语言-动作(VLA)模型整合推理与动作生成,采用混合Transformer(MoT)架构和联合注意力共享机制,在保持预训练视觉语言模型通用推理能力的同时,实现高效快慢异步推理;实验表明其在多个基准上性能领先,并揭示了预训练VLM在自动驾驶中语义理解可仅靠提示词激活,而动作级任务仍需微调。
Details
Motivation: 现有将视觉语言模型(VLMs)集成到端到端自动驾驶系统的方法存在三大问题:推理空间与动作空间分布不一致、未能充分利用预训练VLM的通用推理能力、动作策略生成时推理延迟高从而影响驾驶性能。 Method: 提出OURS框架,构建统一的视觉-语言-动作(VLA)模型;采用混合Transformer(MoT)架构并引入联合注意力共享机制;支持快慢异步推理以适配不同任务频率;通过语义提示(semantic prompting)激发预训练VLM能力,并系统分析AD任务中是否需针对动作任务进行微调。 Result: 在开环与闭环多种基准测试中,OURS性能达到SOTA水平;实验证明预训练VLM仅通过语义提示即可在多任务场景理解上表现优异,但在决策与轨迹规划等动作级任务中仍需AD定制化微调。 Conclusion: OURS有效弥合了VLM通用推理能力与自动驾驶动作生成之间的鸿沟;验证了‘提示即能力’在感知理解层面的可行性,同时明确了动作生成仍依赖领域微调,为VLM在自动驾驶中的分层应用提供了理论依据与实践路径。 Abstract: Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \href{https://automot-website.github.io/}{Project Page} for the demonstration videos and qualitative results.[380] From Horizontal to Rotated: Cross-View Object Geo-Localization with Orientation Awareness
Chenlin Fu,Ao Gong,Yingying Zhu
Main category: cs.CV
TL;DR: 本文提出OSGeo框架,利用旋转边界框(RBoxes)提升跨视角物体地理定位精度,兼顾高准确率与低标注成本。
Details
Motivation: 现有检测方法因水平边界框(HBoxes)几何拟合差和特征图缩放导致精度下降,而分割方法虽精度高但像素级标注成本过高。 Method: 提出基于旋转边界框(RBoxes)的检测范式,设计多尺度感知模块和方向敏感检测头,并构建首个带RBox标注的CVOGL-R数据集。 Result: OSGeo在多个实验中达到SOTA性能,精度媲美甚至超越主流分割方法,但标注成本降低一个数量级以上。 Conclusion: RBoxes是提升CVOGL检测精度的有效方案,OSGeo在精度与效率间取得更好平衡,推动该任务向实用化发展。 Abstract: Cross-View object geo-localization (CVOGL) aims to precisely determine the geographic coordinates of a query object from a ground or drone perspective by referencing a satellite map. Segmentation-based approaches offer high precision but require prohibitively expensive pixel-level annotations, whereas more economical detection-based methods suffer from lower accuracy. This performance disparity in detection is primarily caused by two factors: the poor geometric fit of Horizontal Bounding Boxes (HBoxes) for oriented objects and the degradation in precision due to feature map scaling. Motivated by these, we propose leveraging Rotated Bounding Boxes (RBoxes) as a natural extension of the detection-based paradigm. RBoxes provide a much tighter geometric fit to oriented objects. Building on this, we introduce OSGeo, a novel geo-localization framework, meticulously designed with a multi-scale perception module and an orientation-sensitive head to accurately regress RBoxes. To support this scheme, we also construct and release CVOGL-R, the first dataset with precise RBox annotations for CVOGL. Extensive experiments demonstrate that our OSGeo achieves state-of-the-art performance, consistently matching or even surpassing the accuracy of leading segmentation-based methods but with an annotation cost that is over an order of magnitude lower.[381] Video Detector: A Dual-Phase Vision-Based System for Real-Time Traffic Intersection Control and Intelligent Transportation Analysis
Mustafa Fatih Şen,Halûk Gümüşkaya,Şenol Pazar
Main category: cs.CV
TL;DR: 本文提出了一种名为Video Detector(VD)的双阶段视觉交通路口管理系统,包含实时控制模块(VD-RT)与离线分析模块(VD-Offline),基于多种目标检测模型,在高精度(90%准确率、29.5 mAP@0.5)和实时性(37 FPS)下实现无埋设传感器的全面路口监测。
Details
Motivation: 城市交通管理亟需低成本、可自适应的智能感知系统,以替代昂贵且难以部署维护的传统地感线圈。 Method: 构建双阶段视觉系统VD:VD-RT用于实时路口控制,VD-Offline用于离线交通行为分析;采用SSD Inception v2、Faster R-CNN Inception v2和CenterNet ResNet-50 V1 FPN三种模型,在108,000张标注图像(6–10类车辆)上训练;支持虚拟线圈、多目标跟踪、排队估计、速度分析与多类别分类等功能。 Result: 达到90%测试准确率和29.5 mAP@0.5,HD视频流下实时处理达37 FPS;在伊斯坦布尔实地部署中表现稳定;开源数据集与训练流程。 Conclusion: VD是一种可扩展、可部署的视觉驱动解决方案,适用于智能交通系统与智慧城市交通管理,显著降低基础设施依赖。 Abstract: Urban traffic management increasingly requires intelligent sensing systems capable of adapting to dynamic traffic conditions without costly infrastructure modifications. Vision-based vehicle detection has therefore become a key technology for modern intelligent transportation systems. This study presents Video Detector (VD), a dual-phase vision-based traffic intersection management system designed as a flexible and cost-effective alternative to traditional inductive loop detectors. The framework integrates a real-time module (VD-RT) for intersection control with an offline analytical module (VD-Offline) for detailed traffic behavior analysis. Three system configurations were implemented using SSD Inception v2, Faster R-CNN Inception v2, and CenterNet ResNet-50 V1 FPN, trained on datasets totaling 108,000 annotated images across 6-10 vehicle classes. Experimental results show detection performance of up to 90% test accuracy and 29.5 mAP@0.5, while maintaining real-time throughput of 37 FPS on HD video streams. Field deployments conducted in collaboration with Istanbul IT and Smart City Technologies Inc. (ISBAK) demonstrate stable operation under diverse environmental conditions. The system supports virtual loop detection, vehicle counting, multi-object tracking, queue estimation, speed analysis, and multiclass vehicle classification, enabling comprehensive intersection monitoring without the need for embedded road sensors. The annotated dataset and training pipeline are publicly released to support reproducibility. These results indicate that the proposed framework provides a scalable and deployable vision-based solution for intelligent transportation systems and smart-city traffic management.[382] RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
Linfei Li,Lin Zhang,Ying Shen
Main category: cs.CV
TL;DR: 本文提出RealVLG框架,整合RealVLG-11B数据集与RealVLG-R1模型,统一真实世界视觉-语言定位与抓取任务,支持零样本跨场景感知与操作。
Details
Motivation: 现有视觉-语言定位方法局限于粗粒度物体级定位,而传统机器人抓取缺乏语言指导,难以应对语言驱动的操控需求。 Method: 构建多粒度标注的大规模RealVLG-11B数据集,并基于其对预训练大视觉语言模型进行强化学习微调,得到端到端输出检测框、分割掩码、抓取姿态和接触点的RealVLG-R1模型。 Result: RealVLG在真实未见环境中实现零样本视觉-语言定位与抓取,建立首个面向语言驱动机器人操作的语义-视觉多模态基准。 Conclusion: RealVLG实现了语言指令到多粒度视觉与抓取动作的统一建模,为语言引导的具身智能提供了数据、模型与评估一体化解决方案。 Abstract: Visual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitations, we propose the RealVLG framework, which integrates the RealVLG-11B dataset and the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million segmentation, detection, and language annotations, and roughly 11 billion grasping examples. Building on this dataset, RealVLG-R1 employs Reinforcement Fine-tuning on pretrained large-scale vision-language models to predict bounding boxes, segmentation masks, grasp poses, and contact points in a unified manner given natural language instructions. Experimental results demonstrate that RealVLG supports zero-shot perception and manipulation in real-world unseen environments, establishing a unified semantic-visual multimodal benchmark that provides a comprehensive data and evaluation platform for language-driven robotic perception and grasping policy learning. All data and code are publicly available at https://github.com/lif314/RealVLG-R1.[383] LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
Soumyaratna Debnath,Bui Duc Manh,Zinan Liu,Lin Wang
Main category: cs.CV
TL;DR: 本文提出LLMind,一种无需训练的生物启发式视觉表示框架,通过模拟人类视觉的注视编码和皮层放大机制,实现像素受限下的自适应高效表示。
Details
Motivation: 现有视觉语言模型(VLMs)假设视觉输入空间保真度均匀,而人类视觉是自适应、选择性和资源高效的,因此需借鉴生物视觉机制提升VLM效率。 Method: 提出Bio-inspired Adaptive Sampling Strategy(BASS)与Mobius参数化模块实现非均匀采样,并引入闭合语义反馈(CSF)进行测试时自适应对齐感知显著性与文本信息。 Result: 在VQAv2、Seed-Bench和A-OKVQA上平均提升+20%、+38%、+37%;仅用1%/3%/5%像素即保留82%/92%/97%全分辨率性能。 Conclusion: LLMind是一种轻量、即插即用、无需修改VLM结构的通用方法,显著提升了像素受限下VLM的效率与性能。 Abstract: Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.[384] SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras
Huanjing Yue,Shangbin Xie,Cong Cao,Qian Wu,Lei Zhang,Lei Zhao,Jingyu Yang
Main category: cs.CV
TL;DR: 本文提出SpiralDiff,一种基于扩散模型的RGB-to-RAW转换框架,引入信号依赖的噪声加权策略和相机感知的轻量适配模块CamLoRA,以提升不同亮度区域的重建质量并支持多相机适配。
Details
Motivation: 现有RGB-to-RAW方法忽视了像素强度差异导致的重建难度变化以及多相机ISP特性差异带来的适配需求。 Method: 提出SpiralDiff扩散框架,包含信号依赖的噪声加权策略;并设计CamLoRA模块实现相机感知的轻量级适配。 Result: 在四个基准数据集上验证了SpiralDiff在转换质量和下游RAW目标检测任务中的优越性。 Conclusion: SpiralDiff有效解决了RGB-to-RAW转换中强度相关重建失衡与跨相机泛化问题,兼具高质量重建与强适应性。 Abstract: RAW images preserve superior fidelity and rich scene information compared to RGB, making them essential for tasks in challenging imaging conditions. To alleviate the high cost of data collection, recent RGB-to-RAW conversion methods aim to synthesize RAW images from RGB. However, they overlook two key challenges: (i) the reconstruction difficulty varies with pixel intensity, and (ii) multi-camera conversion requires camera-specific adaptation. To address these issues, we propose SpiralDiff, a diffusion-based framework tailored for RGB-to-RAW conversion with a signal-dependent noise weighting strategy that adapts reconstruction fidelity across intensity levels. In addition, we introduce CamLoRA, a camera-aware lightweight adaptation module that enables a unified model to adapt to different camera-specific ISP characteristics. Extensive experiments on four benchmark datasets demonstrate the superiority of SpiralDiff in RGB-to-RAW conversion quality and its downstream benefits in RAW-based object detection. Our code and model are available at https://github.com/Chuancy-TJU/SpiralDiff.[385] PASTE: Physics-Aware Scattering Topology Embedding Framework for SAR Object Detection
Jiacheng Chen,Yuxuan Xiong,Haipeng Wang
Main category: cs.CV
TL;DR: 本文提出PASTE框架,将SAR图像中目标的电磁散射拓扑结构信息嵌入现代检测器,通过散射关键点生成、拓扑注入与先验监督,显著提升检测性能并增强可解释性。
Details
Motivation: 现有基于深度学习的SAR目标检测方法多直接套用光学图像方法,忽视电磁散射物理机制;虽有研究引入散射点或频域信息,但受限于幅值统计模型、计算开销大或泛化性差,难以有效融合散射拓扑先验。 Method: 提出物理感知的散射拓扑嵌入框架(PASTE):1)基于属性化散射中心(ASC)模型自动生成散射关键点及标注;2)设计散射拓扑注入模块引导多尺度特征学习;3)引入散射先验监督策略,约束网络预测与散射中心分布对齐。 Result: 在真实SAR数据集上实验表明,PASTE兼容多种检测器,相对mAP提升2.9%–11.3%,计算开销可控;散射图可视化证实其成功将散射拓扑先验嵌入特征空间,显著区分目标与背景散射区域。 Conclusion: PASTE实现了散射物理先验与深度检测框架的闭环融合,兼顾性能提升与结果可解释性,为SAR图像智能解译提供了新范式。 Abstract: Current deep learning-based object detection for Synthetic Aperture Radar (SAR) imagery mainly adopts optical image methods, treating targets as texture patches while ignoring inherent electromagnetic scattering mechanisms. Though scattering points have been studied to boost detection performance, most methods still rely on amplitude-based statistical models. Some approaches introduce frequency-domain information for scattering center extraction, but they suffer from high computation cost and poor compatibility with diverse datasets. Thus, effectively embedding scattering topological information into modern detection frameworks remains challenging. To solve these problems, this paper proposes the Physics-Aware Scattering Topology Embedding Framework (PASTE), a novel closed-loop architecture for comprehensive scattering prior integration. By building the full pipeline from topology generation, injection to joint supervision, PASTE elegantly integrates scattering physics into modern SAR detectors. Specifically, it designs a scattering keypoint generation and automatic annotation scheme based on the Attributed Scattering Center (ASC) model to produce scalable and physically consistent priors. A scattering topology injection module guides multi-scale feature learning, and a scattering prior supervision strategy constrains network optimization by aligning predictions with scattering center distributions. Experiments on real datasets show that PASTE is compatible with various detectors and brings relative mAP gains of 2.9% to 11.3% over baselines with acceptable computation overhead. Visualization of scattering maps verifies that PASTE successfully embeds scattering topological priors into feature space, clearly distinguishing target and background scattering regions, thus providing strong interpretability for results.[386] Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs
Jaehoon Lee,Mingi Jung,Soohyuk Jang,Seungryong Yoo,Dahuin Jung,Sungroh Yoon
Main category: cs.CV
TL;DR: 本文提出PromPrune,一种样本自适应的视觉令牌选择框架,通过语义显著性感知的预算分配与两阶段选择流程,在大幅压缩视觉令牌的同时保持模型性能。
Details
Motivation: 现有视觉令牌压缩方法采用静态策略,无法适配不同样本间语义显著性分布的差异,导致压缩效果次优。 Method: PromPrune框架包含语义显著性感知的预算分配机制和两阶段令牌选择流程,动态平衡局部显著性保留与全局多样性覆盖。 Result: 在LLaVA-NeXT-7B上,FLOPs降低88%,prefill延迟降低22%,同时保持97.5%原始准确率。 Conclusion: 样本自适应的视觉令牌压缩策略能更有效地权衡计算效率与模型性能,优于固定策略。 Abstract: Large Vision-Language Models (VLMs) achieve strong multimodal understanding capabilities by leveraging high-resolution visual inputs, but the resulting large number of visual tokens creates a major computational bottleneck. Recent work mitigates this issue through visual token compression, typically compressing tokens based on saliency, diversity, or a fixed combination of both. We observe that the distribution of semantic prominence varies substantially across samples, leading to different optimal trade-offs between local saliency preservation and global coverage. This observation suggests that applying a static compression strategy across all samples can be suboptimal. Motivated by this insight, we propose PromPrune, a sample-adaptive visual token selection framework composed of semantic prominence-aware budget allocation and a two-stage selection pipeline. Our method adaptively balances local saliency preservation and global coverage according to the semantic prominence distribution of each sample. By allocating token budgets between locally salient regions and globally diverse regions, our method maintains strong performance even under high compression ratios. On LLaVA-NeXT-7B, our approach reduces FLOPs by 88% and prefill latency by 22% while preserving 97.5% of the original accuracy.[387] TopoVST: Toward Topology-fidelitous Vessel Skeleton Tracking
Yaoyu Liu,Minghui Zhang,Junjun He,Yun Gu
Main category: cs.CV
TL;DR: 本文提出TopoVST,一种拓扑保真血管骨架追踪方法,通过多尺度球图建模、图神经网络联合预测方向与半径、门控特征融合、几何感知加权损失及波传播骨架追踪算法,有效缓解血管骨架断裂与伪影问题,在多个指标上达到SOTA性能。
Details
Motivation: 自动提取血管骨架对临床应用至关重要,但现有方法在保持拓扑结构完整性方面仍面临骨架断裂和伪影段频发的挑战。 Method: 提出TopoVST:构建多尺度球图采样图像;用图神经网络联合估计追踪方向与血管半径;引入门控机制实现多尺度特征融合;在方向损失中嵌入几何感知加权策略缓解类别不平衡;设计基于波传播与空间占位过滤的骨架追踪算法以抑制伪骨架生成。 Result: 在两类不同几何结构的血管数据集上,TopoVST在重叠指标(如Dice)和拓扑指标(如Hausdorff距离、分支点误差)上均优于当前主流方法。 Conclusion: TopoVST通过多尺度建模、几何感知学习与显式拓扑约束追踪,显著提升了血管骨架提取的连续性与拓扑保真度,为临床血管分析提供了更鲁棒的工具。 Abstract: Automatic extraction of vessel skeletons is crucial for many clinical applications. However, achieving topologically faithful delineation of thin vessel skeletons remains highly challenging, primarily due to frequent discontinuities and the presence of spurious skeleton segments. To address these difficulties, we propose TopoVST, a topology-fidelitious vessel skeleton tracker. TopoVST constructs multi-scale sphere graphs to sample the input image and employs graph neural networks to jointly estimate tracking directions and vessel radii. The utilization of multi-scale representations is enhanced through a gating-based feature fusion mechanism, while the issue of class imbalance during training is mitigated by embedding a geometry-aware weighting scheme into the directional loss. In addition, we design a wave-propagation-based skeleton tracking algorithm that explicitly mitigates the generation of spurious skeletons through space-occupancy filtering. We evaluate TopoVST on two vessel datasets with different geometries. Extensive comparisons with state-of-the-art baselines demonstrate that TopoVST achieves competitive performance in both overlapping and topological metrics. Our source code is available at: https://github.com/EndoluminalSurgicalVision-IMR/TopoVST.[388] ILV: Iterative Latent Volumes for Fast and Accurate Sparse-View CT Reconstruction
Seungryong Lee,Woojeong Baek,Joosang Lee,Eunbyung Park
Main category: cs.CV
TL;DR: 本文提出了一种名为Iterative Latent Volumes(ILV)的前馈式框架,融合数据驱动先验与经典迭代重建思想,显著提升稀疏视角锥形束CT(CBCT)重建的质量与速度。
Details
Motivation: 稀疏视角CT重建可降低辐射剂量、系统成本并加快临床成像,但现有前馈方法仍存在伪影和细节丢失问题。 Method: ILV构建可迭代更新的3D潜在体素,并结合多视角X射线特征与学习到的解剖先验;引入X射线特征体、分组交叉注意力、高效自注意力及视角特征聚合等关键模块。 Result: 在约14,000例CT数据集上,ILV在重建质量与速度上均显著优于现有前馈与基于优化的方法。 Conclusion: ILV实现了快速、高精度的稀疏视角CBCT重建,具备临床应用潜力。 Abstract: A long-term goal in CT imaging is to achieve fast and accurate 3D reconstruction from sparse-view projections, thereby reducing radiation exposure, lowering system cost, and enabling timely imaging in clinical workflows. Recent feed-forward approaches have shown strong potential toward this overarching goal, yet their results still suffer from artifacts and loss of fine details. In this work, we introduce Iterative Latent Volumes (ILV), a feed-forward framework that integrates data-driven priors with classical iterative reconstruction principles to overcome key limitations of prior feed-forward models in sparse-view CBCT reconstruction. At its core, ILV constructs an explicit 3D latent volume that is repeatedly updated by conditioning on multi-view X-ray features and the learned anatomical prior, enabling the recovery of fine structural details beyond the reach of prior feed-forward models. In addition, we develop and incorporate several key architectural components, including an X-ray feature volume, group cross-attention, efficient self-attention, and view-wise feature aggregation, that efficiently realize its core latent volume refinement concept. Extensive experiments on a large-scale dataset of approximately 14,000 CT volumes demonstrate that ILV significantly outperforms existing feed-forward and optimization-based methods in both reconstruction quality and speed. These results show that ILV enables fast and accurate sparse-view CBCT reconstruction suitable for clinical use. The project page is available at: https://sngryonglee.github.io/ILV/.[389] EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing
Zitong Xu,Huiyu Duan,Zhongpeng Ji,Xinyun Zhang,Yutao Liu,Xiongkuo Min,Ke Gu,Jian Zhang,Shusong Xu,Jinwei Chen,Bo Li,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出了EditHF-1M数据集、EditHF评估模型和EditHF-Reward奖励模型,以解决文本引导图像编辑中缺乏可扩展人类对齐评估方法的问题。
Details
Motivation: 现有文本引导图像编辑模型生成的图像常存在伪影、意外编辑和不美观等问题,而缺乏可扩展的人类对齐评估模型限制了基于人类反馈的奖励模型发展。 Method: 构建百万级图像编辑数据集EditHF-1M(含2900万偏好对和14.8万MOS评分),基于该数据集训练多模态大语言模型EditHF作为评估器,并进一步提出EditHF-Reward,利用EditHF输出作为强化学习奖励信号优化图像编辑模型。 Result: EditHF在人类偏好对齐和跨数据集泛化能力上表现优异;用EditHF-Reward微调Qwen-Image-Edit显著提升了编辑性能。 Conclusion: EditHF系列工作为文本引导图像编辑提供了高质量评估与优化范式,推动了人类反馈驱动的图像编辑模型发展。 Abstract: Recent text-guided image editing (TIE) models have achieved remarkable progress, while many edited images still suffer from issues such as artifacts, unexpected editings, unaesthetic contents. Although some benchmarks and methods have been proposed for evaluating edited images, scalable evaluation models are still lacking, which limits the development of human feedback reward models for image editing. To address the challenges, we first introduce \textbf{EditHF-1M}, a million-scale image editing dataset with over 29M human preference pairs and 148K human mean opinion ratings, both evaluated from three dimensions, \textit{i.e.}, visual quality, instruction alignment, and attribute preservation. Based on EditHF-1M, we propose \textbf{EditHF}, a multimodal large language model (MLLM) based evaluation model, to provide human-aligned feedback from image editing. Finally, we introduce \textbf{EditHF-Reward}, which utilizes EditHF as the reward signal to optimize the text-guided image editing models through reinforcement learning. Extensive experiments show that EditHF achieves superior alignment with human preferences and demonstrates strong generalization on other datasets. Furthermore, we fine-tune the Qwen-Image-Edit using EditHF-Reward, achieving significant performance improvements, which demonstrates the ability of EditHF to serve as a reward model to scale-up the image editing. Both the dataset and code will be released in our GitHub repository: https://github.com/IntMeGroup/EditHF.[390] $\text{F}^2\text{HDR}$: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling
Huanjing Yue,Dawei Li,Shaoxiong Tu,Jingyu Yang
Main category: cs.CV
TL;DR: 本文提出F²HDR框架,通过两阶段方法解决动态场景下交替曝光LDR视频重建HDR视频的难题,重点改进跨曝光对齐、运动建模与细节恢复。
Details
Motivation: 现有方法在动态场景中因跨曝光不一致和复杂运动导致帧间对齐困难,引发鬼影和细节丢失,且存在对齐不准、特征聚合次优及运动区域重建质量下降等问题。 Method: 提出F²HDR两阶段框架:1)引入流适配器(flow adapter)适配通用光流以实现鲁棒跨曝光对齐;2)采用物理运动建模识别显著运动区域;3)设计运动感知细化网络,聚合互补信息并消除鬼影与噪声。 Result: 在真实HDR视频基准上达到SOTA性能,能有效处理大幅运动与曝光变化,生成无鬼影、高保真HDR视频。 Conclusion: F²HDR通过协同优化运动感知与跨曝光对齐,在复杂动态场景下显著提升了HDR视频重建质量与鲁棒性。 Abstract: Reconstructing High Dynamic Range (HDR) videos from sequences of alternating-exposure Low Dynamic Range (LDR) frames remains highly challenging, especially under dynamic scenes where cross-exposure inconsistencies and complex motion make inter-frame alignment difficult, leading to ghosting and detail loss. Existing methods often suffer from inaccurate alignment, suboptimal feature aggregation, and degraded reconstruction quality in motion-dominated regions. To address these challenges, we propose $\text{F}^2\text{HDR}$, a two-stage HDR video reconstruction framework that robustly perceives inter-frame motion and restores fine details in complex dynamic scenarios. The proposed framework integrates a flow adapter that adapts generic optical flow for robust cross-exposure alignment, a physical motion modeling to identify salient motion regions, and a motion-aware refinement network that aggregates complementary information while removing ghosting and noise. Extensive experiments demonstrate that $\text{F}^2\text{HDR}$ achieves state-of-the-art performance on real-world HDR video benchmarks, producing ghost-free and high-fidelity results under large motion and exposure variations.[391] Workflow-Aware Structured Layer Decomposition for Illustration Production
Tianyu Zhang,Dongchi Li,Keiichi Sawada,Haoran Xie
Main category: cs.CV
TL;DR: 本文提出了一种面向动漫插画制作流程的结构化分层分解框架,将图像分解为线稿、平涂色、阴影和高光等语义明确的生产层,并通过轻量级层语义嵌入和分层损失实现解耦,解决了现有方法难以捕捉动漫图像结构与风格特性的问题。
Details
Motivation: 现有基于对象分割的分层生成图像编辑方法难以准确建模动漫插画等人工创作图像的结构与风格特性。 Method: 提出面向动漫制作流程的结构化分层分解框架,将插画分解为线稿、平涂色、阴影、高光四层;引入轻量级层语义嵌入以提供各层特定任务引导,并设计分层损失监督训练;构建模拟标准动漫制作流程的高质量分层数据集。 Result: 实现了准确且视觉连贯的分层分解效果,在动漫插画上验证了方法有效性;分层表示支持重着色、纹理嵌入等下游编辑任务。 Conclusion: 该工作为动漫类图像提供了更符合实际创作逻辑的分层表示方法,提升了生成式图像编辑在专业艺术创作场景中的可控性与实用性。 Abstract: Recent generative image editing methods adopt layered representations to mitigate the entangled nature of raster images and improve controllability, typically relying on object-based segmentation. However, such strategies may fail to capture the structural and stylized properties of human-created images, such as anime illustrations. To solve this issue, we propose a workflow-aware structured layer decomposition framework tailored to the illustration production of anime artwork. Inspired by the creation pipeline of anime production, our method decomposes the illustration into semantically meaningful production layers, including line art, flat color, shadow, and highlight. To decouple all these layers, we introduce lightweight layer semantic embeddings to provide specific task guidance for each layer. Furthermore, a set of layer-wise losses is incorporated to supervise the training process of individual layers. To overcome the lack of ground-truth layered data, we construct a high-quality illustration dataset that simulated the standard anime production workflow. Experiments demonstrate that the accurate and visually coherent layer decompositions were achieved by using our method. We believe that the resulting layered representation further enables downstream tasks such as recoloring and embedding texture, supporting content creation, and illustration editing. Code is available at: https://github.com/zty0304/Anime-layer-decomposition[392] Video-CoE: Reinforcing Video Event Prediction via Chain of Events
Qile Su,Jing Tang,Rui Chen,Lei Sun,Xiangxiang Chu
Main category: cs.CV
TL;DR: 本文提出了一种名为Chain of Events (CoE)的新范式,用于提升多模态大语言模型(MLLMs)在视频事件预测(VEP)任务上的性能,通过构建时间事件链来增强模型对视觉内容和未来事件逻辑关系的建模能力,并在公开基准上达到新SOTA。
Details
Motivation: 现有MLLMs在视频事件预测(VEP)任务中表现不佳,主要因缺乏对未来事件的逻辑推理能力及对视觉信息利用不足。 Method: 提出Chain of Events (CoE)范式,通过构建时间事件链,隐式引导MLLM关注视觉内容与视频-未来事件间的逻辑联系,并结合多种训练策略提升模型推理能力。 Result: 在多个公开VEP基准上,该方法显著优于当前主流开源与商用MLLMs,达到新SOTA。 Conclusion: CoE范式有效缓解了MLLMs在VEP任务中逻辑推理弱和视觉利用不足的问题,为视频时序理解与预测提供了新思路。 Abstract: Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbf{C}hain \textbf{o}f \textbf{E}vents (\textbf{CoE}) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.[393] Relevance Feedback in Text-to-Image Diffusion: A Training-Free And Model-Agnostic Interactive Framework
Wenxi Wang,Hongbin Liu,Mingqian Li,Junyan Yuan,Junqi Zhang
Main category: cs.CV
TL;DR: 本文提出RFD框架,将信息检索中的相关反馈机制引入扩散模型的文本到图像生成中,通过多选视觉反馈降低用户认知负荷,结合专家构建的特征库和信息论加权累积偏好分析实现可解释的偏好推理,且无需训练、与模型无关。
Details
Motivation: 用户虽有明确视觉意图但难以用语言精准表达,导致提示词模糊和生成图像不匹配;现有方法在低认知负荷、可解释偏好推理、免训练及模型无关性方面无法兼顾。 Method: 提出RFD交互式框架:采用隐式多选视觉反馈替代显式文本对话;构建专家标注特征库;设计信息论加权累积偏好分析(白盒、避免历史反馈拼接);引入概率采样机制进行提示重构以平衡探索与利用;全程在外部文本空间操作,保持免训练与模型无关。 Result: 实验表明RFD能更有效地捕捉用户真实视觉意图,在偏好对齐任务上显著优于基线方法。 Conclusion: RFD是一种通用即插即用方案,成功在低认知负荷、可解释性、免训练和模型无关性之间取得平衡,为文本到图像生成提供了更自然、可控的交互范式。 Abstract: Text-to-image generation using diffusion models has achieved remarkable success. However, users often possess clear visual intents but struggle to express them precisely in language, resulting in ambiguous prompts and misaligned images. Existing methods struggle to bridge this gap, typically relying on high-load textual dialogues, opaque black-box inferences, or expensive fine-tuning. They fail to simultaneously achieve low cognitive load, interpretable preference inference, and remain training-free and model-agnostic. To address this, we propose RFD, an interactive framework that adapts the relevance feedback mechanism from information retrieval to diffusion models. In RFD, users replace explicit textual dialogue with implicit, multi-select visual feedback to minimize cognitive load, easily expressing complex, multi-dimensional preferences. To translate feedback into precise generative guidance, we construct an expert-curated feature repository and introduce an information-theoretic weighted cumulative preference analysis. This white-box method calculates preferences from current-round feedback and incrementally accumulates them, avoiding the concatenation of historical interactions and preventing inference degradation caused by lengthy contexts. Furthermore, RFD employs a probabilistic sampling mechanism for prompt reconstruction to balance exploitation and exploration, preventing output homogenization. Crucially, RFD operates entirely within the external text space, making it strictly training-free and model-agnostic as a universal plug-and-play solution. Extensive experiments demonstrate that RFD effectively captures the user's true visual intent, significantly outperforming baselines in preference alignment.[394] FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving
Yaoru Li,Federico Landi,Marco Godi,Xin Jin,Ruiju Fu,Yufei Ma,Muyang Sun,Heyu Si,Qi Guo
Main category: cs.CV
TL;DR: 本文提出FAR-Drive,一种面向自动驾驶的帧级自回归视频生成框架,通过多视角扩散Transformer与两阶段训练策略,解决长时序一致性、自回归退化和低延迟交互三大挑战,在nuScenes上实现SOTA闭环仿真性能并保持单卡亚秒级延迟。
Details
Motivation: 现有自动驾驶系统受限于缺乏可扩展、可交互的仿真环境;生成式视频模型虽视觉质量高,但多为开环,难以支持细粒度帧级动作-环境交互。 Method: 提出FAR-Drive框架:1)多视角扩散Transformer,支持几何一致的多相机视频生成;2)两阶段训练策略——自适应参考时域条件建模 + blend-forcing自回归训练,提升长时序一致性与抗退化能力;3)系统级推理加速优化以满足低延迟要求。 Result: 在nuScenes数据集上,FAR-Drive在闭环自动驾驶仿真任务中达到当前最优性能,且单GPU推理延迟低于1秒。 Conclusion: FAR-Drive有效解决了学习型闭环驾驶仿真中的关键挑战,为高保真、强交互、低延迟的端到端仿真提供了新范式。 Abstract: Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.[395] Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation
Xingtai Gui,Meijie Zhang,Tianyi Yan,Wencheng Han,Jiahao Gong,Feiyang Tan,Cheng-zhong Xu,Jianbing Shen
Main category: cs.CV
TL;DR: 本文提出WorldDrive框架,通过统一视觉与运动表征,耦合场景生成与实时规划,提升端到端自动驾驶的规划性能与视频生成质量。
Details
Motivation: 现有驾驶世界模型偏重视觉表征,缺乏为规划器共享和可继承的显式运动表征,导致场景生成优化与精确运动规划需求之间存在脱节。 Method: 提出Trajectory-aware Driving World Model,以轨迹词表为条件建模视觉动态与运动意图的一致性;将视觉与运动编码器迁移至多模态规划器;设计Future-aware Rewarder,利用冻结世界模型的未来潜在表征进行实时轨迹评估与选择。 Result: 在NAVSIM、NAVSIM-v2和nuScenes基准上,WorldDrive在纯视觉方法中取得领先规划性能,同时保持高保真动作控制视频生成能力。 Conclusion: 统一视觉与运动表征能有效提升自动驾驶系统的鲁棒性与规划质量,验证了世界模型与规划器协同设计的重要性。 Abstract: End-to-end autonomous driving aims to generate safe and plausible planning policies from raw sensor input. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner-shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation. We first introduce a Trajectory-aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi-modal Planner, ensuring the driving policy operates on mature representations pre-optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high-quality, multi-modal trajectories. Furthermore, to exploit the world model's foresight, we propose a Future-aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real-time. Extensive experiments on the NAVSIM, NAVSIM-v2, and nuScenes benchmarks demonstrate that WorldDrive achieves leading planning performance among vision-only methods while maintaining high-fidelity action-controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving.[396] GT-PCQA: Geometry-Texture Decoupled Point Cloud Quality Assessment with MLLM
Guohua Zhang,Jian Jin,Meiqin Liu,Chao Yao,Weisi Lin,Yao Zhao
Main category: cs.CV
TL;DR: 本文提出GT-PCQA框架,通过2D-3D联合训练与几何-纹理解耦策略,解决MLLM在点云质量评估(PCQA)中因数据稀缺和纹理偏好导致的泛化不足问题。
Details
Motivation: 现有PCQA数据集规模小,难以支撑MLLM稳定指令微调;且MLLM预训练偏向纹理推理,对PCQA关键的几何结构退化不敏感。 Method: 提出GT-PCQA框架:1)2D-3D联合训练策略,将PCQA建模为相对质量比较任务,融合大规模IQA与有限PCQA数据,并采用LoRA实现参数高效微调;2)几何-纹理解耦策略,结合双提示机制与交替优化,削弱MLLM纹理偏好,增强几何结构敏感性。 Result: GT-PCQA在多个实验中展现出竞争性性能与强泛化能力。 Conclusion: GT-PCQA有效缓解了MLLM在PCQA任务中的数据稀缺与纹理偏差问题,为多模态模型在三维视觉质量评估中的应用提供了新思路。 Abstract: With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising generalization. However, directly extending these MLLM-based IQA methods to PCQA remains challenging. On the one hand, existing PCQA datasets are limited in scale, which hinders stable and effective instruction tuning of MLLMs. On the other hand, due to large-scale image-text pretraining, MLLMs tend to rely on texture-dominant reasoning and are insufficiently sensitive to geometric structural degradations that are critical for PCQA. To address these gaps, we propose a novel MLLM-based no-reference PCQA framework, termed GT-PCQA, which is built upon two key strategies. First, to enable stable and effective instruction tuning under scarce PCQA supervision, a 2D-3D joint training strategy is proposed. This strategy formulates PCQA as a relative quality comparison problem to unify large-scale IQA datasets with limited PCQA datasets. It incorporates a parameter-efficient Low-Rank Adaptation (LoRA) scheme to support instruction tuning. Second, a geometry-texture decoupling strategy is presented, which integrates a dual-prompt mechanism with an alternating optimization scheme to mitigate the inherent texture-dominant bias of pre-trained MLLMs, while enhancing sensitivity to geometric structural degradations. Extensive experiments demonstrate that GT-PCQA achieves competitive performance and exhibits strong generalization.[397] Pansharpening for Thin-Cloud Contaminated Remote Sensing Images: A Unified Framework and Benchmark Dataset
Songcheng Du,Yang Zou,Jiaxin Li,Mingxuan Liu,Ying Li,Changjing Shang,Qiang Shen
Main category: cs.CV
TL;DR: 本文提出了一种端到端的统一模型Pan-TCR,用于薄云条件下的图像锐化(pansharpening),通过频域解耦恢复(FDR)与交互式跨频一致性(IFC)模块联合建模云污染与空间降质,同时发布首个真实薄云数据集PanTCR-GF2。
Details
Motivation: 薄云条件下的pansharpening具有实际意义但研究不足,现有方法顺序处理去云和锐化,缺乏联合退化建模,导致误差累积和性能下降。 Method: 提出统一模型Pan-TCR,包含频域解耦恢复(FDR)模块(分别利用NIR振幅和PAN相位引导恢复)和交互式跨频一致性(IFC)模块;并构建首个真实薄云pansharpening数据集PanTCR-GF2。 Result: 在真实与合成数据集上实验表明,Pan-TCR在薄云条件下表现优越且鲁棒,建立了该任务的新基准。 Conclusion: Pan-TCR通过物理先验驱动的频域联合建模与真实数据支撑,有效解决了薄云下pansharpening的关键挑战,推动了遥感图像增强在复杂大气条件下的实用化进展。 Abstract: Pansharpening under thin cloudy conditions is a practically significant yet rarely addressed task, challenged by simultaneous spatial resolution degradation and cloud-induced spectral distortions. Existing methods often address cloud removal and pansharpening sequentially, leading to cumulative errors and suboptimal performance due to the lack of joint degradation modeling. To address these challenges, we propose a Unified Pansharpening Model with Thin Cloud Removal (Pan-TCR), an end-to-end framework that integrates physical priors. Motivated by theoretical analysis in the frequency domain, we design a frequency-decoupled restoration (FDR) block that disentangles the restoration of multispectral image (MSI) features into amplitude and phase components, each guided by complementary degradation-robust prompts: the near-infrared (NIR) band amplitude for cloud-resilient restoration, and the panchromatic (PAN) phase for high-resolution structural enhancement. To ensure coherence between the two components, we further introduce an interactive inter-frequency consistency (IFC) module, enabling cross-modal refinement that enforces consistency and robustness across frequency cues. Furthermore, we introduce the first real-world thin-cloud contaminated pansharpening dataset (PanTCR-GF2), comprising paired clean and cloudy PAN-MSI images, to enable robust benchmarking under realistic conditions. Extensive experiments on real-world and synthetic datasets demonstrate the superiority and robustness of Pan-TCR, establishing a new benchmark for pansharpening under realistic atmospheric degradations.[398] Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering
Minchan Kwon,Hyounguk Shon,Junmo Kim
Main category: cs.CV
TL;DR: 本文提出了一种问题感知的关键帧选择框架,通过伪关键帧标签和覆盖正则化提升视频问答(VideoQA)性能,尤其在时序和因果类问题上效果显著。
Details
Motivation: 现有大视觉语言模型在视频问答中面临高推理成本和信息稀释问题;关键帧选择虽能提升效率和推理质量,但受限于稀疏监督和仅依赖图像-文本相似度导致的冗余帧选择。 Method: 提出问题感知的关键帧选择框架,包含两部分:1)利用大视觉语言模型生成伪关键帧标签以提供信息性监督;2)引入覆盖正则化,鼓励时间维度上多样且互补的证据选择。 Result: 在NExT-QA数据集上的实验表明,该方法显著提升准确率,尤其在时序和因果类问题上效果突出。 Conclusion: 关键帧选择可作为视频问答中一个有效且可学习的模块,本方法验证了其潜力与实用性。 Abstract: Large multimodal models (LMMs) have recently demonstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity. We present a question-aware keyframe selection framework with two components: pseudo keyframe labels derived from LMMs that provide informative supervision and a coverage regularization that promotes diverse, complementary evidence across time. Experiments on NExT-QA show that our method significantly improves accuracy, especially for temporal and causal question types, establishing keyframe selection as an effective and learnable module for VideoQA.[399] CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models
Xiaojun Shan,Haoyu Shen,Yucheng Mao,Xiang Zhang,Abhay Anand,Bingnan Li,Haiyang Xu,Zhuowen Tu
Main category: cs.CV
TL;DR: CyCLeGen是一个统一的视觉-语言基础模型,通过图像→布局→图像和布局→图像→布局的循环一致性学习,在单一自回归框架中同时实现图像理解和生成。
Details
Motivation: 现有视觉模型通常依赖分离模块分别处理感知与合成,缺乏统一建模能力;本文旨在构建一个能同时理解与生成、并具备自我推理与数据高效特性的统一架构。 Method: 提出CyCLeGen模型,采用全集成自回归架构,引入图像↔布局双向循环生成机制,结合基于循环一致性的强化学习目标实现自我监督优化。 Result: 在多个图像理解与生成基准上均取得显著性能提升。 Conclusion: 统一的视觉-语言基础模型(如CyCLeGen)具有 introspection 与数据高效优势,展现出图像理解与生成协同发展的新范式潜力。 Abstract: We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image->layout->image and layout->image->layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.[400] GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis
Minjun Kang,Inkyu Shin,Taeyeop Lee,Myungchul Kim,In So Kweon,Kuk-Jin Yoon
Main category: cs.CV
TL;DR: 本文提出GeoNVS,一种基于几何引导的新型视图合成方法,通过高斯溅射特征适配器(GS-Adapter)在特征空间中引入3D几何约束,显著提升几何保真度与相机可控性,无需额外训练即可即插即用。
Details
Motivation: 现有相机控制视频扩散模型在新颖视图合成中存在几何失真和相机可控性差的问题,亟需更强的3D几何一致性与灵活视角生成能力。 Method: 提出GS-Adapter模块,将输入视角的扩散特征提升为3D高斯表示,进行几何约束下的新视角特征渲染,并自适应融合回扩散特征;该模块工作于特征空间,非输入空间,避免颜色噪声干扰结构一致性,且支持零样本迁移至多种前馈几何模型和视频扩散主干。 Result: 在9个场景、18种设置下实验表明,GeoNVS性能领先,相较SEVA和CameraCtrl分别提升11.3%和14.9%,平移误差降低2倍,Chamfer距离降低7倍。 Conclusion: GS-Adapter在特征空间实现显式几何引导,兼顾几何精度与生成质量,是一种通用、高效、即插即用的新颖视图合成新范式。 Abstract: Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.[401] Voronoi-based Second-order Descriptor with Whitened Metric in LiDAR Place Recognition
Jaein Kim,Hee Bin Yoo,Dong-Sig Han,Byoung-Tak Zhang
Main category: cs.CV
TL;DR: 本文提出了一种结合Voronoi单元先验和二阶统计的新型池化方法,用于LiDAR地点识别,通过白化全局描述符隐式度量马氏距离,并保持聚类特性,同时解决学习过程中的数值不稳定性问题。
Details
Motivation: 现有LiDAR地点识别(LPR)中的二阶池化方法采用常规实现和后归一化,导致描述符不适合欧氏距离度量;而NetVLAD与二阶统计的关联为改进提供了新视角。 Method: 提出一种融合Voronoi单元归纳偏置的二阶池化方法:先聚合局部描述符形成二阶矩阵,再对全局描述符进行白化处理,以隐式度量马氏距离并保留Voronoi聚类特性,并采用多种技术缓解学习过程中的数值不稳定性。 Result: 在Oxford Robotcar和Wild-Places基准上验证了该方法的性能提升,并分析了所提白化算法的数值效应。 Conclusion: 所提方法有效提升了LiDAR地点识别中全局描述符的判别能力与鲁棒性,兼顾几何结构建模与数值稳定性。 Abstract: The pooling layer plays a vital role in aggregating local descriptors into the metrizable global descriptor in the LiDAR Place Recognition (LPR). In particular, the second-order pooling is capable of capturing higher-order interactions among local descriptors. However, its existing methods in the LPR adhere to conventional implementations and post-normalization, and incur the descriptor unsuitable for Euclidean distancing. Based on the recent interpretation that associates NetVLAD with the second-order statistics, we propose to integrate second-order pooling with the inductive bias from Voronoi cells. Our novel pooling method aggregates local descriptors to form the second-order matrix and whitens the global descriptor to implicitly measure the Mahalanobis distance while conserving the cluster property from Voronoi cells, addressing its numerical instability during learning with diverse techniques. We demonstrate its performance gains through the experiments conducted on the Oxford Robotcar and Wild-Places benchmarks and analyze the numerical effect of the proposed whitening algorithm.[402] MMSpec: Benchmarking Speculative Decoding for Vision-Language Models
Hui Shen,Xin Wang,Ping Zhang,Yunta Hsieh,Qi Han,Zhongwei Wan,Ziheng Zhang,Jingxuan Zhang,Jing Xiong,Ziyuan Liu,Yifan Zhang,Hangrui Cao,Chenyang Zhao,Mi Zhang
Main category: cs.CV
TL;DR: 本文提出了MMSpec基准测试,用于评估视觉-语言模型(VLMs)中的推测解码,并基于发现提出了一种新方法ViSkip,以提升多模态场景下的推测解码性能。
Details
Motivation: 现有推测解码方法主要面向纯文本大语言模型(LLMs),在视觉-语言模型(VLMs)中表现不佳,且缺乏统一的多模态评估基准;作者旨在填补这一空白并改进VLMs的推理效率。 Method: 构建首个面向VLMs的推测解码基准MMSpec(含600个多模态样本、6类任务、10种算法),系统分析现有方法在多模态场景下的行为,并据此提出视觉感知的动态推测方法ViSkip。 Result: 揭示三大现象:纯文本推测方法在多模态下性能下降;视觉感知对大batch更关键;吞吐量提升不能准确反映真实延迟改善;ViSkip在MMSpec上达到SOTA性能。 Conclusion: 推测解码需适配多模态特性,ViSkip通过动态视觉感知机制显著提升VLMs的推理效率与延迟表现,为未来VLM加速研究提供基准与方法论支撑。 Abstract: Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.[403] Thermal Image Refinement with Depth Estimation using Recurrent Networks for Monocular ORB-SLAM3
Hürkan Şahin,Huy Xuan Pham,Van Huyen Dang,Alper Yegenoglu,Erdal Kayacan
Main category: cs.CV
TL;DR: 本文提出了一种基于单目热成像的轻量级深度估计与SLAM方法,适用于GPS拒止和视觉退化环境下的无人机自主导航。该方法采用带循环模块的轻量监督网络,结合热图像增强网络(T-RefNet),在非辐射型热相机数据上训练,显著降低硬件成本,并在多个数据集和实飞中验证了其深度精度与热视觉SLAM鲁棒性。
Details
Motivation: 解决无人机在GPS拒止和视觉退化(如低光照)环境下自主导航困难的问题,避免依赖昂贵的辐射式热相机。 Method: 提出一种轻量级监督深度估计网络,引入循环模块建模时序依赖,结合热图像精炼网络(T-RefNet)提升特征可见性;将预测深度图与精炼热图输入ORB-SLAM3实现纯热视觉SLAM;使用自建非辐射型热图像数据集进行训练。 Result: 在辐射型VIVID++(室内暗光)数据集上绝对相对误差约0.06(基线>0.11);在自建非辐射型室内数据集上误差<0.10(基线>0.24);热视觉ORB-SLAM3平均轨迹误差<0.4米。 Conclusion: 所提方法在低成本非辐射热相机条件下实现了高精度深度估计与稳定SLAM性能,为弱光/无光场景下无人机自主导航提供了可行方案。 Abstract: Autonomous navigation in GPS-denied and visually degraded environments remains challenging for unmanned aerial vehicles (UAVs). To this end, we investigate the use of a monocular thermal camera as a standalone sensor on a UAV platform for real-time depth estimation and simultaneous localization and mapping (SLAM). To extract depth information from thermal images, we propose a novel pipeline employing a lightweight supervised network with recurrent blocks (RBs) integrated to capture temporal dependencies, enabling more robust predictions. The network combines lightweight convolutional backbones with a thermal refinement network (T-RefNet) to refine raw thermal inputs and enhance feature visibility. The refined thermal images and predicted depth maps are integrated into ORB-SLAM3, enabling thermal-only localization. Unlike previous methods, the network is trained on a custom non-radiometric dataset, obviating the need for high-cost radiometric thermal cameras. Experimental results on datasets and UAV flights demonstrate competitive depth accuracy and robust SLAM performance under low-light conditions. On the radiometric VIVID++ (indoor-dark) dataset, our method achieves an absolute relative error of approximately 0.06, compared to baselines exceeding 0.11. In our non-radiometric indoor set, baseline errors remain above 0.24, whereas our approach remains below 0.10. Thermal-only ORB-SLAM3 maintains a mean trajectory error under 0.4 m.[404] Edit2Interp: Adapting Image Foundation Models from Spatial Editing to Video Frame Interpolation with Few-Shot Learning
Nasrin Rahimi,Mısra Yavuz,Burak Can Biner,Yunus Bilge Kurt,Ahmet Rasim Emirdağı,Süleyman Aslan,Görkay Aydemir,M. Akın Yılmaz,A. Murat Tekalp
Main category: cs.CV
TL;DR: 本文发现预训练的图像编辑模型(如Qwen-Image-Edit)虽无显式时序建模,但其空间先验中隐含时序推理能力;仅用64–256个样本通过LoRA微调,即可使其有效完成视频帧插值(VFI),为资源受限下的视频合成提供高效新路径。
Details
Motivation: 预训练图像编辑模型具备强空间与对象变换理解能力,但缺乏显式时序建模;本文旨在探索其是否可被轻量适配用于视频任务(如帧插值),以挖掘其未被利用的时序潜力。 Method: 对纯静态图像编辑模型Qwen-Image-Edit,采用低秩自适应(LoRA)进行少样本(64–256帧)微调,将其迁移至视频帧插值(VFI)任务,不引入任何视频专用结构或运动估计模块。 Result: 微调后模型在VFI任务上显著优于原始模型(后者完全无法生成连贯中间帧);验证了空间先验中存在可被激活的隐式时序推理能力。 Conclusion: 基础图像编辑模型蕴含未被发掘的时序能力,少量数据微调即可解锁视频合成能力;这表明空间与时间推理在基础模型中可能高度耦合,为数据高效视频生成提供了新范式。 Abstract: Pre-trained image editing models exhibit strong spatial reasoning and object-aware transformation capabilities acquired from billions of image-text pairs, yet they possess no explicit temporal modeling. This paper demonstrates that these spatial priors can be repurposed to unlock temporal synthesis capabilities through minimal adaptation - without introducing any video-specific architecture or motion estimation modules. We show that a large image editing model (Qwen-Image-Edit), originally designed solely for static instruction-based edits, can be adapted for Video Frame Interpolation (VFI) using only 64-256 training samples via Low-Rank Adaptation (LoRA). Our core contribution is revealing that the model's inherent understanding of "how objects transform" in static scenes contains latent temporal reasoning that can be activated through few-shot fine-tuning. While the baseline model completely fails at producing coherent intermediate frames, our parameter-efficient adaptation successfully unlocks its interpolation capability. Rather than competing with task-specific VFI methods trained from scratch on massive datasets, our work establishes that foundation image editing models possess untapped potential for temporal tasks, offering a data-efficient pathway for video synthesis in resource-constrained scenarios. This bridges the gap between image manipulation and video understanding, suggesting that spatial and temporal reasoning may be more intertwined in foundation models than previously recognized[405] Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning
Kaixin zhang,Xiaohe Li,Jiahao Li,Haohua Wu,Xinyu Zhao,Zide Fan,Lei Wang
Main category: cs.CV
TL;DR: 本文提出ClueNet,一种基于线索感知的视频推理框架,通过两阶段监督微调提升视频问答(VideoQA)性能,解决现有MLLM在时序因果推理、证据支撑回答生成及视觉线索提取等方面的不足。
Details
Motivation: 现有端到端多模态大语言模型(MLLM)在视频问答中缺乏显式结构化推理,导致幻觉严重、可解释性差,且未能解决忠实视觉线索提取、效用感知线索过滤和端到端线索-答案对齐三大核心问题。 Method: 提出ClueNet框架,采用受人类分层视觉认知启发的两阶段监督微调范式:第一阶段解耦监督对齐线索提取与链式推理;第二阶段结合自适应线索过滤机制进行推理监督,并引入轻量模块以提升推理效率,无需大幅修改基础模型。 Result: 在NExT-QA、STAR和MVBench数据集上,ClueNet超越当前最优方法≥1.1%,同时展现出更强的泛化能力、幻觉抑制能力、推理效率及跨骨干网络兼容性。 Conclusion: ClueNet弥合了多模态大语言模型中从感知到生成的关键鸿沟,为高风险视频问答任务提供了可解释、可信的推理范式。 Abstract: Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.[406] Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing
Jiahe Song,Chuang Wang,Yinfan Wang,Hao Zheng,Rui Nie,Bowen Jiang,Xingjian Wei,Junyuan Gao,Yubin Wang,Bin Wang,Lijun Wu,Jiang Wu,Qian Yu,Conghui He
Main category: cs.CV
TL;DR: 本文提出了一种增强视觉语言模型(VLM)用于反应图解析(RxnDP)的新方法,包括Identifier as Visual Prompting(IdtVP)和Re3-DAPO强化学习算法,并发布了ScannedRxn基准数据集。
Details
Motivation: 现有VLM在反应图解析中受限于视觉化学实体与预训练知识对齐困难,以及token级训练与reaction级评估之间的不匹配。 Method: 1) 提出IdtVP,利用分子标识符(如粗体数字)作为视觉提示激活VLM预训练化学知识;2) 提出Re3-DAPO强化学习算法,通过可验证奖励直接优化reaction级指标;3) 构建ScannedRxn扫描图基准数据集。 Result: IdtVP展现出优异的零样本和分布外泛化能力;Re3-DAPO在微调范式下持续优于监督微调;ScannedRxn提升了模型鲁棒性与泛化性评估。 Conclusion: 本文从提示表示与学习范式两方面提升VLM-based RxnDP性能,显著增强了其准确性与泛化能力,并开源全部资源。 Abstract: Reaction diagram parsing (RxnDP) is critical for extracting chemical synthesis information from literature. Although recent Vision-Language Models (VLMs) have emerged as a promising paradigm to automate this complex visual reasoning task, their application is fundamentally bottlenecked by the inability to align visual chemical entities with pre-trained knowledge, alongside the inherent discrepancy between token-level training and reaction-level evaluation. To address these dual challenges, this work enhances VLM-based RxnDP from two complementary perspectives: prompting representation and learning paradigms. First, we propose Identifier as Visual Prompting (IdtVP), which leverages naturally occurring molecule identifiers (e.g., bold numerals like 1a) to activate the chemical knowledge acquired during VLM pre-training. IdtVP enables powerful zero-shot and out-of-distribution capabilities, outperforming existing prompting strategies. Second, to further optimize performance within fine-tuning paradigms, we introduce Re3-DAPO, a reinforcement learning algorithm that leverages verifiable rewards to directly optimize reaction-level metrics, thereby achieving consistent gains over standard supervised fine-tuning. Additionally, we release the ScannedRxn benchmark, comprising scanned historical reaction diagrams with real-world artifacts, to rigorously assess model robustness and out-of-distribution ability. Our contributions advance the accuracy and generalization of VLM-based reaction diagram parsing. We will release data, models, and code on GitHub.[407] Riemannian Motion Generation: A Unified Framework for Human Motion Representation and Generation via Riemannian Flow Matching
Fangran Miao,Jian Huang,Ting Li
Main category: cs.CV
TL;DR: 本文提出了Riemannian Motion Generation (RMG)框架,将人体运动建模为乘积流形上的动力学过程,利用黎曼流匹配进行学习,在多个基准上实现了SOTA性能。
Details
Motivation: 传统方法在欧氏空间中建模人体运动,忽略了运动本身具有的非欧几里得几何结构;本文旨在通过几何感知建模提升运动生成的真实性与稳定性。 Method: RMG将运动分解为多个流形因子(如平移+旋转),采用测地线插值、切空间监督和保持流形结构的ODE积分进行训练与采样,实现尺度无关且内蕴归一化的表示。 Result: 在HumanML3D上取得SOTA FID(0.043)并全面领先MotionStreamer格式各项指标;在MotionMillion上FID=5.6,R@1=0.86,超越强基线;消融表明T+R表示最稳定有效。 Conclusion: 几何感知的流形建模范式是实现高保真、可扩展人体运动生成的一条实用且高效路径。 Abstract: Human motion generation is often learned in Euclidean spaces, although valid motions follow structured non-Euclidean geometry. We present Riemannian Motion Generation (RMG), a unified framework that represents motion on a product manifold and learns dynamics via Riemannian flow matching. RMG factorizes motion into several manifold factors, yielding a scale-free representation with intrinsic normalization, and uses geodesic interpolation, tangent-space supervision, and manifold-preserving ODE integration for training and sampling. On HumanML3D, RMG achieves state-of-the-art FID in the HumanML3D format (0.043) and ranks first on all reported metrics under the MotionStreamer format. On MotionMillion, it also surpasses strong baselines (FID 5.6, R@1 0.86). Ablations show that the compact $\mathscr{T}+\mathscr{R}$ (translation + rotations) representation is the most stable and effective, highlighting geometry-aware modeling as a practical and scalable route to high-fidelity motion generation.[408] Reference-Free Omnidirectional Stereo Matching via Multi-View Consistency Maximization
Lehuai Xu,Weiming Zhang,Yang Li,Sidan Du,Lin Wang
Main category: cs.CV
TL;DR: 本文提出FreeOmniMVS,一种无需参考视图的多鱼眼立体匹配框架,通过多视角一致性最大化实现鲁棒、可见性感知且全局一致的全向深度估计。
Details
Motivation: 现有方法未能显式建模多视角间的几何关系,难以捕捉全局依赖、可见性及尺度变化,导致在遮挡、部分重叠和基线变化下性能受限。 Method: 提出参考无关的FreeOmniMVS框架:1)View-pair Correlation Transformer(VCT)显式建模所有相机对的相关体积以剔除不可靠匹配;2)轻量级自适应注意力机制实现可见性感知与可扩展的全局共识融合。 Result: 在多个基准数据集上验证了该方法在全局一致性、可见性感知和尺度感知全向深度估计方面的优越性。 Conclusion: FreeOmniMVS摆脱了参考视图依赖,通过多视角一致性建模显著提升了复杂场景下的深度估计鲁棒性与泛化能力。 Abstract: Reliable omnidirectional depth estimation from multi-fisheye stereo matching is pivotal to many applications, such as embodied robotics. Existing approaches either rely on spherical sweeping with heuristic fusion strategies to build the cost columns or perform reference-centric stereo matching based on rectified views. However, these methods fail to explicitly exploit geometric relationships between multiple views, rendering them less capable of capturing the global dependencies, visibility, or scale changes. In this paper, we shift to a new perspective and propose a novel reference-free framework, dubbed FreeOmniMVS, via multi-view consistency maximization. The highlight of FreeOmniMVS is that it can aggregate pair-wise correlations into a robust, visibility-aware, and global consensus. As such, it is tolerant to occlusions, partial overlaps, and varying baselines. Specifically, to achieve global coherence, we introduce a novel View-pair Correlation Transformer (VCT) that explicitly models pairwise correlation volumes across all camera view pairs, allowing us to drop unreliable pairs caused by occlusion or out-of-focus observations. To realize scalable and visibility-aware consensus, we propose a lightweight attention mechanism that adaptively fuses the correlation vectors, eliminating the need for a designated reference view and allowing all cameras to contribute equally to the stereo matching process. Extensive experiments on diverse benchmark datasets demonstrate the superiority of our method for globally consistent, visibility-aware, and scale-aware omnidirectional depth estimation.[409] One CT Unified Model Training Framework to Rule All Scanning Protocols
Fengzhi Xu,Ziyuan Yang,Zexin Lu,Yingyu Chen,Fenglei Fan,Hongming Shan,Yi Zhang
Main category: cs.CV
TL;DR: 本文提出了一种不确定性引导的流形平滑(UMS)框架,用于非理想测量CT图像增强,无需配对数据,通过识别离散子流形并利用分类器预测的不确定性指导样本生成,实现子流形间间隙填补与特征空间连续化。
Details
Motivation: 现有NICT增强方法依赖难以获取的配对数据,或假设均匀噪声而忽略扫描协议差异,导致泛化差和模型崩溃;实际中不同协议对应特征空间中离散子流形,违背均匀噪声假设。 Method: 提出不确定性引导的流形平滑(UMS)框架:1)分类器识别子流形并输出不确定性分数;2)利用该分数引导跨子流形的多样化样本生成;3)设计全局与子流形驱动的动态网络架构,融合全局与子域特异性特征。 Result: 在多个公开数据集上验证了UMS在不同生成范式下的有效性,显著提升了非配对条件下的CT图像重建质量与泛化能力。 Conclusion: UMS通过建模扫描协议引起的子流形结构并引入不确定性引导的平滑机制,有效缓解了无监督NICT增强中因协议异质性导致的性能瓶颈,为临床低剂量CT提供了更实用、鲁棒的解决方案。 Abstract: Non-ideal measurement computed tomography (NICT), which lowers radiation at the cost of image quality, is expanding the clinical use of CT. Although unified models have shown promise in NICT enhancement, most methods require paired data, which is an impractical demand due to inevitable organ motion. Unsupervised approaches attempt to overcome this limitation, but their assumption of homogeneous noise neglects the variability of scanning protocols, leading to poor generalization and potential model collapse. We further observe that distinct scanning protocols, which correspond to different physical imaging processes, produce discrete sub-manifolds in the feature space, contradicting these assumptions and limiting their effectiveness. To address this, we propose an Uncertainty-Guided Manifold Smoothing (UMS) framework to bridge the gaps between sub-manifolds. A classifier in UMS identifies sub-manifolds and predicts uncertainty scores, which guide the generation of diverse samples across the entire manifold. By leveraging the classifier's capability, UMS effectively fills the gaps between discrete sub-manifolds, and promotes a continuous and dense feature space. Due to the complexity of the global manifold, it's hard to directly model it. Therefore, we propose to dynamically incorporate the global- and sub-manifold-specific features. Specifically, we design a global- and sub-manifold-driven architecture guided by the classifier, which enables dynamic adaptation to subdomain variations. This dynamic mechanism improves the network's capacity to capture both shared and domain-specific features, thereby improving reconstruction performance. Extensive experiments on public datasets are conducted to validate the effectiveness of our method across different generation paradigms.[410] Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
Omer Ben Hayun,Roy Betser,Meir Yossef Levi,Levi Kassel,Guy Gilboa
Main category: cs.CV
TL;DR: 本文提出STALL,一种无需训练、零样本的视频伪造检测方法,通过联合建模时空特征并基于真实数据统计进行似然打分,显著优于现有图像和视频检测基线。
Details
Motivation: 现有图像检测器忽略时序信息,监督式视频检测器难以泛化到新生成模型,因此需要不依赖合成数据、无需训练、模型无关的零样本检测方法。 Method: STALL是一种基于概率框架的零样本检测器,联合建模视频的空域与时域特征,利用真实数据统计进行似然评分,无需训练或合成数据。 Result: STALL在两个公开基准及新构建的ComGenVid基准(涵盖最先进生成模型)上均一致优于先前的图像与视频检测基线。 Conclusion: STALL提供了一种简单、理论可靠、训练免费且模型无关的视频伪造检测方案,有效应对生成模型快速演进带来的泛化挑战。 Abstract: Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce \emph{STALL}, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at https://omerbenhayun.github.io/stall-video.[411] GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
Yang Li,Yuchen Liu,Haoyu Lu,Zhiqiang Xia,Hongzhen Wang,Kaiyang Han,Changpeng Yang,Jinyang Wu,Jiaming Xu,Runyu Shi,Ying Huang
Main category: cs.CV
TL;DR: 本文提出了GUI-CEval,首个面向中文移动GUI代理的综合性基准测试,涵盖201个主流App、四种设备类型,从感知、规划、反思、执行与评估五个维度系统评测MLLMs在真实设备环境中的端到端能力。
Details
Motivation: 现有基准以英文为主,无法反映中文移动生态的语言与交互特性,且缺乏对从感知到执行全链条能力的统一、细粒度评估框架。 Method: 构建基于真实物理设备环境的GUI-CEval基准,采用两层结构(原子能力+应用级性能),覆盖五大能力维度,并通过多阶段人工采集与验证确保数据真实性与可复现性。 Result: 在20个代表性MLLM及多智能体系统上的实验表明,Qwen2.5-VL和UI-TARS表现较优,但多数模型在反思性决策与动作后自我评估方面存在明显短板。 Conclusion: GUI-CEval为中文移动GUI代理提供了全面、可解释的评测基准,有助于能力诊断与技术演进。 Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a comprehensive and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.[412] SRL-MAD: Structured Residual Latents for One-Class Morphing Attack Detection
Diogo J. Paulo,Hugo Proença,João C. Neves
Main category: cs.CV
TL;DR: 本文提出SRL-MAD,一种基于结构化残差傅里叶表示的一类单图像人脸融合攻击检测方法,通过引入环状频域表示、可学习频带投影及跨频带交互建模,在无需融合攻击样本训练的情况下,显著提升了对未知融合攻击的检测能力。
Details
Motivation: 现有监督式融合攻击检测方法依赖标注的攻击数据,泛化能力差;因此需发展仅用真实人脸样本训练的一类检测方法,以应对未知攻击。 Method: 提出SRL-MAD:构建残差频谱图抑制图像特异性趋势;采用环状二维傅里叶表示替代方位角平均;设计可学习的环向频谱投影;按低/中/高频分组并建模跨频带交互;将结构化频域特征映射至判别性潜空间直接打分,避免依赖重构误差。 Result: 在FERET-Morph、FRLL-Morph和MorDIFF数据集上,SRL-MAD持续优于近期一类及监督式MAD方法。 Conclusion: 学习频率感知的投影比传统方位角频谱汇总更具判别力,为一类融合攻击检测提供了更优方案。 Abstract: Face morphing attacks represent a significant threat to biometric systems as they allow multiple identities to be combined into a single face. While supervised morphing attack detection (MAD) methods have shown promising performance, their reliance on attack-labeled data limits generalization to unseen morphing attacks. This has motivated increasing interest in one-class MAD, where models are trained exclusively on bona fide samples and are expected to detect unseen attacks as deviations from the normal facial structure. In this context, we introduce SRL-MAD, a one-class single-image MAD that uses structured residual Fourier representations for open-set morphing attack detection. Starting from a residual frequency map that suppresses image-specific spectral trends, we preserve the two-dimensional organization of the Fourier domain through a ring-based representation and replace azimuthal averaging with a learnable ring-wise spectral projection. To further encode domain knowledge about where morphing artifacts arise, we impose a frequency-informed inductive bias by organizing spectral evidence into low, mid, and high-frequency bands and learning cross-band interactions. These structured spectral features are mapped into a latent space designed for direct scoring, avoiding the reliance on reconstruction errors. Extensive evaluation on FERET-Morph, FRLL-Morph, and MorDIFF demonstrates that SRL-MAD consistently outperforms recent one-class and supervised MAD models. Overall, our results show that learning frequency-aware projections provides a more discriminative alternative to azimuthal spectral summarization for one-class morphing attack detection.[413] The Good, the Better, and the Best: Improving the Discriminability of Face Embeddings through Attribute-aware Learning
Ana Dias,João Ribeiro Pinto,Hugo Proença,João C. Neves
Main category: cs.CV
TL;DR: 本文提出了一种属性感知的人脸识别架构,通过联合学习身份标签与身份相关/无关的面部属性,提升嵌入判别性,并可诊断模型是否依赖冗余属性进行捷径学习。
Details
Motivation: 现有方法使用固定、异构的面部属性集进行辅助监督,隐含假设各属性对身份识别同等重要,但实际中不同属性判别力不同,部分甚至引入有害偏差。 Method: 构建属性分组的可解释结构,联合监督身份类别标签、身份相关属性和非身份相关属性;显式引导模型学习/遗忘特定属性。 Result: 在标准人脸验证基准上验证:(i) 仅用身份相关属性子集监督优于全属性监督;(ii) 显式让嵌入‘遗忘’非身份相关属性可进一步提升性能;且该方法可作为诊断工具量化捷径学习程度。 Conclusion: 属性应被差异化建模与监督;身份相关属性增强判别性,抑制非相关属性可缓解捷径学习,提升模型鲁棒性与可解释性。 Abstract: Despite recent advances in face recognition, robust performance remains challenging under large variations in age, pose, and occlusion. A common strategy to address these issues is to guide representation learning with auxiliary supervision from facial attributes, encouraging the visual encoder to focus on identity-relevant regions. However, existing approaches typically rely on heterogeneous and fixed sets of attributes, implicitly assuming equal relevance across attributes. This assumption is suboptimal, as different attributes exhibit varying discriminative power for identity recognition, and some may even introduce harmful biases. In this paper, we propose an attribute-aware face recognition architecture that supervises the learning of facial embeddings using identity class labels, identity-relevant facial attributes, and non-identity-related attributes. Facial attributes are organized into interpretable groups, making it possible to decompose and analyze their individual contributions in a human-understandable manner. Experiments on standard face verification benchmarks demonstrate that joint learning of identity and facial attributes improves the discriminability of face embeddings with two major conclusions: (i) using identity-relevant subsets of facial attributes consistently outperforms supervision with a broader attribute set, and (ii) explicitly forcing embeddings to unlearn non-identity-related attributes yields further performance gains compared to leaving such attributes unsupervised. Additionally, our method serves as a diagnostic tool for assessing the trustworthiness of face recognition encoders by allowing for the measurement of accuracy gains with suppression of non-identity-relevant attributes, with such gains suggesting shortcut learning from redundant attributes associated with each identity.[414] ReactMotion: Generating Reactive Listener Motions from Speaker Utterance
Cheng Luo,Bizhu Wu,Bing Li,Jianfeng Ren,Ruibin Bai,Rong Qu,Linlin Shen,Bernard Ghanem
Main category: cs.CV
TL;DR: 本文提出了一项新任务:从说话者话语生成反应性听众肢体动作,并构建了ReactMotionNet数据集与偏好导向评估协议,提出了统一生成框架ReactMotion,在自然性、多样性和反应恰当性上均优于基线方法。
Details
Motivation: 建模人类非言语倾听行为具有挑战性,因其反应具有内在的不确定性;现有工作缺乏对‘一对多’反应特性的建模与恰当性评估。 Method: 构建大规模多候选 listener motion 数据集 ReactMotionNet,设计偏好导向评估协议,并提出联合建模文本、音频、情感与动作的生成框架 ReactMotion,采用基于偏好的目标函数训练。 Result: ReactMotion 在自然性、多样性与反应恰当性上显著优于检索式基线和级联 LLM 管道。 Conclusion: 该工作为反应性倾听动作生成建立了新基准,验证了偏好学习与多模态联合建模对提升反应恰当性与多样性的重要性。 Abstract: In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.[415] Learning from Limited and Incomplete Data: A Multimodal Framework for Predicting Pathological Response in NSCLC
Alice Natalina Caragliano,Giulia Farina,Fatih Aksu,Camillo Maria Caruso,Claudia Tacconi,Carlo Greco,Lorenzo Nibid,Edy Ippolito,Michele Fiore,Giuseppe Perrone,Sara Ramella,Paolo Soda,Valerio Guarrasi
Main category: cs.CV
TL;DR: 本文提出了一种多模态深度学习框架,结合基于基础模型的CT特征提取与缺失感知临床变量建模,无需传统插补即可在小样本、临床数据不全的真实场景中准确预测非小细胞肺癌新辅助治疗后的主要病理反应(pR)。
Details
Motivation: 准确术前预测非小细胞肺癌新辅助治疗后的主要病理反应(pR)具有重要临床意义,但在真实世界中受限于数据量小、临床信息不全,现有方法难以稳健应用。 Method: 构建多模态深度学习框架:1)采用基础模型提取CT影像特征;2)设计缺失感知架构处理不完整临床变量;3)引入加权融合机制整合影像与临床模态。全程避免传统缺失值插补。 Result: 该多模态模型在小队列上表现稳健,显著优于仅用影像或仅用临床变量的单模态基线模型。 Conclusion: 整合异构数据源并显式建模缺失信息的多模态策略,可提升pR预测在现实临床条件下的可靠性与实用性。 Abstract: Major pathological response (pR) following neoadjuvant therapy is a clinically meaningful endpoint in non-small cell lung cancer, strongly associated with improved survival. However, accurate preoperative prediction of pR remains challenging, particularly in real-world clinical settings characterized by limited data availability and incomplete clinical profiles. In this study, we propose a multimodal deep learning framework designed to address these constraints by integrating foundation model-based CT feature extraction with a missing-aware architecture for clinical variables. This approach enables robust learning from small cohorts while explicitly modeling missing clinical information, without relying on conventional imputation strategies. A weighted fusion mechanism is employed to leverage the complementary contributions of imaging and clinical modalities, yielding a multimodal model that consistently outperforms both unimodal imaging and clinical baselines. These findings underscore the added value of integrating heterogeneous data sources and highlight the potential of multimodal, missing-aware systems to support pR prediction under realistic clinical conditions.[416] PAKAN: Pixel Adaptive Kolmogorov-Arnold Network Modules for Pansharpening
Haoyu Zhang,Haojing Chen,Zhen Zhong,Liangjian Deng
Main category: cs.CV
TL;DR: 本文提出了一种像素自适应的Kolmogorov-Arnold网络(PAKAN),通过设计2D和1D自适应变体,提升全色锐化中空间-光谱特征融合与细化的能力,显著优于现有方法。
Details
Motivation: 现有深度神经网络在全色锐化任务中多采用静态激活函数,难以动态建模复杂非线性空间-光谱映射;传统KAN虽具可学习激活,但缺乏推理时的动态适应性。 Method: 基于KAN,构建两种像素自适应变体:2D Adaptive KAN(空间维度生成样条加权)和1D Adaptive KAN(光谱通道维度生成样条加权),并组合为PAKAN 2to1(用于特征融合)与PAKAN 1to1(用于特征细化)。 Result: 大量实验表明,所提模块显著提升网络性能,在多个数据集和指标上优于现有方法,验证了像素自适应激活在全色锐化中的有效性与优越性。 Conclusion: 像素自适应激活机制能更灵活、精准地建模全色锐化中的空间-光谱关系,为该任务提供了新思路和更优解决方案。 Abstract: Pansharpening aims to fuse high-resolution spatial details from panchromatic images with the rich spectral information of multispectral images. Existing deep neural networks for this task typically rely on static activation functions, which limit their ability to dynamically model the complex, non-linear mappings required for optimal spatial-spectral fusion. While the recently introduced Kolmogorov-Arnold Network (KAN) utilizes learnable activation functions, traditional KANs lack dynamic adaptability during inference. To address this limitation, we propose a Pixel Adaptive Kolmogorov-Arnold Network framework. Starting from KAN, we design two adaptive variants: a 2D Adaptive KAN that generates spline summation weights across spatial dimensions and a 1D Adaptive KAN that generates them across spectral channels. These two components are then assembled into PAKAN 2to1 for feature fusion and PAKAN 1to1 for feature refinement. Extensive experiments demonstrate that our proposed modules significantly enhance network performance, proving the effectiveness and superiority of pixel-adaptive activation in pansharpening tasks.[417] VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents
Udi Barzelay,Ophir Azulai,Inbar Shapira,Idan Friedman,Foad Abo Dahood,Madison Lee,Abraham Daniels
Main category: cs.CV
TL;DR: VAREX是一个用于评估多模态基础模型从政府表格中提取结构化数据能力的新基准,通过反向标注生成带确定性真值的1777份文档,涵盖四种输入模态,并揭示了小参数模型在结构化输出合规性上的瓶颈及布局感知文本的关键作用。
Details
Motivation: 现有基准通常只提供单一输入表示,缺乏对不同输入模态影响的系统性分析;同时,针对成本与延迟敏感场景的小参数模型(≤4B)在结构化数据提取任务中的表现尚未被充分评估。 Method: 提出VAREX基准,采用Reverse Annotation流水线自动生成带确定性真值的PDF表单文档;构建含1777份文档、1771种独特schema的数据集,每份文档提供四种输入模态(纯文本、布局保持文本、图像、图文结合);对20个从前沿闭源到小型开源模型(尤其≤4B参数)进行系统评测,并开展消融与细调实验。 Result: (1)≤4B模型的主要瓶颈是结构化输出合规性而非抽取能力,'schema echo'现象导致性能下降45–65个百分点;(2)2B参数模型经抽取任务微调后提升81个百分点;(3)布局保持文本比图像带来更大精度增益(+3–18pp);(4)该基准最能区分60–95%准确率区间的模型性能。 Conclusion: VAREX填补了多模态结构化抽取评估中跨模态可控对比与小模型适配性的空白;结果表明提升指令遵循能力(尤其是结构化输出)比单纯扩大模型规模更有效,且布局信息比视觉像素更关键。 Abstract: We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy -- a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models <=4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance -- not extraction capability -- is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.[418] A Tutorial on ALOS2 SAR Utilization: Dataset Preparation, Self-Supervised Pretraining, and Semantic Segmentation
Nevrez Imamoglu,Ali Caglayan,Toru Kouyama
Main category: cs.CV
TL;DR: 本文提出SAR-W-SimMIM方法,通过引入强度加权损失缓解SAR图像斑点噪声和极端强度值影响,并构建日本地区ALOS-2 SAR数据集,验证其在语义分割任务中相较随机初始化和SAR-W-MixMAE的性能提升。
Details
Motivation: SAR图像因高噪声、缺乏语义标注,限制了掩码自编码器(如MAE)的应用;同时区域特定模型面临地物覆盖分布不均(如水体、森林、沙漠主导)带来的偏差问题。 Method: 提出SAR-W-SimMIM:在SimMIM基础上引入SAR强度加权损失;构建基于ALOS-2单通道(HH极化)的日本地区SAR数据集;采用ViT架构自编码器进行自监督预训练,再微调用于语义分割。 Result: SAR-W-SimMIM在语义分割任务上显著优于随机初始化和先前的SAR-W-MixMAE;所构建数据集支持区域专用基础模型开发,预训练+微调范式带来明显性能提升。 Conclusion: SAR-W-SimMIM与面向日本地区的ALOS-2 SAR数据集为SAR图像自监督学习提供了有效方案,推动了区域专用遥感基础模型的发展。 Abstract: Masked auto-encoders (MAE) and related approaches have shown promise for satellite imagery, but their application to synthetic aperture radar (SAR) remains limited due to challenges in semantic labeling and high noise levels. Building on our prior work with SAR-W-MixMAE, which adds SAR-specific intensity-weighted loss to standard MixMAE for pretraining, we also introduce SAR-W-SimMIM; a weighted variant of SimMIM applied to ALOS-2 single-channel SAR imagery. This method aims to reduce the impact of speckle and extreme intensity values during self-supervised pretraining. We evaluate its effect on semantic segmentation compared to our previous trial with SAR-W-MixMAE and random initialization, observing notable improvements. In addition, pretraining and fine-tuning models on satellite imagery pose unique challenges, particularly when developing region-specific models. Imbalanced land cover distributions such as dominant water, forest, or desert areas can introduce bias, affecting both pretraining and downstream tasks like land cover segmentation. To address this, we constructed a SAR dataset using ALOS-2 single-channel (HH polarization) imagery focused on the Japan region, marking the initial phase toward a national-scale foundation model. This dataset was used to pretrain a vision transformer-based autoencoder, with the resulting encoder fine-tuned for semantic segmentation using a task-specific decoder. Initial results demonstrate significant performance improvements compared to training from scratch with random initialization. In summary, this work provides a guide to process and prepare ALOS2 observations to create dataset so that it can be taken advantage of self-supervised pretraining of models and finetuning downstream tasks such as semantic segmentation.[419] Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors
Yunuo Chen,Chuqin Zhou,Jiangchuan Li,Xiaoyue Ling,Bing He,Jincheng Dai,Li Song,Guo Lu
Main category: cs.CV
TL;DR: 本文提出了一种利用预训练视频扩散模型(VDM)建模生成式解码过程的超低比特率图像压缩新范式,通过引入语义保真的锚帧作为中间状态,将解码视为从锚帧到目标图像的虚拟时序演化过程,在保持高感知质量的同时显著降低码率并加速解码。
Details
Motivation: 现有基于图像扩散的超低比特率图像压缩方法在解码过程中缺乏语义一致的中间表示,导致重建图像的保真度与真实性受限;而视频扩散模型天然具备建模帧间时序演化的能力,可被迁移用于提升图像压缩的感知质量与效率。 Method: 定义一个紧凑的锚帧作为解码中间状态,保留场景几何与语义布局;将生成式解码重解释为从该锚帧到原始图像的虚拟时序过渡;利用预训练视频扩散模型作为时序先验,以锚帧为初始帧、原图为预测目标,将解码转化为多步或单步的‘下一帧预测’任务。 Result: 在CLIC2020测试集上,相比DiffC,LPIPS、DISTS、FID、KID等指标均实现超50%的码率节省,并获得最高达5倍的解码加速;主观与客观评估均显示更优的重建质量与真实感。 Conclusion: 引入显式锚帧与视频扩散时序先验的ULB-IC范式,有效解耦语义结构与细节重建,兼顾高压缩率、高保真度与高解码效率,为生成式图像压缩提供了新思路。 Abstract: We present a novel paradigm for ultra-low-bitrate image compression (ULB-IC) that exploits the ``temporal'' evolution in generative image compression. Specifically, we define an explicit intermediate state during decoding: a compact anchor frame, which preserves the scene geometry and semantic layout while discarding high-frequency details. We then reinterpret generative decoding as a virtual temporal transition from this anchor to the final reconstructed image.To model this progression, we leverage a pretrained video diffusion model (VDM) as temporal priors: the anchor frame serves as the initial frame and the original image as the target frame, transforming the decoding process into a next-frame prediction task.In contrast to image diffusion-based ULB-IC models, our decoding proceeds from a visible, semantically faithful anchor, which improves both fidelity and realism for perceptual image compression. Extensive experiments demonstrate that our method achieves superior objective and subjective performance. On the CLIC2020 test set, our method achieves over \textbf{50\% bitrate savings} across LPIPS, DISTS, FID, and KID compared to DiffC, while also delivering a significant decoding speedup of up to $\times$5. Code will be released later.[420] Low-light Image Enhancement with Retinex Decomposition in Latent Space
Bolun Zheng,Qingshan Lei,Quan Chen,Qianyu Zhang,Kainan Yu,Xu Jia,Lingyu Zhu
Main category: cs.CV
TL;DR: 本文提出了一种Retinex引导的Transformer模型(RGT),通过两阶段(分解与增强)策略提升低光照图像增强效果,利用对数变换和像素偏移将乘性关系转为加性,提高分解稳定性,并设计U型组件精炼器与引导融合Transformer模块优化反射率和照度分量。
Details
Motivation: 现有基于Retinex理论的方法在准确分解反射率和照度分量方面存在局限,影响低光照图像增强效果。 Method: 提出Retinex-Guided Transformer(RGT)两阶段模型:第一阶段采用隐空间分解策略,结合对数变换和1像素偏移实现乘性到加性的转换;第二阶段构建U型组件精炼器,引入引导融合Transformer块,分别精炼反射率(保纹理)和照度(优分布)。 Result: 在四个基准数据集上实验表明,该方法在低光照图像增强任务中性能优越,且训练过程更稳定。 Conclusion: RGT模型有效提升了Retinex分解的准确性与增强质量,为低光照图像增强提供了新思路和实用框架。 Abstract: Retinex theory provides a principled foundation for low-light image enhancement, inspiring numerous learning-based methods that integrate its principles. However, existing methods exhibits limitations in accurately decomposing reflectance and illumination components. To address this, we propose a Retinex-Guided Transformer~(RGT) model, which is a two-stage model consisting of decomposition and enhancement phases. First, we propose a latent space decomposition strategy to separate reflectance and illumination components. By incorporating the log transformation and 1-pixel offset, we convert the intrinsically multiplicative relationship into an additive formulation, enhancing decomposition stability and precision. Subsequently, we construct a U-shaped component refiner incorporating the proposed guidance fusion transformer block. The component refiner refines reflectance component to preserve texture details and optimize illumination distribution, effectively transforming low-light inputs to normal-light counterparts. Experimental evaluations across four benchmark datasets validate that our method achieves competitive performance in low-light enhancement and a more stable training process.[421] WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation
Hainuo Wang,Mingjia Li,Xiaojie Guo
Main category: cs.CV
TL;DR: 本文提出Waypoint Diffusion Transformers (WiT),通过引入语义路标点分解像素空间中的连续向量场,解决Flow Matching模型中因像素流形缺乏语义连续性导致的轨迹冲突问题;WiT在ImageNet 256×256上超越强基线,训练收敛速度提升2.2倍。
Details
Motivation: 现有Flow Matching模型虽在像素空间操作避免了潜在自编码器的重建瓶颈,但像素流形缺乏语义连续性,导致最优传输路径严重纠缠、轨迹交叉冲突,产生次优解。 Method: 提出Waypoint Diffusion Transformers(WiT),利用预训练视觉模型提取中间语义路标点,将连续向量场分解为‘先验→路标’和‘路标→像素’两段;在迭代去噪过程中,轻量生成器动态从当前噪声状态推断路标点,并通过Just-Pixel AdaLN机制持续调节主扩散Transformer。 Result: 在ImageNet 256×256数据集上,WiT显著优于强像素空间基线,JiT训练收敛速度提升2.2倍。 Conclusion: WiT通过显式引入语义路标点来解耦像素空间生成轨迹,无需损失信息的潜在表示,即可有效缓解最优传输路径纠缠问题,提升生成质量与训练效率。 Abstract: While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.[422] Context-Aware Sensor Modeling for Asynchronous Multi-Sensor Tracking in Stone Soup
Martin Vonheim Larsen,Kim Mathiassen
Main category: cs.CV
TL;DR: 本文提出DetectorContext抽象,用于Stone Soup开源多目标跟踪框架,通过将检测概率和杂波强度建模为状态依赖函数,提升异步、部分重叠传感器下的多传感器跟踪性能。
Details
Motivation: 现有概率跟踪方法常假设全局统一的可观测性,但在多速率、部分覆盖的异步传感器场景下,会导致高频率传感器的重复未检测削弱仅由低频率传感器观测到的目标轨迹,从而降低融合性能。 Method: 提出DetectorContext抽象,使检测概率和杂波强度成为状态和感知上下文相关的函数,并在假设生成阶段动态评估;该抽象可无缝集成至现有概率跟踪器,无需修改其更新方程。 Result: 在异步雷达-激光雷达数据上的实验表明,上下文感知建模能恢复稳定的融合效果,并显著提升HOTA和GOSPA指标,且不增加误检轨迹数。 Conclusion: DetectorContext有效缓解了多速率、部分覆盖传感器下因固定可观测性假设导致的跟踪退化问题,提升了异构多传感器融合的鲁棒性与精度。 Abstract: Multi-sensor tracking in the real world involves asynchronous sensors with partial coverage and heterogeneous detection performance. Although probabilistic tracking methods permit detection probability and clutter intensity to depend on state and sensing context, many practical frameworks enforce globally uniform observability assumptions. Under multi-rate and partially overlapping sensing, this simplification causes repeated non-detections from high-rate sensors to erode tracks visible only to low-rate sensors, potentially degrading fusion performance. We introduce DetectorContext, an abstraction for the open-source multi-target tracking framework Stone Soup. DetectorContext exposes detection probability and clutter intensity as state-dependent functions evaluated during hypothesis formation. The abstraction integrates with existing probabilistic trackers without modifying their update equations. Experiments on asynchronous radar-lidar data demonstrate that context-aware modeling restores stable fusion and significantly improves HOTA and GOSPA performance without increasing false tracks.[423] SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation
Shufan Li,Jiuxiang Gu,Kangning Liu,Zhe Lin,Aditya Grover,Jason Kuen
Main category: cs.CV
TL;DR: 本文提出了一种新的训练目标SNCE,用于解决大码本离散图像生成器的优化难题,通过构建软分类分布而非硬单热目标,提高了收敛速度和生成质量。
Details
Motivation: 扩大VQ码本尺寸虽能提升重建保真度,但训练大码本生成模型仍具挑战性,通常需要更大的模型规模和更长的训练周期。 Method: 提出随机邻居交叉熵最小化(SNCE)训练目标,用基于邻近代码嵌入与真实图像嵌入距离的软分类分布替代硬单热目标。 Result: 在ImageNet-256类条件生成、大规模文本到图像合成及图像编辑任务中,SNCE显著提升了收敛速度和整体生成质量。 Conclusion: SNCE有效缓解了大码本离散图像生成器的优化困难,在多个图像生成任务上展现出优越性能。 Abstract: Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.[424] TextOVSR: Text-Guided Real-World Opera Video Super-Resolution
Hua Chang,Xin Xu,Wei Liu,Jiayi Wu,Kui Jiang,Fei Ma,Qi Tian
Main category: cs.CV
TL;DR: 本文提出了一种文本引导的双分支歌剧视频超分辨率方法(TextOVSR),通过引入退化描述文本和内容描述文本分别指导负向与正向重建分支,并设计退化鲁棒特征融合模块(DRF)和文本增强判别器(TED),有效提升了老旧歌剧视频的复原质量。
Details
Motivation: 经典歌剧视频因早期拍摄设备限制和长期存储导致严重退化,现有真实世界视频超分辨率(RWVSR)方法难以直接适用:一是真实退化建模复杂,传统退化核组合或外部噪声补丁易导致风格不匹配;二是缺乏高层语义引导,难以重建真实细节纹理。 Method: 提出TextOVSR网络,包含两个文本引导分支:退化描述文本用于负向分支约束解空间,内容描述文本用于正向分支及自研的Text-Enhanced Discriminator(TED)以提供语义指导;并设计Degradation-Robust Feature Fusion(DRF)模块实现跨模态特征融合并抑制退化干扰。 Result: 在自建OperaLQ基准上实验表明,TextOVSR在定性和定量指标上均超越当前最先进方法。 Conclusion: 文本提示可有效弥补真实退化建模与语义缺失的双重挑战,TextOVSR为老旧艺术影像修复提供了新范式。 Abstract: Many classic opera videos exhibit poor visual quality due to the limitations of early filming equipment and long-term degradation during storage. Although real-world video super-resolution (RWVSR) has achieved significant advances in recent years, directly applying existing methods to degraded opera videos remains challenging. The difficulties are twofold. First, accurately modeling real-world degradations is complex: simplistic combinations of classical degradation kernels fail to capture the authentic noise distribution, while methods that extract real noise patches from external datasets are prone to style mismatches that introduce visual artifacts. Second, current RWVSR methods, which rely solely on degraded image features, struggle to reconstruct realistic and detailed textures due to a lack of high-level semantic guidance. To address these issues, we propose a Text-guided Dual-Branch Opera Video Super-Resolution (TextOVSR) network, which introduces two types of textual prompts to guide the super-resolution process. Specifically, degradation-descriptive text, derived from the degradation process, is incorporated into the negative branch to constrain the solution space. Simultaneously, content-descriptive text is incorporated into a positive branch and our proposed Text-Enhanced Discriminator (TED) to provide semantic guidance for enhanced texture reconstruction. Furthermore, we design a Degradation-Robust Feature Fusion (DRF) module to facilitate cross-modal feature fusion while suppressing degradation interference. Experiments on our OperaLQ benchmark show that TextOVSR outperforms state-of-the-art methods both qualitatively and quantitatively. The code is available at https://github.com/ChangHua0/TextOVSR.[425] Vision-Language Model Based Multi-Expert Fusion for CT Image Classification
Jianfa Bai,Kejin Lu,Runtian Yuan,Qingqiu Li,Jilan Xu,Junlin Hou,Yuejie Zhang,Rui Feng
Main category: cs.CV
TL;DR: 本文提出了一种三阶段源感知多专家框架,用于多中心CT影像中的COVID-19鲁棒检测,通过肺部感知3D模型、MedSigLIP切片级建模与跨切片上下文建模,以及源分类器指导的模型融合投票,显著提升了在异构多源数据下的分类性能。
Details
Motivation: 多机构CT数据存在显著源偏移、源不平衡及测试样本源身份未知等问题,导致COVID-19检测鲁棒性差。 Method: 构建三阶段源-aware多专家框架:1)肺感知3D专家(融合原始与肺提取CT);2)两个MedSigLIP专家(切片表征学习与Transformer跨切片建模);3)源分类器预测测试样本来源并指导专家融合与投票。 Result: Stage 1宏F1达0.9711,ACC 0.9712,AUC 0.9791;Stage 2a/2b最佳AUC分别为0.9864和0.9854;Stage 3源分类器ACC 0.9107,F1 0.9114。 Conclusion: 源感知专家建模与分层投票机制可有效提升多源异构CT数据下COVID-19分类的鲁棒性。 Abstract: Robust detection of COVID-19 from chest CT remains challenging in multi-institutional settings due to substantial source shift, source imbalance, and hidden test-source identities. In this work, we propose a three-stage source-aware multi-expert framework for multi-source COVID-19 CT classification. First, we build a lung-aware 3D expert by combining original CT volumes and lung-extracted CT volumes for volumetric classification. Second, we develop two MedSigLIP-based experts: a slice-wise representation and probability learning module, and a Transformer-based inter-slice context modeling module for capturing cross-slice dependency. Third, we train a source classifier to predict the latent source identity of each test scan. By leveraging the predicted source information, we perform model fusion and voting based on different experts. On the validation set covering all four sources, the Stage 1 model achieves the best macro-F1 of 0.9711, ACC of 0.9712, and AUC of 0.9791. Stage~2a and Stage~2b achieve the best AUC scores of 0.9864 and 0.9854, respectively. Stage~3 source classifier reaches 0.9107 ACC and 0.9114 F1. These results demonstrate that source-aware expert modeling and hierarchical voting provide an effective solution for robust COVID-19 CT classification under heterogeneous multi-source conditions.[426] DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer
Zhengxu He,Jun Li,Zhijian Wu
Main category: cs.CV
TL;DR: 本文提出DAIT方法,通过可训练的中间教师模型,将冻结的大规模视觉语言模型(VLMs)的知识自适应地蒸馏到轻量级学生模型中,显著提升细粒度视觉分类性能。
Details
Motivation: 大规模视觉语言模型(VLMs)虽具丰富多模态语义,但计算开销大;传统知识蒸馏因架构不匹配和引入无关信息导致效果不佳。 Method: 提出Distillation with Adaptive Intermediate Teacher transfer(DAIT),构建一个可训练的中间教师模型,在细粒度任务监督下,从冻结的VLM中提取并增强判别性视觉线索,再蒸馏至轻量学生模型。 Result: 在FGVC-Aircraft和CUB-200-2011数据集上分别提升12.63%和8.34%;在多个FGVC基准和不同学生架构上验证了有效性。 Conclusion: DAIT为从通用VLM向可部署的细粒度识别模型迁移提供了一种原理清晰、高效可行的新范式。 Abstract: Large-scale Vision-Language Models (VLMs) encode rich multimodal semantics that are highly beneficial for fine-grained visual categorization (FGVC). However, their prohibitive computational cost hinders practical deployment in resource-constrained environments. Although knowledge distillation contributes to transferring VLMs capacity to lightweight classifiers, conventional distillation mechanisms, which directly transfer from a generic VLM to a compact student, often yield suboptimal results due to severe architectural misalignment and introducing task-irrelevant information. To alleviate this limitation, we propose Distillation with Adaptive Intermediate Teacher transfer (DAIT) in this study, facilitating adaptive knowledge transfer from VLMs to lightweight students. DAIT introduces a trainable intermediate teacher that learns to transfer frozen VLMs representations under explicit supervision from the target fine-grained task. This intermediate teacher adaptively enhances discriminative visual cues, thereby producing compact and task-aligned knowledge that can be reliably distilled into lightweight models. Extensive evaluations on multiple FGVC benchmarks with diverse student architectures demonstrate that our method achieves respective performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, establishing DAIT as a principled paradigm for transferring from general-purpose VLMS to deployable fine-grained recognition models.[427] Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding
Sosuke Yamao,Natsuki Miyahara,Yuankai Qi,Shun Takeuchi
Main category: cs.CV
TL;DR: 本文提出QViC-MF框架,通过问题引导的多模态选择性注意力(QMSA)和记忆反馈机制,实现长视频理解中感知与记忆的双向交互,显著提升时序推理等任务性能。
Details
Motivation: 现有方法多采用单向感知到记忆的压缩方式,独立处理每帧,难以有效建模完整事件(如MLVU、VNBench中的时序排序任务),因此需引入记忆对感知的反馈机制。 Method: 提出Question-guided Visual Compression with Memory Feedback (QViC-MF),核心是Question-guided Multimodal Selective Attention (QMSA),在每段视频clip处理中联合当前片段与记忆中相关历史帧进行问题导向的信息保留,并迭代执行压缩与记忆更新。 Result: 在MLVU测试集上提升6.1%,LVBench提升8.3%,VNBench Long提升18.3%,VideoMME Long提升3.7%,显著优于当前SOTA方法。 Conclusion: 引入记忆反馈机制的问题引导视觉压缩框架能更有效地建模长视频中的事件结构与时序关系,为长时视频理解提供了新范式。 Abstract: In the context of long-term video understanding with large multimodal models, many frameworks have been proposed. Although transformer-based visual compressors and memory-augmented approaches are often used to process long videos, they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead establish a feedbackdriven process in which past visual contexts stored in the context memory can benefit ongoing perception. To this end, we propose Question-guided Visual Compression with Memory Feedback (QViC-MF), a framework for long-term video understanding. At its core is a Question-guided Multimodal Selective Attention (QMSA), which learns to preserve visual information related to the given question from both the current clip and the past related frames from the memory. The compressor and memory feedback work iteratively for each clip of the entire video. This simple yet effective design yields large performance gains on longterm video understanding tasks. Extensive experiments show that our method achieves significant improvement over current state-of-the-art methods by 6.1% on MLVU test, 8.3% on LVBench, 18.3% on VNBench Long, and 3.7% on VideoMME Long. The code will be released publicly.[428] Multimodal Connectome Fusion via Cross-Attention for Autism Spectrum Disorder Classification Using Graph Learning
Ansar Rahman,Hassan Shojaee-Mend,Sepideh Hatamikia
Main category: cs.CV
TL;DR: 本文提出了一种多模态图学习框架,结合rs-fMRI功能连接、结构MRI和表型信息,通过不对称Transformer交叉注意力机制实现功能主导下的多模态融合,显著提升了ASD自动分类性能。
Details
Motivation: 功能MRI(rs-fMRI)与结构MRI具有互补性,但现有方法难以在统一框架中有效整合异构多模态影像数据用于自闭症谱系障碍(ASD)分类。 Method: 构建基于人群图的多模态图学习框架:以受试者为节点,功能与结构特征为节点属性,表型信息驱动的成对关联编码器(PAE)建模个体间关系;采用双边缘变分GCN学习嵌入,并引入新型不对称Transformer交叉注意力机制,使功能嵌入有选择地融合结构信息;最终经MLP完成ASD分类。 Result: 在ABIDE-I数据集上,10折交叉验证达AUC 87.3%、准确率84.4%;留一中心验证(LOSO-CV)平均准确率达82.0%,较现有方法分别提升约3%和7%。 Conclusion: 该框架能有效整合多中心异构多模态数据,在保持功能连接主导地位的同时提升跨站点ASD分类鲁棒性与准确性。 Abstract: Autism spectrum disorder (ASD) is a complex neurodevelopmental condition characterized by atypical functional brain connectivity and subtle structural alterations. rs-fMRI has been widely used to identify disruptions in large-scale brain networks, while structural MRI provides complementary information about morphological organization. Despite their complementary nature, effectively integrating these heterogeneous imaging modalities within a unified framework remains challenging. This study proposes a multimodal graph learning framework that preserves the dominant role of functional connectivity while integrating structural imaging and phenotypic information for ASD classification. The proposed framework is evaluated on ABIDE-I dataset. Each subject is represented as a node within a population graph. Functional and structural features are extracted as modality-specific node attributes, while inter-subject relationships are modeled using a pairwise association encoder (PAE) based on phenotypic information. Two Edge Variational GCNs are trained to learn subject-level embeddings. To enable effective multimodal integration, we introduce a novel asymmetric transformer-based cross-attention mechanism that allows functional embeddings to selectively incorporate complementary structural information while preserving functional dominance. The fused embeddings are then passed to a MLP for ASD classification. Using stratified 10-fold cross-validation, the framework achieved an AUC of 87.3% and an accuracy of 84.4%. Under leave-one-site-out cross-validation (LOSO-CV), the model achieved an average cross-site accuracy of 82.0%, outperforming existing methods by approximately 3% under 10-fold cross-validation and 7% under LOSO-CV. The proposed framework effectively integrates heterogeneous multimodal data from the multi-site ABIDE-I dataset, improving automated ASD classification across imaging sites.[429] Tracking the Discriminative Axis: Dual Prototypes for Test-Time OOD Detection Under Covariate Shift
Wooseok Lee,Jin Mo Yang,Saewoong Bahk,Hyung-Sin Kim
Main category: cs.CV
TL;DR: 本文提出DART方法,通过在线跟踪ID和OOD双原型来动态恢复判别轴,以应对协变量漂移下的流式OOD检测问题,显著提升性能。
Details
Motivation: 现实世界中测试时输入常为ID与OOD样本的流式混合,并受不断演化的协变量偏移影响,而现有方法假设ID分布平稳,导致性能严重下降。 Method: 基于csID与csOOD样本在特征空间中仍可沿判别轴分离的发现,提出DART:一种测试时在线OOD检测方法,动态跟踪ID与OOD双原型以恢复漂移判别轴,并结合多层融合与翻转校正提升鲁棒性。 Result: 在15种严重程度为5的常见损坏类型下,ImageNet-C vs. Textures-C上AUROC提升15.32个百分点,FPR@95TPR降低49.15个百分点。 Conclusion: 测试时判别轴跟踪策略有望实现动态变化环境下的可靠OOD检测。 Abstract: For reliable deployment of deep-learning systems, out-of-distribution (OOD) detection is indispensable. In the real world, where test-time inputs often arrive as streaming mixtures of in-distribution (ID) and OOD samples under evolving covariate shifts, OOD samples are domain-constrained and bounded by the environment, and both ID and OOD are jointly affected by the same covariate factors. Existing methods typically assume a stationary ID distribution, but this assumption breaks down in such settings, leading to severe performance degradation. We empirically discover that, even under covariate shift, covariate-shifted ID (csID) and OOD (csOOD) samples remain separable along a discriminative axis in feature space. Building on this observation, we propose DART, a test-time, online OOD detection method that dynamically tracks dual prototypes -- one for ID and the other for OOD -- to recover the drifting discriminative axis, augmented with multi-layer fusion and flip correction for robustness. Extensive experiments on a wide range of challenging benchmarks, where all datasets are subjected to 15 common corruption types at severity level 5, demonstrate that our method significantly improves performance, yielding 15.32 percentage points (pp) AUROC gain and 49.15 pp FPR@95TPR reduction on ImageNet-C vs. Textures-C compared to established baselines. These results highlight the potential of the test-time discriminative axis tracking for dependable OOD detection in dynamically changing environments.[430] HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization
Xuerui Qiu,Yutao Cui,Guozhen Zhang,Junzhe Li,JiaKui Hu,Xiao Zhang,Yang Li,Songtao Liu,Miles Yang,Yu Shi,Zhao Zhong,Liefeng Bo
Main category: cs.CV
TL;DR: 本文提出HYDRA-TOK,一种基于纯ViT的表示协调架构,通过Gen-ViT到Sem-ViT的渐进学习及Generation-Semantic Bottleneck(GSB)机制,统一视觉理解与生成任务,在重建、生成与理解多项基准上达到SOTA。
Details
Motivation: 现有统一多模态模型难以兼顾视觉理解所需的抽象表征与生成所需的细节基元,常用解耦编码器、VAE堆叠或离散量化等方法易导致信息失真和优化冲突。 Method: 提出HYDRA-TOK:将ViT主干重构为渐进式学习器,包含结构保持的Gen-ViT和语义编码的Sem-ViT,并引入Generation-Semantic Bottleneck(GSB)进行降维去噪与升维语义增强;在此基础上构建端到端统一框架HYDRA,实现感知与生成共享单一参数空间。 Result: 在视觉重建(rFID 0.08)、生成评测(GenEval 0.86, DPG-Bench 86.4, WISE 0.53)及八项理解任务(平均+10.0分)上均达新SOTA。 Conclusion: HYDRA-TOK通过表示谐调机制有效弥合理解与生成间的表征鸿沟,证明纯ViT架构可在统一框架下同时胜任高性能感知与生成任务。 Abstract: Unified Multimodal Models struggle to bridge the fundamental gap between the abstract representations needed for visual understanding and the detailed primitives required for generation. Existing approaches typically compromise by employing decoupled encoders, stacking representation encoder atop VAEs, or utilizing discrete quantization. However, these methods often disrupt information coherence and lead to optimization conflicts. To this end, we introduce HYDRA-TOK, a representation-harmonized pure ViT in the insight that visual modeling should evolve from generation to understanding. HYDRA-TOK reformulates the standard backbone into a progressive learner that transitions from a Gen-ViT, which captures structure-preserving primitives, to a Sem-ViT for semantic encoding. Crucially, this transition is mediated by a Generation-Semantic Bottleneck (GSB), which compresses features into a low-dimensional space to filter noise for robust synthesis, then restores dimensionality to empower complex semantic comprehension. Built upon this foundation, we present HYDRA, a native unified framework integrating perception and generation within a single parameter space. Extensive experiments establish HYDRA as a new state-of-the-art. It sets a benchmark in visual reconstruction (rFID 0.08) and achieves top-tier generation performance on GenEval (0.86), DPG-Bench (86.4), and WISE (0.53), while simultaneously outperforming previous native UMMs by an average of 10.0 points across eight challenging understanding benchmarks.[431] Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection
Yao Gu,Xiaohao Xu,Yingna Wu
Main category: cs.CV
TL;DR: 本文提出一种物理信息引导的指令微调框架,通过结构化提示注入物体属性、运动范式和动力学约束,提升视觉-语言模型在物理异常检测中的因果推理能力,在Phys-AD基准上显著超越现有方法。
Details
Motivation: 现有视觉-语言模型主要基于外观相关性训练,缺乏对运动学约束和因果动力学的理解,难以检测如异常旋转或违反机械规律等物理异常。 Method: 设计物理信息引导的指令微调框架,将物体属性、运动范式和动态约束编码为结构化提示,并通过多轮对话形式分步引导模型进行因果推理。 Result: 在Phys-AD基准上视频级异常检测AUROC达96.7%,远超先前SOTA(66.9%);因果解释质量LLM评分为0.777。 Conclusion: 结构化的物理先验可有效增强VLM对动态异常的检测与解释能力,使其成为可靠的物理异常检测器。 Abstract: Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.[432] HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning
Kuniaki Saito,Risa Shinoda,Shohei Tanaka,Tosho Hirasawa,Fumio Okura,Yoshitaka Ushiku
Main category: cs.CV
TL;DR: 本文提出了HalDec-Bench基准,用于系统评估视觉-语言模型(VLM)作为幻觉检测器的泛化能力,并揭示了检测偏差与数据过滤策略的有效性。
Details
Motivation: 现有幻觉检测方法缺乏统一、全面的基准来评估不同VLM在多种幻觉类型和生成模型下的泛化能力。 Method: 构建HalDec-Bench基准,包含多源VLM生成的图像描述、人工标注的幻觉存在性、细粒度幻觉类型及片段级标签,并设计多难度评估任务。 Result: 发现检测器存在首句偏好偏差;强VLM可作为有效滤波器显著降低数据噪声。 Conclusion: HalDec-Bench为幻觉检测提供了可解释、可复现的评估框架,揭示了当前VLM在对齐判断中的系统性偏差,并验证了基于VLM的数据清洗策略的有效性。 Abstract: Hallucination detection in captions (HalDec) assesses a vision-language model's ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination detection is also essential for curating high-quality image-caption pairs used to train VLMs. However, the generalizability of VLMs as hallucination detectors across different captioning models and hallucination types remains unclear due to the lack of a comprehensive benchmark. In this work, we introduce HalDec-Bench, a benchmark designed to evaluate hallucination detectors in a principled and interpretable manner. HalDec-Bench contains captions generated by diverse VLMs together with human annotations indicating the presence of hallucinations, detailed hallucination-type categories, and segment-level labels. The benchmark provides tasks with a wide range of difficulty levels and reveals performance differences across models that are not visible in existing multimodal reasoning or alignment benchmarks. Our analysis further uncovers two key findings. First, detectors tend to recognize sentences appearing at the beginning of a response as correct, regardless of their actual correctness. Second, our experiments suggest that dataset noise can be substantially reduced by using strong VLMs as filters while employing recent VLMs as caption generators. Our project page is available at https://dahlian00.github.io/HalDec-Bench-Page/.[433] IConE: Batch Independent Collapse Prevention for Self-Supervised Representation Learning
Konstantinos Almpanakis,Anna Kreshuk
Main category: cs.CV
TL;DR: IConE是一种新型自监督学习框架,通过引入可学习的辅助实例嵌入和显式多样性目标,解耦了表示坍塌预防与批量大小的关系,从而在小批量(甚至batch size=1)和严重类别不平衡场景下实现稳定训练并提升性能。
Details
Motivation: 现有联合嵌入架构(JEAs)依赖批量交互(如负采样或统计正则化)防止表示坍塌,但在高维科学数据等需小批量的场景中因内存限制和类别不平衡而失效。 Method: 提出IConE框架,维护一个全局可学习的辅助实例嵌入集合,并施加显式多样性正则化,将防坍塌机制从瞬时批次转移到数据集级嵌入空间。 Result: 在多种2D/3D生物医学模态上,IConE在小批量(B=1至64)下显著优于对比与非对比基线,且对严重类别不平衡鲁棒;几何分析表明其保持高内在维度,避免小批量下的坍塌。 Conclusion: IConE成功解耦防坍塌机制与批量大小,为资源受限、高维、不平衡的科学数据提供了一种稳定高效的自监督表征学习新范式。 Abstract: Self-supervised learning (SSL) has revolutionized representation learning, with Joint-Embedding Architectures (JEAs) emerging as an effective approach for capturing semantic features. Existing JEAs rely on implicit or explicit batch interaction -- via negative sampling or statistical regularization -- to prevent representation collapse. This reliance becomes problematic in regimes where batch sizes must be small, such as high-dimensional scientific data, where memory constraints and class imbalance make large, well-balanced batches infeasible. We introduce IConE (Instance-Contrasted Embeddings), a framework that decouples collapse prevention from the training batch size. Rather than enforcing diversity through batch statistics, IConE maintains a global set of learnable auxiliary instance embeddings regularized by an explicit diversity objective. This transfers the anti-collapse mechanism from the transient batch to a dataset-level embedding space, allowing stable training even when batch statistics are unreliable, down to batch size 1. Across diverse 2D and 3D biomedical modalities, IConE outperforms strong contrastive and non-contrastive baselines throughout the small-batch regime (from B=1 to B=64) and demonstrates marked robustness to severe class imbalance. Geometric analysis shows that IConE preserves high intrinsic dimensionality in the learned representations, preventing the collapse observed in existing JEAs as batch sizes shrink.[434] Exemplar Diffusion: Improving Medical Object Detection with Opportunistic Labels
Victor Wåhlstrand,Jennifer Alvén,Ida Häggström
Main category: cs.CV
TL;DR: 本文提出了一种名为'exemplar diffusion'的无需训练的框架,利用推理时已有的标注(exemplars)提升医学图像目标检测性能,并支持预测不确定性量化。
Details
Motivation: 医学图像目标检测中,标注成本高且依赖专家;而现有数据常具有清晰的空间结构,可利用少量已知标注(exemplars)在测试阶段提升性能。 Method: 基于现有扩散模型的目标检测方法,设计'exemplar diffusion'机制,在推理阶段引入已知边界框作为先验,实现无需重新训练的性能增强,并扩展用于不确定性估计。 Result: 在具有空间结构的医学图像数据集上,该方法显著提升平均精度(AP)和召回率(Recall),对exemplar质量鲁棒,支持非专家标注,并可量化预测不确定性。 Conclusion: Exemplar diffusion是一种高效、灵活、无需训练的增强策略,适用于标注受限的医学图像检测任务,并拓展了扩散检测模型的功能边界。 Abstract: We present a framework to take advantage of existing labels at inference, called \textit{exemplars}, in order to improve the performance of object detection in medical images. The method, \textit{exemplar diffusion}, leverages existing diffusion methods for object detection to enable a training-free approach to adding information of known bounding boxes at test time. We demonstrate that for medical image datasets with clear spatial structure, the method yields an across-the-board increase in average precision and recall, and a robustness to exemplar quality, enabling non-expert annotation. Moreover, we demonstrate how our method may also be used to quantify predictive uncertainty in diffusion detection methods. Source code and data splits openly available online: https://github.com/waahlstrand/ExemplarDiffusion[435] Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps
Kim Ouan,Noémie Moreau,Katarzyna Bozek
Main category: cs.CV
TL;DR: 本文提出了一种无需昂贵神经纤维分割图即可评估角膜神经纤维迂曲度的方法,利用ImageNet预训练的自监督特征(特别是DINO模型),经精细微调后,在准确率(84.25%)和灵敏度(77.97%)上超越现有最优方法。
Details
Motivation: 现有角膜神经纤维迂曲度分级方法严重依赖昂贵且耗时的神经纤维分割图,亟需一种不依赖分割的高效替代方案。 Method: 采用ImageNet上自监督预训练的DINO模型,针对体内共聚焦显微镜图像进行精细微调,使其直接从原始图像中提取关键形态学特征用于迂曲度分级。 Result: 微调后的DINO模型在准确率(84.25%)和灵敏度(77.97%)上均优于当前最优方法,且无需使用分割图。 Conclusion: DINO虽被后续版本超越,但在医学影像领域仍具强大迁移能力;本工作验证了其在无分割条件下的临床适用性与优越性能。 Abstract: The tortuosity of corneal nerve fibers are used as indication for different diseases. Current state-of-the-art methods for grading the tortuosity heavily rely on expensive segmentation maps of these nerve fibers. In this paper, we demonstrate that self-supervised pretrained features from ImageNet are transferable to the domain of in vivo confocal microscopy. We show that DINO should not be disregarded as a deep learning model for medical imaging, although it was superseded by two later versions. After careful fine-tuning, DINO improves upon the state-of-the-art in terms of accuracy (84,25%) and sensitivity (77,97%). Our fine-tuned model focuses on the key morphological elements in grading without the use of segmentation maps.[436] Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models
Junlong Ke,Zichen Wen,Boxue Yang,Yantai Yang,Xuyang Liu,Chenfei Liao,Zhaorun Chen,Shaobo Wang,Linfeng Zhang
Main category: cs.CV
TL;DR: 本文提出FlashU框架,通过任务感知的训练-free加速策略(如任务特定剪枝、动态层跳过、扩散头缓存和视觉Token剪枝),在统一多模态模型中实现生成与理解任务的协同加速,兼顾性能与效率。
Details
Motivation: 现有统一多模态模型计算开销大,传统静态加速方法忽视生成(迭代式)与理解(单次前向)任务在计算特性上的本质差异,缺乏对任务特性的适配。 Method: 基于对统一模型参数专业化现象的系统分析,提出训练-free、任务感知的FlashU加速框架:针对生成任务采用任务特定网络剪枝、动态层跳过、时变引导尺度控制与扩散头缓存;针对理解任务,在剪枝基础上引入基于V-Norm代理的动态视觉Token剪枝。 Result: 在Show-o2模型上,FlashU实现1.78×–2.01×整体推理加速,同时保持SOTA性能,显著优于其他统一模型加速方法。 Conclusion: 统一多模态模型内部存在隐式的任务专属推理路径,任务感知、训练-free的细粒度优化(参数/层/Token/时间维度)是高效部署的关键路径。 Abstract: Native unified multimodal models, which integrate both generative and understanding capabilities, face substantial computational overhead that hinders their real-world deployment. Existing acceleration techniques typically employ a static, monolithic strategy, ignoring the fundamental divergence in computational profiles between iterative generation tasks (e.g., image generation) and single-pass understanding tasks (e.g., VQA). In this work, we present the first systematic analysis of unified models, revealing pronounced parameter specialization, where distinct neuron sets are critical for each task. This implies that, at the parameter level, unified models have implicitly internalized separate inference pathways for generation and understanding within a single architecture. Based on these insights, we introduce a training-free and task-aware acceleration framework, FlashU, that tailors optimization to each task's demands. Across both tasks, we introduce Task-Specific Network Pruning and Dynamic Layer Skipping, aiming to eliminate inter-layer and task-specific redundancy. For visual generation, we implement a time-varying control signal for the guidance scale and a temporal approximation for the diffusion head via Diffusion Head Cache. For multimodal understanding, building upon the pruned model, we introduce Dynamic Token Pruning via a V-Norm Proxy to exploit the spatial redundancy of visual inputs. Extensive experiments on Show-o2 demonstrate that FlashU achieves 1.78$\times$ to 2.01$\times$ inference acceleration across both understanding and generation tasks while maintaining SOTA performance, outperforming competing unified models and validating our task-aware acceleration paradigm. Our code is publicly available at https://github.com/Rirayh/FlashU.[437] Dataset Diversity Metrics and Impact on Classification Models
Théo Sourget,Niclas Claßen,Jack Junchi Xu,Rob van der Goot,Veronika Cheplygina
Main category: cs.CV
TL;DR: 本文研究了图像、文本和元数据的多种数据集多样性度量方法,使用MorphoMNIST和PadChest数据集进行评估,发现FID和语义多样性指标与下游任务性能相关性更高,而临床专家认为扫描设备是实际中主要的多样性来源,但增加扫描设备反而导致模型捷径学习。
Details
Motivation: 多样性在训练鲁棒模型中至关重要,但其定义不统一、量化常被忽视,亟需系统评估多样性度量的有效性及其与模型性能和专家直觉的关系。 Method: 采用MorphoMNIST(可控扰动玩具数据集)和PadChest(公开胸部X光数据集),对比多种无参考图像、文本及元数据多样性指标,分析其相互关联性、与临床专家直觉的一致性、与下游任务AUC的相关性,以及对模型训练动态的影响。 Result: AUC与图像/元数据无参考多样性指标相关性弱,但与FID和语义多样性指标相关性较强;临床专家认定扫描设备是主要多样性来源;然而向训练集添加新扫描设备反而引发模型捷径学习。 Conclusion: 当前多样性度量需更紧密结合领域知识(如临床实践)与语义层面理解;单纯增加设备来源未必提升鲁棒性,反而可能诱发偏差;未来算法设计应谨慎权衡多样性引入方式与泛化机制。 Abstract: The diversity of training datasets is usually perceived as an important aspect to obtain a robust model. However, the definition of diversity is often not defined or differs across papers, and while some metrics exist, the quantification of this diversity is often overlooked when developing new algorithms. In this work, we study the behaviour of multiple dataset diversity metrics for image, text and metadata using MorphoMNIST, a toy dataset with controlled perturbations, and PadChest, a publicly available chest X-ray dataset. We evaluate whether these metrics correlate with each other but also with the intuition of a clinical expert. We also assess whether they correlate with downstream-task performance and how they impact the training dynamic of the models. We find limited correlations between the AUC and image or metadata reference-free diversity metrics, but higher correlations with the FID and the semantic diversity metrics. Finally, the clinical expert indicates that scanners are the main source of diversity in practice. However, we find that the addition of another scanner to the training set leads to shortcut learning. The code used in this study is available at https://github.com/TheoSourget/dataset_diversity_evaluation[438] GATE-AD: Graph Attention Network Encoding For Few-Shot Industrial Visual Anomaly Detection
Aggelos Psiris,Yannis Panagakis,Maria Vakalopoulou,Georgios Th. Papadopoulos
Main category: cs.CV
TL;DR: 本文提出了一种名为GATE-AD的新型基于重建的少样本工业视觉异常检测方法,利用掩码图注意力网络与表征对齐机制,在多个基准上实现了SOTA性能,并兼顾高精度与低推理延迟。
Details
Motivation: 解决现代制造业中仅用极少量无缺陷样本进行工业视觉异常检测(FS-IVAD)的挑战,提升对罕见缺陷的识别能力。 Method: 提出GATE-AD框架:以图像块级视觉特征为图节点,采用掩码化、表征对齐的图注意力网络(GAT)建模非欧局部关系;引入潜在空间中的表征对齐模块,并使用缩放余弦误差(SCE)评估重建残差以定位缺陷。 Result: 在MVTec AD、VisA和MPDD三个基准上,1–8 shot设置下均达到SOTA:MPDD 8-shot图像AUROC提升达1.8%,单图推理速度比最优方法快至少25.05%。 Conclusion: GATE-AD有效平衡了少样本设定下的检测精度与效率,验证了图结构建模与表征对齐在工业异常检测中的有效性,代码已开源以促进复现与后续研究。 Abstract: Few-Shot Industrial Visual Anomaly Detection (FS-IVAD) comprises a critical task in modern manufacturing settings, where automated product inspection systems need to identify rare defects using only a handful of normal/defect-free training samples. In this context, the current study introduces a novel reconstruction-based approach termed GATE-AD. In particular, the proposed framework relies on the employment of a masked, representation-aligned Graph Attention Network (GAT) encoding scheme to learn robust appearance patterns of normal samples. By leveraging dense, patch-level, visual feature tokens as graph nodes, the model employs stacked self-attentional layers to adaptively encode complex, irregular, non-Euclidean, local relations. The graph is enhanced with a representation alignment component grounded on a learnable, latent space, where high reconstruction residual areas (i.e., defects) are assessed using a Scaled Cosine Error (SCE) objective function. Extensive comparative evaluation on the MVTec AD, VisA, and MPDD industrial defect detection benchmarks demonstrates that GATE-AD achieves state-of-the-art performance across the $1$- to $8$-shot settings, combining the highest detection accuracy (increase up to $1.8\%$ in image AUROC in the 8-shot case in MPDD) with the lowest per-image inference latency (at least $25.05\%$ faster), compared to the best-performing literature methods. In order to facilitate reproducibility and further research, the source code of GATE-AD is available at https://github.com/gthpapadopoulos/GATE-AD.[439] Generative Video Compression with One-Dimensional Latent Representation
Zihan Zheng,Zhaoyang Jia,Naifu Xue,Jiahao Li,Bin Li,Zongyu Guo,Xiaoyi Zhang,Zhenghao Chen,Houqiang Li,Yan Lu
Main category: cs.CV
TL;DR: 本文提出GVC1D方法,通过将视频编码为条件于长短时上下文的极紧凑一维(1D)潜在标记,克服传统二维(2D)潜在网格在时空冗余建模上的局限,显著提升生成式视频压缩效率。
Details
Motivation: 传统生成式视频编解码器采用2D潜在网格,存在空间上帧内冗余保留严重、时间上难以有效建模长程语义相关性的两大问题,导致码率偏高。 Method: 提出生成式视频压缩的一维潜在表示(GVC1D),将视频编码为条件于短时和长时上下文的1D潜在token;利用无刚性空间约束的1D token实现自适应语义注意力与token精简,并设计低开销、语义丰富的1D记忆机制建模长时相关性。 Result: 在HEVC Class B数据集上,相比先前方法,GVC1D在LPIPS指标下降低码率60.4%,在DISTS指标下降低68.8%。 Conclusion: GVC1D通过摒弃2D结构、引入灵活高效的1D潜在表示与记忆机制,更充分挖掘视频的时空冗余,在保持重建质量的同时大幅提升压缩效率。 Abstract: Recent advancements in generative video codec (GVC) typically encode video into a 2D latent grid and employ high-capacity generative decoders for reconstruction. However, this paradigm still leaves two key challenges in fully exploiting spatial-temporal redundancy: Spatially, the 2D latent grid inevitably preserves intra-frame redundancy due to its rigid structure, where adjacent patches remain highly similar, thereby necessitating a higher bitrate. Temporally, the 2D latent grid is less effective for modeling long-term correlations in a compact and semantically coherent manner, as it hinders the aggregation of common contents across frames. To address these limitations, we introduce Generative Video Compression with One-Dimensional (1D) Latent Representation (GVC1D). GVC1D encodes the video data into extreme compact 1D latent tokens conditioned on both short- and long-term contexts. Without the rigid 2D spatial correspondence, these 1D latent tokens can adaptively attend to semantic regions and naturally facilitate token reduction, thereby reducing spatial redundancy. Furthermore, the proposed 1D memory provides semantically rich long-term context while maintaining low computational cost, thereby further reducing temporal redundancy. Experimental results indicate that GVC1D attains superior compression efficiency, where it achieves bitrate reductions of 60.4\% under LPIPS and 68.8\% under DISTS on the HEVC Class B dataset, surpassing the previous video compression methods.Project: https://gvc1d.github.io/[440] UE5-Forest: A Photorealistic Synthetic Stereo Dataset for UAV Forestry Depth Estimation
Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
Main category: cs.CV
TL;DR: 本文提出UE5-Forest,一个基于Unreal Engine 5构建的高保真合成双目立体数据集,专为解决森林环境中密集真实视差图难以获取的问题,支持无人机剪枝等自主作业的立体匹配网络训练。
Details
Motivation: 森林环境中因细密重叠枝条与复杂冠层几何结构,导致传统深度传感器无法获取稠密真实视差图,严重制约了面向无人机修剪任务的监督式立体匹配网络训练。 Method: 利用Quixel Megascans中115棵实景扫描树木,在UE5中构建虚拟森林场景;模拟ZED Mini相机参数(63 mm基线、2.8 mm焦距等)的双目系统环绕每棵树在三个仰角下采集,生成5520对1920×1080校正立体图像及对应像素级精确视差图,并进行统计分析与真实图像对比验证。 Result: 构建了首个面向林业场景、具备像素级真值、高几何保真度与视觉真实感的合成双目数据集UE5-Forest,定量与定性评估证实其分布特性、场景多样性及渲染质量均满足训练与评测需求。 Conclusion: UE5-Forest填补了林业立体视觉领域高质量合成数据集的空白,将作为公开基准与训练资源,推动无人机自主作业中鲁棒深度估计技术的发展。 Abstract: Dense ground-truth disparity maps are practically unobtainable in forestry environments, where thin overlapping branches and complex canopy geometry defeat conventional depth sensors -- a critical bottleneck for training supervised stereo matching networks for autonomous UAV-based pruning. We present UE5-Forest, a photorealistic synthetic stereo dataset built entirely in Unreal Engine 5 (UE5). One hundred and fifteen photogrammetry-scanned trees from the Quixel Megascans library are placed in virtual scenes and captured by a simulated stereo rig whose intrinsics -- 63 mm baseline, 2.8 mm focal length, 3.84 mm sensor width -- replicate the ZED Mini camera mounted on our drone. Orbiting each tree at up to 2 m across three elevation bands (horizontal, +45 degrees, -45 degrees) yields 5,520 rectified 1920 x 1080 stereo pairs with pixel-perfect disparity labels. We provide a statistical characterisation of the dataset -- covering disparity distributions, scene diversity, and visual fidelity -- and a qualitative comparison with real-world Canterbury Tree Branches imagery that confirms the photorealistic quality and geometric plausibility of the rendered data. The dataset will be publicly released to provide the community with a ready-to-use benchmark and training resource for stereo-based forestry depth estimation.[441] MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction
Jiacheng Dong,Huan Li,Sicheng Zhou,Wenhao Hu,Weili Xu,Yan Wang
Main category: cs.CV
TL;DR: MeMix是一种无需训练、即插即用的模块,通过将循环状态重构成“记忆混合”来提升流式3D重建性能,有效缓解状态漂移与灾难性遗忘,在多个基准上显著降低重建误差。
Details
Motivation: 现有流式3D重建的循环在线模型在长序列中易因状态漂移和遗忘而性能退化,亟需推理时无需重新训练的改进方案。 Method: MeMix将循环状态划分为多个独立的记忆块,仅更新对齐度最低的块,其余严格保留,实现选择性更新,不引入可学习参数且保持O(1)推理内存开销。 Result: 在ScanNet、7-Scenes、KITTI等标准基准上,MeMix在300–500帧流中平均降低重建完整性误差15.3%(最高达40.0%),且无需微调或额外参数。 Conclusion: MeMix是一种轻量、通用、即插即用的推理时优化模块,显著提升现有流式3D重建模型的鲁棒性与长时稳定性。 Abstract: Reconstruction is a fundamental task in 3D vision and a fundamental capability for spatial intelligence. Particularly, streaming 3D reconstruction is central to real-time spatial perception, yet existing recurrent online models often suffer from progressive degradation on long sequences due to state drift and forgetting, motivating inference-time remedies. We present MeMix, a training-free, plug-and-play module that improves streaming reconstruction by recasting the recurrent state into a Memory Mixture. MeMix partitions the state into multiple independent memory patches and updates only the least-aligned memory patches while exactly preserving others. This selective update mitigates catastrophic forgetting while retaining $O(1)$ inference memory, and requires no fine-tuning or additional learnable parameters, making it directly applicable to existing recurrent reconstruction models. Across standard benchmarks (ScanNet, 7-Scenes, KITTI, etc.), under identical backbones and inference settings, MeMix reduces reconstruction completeness error by 15.3% on average (up to 40.0%) across 300--500 frame streams on 7-Scenes. The code is available at https://dongjiacheng06.github.io/MeMix/[442] Oscillating Dispersion for Maximal Light-throughput Spectral Imaging
Jiuyun Zhang,Zhan Shi,Linsen Chen,Xun Cao
Main category: cs.CV
TL;DR: 本文提出了一种新型光谱成像系统ODIS,通过轴向移动色散器实现近全光通量采集,并结合PAN引导的深度展开网络PDAUN进行高质量光谱重建,显著提升低照度下的成像性能。
Details
Motivation: 现有计算光谱成像系统因使用编码孔径和分束器导致大量光损失,在弱光条件下重建质量严重下降,亟需高光通量、单光路的新型成像方案。 Method: 提出Oscillating Dispersion Imaging Spectrometer(ODIS)硬件架构,利用色散器轴向振荡在共轭像面与离焦位置间切换,依次获取全色(PAN)图像与无掩模色散测量;并设计PAN-guided Dispersion-Aware Deep Unfolding Network(PDAUN),包含FFT-Woodbury预条件数据保真模块与色散感知可变形卷积(DADC)模块以校正亚像素色散错位。 Result: 在标准基准上达到SOTA性能;跨系统对比验证ODIS在低照度下具有显著优势;物理原型实测证实高保真光谱重建能力。 Conclusion: ODIS与PDAUN构成一套高效、鲁棒的单光路高光通量光谱成像框架,为弱光场景下的计算光谱成像提供了新范式。 Abstract: Existing computational spectral imaging systems typically rely on coded aperture and beam splitters that block a substantial fraction of incident light, degrading reconstruction quality under light-starved conditions. To address this limitation, we develop the Oscillating Dispersion Imaging Spectrometer (ODIS), which for the first time achieves near-full light throughput by axially translating a disperser between the conjugate image plane and a defocused position, sequentially capturing a panchromatic (PAN) image and a dispersed measurement along a single optical path. We further propose a PAN-guided Dispersion-Aware Deep Unfolding Network (PDAUN) that recovers high-fidelity spectral information from maskless dispersion under PAN structural guidance. Its data-fidelity step derives an FFT-Woodbury preconditioned solver by exploiting the cyclic-convolution property of the ODIS forward model, while a Dispersion-Aware Deformable Convolution module (DADC) corrects sub-pixel spectral misalignment using PAN features. Experiments show state-of-the-art performance on standard benchmarks, and cross-system comparisons confirm that ODIS yields decisive gains under low illumination. High-fidelity reconstruction is validated on a physical prototype.[443] A PPO-Based Bitrate Allocation Conditional Diffusion Model for Remote Sensing Image Compression
Yuming Han,Jooho Kim,Anish Shakya
Main category: cs.CV
TL;DR: 本文提出了一种基于PPO的比特率分配条件扩散压缩框架(PCDC),用于高分辨率无人机遥感图像的高效压缩,在保持高感知质量与任务相关性的同时实现高倍率压缩。
Details
Motivation: 现有遥感图像压缩方法难以兼顾高压缩效率与细节/任务信息保留;高分辨率无人机影像数据量大(可达数百GB),给存储与长期管理带来挑战。 Method: 提出PPO-based bitrate allocation Conditional Diffusion Compression(PCDC)框架,结合条件扩散解码器与基于PPO的分块比特率分配策略。 Result: 在DIV2K和自建无人机数据集上分别达到19.3x和21.2x压缩比;下游目标检测任务性能几乎无损。 Conclusion: PCDC在高压缩比下有效保持感知质量与任务可用性,验证了强化学习驱动的比特率分配与扩散模型结合在遥感图像压缩中的有效性。 Abstract: Existing remote sensing image compression methods still explore to balance high compression efficiency with the preservation of fine details and task-relevant information. Meanwhile, high-resolution drone imagery offers valuable structural details for urban monitoring and disaster assessment, but large-area datasets can easily reach hundreds of gigabytes, creating significant challenges for storage and long-term management. In this paper, we propose a PPO-based bitrate allocation Conditional Diffusion Compression (PCDC) framework. PCDC integrates a conditional diffusion decoder with a PPO-based block-wise bitrate allocation strategy to achieve high compression ratios while maintaining strong perceptual performance. We also release a high-resolution drone image dataset with richer structural details at a consistent low altitude over residential neighborhoods in coastal urban areas. Experimental results show compression ratios of 19.3x on DIV2K and 21.2x on the drone image dataset. Moreover, downstream object detection experiments demonstrate that the reconstructed images preserve task-relevant information with negligible performance loss.[444] IRIS: Intersection-aware Ray-based Implicit Editable Scenes
Grzegorz Wilczyński,Mikołaj Zieliński,Krzysztof Byrski,Joanna Waczyńska,Dominik Belter,Przemysław Spurek
Main category: cs.CV
TL;DR: 本文提出IRIS框架,通过分析射线与场景原语的交点实现高效实时渲染和交互式场景编辑,避免了传统体素采样和3D空间搜索的计算开销。
Details
Motivation: Neural Radiance Fields训练和渲染开销大,3D Gaussian splatting虽快但缺乏灵活性;现有结合二者的方法仍受限于随机体素采样带来的效率瓶颈。 Method: 提出IRIS框架:1)采用解析式射线采样策略,精确计算射线与场景原语的交点,跳过空区域;2)引入沿射线连续特征聚合机制,通过对排序后的交点插值隐式属性,规避3D邻域搜索。 Result: 实现高保真、实时渲染与灵活形状编辑,显著提升渲染效率与几何一致性。 Conclusion: IRIS在保持神经辐射场表达能力的同时,达到类似3D高斯泼溅的实时性能,并支持交互式编辑,为隐式场景建模提供了高效新范式。 Abstract: Neural Radiance Fields achieve high-fidelity scene representation but suffer from costly training and rendering, while 3D Gaussian splatting offers real-time performance with strong empirical results. Recently, solutions that harness the best of both worlds by using Gaussians as proxies to guide neural field evaluations, still suffer from significant computational inefficiencies. They typically rely on stochastic volumetric sampling to aggregate features, which severely limits rendering performance. To address this issue, a novel framework named IRIS (Intersection-aware Ray-based Implicit Editable Scenes) is introduced as a method designed for efficient and interactive scene editing. To overcome the limitations of standard ray marching, an analytical sampling strategy is employed that precisely identifies interaction points between rays and scene primitives, effectively eliminating empty space processing. Furthermore, to address the computational bottleneck of spatial neighbor lookups, a continuous feature aggregation mechanism is introduced that operates directly along the ray. By interpolating latent attributes from sorted intersections, costly 3D searches are bypassed, ensuring geometric consistency, enabling high-fidelity, real-time rendering, and flexible shape editing. Code can be found at https://github.com/gwilczynski95/iris.[445] Trajectory-Diversity-Driven Robust Vision-and-Language Navigation
Jiangyang Li,Cong Wan,SongLin Dong,Chenhao Ding,Qiang Wang,Zhiheng Ma,Yihong Gong
Main category: cs.CV
TL;DR: 本文提出NavGRPO,一种基于组相对策略优化(Group Relative Policy Optimization)的强化学习框架,用于视觉-语言导航(VLN),显著提升模型在未见环境中的鲁棒性和性能。
Details
Motivation: 现有VLN方法主要依赖模仿学习,泛化能力有限且对执行扰动鲁棒性差。 Method: 提出NavGRPO框架,基于ScaleVLN平台,采用组内性能比较进行策略优化,无需额外价值网络,通过探索多样轨迹学习目标导向导航策略。 Result: 在R2R和REVERIE基准上,未见环境中SPL分别提升+3.0%和+1.71%;在极端早期扰动下SPL提升+14.89%。 Conclusion: 目标导向的强化学习训练能构建更鲁棒的VLN策略,NavGRPO验证了该范式的有效性与优越性。 Abstract: Vision-and-Language Navigation (VLN) requires agents to navigate photo-realistic environments following natural language instructions. Current methods predominantly rely on imitation learning, which suffers from limited generalization and poor robustness to execution perturbations. We present NavGRPO, a reinforcement learning framework that learns goal-directed navigation policies through Group Relative Policy Optimization. By exploring diverse trajectories and optimizing via within-group performance comparisons, our method enables agents to distinguish effective strategies beyond expert paths without requiring additional value networks. Built on ScaleVLN, NavGRPO achieves superior robustness on R2R and REVERIE benchmarks with +3.0% and +1.71% SPL improvements in unseen environments. Under extreme early-stage perturbations, we demonstrate +14.89% SPL gain over the baseline, confirming that goal-directed RL training builds substantially more robust navigation policies. Code and models will be released.[446] Spectral Rectification for Parameter-Efficient Adaptation of Foundation Models in Colonoscopy Depth Estimation
Xiaoxian Zhang,Minghai Shi,Lei Li
Main category: cs.CV
TL;DR: 本文提出SpecDepth,一种参数高效的适配框架,通过自适应频谱校正模块(使用可学习小波分解增强特征图中衰减的高频成分),解决基础视觉模型在结肠镜图像单目深度估计中因频域统计偏移(高频边缘与纹理梯度缺失)导致泛化失败的问题,在C3VD和SimCol3D数据集上达到SOTA性能。
Details
Motivation: 基础模型在自然图像上训练后无法直接泛化到结肠镜图像,核心问题不是语义差异,而是频域统计偏移——结肠镜图像缺乏模型依赖的强高频边缘与纹理梯度,影响几何推理能力。 Method: 提出SpecDepth框架,包含自适应频谱校正模块:采用可学习小波分解显式建模并增强特征图中衰减的高频分量,以低层次、目标性调整对齐预训练模型的原始归纳偏置,避免常规微调扭曲高层语义特征。 Result: 在公开数据集C3VD和SimCol3D上分别取得绝对相对误差(Abs Rel)0.022和0.027,达到当前最优性能。 Conclusion: 直接解决频谱失配问题是将视觉基础模型有效适配至专业医学影像任务的高效策略。 Abstract: Accurate monocular depth estimation is critical in colonoscopy for lesion localization and navigation. Foundation models trained on natural images fail to generalize directly to colonoscopy. We identify the core issue not as a semantic gap, but as a statistical shift in the frequency domain: colonoscopy images lack the strong high-frequency edge and texture gradients that these models rely on for geometric reasoning. To address this, we propose SpecDepth, a parameter-efficient adaptation framework that preserves the robust geometric representations of the pre-trained models while adapting to the colonoscopy domain. Its key innovation is an adaptive spectral rectification module, which uses a learnable wavelet decomposition to explicitly model and amplify the attenuated high-frequency components in feature maps. Different from conventional fine-tuning that risks distorting high-level semantic features, this targeted, low-level adjustment realigns the input signal with the original inductive bias of the foundational model. On the public C3VD and SimCol3D datasets, SpecDepth achieved state-of-the-art performance with an absolute relative error of 0.022 and 0.027, respectively. Our work demonstrates that directly addressing spectral mismatches is a highly effective strategy for adapting vision foundation models to specialized medical imaging tasks. The code will be released publicly after the manuscript is accepted for publication.[447] RieMind: Geometry-Grounded Spatial Agent for Scene Understanding
Fernando Ropero,Erkin Turkoz,Daniel Matos,Junqing Du,Antonio Ruiz,Yanfeng Zhang,Lu Liu,Mingwei Sun,Yongliang Wang
Main category: cs.CV
TL;DR: 本文提出一种将感知与推理解耦的代理框架,通过将大语言模型(LLM)锚定在显式的3D场景图(3DSG)上,显著提升静态室内场景的空间推理能力,无需任务特定微调,在VSI-Bench静态子集上较先前方法最高提升16%,较基线VLM平均提升33%–50%。
Details
Motivation: 现有视觉语言模型(VLMs)在室内场景理解中仍难以进行度量与空间推理,且主流方法将感知与推理耦合,限制了推理能力的提升。 Method: 提出基于3D场景图(3DSG)的代理式框架:由专用感知模块构建持久化3DSG(实验中使用真值标注初始化),LLM仅通过结构化几何工具(如物体尺寸、距离、位姿、空间关系)与场景交互,实现感知与推理分离。 Result: 在VSI-Bench静态分割上取得性能上界,较先前方法最高提升16%;相比基线VLM,平均提升33%–50%;验证了显式几何锚定和结构化表征对空间推理的有效性。 Conclusion: 显式几何接地与结构化场景表示能显著提升空间推理性能,为替代端到端视觉推理提供了有力新范式。 Abstract: Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning leads to improved spatial reasoning. We propose an agentic framework for static 3D indoor scene reasoning that grounds an LLM in an explicit 3D scene graph (3DSG). Rather than ingesting videos directly, each scene is represented as a persistent 3DSG constructed by a dedicated perception module. To isolate reasoning performance, we instantiate the 3DSG from ground-truth annotations. The agent interacts with the scene exclusively through structured geometric tools that expose fundamental properties such as object dimensions, distances, poses, and spatial relationships. The results we obtain on the static split of VSI-Bench provide an upper bound under ideal perceptual conditions on the spatial reasoning performance, and we find that it is significantly higher than previous works, by up to 16\%, without task specific fine-tuning. Compared to base VLMs, our agentic variant achieves significantly better performance, with average improvements between 33\% to 50\%. These findings indicate that explicit geometric grounding substantially improves spatial reasoning performance, and suggest that structured representations offer a compelling alternative to purely end-to-end visual reasoning.[448] AI Evasion and Impersonation Attacks on Facial Re-Identification with Activation Map Explanations
Noe Claudel,Weisi Guo,Yang Xing
Main category: cs.CV
TL;DR: 本文提出了一种新型条件编码-解码网络框架,用于单次前向传播生成面向跨摄像头行人重识别与人脸识别模型的对抗性补丁,支持逃避攻击和冒名攻击,并结合扩散模型提升物理可行性;实验表明其在多个基准上显著降低检索精度并具备跨模型泛化能力。
Details
Motivation: 面部识别系统在监控中广泛应用,但其易受对抗性逃避和冒名攻击威胁,现有方法依赖迭代优化、效率低且物理部署困难。 Method: 提出条件编码-解码网络,利用源图与目标图的多尺度特征,在单次前向传播中生成对抗补丁;采用包含pull/push项的双重对抗目标函数优化,并融合预训练潜在扩散模型以提升补丁自然性与物理可实现性。 Result: 在白盒/黑盒设置下,对抗逃避攻击使mAP从90%/72%降至0.4%;在CelebA-HQ上冒名攻击成功率达27%;通过激活图聚类揭示攻击所依赖的关键特征。 Conclusion: 该方法高效、通用且更具物理可行性,凸显了检索型识别系统的实际脆弱性,亟需发展鲁棒防御机制。 Abstract: Facial identification systems are increasingly deployed in surveillance and yet their vulnerability to adversarial evasion and impersonation attacks pose a critical risk. This paper introduces a novel framework for generating adversarial patches capable of both evasion and impersonation attacks against deep re-identification models across non-overlapping cameras. Unlike prior approaches that require iterative patch optimisation for each target, our method employs a conditional encoder-decoder network to synthesize adversarial patches in a single forward pass, guided by multi-scale features from source and target images. The patches are optimised with a dual adversarial objective comprising of pull and push terms. To enhance imperceptibility and aid physical deployment, we further integrate naturalistic patch generation using pre-trained latent diffusion models. Experiments on standard pedestrian (Market-1501, DukeMTMCreID) and facial recognition benchmarks (CelebA-HQ, PubFig) datasets demonstrate the effectiveness of the proposed method. Our adversarial evasion attacks reduce mean Average Precision from 90% to 0.4% in white-box settings and from 72% to 0.4% in black-box settings, showing strong cross-model generalization. In targeted impersonation attacks, our framework achieves a success rate of 27% on CelebA-HQ, competing with other patch-based methods. We go further to use clustering of activation maps to interpret which features are most used by adversarial attacks and propose a pathway for future countermeasures. The results highlight the practicality of adversarial patch attacks on retrieval-based systems and underline the urgent need for robust defense strategies.[449] Pointing-Based Object Recognition
Lukáš Hajdúch,Viktor Kocur
Main category: cs.CV
TL;DR: 本文提出了一种基于RGB图像的指指点目标识别完整流程,融合了目标检测、人体姿态估计、单目深度估计和视觉语言模型,验证了深度信息和图像描述模型对提升复杂场景下目标识别性能的有效性。
Details
Motivation: 随着人机交互向更直观界面发展,识别非言语交流(如指向手势)的目标变得至关重要。 Method: 集成现有先进方法:目标检测、人体姿态估计、单目深度估计和视觉语言模型;评估3D空间信息重建与图像描述模型在纠偏分类错误中的作用。 Result: 在自建数据集上的实验表明,引入深度信息显著提升了目标识别准确率,尤其在存在重叠物体的复杂场景中;模块化设计支持在无专用深度传感器环境下部署。 Conclusion: 单目RGB图像结合深度估计与视觉语言模型可有效实现指向目标识别,为无深度传感器的交互系统提供了可行方案。 Abstract: This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.[450] Detection of Autonomous Shuttles in Urban Traffic Images Using Adaptive Residual Context
Mohamed Aziz Younes,Nicolas Saunier,Guillaume-Alexandre Bilodeau
Main category: cs.CV
TL;DR: 本文提出了一种名为自适应残差上下文(ARC)的架构,用于在视频目标检测中增量添加新型自动驾驶车辆类别,同时避免灾难性遗忘,保持对交通场景的上下文理解能力。
Details
Motivation: 自动驾驶车辆等新型交通参与者需要持续监测以评估其安全性,但传统检测模型在新增检测类别时易发生灾难性遗忘,损害已有场景理解能力,而道路安全应用中上下文知识至关重要。 Method: 提出ARC架构,包含一个冻结的上下文分支和多个可训练的任务特定分支,通过上下文引导桥(Context-Guided Bridge)利用注意力机制传递空间特征,同时保护预训练表征。 Result: 在自建数据集上的实验表明,ARC在新增车辆类别检测性能上媲美微调基线,同时显著提升知识保留能力,具备数据高效性。 Conclusion: ARC为复杂城市环境中动态扩展自动驾驶车辆检测能力提供了一种兼顾准确性与上下文稳定性的新范式。 Abstract: The progressive automation of transport promises to enhance safety and sustainability through shared mobility. Like other vehicles and road users, and even more so for such a new technology, it requires monitoring to understand how it interacts in traffic and to evaluate its safety. This can be done with fixed cameras and video object detection. However, the addition of new detection targets generally requires a fine-tuning approach for regular detection methods. Unfortunately, this implementation strategy will lead to a phenomenon known as catastrophic forgetting, which causes a degradation in scene understanding. In road safety applications, preserving contextual scene knowledge is of the utmost importance for protecting road users. We introduce the Adaptive Residual Context (ARC) architecture to address this. ARC links a frozen context branch and trainable task-specific branches through a Context-Guided Bridge, utilizing attention to transfer spatial features while preserving pre-trained representations. Experiments on a custom dataset show that ARC matches fine-tuned baselines while significantly improving knowledge retention, offering a data-efficient solution to add new vehicle categories for complex urban environments.[451] AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation
Zhenyu Xie,Ji Xia,Michael Kampffmeyer,Panwen Hu,Zehua Ma,Yujian Zheng,Jing Wang,Zheng Chong,Xujie Zhang,Xianhang Cheng,Xiaodan Liang,Hao Li
Main category: cs.CV
TL;DR: 本文提出AnyCrowd框架,基于Diffusion Transformer实现可控的多角色动画生成,通过实例隔离潜在表示、三阶段解耦注意力机制和自适应门控融合模块,解决身份纠缠、身份-姿态错绑及重叠区域歧义等问题。
Details
Motivation: 现有可控角色动画方法在多角色场景下存在潜在身份纠缠导致的身份混淆、身份与驱动姿态绑定不精确、时空一致性差等问题,亟需可扩展的多角色动画生成方案。 Method: 提出AnyCrowd:1)实例隔离潜在表示(IILR)以避免身份纠缠;2)三阶段解耦注意力(TSDA),分解为实例感知前景注意、背景中心交互、全局前景-背景协调;3)自适应门控融合(AGF)模块缓解重叠区域token歧义。 Result: AnyCrowd在多角色动画生成中显著提升身份可控性与时空一致性,支持任意数量角色,有效抑制身份漂移与姿态错绑,在多个基准上优于现有方法。 Conclusion: AnyCrowd通过结构化解耦建模实现了高质量、高可控性的多角色视频生成,为复杂人群动画提供了可扩展的新范式。 Abstract: Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations...[452] Gym-V: A Unified Vision Environment System for Agentic Vision Research
Fanqing Meng Lingxiao Du Jiawei Gu Jiaqi Liao Linjie Li Zijian Wu Xiangyan Liu Ziqi Zhao Mengkang Hu Yue Zhang Zichen Liu Jiaheng Zhang Michael Qizhe Shieh
Main category: cs.CV
TL;DR: 本文介绍了Gym-V,一个面向视觉智能体的标准化强化学习环境平台,包含179个程序生成的视觉环境,涵盖10个领域,并具有可控难度;实验发现观察提示(如字幕和游戏规则)比RL算法选择对训练成功更关键,且跨领域训练有助于泛化,而窄域训练可能导致负迁移。
Details
Motivation: 视觉智能体缺乏类似传统强化学习中标准‘gym’的基础设施,限制了对其学习机制的系统性研究与模型性能短板的识别。 Method: 构建统一平台Gym-V,包含179个程序生成、多领域、难度可控的视觉环境;开展观察 scaffolding 对比、跨域迁移及多轮交互实验。 Result: 观察提示(如字幕、规则)比RL算法选择更决定训练成败;跨域训练带来广泛泛化能力,窄域训练易致负迁移;多轮交互放大上述效应。 Conclusion: Gym-V为视觉语言模型(VLM)作为智能体的研究提供了可复现、可扩展的基础平台,推动了视觉智能体的系统性评估与训练方法发展。 Abstract: As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym'' infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.[453] Real-Time Human Frontal View Synthesis from a Single Image
Fangyu Lin,Yingdong Hu,Lunjie Zhu,Zhening Liu,Yushi Huang,Zehong Lin,Jun Zhang
Main category: cs.CV
TL;DR: PrismMirror是一种面向单图实时人像正面视角合成的几何引导框架,通过级联学习策略实现粗到细的几何特征学习,并蒸馏为轻量线性注意力模型,在24FPS下实现高保真与高结构精度。
Details
Motivation: 现有方法在视觉保真度与显式几何理解间难以兼顾,且面临内存瓶颈与实时性不足问题,尤其在人脸和手部等复杂区域表现不佳。 Method: 提出PrismMirror框架,采用级联学习策略:先直接学习SMPL-X网格和点云等粗粒度几何特征,再通过渲染监督细化纹理;并蒸馏为轻量级线性注意力模型以提升效率。 Result: 首次实现单目人体正面视角合成的实时推理(24 FPS),在视觉真实感与结构准确性上显著超越先前方法。 Conclusion: PrismMirror通过摒弃外部几何建模、聚焦正面视角合成与级联几何学习,在保证实时性的同时提升了视觉完整性与几何精度,适用于沉浸式3D远程临场应用。 Abstract: Photorealistic human novel view synthesis from a single image is crucial for democratizing immersive 3D telepresence, eliminating the need for complex multi-camera setups. However, current rendering-centric methods prioritize visual fidelity over explicit geometric understanding and struggle with intricate regions like faces and hands, leading to temporal instability. Meanwhile, human-centric frameworks suffer from memory bottlenecks since they typically rely on an auxiliary model to provide informative structural priors for geometric modeling, which limits real-time performance. To address these challenges, we propose PrismMirror, a geometry-guided framework for instant frontal view synthesis from a single image. By avoiding external geometric modeling and focusing on frontal view synthesis, our model optimizes visual integrity for telepresence. Specifically, PrismMirror introduces a novel cascade learning strategy that enables coarse-to-fine geometric feature learning. It first directly learns coarse geometric features, such as SMPL-X meshes and point clouds, and then refines textures through rendering supervision. To achieve real-time efficiency, we distill this unified framework into a lightweight linear attention model. Notably, PrismMirror is the first monocular human frontal view synthesis model that achieves real-time inference at 24 FPS, significantly outperforming previous methods in both visual authenticity and structural accuracy.[454] MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts
Zheng Zhang,Qinchuan Zhang,Yuteng Ye,Zhi Chen,Penglei Ji,Mengfei Li,Wenxiao Zhang,Yuan Liu
Main category: cs.CV
TL;DR: 本文提出MV2UV方法,结合多视角生成的2D生成先验与UV空间修复能力,解决多视角纹理不一致和不可见区域缺失问题,显著提升纹理生成质量。
Details
Motivation: 现有方法存在多视角纹理不一致、不可见区域纹理缺失,以及UV修复方法泛化能力差、难以利用2D扩散先验等问题。 Method: 提出MV2UV方法,构建UV空间生成模型,在修复多视角图像不可见区域的同时,缓解多视角图像间的不一致性。 Result: 实验表明,该方法在纹理生成质量上优于现有方法,尤其在遮挡区域和多视角不一致区域效果更优。 Conclusion: MV2UV有效融合多视角生成与UV修复优势,提升了3D资产纹理生成的整体质量与鲁棒性。 Abstract: Generating high-quality textures for 3D assets is a challenging task. Existing multiview texture generation methods suffer from the multiview inconsistency and missing textures on unseen parts, while UV inpainting texture methods do not generalize well due to insufficient UV data and cannot well utilize 2D image diffusion priors. In this paper, we propose a new method called MV2UV that combines 2D generative priors from multiview generation and the inpainting ability of UV refinement to get high-quality texture maps. Our key idea is to adopt a UV space generative model that simultaneously inpaints unseen parts of multiview images while resolving the inconsistency of multiview images. Experiments show that our method enables a better texture generation quality than existing methods, especially in unseen occluded and multiview-inconsistent parts.[455] Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task
Yurui Dong,Ziyue Wang,Shuyun Lu,Dairu Liu,Xuechen Liu,Fuwen Luo,Peng Li,Yang Liu
Main category: cs.CV
TL;DR: 本文提出EscapeCraft-4D——一个用于评估多模态大模型在时间敏感、选择性跨模态感知任务中能力的4D可定制环境,并构建相应基准,揭示当前模型在时序约束下跨模态整合能力的显著不足及模态偏差问题。
Details
Motivation: 现有环境缺乏对时序依赖音频信号和选择性跨模态集成(模态间可能互补或干扰)的支持,导致模型在时间变化、不可逆条件下的主动跨模态协调与推理能力未被充分探索。 Method: 构建EscapeCraft-4D 4D环境,包含触发式听觉源、时变证据和位置依赖线索;基于该环境设计基准测试,评估主流Omni模型在时空推理与主动多模态融合方面的能力。 Result: 实验表明模型普遍存在模态偏差,在时间约束下多模态整合能力薄弱;深入分析揭示了多模态如何交互并共同影响复杂推理中的决策。 Conclusion: EscapeCraft-4D填补了评估真实世界多模态推理中时间意识与选择性感知能力的空白,指出现有MLLMs在动态、不可逆多模态场景中的关键短板,为未来研究提供新方向与评测基准。 Abstract: Multimodal Large Language Models (MLLMs) have recently made rapid progress toward unified Omni models that integrate vision, language, and audio. However, existing environments largely focus on 2D or 3D visual context and vision-language tasks, offering limited support for temporally dependent auditory signals and selective cross-modal integration, where different modalities may provide complementary or interfering information, which are essential capabilities for realistic multimodal reasoning. As a result, whether models can actively coordinate modalities and reason under time-varying, irreversible conditions remains underexplored. To this end, we introduce \textbf{EscapeCraft-4D}, a customizable 4D environment for assessing selective cross-modal perception and time awareness in Omni models. It incorporates trigger-based auditory sources, temporally transient evidence, and location-dependent cues, requiring agents to perform spatio-temporal reasoning and proactive multimodal integration under time constraints. Building on this environment, we curate a benchmark to evaluate corresponding abilities across powerful models. Evaluation results suggest that models struggle with modality bias, and reveal significant gaps in current model's ability to integrate multiple modalities under time constraints. Further in-depth analysis uncovers how multiple modalities interact and jointly influence model decisions in complex multimodal reasoning environments.[456] Automated Counting of Stacked Objects in Industrial Inspection
Corentin Dumery,Noa Etté,Aoxiang Fan,Ren Li,Jingyi Xu,Hieu Le,Pascal Fua
Main category: cs.CV
TL;DR: 本文提出了一种新的3D视觉计数方法,通过多视角图像估计堆叠物体的3D几何结构和占用率,以解决工业检测中严重遮挡下的自动计数难题。
Details
Motivation: 工业检测中,制造零件常因过轻或过重难以通过称重准确计数,而现有视觉计数方法在处理容器、托盘或料箱中严重遮挡的3D堆叠物体时表现不佳。 Method: 将3D计数任务分解为两个子问题:基于多视角图像估计堆叠体的3D几何结构及其占用率;结合几何重建与深度学习驱动的深度分析。 Result: 在大规模合成数据和多样化的实际工业数据上验证了该方法的有效性,计数结果经人工校验,表现出对不规则堆叠和部分遮挡场景的强鲁棒性。 Conclusion: 所提3D计数方法为工业视觉检测中高精度、高吞吐量的库存统计与质量保障提供了可靠、实用的新方案。 Abstract: Visual object counting is a fundamental computer vision task in industrial inspection, where accurate, high-throughput inventory tracking and quality assurance are critical. Moreover, manufactured parts are often too light to reliably deduce their count from their weight, or too heavy to move the stack on a scale safely and practically, making automated visual counting the more robust solution in many scenarios. However, existing methods struggle with stacked 3D items in containers, pallets, or bins, where most objects are heavily occluded and only a few are directly visible. To address this important yet underexplored challenge, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems: estimating the 3D geometry of the stack and its occupancy ratio from multi-view images. By combining geometric reconstruction with deep learning-based depth analysis, our method can accurately count identical manufactured parts inside containers, even when they are irregularly stacked and partially hidden. We validate our 3D counting pipeline on large-scale synthetic and diverse real-world data with manually verified total counts, demonstrating robust performance under realistic inspection conditions.[457] Anchor then Polish for Low-light Enhancement
Tianle Du,Mingjia Li,Hainuo Wang,Xiaojie Guo
Main category: cs.CV
TL;DR: 本文提出了一种锚定后精修(ATP)框架,通过宏观锚定(稳定亮度分布、校正颜色)和微观精修(小波域与色度空间细节增强)解耦全局能量对齐与局部细节优化,以简单线性算子实现高效低光图像增强。
Details
Motivation: 低光图像增强面临光照差、色彩偏移和纹理干扰等纠缠退化问题;现有方法依赖复杂结构联合处理,易过拟合物理约束,导致全局失真。 Method: 提出锚定后精修(ATP)框架:1)宏观锚定——学习仅含12自由度的场景自适应投影矩阵,稳定亮度并校正颜色;2)微观精修——在小波域和色度空间中,受矩阵引导进行细节增强,并设计约束亮度更新策略保障全局一致性。 Result: 在多个基准上实验表明,该方法达到SOTA性能,生成视觉自然、定量指标优越的增强结果。 Conclusion: 简单的线性操作足以有效完成全局能量对齐,而细节增强应聚焦于更精细的表示空间;ATP框架实现了高效、稳定且可解释的低光图像增强。 Abstract: Low-light image enhancement is challenging due to entangled degradations, mainly including poor illumination, color shifts, and texture interference. Existing methods often rely on complex architectures to address these issues jointly but may overfit simple physical constraints, leading to global distortions. This work proposes a novel anchor-then-polish (ATP) framework to fundamentally decouple global energy alignment from local detail refinement. First, macro anchoring is customized to (greatly) stabilize luminance distribution and correct color by learning a scene-adaptive projection matrix with merely 12 degrees of freedom, revealing that a simple linear operator can effectively align global energy. The macro anchoring then reduces the task to micro polishing, which further refines details in the wavelet domain and chrominance space under matrix guidance. A constrained luminance update strategy is designed to ensure global consistency while directing the network to concentrate on fine-grained polishing. Extensive experiments on multiple benchmarks show that our method achieves state-of-the-art performance, producing visually natural and quantitatively superior low-light enhancements.[458] Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation
Yuanfan Zheng,Kunyu Peng,Xu Zheng,Kailun Yang
Main category: cs.CV
TL;DR: 本文提出EDA-PSeg框架,用于解决跨域全景语义分割中的几何FoV失真与开集语义不一致问题,通过Euler-Margin Attention和Graph Matching Adapter提升视角不变表征与新类泛化能力,在多个基准上达到SOTA。
Details
Motivation: 跨域全景语义分割面临严重几何FoV失真和跨域开集语义不一致的挑战,现有方法难以同时应对视角变化和未知类别识别。 Method: 提出EDA-PSeg框架,包含Euler-Margin Attention(引入角向间隔、幅值与相位调制)和Graph Matching Adapter(构建高阶图关系以对齐共享语义并分离新类别)。 Result: 在四个基准数据集(涵盖相机位移、天气变化与开集场景)上取得SOTA性能,展现出对多样化视角几何与环境变化的强鲁棒性与泛化能力。 Conclusion: EDA-PSeg有效缓解了全景分割中几何失真与开集语义的双重挑战,为真实场景下的360°理解提供了可扩展、鲁棒的域自适应解决方案。 Abstract: Cross-domain panoramic semantic segmentation has attracted growing interest as it enables comprehensive 360° scene understanding for real-world applications. However, it remains particularly challenging due to severe geometric Field of View (FoV) distortions and inconsistent open-set semantics across domains. In this work, we formulate an open-set domain adaptation setting, and propose Extrapolative Domain Adaptive Panoramic Segmentation (EDA-PSeg) framework that trains on local perspective views and tests on full 360° panoramic images, explicitly tackling both geometric FoV shifts across domains and semantic uncertainty arising from previously unseen classes. To this end, we propose the Euler-Margin Attention (EMA), which introduces an angular margin to enhance viewpoint-invariant semantic representation, while performing amplitude and phase modulation to improve generalization toward unseen classes. Additionally, we design the Graph Matching Adapter (GMA), which builds high-order graph relations to align shared semantics across FoV shifts while effectively separating novel categories through structural adaptation. Extensive experiments on four benchmark datasets under camera-shift, weather-condition, and open-set scenarios demonstrate that EDA-PSeg achieves state-of-the-art performance, robust generalization to diverse viewing geometries, and resilience under varying environmental conditions. The code is available at https://github.com/zyfone/EDA-PSeg.[459] ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer
Ruonan Yu,Zhenxiong Tan,Zigeng Chen,Songhua Liu,Xinchao Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需视频数据的视频扩散Transformer微调框架ViFeEdit,仅用2D图像即可实现可控视频生成与编辑,核心是解耦空间与时间注意力,并采用双路径时间步嵌入设计。
Details
Motivation: 现有视频控制与编辑方法受限于配对视频数据稀缺和训练开销大,难以扩展到视频领域。 Method: 提出ViFeEdit框架,通过架构重参数化解耦3D注意力中的空间独立性,引入双路径 timestep embedding 以支持多种条件信号,全程仅使用2D图像进行微调。 Result: 在无需任何视频训练数据的前提下,实现了高质量、时序一致的可控视频生成与编辑,实验表明其效果优异且计算高效。 Conclusion: ViFeEdit为视频扩散模型提供了一种高效、轻量、数据友好的微调范式,显著降低了视频编辑任务的数据与计算门槛。 Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.[460] RSGen: Enhancing Layout-Driven Remote Sensing Image Generation with Diverse Edge Guidance
Xianbao Hou,Yonghao He,Zeyd Boukhers,John See,Hu Su,Wei Sui,Cong Yang
Main category: cs.CV
TL;DR: 本文提出RSGen框架,通过多样的边缘引导增强布局驱动的遥感图像生成,解决了现有Layout-to-Image模型在细粒度控制和边界框约束遵守方面的不足。
Details
Motivation: 现有基于扩散模型的遥感图像Layout-to-Image生成方法在细粒度控制和严格遵循边界框约束方面存在局限。 Method: RSGen采用渐进式增强策略:首先利用图像到图像生成丰富从训练样本中检索出的边缘图多样性;再将这些多样边缘图作为条件输入现有L2I模型,实现边界框内像素级控制。 Result: 在DOTA数据集上,结合CC-Diff模型,RSGen在YOLOScore mAP50/mAP50-95上分别提升+9.8/+12.0,在下游检测任务mAP上提升+1.6。 Conclusion: RSGen是一种即插即用框架,显著提升了现有L2I模型在遥感图像生成中的布局控制能力与精度。 Abstract: Diffusion models have significantly mitigated the impact of annotated data scarcity in remote sensing (RS). Although recent approaches have successfully harnessed these models to enable diverse and controllable Layout-to-Image (L2I) synthesis, they still suffer from limited fine-grained control and fail to strictly adhere to bounding box constraints. To address these limitations, we propose RSGen, a plug-and-play framework that leverages diverse edge guidance to enhance layout-driven RS image generation. Specifically, RSGen employs a progressive enhancement strategy: 1) it first enriches the diversity of edge maps composited from retrieved training instances via Image-to-Image generation; and 2) subsequently utilizes these diverse edge maps as conditioning for existing L2I models to enforce pixel-level control within bounding boxes, ensuring the generated instances strictly adhere to the layout. Extensive experiments across three baseline models demonstrate that RSGen significantly boosts the capabilities of existing L2I models. For instance, with CC-Diff on the DOTA dataset for oriented object detection, we achieve remarkable gains of +9.8/+12.0 in YOLOScore mAP50/mAP50-95 and +1.6 in mAP on the downstream detection task. Our code will be publicly available: https://github.com/D-Robotics-AI-Lab/RSGen[461] Real-Time Oriented Object Detection Transformer in Remote Sensing Images
Zeyu Ding,Yong Zhou,Jiaqi Zhao,Wen-Liang Du,Xixi Li,Rui Yao,Abdulmotaleb El Saddik
Main category: cs.CV
TL;DR: 本文提出了一种面向实时应用的端到端旋转目标检测Transformer(O2系列),通过角度分布细化、Chamfer距离匹配和定向对比去噪等创新,显著提升遥感图像中任意方向目标的检测精度与训练稳定性。
Details
Motivation: 现有实时检测Transformer未显式建模目标旋转,尤其在遥感图像中因目标朝向任意,导致角度表示、匹配代价和训练稳定性等问题。 Method: 提出角度分布细化(将角度回归转化为概率分布迭代优化)、Chamfer距离匹配(基于顶点集度量框距离)和定向对比去噪(分析四种噪声模式并设计不稳定性度量)。 Result: O2-DFINE-L、O2-RTDETR-R50 和 O2-DEIM-R50 在DOTA1.0上分别达到77.73%/78.45%/80.15% AP50,在2080Ti GPU上达132/119/119 FPS。 Conclusion: 所提方法是首个实时端到端旋转目标检测Transformer,在精度与速度间取得优异平衡,为遥感图像检测提供了新范式。 Abstract: Recent real-time detection transformers have gained popularity due to their simplicity and efficiency. However, these detectors do not explicitly model object rotation, especially in remote sensing imagery where objects appear at arbitrary angles, leading to challenges in angle representation, matching cost, and training stability. In this paper, we propose a real-time oriented object detection transformer, the first real-time end-to-end oriented object detector to the best of our knowledge, that addresses the above issues. Specifically, angle distribution refinement is proposed to reformulate angle regression as an iterative refinement of probability distributions, thereby capturing the uncertainty of object rotation and providing a more fine-grained angle representation. Then, we incorporate a Chamfer distance cost into bipartite matching, measuring box distance via vertex sets, enabling more accurate geometric alignment and eliminating ambiguous matches. Moreover, we propose oriented contrastive denoising to stabilize training and analyze four noise modes. We observe that a ground truth can be assigned to different index queries across different decoder layers, and analyze this issue using the proposed instability metric. We design a series of model variants and experiments to validate the proposed method. Notably, our O2-DFINE-L, O2-RTDETR-R50 and O2-DEIM-R50 achieve 77.73%/78.45%/80.15% AP50 on DOTA1.0 and 132/119/119 FPS on the 2080ti GPU. Code is available at https://github.com/wokaikaixinxin/ai4rs.[462] FreeTalk: Emotional Topology-Free 3D Talking Heads
Federico Nocentini,Thomas Besnier,Claudio Ferrari,Stefano Berretti,Mohamed Daoudi
Main category: cs.CV
TL;DR: FreeTalk 是一种无需模板的两阶段语音驱动3D面部动画方法,支持任意拓扑的未注册人脸网格,并能可控地建模情感动态。
Details
Motivation: 现有方法依赖注册模板网格,难以泛化到任意拓扑的原始3D扫描;同时难以在脱离模板参数化的情况下建模除口型外的情感动态。 Method: 提出FreeTalk框架:第一阶段Audio-To-Sparse(ATS)从语音音频预测带情感条件的3D关键点位移序列;第二阶段Sparse-To-Mesh(STM)利用内在曲面特征和关键点-顶点条件,将稀疏运动映射到任意目标网格,无需模板拟合或对应监督。 Result: FreeTalk在域内训练时媲美专用基线,且对未见身份和网格拓扑展现出显著更强的鲁棒性。 Conclusion: FreeTalk实现了对任意拓扑人脸网格的通用、情感可控的语音驱动3D动画,摆脱了对模板和显式对应关系的依赖。 Abstract: Speech-driven 3D facial animation has advanced rapidly, yet most approaches remain tied to registered template meshes, preventing effective deployment on raw 3D scans with arbitrary topology. At the same time, modeling controllable emotional dynamics beyond lip articulation remains challenging, and is often tied to template-based parameterizations. We address these challenges by proposing FreeTalk, a two-stage framework for emotion-conditioned 3D talking-head animation that generalizes to unregistered face meshes with arbitrary vertex count and connectivity. First, Audio-To-Sparse (ATS) predicts a temporally coherent sequence of 3D landmark displacements from speech audio, conditioned on an emotion category and intensity. This sparse representation captures both articulatory and affective motion while remaining independent of mesh topology. Second, Sparse-To-Mesh (STM) transfers the predicted landmark motion to a target mesh by combining intrinsic surface features with landmark-to-vertex conditioning, producing dense per-vertex deformations without template fitting or correspondence supervision at test time. Extensive experiments show that FreeTalk matches specialized baselines when trained in-domain, while providing substantially improved robustness to unseen identities and mesh topologies. Code and pre-trained models will be made publicly available.[463] Clinically Aware Synthetic Image Generation for Concept Coverage in Chest X-ray Models
Amy Rafferty,Rishi Ramaesh,Ajitha Rajan
Main category: cs.CV
TL;DR: 本文提出CARS框架,通过解剖学引导的合成图像生成,提升AI诊断模型在胸部X光片上的鲁棒性与临床可信度。
Details
Motivation: 公开胸部X光数据集系统性地缺乏关键临床特征组合,导致模型在高风险场景下训练不足,难以满足临床部署对鲁棒性的要求。 Method: 提出CARS(Clinically Aware and Anatomically Grounded Synthetic)框架,基于临床特征向量进行靶向扰动,在保持解剖结构的前提下可控地插入或删除病理征象,并生成高质量合成图像;在MIMIC-CXR上评估七种骨干网络的微调效果。 Result: 相比以往特征扰动方法,CARS生成图像微调后显著提升精确率-召回率性能、降低预测不确定性、改善模型校准;结构与语义分析显示高解剖保真度、强特征对齐性及低语义不确定性;两位放射科专家独立评估确认其临床真实性和一致性。 Conclusion: CARS证明了解剖学忠实的合成数据生成是拓展特征空间覆盖、提升胸部X光分类系统性能与可信度的有效可行策略,且不损害临床完整性。 Abstract: The clinical deployment of AI diagnostic models demands more than benchmark accuracy - it demands robustness across the full spectrum of disease presentations. However, publicly available chest radiographic datasets systematically underrepresent critical clinical feature combinations, leaving models under-trained precisely where clinical stakes are highest. We present CARS, a clinically aware and anatomically grounded framework that addresses this gap through principled synthetic image generation. CARS applies targeted perturbations to clinical feature vectors, enabling controlled insertion and deletion of pathological findings while explicitly preserving anatomical structure. We evaluate CARS across seven backbone architectures by fine-tuning models on synthetic subsets and testing on a held-out MIMIC-CXR benchmark. Compared to prior feature perturbation approaches, fine-tuning on CARS-generated images consistently improves precision-recall performance, reduces predictive uncertainty, and improves model calibration. Structural and semantic analyses demonstrate high anatomical fidelity, strong feature alignment, and low semantic uncertainty. Independent evaluation by two expert radiologists further confirms realism and clinical agreement. As the field moves toward regulated clinical AI, CARS demonstrates that anatomically faithful synthetic data generation for better feature space coverage is a viable and effective strategy for improving both the performance and trustworthiness of chest X-ray classification systems - without compromising clinical integrity.[464] Kimodo: Scaling Controllable Human Motion Generation
Davis Rempe,Mathis Petrovich,Ye Yuan,Haotian Zhang,Xue Bin Peng,Yifeng Jiang,Tingwu Wang,Umar Iqbal,David Minor,Michael de Ruyter,Jiefeng Li,Chen Tessler,Edy Lim,Eugene Jeong,Sam Wu,Ehsan Hassani,Michael Huang,Jin-Bey Yu,Chaeyeon Chung,Lina Song,Olivier Dionne,Jan Kautz,Simon Yuen,Sanja Fidler
Main category: cs.CV
TL;DR: 本文提出Kimodo模型,一种基于700小时光学动捕数据训练的两阶段运动扩散模型,支持文本和多种运动学约束(如关键帧、2D路径等)驱动高质量人体动作生成。
Details
Motivation: 公共动捕数据集规模小,限制了生成模型的动作质量、控制精度与泛化能力;亟需大规模高质量数据与更优建模方法。 Method: 设计新型运动表示与两阶段去噪器架构(分离根节点与身体预测),支持文本及多种运动学约束(全身体关键帧、稀疏关节位姿、2D路点/密集2D路径)条件生成。 Result: 在大规模动捕数据上验证了模型有效性;实证分析了数据规模与模型规模对性能的影响;生成动作质量高、控制准确、灵活性强。 Conclusion: Kimodo通过大规模数据与结构化建模显著提升了可控人体动作生成的质量与实用性,为机器人、仿真与娱乐应用提供了新范式。 Abstract: High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.[465] Self-Distillation of Hidden Layers for Self-Supervised Representation Learning
Scott C. Lowe,Anthony Fuller,Sageev Oore,Evan Shelhamer,Graham W. Taylor
Main category: cs.CV
TL;DR: 本文提出Bootleg方法,通过让模型预测教师网络多个隐藏层的潜在表示,实现对不同抽象层次特征的同时学习,从而在图像分类和语义分割任务上显著超越现有自监督学习方法。
Details
Motivation: 现有自监督学习方法分为生成式(如MAE)和预测式(如I-JEPA),前者计算开销大且不强调高层概念特征,后者因依赖非平稳的目标而训练不稳定。 Method: Bootleg让学生模型预测教师网络多个隐藏层的潜在表示,形成分层预测目标,以同时学习多层级抽象特征。 Result: Bootleg在ImageNet-1K和iNaturalist-21分类任务上比I-JEPA提升约10%,并在ADE20K和Cityscapes语义分割任务中也显著优于基线方法。 Conclusion: Bootleg通过引入多层隐状态预测机制,有效融合生成式与预测式SSL的优势,在效率、稳定性与表征质量间取得更好平衡。 Abstract: The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.[466] Learning Latent Proxies for Controllable Single-Image Relighting
Haoze Zheng,Zihao Wang,Xianfeng Wu,Yajing Bai,Yexin Liu,Yun Li,Xiaogang Xu,Harry Yang
Main category: cs.CV
TL;DR: 本文提出LightCtrl,一种结合物理先验的单图像重打光方法,通过稀疏但物理有意义的线索指导扩散模型,实现对光照方向、强度和颜色的精确连续控制。
Details
Motivation: 单图像重打光问题高度欠约束,现有基于扩散的方法要么依赖密集且脆弱的内在图像或G-buffer监督,要么缺乏物理基础导致控制不可靠。 Method: LightCtrl在两个层面引入物理先验:一是少量PBR样本训练的潜在代理编码器,提取紧凑的材质-几何线索;二是光照感知掩码,识别对光照敏感区域并引导去噪器关注阴影相关像素;同时采用DPO目标优化代理分支,并构建大规模数据集ScaLight支持训练。 Result: 在物体级和场景级基准上,该方法实现了光度保真且可控的重打光效果,在受控光照变化下PSNR提升达+2.4 dB,RMSE降低35%,显著优于现有扩散及内在分解方法。 Conclusion: 稀疏但物理意义明确的线索足以支撑高质量重打光,无需完整内在分解;LightCtrl验证了物理先验与扩散模型结合的有效性,为可控、可解释的生成式重打光提供了新范式。 Abstract: Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic or G-buffer pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that a full intrinsic decomposition is unnecessary and redundant for accurate relighting. Instead, sparse but physically meaningful cues, indicating where illumination should change and how materials should respond, are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates physical priors at two levels: a few-shot latent proxy encoder that extracts compact material-geometry cues from limited PBR supervision, and a lighting-aware mask that identifies sensitive illumination regions and steers the denoiser toward shading relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that enforces physical consistency in the predicted cues. We also present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera-light metadata, enabling physically consistent and controllable training. Across object and scene level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.[467] Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models
Lexiang Xiong,Qi Li,Jingwen Ye,Xinchao Wang
Main category: cs.CV
TL;DR: 本文提出一种新范式,将视觉语言模型(VLM)的幻觉视为其计算认知轨迹的动态病理,通过信息论探针构建可解释的认知状态空间,并发现几何异常与信息 surprisal 的对偶性,实现高效、鲁棒、可归因的幻觉检测。
Details
Motivation: VLMs 经常产生幻觉(即看似合理但事实错误的输出),严重阻碍其可信部署;现有方法多视幻觉为静态输出错误,缺乏对生成过程内在认知机制的建模与诊断。 Method: 基于计算理性原则建模 VLM 生成为动态认知轨迹;设计信息论探针,将其投影到低维可解释的认知状态空间;提出‘几何-信息对偶性’原理,将幻觉检测转化为几何异常检测;引入三个可解释指标:感知熵、推理冲突、决策熵,分别对应不同病理类型。 Result: 在 POPE、MME、MS-COCO 等多个基准上达到 SOTA 性能;具备高效率(弱监督)、强鲁棒性(抗污染校准数据);支持因果归因,将错误映射至具体认知病理状态。 Conclusion: 该框架将 VLM 幻觉诊断从输出层提升至认知过程层,为构建透明、可审计、可诊断的 AI 系统提供了新路径。 Abstract: Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.[468] Panoramic Affordance Prediction
Zixin Zhang,Chenfei Liao,Hongfei Zhang,Harold Haodong Chen,Kanghao Chen,Zichen Wen,Litao Guo,Bin Ren,Xu Zheng,Yinchuan Li,Xuming Hu,Nicu Sebe,Ying-Cong Chen
Main category: cs.CV
TL;DR: 本文首次提出全景感知下的可操作性预测任务,构建了首个大规模全景可操作性预测基准数据集PAP-12K,并设计了一种无需训练、由粗到精的PAP框架,有效应对全景图像的超高分辨率与严重畸变问题。
Details
Motivation: 现有可操作性预测方法局限于针孔相机模型,视野窄、观测碎片化,难以获取整体环境上下文;而全景图像能提供全局空间关系和场景理解,但尚未被探索。 Method: 提出PAP-12K数据集(含12k+高分辨率360°图像及QA对与可操作性掩码);设计无训练、仿人类中央凹视觉的PAP框架,包含递归网格提示定位、自适应凝视畸变校正、级联接地实例分割三阶段流程。 Result: 实验表明,传统方法在PAP-12K上性能严重下降甚至失效;PAP显著超越SOTA基线,验证其有效性。 Conclusion: 全景感知为具身智能提供了更鲁棒的环境理解基础,PAP-12K与PAP框架共同开辟了可操作性预测的新方向。 Abstract: Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.[469] Severe Domain Shift in Skeleton-Based Action Recognition:A Study of Uncertainty Failure in Real-World Gym Environments
Aaditya Khanal,Junxiu Zhou
Main category: cs.CV
TL;DR: 本文研究了从多视角3D骨架捕获到单目2D姿态估计的实际部署 gap,揭示了严重的域偏移对模型安全性的影响,并提出了一种轻量级门控机制来改善模型在分布外数据上的决策可靠性。
Details
Motivation: 实际部署中从多视角3D骨架到单目2D姿态估计存在显著域偏移,其安全影响尚未被充分探索。 Method: 构建Gym2D(风格/视角偏移)和UCF101(语义偏移)两个新数据集,评估Skeleton Transformer在跨域场景下的性能与不确定性估计能力,并引入轻量微调门控机制以提升选择性分类的可靠性。 Result: Skeleton Transformer在NTU-120上达63.2%准确率,但在Gym2D和UCF101上零样本迁移准确率分别骤降至1.6%和1.16%;标准不确定性方法无法有效检测该性能崩塌(99.6%风险下仍保持高置信度),而能量分数和马氏距离虽具高AUROC(≥0.91),却无法保障风险-覆盖率行为;所提门控机制显著降低错误高置信预测率。 Conclusion: 高OOD检测AUROC不等价于安全决策;需结合校准与选择性拒绝机制,才能实现骨架识别在真实场景中的安全部署。 Abstract: The practical deployment gap -- transitioning from controlled multi-view 3D skeleton capture to unconstrained monocular 2D pose estimation -- introduces a compound domain shift whose safety implications remain critically underexplored. We present a systematic study of this severe domain shift using a novel Gym2D dataset (style/viewpoint shift) and the UCF101 dataset (semantic shift). Our Skeleton Transformer achieves 63.2% cross-subject accuracy on NTU-120 but drops to 1.6% under zero-shot transfer to the Gym domain and 1.16% on UCF101. Critically, we demonstrate that high Out-Of-Distribution (OOD) detection AUROC does not guarantee safe selective classification. Standard uncertainty methods fail to detect this performance drop: the model remains confidently incorrect with 99.6% risk even at 50% coverage across both OOD datasets. While energy-based scoring (AUROC >= 0.91) and Mahalanobis distance provide reliable distributional detection signals, such high AUROC scores coexist with poor risk-coverage behavior when making decisions. A lightweight finetuned gating mechanism restores calibration and enables graceful abstention, substantially reducing the rate of confident wrong predictions. Our work challenges standard deployment assumptions, providing a principled safety analysis of both semantic and geometric skeleton recognition deployment.[470] Grounding World Simulation Models in a Real-World Metropolis
Junyoung Seo,Hyunwook Choi,Minkyung Kwon,Jinhyeok Choi,Siyoon Jin,Gayoung Lee,Junho Kim,JoungBin Lee,Geonmo Gu,Dongyoon Han,Sangdoo Yun,Seungryong Kim,Jin-Hwa Kim
Main category: cs.CV
TL;DR: 本文提出了首尔世界模型(SWM),一种基于真实城市(首尔)的城市场景视频生成模型,通过检索增强的自回归视频生成方法,结合跨时间配对、大规模合成数据集和虚拟前瞻汇(Virtual Lookahead Sink)等技术,解决了时间错位、轨迹多样性不足和数据稀疏等问题,在多个城市上实现了空间准确、时间一致、长时程且多样化的实景视频生成。
Details
Motivation: 现有生成式世界模型仅能合成视觉合理但非真实的环境,缺乏对现实城市的精确建模能力;而真实城市视频生成面临时间错位、轨迹单一和街景图像稀疏等挑战。 Method: 提出首尔世界模型(SWM),采用检索增强的自回归视频生成框架;引入跨时间配对策略、大规模合成轨迹数据集、街景图像驱动的视图插值流水线,以及用于长时程稳定的虚拟前瞻汇机制。 Result: 在首尔、釜山和安娜堡三个城市上,SWM在空间保真度、时间一致性、长距离轨迹(数百米)生成、相机运动多样性及文本提示场景控制方面均优于现有视频世界模型。 Conclusion: SWM首次实现了以真实城市为根基的大规模、长时程、高保真视频世界建模,验证了将生成模型与真实地理数据深度耦合的可行性与优势。 Abstract: What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.[471] Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery
Timing Yang,Sicheng He,Hongyi Jing,Jiawei Yang,Zhijian Liu,Chuhang Zou,Yue Wang
Main category: cs.CV
TL;DR: 本文提出Fast SAM 3D Body,一种无需训练的加速框架,通过解耦空间依赖、架构感知剪枝和直接前馈映射,将单目3D人体网格重建速度提升10.9倍,同时保持甚至超越原模型精度,并成功应用于纯视觉实时人形机器人遥操作。
Details
Motivation: SAM 3D Body虽精度领先,但单图推理耗时数秒,无法满足实时应用需求。 Method: 解耦串行空间依赖、架构感知剪枝以支持并行多裁剪特征提取与精简Transformer解码;用直接前馈映射替代迭代网格拟合,实现SMPL关节运动学快速生成。 Result: 端到端加速达10.9倍,在LSPET等基准上重建精度持平甚至超越原模型;成功部署于纯RGB流驱动的实时视觉遥操作系统。 Conclusion: Fast SAM 3D Body在不牺牲精度前提下显著提升推理速度,为实时3D人体姿态估计及下游机器人控制提供了高效可行方案。 Abstract: SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that-unlike methods reliant on wearable IMUs-enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.[472] HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions
Yukang Cao,Haozhe Xie,Fangzhou Hong,Long Zhuo,Zhaoxi Chen,Liang Pan,Ziwei Liu
Main category: cs.CV
TL;DR: 本文提出HSImul3R框架,通过物理引导的双向优化(结合场景目标强化学习与直接仿真奖励优化),从稀疏视角图像或单目视频中生成稳定、仿真就绪的人-场景交互3D重建结果,并构建新基准HSIBench验证其在真实人形机器人上的部署能力。
Details
Motivation: 现有方法存在感知-仿真鸿沟:视觉上合理的重建结果常违反物理约束,导致物理引擎不稳定及具身AI应用失败。 Method: 提出物理 grounded 的双向优化流水线:前向使用场景目标强化学习优化人体运动(兼顾运动保真度与接触稳定性);反向采用直接仿真奖励优化,利用重力稳定性和交互成功率等仿真反馈来优化场景几何。 Result: 生成首个稳定、仿真就绪的人-场景交互3D重建结果,可直接部署于真实人形机器人;在新基准HSIBench上验证了有效性。 Conclusion: HSImul3R成功弥合了感知与仿真之间的鸿沟,为具身AI提供了可靠、物理一致的3D交互重建基础。 Abstract: We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.[473] Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion
Zhenghong Zhou,Xiaohang Zhan,Zhiqin Chen,Soo Ye Kim,Nanxuan Zhao,Haitian Zheng,Qing Liu,He Zhang,Zhe Lin,Yuqian Zhou,Jiebo Luo
Main category: cs.CV
TL;DR: 本文提出Tri-Prompting框架,统一解决视频生成中的场景构图、多视角主体一致性与运动控制三大挑战,通过双条件运动模块和ControlNet尺度调度策略,在可控性与真实性间取得平衡,并在多视角身份保持、3D一致性和运动精度上超越Phantom、DaS等基线方法。
Details
Motivation: 现有视频扩散模型虽视觉质量高,但在精细可控性(如场景构图、多视角主体定制、相机/物体运动调整)方面存在瓶颈,且缺乏统一架构支持联合控制。 Method: 提出Tri-Prompting:一种两阶段训练的统一框架,包含双条件运动模块(3D跟踪点驱动背景、下采样RGB驱动前景)和推理阶段的ControlNet scale schedule机制。 Result: 在多视角主体身份保持、3D一致性与运动准确性上显著优于Phantom和DaS等专用基线方法;支持3D感知主体插入任意场景及图像中已有主体的操控。 Conclusion: Tri-Prompting实现了对视频生成三大关键控制维度的统一建模,提升了AI视频创作的灵活性与实用性,为可控视频生成提供了新范式。 Abstract: Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.[474] GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
Xincheng Shuai,Ziye Li,Henghui Ding,Dacheng Tao
Main category: cs.CV
TL;DR: 本文提出GlyphPrinter,一种基于偏好优化的文本渲染方法,通过区域级偏好标注和区域分组DPO(R-GDPO)提升字形准确性,并引入区域奖励引导推理策略,在保持风格化的同时显著提高复杂或域外字符的渲染精度。
Details
Motivation: 现有文本渲染方法受限于字形变体覆盖不足和过度风格化,导致复杂或域外字符的字形准确性差;强化学习方法依赖对细粒度字形错误不敏感的识别系统作为奖励模型,难以有效优化。 Method: 提出基于Direct Preference Optimization(DPO)思想的GlyphPrinter框架;构建含区域级字形偏好标注的GlyphCorrector数据集;设计Region-Grouped DPO(R-GDPO)目标函数,联合优化样本间与样本内区域偏好;引入Regional Reward Guidance推理策略。 Result: 在多项实验中,GlyphPrinter在字形准确性上显著优于现有方法,同时较好地平衡了风格化与精度。 Conclusion: 区域级偏好建模与无显式奖励的优化范式能更有效地提升视觉文本渲染中的字形准确性,为高质量字形生成提供了新思路。 Abstract: Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.[475] Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models
Yulin Luo,Hao Chen,Zhuangzhe Wu,Bowen Sui,Jiaming Liu,Chenyang Gu,Zhuoyang Liu,Qiuxuan Feng,Jiale Yu,Shuo Gu,Peng Jia,Pheng-Ann Heng,Shanghang Zhang
Main category: cs.CV
TL;DR: 本文提出DeepVision-VLA框架,通过Vision-Language Mixture-of-Transformers(VL-MoT)和Action-Guided Visual Pruning(AGVP)增强VLA模型的视觉表征能力,在仿真与真实世界任务中分别提升9.0%和7.5%。
Details
Motivation: 现有VLA模型多将LLM视为黑箱,缺乏对视觉信息如何被用于动作生成的理解;作者发现视觉token敏感性随网络深度增加而下降,由此激发改进动机。 Method: 提出DeepVision-VLA框架,包含两个核心方法:1)VL-MoT——实现视觉基础模型与VLA主干网络之间的跨层共享注意力,将多级视觉特征注入深层;2)AGVP——利用浅层注意力动态剪枝无关视觉token,保留任务关键视觉线索。 Result: 在仿真与真实机器人任务上分别超越SOTA方法9.0%和7.5%。 Conclusion: 深层视觉特征融合与任务导向的视觉token精炼对提升VLA模型动作预测精度至关重要,为构建更鲁棒、精准的视觉增强型VLA模型提供了新设计范式。 Abstract: Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.[476] Towards Generalizable Robotic Manipulation in Dynamic Environments
Heng Fang,Shangru Li,Shuhan Wang,Xuanyang Xi,Dingkang Liang,Xiang Bai
Main category: cs.CV
TL;DR: 本文提出DOMINO动态操作数据集和基准测试,以及PUMA动态感知视觉-语言-动作模型,显著提升动态环境中任务成功率,并验证动态训练对静态任务的迁移增益。