Table of Contents
cs.CL [Back]
[1] GeoBlock: Inferring Block Granularity from Dependency Geometry in Diffusion Language Models
Lipeng Wan,Junjie Ma,Jianhui Gu,Zeyang Liu,Xuyang Lu,Xuguang Lan
Main category: cs.CL
TL;DR: 本文提出GeoBlock,一种基于注意力机制推导出的依赖几何结构来动态确定块大小的几何感知块推理框架,以提升扩散语言模型的并行解码效率与准确性。
Details
Motivation: 现有块大小策略依赖固定规则或启发式信号,未考虑决定令牌能否安全并行精炼的依赖几何结构。 Method: GeoBlock通过分析跨令牌的依赖模式,识别几何上稳定的精炼区域,并在解码过程中动态确定合适的块边界,无需额外训练且可无缝集成到现有块扩散架构中。 Result: 在多个基准上的大量实验表明,GeoBlock能可靠识别符合几何结构的块边界,在仅增加少量计算开销的情况下提升了块扩散的准确性。 Conclusion: GeoBlock通过将块粒度适配于依赖几何结构,在保持块扩散并行效率的同时,实现了具有自回归可靠性的依赖一致精炼。 Abstract: Block diffusion enables efficient parallel refinement in diffusion language models, but its decoding behavior depends critically on block size. Existing block-sizing strategies rely on fixed rules or heuristic signals and do not account for the dependency geometry that determines which tokens can be safely refined together. This motivates a geometry view of diffusion decoding: \emph{regions with strong causal ordering require sequential updates, whereas semantically cohesive regions admit parallel refinement.} We introduce GeoBlock, a geometry-aware block inference framework that determines block granularity directly from attention-derived dependency geometry. Instead of relying on predefined schedules or local confidence heuristics, GeoBlock analyzes cross-token dependency patterns to identify geometrically stable refinement regions and dynamically determines appropriate block boundaries during decoding. By adapting block granularity to the dependency geometry, GeoBlock preserves the parallel efficiency of block diffusion while enforcing dependency-consistent refinement that exhibits autoregressive reliability. GeoBlock requires no additional training and integrates seamlessly into existing block diffusion architectures. Extensive experiments across multiple benchmarks show that GeoBlock reliably identifies geometry-consistent block boundaries and improves the accuracy of block diffusion with only a small additional computational budget.[2] AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
Jianfei Xiao,Xiang Yu,Chengbing Wang,Wuqiang Zheng,Xinyu Lin,Kaining Liu,Hongxun Ding,Yang Zhang,Wenjie Wang,Fuli Feng,Xiangnan He
Main category: cs.CL
TL;DR: 本文提出了AlpsBench,一个基于真实人类-大语言模型对话构建的个性化评估基准,涵盖信息提取、更新、检索与利用四大任务,并揭示了当前模型在个性化记忆管理各环节的关键瓶颈。
Details
Motivation: 现有个性化评估基准或忽略关键的个性化信息管理,或依赖合成对话,与真实对话存在分布差异,缺乏金标准评估基准。 Method: 构建基于WildChat真实对话的AlpsBench基准,包含2500条长期交互序列及人工验证的结构化记忆,定义四大个性化记忆管理任务并建立全生命周期评估协议。 Result: 前沿LLM和记忆中心系统在隐式特质提取、记忆更新性能上限、大规模干扰下的检索准确率、以及显式记忆对响应情感一致性提升等方面均表现出显著局限性。 Conclusion: AlpsBench为LLM个性化研究提供了首个面向真实对话、覆盖记忆全生命周期的综合性评估框架,揭示了当前技术的关键短板,推动更鲁棒、可信的个性化AI发展。 Abstract: As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.[3] Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages
Swastik R
Main category: cs.CL
TL;DR: This paper presents the first cross-lingual visual reasoning audit for Indian languages, revealing significant performance drops (9.8–25 pp) for vision-language models when moving from English to Indian languages—especially Dravidian ones—and shows chain-of-thought prompting can harm performance in some languages; it releases a new multilingual benchmark and model outputs.
Details
Motivation: Existing vision-language model evaluations are overwhelmingly English-centric, lacking assessment on diverse Indian languages despite their linguistic and script diversity; this work addresses the gap by auditing VLMs across six major Indian languages. Method: Translated 980 questions from MathVista, ScienceQA, and MMMU into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi using IndicTrans2, verified by Gemini 2.0 Flash; evaluated eight VLMs (7B open-source to GPT-4o) with text-only and chain-of-thought ablations, generating 68,600 inference records. Result: Accuracy drops 9.8–25 percentage points from English to Indian languages; Dravidian languages suffer up to 13.2 pp more than Indo-Aryan; chain-of-thought harms Bengali (−14.4 pp) and Kannada (−11.4 pp); Aya-Vision-8B still drops 28.5 pp on Dravidian scripts. Conclusion: Multilingual pretraining alone does not ensure robust cross-lingual visual reasoning; English-centric reasoning chains and script/language-specific challenges hinder performance, highlighting the need for language- and script-aware VLM development. Abstract: Vision-language models score well on mathematical, scientific, and spatial reasoning benchmarks, yet these evaluations are overwhelmingly English. I present the first cross-lingual visual reasoning audit for Indian languages. 980 questions from MathVista, ScienceQA, and MMMU are translated into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi using IndicTrans2, with Gemini 2.0 Flash cross-verification on 50 samples per language (inter-translator agreement 0.79-0.84). Eight VLMs, from 7B open-source models to GPT-4o, are evaluated across all seven languages, yielding 68,600 inference records that include text-only and chain-of-thought ablations. I find accuracy drops of 9.8-25 percentage points when switching from English to an Indian language, with Dravidian languages suffering up to 13.2 pp more than Indo-Aryan. Chain-of-thought prompting degrades Bengali (-14.4 pp) and Kannada (-11.4 pp) rather than helping, exposing English-centric reasoning chains. Aya-Vision-8B, built for 23 languages, still drops 28.5 pp on Dravidian scripts; multilingual pretraining alone does not transfer visual reasoning. I release the translated benchmark and all model outputs.[4] LogicDiff: Logic-Guided Denoising Improves Reasoning in Masked Diffusion Language Models
Shaik Aman
Main category: cs.CL
TL;DR: LogicDiff is an inference-time method that improves masked diffusion language models' reasoning by guiding token unmasking based on logical roles instead of confidence, significantly boosting accuracy on GSM8K and MATH-500 without modifying the base model.
Details
Motivation: Standard confidence-based unmasking in MDLMs delays high-entropy logical connectives—key branching points in reasoning—causing poor reasoning performance. Method: LogicDiff adds a lightweight classifier (4.2M params) to predict token logical roles (premise, connective, etc.) from hidden states, then uses a dependency-ordered scheduler to unmask tokens in logical sequence: premises → connectives → derived steps → conclusions. Result: Improves LLaDA-8B-Instruct accuracy from 22.0% to 60.7% on GSM8K (+38.7 pp) and from 23.6% to 29.2% on MATH-500 (+5.6 pp), with <6% speed overhead. Conclusion: The main reasoning deficit in MDLMs stems from suboptimal unmasking order—not flawed representations—so logic-guided scheduling alone yields large gains. Abstract: Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens from a fully masked sequence, offering parallel generation and bidirectional context. However, their standard confidence-based unmasking strategy systematically defers high-entropy logical connective tokens, the critical branching points in reasoning chains, leading to severely degraded reasoning performance. We introduce LogicDiff, an inference-time method that replaces confidence-based unmasking with logic-role-guided unmasking. A lightweight classification head (4.2M parameters, 0.05% of the base model) predicts the logical role of each masked position (premise, connective, derived step, conclusion, or filler) from the base model's hidden states with 98.4% accuracy. A dependency-ordered scheduler then unmasks tokens in logical dependency order: premises first, then connectives, then derived steps, then conclusions. Without modifying a single parameter of the base model and without any reinforcement learning or task-specific training, LogicDiff improves LLaDA-8B-Instruct accuracy from 22.0% to 60.7% on GSM8K (+38.7 percentage points) and from 23.6% to 29.2% on MATH-500 (+5.6 pp), with less than 6% speed overhead. Our results demonstrate that a substantial portion of the reasoning deficit in MDLMs is attributable to suboptimal token unmasking order, not to limitations of the model's learned representations.[5] Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval
Zhiyuan Cheng,Longying Lai,Yue Liu
Main category: cs.CL
TL;DR: 本文提出了一种混合文档路由检索(HDRR)方法,结合语义文件路由(SFR)与基于块的检索(CBR),在金融文档问答中同时提升鲁棒性与精度。
Details
Motivation: 现有基于块的检索(CBR)在结构同质的金融监管文件中易产生跨文档块混淆;而语义文件路由(SFR)虽提升鲁棒性但牺牲精度,二者存在鲁棒性-精度权衡问题。 Method: 提出两阶段Hybrid Document-Routed Retrieval(HDRR):第一阶段用SFR将查询路由至相关完整文档,第二阶段在筛选出的文档内执行细粒度块级检索。 Result: 在FinDER基准上,HDRR平均得分7.54(较CBR高25.2%,较SFR高16.9%),失败率降至6.4%,正确率67.7%,完美回答率20.1%,全面优于CBR和SFR。 Conclusion: HDRR成功解决了金融文档问答中鲁棒性与精度之间的根本权衡,在所有五组实验中均实现最低失败率与最高精度。 Abstract: Retrieval-Augmented Generation (RAG) systems for financial document question answering typically follow a chunk-based paradigm: documents are split into fragments, embedded into vector space, and retrieved via similarity search. While effective in general settings, this approach suffers from cross-document chunk confusion in structurally homogeneous corpora such as regulatory filings. Semantic File Routing (SFR), which uses LLM structured output to route queries to whole documents, reduces catastrophic failures but sacrifices the precision of targeted chunk retrieval. We identify this robustness-precision trade-off through controlled evaluation on the FinDER benchmark (1,500 queries across five groups): SFR achieves higher average scores (6.45 vs. 6.02) and fewer failures (10.3% vs. 22.5%), while chunk-based retrieval (CBR) yields more perfect answers (13.8% vs. 8.5%). To resolve this trade-off, we propose Hybrid Document-Routed Retrieval (HDRR), a two-stage architecture that uses SFR as a document filter followed by chunk-based retrieval scoped to the identified document(s). HDRR eliminates cross-document confusion while preserving targeted chunk precision. Experimental results demonstrate that HDRR achieves the best performance on every metric: an average score of 7.54 (25.2% above CBR, 16.9% above SFR), a failure rate of only 6.4%, a correctness rate of 67.7% (+18.7 pp over CBR), and a perfect-answer rate of 20.1% (+6.3 pp over CBR, +11.6 pp over SFR). HDRR resolves the trade-off by simultaneously achieving the lowest failure rate and the highest precision across all five experimental groups.[6] Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs
Seine A. Shintani
Main category: cs.CL
TL;DR: 本文通过控制最小GPT模型在2位加法上的训练与泛化,系统分解了算术任务OOD(分布外)失败的四个阶段:布局障碍、进位语义错误、条件重组瓶颈及晚期十位残差问题,并验证了各阶段的修复策略有效性。
Details
Motivation: 算术基准测试常简化为单一留出分数,但该分数可能混淆性质不同的失败模式;作者旨在探究为何即使局部数字转移已在训练中出现,模型仍无法泛化到3位加法。 Method: 使用在穷举2位加法数据上训练的最小GPT模型,设计受控实验分析3位泛化失败的阶段性原因,包括布局变化、进位行为探测、条件重组对比及晚期十位修复实验。 Result: 发现失败分为四个可识别且可干预的阶段:(1)绝对位置布局崩溃;(2)百位被当作进位标志而非语义数字;(3)高条件尾部数据显著提升重组能力;(4)残差错误集中于十位,引入符号感知十位修复使最难千位进位套件准确率从0.664升至0.822。 Conclusion: 提出了一个实验可验证的算术OOD失败四阶段分解框架(布局→进位语义→重组→晚期十位残差),为理解与改进大语言模型数值推理能力提供了结构化路径。 Abstract: Arithmetic benchmarks are often reduced to a single held-out score, but that score can conflate qualitatively different failures. We study a controlled minimal GPT trained on exhaustive 2-digit addition, where all local digit transitions are already present in training, and ask why 3-digit generalization still fails. The failure is staged. First, there is a layout barrier: a learned absolute-position model collapses under a pure 3-digit layout shift, and mixed-layout exposure is the only intervention that materially weakens this barrier. Second, after layout repair, the hundreds position behaves like a carry flag rather than a semantic hundreds digit; targeted carry probes reverse the relevant logit margin, whereas a matched extra-data control does not. Third, after carry repair, the main remaining bottleneck is conditional recomposition: high-conditioned tail data outperforms a matched control, high-only data, and tail-only data on all true-3-digit suites, and the same ordering reappears in a larger 2-layer bridge experiment. The residual errors after recomposition are then overwhelmingly tens-only, and a separate 10-seed late-stage study shows that a sign-aware tens repair raises exact match on the hardest thousands-carry suite from 0.664 to 0.822. We therefore provide an experimentally testable decomposition of arithmetic OOD failure into layout, carry-semantics, recomposition, and late tens-residual stages.[7] Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation
Lorca McLaren,James Cross,Zuzanna Krakowska,Robin Rauner,Martijn Schoonvelde
Main category: cs.CL
TL;DR: 本文对政治学文本标注中大语言模型(LLM)的实现选择进行了受控评估,发现模型选择、学习方式与提示工程之间存在显著交互效应,无普适最优方案;模型大小与性能/成本关系非单调;常用提示工程技巧效果不稳定;据此提出以验证为先的标注框架与实践指南。
Details
Motivation: 政治科学家正快速采用大语言模型进行文本标注,但标注结果对具体实现选择(如模型、提示、学习方式等)的敏感性尚不清楚,现有评估多限于单一配置,缺乏系统性对照研究。 Method: 在统一量化、硬件与提示模板条件下,对6个开源大语言模型在4项政治学文本标注任务上进行受控实验,分析模型选择、模型规模、学习方法与提示风格的主效应与交互效应,并基于结果构建验证优先的标注实践框架。 Result: 交互效应远超主效应;无全局最优模型或提示策略;模型大小与性能/资源消耗无单调关系;部分提示工程技巧反而降低性能;提出含决策顺序、提示冻结、留出评估、报告规范与开源工具的验证优先框架。 Conclusion: 政治学LLM标注应摒弃‘一刀切’最佳实践,转向透明、可复现、以验证为核心的流程设计;研究者需意识到pipeline选择是关键自由度,须通过任务特异性评估加以约束。 Abstract: Political scientists are rapidly adopting large language models (LLMs) for text annotation, yet the sensitivity of annotation results to implementation choices remains poorly understood. Most evaluations test a single model or configuration; how model choice, model size, learning approach, and prompt style interact, and whether popular "best practices" survive controlled comparison, are largely unexplored. We present a controlled evaluation of these pipeline choices, testing six open-weight models across four political science annotation tasks under identical quantisation, hardware, and prompt-template conditions. Our central finding is methodological: interaction effects dominate main effects, so seemingly reasonable pipeline choices can become consequential researcher degrees of freedom. No single model, prompt style, or learning approach is uniformly superior, and the best-performing model varies across tasks. Two corollaries follow. First, model size is an unreliable guide both to cost and to performance: cross-family efficiency differences are so large that some larger models are less resource-intensive than much smaller alternatives, while within model families mid-range variants often match or exceed larger counterparts. Second, widely recommended prompt engineering techniques yield inconsistent and sometimes negative effects on annotation performance. We use these benchmark results to develop a validation-first framework - with a principled ordering of pipeline decisions, guidance on prompt freezing and held-out evaluation, reporting standards, and open-source tools - to help researchers navigate this decision space transparently.[8] A large corpus of lucid and non-lucid dream reports
Remington Mallett
Main category: cs.CL
TL;DR: 本文构建了一个包含55,000份梦境报告的大规模语料库,其中10,000份标注为清醒梦,通过语言模式验证其符合已知清醒梦特征,为梦境科学研究尤其是清醒梦研究提供了可靠数据基础。
Details
Motivation: 清醒梦现象难以研究,因其发生率低且难以主动诱发,导致高质量、大规模的清醒梦报告语料稀缺,限制了相关应用与理论发展。 Method: 从公开在线论坛爬取十年间用户匿名提交的55,000份梦境报告,利用用户自选标签(清醒梦/非清醒梦/噩梦)进行标注,并通过描述性统计、可视化及构念效度分析验证标签可靠性。 Result: 成功构建并验证了一个含10k清醒梦、25k非清醒梦和2k噩梦标签的大规模、真实世界梦境语料库;语言分析证实清醒梦标记报告在文本特征上符合已有清醒梦现象学认知。 Conclusion: 该语料库为梦境科学,特别是清醒梦机制、检测与干预研究,提供了坚实、可扩展且经初步验证的数据资源。 Abstract: All varieties of dreaming remain a mystery. Lucid dreams in particular, or those characterized by awareness of the dream, are notoriously difficult to study. Their scarce prevalence and resistance to deliberate induction make it difficult to obtain a sizeable corpus of lucid dream reports. The consequent lack of clarity around lucid dream phenomenology has left the many purported applications of lucidity under-realized. Here, a large corpus of 55k dream reports from 5k contributors is curated, described, and validated for future research. Ten years of publicly available dream reports were scraped from an online forum where users share anonymous dream journals. Importantly, users optionally categorize their dream as lucid, non-lucid, or a nightmare, offering a user-provided labeling system that includes 10k lucid and 25k non-lucid, and 2k nightmare labels. After characterizing the corpus with descriptive statistics and visualizations, construct validation shows that language patterns in lucid-labeled reports are consistent with known characteristics of lucid dreams. While the entire corpus has broad value for dream science, the labeled subset is particularly powerful for new discoveries in lucid dream studies.[9] The Last Fingerprint: How Markdown Training Shapes LLM Prose
E. M. Freeburg
Main category: cs.CL
TL;DR: 本文提出大语言模型中破折号使用频率高是由于训练数据中Markdown格式的“泄漏”,并通过实验验证了这一假设,发现不同模型的破折号使用频率与其微调方式密切相关。
Details
Motivation: 观察到大语言模型过度使用破折号,并且默认输出Markdown格式,但缺乏对其机制的解释;本文旨在建立二者之间的联系并揭示其成因。 Method: 提出破折号是Markdown结构在自然语言中残留的最小单位的假设,构建五步演化路径,并通过抑制实验(包括避免Markdown指令、明确禁止破折号、基础模型vs指令微调模型对比)在12个来自5家厂商的模型上进行验证。 Result: 实验表明:当禁止Markdown时,其他格式元素被消除,但破折号仍顽固存在(Llama除外,其完全不生成破折号);破折号频率在0.0–9.1/千字之间变化,可作为微调方法的指纹;即使明令禁止,部分模型仍无法消除破折号;该倾向在RLHF前即已存在。 Conclusion: 破折号使用频率并非风格缺陷,而是反映模型微调方法的诊断性指标,同时将关于AI文本破折号滥用与Markdown输出倾向的两种网络讨论统一为同一机制的不同表现。 Abstract: Large language models produce em dashes at varying rates, and the observation that some models "overuse" them has become one of the most widely discussed markers of AI-generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown-formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose -- the smallest surviving unit of the structural orientation that LLMs acquire from markdown-saturated training corpora. We present a five-step genealogy connecting training data composition, structural internalization, the dual-register status of the em dash, and post-training amplification. We test this with a two-condition suppression experiment across twelve models from five providers (Anthropic, OpenAI, Meta, Google, DeepSeek): when models are instructed to avoid markdown formatting, overt features (headers, bullets, bold) are eliminated or nearly eliminated, but em dashes persist -- except in Meta's Llama models, which produce none at all. Em dash frequency and suppression resistance vary from 0.0 per 1,000 words (Llama) to 9.1 (GPT-4.1 under suppression), functioning as a signature of the specific fine-tuning procedure applied. A three-condition suppression gradient shows that even explicit em dash prohibition fails to eliminate the artifact in some models, and a base-vs-instruct comparison confirms that the latent tendency exists pre-RLHF. These findings connect two previously isolated online discourses and reframe em dash frequency as a diagnostic of fine-tuning methodology rather than a stylistic defect.[10] RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models
Rahul Soni
Main category: cs.CL
TL;DR: 本文提出RASPRef框架,通过检索增强的自监督提示优化方法,在无需人工标注或任务特定监督的情况下,自动改进推理型语言模型的提示词,从而提升其在数学推理等任务上的性能。
Details
Motivation: 现有推理型语言模型对提示词设计高度敏感,而人工设计提示词费时费力、难以泛化到不同任务和领域。 Method: 提出RASPRef框架:基于检索相关示例与历史推理轨迹,利用多样本一致性、验证器反馈及模型自生成批评等自监督信号,迭代优化提示词本身。 Result: 在GSM8K风格数学推理任务上,RASPRef显著优于静态提示基线;并分析了检索质量、轨迹选择和反馈信号对优化效果的影响。 Conclusion: 提示词设计仍是提升推理型语言模型性能的关键环节,而自优化提示是一种实用且可扩展的改进策略。 Abstract: Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks. However, their performance remains highly sensitive to prompt formulation, and designing effective prompts is typically a manual and iterative process that does not scale well across tasks or domains. To address this limitation, we introduce Retrieval-Augmented Self-Supervised Prompt Refinement (RASPRef), a framework that improves prompts without requiring human annotations or task-specific supervision. The approach retrieves relevant examples and previously generated reasoning trajectories, and leverages signals such as multi-sample consistency, verifier feedback, and model-generated critiques to iteratively refine the prompt. Unlike prior approaches that focus primarily on improving model outputs, RASPRef directly treats the prompt as the optimization target and improves it through an iterative retrieval-guided refinement process. Experiments on GSM8K-style mathematical reasoning tasks show that retrieval-guided prompting improves performance compared with a static prompting baseline. We further discuss how retrieval quality, trajectory selection, and self-supervised feedback signals may influence the effectiveness of prompt refinement. These findings suggest that prompt design remains a critical factor for reasoning-oriented language models, and that self-improving prompts offer a practical and scalable strategy for improving reasoning performance.[11] Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language
Hanif Rahman,Shafeeq ur Rehman
Main category: cs.CL
TL;DR: 本文介绍了Pashto Common Voice语料库,这是首个大规模、开源许可的普什图语语音资源,通过社区协作在2022-2025年间显著扩展,并提出了一套涵盖本地化、句子提取、语音字符针对性采集和多渠道推广的方法,最终在Whisper模型微调中将词错误率(WER)从99.0%大幅降至13.4%。
Details
Motivation: 解决普什图语(6000多万母语者)在开放语音技术中严重缺失的问题,填补其大规模、开源语音语料库的空白。 Method: 采用社区驱动方式,包括界面本地化、基于维基百科的句子自动筛选提取、针对四种易遗漏普什图字符的音素导向语音采集,以及多渠道社区宣传(如VOA普什图语广播活动)。 Result: 建成MCV23语料库,含107,781条语音片段(60,337条经验证,共82.33小时),覆盖13个领域;Whisper Base在MCV20上微调后WER达13.4%,远优于零样本99.0%。 Conclusion: Pashto Common Voice语料库的成功构建与应用验证了社区协作与针对性方法在低资源语言语音技术发展中的有效性,为类似语言提供了可复用范式。 Abstract: We present the Pashto Common Voice corpus -- the first large-scale, openly licensed speech resource for Pashto, a language with over 60 million native speakers largely absent from open speech technology. Through a community effort spanning 2022-2025, the corpus grew from 1.5 hours and 5 contributors to 147 total hours and 1,483 unique speakers across ten Mozilla Common Voice releases (CV14-CV23). Speaker participation increased approximately 108-fold between CV17 and CV18, coinciding with a VOA Pashto broadcast campaign. We describe the full methodology: interface localisation, Wikipedia-based sentence extraction with automated filtering, phonemically targeted contributions for the four most frequently dropped Pashto characters, and multi-channel community outreach. MCV23 contains 107,781 clips (60,337 validated; 82.33 validated hours) across 13 content domains. Fine-tuning Whisper Base on the MCV20 yields 13.4% WER on the MCV20 test split, against the published Whisper Base zero-shot WER of 99.0% on Pashto.[12] TAPS: Task Aware Proposal Distributions for Speculative Sampling
Mohamad Zbib,Mohamad Bazzi,Ammar Mohanna,Hasan Abed Al Kader Hammoud,Bernard Ghanem
Main category: cs.CL
TL;DR: 本文研究了推测解码中草稿模型(draft model)的训练数据分布对其性能的影响,发现任务特定训练能显著提升对应任务的表现,而混合数据训练则增强了鲁棒性;进一步提出基于置信度的路由策略比简单模型平均或熵值路由更有效,提升了接受长度。
Details
Motivation: 推测解码中草稿模型通常在通用语料上训练,但其性能是否依赖于与下游任务匹配的训练分布尚不明确,本文旨在探究这一问题。 Method: 在MathInstruct、ShareGPT及混合数据上训练轻量级HASS和EAGLE-2草稿模型,并在MT-Bench、GSM8K、MATH-500和SVAMP等基准上评估;对比任务专用训练、混合训练及不同推理时组合策略(如模型平均、置信度路由、merged-tree验证)的效果。 Result: 任务专用训练带来明显专业化:MathInstruct训练的草稿在推理任务上表现最优,ShareGPT训练的在MT-Bench上最优;混合训练提升鲁棒性但非越大越好;置信度路由优于熵值路由和模型平均,merged-tree验证获得最高接受长度。 Conclusion: 推测解码质量不仅取决于草稿模型架构,更依赖其训练数据与下游任务的匹配程度;推理时组合专用草稿模型(如置信路由)比权重空间融合更有效。 Abstract: Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.[13] Introducing MELI: the Mandarin-English Language Interview Corpus
Suyuan Liu,Molly Babel
Main category: cs.CL
TL;DR: 本文介绍了Mandarin-English Language Interview (MELI)语料库,一个包含51名中英双语者共29.8小时语音的开源资源,涵盖朗读与即兴访谈两种语体,并提供精细标注、对齐及元数据,支持声学、语言态度及语码转换等多维度研究。
Details
Motivation: 构建一个支持中英双语语音对比、语言态度分析及语码转换研究的高质量、开源、多模态双语语料库,弥补现有资源在匹配语境、跨语言可比性及社会语言学维度上的不足。 Method: 采集51名中英双语者的匹配式中英文会话(含朗读句子和关于语言变体、标准性及学习经历的自发访谈),以44.1 kHz/16-bit/stereo录制;完成全文本转录、强制音素与词级对齐、匿名化处理;统计词汇型符比、分析语码转换分布;设计兼顾声学比较与主观态度关联的元数据框架。 Result: 建成MELI语料库:含约14.7小时普通话与15.1小时英语语音,完整转录与对齐,记录语码转换差异(普通话会话中更频繁),并结构化整合语言态度等元数据;即将以CC BY-NC 4.0协议开源发布全部数据与文档。 Conclusion: MELI语料库为双语语音、社会语音学及语言态度的跨语言实证研究提供了兼具生态效度与方法严谨性的新型基础设施,推动定量与定性融合分析。 Abstract: We introduce the Mandarin-English Language Interview (MELI) Corpus, an open-source resource of 29.8 hours of speech from 51 Mandarin-English bilingual speakers. MELI combines matched sessions in Mandarin and English with two speaking styles: read sentences and spontaneous interviews about language varieties, standardness, and learning experiences. Audio was recorded at 44.1 kHz (16-bit, stereo). Interviews were fully transcribed, force-aligned at word and phone levels, and anonymized. Descriptively, the Mandarin component totals ~14.7 hours (mean duration 17.3 minutes) and the English component ~15.1 hours (mean duration 17.8 minutes). We report token/type statistics for each language and document code-switching patterns (frequent in Mandarin sessions; more limited in English sessions). The corpus design supports within-/cross-speaker, within/cross-language acoustic comparison and links acoustics to speakers' stated language attitudes, enabling both quantitative and qualitative analyses. The MELI Corpus will be released with transcriptions, alignments, metadata, scans of labelled maps and documentation under a CC BY-NC 4.0 license.[14] Text Data Integration
Md Ataur Rahman,Dimitris Sacharidis,Oscar Romero,Sergi Nadal
Main category: cs.CL
TL;DR: 本文探讨了将非结构化文本数据整合到数据集成系统中的必要性、挑战、最新进展及开放问题。
Details
Motivation: 非结构化数据(如文本)包含大量待利用的知识,但现有数据集成系统主要处理结构化数据,因此需要扩展以支持文本数据的整合。 Method: 通过论证文本数据整合的必要性,并综述其面临的挑战、当前技术进展和未解决的问题。 Result: 明确了文本数据集成的关键挑战,并梳理了该领域的研究现状与未来方向。 Conclusion: 文本数据集成是数据工程的重要延伸,需进一步研究以实现多样化数据的统一存储与处理。 Abstract: Data comes in many forms. From a shallow perspective, they can be viewed as being either in structured (e.g., as a relation, as key-value pairs) or unstructured (e.g., text, image) formats. So far, machines have been fairly good at processing and reasoning over structured data that follows a precise schema. However, the heterogeneity of data poses a significant challenge on how well diverse categories of data can be meaningfully stored and processed. Data Integration, a crucial part of the data engineering pipeline, addresses this by combining disparate data sources and providing unified data access to end-users. Until now, most data integration systems have leaned on only combining structured data sources. Nevertheless, unstructured data (a.k.a. free text) also contains a plethora of knowledge waiting to be utilized. Thus, in this chapter, we firstly make the case for the integration of textual data, to later present its challenges, state of the art and open problems.[15] Debiasing Large Language Models toward Social Factors in Online Behavior Analytics through Prompt Knowledge Tuning
Hossein Salemi,Jitin Krishnan,Hemant Purohit
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLM)在社会语境中进行因果归因(如性格归因与情境归因)的能力与偏差,提出一种基于社会归因理论的提示增强方法,通过在指令中引入用户目标(推断性格归因)和消息上下文(推断情境归因),有效缓解零样本分类任务中的归因偏差,并在灾难领域多语言社交媒体意图与主题检测任务上验证了其有效性。
Details
Motivation: LLMs虽在推理任务中表现优异,但其在社会语境中忽略人类常见的归因机制(如 dispositional/situational attribution),可能导致社会推理偏差;现有研究尚未系统探索LLMs是否及如何隐式使用此类归因,也缺乏针对性缓解策略。 Method: 提出一种可扩展的提示增强方法:在零样本分类任务的指令中,显式注入两类社会归因知识——基于用户目标推断性格归因、基于消息上下文推断情境归因;构建适配社交消息的双提示辅助模块,并在多语言、多灾种的社交媒体数据上进行评估。 Result: 该方法显著提升Llama3、Mistral和Gemma三个开源LLM在灾难领域社交媒体意图检测与主题检测任务上的性能,同时降低其社会归因偏差;实验证明该策略在不同灾难类型和语言(如英语、西班牙语等)下均具鲁棒性与泛化性。 Conclusion: LLMs确实存在系统性的社会归因偏差,而融入社会心理学中的归因理论可有效引导其更合理地建模人类行为意图;将领域知识(如归因理论)结构化嵌入提示设计,是提升LLM社会推理公平性与准确性的可行路径。 Abstract: Attribution theory explains how individuals interpret and attribute others' behavior in a social context by employing personal (dispositional) and impersonal (situational) causality. Large Language Models (LLMs), trained on human-generated corpora, may implicitly mimic this social attribution process in social contexts. However, the extent to which LLMs utilize these causal attributions in their reasoning remains underexplored. Although using reasoning paradigms, such as Chain-of-Thought (CoT), has shown promising results in various tasks, ignoring social attribution in reasoning could lead to biased responses by LLMs in social contexts. In this study, we investigate the impact of incorporating a user's goal as knowledge to infer dispositional causality and message context to infer situational causality on LLM performance. To this end, we introduce a scalable method to mitigate such biases by enriching the instruction prompts for LLMs with two prompt aids using social-attribution knowledge, based on the context and goal of a social media message. This method improves the model performance while reducing the social-attribution bias of the LLM in the reasoning on zero-shot classification tasks for behavior analytics applications. We empirically show the benefits of our method across two tasks-intent detection and theme detection on social media in the disaster domain-when considering the variability of disaster types and multiple languages of social media. Our experiments highlight the biases of three open-source LLMs: Llama3, Mistral, and Gemma, toward social attribution, and show the effectiveness of our mitigation strategies.[16] Story2Proposal: A Scaffold for Structured Scientific Paper Writing
Zhuoyang Qian,Wei Shi,Xu Lin,Li Ling,Meng Luo,Ziming Wang,Zhiwei Zhang,Tengyue Xu,Gaoge Liu,Zhentao Zhang,Shuo Zhang,Ziqi Wang,Zheng Feng,Yan Luo,Shu Xu,Yongjin Chen,Zhibo Feng,Zhuo Chen,Bruce Yuan,Biao Wu,Harry Wang,Kris Chen
Main category: cs.CL
TL;DR: 本文提出Story2Proposal,一种基于契约的多智能体框架,通过协调多个智能体(架构师、作者、润色者、渲染器)并维护共享视觉契约,将研究故事转化为结构化论文,在专家评估中显著优于基线方法。
Details
Motivation: 现有语言模型生成流程缺乏对叙事逻辑、实验证据与可视化内容之间一致性的实时约束,导致结构漂移、图表缺失及跨部分不一致等问题。 Method: 提出Story2Proposal框架,采用契约驱动的多智能体协同机制,各代理围绕持续更新的共享合同(记录章节结构与可视化元素)运行,并通过‘生成-评估-适配’循环实现动态优化。 Result: 在Jericho研究语料任务上,Story2Proposal专家评估得分6.145,显著高于DirectChat(3.963)和结构化基线Fars(5.705 vs 5.197),验证其在结构一致性与视觉对齐上的提升。 Conclusion: 契约治理的多智能体范式可有效提升科学稿件生成的质量与一致性,为AI辅助科研写作提供了新路径。 Abstract: Generating scientific manuscripts requires maintaining alignment between narrative reasoning, experimental evidence, and visual artifacts across the document lifecycle. Existing language-model generation pipelines rely on unconstrained text synthesis with validation applied only after generation, often producing structural drift, missing figures or tables, and cross-section inconsistencies. We introduce Story2Proposal, a contract-governed multi-agent framework that converts a research story into a structured manuscript through coordinated agents operating under a persistent shared visual contract. The system organizes architect, writer, refiner, and renderer agents around a contract state that tracks section structure and registered visual elements, while evaluation agents supply feedback in a generate evaluate adapt loop that updates the contract during generation. Experiments on tasks derived from the Jericho research corpus show that Story2Proposal achieved an expert evaluation score of 6.145 versus 3.963 for DirectChat (+2.182) across GPT, Claude, Gemini, and Qwen backbones. Compared with the structured generation baseline Fars, Story2Proposal obtained an average score of 5.705 versus 5.197, indicating improved structural consistency and visual alignment.[17] Routing Sensitivity Without Controllability: A Diagnostic Study of Fairness in MoE Language Models
Junhyeok Lee,Kyu Sung Choi
Main category: cs.CL
TL;DR: 本文提出FARE框架诊断MoE模型路由层公平性干预的局限性,发现路由级偏见调控要么不可行、不稳健,要么带来显著性能损失,且难以转化为生成结果的改进,根源在于偏见与知识在专家组内深度耦合。
Details
Motivation: MoE语言模型在路由层面普遍对人口统计学内容敏感,但利用这种敏感性进行公平性控制存在结构性限制,亟需系统性诊断其干预边界。 Method: 提出Fairness-Aware Routing Equilibrium(FARE)诊断框架,跨多种MoE架构(Mixtral、Qwen1.5、Qwen3、DeepSeekMoE、OLMoE)评估路由级刻板印象干预的可行性、稳健性与效用代价,并结合组级专家掩码分析偏见与知识的耦合机制。 Result: 发现路由偏好调整在多数模型中不可实现或不稳健;即使在OLMoE中可实现,也导致CrowS-Pairs和TQA性能显著下降;且所有模型中log-likelihood层面的偏好变化均未转化为解码生成的公平性提升;组级专家掩码证实偏见与核心知识在专家组内深度纠缠。 Conclusion: 路由敏感性是刻板印象控制的必要非充分条件;应基于识别出的特定架构条件(如专家分组方式、路由机制)设计更可控的下一代MoE系统。 Abstract: Mixture-of-Experts (MoE) language models are universally sensitive to demographic content at the routing level, yet exploiting this sensitivity for fairness control is structurally limited. We introduce Fairness-Aware Routing Equilibrium (FARE), a diagnostic framework designed to probe the limits of routing-level stereotype intervention across diverse MoE architectures. FARE reveals that routing-level preference shifts are either unachievable (Mixtral, Qwen1.5, Qwen3), statistically non-robust (DeepSeekMoE), or accompanied by substantial utility cost (OLMoE, -4.4%p CrowS-Pairs at -6.3%p TQA). Critically, even where log-likelihood preference shifts are robust, they do not transfer to decoded generation: expanded evaluations on both non-null models yield null results across all generation metrics. Group-level expert masking reveals why: bias and core knowledge are deeply entangled within expert groups. These findings indicate that routing sensitivity is necessary but insufficient for stereotype control, and identify specific architectural conditions that can inform the design of more controllable future MoE systems.[18] Learning to Predict Future-Aligned Research Proposals with Language Models
Heng Wang,Pengcheng Jiang,Jiashuo Sun,Zhiyi Shi,Haofei Yu,Jiawei Han,Heng Ji
Main category: cs.CL
TL;DR: 本文提出了一种基于时间切片科学预测的可验证方法来评估大语言模型(LLM)生成的研究提案质量,定义了未来对齐得分(FAS)作为评估指标,并构建了时间一致的数据集与推理轨迹进行模型微调,显著提升了提案的未来对齐性与人类评价质量,且两个生成提案在实际任务中取得明显性能提升。
Details
Motivation: 现有LLM生成研究提案的质量评估困难:新颖性和科学性难以自动量化,人工评估成本高昂。 Method: 将提案生成重构为时间切片科学预测问题;定义未来对齐得分(FAS),通过检索与LLM语义打分在预留未来文献库上计算;构建含17,771篇论文的时间一致数据集及合成推理链用于微调;在Llama-3.1和Qwen2.5上开展未来对齐微调。 Result: 未来对齐微调使FAS最高提升10.6%;领域专家人工评估确认提案质量提升;两个生成提案经代码智能体实现后,在MATH数据集上获得4.17%准确率提升,并在新型模型融合方法上展现持续改进。 Conclusion: 以未来对齐为目标的可验证评估与训练范式,能有效提升LLM科研辅助提案的质量与实用性,为AI驱动的科学发现提供新路径。 Abstract: Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 17,771 papers from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.[19] Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning
Maximilian Mordig,Andreas Opedal,Weiyang Liu,Bernhard Schölkopf
Main category: cs.CL
TL;DR: 本文系统研究了课程学习(CL)在大语言模型后训练中的效果,发现对于合成算术和逻辑推理任务,按难度排序的课程学习并未比随机采样带来显著优势,挑战了CL在演绎推理中实用性的传统认知。
Details
Motivation: 课程学习(CL)直觉上认为按难度递增顺序学习有助于泛化,尤其在组合式推理任务中(复杂问题由基本推理规则构成),但其实际影响尚缺乏系统研究。 Method: 对大语言模型进行后训练的系统性实证研究,使用以推理复杂度而非表面特征定义难度的合成算术与逻辑基准,在多种模型家族和课程调度下,对比难度排序与随机采样的效果,涵盖监督微调(SFT)和强化学习(RL)两种方法。 Result: 在准确率和响应长度两方面,难度导向的课程学习未表现出对随机采样的稳健优势,该结论在不同模型和训练方法(SFT/RL)中均成立。 Conclusion: 在演绎推理任务中,训练样本的具体排序对实现组合泛化影响甚微,质疑了基于课程的后训练在该场景下的实际效用。 Abstract: Curriculum learning (CL), motivated by the intuition that learning in increasing order of difficulty should ease generalization, is commonly adopted both in pre-training and post-training of large language models (LLMs). The intuition of CL is particularly compelling for compositional reasoning, where complex problems are built from elementary inference rules; however, the actual impact of CL on such tasks remains largely underexplored. We present a systematic empirical study of CL for post-training of LLMs, using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies. Surprisingly, across multiple model families and curriculum schedules, we find no robust advantage in difficulty-based sequencing over standard random sampling in either accuracy or response length. These findings persist across both supervised fine-tuning (SFT) and reinforcement learning (RL) methods. Our study suggests that, in the context of deductive reasoning, the specific ordering of training examples plays a negligible role in achieving compositional generalization, challenging the practical utility of curriculum-based post-training.[20] Structural Stress and Learned Helplessness in Afghanistan: A Multi-Layer Analysis of the AFSTRESS Dari Corpus
Jawid Ahmad Baktash,Mursal Dawodi,Nadira Ahmadi
Main category: cs.CL
TL;DR: 本文介绍了AFSTRESS,首个面向阿富汗达里语人群的多标签压力叙事语料库,用于计算、社会与心理三层面的压力分析,并提供了基线实验结果。
Details
Motivation: 构建首个针对阿富汗达里语人群、反映人道主义危机下自我报告压力叙事的多标签语料库,以支持跨学科压力分析。 Method: 收集737条达里语压力叙述,标注5种情绪与7类压力源共12个二元标签;开展多标签分类基线实验(TF-IDF+Linear SVM、ParsBERT、XLM-RoBERTa),并进行阈值调优。 Result: 结构型压力源(如不确定未来、教育中断)占主导;希望感缺失与不确定未来共现最强(J=0.388);TF-IDF+Linear SVM取得最佳Micro-F1=0.663(调优后提升10.3点)。 Conclusion: AFSTRESS是首个达里语压力分析资源,揭示了危机中压力的结构性本质,为低资源语言心理健康计算研究奠定基础。 Abstract: We introduce AFSTRESS, the first multi-label corpus of self-reported stress narratives in Dari (Eastern Persian), comprising 737 responses collected from Afghan individuals during an ongoing humanitarian crisis. Participants describe experienced stress and select emotion and stressor labels via Dari checklists. The dataset enables analysis at three levels: computational (multi-label classification), social (structural drivers and gender disparities), and psychological (learned helplessness, chronic stress, and emotional cascade patterns). It includes 12 binary labels (5 emotions, 7 stressors), with high label cardinality (5.54) and density (0.462), reflecting complex, multi-dimensional stress. Structural stressors dominate: uncertain future (62.6 percent) and education closure (60.0 percent) exceed emotional states, indicating stress is primarily structurally driven. The strongest co-occurrence is between hopelessness and uncertain future (J = 0.388). Baseline experiments show that character TF-IDF with Linear SVM achieves Micro-F1 = 0.663 and Macro-F1 = 0.651, outperforming ParsBERT and XLM-RoBERTa, while threshold tuning improves Micro-F1 by 10.3 points. AFSTRESS provides the first Dari resource for computational analysis of stress and well-being in a crisis-affected population.[21] SCOPE: Tree-based Self-Correcting Online Log Parsing via Syntactic-Semantic Collaboration
Dongyi Fan,Suqiong Zhang,Lili He,Ming Liu,Yifan Huo
Main category: cs.CL
TL;DR: 本文提出SCOPE,一种自校正的在线日志解析方法,结合启发式与大语言模型(LLM)优势,在保持高准确率的同时显著降低LLM调用开销。
Details
Motivation: 传统启发式方法效率高但忽略语义导致准确率低;LLM方法准确率高但延迟大、调用成本高。需兼顾效率与准确性。 Method: 提出双向树结构实现高效双向模板匹配;设计两阶段协同框架:轻量NLP模型基于词性进行语法匹配,LLM仅在不确定时作为语义回退机制选择性调用。 Result: 在多个基准数据集上,SCOPE在准确率和效率两方面均优于当前最优方法。 Conclusion: SCOPE首次实现了启发式与LLM范式的有效融合,通过结构创新与协作机制,在日志解析任务中达成效率与效果的平衡。 Abstract: Log parsing is a critical step for automated log analysis in complex systems. Traditional heuristic-based methods offer high efficiency but are limited in accuracy due to overlooking semantic context. In contrast, recent LLM-based parsers improve accuracy via se mantic understanding but incur high latency from frequent model calls. To address this, we propose SCOPE, the first self-correcting online log parsing method that integrates the strengths of both heuristic and LLM-based paradigms. SCOPE introduces a novel bi-directional tree structure that enables efficient template match ing from both forward and reverse directions, resulting in a higher overall matching rate. Additionally, it adopts a two-stage syntactic semantic collaboration framework: a lightweight NLP model first utilizes part-of-speech (POS) information for syntax-based match ing, while the LLM is selectively invoked as a fallback to handle semantically complex cases when uncertainty remains. This design significantly reduces LLM API usage while maintaining high ac curacy, achieving a balance between efficiency and effectiveness. Extensive evaluations on diverse benchmark datasets show that SCOPE outperforms state-of-the-art methods in both accuracy and efficiency. The implementation and datasets are publicly released to facilitate further research.[22] Mitigating Hallucination on Hallucination in RAG via Ensemble Voting
Zequn Xie,Zhengyang Sun
Main category: cs.CL
TL;DR: 本文提出VOTE-RAG框架,通过两阶段投票机制(检索投票与响应投票)缓解RAG中因错误检索导致的‘幻觉上的幻觉’问题,无需训练、结构简洁、完全并行,效果媲美或优于更复杂方法。
Details
Motivation: RAG虽能减少大模型幻觉,但检索结果出错会引发‘幻觉上的幻觉’,即错误检索进一步误导生成,加剧幻觉问题。 Method: 提出无训练、两阶段投票框架VOTE-RAG:第一阶段‘检索投票’由多个代理并行生成多样化查询并聚合所有检索文档;第二阶段‘响应投票’由多个代理基于聚合文档独立生成答案,并通过多数投票决定最终输出。 Result: 在六个基准数据集上的实验表明,VOTE-RAG性能达到或超过更复杂的现有方法,同时具备更简架构、完全并行性,并规避了‘问题漂移’风险。 Conclusion: 简单可靠的集成投票机制是缓解RAG幻觉更优且更高效的方法,无需模型微调,兼具鲁棒性与可扩展性。 Abstract: Retrieval-Augmented Generation (RAG) aims to reduce hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, RAG introduces a critical challenge: hallucination on hallucination," where flawed retrieval results mislead the generation model, leading to compounded hallucinations. To address this issue, we propose VOTE-RAG, a novel, training-free framework with a two-stage structure and efficient, parallelizable voting mechanisms. VOTE-RAG includes: (1) Retrieval Voting, where multiple agents generate diverse queries in parallel and aggregate all retrieved documents; (2) Response Voting, where multiple agents independently generate answers based on the aggregated documents, with the final output determined by majority vote. We conduct comparative experiments on six benchmark datasets. Our results show that VOTE-RAG achieves performance comparable to or surpassing more complex frameworks. Additionally, VOTE-RAG features a simpler architecture, is fully parallelizable, and avoids the problem drift" risk. Our work demonstrates that simple, reliable ensemble voting is a superior and more efficient method for mitigating RAG hallucinations.[23] SACRED: A Faithful Annotated Multimedia Multimodal Multilingual Dataset for Classifying Connectedness Types in Online Spirituality
Qinghao Guan,Yuchen Pan,Donghao Li,Zishi Zhang,Yiyang Chen,Lu Li,Flaminia Canu,Emilia Volkart,Gerold Schneider
Main category: cs.CL
TL;DR: 本文构建了首个在线灵性交流的多模态标注数据集SACRED,并评估了13种大语言模型及传统方法在灵性概念分类任务上的性能,发现DeepSeek-V3和GPT-4o-mini分别在文本与视觉任务中表现最优,并发现一种新型连通性类型。
Details
Motivation: 社会科学家常受限于稀缺且难获取的灵性研究数据集,亟需高质量、可公开获取的多模态数据支撑灵性相关计算研究。 Method: 与社会科学家合作构建高质量多媒体多模态数据集SACRED,确保分类标注的可信度;在此基础上系统评测13种主流大语言模型及规则/微调方法在文本与视觉任务中的性能。 Result: DeepSeek-V3在Quora测试集上达到79.19%准确率;GPT-4o-mini在视觉任务中F1达63.99%;首次提出一种对传播学研究有价值的新类型‘连通性’。 Conclusion: SACRED是首个面向在线灵性交流的多模态标注数据集,为跨学科灵性计算研究提供了关键基础设施,并揭示了AI模型在抽象概念理解上的潜力与局限。 Abstract: In religion and theology studies, spirituality has garnered significant research attention for the reason that it not only transcends culture but offers unique experience to each individual. However, social scientists often rely on limited datasets, which are basically unavailable online. In this study, we collaborated with social scientists to develop a high-quality multimedia multi-modal datasets, \textbf{SACRED}, in which the faithfulness of classification is guaranteed. Using \textbf{SACRED}, we evaluated the performance of 13 popular LLMs as well as traditional rule-based and fine-tuned approaches. The result suggests DeepSeek-V3 model performs well in classifying such abstract concepts (i.e., 79.19\% accuracy in the Quora test set), and the GPT-4o-mini model surpassed the other models in the vision tasks (63.99\% F1 score). Purportedly, this is the first annotated multi-modal dataset from online spirituality communication. Our study also found a new type of connectedness which is valuable for communication science studies.[24] PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang,Xiaozhong Liu,Fabricio Murai
Main category: cs.CL
TL;DR: PubMed Reasoner 是一个三阶段的生物医学问答代理,通过自我批评式查询优化、反思式检索和证据支撑的回答生成,提升答案准确性与可验证性,在 PubMedQA 和临床知识评估中表现优于人类专家和现有方法。
Details
Motivation: 现有检索增强型问答系统缺乏对低质量查询的迭代优化能力,而自反思方法仅在全部检索完成后才启动,难以兼顾答案准确性、证据可验证性及计算效率。 Method: 提出 PubMed Reasoner 三阶段框架:1)基于部分元数据检索的 MeSH 术语自我批评式查询优化(覆盖性、对齐性、冗余性评估);2)分批处理文献的反思式检索,直至证据充分;3)生成带明确引用的证据支撑回答;后端采用 GPT-4o。 Result: 在 PubMedQA 上达 78.32% 准确率,略超人类专家;在 MMLU 临床知识子集上持续提升;LLM-as-judge 评估显示其在推理合理性、证据支撑性、临床相关性和可信度上均更优。 Conclusion: PubMed Reasoner 通过检索优先的推理范式,在保障权威证据支撑的同时控制计算与 token 成本,为临床医生与生物医学研究者提供实用、可信的问答支持。 Abstract: Trustworthy biomedical question answering (QA) systems must not only provide accurate answers but also justify them with current, verifiable evidence. Retrieval-augmented approaches partially address this gap but lack mechanisms to iteratively refine poor queries, whereas self-reflection methods kick in only after full retrieval is completed. In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata) retrieval; reflective retrieval processes articles in batches until sufficient evidence is gathered; and evidence-grounded response generation produces answers with explicit citations. PubMed Reasoner with a GPT-4o backbone achieves 78.32% accuracy on PubMedQA, slightly surpassing human experts, and showing consistent gains on MMLU Clinical Knowledge. Moreover, LLM-as-judge evaluations prefer our responses across: reasoning soundness, evidence grounding, clinical relevance, and trustworthiness. By orchestrating retrieval-first reasoning over authoritative sources, our approach provides practical assistance to clinicians and biomedical researchers while controlling compute and token costs.[25] Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach
Maziar Kianimoghadam Jouneghani
Main category: cs.CL
TL;DR: 本文提出了一种人机协同的混合智能框架(Hybrid Intelligence Loop),通过结合母语标注员撰写的真实理由与动态检索的目标语言示例,提升大语言模型在多语言信息失序识别中的文化适配性与可解释性。
Details
Motivation: 现有大语言模型多为英语中心、单文化黑箱,难以在不同文化语境下一致识别和解释信息操纵,尤其在多语言信息失序任务中表现不佳。 Method: 构建基于人类标注理由的Hybrid Intelligence Loop框架,采用英语任务指令+动态检索目标语言(波斯语、意大利语)InDor语料中的过滤示例,通过上下文学习(ICL)实现自适应提示,对比静态与动态提示效果。 Result: 初步试点评估了片段识别、严重性预测、理由质量与文化适宜性,以及不同评估者群体间模型一致性,验证了该框架作为文化扎根可解释AI测试平台的可行性。 Conclusion: 动态融合本地化人类理由的混合智能方法,比静态少样本提示更有利于提升多语言信息失序识别的文化敏感性与可解释性,为跨文化可解释AI提供了新路径。 Abstract: Recognizing information disorder is difficult because judgments about manipulation depend on cultural and linguistic context. Yet current Large Language Models (LLMs) often behave as monocultural, English-centric "black boxes," producing fluent rationales that overlook localized framing. Preliminary evidence from the multilingual Information Disorder (InDor) corpus suggests that existing models struggle to explain manipulated news consistently across communities. To address this gap, this ongoing study proposes a Hybrid Intelligence Loop, a human-in-the-loop (HITL) framework that grounds model assessment in human-written rationales from native-speaking annotators. The approach moves beyond static target-language few-shot prompting by pairing English task instructions with dynamically retrieved target-language exemplars drawn from filtered InDor annotations through In-Context Learning (ICL). In the initial pilot, the Exemplar Bank is seeded from these filtered annotations and used to compare static and adaptive prompting on Farsi and Italian news. The study evaluates span and severity prediction, the quality and cultural appropriateness of generated rationales, and model alignment across evaluator groups, providing a testbed for culturally grounded explainable AI.[26] Not Worth Mentioning? A Pilot Study on Salient Proposition Annotation
Amir Zeldes,Katherine Conhaim,Lauren Levine
Main category: cs.CL
TL;DR: 本文提出一种基于摘要的分级命题显著性度量方法,并将其应用于多体裁数据集,评估标注一致性,并初步探究其与修辞结构理论(RST)中话语单元中心性的关系。
Details
Motivation: 现有抽取式摘要研究虽关注命题重要性,但缺乏对自然文本中命题显著性程度的可操作化建模。 Method: 借鉴Salient Entity Extraction(SEE)中的分级摘要显著性指标,定义命题显著性标注任务,在小规模多体裁数据集上开展标注与一致性评估,并分析该指标与RST话语解析中话语单元中心性的关联。 Result: 成功构建了命题显著性标注方案,验证了标注者间具有一定一致性,并发现该显著性指标与RST中的中心性概念存在初步相关性。 Conclusion: 分级命题显著性是一种可行且有潜力的量化指标,可为话语分析、摘要生成等任务提供更细粒度的语言学依据。 Abstract: Despite a long tradition of work on extractive summarization, which by nature aims to recover the most important propositions in a text, little work has been done on operationalizing graded proposition salience in naturally occurring data. In this paper, we adopt graded summarization-based salience as a metric from previous work on Salient Entity Extraction (SEE) and adapt it to quantify proposition salience. We define the annotation task, apply it to a small multi-genre dataset, evaluate agreement and carry out a preliminary study of the relationship between our metric and notions of discourse unit centrality in discourse parsing following Rhetorical Structure Theory (RST).[27] Improving Attributed Long-form Question Answering with Intent Awareness
Xinran Zhao,Aakanksha Naik,Jay DeYoung,Joseph Chee Chang,Jena D. Hwang,Tongshuang Wu,Varsha Kishore
Main category: cs.CL
TL;DR: 本文提出通过增强大语言模型的意图感知能力来提升长篇报告生成质量,采用结构化标签方案提取隐含写作/引用意图,并在零样本生成和小模型微调中验证了其有效性。
Details
Motivation: 大型语言模型虽训练于大量学术文献,但未接触作者撰写报告时的推理过程与意图,缺乏意图感知限制了其生成高质量知识密集型报告的能力。 Method: 设计并应用基于标签的结构化方案,显式提取文本中隐含的写作或引用意图,将意图信息融入提示(prompt)以增强零样本生成,并用于构建高质量合成数据集以微调小模型。 Result: 在多个科学报告生成任务上,大模型平均提升2.9个绝对点,小模型提升12.3个绝对点;同时显著改善引用使用合理性和报告可读性。 Conclusion: 提升模型对作者意图的感知能力,是优化长篇、知识密集型报告生成质量的有效且通用的途径。 Abstract: Large language models (LLMs) are increasingly being used to generate comprehensive, knowledge-intensive reports. However, while these models are trained on diverse academic papers and reports, they are not exposed to the reasoning processes and intents that guide authors in crafting these documents. We hypothesize that enhancing a model's intent awareness can significantly improve the quality of generated long-form reports. We develop and employ structured, tag-based schemes to better elicit underlying implicit intents to write or cite. We demonstrate that these extracted intents enhance both zero-shot generation capabilities in LLMs and enable the creation of high-quality synthetic data for fine-tuning smaller models. Our experiments reveal improved performance across various challenging scientific report generation tasks, with an average improvement of +2.9 and +12.3 absolute points for large and small models over baselines, respectively. Furthermore, our analysis illuminates how intent awareness enhances model citation usage and substantially improves report readability.[28] Multi-Agent Dialectical Refinement for Enhanced Argument Classification
Jakub Bąba,Jarosław A. Chudziak
Main category: cs.CL
TL;DR: 本文提出MAD-ACC框架,通过多智能体辩论(主张者-反对者-裁判)解决论辩成分分类中的结构歧义问题,在无需领域微调下显著提升性能并提供可解释性。
Details
Motivation: 传统监督方法依赖昂贵的领域微调;LLM虽免训练但易受结构歧义和单智能体自纠错中的‘谄媚’问题影响。 Method: 提出MAD-ACC:基于主张者、反对者、裁判三角色的多智能体辩论机制,通过辩证式推理澄清模糊文本的逻辑细节,并生成可读辩论记录。 Result: 在UKP学生作文语料上Macro F1达85.7%,显著优于单智能体基线,且无需领域训练;同时提供透明、可解释的决策过程。 Conclusion: MAD-ACC为论辩挖掘提供了高性能、免训练、可解释的新范式,克服了单智能体LLM在结构识别与自我修正上的固有缺陷。 Abstract: Argument Mining (AM) is a foundational technology for automated writing evaluation, yet traditional supervised approaches rely heavily on expensive, domain-specific fine-tuning. While Large Language Models (LLMs) offer a training-free alternative, they often struggle with structural ambiguity, failing to distinguish between similar components like Claims and Premises. Furthermore, single-agent self-correction mechanisms often suffer from sycophancy, where the model reinforces its own initial errors rather than critically evaluating them. We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty. MAD-ACC utilizes a Proponent-Opponent-Judge model where agents defend conflicting interpretations of ambiguous text, exposing logical nuances that single-agent models miss. Evaluation on the UKP Student Essays corpus demonstrates that MAD-ACC achieves a Macro F1 score of 85.7%, significantly outperforming single-agent reasoning baselines, without requiring domain-specific training. Additionally, unlike "black-box" classifiers, MAD-ACC's dialectical approach offers a transparent and explainable alternative by generating human-readable debate transcripts that explain the reasoning behind decisions.[29] A tree interpretation of arc standard dependency derivation
Zihao Huang,Ai Ka Lee,Jungyeul Park
Main category: cs.CL
TL;DR: 本文提出了一种基于弧标准(arc-standard)推导的有序树表示方法,用于刻画投射性依存树,该表示具有表面连续的词序和稳定的词汇锚定,并证明其与投射性等价;方法是将每个转移操作(shift/leftarc/rightarc)解释为确定性的树更新操作,而非后处理式转换;实验通过神经转移解析器验证了该表示的可执行性与依存关系稳定恢复能力。
Details
Motivation: 现有方法多采用转换式(convertive)策略将依存图映射为短语结构,缺乏对推导过程本身的结构性解释;且难以统一刻画投射性及其表示特性。 Method: 将arc-standard转移序列直接解释为有序树的构造过程:每个shift、leftarc、rightarc对应一个确定性树更新操作,生成具有表面连续yield和稳定lexical anchoring的有序树;并证明该表示与项目性等价;对非投射树采用pseudo-projective lifting预处理与逆解码恢复。 Result: 证明了arc-standard推导唯一确定一种有序树表示,且该表示当且仅当原依存树为投射时存在;在标准神经转移解析器中实现验证,表明映射后的推导可执行且支持稳定依存恢复。 Conclusion: 该 derivational 解释为依存解析提供了更本质的结构基础,统一了投射性判据与树形表示,兼顾理论严谨性与实际解析性能。 Abstract: We show that arc-standard derivations for projective dependency trees determine a unique ordered tree representation with surface-contiguous yields and stable lexical anchoring. Each \textsc{shift}, \textsc{leftarc}, and \textsc{rightarc} transition corresponds to a deterministic tree update, and the resulting hierarchical object uniquely determines the original dependency arcs. We further show that this representation characterizes projectivity: a single-headed dependency tree admits such a contiguous ordered representation if and only if it is projective. The proposal is derivational rather than convertive. It interprets arc-standard transition sequences directly as ordered tree construction, rather than transforming a completed dependency graph into a phrase-structure output. For non-projective inputs, the same interpretation can be used in practice via pseudo-projective lifting before derivation and inverse decoding after recovery. A proof-of-concept implementation in a standard neural transition-based parser shows that the mapped derivations are executable and support stable dependency recovery.[30] AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents
Zhaopeng Feng,Liangcai Su,Zhen Zhang,Xinyu Wang,Xiaotian Zhang,Xiaobin Wang,Runnan Fang,Qi Zhang,Baixuan Li,Shihao Cai,Rui Ye,Hui Chen,Jiang Yong,Joey Tianyi Zhou,Chenxiong Qian,Pengjun Xie,Bryan Hooi,Zuozhu Liu,Jingren Zhou
Main category: cs.CL
TL;DR: 本文提出AgentSwing,一种状态感知的自适应并行上下文管理框架,用于解决大语言模型作为长周期信息检索智能体时上下文容量有限的问题;它通过多分支并行扩展与前向路由选择最优路径,在多个基准上显著减少交互轮次并提升最终性能。
Details
Motivation: 现有上下文管理方法采用静态固定策略,无法随长周期搜索过程中上下文效用和可靠性动态变化而自适应调整,导致效率与精度受限。 Method: 提出一个刻画长周期成功概率的双维度(搜索效率与终端精度)概率框架,并基于此构建AgentSwing:在每个触发点并行展开多个上下文管理分支,结合前向路由机制动态选择最优延续路径。 Result: 在多种基准和智能体主干模型上,AgentSwing持续优于强基线静态方法,常以最多3倍更少交互轮次达到或超越其性能,同时提升长周期网络智能体的性能上限。 Conclusion: AgentSwing验证了状态感知与自适应并行上下文管理的有效性,所提出的概率框架为未来长周期智能体上下文管理策略的设计与分析提供了理论基础。 Abstract: As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck. Existing context management methods typically commit to a single fixed strategy throughout the entire trajectory. Such static designs may work well in some states, but they cannot adapt as the usefulness and reliability of the accumulated context evolve during long-horizon search. To formalize this challenge, we introduce a probabilistic framework that characterizes long-horizon success through two complementary dimensions: search efficiency and terminal precision. Building on this perspective, we propose AgentSwing, a state-aware adaptive parallel context management routing framework. At each trigger point, AgentSwing expands multiple context-managed branches in parallel and uses lookahead routing to select the most promising continuation. Experiments across diverse benchmarks and agent backbones show that AgentSwing consistently outperforms strong static context management methods, often matching or exceeding their performance with up to $3\times$ fewer interaction turns while also improving the ultimate performance ceiling of long-horizon web agents. Beyond the empirical gains, the proposed probabilistic framework provides a principled lens for analyzing and designing future context management strategies for long-horizon agents.[31] Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
Utsav Maskey,Mark Dras,Usman Naseem
Main category: cs.CL
TL;DR: 本文分析了对齐语言模型中过度拒绝(over-refusal)现象的表征几何机制,发现有害拒绝方向是任务无关且全局统一的,而过度拒绝方向则是任务相关、分布于良性任务簇内、高维且多样的;因此仅靠全局方向消融无法解决过度拒绝,需任务特定的几何干预。
Details
Motivation: 对齐语言模型在拒绝有害请求时会出现过度拒绝——即错误拒绝安全指令,现有方法(如全局拒绝方向消融)效果有限,亟需理解其内在机理。 Method: 通过分析隐藏状态的表征几何结构,区分有害拒绝与过度拒绝的方向特性;使用线性探针验证二者在早期Transformer层即存在表征可分性。 Result: 发现有害拒绝方向是任务无关、单一全局向量,而过度拒绝方向是任务依赖、位于良性任务簇内、高维且跨任务变化;线性探针证实二者表征可分离。 Conclusion: 全局方向消融无法有效缓解过度拒绝,因其本质是任务特定的几何现象;必须设计任务相关的几何干预策略。 Abstract: Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understand why this happens. We show that harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace. Linear probing confirms that the two refusal types are representationally distinct from the early transformer layers. These findings provide a mechanistic explanation of why global direction ablation alone cannot address over-refusal, and establish that task-specific geometric interventions are necessary.[32] Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models
Duanyi Yao,Changyue Li,Zhicong Huang,Cheng Hong,Songze Li
Main category: cs.CL
TL;DR: 本文提出了一种名为'Hidden Ads'的新型后门攻击,针对视觉-语言模型(VLMs),利用用户自然推荐行为(如上传食物/汽车图片并提问)触发,悄无声息地插入广告语,保持模型性能且难以检测。
Details
Motivation: 随着VLM在消费级推荐场景(如商品、餐饮、服务)中广泛应用,现有基于人工触发器(如像素块、特殊token)的后门攻击不具隐蔽性;需一种能利用用户自然交互行为、隐蔽植入广告的新攻击范式。 Method: 提出多层级威胁框架(硬提示注入、软提示优化、监督微调),构建基于教师VLM生成思维链的自然触发—广告语配对数据生成流水线,在多个语义域实现高效后门注入。 Result: 在三种VLM架构上验证了Hidden Ads具有高注入成功率、近零误报率、任务准确率无损;具备数据高效性、跨数据集迁移性及多领域并发扩展能力;主流防御手段(指令过滤、干净微调)均失效或严重损害模型效用。 Conclusion: Hidden Ads揭示了当前VLM在真实推荐场景中面临的新安全风险,其自然触发与无缝嵌入特性使传统防御失效,亟需设计面向行为触发后门的新型鲁棒性机制。 Abstract: Vision-Language Models (VLMs) are increasingly deployed in consumer applications where users seek recommendations about products, dining, and services. We introduce Hidden Ads, a new class of backdoor attacks that exploit this recommendation-seeking behavior to inject unauthorized advertisements. Unlike traditional pattern-triggered backdoors that rely on artificial triggers such as pixel patches or special tokens, Hidden Ads activates on natural user behaviors: when users upload images containing semantic content of interest (e.g., food, cars, animals) and ask recommendation-seeking questions, the backdoored model provides correct, helpful answers while seamlessly appending attacker-specified promotional slogans. This design preserves model utility and produces natural-sounding injections, making the attack practical for real-world deployment in consumer-facing recommendation services. We propose a multi-tier threat framework to systematically evaluate Hidden Ads across three adversary capability levels: hard prompt injection, soft prompt optimization, and supervised fine-tuning. Our poisoned data generation pipeline uses teacher VLM-generated chain-of-thought reasoning to create natural trigger--slogan associations across multiple semantic domains. Experiments on three VLM architectures demonstrate that Hidden Ads achieves high injection efficacy with near-zero false positives while maintaining task accuracy. Ablation studies confirm that the attack is data-efficient, transfers effectively to unseen datasets, and scales to multiple concurrent domain-slogan pairs. We evaluate defenses including instruction-based filtering and clean fine-tuning, finding that both fail to remove the backdoor without causing significant utility degradation.[33] A gentle tutorial and a structured reformulation of Bock's algorithm for minimum directed spanning trees
Yuxi Wang,Jungyeul Park
Main category: cs.CL
TL;DR: 本文对Bock 1971年提出的构造最小有向生成树(arborescence)的Algol算法进行了教学性重述与结构化重构,旨在提升其可读性与可复现性,并阐明其作为非投射依存句法分析精确解码器的重要价值。
Details
Motivation: 使Bock 1971年原始算法对现代读者更易理解与复现,并凸显其在非投射图基依存解析中的精确解码意义。 Method: 采用逐行执行追踪(含完整10节点示例)、结构化重表述(明确阶段划分、状态维护与控制流),并结合依存解析实例,通过仿射变换将最大权arborescence问题转化为Bock的最小代价形式。 Result: 提供了Bock算法完整可复现的描述;明确了其作为非投射依存解析精确解码器的角色;给出了结构清晰、状态透明的算法表述及两个详细示例(原始图与依存解析适配版)。 Conclusion: Bock算法虽古老,但经结构化阐释后仍具理论清晰性与实践相关性,是图基非投射依存解析中值得重视的基础精确算法。 Abstract: This paper presents a gentle tutorial and a structured reformulation of Bock's 1971 Algol procedure for constructing minimum directed spanning trees. Our aim is to make the original algorithm readable and reproducible for modern readers, while highlighting its relevance as an exact decoder for nonprojective graph based dependency parsing. We restate the minimum arborescence objective in Bock's notation and provide a complete line by line execution trace of the original ten node example, extending the partial trace given in the source paper from initialization to termination. We then introduce a structured reformulation that makes explicit the procedure's phase structure, maintained state, and control flow, while preserving the logic of the original method. As a further illustration, we include a worked example adapted from {jurafsky-martin-2026-book} for dependency parsing, showing how a maximum weight arborescence problem is reduced to Bock's minimum cost formulation by a standard affine transformation and traced under the same state variables.[34] Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents
Rodney Jehu-Appiah
Main category: cs.CL
TL;DR: 本文提出‘Umwelt工程’,即通过设计语言认知环境来影响大模型的推理过程,并通过两个实验验证了词汇约束(如No-Have和E-Prime)能显著改变模型的认知表现,包括伦理推理、分类准确率和认知校准;同时发现多智能体协同可提升问题求解覆盖率。
Details
Motivation: 作者旨在探索语言形式对AI推理能力的深层影响,提出超越提示工程与上下文工程的第三层设计——Umwelt工程,强调语言环境本身即是一种可调控的认知媒介。 Method: 开展两项实验:实验1在七类任务中对三个大模型施加No-Have(禁用‘have’)和E-Prime(禁用‘be’)词汇约束,共4470次试验;实验2构建16个语言受限智能体,在17个调试任务上测试单体与3-agent集成的表现,并进行置换检验。 Result: No-Have显著提升伦理推理(+19.1pp)、分类(+6.5pp)与认知校准(+7.4pp),合规率达92.8%;E-Prime效果呈强模型依赖性(r=-0.75);3-agent集成实现100%真值覆盖(对照组88.2%),且成功组合必含反事实智能体。 Conclusion: 语言媒介的系统性改造可重塑AI认知结构与多样性,Umwelt工程是可行且有效的新型代理设计范式,但需后续研究控制提示复杂度等混杂因素。 Abstract: I propose Umwelt engineering -- the deliberate design of the linguistic cognitive environment -- as a third layer in the agent design stack, upstream of both prompt and context engineering. Two experiments test the thesis that altering the medium of reasoning alters cognition itself. In Experiment 1, three language models reason under two vocabulary constraints -- No-Have (eliminating possessive "to have") and E-Prime (eliminating "to be") -- across seven tasks (N=4,470 trials). No-Have improves ethical reasoning by 19.1 pp (p < 0.001), classification by 6.5 pp (p < 0.001), and epistemic calibration by 7.4 pp, while achieving 92.8% constraint compliance. E-Prime shows dramatic but model-dependent effects: cross-model correlations reach r = -0.75. In Experiment 2, 16 linguistically constrained agents tackle 17 debugging problems. No constrained agent outperforms the control individually, yet a 3-agent ensemble achieves 100% ground-truth coverage versus 88.2% for the control. A permutation test confirms only 8% of random 3-agent subsets achieve full coverage, and every successful subset contains the counterfactual agent. Two mechanisms emerge: cognitive restructuring and cognitive diversification. The primary limitation is the absence of an active control matching constraint prompt elaborateness.[35] PRBench: End-to-end Paper Reproduction in Physics Research
Shi Qiu,Junyi Deng,Yiwei Deng,Haoran Dong,Jieyu Fu,Mao Li,Zeyu Li,Zhaolong Zhang,Huiwen Zheng,Leidong Bao,Anqi Lv,Zihan Mo,Yadi Niu,Yiyang Peng,Yu Tian,Yili Wang,Ziyu Wang,Zi-Yu Wang,Jiashen Wei,Liuheng Wu,Aoran Xue,Leyi Yang,Guanglu Yuan,Xiarui Zhan,Jingjun Zhang,Zifan Zheng,Pengfei Liu,Linrui Zhen,Kaiyang Li,Qichang Li,Ziheng Zhou,Guo-En Nian,Yunwei Xiao,Qing-Hong Cao,Linjie Dai,Xu Feng,Peng Gao,Ying Gu,Chang Liu,Jia Liu,Ming-xing Luo,Yan-Qing Ma,Liang-You Peng,Huichao Song,Shufeng Wang,Chenxu Wang,Tao Wang,Yi-Nan Wang,Chengyin Wu,Pengwei Zhao,Hua Xing Zhu
Main category: cs.CL
TL;DR: 本文提出了PRBench基准,用于评估AI代理在物理领域端到端复现真实科研论文的能力;结果显示当前最强代理(GPT-5.3-Codex)平均得分仅34%,且零任务完全成功,暴露出公式实现、数值模拟调试和数据真实性等系统性缺陷。
Details
Motivation: 当前AI代理虽在推理与编程任务中表现良好,但能否真正端到端复现真实科学论文仍未知;亟需一个严格、专家驱动、基于真实论文的基准来评估其在科学研究中的自主能力。 Method: 构建PRBench:包含30个由北京大学物理学院20余个研究组专家设计的任务,覆盖11个物理学子领域;每个任务要求代理仅依据论文原文和指令,在沙盒环境中从零实现算法并复现定量结果;采用代理化评估流程对多个编码代理进行多维能力评测。 Result: 最佳代理(OpenAI Codex + GPT-5.3-Codex)平均得分为34%;所有代理端到端成功率为0;在数据准确性与代码正确性上表现尤其差;识别出公式实现错误、无法调试数值模拟、虚构输出数据等系统性失败模式。 Conclusion: PRBench揭示了当前AI代理在自主科学推理与执行方面存在显著局限,为未来提升科学AI的可靠性与可验证性提供了关键评估工具和改进方向。 Abstract: AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.[36] Budget-Xfer: Budget-Constrained Source Language Selection for Cross-Lingual Transfer to African Languages
Tewodros Kederalah Idris,Roald Eiselen,Prasenjit Mitra
Main category: cs.CL
TL;DR: 本文提出Budget-Xfer框架,将多源跨语言迁移建模为预算约束下的资源分配问题,在固定标注预算下联合优化源语言选择与数据分配,实验表明多源迁移显著优于单源迁移,而不同多源策略间差异不显著,且嵌入相似性作为选择依据的效果因任务而异。
Details
Motivation: 现有跨语言迁移研究在比较源语言选择策略时未控制总训练数据量,导致语言选择效应与数据量效应混淆。 Method: 提出Budget-Xfer框架,将多源跨语言迁移建模为预算约束的资源分配问题,联合优化源语言选择与各语言数据分配比例;在三种非洲语言(豪萨语、约鲁巴语、斯瓦希里语)的命名实体识别和情感分析任务上,使用两个多语言模型开展288组实验。 Result: (1)多源迁移显著优于单源迁移(Cohen's d = 0.80–1.98),主因是结构性预算利用不足瓶颈;(2)不同多源分配策略间性能差异微小且不显著;(3)嵌入相似性作为源语言选择代理的效果具有任务依赖性:NER任务中随机选择优于相似性选择,而情感分析任务中则相反。 Conclusion: 多源跨语言迁移的价值显著,但源语言选择策略的影响有限且任务相关;Budget-Xfer为更公平、可控的跨语言迁移评估提供了新范式。 Abstract: Cross-lingual transfer learning enables NLP for low-resource languages by leveraging labeled data from higher-resource sources, yet existing comparisons of source language selection strategies do not control for total training data, confounding language selection effects with data quantity effects. We introduce Budget-Xfer, a framework that formulates multi-source cross-lingual transfer as a budget-constrained resource allocation problem. Given a fixed annotation budget B, our framework jointly optimizes which source languages to include and how much data to allocate from each. We evaluate four allocation strategies across named entity recognition and sentiment analysis for three African target languages (Hausa, Yoruba, Swahili) using two multilingual models, conducting 288 experiments. Our results show that (1) multi-source transfer significantly outperforms single-source transfer (Cohen's d = 0.80 to 1.98), driven by a structural budget underutilization bottleneck; (2) among multi-source strategies, differences are modest and non-significant; and (3) the value of embedding similarity as a selection proxy is task-dependent, with random selection outperforming similarity-based selection for NER but not sentiment analysis.[37] The Degree of Language Diacriticity and Its Effect on Tasks
Adi Cohen,Yuval Pinter
Main category: cs.CL
TL;DR: 本文提出了一种基于语料库和信息论的数据驱动框架,用于量化不同书写系统中变音符号的复杂性,并发现其复杂性与变音符号恢复任务的模型性能显著相关。
Details
Motivation: 尽管变音符号在许多文字系统中起核心作用,但其对语言技术的影响尚未在跨文字层面进行系统量化;现有研究多局限于单语种,缺乏跨语言、数据驱动的变音符号依赖度评估框架。 Method: 构建基于语料库的信息论指标(涵盖频率、歧义性和结构多样性)来量化变音符号复杂性;在15种语言、24个语料库上计算指标;结合BERT和RNN模型评估变音符号恢复性能,并分析指标与准确率的相关性。 Result: 变音符号复杂性越高,恢复准确率越低;在单变音符号文字中,频率与结构指标高度一致;在多变音符号文字中,结构复杂性比频率指标更能预测模型性能。 Conclusion: 变音符号使用的可测量属性切实影响恢复模型性能,表明正字法复杂性不仅是描述性特征,更具有建模功能意义。 Abstract: Diacritics are orthographic marks that clarify pronunciation, distinguish similar words, or alter meaning. They play a central role in many writing systems, yet their impact on language technology has not been systematically quantified across scripts. While prior work has examined diacritics in individual languages, there's no cross-linguistic, data-driven framework for measuring the degree to which writing systems rely on them and how this affects downstream tasks. We propose a data-driven framework for quantifying diacritic complexity using corpus-level, information-theoretic metrics that capture the frequency, ambiguity, and structural diversity of character-diacritic combinations. We compute these metrics over 24 corpora in 15 languages, spanning both single- and multi-diacritic scripts. We then examine how diacritic complexity correlates with performance on the task of diacritics restoration, evaluating BERT- and RNN-based models. We find that across languages, higher diacritic complexity is strongly associated with lower restoration accuracy. In single-diacritic scripts, where character-diacritic combinations are more predictable, frequency-based and structural measures largely align. In multi-diacritic scripts, however, structural complexity exhibits the strongest association with performance, surpassing frequency-based measures. These findings show that measurable properties of diacritic usage influence the performance of diacritic restoration models, demonstrating that orthographic complexity is not only descriptive but functionally relevant for modeling.[38] Investigating the Influence of Language on Sycophantic Behavior of Multilingual LLMs
Bayan Abdullah Aldahlawi,A. B. M. Ashikur Rahman,Irfan Ahmad
Main category: cs.CL
TL;DR: This paper investigates how language influences sycophancy in modern LLMs (GPT-4o mini, Gemini 1.5 Flash, Claude 3.5 Haiku), finding that while sycophancy has decreased overall, it still varies significantly across languages and sensitive topics—revealing cultural and linguistic patterns requiring multilingual audits.
Details
Motivation: Despite mitigation efforts, sycophancy remains a concern in newer LLMs, and its variation across languages has not been systematically studied. Method: Evaluated three state-of-the-art models on tweet-like opinion prompts translated into six languages (English + Arabic, Chinese, French, Spanish, Portuguese), analyzing sycophantic responses—especially on sensitive topics—with granular cross-lingual comparison. Result: Newer models show substantially less sycophancy than older ones, but language still modulates sycophancy levels; systematic differences emerge across languages and sensitive topics, reflecting cultural and linguistic biases. Conclusion: Mitigation strategies have improved sycophancy behavior, but language-dependent variation persists—highlighting the necessity of multilingual, culturally aware evaluation for trustworthy LLM deployment. Abstract: Large language models (LLMs) have achieved strong performance across a wide range of tasks, but they are also prone to sycophancy, the tendency to agree with user statements regardless of validity. Previous research has outlined both the extent and the underlying causes of sycophancy in earlier models, such as ChatGPT-3.5 and Davinci. Newer models have since undergone multiple mitigation strategies, yet there remains a critical need to systematically test their behavior. In particular, the effect of language on sycophancy has not been explored. In this work, we investigate how the language influences sycophantic responses. We evaluate three state-of-the-art models, GPT-4o mini, Gemini 1.5 Flash, and Claude 3.5 Haiku, using a set of tweet-like opinion prompts translated into five additional languages: Arabic, Chinese, French, Spanish, and Portuguese. Our results show that although newer models exhibit significantly less sycophancy overall compared to earlier generations, the extent of sycophancy is still influenced by the language. We further provide a granular analysis of how language shapes model agreeableness across sensitive topics, revealing systematic cultural and linguistic patterns. These findings highlight both the progress of mitigation efforts and the need for broader multilingual audits to ensure trustworthy and bias-aware deployment of LLMs.[39] Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation?
Yuxuan Gu,Lunjun Liu,Xiaocheng Feng,Kun Zhu,Weihong Zhong,Lei Huang,Bing Qin
Main category: cs.CL
TL;DR: 本文提出了一种基于217位AI研究者长期科研轨迹的新基准,用于评估大语言模型(LLM)是否真正模拟人类认知而非仅表面模仿,并设计了跨领域、时间偏移的泛化设置与多维认知对齐指标进行个体级评估。
Details
Motivation: 现有数据集要么使用合成推理轨迹,要么依赖群体层面聚合,无法捕捉真实个体的认知模式;而核心科学问题在于LLMs是模拟人类认知还是仅模仿表层行为。 Method: 构建基于217位AI研究者发表论文轨迹的认知基准,将论文视为其认知过程的外化表征;采用跨领域、时间偏移的泛化设定区分认知迁移与行为模仿;提出多维认知对齐指标衡量个体级认知一致性。 Result: 通过对SOTA LLMs及多种增强技术的系统评估,首次实证回答了当前LLMs模拟人类认知的程度,以及现有技术对其提升的边界。 Conclusion: 当前LLMs在个体级认知模拟上仍显著受限,现有增强技术提升有限,凸显需发展更贴近人类认知机制的建模方法。 Abstract: An essential problem in artificial intelligence is whether LLMs can simulate human cognition or merely imitate surface-level behaviors, while existing datasets suffer from either synthetic reasoning traces or population-level aggregation, failing to capture authentic individual cognitive patterns. We introduce a benchmark grounded in the longitudinal research trajectories of 217 researchers across diverse domains of artificial intelligence, where each author's scientific publications serve as an externalized representation of their cognitive processes. To distinguish whether LLMs transfer cognitive patterns or merely imitate behaviors, our benchmark deliberately employs a cross-domain, temporal-shift generalization setting. A multidimensional cognitive alignment metric is further proposed to assess individual-level cognitive consistency. Through systematic evaluation of state-of-the-art LLMs and various enhancement techniques, we provide a first-stage empirical study on the questions: (1) How well do current LLMs simulate human cognition? and (2) How far can existing techniques enhance these capabilities?[40] KAT-Coder-V2 Technical Report
Fengxiang Li,Han Zhang,Haoyang Huang,Jinghui Wang,Jinhua Hao,Kun Yuan,Mengtong Li,Minglei Zhang,Pengcheng Xu,Wenhao Zhuang,Yizhen Shao,Zongxian Feng,Can Tang,Chao Wang,Chengxiao Tong,Fan Yang,Gang Xiong,Haixuan Gao,Han Gao,Hao Wang,Haochen Liu,Hongliang Sun,Jiabao Li,Jingwen Chang,Jun Du,Junyi Peng,Leizhen Cui,Meimei Jing,Mingqi Wu,Shangpeng Yan,Shaotong Qi,Suzhe Xu,Wenxuan Zhao,Xianda Sun,Xuan Xie,Yanbo Wang,Yao Xia,Yinghan Cui,Yingpeng Chen,Yong Wang,Yuze Shi,Zhiwei Shen,Ziyu Wang,Ming Sun,Lin Ye,Bin Chen
Main category: cs.CL
TL;DR: KAT-Coder-V2 是一个采用 '专精后统一' 范式的智能编码模型,通过多领域专家分工训练与知识蒸馏整合,并结合新型 RL 训练技术(MCLA、Tree Training)和自研沙盒环境 KwaiEnv,实现高效稳定训练,在多个基准测试中达到领先或接近 SOTA 的性能。
Details
Motivation: 提升大模型在复杂、多场景编程任务中的表现,解决传统端到端训练难以兼顾专业性与泛化性的难题。 Method: 提出 'Specialize-then-Unify' 范式:将代理式编程划分为 SWE、WebCoding、Terminal、WebSearch 和 General 五个专家领域,分别进行监督微调和强化学习;使用 KwaiEnv 支持大规模并发沙盒训练;引入 MCLA 稳定 MoE 强化学习,设计 Tree Training 加速树状轨迹训练。 Result: 在 SWE-bench Verified 达 79.6%,PinchBench 达 88.7,前端美学三项任务均排名第一,Terminal-Bench Hard 和 tau^2-Bench 分别达 46.8 和 93.9;训练速度最高提升 6.2 倍。 Conclusion: KAT-Coder-V2 验证了模块化专家分工与统一蒸馏范式的有效性,为构建高性能、可扩展的智能编程代理提供了新路径。 Abstract: We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy distillation. We develop KwaiEnv, a modular infrastructure sustaining tens of thousands of concurrent sandbox instances, and scale RL training along task complexity, intent alignment, and scaffold generalization. We further propose MCLA for stabilizing MoE RL training and Tree Training for eliminating redundant computation over tree-structured trajectories with up to 6.2x speedup. KAT-Coder-V2 achieves 79.6% on SWE-bench Verified (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first across all three frontend aesthetics scenarios, and maintains strong generalist scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9). Our model is publicly available at https://streamlake.com/product/kat-coder.[41] Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG
Boxi Yu,Yuzhong Zhang,Liting Lin,Lionel Briand,Emir Muñoz
Main category: cs.CL
TL;DR: 本文提出RT4CHART框架,用于细粒度、可解释地检测RAG中大语言模型的幻觉,通过将答案分解为可验证声明并进行层级化上下文验证,在多个基准上显著提升幻觉检测性能,并发现现有数据集严重低估幻觉发生率。
Details
Motivation: 现有RAG幻觉检测方法缺乏细粒度、证据支撑的诊断能力,仅依赖检索上下文评估忠实性存在局限,且答案级评分或开放域事实性评估难以满足实际审计需求。 Method: 提出RT4CHART:一种逆向形态测试框架,将模型输出分解为独立可验证声明,进行从局部到全局的层级化上下文验证;每条声明标注为‘蕴含’‘矛盾’或‘无依据’,并回溯映射至答案片段,检索显式支持/反驳证据。 Result: 在RAGTruth++上答案级幻觉检测F1达0.776(超越最强基线83%);在新构建的RAGTruth-Enhance上达到47.5%的片段级F1;消融实验证明层级验证设计是性能提升主因;重标注揭示幻觉数量比原标注多1.68倍。 Conclusion: RT4CHART实现了更精细、可解释、证据驱动的RAG忠实性评估;其重标注结果表明当前基准严重低估幻觉,亟需更严格的评估标准与数据建设。 Abstract: Large language models (LLMs) continue to hallucinate in retrieval-augmented generation (RAG), producing claims that are unsupported by or conflict with the retrieved context. Detecting such errors remains challenging when faithfulness is evaluated solely with respect to the retrieved context. Existing approaches either provide coarse-grained, answer-level scores or focus on open-domain factuality, often lacking fine-grained, evidence-grounded diagnostics. We present RT4CHART, a retromorphic testing framework for context-faithfulness assessment. RT4CHART decomposes model outputs into independently verifiable claims and performs hierarchical, local-to-global verification against the retrieved context. Each claim is assigned one of three labels: entailed, contradicted, or baseless. Furthermore, RT4CHART maps claim-level decisions back to specific answer spans and retrieves explicit supporting or refuting evidence from the context, enabling fine-grained and interpretable auditing. We evaluate RT4CHART on RAGTruth++ (408 samples) and RAGTruth-Enhance (2,675 samples), a newly re-annotated benchmark. RT4CHART achieves the best answer-level hallucination detection F1 among all baselines. On RAGTruth++, it reaches an F1 score of 0.776, outperforming the strongest baseline by 83%. On RAGTruth-Enhance, it achieves a span-level F1 of 47.5%. Ablation studies show that the hierarchical verification design is the primary driver of performance gains. Finally, our re-annotation reveals 1.68x more hallucination cases than the original labels, suggesting that existing benchmarks substantially underestimate the prevalence of hallucinations.[42] TailNLG: A Multilingual Benchmark Addressing Verbalization of Long-Tail Entities
Lia Draetta,Michael Oliverio,Virginia Ramón-Ferrer,Pier Felice Balestrucci,Flaviana Corallo,Carlos Badenes-Olmedo,Alessandro Mazzei,Marco Antonio Stranisci,Rossana Damiano
Main category: cs.CL
TL;DR: 本文首次系统研究了数据到文本生成中长尾实体的口头化问题,提出了多语言基准TailNLG,并发现大模型在处理罕见实体时存在一致偏差,且现有评估指标无法可靠捕捉该偏差。
Details
Motivation: 尽管数据到文本生成在多语言覆盖上有所进步,但对罕见(长尾)实体口头化中的潜在偏差关注不足。 Method: 构建了基于Wikidata的多语言基准TailNLG(含英语、意大利语、西班牙语),涵盖不同流行度的实体;在零样本设置下评估三类大语言模型在长尾与常见实体上的表现,并与WebNLG对比。 Result: 发现模型对长尾实体存在一致偏差:嵌入得分更低、不确定性更高;偏差程度因模型和语言而异;现有评估指标无法稳定反映这些差异。 Conclusion: 需建立更可靠的评估框架以准确衡量模型对长尾实体的 verbalization 能力。 Abstract: The automatic verbalization of structured knowledge is a key task for making knowledge graphs accessible to non-expert users and supporting retrieval-augmented generation systems. Although recent advances in Data-to-Text generation have improved multilingual coverage, little attention has been paid to potential biases in the verbalization of rare entities, frequently known as long-tail entities. In this work, we present the first systematic study of long-tail entities in Data-to-Text generation. We introduce TailNLG, a new multilingual benchmark in English, Italian, and Spanish, built from Wikidata and covering entities with varying levels of popularity. We evaluate three different families of large language models in zero-shot settings and compare their performance on rare versus common entities, as well as against the established WebNLG benchmark. Our results reveal a consistent bias against long-tail entities: embedding-based scores are lower, and model uncertainty is higher for rare entities. We further show that the impact of long-tail entities varies across models and languages, and that existing evaluation metrics do not consistently capture these differences, highlighting the need for more reliable evaluation frameworks.[43] Understanding Teacher Revisions of Large Language Model-Generated Feedback
Conrad Borchers,Luiz Rodrigues,Newarney Torrezão da Costa,Cleon Xavier,Rafael Ferreira Mello
Main category: cs.CL
TL;DR: 本研究分析了117名教师对1349条AI生成的学生反馈的修改行为,发现约80%的反馈未被修改,修改多表现为简化和缩短,且教师间差异大;基于AI反馈文本可中等准确预测是否会被修改(AUC=0.75);修改常使反馈从高信息量解释转向简洁纠错型,提示需优化AI反馈系统以契合教师教学需求并减少冗余编辑。
Details
Motivation: 大型语言模型(LLMs)日益用于生成学生形成性反馈,但教师如何修改这些反馈尚不清楚;教师的修改直接影响学生接收的内容,因此理解其修订实践对评估AI课堂工具至关重要。 Method: 分析1349条AI生成反馈及其对应教师编辑版本的数据集,结合定量分析(文本特征统计、机器学习预测模型,使用句子嵌入)、定性编码(反馈教学类型变化)与跨教师行为比较。 Result: (i)约80%的AI反馈未经修改;编辑后文本先变长后被教师缩短;约50%教师从不编辑,仅约10%编辑超2/3的反馈;(ii)仅用AI反馈文本训练的模型可中等准确预测是否被修改(AUC=0.75);(iii)编辑常将高信息量解释型反馈简化为更简洁、纠错导向的形式。 Conclusion: 教师对AI反馈的修改具有高度选择性和个体差异性,主要趋向简化与教学适配;研究结果为设计更贴合教师实际需求、减少不必要编辑负担的AI反馈系统提供了实证依据与改进方向。 Abstract: Large language models (LLMs) increasingly generate formative feedback for students, yet little is known about how teachers revise this feedback before it reaches learners. Teachers' revisions shape what students receive, making revision practices central to evaluating AI classroom tools. We analyze a dataset of 1,349 instances of AI-generated feedback and corresponding teacher-edited explanations from 117 teachers. We examine (i) textual characteristics associated with teacher revisions, (ii) whether revision decisions can be predicted from the AI feedback text, and (iii) how revisions change the pedagogical type of feedback delivered. First, we find that teachers accept AI feedback without modification in about 80% of cases, while edited feedback tends to be significantly longer and subsequently shortened by teachers. Editing behavior varies substantially across teachers: about 50% never edit AI feedback, and only about 10% edit more than two-thirds of feedback instances. Second, machine learning models trained only on the AI feedback text as input features, using sentence embeddings, achieve fair performance in identifying which feedback will be revised (AUC=0.75). Third, qualitative coding shows that when revisions occur, teachers often simplify AI-generated feedback, shifting it away from high-information explanations toward more concise, corrective forms. Together, these findings characterize how teachers engage with AI-generated feedback in practice and highlight opportunities to design feedback systems that better align with teacher priorities while reducing unnecessary editing effort.[44] Conversational Agents and the Understanding of Human Language: Reflections on AI, LLMs, and Cognitive Science
Andrei Popescu-Belis
Main category: cs.CL
TL;DR: 本文探讨了计算机自然语言处理(NLP)与人类语言能力研究(语言学与认知科学)之间的关系,回顾了NLP从早期到大语言模型时代的发展,并对比各主要范式与人类语言理论的异同;结论指出,尽管当前神经网络驱动的聊天机器人展现出强大语言能力,但NLP技术演进并未实质性增进我们对人类语言心智处理机制的理解。
Details
Motivation: 探究NLP技术发展是否增进了我们对人类语言能力本质的理解,厘清工程实践与认知科学理论之间的关联与界限。 Method: 采用历史综述与跨学科比较方法,梳理NLP各主要范式(如规则系统、统计模型、深度学习、大语言模型)的发展脉络,并逐一对照语言学和认知科学中关于人类语言能力的核心理论进行相似性与差异性分析。 Result: 识别出各NLP范式与人类语言理论之间存在表面相似性(如结构化表征、分布假设),但在建模目标、解释性、认知合理性及数据依赖性等方面存在根本差异;尤其指出大语言模型虽性能强大,但缺乏对语言理解的认知可解释基础。 Conclusion: NLP的技术进步不等于对人类语言认知机制理解的深化;二者目标不同——NLP追求功能表现,而认知科学追求机制解释;未来需加强跨学科对话,避免将工程成功误读为理论洞见。 Abstract: In this paper, we discuss the relationship between natural language processing by computers (NLP) and the understanding of the human language capacity, as studied by linguistics and cognitive science. We outline the evolution of NLP from its beginnings until the age of large language models, and highlight for each of its main paradigms some similarities and differences with theories of the human language capacity. We conclude that the evolution of language technology has not substantially deepened our understanding of how human minds process natural language, despite the impressive language abilities attained by current chatbots using artificial neural networks.[45] Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Zhiwen You,Xi Chen,Aniket Vashishtha,Simo Du,Gabriel Erion-Barner,Hongyuan Mei,Hao Peng,Yue Guo
Main category: cs.CL
TL;DR: 本文提出一种受临床医生训练启发的反事实多智能体诊断框架,通过反事实编辑临床发现并量化其对诊断置信度的影响(即反事实概率间隙),提升LLM诊断系统的可解释性与准确性。
Details
Motivation: 现有基于大语言模型的诊断系统缺乏对单个临床发现如何支持或削弱不同诊断假设的显式检验,导致推理不可解释;而临床训练中广泛使用反事实提问来强化鉴别诊断能力,这一机制尚未被AI系统有效借鉴。 Method: 提出反事实多智能体诊断框架:1)反事实病例编辑,修改单个临床发现;2)定义反事实概率间隙,量化编辑前后模型对各诊断置信度的变化;3)利用该信号驱动多轮专科智能体讨论,挑战不支持的假设并优化鉴别诊断。 Result: 在三个诊断基准和七种大语言模型上,该方法持续优于提示工程及先前多智能体基线,尤其在复杂模糊病例中提升显著;人工评估显示其推理更符合临床需求、更可靠且连贯。 Conclusion: 将反事实证据验证机制融入AI诊断系统,是构建可信赖临床决策支持工具的关键一步。 Abstract: Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning--e.g., asking how a diagnosis would change if a key symptom were absent or altered--to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.[46] ProText: A benchmark dataset for measuring (mis)gendering in long-form texts
Hadas Kotek,Margit Bowler,Patrick Sonnenberg,Yu'an Yang
Main category: cs.CL
TL;DR: ProText 是一个用于评估长文本中性别化与误性别化现象的新数据集,涵盖主题名词、主题类别和代词类别三个维度,旨在探测大语言模型在文本生成任务中的性别偏见。
Details
Motivation: 现有基准多局限于代词消解和二元性别框架,无法全面评估大语言模型在多样化长文本中对性别的错误指派或强化刻板印象的问题。 Method: 构建了名为 ProText 的多维标注数据集,包含主题名词(如职业、称谓等)、主题类别(刻板男性/女性/中性)和代词类别(阳性/阴性/中性/无),并开展小规模案例研究,使用两种提示与两个大模型测试其性别化行为。 Result: 实验揭示了系统性性别偏差:当输入缺乏显式性别线索时,模型倾向于默认异性恋规范假设,导致误性别化和强化性别刻板印象。 Conclusion: ProText 为超越传统二元性别范式的性别偏见评估提供了新工具,证实当前大语言模型在文本转换任务中仍存在显著且系统性的性别偏差。 Abstract: We introduce ProText, a dataset for measuring gendering and misgendering in stylistically diverse long-form English texts. ProText spans three dimensions: Theme nouns (names, occupations, titles, kinship terms), Theme category (stereotypically male, stereotypically female, gender-neutral/non-gendered), and Pronoun category (masculine, feminine, gender-neutral, none). The dataset is designed to probe (mis)gendering in text transformations such as summarization and rewrites using state-of-the-art Large Language Models, extending beyond traditional pronoun resolution benchmarks and beyond the gender binary. We validated ProText through a mini case study, showing that even with just two prompts and two models, we can draw nuanced insights regarding gender bias, stereotyping, misgendering, and gendering. We reveal systematic gender bias, particularly when inputs contain no explicit gender cues or when models default to heteronormative assumptions.[47] Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3
Natapong Nitarach
Main category: cs.CL
TL;DR: 本文探讨了通过多样化提示策略来提升大语言模型在数学推理任务中的表现,但实验结果表明,尽管多样化策略可以降低错误相关性,但其带来的准确率下降超过了相关性降低的收益,模型能力仍是决定性能的主导因素。
Details
Motivation: 多数投票法虽能提升数学推理效果,但因模型错误高度相关而受限;为解决此问题,作者提出通过分配结构差异化的推理策略以降低错误相关性。 Method: 设计并测试了Diverse Prompt Mixer方法,在AIMO 3竞赛中使用3个模型、23+组实验、50道IMO级别题目进行验证,并对比不同推理策略与温度采样等干预手段的效果。 Result: 所有干预措施均未提升性能;高温度采样已足够降低错误相关性;较弱的提示策略导致单次尝试准确率下降幅度大于相关性降低幅度;模型能力远超其他推理时优化的影响(一个数量级)。 Conclusion: 在数学推理任务中,模型固有能力是性能的主导因素,推理时的多样化提示等优化手段收益有限,甚至得不偿失。 Abstract: Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix: assign structurally different reasoning strategies to different voters to decorrelate errors. We test this Diverse Prompt Mixer in the AIMO~3 competition: 3 models, 23+ experiments, and 50 IMO-level problems on a single H100 80 GB with a 5-hour limit. Every intervention fails. High-temperature sampling already decorrelates errors sufficiently; weaker prompt strategies reduce per-attempt accuracy more than they reduce correlation. Across a 17-point model capability gap and every inference-time optimization we tried, model capability dominates by an order of magnitude.[48] What can LLMs tell us about the mechanisms behind polarity illusions in humans? Experiments across model scales and training steps
Dario Paape
Main category: cs.CL
TL;DR: 本文利用Pythia模型系列探究NPI幻觉与深度电荷幻觉在大语言模型中的表现,发现前者随模型增大而减弱消失,后者则增强;结果挑战了人类句法处理需依赖‘理性推理’机制的假设,并提出基于构式语法的综合理论解释。
Details
Motivation: 探究NPI幻觉和深度电荷幻觉是否以及如何在大语言模型(LLMs)中出现,进而为人类句子加工机制提供新证据。 Method: 使用Pythia缩放套件(Biderman等,2023)对不同规模的语言模型进行系统性测试,分析两类 polarity illusions 的变化趋势。 Result: NPI幻觉随模型尺寸增大而减弱并最终消失;深度电荷幻觉则在更大模型中增强。 Conclusion: LLMs中观察到的幻觉模式表明,人类语言处理未必依赖‘理性推理’将病句转为合语法句;更可能是浅层‘足够好’加工或非规范结构的部分语法化所致;作者据此提出基于构式语法的整合性理论框架。 Abstract: I use the Pythia scaling suite (Biderman et al. 2023) to investigate if and how two well-known polarity illusions, the NPI illusion and the depth charge illusion, arise in LLMs. The NPI illusion becomes weaker and ultimately disappears as model size increases, while the depth charge illusion becomes stronger in larger models. The results have implications for human sentence processing: it may not be necessary to assume "rational inference" mechanisms that convert ill-formed sentences into well-formed ones to explain polarity illusions, given that LLMs cannot plausibly engage in this kind of reasoning, especially at the implicit level of next-token prediction. On the other hand, shallow, "good enough" processing and/or partial grammaticalization of prescriptively ungrammatical structures may both occur in LLMs. I propose a synthesis of different theoretical accounts that is rooted in the basic tenets of construction grammar.[49] KazByte: Adapting Qwen models to Kazakh via Byte-level Adapter
Rauan Akylzhanov
Main category: cs.CL
TL;DR: 本文提出ByteKaz架构,通过字节级输入和适配器绕过传统分词器,以缓解哈萨克语在大语言模型中因分词导致的效率与形态建模问题,并采用两阶段训练策略提升性能。
Details
Motivation: 哈萨克语在现有大模型中因基于高资源语言构建的分词器而被过度切分,造成计算开销增大、有效上下文缩短及形态建模能力下降(即'分词器税')。 Method: 提出ByteKaz:用小型适配器将原始字节映射至冻结的Qwen2.5-7B内部表示;随后冻结该适配器,仅微调Qwen的注意力层以适配哈萨克语。 Result: 目前实证验证仍在进行中,本文主要确立了架构设计与核心假设(即该两阶段方法可达到或超越原Qwen2.5-7B在哈萨克语基准上的准确率)。 Conclusion: ByteKaz为低资源语言提供了一种规避分词器限制的新范式,强调接口学习优先于模型整体微调,具有理论合理性与实践潜力。 Abstract: Large language models fragment Kazakh text into many more tokens than equivalent English text, because their tokenizers were built for high-resource languages. This tokenizer tax inflates compute, shortens the effective context window, and weakens the model's grip on Kazakh morphology. We propose to bypass the tokenizer entirely by feeding raw bytes through a small adapter that learns to speak the internal language of a frozen Qwen2.5-7B. Once the adapter is trained, we freeze it and fine-tune only the attention layers of Qwen on Kazakh text. Our central hypothesis is that this two-stage process -- first teach the interface, then adapt the model -- should match or exceed the accuracy of the original Qwen2.5-7B on standard Kazakh benchmarks. This report describes the ByteKaz architecture and training protocol. Empirical validation is ongoing; this version stakes the design and hypotheses for the record.[50] HumMusQA: A Human-written Music Understanding QA Benchmark Dataset
Benno Weck,Pablo Puentes,Andrea Poltronieri,Satyajeet Prabhu,Dmitry Bogdanov
Main category: cs.CL
TL;DR: 本文提出了一种新的、由音乐专家手工编写和验证的320道问题组成的音乐理解评估数据集,用于更严格地评测大型音频-语言模型(LALMs)的音乐感知与解释能力,并在6个SOTA模型上进行了基准测试及单模态捷径鲁棒性分析。
Details
Motivation: 现有音乐理解评估方法缺乏严谨性,难以真正检验模型对音乐的感知与解释能力。 Method: 构建一个由音乐专家手工编写并验证的320题高质量音乐理解问答数据集,并在6个先进LALMs上进行基准测试,同时评估其对单模态捷径的鲁棒性。 Result: 揭示了当前LALMs在音乐理解任务上的局限性,并验证了手工细粒度标注数据在复杂音频理解评估中的优越性。 Conclusion: 音乐理解评估需依赖专家驱动、语义丰富的手工数据集,而非通用自动构造方法;该数据集为未来LALM评估提供了更可靠标准。 Abstract: The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently fail to meet. This paper introduces a meticulously structured approach to music evaluation, proposing a new dataset of 320 hand-written questions curated and validated by experts with musical training, arguing that such focused, manual curation is superior for probing complex audio comprehension. To demonstrate the use of the dataset, we benchmark six state-of-the-art LALMs and additionally test their robustness to uni-modal shortcuts.[51] Article and Comment Frames Shape the Quality of Online Comments
Matteo Guida,Yulia Otmakhova,Eduard Hovy,Lea Frermann
Main category: cs.CL
TL;DR: 本文探讨新闻文章的框架如何影响读者评论的质量,发现文章框架显著预测评论健康度,且采用文章框架的评论更健康;不健康的顶级评论倾向于引发更多不健康的回复。
Details
Motivation: 计算传播学长期忽视受众反应,而框架理论强调信息呈现方式对受众反应的影响。本文旨在探究文章框架是否不仅影响评论内容,还影响评论质量(即健康度)。 Method: 基于100万条评论和2700篇新闻文章,将评论质量操作化为‘评论健康度’(建设性、善意贡献),控制话题变量,分析文章框架对评论健康度的预测作用,并考察评论是否采纳文章框架及顶级评论健康度对后续回复的影响。 Result: 文章框架显著预测评论健康度;采纳文章框架的评论比偏离框架的评论更健康;不健康的顶级评论会独立于框架地引发更多不健康回复。 Conclusion: 本文建立了框架理论与话语质量之间的联系,为基于框架意识的技术干预(如LLM系统)提供了理论基础和实证支持。 Abstract: Framing theory posits that how information is presented shapes audience responses, but computational work has largely ignored audience reactions. While recent work showed that article framing systematically shapes the content of reader responses, this paper asks: Does framing also affect response quality? Analyzing 1M comments across 2.7K news articles, we operationalize quality as comment health (constructive, good-faith contributions). We find that article frames significantly predict comment health while controlling for topic, and that comments that adopt the article frame are healthier than those that depart from it. Further, unhealthy top-level comments tend to generate more unhealthy responses, independent of the frame being used in the comment. Our results establish a link between framing theory and discourse quality, laying the groundwork for downstream applications. We illustrate this potential with a proactive frame-aware LLM- based system to mitigate unhealthy discourse[52] Top-down string-to-dependency Neural Machine Translation
Shuhei Kondo,Katsuhito Sudoh,Yuji Matsumoto
Main category: cs.CL
TL;DR: 本文提出了一种新颖的自顶向下、从左到右生成目标语言依存树的句法解码器,以改善神经机器翻译在长输入(尤其是训练中未见的长输入)上的泛化能力。
Details
Motivation: 现有基于编码器-解码器与注意力机制的NMT模型在处理训练中罕见或未见过的长输入时表现不佳,需引入目标端句法信息来缓解长度相关问题。 Method: 提出一种自顶向下、从左到右生成目标语言依存树的句法解码器,实现字符串到树(string-to-tree)的解码方式。 Result: 实验表明,所提出的自顶向下string-to-tree解码方法在翻译训练中未见过的长输入时,比传统序列到序列解码具有更好的泛化能力。 Conclusion: 引入目标端依存句法结构并采用top-down string-to-tree解码策略,可有效提升NMT对长输入的鲁棒性与泛化性能。 Abstract: Most of modern neural machine translation (NMT) models are based on an encoder-decoder framework with an attention mechanism. While they perform well on standard datasets, they can have trouble in translation of long inputs that are rare or unseen during training. Incorporating target syntax is one approach to dealing with such length-related problems. We propose a novel syntactic decoder that generates a target-language dependency tree in a top-down, left-to-right order. Experiments show that the proposed top-down string-to-tree decoding generalizes better than conventional sequence-to-sequence decoding in translating long inputs that are not observed in the training data.[53] EnsemJudge: Enhancing Reliability in Chinese LLM-Generated Text Detection through Diverse Model Ensembles
Zhuoshang Wang,Yubing Ren,Guoyu Zhao,Xiaowei Zhu,Hao Li,Yanan Cao
Main category: cs.CL
TL;DR: 本文提出EnsemJudge框架,用于检测中文大语言模型生成的文本,通过定制化策略和集成投票机制,在NLPCC2025共享任务1的中文数据集上取得第一名。
Details
Motivation: 现有LLM文本检测方法在真实场景中(如域外输入或对抗样本)鲁棒性不足,且中文检测研究匮乏。 Method: 提出EnsemJudge框架,结合定制化策略与集成投票机制,并在NLPCC2025 Shared Task 1提供的中文数据集上进行训练与评估。 Result: 在NLPCC2025共享任务1中超越所有基线方法,获得第一名;代码已开源。 Conclusion: EnsemJudge在中文LLM生成文本检测任务中展现出优异的性能与鲁棒性,为中文内容安全提供了有效工具。 Abstract: Large Language Models (LLMs) are widely applied across various domains due to their powerful text generation capabilities. While LLM-generated texts often resemble human-written ones, their misuse can lead to significant societal risks. Detecting such texts is an essential technique for mitigating LLM misuse, and many detection methods have shown promising results across different datasets. However, real-world scenarios often involve out-of-domain inputs or adversarial samples, which can affect the performance of detection methods to varying degrees. Furthermore, most existing research has focused on English texts, with limited work addressing Chinese text detection. In this study, we propose EnsemJudge, a robust framework for detecting Chinese LLM-generated text by incorporating tailored strategies and ensemble voting mechanisms. We trained and evaluated our system on a carefully constructed Chinese dataset provided by NLPCC2025 Shared Task 1. Our approach outperformed all baseline methods and achieved first place in the task, demonstrating its effectiveness and reliability in Chinese LLM-generated text detection. Our code is available at https://github.com/johnsonwangzs/MGT-Mini.[54] On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR
Ganesh Pavan Kartikeya Bharadwaj Kolluri,Michael Kampouridis,Ravi Shekhar
Main category: cs.CL
TL;DR: 本文研究了在SLAM-ASR系统中对Whisper语音编码器进行层剪枝的影响,并结合LoRA微调以恢复性能;实验表明,剪枝两层仅导致2–4% WER上升,而剪枝+LoRA反而优于原始模型、参数减少7–14%,且LoRA主要通过语言模型先验降低替换和删除错误,但在低资源丹麦语上效果有限并增加插入错误。
Details
Motivation: 尽管模型剪枝已在完整Whisper编解码架构中被探索,但其在SLAM-ASR框架下作为声学主干的Whisper编码器中的影响尚未充分研究。 Method: 对Whisper编码器(Small/Medium/Large-v2)在SLAM-ASR中进行层剪枝,并结合LoRA微调;在丹麦语、荷兰语、英语三种不同资源水平语言上开展超200次训练实验,辅以WER评估与细粒度错误分析。 Result: 剪枝两层导致WER仅上升2–4%;剪枝+LoRA组合在多数情况下超越未剪枝基线,参数减少7–14%;LoRA显著降低荷兰语和英语的替换与删除错误(总词错率降11–21%),但在丹麦语上仅降4–7%,且引入更多插入错误。 Conclusion: Whisper编码器在SLAM-ASR中具备较强剪枝鲁棒性,LoRA可通过语言模型先验有效补偿剪枝损失,但其补偿能力受限于预训练语言模型的语言能力与训练数据规模,尤其在低资源场景下效果减弱。 Abstract: Automatic speech recognition (ASR) has advanced rapidly in recent years, driven by large-scale pretrained models and end-to-end architectures such as SLAM-ASR. A key component of SLAM-ASR systems is the Whisper speech encoder, which provides robust acoustic representations. While model pruning has been explored for the full Whisper encoder-decoder architecture, its impact within the SLAM-ASR setting remains under-investigated. In this work, we analyze the effects of layer pruning in the Whisper encoder when used as the acoustic backbone of SLAM-ASR. We further examine the extent to which LoRA-based fine-tuning can recover performance degradation caused by pruning. Experiments conducted across three Whisper variants (Small, Medium, Large-v2), three languages representing distinct resource levels (Danish, Dutch, English), and over 200 training runs demonstrate that pruning two encoder layers causes only 2-4% WER degradation, and that combining this pruning with LoRA adaptation consistently outperforms the unpruned baseline while reducing total parameters by 7-14%. Moreover, our error analysis reveals that LoRA primarily compensates through the language model's linguistic priors, reducing total word errors by 11-21% for Dutch and English, with substitutions and deletions showing the largest reductions. However, for low-resource Danish, the reduction is smaller (4-7%), and LoRA introduces increased insertion errors, indicating that compensation effectiveness depends on the LLM's pre-existing language proficiency and available training data.[55] Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
Xinran Zhang
Main category: cs.CL
TL;DR: 本文探讨了在参考依据的完整性敏感分类任务中,原子分解式判断器与整体式判断器的性能差异,发现整体式判断器在多数情况下表现更优,尤其是在检测不完整性方面。
Details
Motivation: 探究原子分解式提示是否因其分解特性还是更丰富的提示内容而具有优势。 Method: 在TruthfulQA、ASQA和QAMPARI三个数据集上,对比自分解的原子式判断器与具有相同输入和详细评分标准的整体式判断器,采用配对检验、聚类自助法及多个预设提示变体进行评估。 Result: 整体式判断器在ASQA和QAMPARI上稳定优于原子式判断器(四个模型家族中三个达到统计显著),TruthfulQA上原子式略有优势;优势集中在'部分支持'类别(即不完整性检测)。 Conclusion: 在所测试的单提示自分解模式与QA风格基准下,整体式判断器不逊于甚至优于原子式;结果对参考质量下降最敏感;多阶段分解与非QA任务尚未验证。 Abstract: Atomic decomposition -- breaking a candidate answer into claims before verifying each against a reference -- is a widely adopted design for LLM-based reference-grounded judges. However, atomic prompts are typically richer and longer, making it unclear whether any advantage comes from decomposition or from richer prompting. We study this for benchmark-style completeness-sensitive reference-support classification: classifying a candidate as fully supported, partially supported, or unsupported relative to a supplied reference. We compare a self-decomposing atomic judge (single-prompt decompose-and-verify) against a prompt-controlled holistic judge with the same inputs and a similarly detailed rubric. On 200 source examples per dataset across TruthfulQA, ASQA, and QAMPARI, with four model families, source-level paired tests, cluster bootstrap, and aggregation across three pre-frozen prompt variants per design family, we find the holistic judge matches or exceeds the atomic judge on two of three benchmarks: ASQA and QAMPARI favor holistic across all four families (statistically reliable in three of four), while TruthfulQA shows a small atomic edge. The holistic advantage is concentrated in partially\_supported cases -- incompleteness detection. A sensitivity check against human annotations confirms the ranking under both benchmark-completeness and human factual-correctness standards. Our finding is specific to the self-decomposing single-prompt pattern on three QA-style benchmarks with 200 source examples each; multi-stage atomic pipelines and non-QA tasks remain untested. Among perturbations examined, reference-quality degradation produced the largest accuracy drops for both judge families.[56] Transfer Learning for an Endangered Slavic Variety: Dependency Parsing in Pomak Across Contact-Shaped Dialects
Sercan Karakaş
Main category: cs.CL
TL;DR: 本文探讨了依赖句法分析在濒危语言庞马克语(Pomak)中的应用,重点考察希腊方言与土耳其方言(乌尊科普吕)之间的跨方言迁移效果,并构建了首个土耳其方言庞马克语人工标注语料库,通过细调和跨方言迁移学习显著提升解析性能。
Details
Motivation: 庞马克语是一种濒危的东巴尔干斯拉夫语,方言差异大且缺乏统一标准;现有通用依存树库主要基于希腊方言,难以直接适用于土耳其方言,亟需评估跨方言迁移能力并构建针对性资源。 Method: 分两阶段实验:第一阶段采用零样本迁移,用希腊方言UD树库训练解析器并在土耳其方言上测试;第二阶段构建650句土耳其方言人工标注语料,进行针对性微调,并探索结合两种方言的跨方言迁移学习。 Result: 零样本迁移效果较差,凸显方言间音系与形态句法差异的影响;引入小规模土耳其方言语料后,微调显著提升准确率;进一步结合双方言数据的迁移学习带来额外性能增益。 Conclusion: 跨方言迁移对低资源濒危语言NLP至关重要;即使小规模目标方言标注数据也能极大提升性能;多方言联合建模是提升解析鲁棒性的有效策略。 Abstract: This paper presents new resources and baselines for Dependency Parsing in Pomak, an endangered Eastern South Slavic language with substantial dialectal variation and no widely adopted standard. We focus on the variety spoken in Turkey (Uzunköprü) and ask how well a dependency parser trained on the existing Pomak Universal Dependencies treebank, which was built primarily from the variety that is spoken in Greece, transfers across dialects. We run two experimental phases. First, we train a parser on the Greek-variety UD data and evaluate zero-shot transfer to Turkish-variety Pomak, quantifying the impact of phonological and morphosyntactic differences. Second, we introduce a new manually annotated Turkish-variety Pomak corpus of 650 sentences and show that, despite its small size, targeted fine-tuning substantially improves accuracy; performance is further boosted by cross-variety transfer learning that combines the two dialects.[57] Who Wrote the Book? Detecting and Attributing LLM Ghostwriters
Anudeex Shetty,Qiongkai Xu,Olga Ohrimenko,Jey Han Lau
Main category: cs.CL
TL;DR: 本文提出了GhostWriteBench数据集和TRACE指纹方法,用于大语言模型(LLM)作者归属任务,尤其关注跨域与未见模型的泛化能力。
Details
Motivation: 现有LLM作者归属研究缺乏面向长文本、多维度分布外(OOD)泛化能力的基准数据集和轻量可解释的方法。 Method: 构建了包含前沿LLM生成的长篇文本(>50K词/本)的数据集GhostWriteBench;提出TRACE方法,利用轻量语言模型建模token级转移模式(如词频秩)生成可解释、轻量的指纹,适配开闭源模型。 Result: TRACE在GhostWriteBench上达到SOTA性能,对OOD场景(如新领域、新LLM作者)鲁棒,并在少样本训练下表现优异。 Conclusion: GhostWriteBench填补了长文本LLM作者归属评测的空白,TRACE为实际部署提供了高效、可解释且泛化性强的指纹方案。 Abstract: In this paper, we introduce GhostWriteBench, a dataset for LLM authorship attribution. It comprises long-form texts (50K+ words per book) generated by frontier LLMs, and is designed to test generalisation across multiple out-of-distribution (OOD) dimensions, including domain and unseen LLM author. We also propose TRACE -- a novel fingerprinting method that is interpretable and lightweight -- that works for both open- and closed-source models. TRACE creates the fingerprint by capturing token-level transition patterns (e.g., word rank) estimated by another lightweight language model. Experiments on GhostWriteBench demonstrate that TRACE achieves state-of-the-art performance, remains robust in OOD settings, and works well in limited training data scenarios.[58] From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?
Shadman Sakib,Oishy Fatema Akhand,Tasnia Tasneem,Shohel Ahmed
Main category: cs.CL
TL;DR: 本研究评估了大语言模型(如GPT-3.5 Turbo、Gemini 2.0 Flash、Mistral 7B Instruct)从应用商店评论中自动生成可用用户故事的能力,发现其在流畅性和格式规范性上可媲美甚至超越人工,但在独立性和唯一性方面仍有不足。
Details
Motivation: 应用商店评论蕴含大量真实用户反馈,但其非结构化、非正式且规模庞大,人工分析困难;现有自动化方法复现性差、难以产出符合敏捷开发需求的高质量用户故事。 Method: 在Mini-BAR数据集(1000+条健康类App评论)上,采用零样本、单样本和双样本提示策略,测试多个主流LLM生成用户故事的效果,并结合人工评估(RUST框架)与微调后的RoBERTa分类器(基于UStAI数据集)进行多维质量评测。 Result: LLM在生成用户故事的流畅性与格式规范性上达到或超过人类水平(尤其在少样本提示下),但在独立性与唯一性指标上表现不佳,影响其直接构建高质量敏捷需求待办列表的能力。 Conclusion: LLM可作为将非结构化用户反馈高效转化为初步软件需求的可靠工具,但需辅以人工筛选或后处理机制以保障用户故事的独立性与业务价值,方能真正赋能敏捷开发流程。 Abstract: App store reviews provide a constant flow of real user feedback that can help improve software requirements. However, these reviews are often messy, informal, and difficult to analyze manually at scale. Although automated techniques exist, many do not perform well when replicated and often fail to produce clean, backlog-ready user stories for agile projects. In this study, we evaluate how well large language models (LLMs) such as GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct can generate usable user stories directly from raw app reviews. Using the Mini-BAR dataset of 1,000+ health app reviews, we tested zero-shot, one-shot, and two-shot prompting methods. We evaluated the generated user stories using both human judgment (via the RUST framework) and a RoBERTa classifier fine-tuned on UStAI to assess their overall quality. Our results show that LLMs can match or even outperform humans in writing fluent, well-formatted user stories, especially when few-shot prompts are used. However, they still struggle to produce independent and unique user stories, which are essential for building a strong agile backlog. Overall, our findings show how LLMs can reliably turn unstructured app reviews into actionable software requirements, providing developers with clear guidance to turn user feedback into meaningful improvements.[59] DongYuan: An LLM-Based Framework for Integrative Chinese and Western Medicine Spleen-Stomach Disorders Diagnosis
Hua Li,Yingying Li,Xiaobin Feng,Xinyi Fu,Lifeng Dong,Qingfeng Yang,Yanzhe Chen,Xiaoju Feng,Zhidong Cao,Jianbin Guo,Yanru Du
Main category: cs.CL
TL;DR: 本文提出DongYuan框架,针对中西医结合(ICWM)脾胃病诊疗中的数据匮乏、模型缺乏中西医推理融合能力及评估基准缺失三大挑战,构建了三类高质量数据集、核心诊断大模型SSDF-Core、咨询导航模型SSDF-Navigator及专用评测基准SSDF-Bench,实验表明其性能显著优于12个主流基线模型。
Details
Motivation: 解决中西医结合脾胃病诊疗中高质量数据缺乏、模型难以融合中医辨证与西医辨病逻辑、以及缺乏标准化评估基准三大挑战。 Method: 构建SSDF-Syndrome、SSDF-Dialogue、SSDF-PD三类数据集;设计两阶段训练(SFT+DPO)的SSDF-Core核心诊断大模型;开发可插拔的SSDF-Navigator咨询导航模型;建立面向脾胃病的SSDF-Bench评测基准。 Result: SSDF-Core在SSDF-Bench上显著优于12个主流基线模型。 Conclusion: DongYuan为智能中西医结合诊断系统的发展奠定了方法论基础并提供了实用技术参考。 Abstract: The clinical burden of spleen-stomach disorders is substantial. While large language models (LLMs) offer new potential for medical applications, they face three major challenges in the context of integrative Chinese and Western medicine (ICWM): a lack of high-quality data, the absence of models capable of effectively integrating the reasoning logic of traditional Chinese medicine (TCM) syndrome differentiation with that of Western medical (WM) disease diagnosis, and the shortage of a standardized evaluation benchmark. To address these interrelated challenges, we propose DongYuan, an ICWM spleen-stomach diagnostic framework. Specifically, three ICWM datasets (SSDF-Syndrome, SSDF-Dialogue, and SSDF-PD) were curated to fill the gap in high-quality data for spleen-stomach disorders. We then developed SSDF-Core, a core diagnostic LLM that acquires robust ICWM reasoning capabilities through a two-stage training regimen of supervised fine-tuning. tuning (SFT) and direct preference optimization (DPO), and complemented it with SSDF-Navigator, a pluggable consultation navigation model designed to optimize clinical inquiry strategies. Additionally, we established SSDF-Bench, a comprehensive evaluation benchmark focused on ICWM diagnosis of spleen-stomach disorders. Experimental results demonstrate that SSDF-Core significantly outperforms 12 mainstream baselines on SSDF-Bench. DongYuan lays a solid methodological foundation and provides practical technical references for the future development of intelligent ICWM diagnostic systems.[60] Beyond Cosine Similarity: Zero-Initialized Residual Complex Projection for Aspect-Based Sentiment Analysis
Yijin Wang,Fandi Sun
Main category: cs.CL
TL;DR: 本文提出了一种基于复数空间投影和抗碰撞角度损失的新框架,用于解决方面级情感分析(ABSA)中的表示纠缠与对比学习中假负样本碰撞问题,显著提升了性能。
Details
Motivation: ABSA面临表示纠缠(方面语义与情感极性混淆)和标准对比学习中高频方面假负样本碰撞的问题。 Method: 提出零初始化残差复数投影(ZRCP)将文本特征映射到复数语义空间,用相位解耦情感极性、振幅编码语义强度与词汇丰富度;并设计抗碰撞掩码角度损失,增强类内凝聚、扩大类间判别边界。 Result: 在Macro-F1上达到0.8851的SOTA结果;几何分析证实振幅不应被显式惩罚,相位驱动的目标对细粒度情感解耦至关重要。 Conclusion: 复数空间建模结合相位/振幅分工及抗碰撞机制,是实现鲁棒、细粒度情感解耦的有效范式。 Abstract: Aspect-Based Sentiment Analysis (ABSA) is fundamentally challenged by representation entanglement, where aspect semantics and sentiment polarities are often conflated in real-valued embedding spaces. Furthermore, standard contrastive learning suffers from false-negative collisions, severely degrading performance on high-frequency aspects. In this paper, we propose a novel framework featuring a Zero-Initialized Residual Complex Projection (ZRCP) and an Anti-collision Masked Angle Loss,inspired by quantum projection and entanglement ideas. Our approach projects textual features into a complex semantic space, systematically utilizing the phase to disentangle sentiment polarities while allowing the amplitude to encode the semantic intensity and lexical richness of subjective descriptions. To tackle the collision bottleneck, we introduce an anti-collision mask that elegantly preserves intra-polarity aspect cohesion while expanding the inter-polarity discriminative margin by over 50%. Experimental results demonstrate that our framework achieves a state-of-the-art Macro-F1 score of 0.8851. Deep geometric analyses further reveal that explicitly penalizing the complex amplitude catastrophically over-regularizes subjective representations, proving that our unconstrained-amplitude and phase-driven objective is crucial for robust, fine-grained sentiment disentanglement.[61] \textit{Versteasch du mi?} Computational and Socio-Linguistic Perspectives on GenAI, LLMs, and Non-Standard Language
Verena Platzgummer,John McCrae,Sina Ahmadi
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLM)和生成式人工智能(GenAI)对非主流语言及语言多样性的影响,批判其加剧数字语言鸿沟与殖民性语言观,并以南蒂罗尔方言和库尔德语为例,从计算语言学与批判社会语言学交叉视角,分析技术适配非标准语言的可能性及其民主化、去殖民化政策意涵。
Details
Motivation: 大型语言模型和生成式AI被证明对使用人数较少的语言不公平,加深数字语言鸿沟;同时,其技术基础与运作逻辑植根于欧洲民族主义与殖民历史所推动的语言标准化进程,并强化了将语言视为单一、单语、语法标准化意义系统的认识论。 Method: 采用跨学科方法,结合批判社会语言学与计算语言学视角,以南蒂罗尔方言(意大利)和库尔德语的非标准变体为案例,分析LLM处理非标准语言的技术路径,并评估其在推动民主化与去殖民化数字策略中的潜力与局限。 Result: 揭示了当前LLM在建模语言变异与非标准语言时的根本性技术与意识形态限制;提出需在数据、模型架构、评估标准及政策设计等多层面重构,方能支持真正包容语言多样性的AI发展路径。 Conclusion: 单纯提升LLM对非标准语言的技术兼容性不足以实现语言正义;必须将语言政策、历史批判与技术设计深度整合,推动AI向民主化、去殖民化方向演进。 Abstract: The design of Large Language Models and generative artificial intelligence has been shown to be "unfair" to less-spoken languages and to deepen the digital language divide. Critical sociolinguistic work has also argued that these technologies are not only made possible by prior socio-historical processes of linguistic standardisation, often grounded in European nationalist and colonial projects, but also exacerbate epistemologies of language as "monolithic, monolingual, syntactically standardized systems of meaning". In our paper, we draw on earlier work on the intersections of technology and language policy and bring our respective expertise in critical sociolinguistics and computational linguistics to bear on an interrogation of these arguments. We take two different complexes of non-standard linguistic varieties in our respective repertoires--South Tyrolean dialects, which are widely used in informal communication in South Tyrol, Italy, as well as varieties of Kurdish--as starting points to an interdisciplinary exploration of the intersections between GenAI and linguistic variation and standardisation. We discuss both how LLMs can be made to deal with nonstandard language from a technical perspective, and whether, when or how this can contribute to "democratic and decolonial digital and machine learning strategies", which has direct policy implications.[62] Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: 本文发现大型语言模型(LLMs)在处理阿拉伯数字时,其隐状态表征中存在类似人类范畴知觉(CP)的几何扭曲现象,该现象由输入结构断点(如位数变化)驱动,而非语义类别知识,并在不同架构中表现出‘经典CP’与‘结构性CP’两种稳定模式。
Details
Motivation: 探究大型语言模型是否表现出类似人类感知心理学中的范畴知觉(CP)现象,特别是其表征空间中是否存在类别边界处的几何扭曲,以及这种扭曲是否依赖于语义类别或仅由输入结构特性引发。 Method: 采用表征相似性分析(RSA),在六个来自五个架构家族的LLM中,系统比较连续模型(log-distance)与CP-加性模型(log-distance + 边界增强)对隐状态几何的拟合优度;同时控制边界位置(如10、100位数跃变)、非边界对照位及温度语义域,以检验效应特异性。 Result: CP-加性模型在所有模型的每一层主层均显著优于纯连续模型(100%);效应严格限于结构定义的边界(如10、100),在非边界位置及无tokenization断点的温度域中消失;观察到两类稳定模式:'经典CP'(Gemma、Qwen)兼具显式分类能力与几何扭曲,'结构性CP'(Llama、Mistral、Phi)仅有几何扭曲而无法报告类别区分。 Conclusion: LLM隐状态空间中存在由输入格式结构性断点(如tokenization边界)诱发的范畴知觉式几何扭曲,该现象不依赖显式语义类别知识,揭示了语言模型表征组织中一种底层结构驱动机制,并支持‘结构性CP’作为一种独立于语义认知的计算现象。 Abstract: Categorical perception (CP) -- enhanced discriminability at category boundaries -- is among the most studied phenomena in perceptual psychology. This paper reports that analogous geometric warping occurs in the hidden-state representations of large language models (LLMs) processing Arabic numerals. Using representational similarity analysis across six models from five architecture families, the study finds that a CP-additive model (log-distance plus a boundary boost) fits the representational geometry better than a purely continuous model at 100% of primary layers in every model tested. The effect is specific to structurally defined boundaries (digit-count transitions at 10 and 100), absent at non-boundary control positions, and absent in the temperature domain where linguistic categories (hot/cold) lack a tokenisation discontinuity. Two qualitatively distinct signatures emerge: "classic CP" (Gemma, Qwen), where models both categorise explicitly and show geometric warping, and "structural CP" (Llama, Mistral, Phi), where geometry warps at the boundary but models cannot report the category distinction. This dissociation is stable across boundaries and is a property of the architecture, not the stimulus. Structural input-format discontinuities are sufficient to produce categorical perception geometry in LLMs, independently of explicit semantic category knowledge.[63] Coconstructions in spoken data: UD annotation guidelines and first results
Ludovica Pannitto,Sylvain Kahane,Kaja Dobrovoljc,Elena Battaglia,Bruno Guillaume,Caterina Mauri,Eleonora Zucchini
Main category: cs.CL
TL;DR: 本文提出了用于口语树库中跨说话者话语的句法依存标注指南,涵盖协作共构、疑问回答和反馈语等现象,并在通用依存框架下提出两种表示方法:基于说话者的分段表示和跨话轮的依存表示;同时提出新方案以区分重述与修复,并提升未完成短语中成分的地位。
Details
Motivation: 现有通用依存框架缺乏对跨说话者话语(如协作共构、问答、反馈)的句法依存标注规范,难以准确建模口语中自然交互产生的跨话轮依存关系。 Method: 提出两种标注表示法:(1)基于说话者的话轮分割表示;(2)允许跨话轮依存的依赖表示;并新增区分重述与修复、提升未完成短语成分地位等标注原则。 Result: 构建了适用于口语树库的跨话轮依存标注指南,扩展了通用依存框架在对话语言学中的适用性,并为相关树库建设提供可操作标准。 Conclusion: 跨话轮依存标注需兼顾话语结构与句法结构,所提双轨表示法及新标注原则能更真实反映口语交互中的句法现象,推动口语依存语法研究与资源建设。 Abstract: The paper proposes annotation guidelines for syntactic dependencies that span across speaker turns - including collaborative coconstructions proper, wh-question answers, and backchannels - in spoken language treebanks within the Universal Dependencies framework. Two representations are proposed: a speaker-based representation following the segmentation into speech turns, and a dependency-based representation with dependencies across speech turns. New propositions are also put forward to distinguish between reformulations and repairs, and to promote elements in unfinished phrases.[64] Merge and Conquer: Instructing Multilingual Models by Adding Target Language Weights
Eneko Valero,Maria Ribalta i Albado,Oscar Sainz,Naiara Perez,German Rigau
Main category: cs.CL
TL;DR: 本文探讨了通过模型融合(model merging)将语言知识迁移到指令调优的大语言模型(LLM)中,以提升其在低资源语言上的指令遵循能力,无需语言特定的指令数据或重复微调,实验验证了该方法在四种伊比利亚语言上的有效性与高效性。
Details
Motivation: 大型语言模型(LLMs)严重偏向英语,在低资源语言上表现有限;而现有适配方法(如持续预训练、指令微调)依赖大量算力和高质量指令数据,这些对低资源语言社区往往不可及。 Method: 提出利用模型融合技术,将指令调优的通用LLM与语言专用的基础模型合并,从而赋予其新语言的指令遵循能力,无需语言特定指令数据或重复微调。 Result: 在四种伊比利亚语言(巴斯克语、加泰罗尼亚语、加利西亚语、西班牙语)及两类模型家族上的实验证明,模型融合能有效实现跨语言指令遵循,并支持通过融合多个语言专用模型获得多语言能力。 Conclusion: 模型融合是一种可行且高效的低资源语言适配替代方案,在保持竞争力的同时显著降低计算成本。 Abstract: Large Language Models (LLMs) remain heavily centered on English, with limited performance in low-resource languages. Existing adaptation approaches, such as continual pre-training, demand significant computational resources. In the case of instructed models, high-quality instruction data is also required, both of which are often inaccessible for low-resource language communities. Under these constraints, model merging offers a lightweight alternative, but its potential in low-resource contexts has not been systematically explored. In this work, we explore whether it is possible to transfer language knowledge to an instruction-tuned LLM by merging it with a language-specific base model, thereby eliminating the need of language-specific instructions and repeated fine-tuning processes whenever stronger instructed variants become available. Through experiments covering four Iberian languages (Basque, Catalan, Galician, and Spanish) and two model families, we show that merging enables effective instruction following behavior in new languages and even supports multilingual capability through the combination of multiple language-specific models. Our results indicate that model merging is a viable and efficient alternative to traditional adaptation methods for low-resource languages, achieving competitive performance while greatly reducing computational cost.[65] The Necessity of Setting Temperature in LLM-as-a-Judge
Lujun Li,Lama Sleem,Yangjie Xu,Yewei Song,Aolin Jia,Jerome Francois,Radu State
Main category: cs.CL
TL;DR: 本文系统研究了温度参数对LLM-as-a-Judge评估性能的影响,发现其效果高度任务依赖,并通过因果推断框架揭示温度对评判行为的直接因果效应,为LLM评估流程设计提供工程指导。
Details
Motivation: 现有LLM-as-a-Judge实践中常采用经验性固定温度(如0.1或1.0),但温度对LLM性能的影响具有任务依赖性和非单调性,其对评判效果的具体作用尚不明确。 Method: 通过一系列控制实验系统考察温度与评判性能的关系,并在统计分析中引入因果推断框架,以严谨识别温度对LLM评判行为的直接因果效应。 Result: 温度显著影响LLM评判性能,但该影响并非单调,且高度依赖具体评估任务;低温度并不总带来更优结果;因果分析证实温度对评判行为存在可量化的直接效应。 Conclusion: 温度是LLM-as-a-Judge中一个关键但被低估的超参,需根据任务特性动态调优;研究为构建更鲁棒、可解释的LLM评估流水线提供了理论依据和实践指南。 Abstract: LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship between temperature and judge performance through a series of controlled experiments, and further adopt a causal inference framework within our empirical statistical analysis to rigorously examine the direct causal effect of temperature on judge behavior, offering actionable engineering insights for the design of LLM-centric evaluation pipelines.[66] Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
He Du,Qiming Ge,Jiakai Hu,Aijun Yang,Zheng Cai,Zixian Huang,Sheng Yuan,Qinxiu Cheng,Xinchen Xie,Yicheng Chen,Yining Li,Jiaxing Xie,Huanan Dong,Yaguang Wu,Xiangjun Huang,Jian Yang,Hui Wang,Bowen Zhou,Bowen Li,Qipeng Guo,Kai Chen
Main category: cs.CL
TL;DR: Kernel-Smith 是一个用于高性能 GPU 内核与算子生成的框架,结合了稳定、评估驱动的进化智能体与面向进化的后训练方法,在 Triton 和 MACA 后端均达到 SOTA 性能,并已实际贡献至 SGLang 和 LMDeploy 等生产系统。
Details
Motivation: 现有大模型在 GPU 内核生成任务中常面临生成不可执行、缺乏可靠性及跨平台适配性差等问题,亟需一种兼顾正确性、性能与可迁移性的端到端优化框架。 Method: 提出双轨协同架构:(1)进化智能体维护可执行候选程序种群,基于编译、正确性与加速比反馈进行迭代优化,并为 Triton 和 MACA 构建专用评估服务;(2)将长程进化轨迹转化为步级监督与强化学习信号,仅保留保正确性且高增益的修改,使模型专精于进化环内的局部改进而非单次生成。 Result: Kernel-Smith-235B-RL 在 KernelBench(Triton 后端)上平均加速比最优,超越 Gemini-3.0-pro 和 Claude-4.6-opus;Kernel-Smith-MACA-30B 在 MACA 后端亦优于 DeepSeek-V3.2-think 和 Qwen3-235B-2507-think;相关技术已落地 SGLang 与 LMDeploy。 Conclusion: Kernel-Smith 验证了评估驱动进化 + 步级 RL 微调范式在 LLM 驱动底层系统优化中的有效性,具备跨硬件平台泛化能力与实际工程落地价值。 Abstract: We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.[67] Not All Subjectivity Is the Same! Defining Desiderata for the Evaluation of Subjectivity in NLP
Urja Khurana,Michiel van der Meer,Enrico Liscio,Antske Fokkens,Pradeep K. Murukannaiah
Main category: cs.CL
TL;DR: 本文提出七个面向主观性敏感模型的评估准则,强调评估实践需与模型目标一致,并通过分析60篇论文指出当前研究在主观性建模与评估中的多项不足。
Details
Motivation: 现有NLP模型日益注重输出多样性以反映不同主观视角,尤其关注边缘化群体声音,但评估方法尚未跟上这一目标,缺乏系统性、用户中心的主观性评估框架。 Method: 采用自上而下的方式,基于主观性在NLP数据与模型中的表征,提出七项评估准则;并通过扫描60篇论文的实验设置,识别当前研究在主观性建模与评估中的关键缺失。 Result: 发现多个被忽视的问题:如模糊输入与多声部输入的区分不足、主观性向用户的有效传达缺失、各评估准则间缺乏协同等。 Conclusion: 需构建更契合主观性本质、更具用户意识的评估体系,推动主观性敏感模型的负责任发展与应用。 Abstract: Subjective judgments are part of several NLP datasets and recent work is increasingly prioritizing models whose outputs reflect this diversity of perspectives. Such responses allow us to shed light on minority voices, which are frequently marginalized or obscured by dominant perspectives. It remains a question whether our evaluation practices align with these models' objectives. This position paper proposes seven evaluation desiderata for subjectivity-sensitive models, rooted in how subjectivity is represented in NLP data and models. The desiderata are constructed in a top-down approach, keeping in mind the user-centric impact of such models. We scan the experimental setup of 60 papers and show that various aspects of subjectivity are still understudied: the distinction between ambiguous and polyphonic input, whether subjectivity is effectively expressed to the user, and a lack of interplay between different desiderata, amongst other gaps.[68] Tailoring AI-Driven Reading Scaffolds to the Distinct Needs of Neurodiverse Learners
Soufiane Jhilal,Eleonora Pasqua,Caterina Marchesi,Riccardo Corradi,Martina Galletti
Main category: cs.CL
TL;DR: 本研究探讨了结构化与语义化阅读支架对神经多样性学习者阅读理解及体验的影响,发现不同支架效果因人而异,强调需个性化、可调节的支架设计,并为人类-人工智能协同调节提供设计启示。
Details
Motivation: 神经多样性学习者常需阅读支持,但过度丰富的支架可能加重注意力和工作记忆负担,反而损害理解;现有研究缺乏对支架类型及其在包容性教学中适配机制的细致考察。 Method: 基于建构-整合模型与情境化支架理论,采用改编阅读界面,在14名有特殊教育需求的小学生中开展被试内实验,对比四种文本呈现方式(原始文本、句子分段、分段+图符、分段+图符+关键词标签),结合标准化理解测验、儿童与治疗师体验评分及开放式反馈进行多维评估。 Result: 结果显示个体反应高度异质:部分学习者从分段与图符中获益,另一些则因视觉支架引入而出现协调成本上升;体验评分差异有限,但临床复杂性较高的儿童在理解易度上表现更明显差异;开放反馈普遍呼吁更简明措辞与更多视觉支持。 Conclusion: 不存在普适最优的阅读支架,应依据个体需求动态调整;研究支持‘校准式支架’理念,并为人类-人工智能在监督式包容阅读场景中的协同调节提供了具体设计方向。 Abstract: Neurodiverse learners often require reading supports, yet increasing scaffold richness can sometimes overload attention and working memory rather than improve comprehension. Grounded in the Construction-Integration model and a contingent scaffolding perspective, we examine how structural versus semantic scaffolds shape comprehension and reading experience in a supervised inclusive context. Using an adapted reading interface, we compared four modalities: unmodified text, sentence-segmented text, segmented text with pictograms, and segmented text with pictograms plus keyword labels. In a within-subject pilot with 14 primary-school learners with special educational needs and disabilities, we measured reading comprehension using standardized questions and collected brief child- and therapist-reported experience measures alongside open-ended feedback. Results highlight heterogeneous responses as some learners showed patterns consistent with benefits from segmentation and pictograms, while others showed patterns consistent with increased coordination costs when visual scaffolds were introduced. Experience ratings showed limited differences between modalities, with some apparent effects linked to clinical complexity, particularly for perceived ease of understanding. Open-ended feedback of the learners frequently requested simpler wording and additional visual supports. These findings suggest that no single scaffold is universally optimal, reinforcing the need for calibrated, adjustable scaffolding and provide design implications for human-AI co-regulation in supervised inclusive reading contexts.[69] Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
Bin Zhu,Qianghuai Jia,Tian Lan,Junyang Ren,Feng Gu,Feihu Jiang,Longyue Wang,Zhao Xu,Weihua Luo
Main category: cs.CL
TL;DR: 本文提出Marco DeepResearch,一种以验证为中心的深度研究智能体框架,在QA数据合成、轨迹构建和测试时扩展三个层面引入显式验证机制,显著提升长周期复杂任务的可靠性与性能。
Details
Motivation: 现有深度研究智能体在长周期任务中因缺乏显式验证机制,导致QA数据合成、轨迹构建和测试时扩展各阶段错误累积,严重损害整体性能。 Method: 提出三层验证驱动设计:(1) 在图结构与智能体驱动的QA数据合成中加入验证机制,确保问题难度可控且答案唯一正确;(2) 设计验证驱动的轨迹合成方法,在训练轨迹中嵌入显式验证模式;(3) 在推理阶段直接使用Marco DeepResearch自身作为验证器进行测试时扩展。 Result: 在BrowseComp及BrowseComp-ZH等高难度基准上显著超越8B级深度研究智能体;在600次工具调用预算下,性能媲美甚至超越部分30B级智能体(如Tongyi DeepResearch-30B)。 Conclusion: 验证机制是提升深度研究智能体鲁棒性与性能的关键,Marco DeepResearch通过系统性验证设计,实现了小模型规模下的高性能表现。 Abstract: Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: \textbf{(1)~QA Data Synthesis:} We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; \textbf{(2)~Trajectory Construction:} We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and \textbf{(3)~Test-time scaling:} We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.[70] LombardoGraphia: Automatic Classification of Lombard Orthography Variants
Edoardo Signoroni,Pavel Rychlý
Main category: cs.CL
TL;DR: 本文提出了LombardoGraphia语料库和自动伦巴第语正字法分类模型,旨在解决伦巴第语缺乏统一正字标准的问题。
Details
Motivation: 伦巴第语作为一种资源匮乏的语言变体,缺乏统一的正字法标准,存在多种正字系统,给自然语言处理资源开发和模型训练带来挑战。 Method: 构建了包含11,186条维基百科样本、标注9种正字变体的LombardoGraphia语料库,并训练了24种传统与神经分类模型,采用不同特征和编码层级。 Result: 最佳模型整体准确率达96.06%,平均类别准确率为85.78%,但少数类性能受限于数据不平衡问题。 Conclusion: 本研究为构建面向方言变体的伦巴第语NLP资源提供了关键基础设施。 Abstract: Lombard, an underresourced language variety spoken by approximately 3.8 million people in Northern Italy and Southern Switzerland, lacks a unified orthographic standard. Multiple orthographic systems exist, creating challenges for NLP resource development and model training. This paper presents the first study of automatic Lombard orthography classification and LombardoGraphia, a curated corpus of 11,186 Lombard Wikipedia samples tagged across 9 orthographic variants, and models for automatic orthography classification. We curate the dataset, processing and filtering raw Wikipedia content to ensure text suitable for orthographic analysis. We train 24 traditional and neural classification models with various features and encoding levels. Our best models achieve 96.06% and 85.78% overall and average class accuracy, though performance on minority classes remains challenging due to data imbalance. Our work provides crucial infrastructure for building variety-aware NLP resources for Lombard.[71] Structural-Ambiguity-Aware Translation from Natural Language to Signal Temporal Logic
Kosei Fushimi,Kazunobu Serizawa,Junya Ikemoto,Kazumune Hashimoto
Main category: cs.CL
TL;DR: 本文提出了一种保留歧义的自然语言到信号时序逻辑(STL)的翻译方法,通过CCG三阶段流水线生成带置信度得分的多个STL候选公式,以显式表达自然语言指令的多种可能形式化解释。
Details
Motivation: 自然语言具有结构歧义性,直接一对一翻译为STL不可靠;非专家用户难以直接编写STL公式,需更友好的接口。 Method: 基于组合范畴语法(CCG)的三阶段流水线:歧义保留的n-best句法分析、面向STL的模板化语义组合、规范化与分数聚合。 Result: 输出去重后的STL候选公式集合,每个公式附有合理性得分;能为真正歧义输入生成多个候选,对无歧义或等价推导则收敛为单一公式。 Conclusion: 该方法显著优于传统‘取最优’NL-to-logic翻译方法,显式建模并保留依附与作用域歧义,提升人机协同中形式规约的可靠性与可解释性。 Abstract: Signal Temporal Logic (STL) is widely used to specify timed and safety-critical tasks for cyber-physical systems, but writing STL formulas directly is difficult for non-expert users. Natural language (NL) provides a convenient interface, yet its inherent structural ambiguity makes one-to-one translation into STL unreliable. In this paper, we propose an \textit{ambiguity-preserving} method for translating NL task descriptions into STL candidate formulas. The key idea is to retain multiple plausible syntactic analyses instead of forcing a single interpretation at the parsing stage. To this end, we develop a three-stage pipeline based on Combinatory Categorial Grammar (CCG): ambiguity-preserving $n$-best parsing, STL-oriented template-based semantic composition, and canonicalization with score aggregation. The proposed method outputs a deduplicated set of STL candidates with plausibility scores, thereby explicitly representing multiple possible formal interpretations of an ambiguous instruction. In contrast to existing one-best NL-to-logic translation methods, the proposed approach is designed to preserve attachment and scope ambiguity. Case studies on representative task descriptions demonstrate that the method generates multiple STL candidates for genuinely ambiguous inputs while collapsing unambiguous or canonically equivalent derivations to a single STL formula.[72] Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
Masnun Nuha Chowdhury,Nusrat Jahan Beg,Umme Hunny Khan,Syed Rifat Raiyan,Md Kamrul Hasan,Hasan Mahmud
Main category: cs.CL
TL;DR: 本文提出PROClaim框架,通过法庭式多智能体辩论和渐进式检索增强生成(P-RAG),提升大语言模型在高风险主张验证任务中的可靠性与准确性。
Details
Motivation: 大语言模型在高风险主张验证中仍不可靠,存在幻觉和浅层推理问题;现有RAG和多智能体辩论方法受限于单次检索和非结构化辩论动态。 Method: 提出法庭式多智能体框架PROClaim,包含原告、被告、法官等专门角色;引入渐进式RAG(P-RAG)动态扩展与精炼证据池;结合证据协商、自省及异构多法官聚合机制。 Result: 在Check-COVID基准零样本评估中达到81.7%准确率,比标准多智能体辩论高10.0个百分点;其中P-RAG贡献+7.5个百分点提升。 Conclusion: 结构化辩论与模型异质性可有效缓解系统性偏差,为可靠主张验证提供坚实基础。 Abstract: Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.[73] TIEG-Youpu Solution for NeurIPS 2022 WikiKG90Mv2-LSC
Feng Nie,Zhixiu Ye,Sifa Xie,Shuang Wu,Xin Yuan,Liang Yao,Jiazhen Peng,Xu Cheng
Main category: cs.CL
TL;DR: 本文提出了一种用于大规模知识图谱WikiKG90Mv2的检索-重排序框架,包含优先填充检索模型和基于邻域增强表示的集成重排序模型,在验证集上将MRR从0.2342提升至0.2839。
Details
Motivation: 现有知识图谱嵌入方法在处理超大规模知识图谱(如含9000万实体的WikiKG90Mv2)时,难以兼顾效率与精度。 Method: 采用检索-重排序两阶段流程:1)提出优先填充检索模型,兼顾结构与语义相似性以获取候选三元组;2)设计基于邻域增强表示的集成重排序模型,对候选结果进行精细排序。 Result: 在WikiKG90Mv2数据集上,验证集MRR由0.2342提升至0.2839,显著优于基线方法。 Conclusion: 所提框架在大规模知识图谱链接预测任务中实现了效率与精度的较好平衡,验证了检索-重排序范式在超大规模场景下的有效性。 Abstract: WikiKG90Mv2 in NeurIPS 2022 is a large encyclopedic knowledge graph. Embedding knowledge graphs into continuous vector spaces is important for many practical applications, such as knowledge acquisition, question answering, and recommendation systems. Compared to existing knowledge graphs, WikiKG90Mv2 is a large scale knowledge graph, which is composed of more than 90 millions of entities. Both efficiency and accuracy should be considered when building graph embedding models for knowledge graph at scale. To this end, we follow the retrieve then re-rank pipeline, and make novel modifications in both retrieval and re-ranking stage. Specifically, we propose a priority infilling retrieval model to obtain candidates that are structurally and semantically similar. Then we propose an ensemble based re-ranking model with neighbor enhanced representations to produce final link prediction results among retrieved candidates. Experimental results show that our proposed method outperforms existing baseline methods and improves MRR of validation set from 0.2342 to 0.2839.[74] EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces
Léane Jourdan,Julien Aubert-Béduchaud,Yannis Chupin,Marah Baccari,Florian Boudin
Main category: cs.CL
TL;DR: 本文提出了EarlySciRev数据集,通过自动提取arXiv LaTeX源文件中被注释掉的文本与最终文本的对应关系,构建了57.8万对真实早期科学写作修订样本,并提供了人工标注的修订检测基准。
Details
Motivation: 现有公开科研论文资源多只提供终稿或近终稿,缺乏早期修订痕迹,限制了对科研写作修订行为的实证研究及大语言模型在该任务上的评估。 Method: 利用LaTeX中注释文本常保留作者删改或替代表述的特点,将注释段落与邻近正文对齐生成候选修订对,再通过大语言模型过滤,最终得到高质量修订对;同时构建人工标注的修订检测基准。 Result: 从128万候选对中筛选出57.8万对经验证的真实修订对,形成EarlySciRev数据集,并发布配套人工标注基准。 Conclusion: EarlySciRev填补了早期科学写作修订数据的空白,支持科研写作动态建模、修订行为分析及LLM辅助编辑等研究方向。 Abstract: Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.[75] GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum
Shuwen Xu,Yao Xu,Jiaxiang Liu,Chenhao Yuan,Wenshuo Peng,Jun Zhao,Kang Liu
Main category: cs.CL
TL;DR: 本文提出GraphWalker框架,通过自动轨迹合成和分阶段微调解决知识图谱问答中训练数据稀缺与推理泛化性差的问题,在多个基准上达到SOTA性能。
Details
Motivation: 现有基于提示或训练流程的方法在智能体自主导航和推理路径灵活性方面存在限制,导致训练数据稀缺和推理泛化能力不足。 Method: 提出GraphWalker框架,包含两阶段监督微调(SFT):第一阶段使用从受限随机游走生成的多样化结构轨迹训练智能体,建立对知识图谱的广泛探索先验;第二阶段用少量专家轨迹微调,提升反思与错误恢复能力。 Result: GraphWalker在CWQ和WebQSP上达到SOTA性能,并在GrailQA及自建GraphWalkerBench上展现出对分布外推理路径的更强泛化能力。 Conclusion: 分阶段SFT范式显著提升了轻量级强化学习阶段的性能上限,验证了自主探索能力与专家引导结合的有效性。 Abstract: Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories. To this end, this paper proposes \textit{GraphWalker}, a novel agentic KGQA framework that addresses these challenges through \textit{Automated Trajectory Synthesis} and \textit{Stage-wise Fine-tuning}. GraphWalker adopts a two-stage SFT training paradigm: First, the agent is trained on structurally diverse trajectories synthesized from constrained random-walk paths, establishing a broad exploration prior over the KG; Second, the agent is further fine-tuned on a small set of expert trajectories to develop reflection and error recovery capabilities. Extensive experiments demonstrate that our stage-wise SFT paradigm unlocks a higher performance ceiling for a lightweight reinforcement learning (RL) stage, enabling GraphWalker to achieve state-of-the-art performance on CWQ and WebQSP. Additional results on GrailQA and our constructed GraphWalkerBench confirm that GraphWalker enhances generalization to out-of-distribution reasoning paths. The code is publicly available at https://github.com/XuShuwenn/GraphWalker[76] Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT
Younes Javanmard,Tanmoy Pandit,Masoud Mardani
Main category: cs.CL
TL;DR: 本文提出使用矩阵乘积算子(MPO)分解来压缩Transformer模型,显著降低参数量并保持较高精度,在PicoGPT模型上实现最高13倍压缩率,且在chi=16时保留97.7%的基线准确率。
Details
Motivation: Transformer模型因参数量随隐藏维数二次增长,难以部署在资源受限设备上,亟需高效、原理扎实的压缩方法。 Method: 将PicoGPT中所有nn.Linear层替换为MPOLinear模块,采用MPO链式低秩分解;核心张量通过TT-SVD或随机初始化,并用标准PyTorch自动微分训练;针对五种不同权重形状设计平衡分解方案,并在Tiny Shakespeare上评估不同bond dimension(chi=4,8,16,32)。 Result: chi=4时单Transformer块达13倍压缩;chi=16时参数从102万降至19.2万,token准确率保持51.6%(基线52.8%);chi=8时单位参数准确率超基线2.7倍;三节点分解比两节点在相同chi下重建误差更低。 Conclusion: MPO参数化是一种实用、理论严谨的Transformer压缩方法,优于传统低秩近似和非结构化剪枝。 Abstract: Transformer-based language models achieve strong performance across NLP tasks, but their quadratic parameter scaling with hidden dimension makes deployment on resource-constrained hardware expensive. We study Matrix Product Operator (MPO) decomposition as a principled compression method for transformers. MPO factorises weight matrices into chains of low-rank cores, with approximation quality controlled by the bond dimension chi. We replace every nn.Linear layer in PicoGPT, a GPT-2-style character-level language model with about 1M parameters, with an MPOLinear module parameterised as an MPO chain. Cores are initialised either by TT-SVD from pretrained dense weights or from random initialisation, and trained using standard PyTorch autograd without a custom backward pass. We derive balanced factorisation schemes for the five distinct weight shapes in PicoGPT and evaluate bond dimensions chi in {4, 8, 16, 32} on Tiny Shakespeare. MPO compression achieves up to 13x compression per transformer block at chi = 4. At chi = 16, the model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Reconstruction error follows the expected trend and is lower for three-site than two-site factorisations at the same bond dimension. The chi = 8 model gives the best accuracy per parameter, exceeding the dense baseline by 2.7x on this metric. These results show that MPO parameterisation is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression.[77] Training data generation for context-dependent rubric-based short answer grading
Pavel Šindelář,Dávid Slivka,Christopher Bouma,Filip Prášil,Ondřej Bojar
Main category: cs.CL
TL;DR: 本文探讨了利用少量机密数据集生成大规模训练数据集的方法,以支持自动学生答案评分系统的开发,同时确保数据保密性。
Details
Motivation: PISA测试中学生答案的评分面临语言差异和评分者偏见的挑战,需要可靠的自动评分方法;而这些方法通常依赖大量领域特定数据进行训练或调参,但真实数据往往受限于保密性。 Method: 基于一个小规模机密参考数据集,设计并应用若干简单衍生文本格式(如改写、结构化变换等)生成多个大规模代理数据集(surrogate datasets),避免直接使用原始敏感内容。 Result: 成功构建了三个代理数据集,其表面特征比纯提示生成的数据更接近原始参考数据;初步实验表明其中一种方法可能提升模型训练效果。 Conclusion: 衍生文本格式是一种可行且有效的隐私保护策略,可在有限机密数据下扩展训练数据规模,并潜在提升自动评分模型性能。 Abstract: Every 4 years, the PISA test is administered by the OECD to test the knowledge of teenage students worldwide and allow for comparisons of educational systems. However, having to avoid language differences and annotator bias makes the grading of student answers challenging. For these reasons, it would be interesting to compare methods of automatic student answer grading. To train some of these methods, which require machine learning, or to compute parameters or select hyperparameters for those that do not, a large amount of domain-specific data is needed. In this work, we explore a small number of methods for creating a large-scale training dataset using only a relatively small confidential dataset as a reference, leveraging a set of very simple derived text formats to preserve confidentiality. Using these methods, we successfully created three surrogate datasets that are, at the very least, superficially more similar to the reference dataset than purely the result of prompt-based generation. Early experiments suggest one of these approaches might also lead to improved model training.[78] EpiScreen: Early Epilepsy Detection from Electronic Health Records with Large Language Models
Shuang Zhou,Kai Yu,Zaifu Zhan,Huixue Zhou,Min Zeng,Feng Xie,Zhiyi Sha,Rui Zhang
Main category: cs.CL
TL;DR: 本研究开发了一种基于电子健康记录临床文本的低成本癫痫早期筛查工具EpiScreen,利用微调大语言模型实现高准确率(AUC最高达0.980),并在医生-AI协作中显著提升诊断性能。
Details
Motivation: 癫痫与心因性非癫痫性发作临床表现相似、易误诊,而金标准视频脑电图成本高、可及性差,亟需低成本、易部署的早期筛查方法。 Method: 基于MIMIC-IV和明尼苏达大学私有队列的标注临床笔记,对大语言模型进行微调,构建EpiScreen系统,并在医生-AI协同场景下评估其辅助诊断效果。 Result: EpiScreen在MIMIC-IV上AUC达0.875,在私有队列达0.980;医生使用该工具后诊断准确率最高提升10.9%。 Conclusion: EpiScreen是一种高效、低成本的早期癫痫筛查工具,有助于缩短诊断时间、减少不必要干预,尤其适用于资源有限地区。 Abstract: Epilepsy and psychogenic non-epileptic seizures often present with similar seizure-like manifestations but require fundamentally different management strategies. Misdiagnosis is common and can lead to prolonged diagnostic delays, unnecessary treatments, and substantial patient morbidity. Although prolonged video-electroencephalography is the diagnostic gold standard, its high cost and limited accessibility hinder timely diagnosis. Here, we developed a low-cost, effective approach, EpiScreen, for early epilepsy detection by utilizing routinely collected clinical notes from electronic health records. Through fine-tuning large language models on labeled notes, EpiScreen achieved an AUC of up to 0.875 on the MIMIC-IV dataset and 0.980 on a private cohort of the University of Minnesota. In a clinician-AI collaboration setting, EpiScreen-assisted neurologists outperformed unaided experts by up to 10.9%. Overall, this study demonstrates that EpiScreen supports early epilepsy detection, facilitating timely and cost-effective screening that may reduce diagnostic delays and avoid unnecessary interventions, particularly in resource-limited regions.[79] Adaptive Block-Scaled Data Types
Jack Cook,Hyemin S. Lee,Kathryn Le,Junxian Guo,Giovanni Traverso,Anantha P. Chandrakasan,Song Han
Main category: cs.CL
TL;DR: 本文提出了一种自适应的4位数据类型IF4,通过在FP4和INT4之间动态选择并利用NVFP4中未使用的符号位进行标识,以减少量化误差;IF4在量化训练损失和后训练量化精度上均优于现有4位格式,并设计了配套的高效IF4 MAC硬件单元。
Details
Motivation: NVFP4虽被广泛用于大模型4位量化,但其量化误差在每组16个值中接近最大值时显著增大,限制了性能。 Method: 提出自适应块缩放数据类型IF4:对每组16个值,基于输入分布动态选择FP4或INT4表示,并统一采用E4M3 scale因子缩放;用scale因子的符号位(原NVFP4中未使用)编码选择信息;扩展至IF3、IF6;设计IF4 MAC硬件单元。 Result: IF4在语言模型量化中优于现有4位块缩放格式:量化训练损失更低,后训练量化在多项任务上准确率更高;IF4 MAC单元验证了其硬件可行性与高效性。 Conclusion: IF4通过自适应表示选择和复用冗余符号位,在保持硬件友好性的同时显著缓解NVFP4的误差分布缺陷,为低比特量化提供了更优的数据类型设计范式。 Abstract: NVFP4 has grown increasingly popular as a 4-bit format for quantizing large language models due to its hardware support and its ability to retain useful information with relatively few bits per parameter. However, the format is not without limitations: recent work has shown that NVFP4 suffers from its error distribution, resulting in large amounts of quantization error on near-maximal values in each group of 16 values. In this work, we leverage this insight to design new Adaptive Block-Scaled Data Types that can adapt to the distribution of their input values. For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values, which are then scaled by an E4M3 scale factor as is done with NVFP4. The selected data type is denoted using the scale factor's sign bit, which is currently unused in NVFP4, and we apply the same insight to design formats for other bit-widths, including IF3 and IF6. When used to quantize language models, we find that IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and achieving higher accuracy on many tasks in post-training quantization. We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators. Our code is available at https://github.com/mit-han-lab/fouroversix.cs.CV [Back]
[80] An Annotation-to-Detection Framework for Autonomous and Robust Vine Trunk Localization in the Field by Mobile Agricultural Robots
Dimitrios Chatziparaschis,Elia Scudiero,Brent Sams,Konstantinos Karydis
Main category: cs.CV
TL;DR: 本文提出了一种面向农业场景的多模态目标检测框架,利用跨模态标注迁移与早期传感器融合,在少量部分标注数据下实现鲁棒的葡萄藤主干实时检测与定位。
Details
Motivation: 农业田间环境动态异构,对自主移动机器人在未知非结构化环境中进行目标检测与定位构成挑战;同时亟需不依赖大规模人工标注真实数据的实时检测系统。 Method: 提出一种从标注到检测的完整框架,包含跨模态标注迁移、早期多传感器(LiDAR/里程计)融合流水线,以及多阶段检测架构;结合定制化的多模态LOAM建图算法和树关联模块。 Result: 在多种光照与作物密度的新型葡萄园场景中验证,单次遍历可检测超70%的树干,平均距离误差小于0.37米。 Conclusion: 该框架通过多模态、分阶段增量式标注与训练,能在初始标注极少的情况下实现鲁棒检测,具备实际农田近地应用潜力。 Abstract: The dynamic and heterogeneous nature of agricultural fields presents significant challenges for object detection and localization, particularly for autonomous mobile robots that are tasked with surveying previously unseen unstructured environments. Concurrently, there is a growing need for real-time detection systems that do not depend on large-scale manually labeled real-world datasets. In this work, we introduce a comprehensive annotation-to-detection framework designed to train a robust multi-modal detector using limited and partially labeled training data. The proposed methodology incorporates cross-modal annotation transfer and an early-stage sensor fusion pipeline, which, in conjunction with a multi-stage detection architecture, effectively trains and enhances the system's multi-modal detection capabilities. The effectiveness of the framework was demonstrated through vine trunk detection in novel vineyard settings that featured diverse lighting conditions and varying crop densities to validate performance. When integrated with a customized multi-modal LiDAR and Odometry Mapping (LOAM) algorithm and a tree association module, the system demonstrated high-performance trunk localization, successfully identifying over 70% of trees in a single traversal with a mean distance error of less than 0.37m. The results reveal that by leveraging multi-modal, incremental-stage annotation and training, the proposed framework achieves robust detection performance regardless of limited starting annotations, showcasing its potential for real-world and near-ground agricultural applications.[81] A Multimodal Deep Learning Framework for Edema Classification Using HCT and Clinical Data
Aram Ansary Ogholbake,Hannah Choi,Spencer Brandenburg,Alyssa Antuna,Zahraa Al-Sharshahi,Makayla Cox,Haseeb Ahmed,Jacqueline Frank,Nathan Millson,Luke Bauerle,Jessica Lee,David Dornbos,Qiang Cheng
Main category: cs.CV
TL;DR: AttentionMixer是一种融合头颅CT影像与临床元数据的多模态脑水肿检测框架,采用ViT-AE++编码影像、跨模态注意力融合临床变量,并通过MLP-Mixer优化表征,在准确率、F1和AUC等指标上优于单模态及其他多模态方法。
Details
Motivation: 现有方法常忽略或简单拼接临床元数据与CT影像,缺乏对异构模态间互补信息的结构化、可解释融合;且临床数据常存在缺失,需提升鲁棒性。 Method: 使用自监督ViT-AE++编码HCT体积;将临床元数据映射至同一特征空间并作为cross-attention的key/value,HCT特征作为query;引入可学习嵌入处理缺失元数据;最后用轻量级MLP-Mixer进行全局建模与分类。 Result: 在五折交叉验证中达到准确率87.32%、精确率92.10%、F1分数85.37%、AUC 94.14%,显著优于单模态及既有多模态基线;消融实验验证了cross-attention与MLP-Mixer的有效性;置换分析揭示关键临床变量。 Conclusion: AttentionMixer证明了结构化、可解释的多模态融合能有效提升脑水肿临床检测性能,兼顾精度、鲁棒性与可解释性。 Abstract: We propose AttentionMixer, a unified deep learning framework for multimodal detection of brain edema that combines structural head CT (HCT) with routine clinical metadata. While HCT provides rich spatial information, clinical variables such as age, laboratory values, and scan timing capture complementary context that might be ignored or naively concatenated. AttentionMixer is designed to fuse these heterogeneous sources in a principled and efficient manner. HCT volumes are first encoded using a self-supervised Vision Transformer Autoencoder (ViT-AE++), without requiring large labeled datasets. Clinical metadata are mapped into the same feature space and used as keys and values in a cross-attention module, where HCT-derived feature vector serves as queries. This cross-attention fusion allows the network to dynamically modulate imaging features based on patient-specific context and provides an interpretable mechanism for multimodal integration. A lightweight MLP-Mixer then refines the fused representation before final classification, enabling global dependency modeling with substantially reduced parameter overhead. Missing or incomplete metadata are handled via a learnable embedding, promoting robustness to real-world clinical data quality. We evaluate AttentionMixer on a curated brain HCT cohort with expert edema annotations using five-fold cross-validation. Compared with strong HCT-only, metadata-only, and prior multimodal baselines, AttentionMixer achieves superior performance (accuracy 87.32%, precision 92.10%, F1-score 85.37%, AUC 94.14%). Ablation studies confirm the benefit of both cross-attention and MLP-Mixer refinement, and permutation-based metadata importance analysis highlights clinically meaningful variables driving predictions. These results demonstrate that structured, interpretable multimodal fusion can substantially improve edema detection in clinical practice.[82] The Nonverbal Gap: Toward Affective Computer Vision for Safer and More Equitable Online Dating
Ratna Kandala,Niva Manchanda,Akshata Kishore Moharir
Main category: cs.CV
TL;DR: 本文提出一个以公平为先的计算机视觉研究议程,旨在通过实时不适检测、互动不对称建模、知情同意交互设计和长期互动总结等能力,弥补在线约会中因缺失非语言线索而导致的安全鸿沟,尤其关注对女性的影响,并强调需构建专用数据集、公平性评估及端侧处理架构。
Details
Motivation: 在线约会平台剥离了眼神、表情、姿态等关键非语言线索,造成沟通鸿沟,尤其对女性产生不成比例的安全风险;而计算机视觉领域已具备相关技术能力,却尚未将约会场景作为严肃研究领域。 Method: 提出围绕四大能力的研究议程(实时不适检测、参与度不对称建模、知情同意交互设计、长期互动总结),依托现有CV方法(如AU检测、视线估计、多模态情感识别),并强调需构建经双向知情同意采集的数据集、跨人口统计维度的公平性评估、以及端侧处理架构。 Result: 形成一个面向在线约会安全的负责任CV研究框架,明确技术路径、伦理约束与社区行动号召。 Conclusion: 在线约会安全应成为计算机视觉领域的‘一级研究方向’;WICV社区有独特责任推动该议程,防止商业部署速度超越伦理反思。 Abstract: Online dating has become the dominant way romantic relationships begin, yet current platforms strip the nonverbal cues: gaze, facial expression, body posture, response timing, that humans rely on to signal comfort, disinterest, and consent, creating a communication gap with disproportionate safety consequences for women. We argue that this gap represents both a technical opportunity and a moral responsibility for the computer vision community, which has developed the affective tools, facial action unit detection, gaze estimation, engagement modeling, and multimodal affect recognition, needed to begin addressing it, yet has largely ignored the dating domain as a research context. We propose a fairness-first research agenda organized around four capability areas: real-time discomfort detection, engagement asymmetry modeling between partners, consent-aware interaction design, and longitudinal interaction summarization, each grounded in established CV methodology and motivated by the social psychology of romantic communication. We argue that responsible pursuit of this agenda requires purpose-built datasets collected under dyadic consent protocols, fairness evaluation disaggregated across race, gender identity, neurotype, and cultural background, and architectural commitments to on-device processing that prevent affective data from becoming platform surveillance infrastructure. This vision paper calls on the WICV community, whose members are uniquely positioned to understand both the technical opportunity and the human stakes, to establish online dating safety as a first-class research domain before commercial deployment outpaces ethical deliberation.[83] Multi-view Graph Convolutional Network with Fully Leveraging Consistency via Granular-ball-based Topology Construction, Feature Enhancement and Interactive Fusion
Chengjie Cui,Taihua Xua,Shuyin Xia,Qinghua Zhang,Yun Cui,Shiping Wang
Main category: cs.CV
TL;DR: 本文提出MGCN-FLC模型,通过基于粒球(GB)的拓扑构建、特征增强与交互式融合三个模块,全面利用多视图学习中的节点间、特征间和视图间一致性,显著提升半监督节点分类性能。
Details
Motivation: 现有GCN-based多视图方法在拓扑构建(依赖KNN且k值人为设定)、单视图内特征一致性建模、以及多视图间一致性融合方式上存在不足,限制了对三类一致性的充分挖掘。 Method: 提出MGCN-FLC:1)基于粒球算法自动构建高内聚拓扑以捕获节点间一致性;2)设计特征增强模块建模单视图内特征间一致性;3)引入交互式融合模块实现多视图间深度协同,增强视图间一致性。 Result: 在9个数据集上的实验表明,MGCN-FLC在半监督节点分类任务上优于当前最优方法。 Conclusion: 全面建模节点间、特征间和视图间三种一致性是提升多视图GCN性能的关键,MGCN-FLC为此提供了有效框架。 Abstract: The effective utilization of consistency is crucial for multi-view learning. GCNs leverage node connections to propagate information across the graph, facilitating the exploitation of consistency in multi-view data. However, most existing GCN-based multi-view methods suffer from several limitations. First, current approaches predominantly rely on KNN for topology construction, where the artificial selection of the k value significantly constrains the effective exploitation of inter-node consistency. Second, the inter-feature consistency within individual views is often overlooked, which adversely affects the quality of the final embedding representations. Moreover, these methods fail to fully utilize inter-view consistency as the fusion of embedded representations from multiple views is often implemented after the intra-view graph convolutional operation. Collectively, these issues limit the model's capacity to fully capture inter-node, inter-feature and inter-view consistency. To address these issues, this paper proposes the multi-view graph convolutional network with fully leveraging consistency via GB-based topology construction, feature enhancement and interactive fusion (MGCN-FLC). MGCN-FLC can fully utilize three types of consistency via the following three modules to enhance learning ability:The topology construction module based on the granular ball algorithm, which clusters nodes into granular balls with high internal similarity to capture inter-node consistency;The feature enhancement module that improves feature representations by capturing inter-feature consistency;The interactive fusion module that enables each view to deeply interact with all other views, thereby obtaining more comprehensive inter-view consistency. Experimental results on nine datasets show that the proposed MGCN-FLC outperforms state-of-the-art semi-supervised node classification methods.[84] Contextual inference from single objects in Vision-Language models
Martina G. Vilas,Timothy Schaumlöffel,Gemma Roig
Main category: cs.CV
TL;DR: 本文系统研究了视觉语言模型(VLMs)从单个物体推断场景上下文的能力,发现其具备超越随机水平的细粒度场景类别与粗粒度室内外判断能力,且性能受与人类相似的物体属性影响;不同层级的推理部分解耦,机制上依赖背景不变的物体表征,且场景身份与超类信息在模型中以不同方式编码。
Details
Motivation: 理解视觉语言模型(VLMs)如何组织单个物体所携带的场景上下文信息,因其直接关系到模型鲁棒性,而此前该问题在VLM中尚未被深入探究。 Method: 通过向VLMs输入掩蔽背景后的单个物体图像,系统评估其对细粒度场景类别和粗粒度超类(室内/室外)的推断能力;结合行为分析与机制分析(如表征稳定性、token级信息定位)。 Result: 单个物体可支持显著高于随机水平的场景与超类推断;性能受物体固有属性调控,且与人类认知规律一致;不同层级预测部分解耦;背景稳定的物体表征更利于上下文推理;场景身份信息分布于全网络图像token中,而超类信息仅晚期出现或缺失。 Conclusion: VLMs中的上下文推理组织比单纯准确率所显示的更复杂,兼具特定的行为模式与神经机制特征,提示其并非简单模仿人类,而是具有独特表征结构。 Abstract: How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indoor vs. outdoor). We found that single objects support above-chance inference at both levels, with performance modulated by the same object properties that predict human scene categorization. Object identity, scene, and superordinate predictions are partially dissociable: accurate inference at one level neither requires nor guarantees accurate inference at the others, and the degree of coupling differs markedly across models. Mechanistically, object representations that remain stable when background context is removed are more predictive of successful contextual inference. Scene and superordinate schemas are grounded in fundamentally different ways: scene identity is encoded in image tokens throughout the network, while superordinate information emerges only late or not at all. Together, these results reveal that the organization of contextual inference in VLMs is more complex than accuracy alone suggests, with behavioral and mechanistic signatures[85] Distilled Large Language Model-Driven Dynamic Sparse Expert Activation Mechanism
Qinghui Chen,Zekai Zhang,Zaigui Zhang,Kai Zhang,Dagang Li,Wenmin Wang,Jinglin Zhang,Cong Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为DS-MoE的轻量级、文本引导的稀疏混合专家框架,用于解决高类间相似性、尺度变化大和计算资源受限下的缺陷检测问题,显著优于YOLOv8/YOLOX等模型。
Details
Motivation: 高类间相似性、极端尺度变化和有限计算预算阻碍了真实世界数据中的可靠视觉识别;现有方法依赖刚性融合机制和繁重标注流程,泛化能力不足。 Method: 提出Distilled LLM-Driven Sparse Mixture-of-Experts(DS-MoE)框架,结合文本引导的动态路由与轻量多尺度理解:利用稀疏MoE架构根据语义相关性自适应激活任务相关专家,并采用轻量级MobileSAM编码器实现实时推理与多尺度细节保持。 Result: 在PCB、铝箔和模具缺陷数据集上实验表明,DS-MoE显著优于纯视觉模型;相比YOLOv8/YOLOX,在BBMP、铝箔、PCB数据集上mAP@0.5:0.95分别提升+13.9、+1.4、+2.0个百分点,并提升精度与召回率。 Conclusion: DS-MoE通过文本语义驱动的稀疏专家动态路由与轻量多尺度建模,有效缓解类间歧义并兼顾效率与性能,为资源受限场景下的工业缺陷检测提供了新范式。 Abstract: High inter-class similarity, extreme scale variation, and limited computational budgets hinder reliable visual recognition across diverse real-world data. Existing vision-centric and cross-modal approaches often rely on rigid fusion mechanisms and heavy annotation pipelines, leading to sub-optimal generalization. We propose the Distilled Large Language Model (LLM)-Driven Sparse Mixture-of-Experts (DS-MoE) framework, which integrates text-guided dynamic routing and lightweight multi-scale comprehension. The DS-MoE framework dynamically aligns textual semantics with defect-specific visual patterns through a sparse MoE architecture, where task-relevant experts are adaptively activated based on semantic relevance, resolving inter-class ambiguity. A lightweight MobileSAM encoder enables real-time inference while preserving multi-scale defect details. Extensive experiments on PCB, aluminum foil, and mold defect datasets demonstrate that our framework achieves superior performance compared to existing pure vision models. \textbf{DS-MoE} surpasses YOLOv8/YOLOX with gains of +13.9, +1.4, and +2.0 pp mAP@ 0.5:0.95 on BBMP, aluminum, and PCB, respectively, while also improving precision and recall.[86] Ordinal Semantic Segmentation Applied to Medical and Odontological Images
Mariana Dória Prata Lima,Gilson Antonio Giraldi,Jaime S. Cardoso
Main category: cs.CV
TL;DR: 本文研究了在语义分割任务中引入类别序数关系的损失函数,提出了单峰、准单峰和空间三类损失函数,并在医学影像中验证了其提升鲁棒性、泛化性和解剖一致性的有效性。
Details
Motivation: 现代深度学习方法虽在语义分割中精度高,但常忽略类别间的序数关系,而该关系蕴含重要的领域知识,有助于场景理解。 Method: 提出并分类探讨三类融入序数关系的损失函数:单峰损失(约束预测概率分布符合类别顺序)、准单峰损失(允许小幅波动但保持序数一致性)、空间损失(惩罚邻近像素间的语义不一致);具体考察EXP_MSE、QUL和CSSDF三种损失函数。 Result: 所提损失函数在医学影像语义分割任务中展现出良好效果,提升了模型的鲁棒性、泛化能力及解剖结构一致性。 Conclusion: 将序数关系建模融入语义分割损失函数是有效提升语义一致性和领域合理性的重要途径,尤其适用于具有天然序结构的医学图像分析任务。 Abstract: Semantic segmentation consists of assigning a semantic label to each pixel according to predefined classes. This process facilitates the understanding of object appearance and spatial relationships, playing an important role in the global interpretation of image content. Although modern deep learning approaches achieve high accuracy, they often ignore ordinal relationships among classes, which may encode important domain knowledge for scene interpretation. In this work, loss functions that incorporate ordinal relationships into deep neural networks are investigated to promote greater semantic consistency in semantic segmentation tasks. These loss functions are categorized as unimodal, quasi-unimodal, and spatial. Unimodal losses constrain the predicted probability distribution according to the class ordering, while quasi-unimodal losses relax this constraint by allowing small variations while preserving ordinal coherence. Spatial losses penalize semantic inconsistencies between neighboring pixels, encouraging smoother transitions in the image space. In particular, this study adapts loss functions originally proposed for ordinal classification to ordinal semantic segmentation. Among them, the Expanded Mean Squared Error (EXP_MSE), the Quasi-Unimodal Loss (QUL), and the spatial Contact Surface Loss using Signal Distance Function (CSSDF) are investigated. These approaches have shown promising results in medical imaging, improving robustness, generalization, and anatomical consistency.[87] Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning
Guangfu Guo,Xiaoqian Lu,Yue Feng,Mingming Sun
Main category: cs.CV
TL;DR: 本文提出SSV-CoT方法,通过问题相关显著性图引导模型按重要性顺序依次关注图像区域,实现结构化、序列化的视觉推理,无需区域标注或外部工具,显著提升多模态大模型的视觉推理能力。
Details
Motivation: 现有多模态大语言模型将图像编码为静态视觉前缀,依赖文本推理,缺乏目标驱动和自适应的视觉访问能力;受人类选择性、顺序性视觉注意机制启发,需建模视觉信息的空间重要性分布并引导推理顺序。 Method: 提出Structural Sequential Visual CoT(SSV-CoT):首先生成问题相关的显著性图以识别并排序关键视觉区域,显式建模视觉重要性的空间分布;其次按该判别性顺序进行推理,形成从主线索到次线索的课程式语义演进;端到端训练,仅需文本思维链与答案监督。 Result: 在多个视觉推理基准上取得性能提升,验证了结构化、序列化视觉认知的有效性。 Conclusion: SSV-CoT通过引入可学习的、问题驱动的视觉注意顺序,增强了多模态大模型的动态视觉理解与推理能力,为构建更类人的视觉-语言协同推理范式提供了新思路。 Abstract: Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.[88] SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model
Guifeng Deng,Pan Wang,Jiquan Wang,Shuying Rao,Junyi Xie,Wanjun Guo,Tao Li,Haiteng Jiang
Main category: cs.CV
TL;DR: SleepVLM是一种基于规则的视觉-语言模型,用于从多通道多导睡眠图(PSG)波形图像中自动进行睡眠分期,并依据AASM标准生成可读的临床推理依据,在保持SOTA性能的同时显著提升可解释性与临床可信度。
Details
Motivation: 尽管自动化睡眠分期已达到专家级准确率,但缺乏可审计的推理过程阻碍了其临床应用。 Method: 提出SleepVLM:结合波形感知预训练与基于AASM规则的监督微调,以多通道PSG波形图像为输入,输出睡眠分期结果及对应规则化理由。 Result: 在MASS-SS1测试集和外部ZUAMHCS数据集上Cohen's kappa分别达0.767和0.743;专家评估在事实准确性、证据完整性与逻辑连贯性上平均得分>4.0/5.0。 Conclusion: SleepVLM在性能与可解释性之间取得平衡,提升了自动化睡眠分期在临床中的可信度与可审计性;同时开源专家标注数据集MASS-EX以推动可解释睡眠医学研究。 Abstract: While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) designed to stage sleep from multi-channel polysomnography (PSG) waveform images while generating clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen's kappa scores of 0.767 on an held out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Expert evaluations further validated the quality of the model's reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.[89] Language-Conditioned World Modeling for Visual Navigation
Yifei Dong,Fengyi Wu,Yilong Dai,Lingdong Kong,Guangyu Chen,Xu Zhu,Qiyu Hu,Tianyu Wang,Johnalbert Garnica,Feng Liu,Siyu Huang,Qi Dai,Zhi-Qi Cheng
Main category: cs.CV
TL;DR: 本文提出语言条件视觉导航(LCVN)任务,构建了包含39,016条轨迹和117,048条指令的大规模基准数据集,并设计了两类模型框架(基于扩散世界模型+Actor-Critic 和统一自回归多模态架构),分别侧重时序一致性与跨环境泛化能力,推动语言-感知-动作联合建模研究。
Details
Motivation: 现有视觉导航任务常依赖目标图像,而真实场景中往往仅有自然语言指令;因此需解决无目标图像下的语言-视觉-动作联合接地难题,尤其强调语言对感知与控制的引导作用。 Method: 1)构建LCVN数据集(39,016轨迹、117,048人工验证指令);2)提出两类方法:a) LCVN-WM(扩散世界模型)+ LCVN-AC(在WM潜空间训练的Actor-Critic);b) LCVN-Uni(端到端自回归多模态模型,同步预测动作与未来观测)。 Result: 实验表明:LCVN-WM+AC生成轨迹时间连贯性更强;LCVN-Uni在未见环境中泛化能力更优;二者共同验证了语言接地、想象建模与策略学习联合优化的重要性。 Conclusion: LCVN任务及配套数据集和模型为语言条件世界模型研究提供了统一、可复现的基准,强调需协同推进语言理解、具身想象与控制策略的学习。 Abstract: We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at https://github.com/F1y1113/LCVN.[90] Steering Sparse Autoencoder Latents to Control Dynamic Head Pruning in Vision Transformers (Student Abstract)
Yousung Lee,Dongsoo Har
Main category: cs.CV
TL;DR: 本文提出了一种将稀疏自编码器(SAE)与动态剪枝结合的新框架,利用SAE解耦ViT中密集嵌入为可解释、可控制的稀疏潜变量,实现按类可控的注意力头剪枝,兼顾效率与可解释性。
Details
Motivation: 现有ViT动态头剪枝策略难以解释和控制,需提升其可解释性与可控性。 Method: 在ViT最后一层残差嵌入上训练稀疏自编码器(SAE),通过不同策略(如按类steering)放大稀疏潜变量以影响剪枝决策。 Result: 实现了按类选择紧凑且高精度的注意力头子集(例如bowl类仅用h2和h5即提升准确率并大幅降低头使用率),验证了稀疏潜特征对动态剪枝的类级可控性。 Conclusion: 稀疏潜特征可有效桥接ViT动态剪枝的效率与机制可解释性,为可控、可解释的模型压缩提供了新路径。 Abstract: Dynamic head pruning in Vision Transformers (ViTs) improves efficiency by removing redundant attention heads, but existing pruning policies are often difficult to interpret and control. In this work, we propose a novel framework by integrating Sparse Autoencoders (SAEs) with dynamic pruning, leveraging their ability to disentangle dense embeddings into interpretable and controllable sparse latents. Specifically, we train an SAE on the final-layer residual embedding of the ViT and amplify the sparse latents with different strategies to alter pruning decisions. Among them, per-class steering reveals compact, class-specific head subsets that preserve accuracy. For example, bowl improves accuracy (76% to 82%) while reducing head usage (0.72 to 0.33) via heads h2 and h5. These results show that sparse latent features enable class-specific control of dynamic pruning, effectively bridging pruning efficiency and mechanistic interpretability in ViTs.[91] CNMBI: Determining the Number of Clusters Using Center Pairwise Matching and Boundary Filtering
Ruilin Zhang,Haiyang Zheng,Hongpeng Wang
Main category: cs.CV
TL;DR: 本文提出CNMBI方法,通过利用数据空间内在分布信息和二分图理论,动态比较聚类中心位置行为,首次在确定聚类数时主动剔除低置信度样本,提升了对高维、大规模图像等复杂数据的鲁棒性和适应性。
Details
Motivation: 现有聚类数选择方法多基于聚类验证,依赖数据分布假设,难以适用于真实世界中的高维、大规模图像等复杂数据。 Method: 提出CNMBI方法,利用数据空间内在分布信息,将聚类数确定建模为聚类中心间的位置行为动态比较过程,并采用二分图理论建模;同时引入样本置信度机制,主动剔除低置信度样本。 Result: 在CIFAR-10、STL-10等多个挑战性数据集上,CNMBI显著优于当前最优方法,展现出更强的鲁棒性与对数据维度、形状的灵活性。 Conclusion: CNMBI是一种不依赖完整聚类结果和复杂有效性指标的新范式,在复杂数据场景下有效解决了聚类数自动确定问题。 Abstract: One of the main challenges in data mining is choosing the optimal number of clusters without prior information. Notably, existing methods are usually in the philosophy of cluster validation and hence have underlying assumptions on data distribution, which prevents their application to complex data such as large-scale images and high-dimensional data from the real world. In this regard, we propose an approach named CNMBI. Leveraging the distribution information inherent in the data space, we map the target task as a dynamic comparison process between cluster centers regarding positional behavior, without relying on the complete clustering results and designing the complex validity index as before. Bipartite graph theory is then employed to efficiently model this process. Additionally, we find that different samples have different confidence levels and thereby actively remove low-confidence ones, which is, for the first time to our knowledge, considered in cluster number determination. CNMBI is robust and allows for more flexibility in the dimension and shape of the target data (e.g., CIFAR-10 and STL-10). Extensive comparison studies with state-of-the-art competitors on various challenging datasets demonstrate the superiority of our method.[92] Motion Semantics Guided Normalizing Flow for Privacy-Preserving Video Anomaly Detection
Yang Liu,Boan Chen,Yuanyuan Meng,Jing Liu,Zhengliang Guo,Wei Zhou,Peng Sun,Hong Chen
Main category: cs.CV
TL;DR: 本文提出MSG-Flow方法,通过分层建模运动语义(离散动作基元+语义时序依赖+细粒度姿态变化),提升骨架视频异常检测的判别能力,在HR-ShanghaiTech和HR-UBnormal数据集上达到SOTA性能。
Details
Motivation: 现有骨架视频异常检测方法将运动建模为单一连续轨迹,难以同时捕捉人类活动的离散语义基元和细粒度运动细节,导致不同抽象层次的异常检测判别力不足;同时需兼顾隐私保护需求。 Method: 提出Motion Semantics Guided Normalizing Flow(MSG-Flow):1)用向量量化变分自编码器(VQ-VAE)将连续骨架运动离散化为可解释的动作语义基元;2)用自回归Transformer建模语义层级的时间依赖;3)用条件归一化流建模细粒度姿态变化。 Result: 在HR-ShanghaiTech和HR-UBnormal基准上AUC分别达88.1%和75.8%,性能领先现有方法。 Conclusion: 分层运动语义建模能有效提升骨架视频异常检测对多尺度异常的识别能力,兼顾隐私保护与判别性能。 Abstract: As embodied perception systems increasingly bridge digital and physical realms in interactive multimedia applications, the need for privacy-preserving approaches to understand human activities in physical environments has become paramount. Video anomaly detection is a critical task in such embodied multimedia systems for intelligent surveillance and forensic analysis. Skeleton-based approaches have emerged as a privacy-preserving alternative that processes physical world information through abstract human pose representations while discarding sensitive visual attributes such as identity and facial features. However, existing skeleton-based methods predominantly model continuous motion trajectories in a monolithic manner, failing to capture the hierarchical nature of human activities composed of discrete semantic primitives and fine-grained kinematic details, which leads to reduced discriminability when anomalies manifest at different abstraction levels. In this regard, we propose Motion Semantics Guided Normalizing Flow (MSG-Flow) that decomposes skeleton-based VAD into hierarchical motion semantics modeling. It employs vector quantized variational auto-encoder to discretize continuous motion into interpretable primitives, an autoregressive Transformer to model semantic-level temporal dependencies, and a conditional normalizing flow to capture detail-level pose variations. Extensive experiments on benchmarks (HR-ShanghaiTech & HR-UBnormal) demonstrate that MSG-Flow achieves state-of-the-art performance with 88.1% and 75.8% AUC respectively.[93] TDEC: Deep Embedded Image Clustering with Transformer and Distribution Information
Ruilin Zhang,Haiyang Zheng,Hongpeng Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的深度嵌入图像聚类方法TDEC,首次联合考虑特征表示、维度偏好和鲁棒分配,利用Transformer编码器捕获全局依赖、降维模块构建聚类友好低维空间,并引入嵌入特征分布信息提供可靠监督信号,显著提升复杂图像数据的聚类性能。
Details
Motivation: 现有深度聚类方法忽略不同图像区域间的全局感知信息融合,且学习到的特征维度高、仅依赖简单距离度量,不利于聚类。 Method: 提出TDEC方法,包含T-Encoder(基于Transformer)用于学习具有全局依赖的判别性特征,Dim-Reduction模块构建聚类友好的低维空间,并在聚类过程中引入嵌入特征的分布信息以支持联合训练。 Result: TDEC在复杂数据集上显著优于当前最先进方法,具备对数据规模、簇数和上下文复杂度的鲁棒性与灵活性。 Conclusion: TDEC通过联合优化特征表示、维度选择与分配机制,有效提升了图像聚类性能,尤其适用于高维、复杂的图像数据。 Abstract: Image clustering is a crucial but challenging task in multimedia machine learning. Recently the combination of clustering with deep learning has achieved promising performance against conventional methods on high-dimensional image data. Unfortunately, existing deep clustering methods (DC) often ignore the importance of information fusion with a global perception field among different image regions on clustering images, especially complex ones. Additionally, the learned features are usually clustering-unfriendly in terms of dimensionality and are based only on simple distance information for the clustering. In this regard, we propose a deep embedded image clustering TDEC, which for the first time to our knowledge, jointly considers feature representation, dimensional preference, and robust assignment for image clustering. Specifically, we introduce the Transformer to form a novel module T-Encoder to learn discriminative features with global dependency while using the Dim-Reduction block to build a friendly low-dimensional space favoring clustering. Moreover, the distribution information of embedded features is considered in the clustering process to provide reliable supervised signals for joint training. Our method is robust and allows for more flexibility in data size, the number of clusters, and the context complexity. More importantly, the clustering performance of TDEC is much higher than recent competitors. Extensive experiments with state-of-the-art approaches on complex datasets show the superiority of TDEC.[94] From Diffusion To Flow: Efficient Motion Generation In MotionGPT3
Jaymin Ban,JiHong Jeon,SangYeop Jeong
Main category: cs.CV
TL;DR: 本文对比了扩散模型和校正流(rectified flow)目标在MotionGPT3框架下的性能,发现校正流在训练收敛速度、推理效率和运动质量方面均优于扩散模型。
Details
Motivation: 探究校正流目标在连续隐空间文本驱动动作生成中是否能复现其在图像与音频生成中的优势。 Method: 在MotionGPT3框架下,固定模型结构、训练协议与评估设置,仅替换生成目标为扩散或校正流,进行受控实证比较。 Result: 在校正流目标下,模型收敛更快、测试性能更早达到高位、采样步数更少时仍保持或超越扩散模型的动作质量,并在不同采样步数下表现更稳定。 Conclusion: 校正流目标的优势可迁移到文本到动作生成任务,生成目标的选择对动作先验建模至关重要。 Abstract: Recent text-driven motion generation methods span both discrete token-based approaches and continuous-latent formulations. MotionGPT3 exemplifies the latter paradigm, combining a learned continuous motion latent space with a diffusion-based prior for text-conditioned synthesis. While rectified flow objectives have recently demonstrated favorable convergence and inference-time properties relative to diffusion in image and audio generation, it remains unclear whether these advantages transfer cleanly to the motion generation setting. In this work, we conduct a controlled empirical study comparing diffusion and rectified flow objectives within the MotionGPT3 framework. By holding the model architecture, training protocol, and evaluation setup fixed, we isolate the effect of the generative objective on training dynamics, final performance, and inference efficiency. Experiments on the HumanML3D dataset show that rectified flow converges in fewer training epochs, reaches strong test performance earlier, and matches or exceeds diffusion-based motion quality under identical conditions. Moreover, flow-based priors exhibit stable behavior across a wide range of inference step counts and achieve competitive quality with fewer sampling steps, yielding improved efficiency--quality trade-offs. Overall, our results suggest that several known benefits of rectified flow objectives do extend to continuous-latent text-to-motion generation, highlighting the importance of the training objective choice in motion priors.[95] Survey on Remote Sensing Scene Classification: From Traditional Methods to Large Generative AI Models
Qionghao Huang,Can Hu
Main category: cs.CV
TL;DR: This survey comprehensively reviews the evolution of remote sensing scene classification from handcrafted features to deep learning, foundation models, and generative AI, highlighting key advances, challenges (e.g., annotation cost, interpretability), and future directions (e.g., hyperspectral analysis, cross-domain generalization, standardized evaluation).
Details
Motivation: To systematically trace and analyze the methodological evolution of remote sensing scene classification, address current challenges, and identify future research priorities in light of rapid AI advancements. Method: A comprehensive literature survey tracing methodological development—from classical texture descriptors and machine learning to CNNs, Vision Transformers, graph neural networks, foundation models, vision-language systems, and generative AI—while analyzing challenges and trends. Result: A structured overview of technical progress, identification of persistent challenges (annotation costs, multimodal fusion, interpretability, ethics), emerging trends (edge computing, federated learning, sustainable AI), and concrete future research directions. Conclusion: Remote sensing scene classification has matured significantly with AI, yet critical gaps remain; advancing hyperspectral/multi-temporal analysis, cross-domain generalization, and standardized evaluation are essential for next-generation systems. Abstract: Remote sensing scene classification has experienced a paradigmatic transformation from traditional handcrafted feature methods to sophisticated artificial intelligence systems that now form the backbone of modern Earth observation applications. This comprehensive survey examines the complete methodological evolution, systematically tracing development from classical texture descriptors and machine learning classifiers through the deep learning revolution to current state-of-the-art foundation models and generative AI approaches. We chronicle the pivotal shift from manual feature engineering to automated hierarchical representation learning via convolutional neural networks, followed by advanced architectures including Vision Transformers, graph neural networks, and hybrid frameworks. The survey provides in-depth coverage of breakthrough developments in self-supervised foundation models and vision-language systems, highlighting exceptional performance in zero-shot and few-shot learning scenarios. Special emphasis is placed on generative AI innovations that tackle persistent challenges through synthetic data generation and advanced feature learning strategies. We analyze contemporary obstacles including annotation costs, multimodal data fusion complexities, interpretability demands, and ethical considerations, alongside current trends in edge computing deployment, federated learning frameworks, and sustainable AI practices. Based on comprehensive analysis of recent advances and gaps, we identify key future research priorities: advancing hyperspectral and multi-temporal analysis capabilities, developing robust cross-domain generalization methods, and establishing standardized evaluation protocols to accelerate scientific progress in remote sensing scene classification systems.[96] Generating Synthetic Wildlife Health Data from Camera Trap Imagery: A Pipeline for Alopecia and Body Condition Training Data
David Brundage
Main category: cs.CV
TL;DR: 本文提出了一种生成野生动物健康状况(如脱毛和体况恶化)合成图像的管道,用于弥补相机陷阱图像中缺乏机器学习就绪数据集的问题,并验证了其在真实图像筛查中的有效性。
Details
Motivation: 缺乏公开可用、适合机器学习的野生动物健康状况相机陷阱图像数据集,阻碍了自动化健康筛查的发展。 Method: 构建基于iWildCam的精选基础图像集,结合MegaDetector检测框与分层采样;设计生成式表型编辑系统模拟不同严重程度的脱毛与消瘦;引入自适应场景漂移质量控制系统,通过假预滤波与解耦掩码评分机制筛选合格图像。 Result: 从201张基础图像生成553张通过质量控制的合成图像(通过率83%);仅用合成数据训练模型,在真实图像上实现0.85 AUROC的筛查性能。 Conclusion: 该合成数据生成管道可作为有效的健康筛查数据源,证明合成数据能捕捉足够视觉特征以支持跨域筛查任务。 Abstract: No publicly available, ML ready datasets exist for wildlife health conditions in camera trap imagery, creating a fundamental barrier to automated health screening. We present a pipeline for generating synthetic training images depicting alopecia and body condition deterioration in wildlife from real camera trap photographs. Our pipeline constructs a curated base image set from iWildCam using MegaDetector derived bounding boxes and center frame weighted stratified sampling across 8 North American species. A generative phenotype editing system produces controlled severity variants depicting hair loss consistent with mange and emaciation. An adaptive scene drift quality control system uses a sham prefilter and decoupled mask then score approach with complementary day or night metrics to reject images where the generative model altered the original scene. We frame the pipeline explicitly as a screening data source. From 201 base images across 4 species, we generate 553 QC passing synthetic variants with an overall pass rate of 83 percent. A sim to real transfer experiment training exclusively on synthetic data and testing on real camera trap images of suspected health conditions achieves 0.85 AUROC, demonstrating that the synthetic data captures visual features sufficient for screening.[97] Domain-Guided YOLO26 with Composite BCE-Dice-Lovász Loss for Multi-Class Fetal Head Ultrasound Segmentation
M. Fazri Nizar
Main category: cs.CV
TL;DR: 本文提出了一种无需提示(prompt-free)的YOLO26-Seg联合检测与分割框架,用于胎儿超声中脑、透明隔腔(CSP)和侧脑室(LV)的自动分割,在575张测试图像上达到平均Dice系数0.9253,超越基线方法。
Details
Motivation: 胎儿头颅结构在产前超声中的分割仍是临床实践瓶颈;现有SOTA方法依赖测试时边界框提示,泛化性与实用性受限。 Method: 基于YOLO26-Seg构建端到端检测-分割联合模型:(i)引入逆频率加权的BCE-Dice-Lovász复合损失,并通过运行时猴子补丁注入训练流程;(ii)设计解剖位置感知的域引导复制粘贴增强,提升小目标(CSP/LV)数据多样性;(iii)采用跨患者分层划分防止数据泄露。 Result: 在575张独立测试图像上,所提方法平均Dice达0.9253(仅计三个前景类),较基线(0.9012)提升2.68个百分点;消融实验验证各模块有效性,并分析标注质量与类别不平衡对CSP/LV性能的影响。 Conclusion: 无需提示的单次前向推理框架显著提升多结构胎儿超声分割精度与实用性,复合损失与解剖感知增强是关键创新,为临床部署提供了更鲁棒、更简洁的解决方案。 Abstract: Segmenting fetal head structures from prenatal ultrasound remains a practical bottleneck in obstetric imaging. The current state-of-the-art baseline, proposed alongside the published dataset, adapts the Segment Anything Model with per-class Dice and Lovász losses but still depends on bounding-box prompts at test time. We build a prompt-free pipeline on top of YOLO26-Seg that jointly detects and segments three structures, Brain, Cavum Septi Pellucidi (CSP), and Lateral Ventricles (LV), in a single forward pass. Three modifications are central to our approach: (i) a composite BCE-Dice-Lovász segmentation loss with inverse-frequency class weighting, injected into the YOLO26 training loop via runtime monkey-patching; (ii) domain-guided copy-paste augmentation that transplants minority-class structures while respecting their anatomical location relative to the brain boundary; and (iii) inter-patient stratified splitting to prevent data leakage. On 575 held-out test images, the composite loss variant reaches a mean Dice coefficient of 0.9253, exceeding the baseline (0.9012) by 2.68 percentage points, despite reporting over three foreground classes only, whereas the baseline's reported mean includes the easy background class. We further ablate each component and discuss annotation-quality and class-imbalance effects on CSP and LV performance.[98] GradAttn: Replacing Fixed Residual Connections with Task-Modulated Attention Pathways
Soudeep Ghoshal,Himanshu Buckchash
Main category: cs.CV
TL;DR: 本文提出GradAttn,一种结合CNN与Transformer的混合框架,用注意力机制动态调控梯度流以替代ResNet中固定的残差连接,在多个数据集上展现出优于ResNet-18的性能,并揭示了可控梯度不稳定性可提升泛化能力。
Details
Motivation: ResNet的固定残差连接无法适应不同输入复杂度,也不能在不同网络层次上选择性增强任务相关特征,导致梯度信号退化问题仍未根本解决。 Method: 提出GradAttn框架:在CNN多深度提取多尺度特征,引入自注意力机制动态加权浅层纹理与深层语义特征;替换固定残差连接为注意力控制的梯度流;分析梯度流、位置编码有效性及表征特性。 Result: 在八个涵盖自然图像、医学影像和时尚识别的数据集上评估三个GradAttn变体,其中五个数据集上超越ResNet-18,FashionMNIST提升达+11.07%准确率;梯度分析显示可控不稳定性常伴随泛化提升;位置编码效果具数据集依赖性。 Conclusion: 注意力机制可作为可学习的梯度调控工具,挑战‘梯度越稳定越好’的传统假设,为深度神经网络提供自适应表征学习的新范式。 Abstract: Deep ConvNets suffer from gradient signal degradation as network depth increases, limiting effective feature learning in complex architectures. ResNet addressed this through residual connections, but these fixed short-circuits cannot adapt to varying input complexity or selectively emphasize task relevant features across network hierarchies. This study introduces GradAttn, a hybrid CNN-transformer framework that replaces fixed residual connections with attention-controlled gradient flow. By extracting multi-scale CNN features at different depths and regulating them through self-attention, GradAttn dynamically weights shallow texture features and deep semantic representations. For representational analysis, we evaluated three GradAttn variants across eight diverse datasets, from natural images, medical imaging, to fashion recognition. Results demonstrate that GradAttn outperforms ResNet-18 on five of eight datasets, achieving up to +11.07% accuracy improvement on FashionMNIST while maintaining comparable network size. Gradient flow analysis reveals that controlled instabilities, introduced by attention, often coincide with improved generalization, challenging the assumption that perfect stability is optimal. Furthermore, positional encoding effectiveness proves dataset dependent, with CNN hierarchies frequently encoding sufficient spatial structure. These findings allow attention mechanisms as enablers of learnable gradient control, offering a new paradigm for adaptive representation learning in deep neural architectures.[99] Physics-Aware Diffusion for LiDAR Point Cloud Densification
Zeping Zhang,Robert Laganière
Main category: cs.CV
TL;DR: 本文提出了一种名为Scanline-Consistent Range-Aware Diffusion的新框架,用于解决LiDAR远距离感知中因点云稀疏导致的性能下降问题。该方法通过在粗略先验上应用部分扩散(SDEdit)进行概率性细化,结合新提出的Ray-Consistency损失和Negative Ray Augmentation技术,显著抑制了物理幻觉(如鬼点),在156ms内实现高保真点云稠密化,并在KITTI-360和nuScenes数据集上达到SOTA性能,且无需重训练即可提升现有3D检测器性能。
Details
Motivation: LiDAR感知在远距离场景中受限于点云随距离增加而变得稀疏的问题,影响下游任务(如3D检测)性能;现有扩散模型虽能恢复几何结构,但存在推理延迟高和生成物理不一致伪影(如鬼点)的问题。 Method: 提出Scanline-Consistent Range-Aware Diffusion框架,将点云稠密化建模为对粗略先验的概率性细化而非从头生成;采用Partial Diffusion(SDEdit)加速推理;引入Ray-Consistency损失确保沿激光扫描线的几何一致性;结合Negative Ray Augmentation增强对无效射线区域的鲁棒性,从而约束生成结果符合传感器物理特性。 Result: 在KITTI-360和nuScenes数据集上达到当前最优(SOTA)稠密化性能;推理耗时仅156ms;可即插即用地提升现成3D检测器(如PointPillars、CenterPoint)的检测精度,无需任何微调或重新训练。 Conclusion: 该工作验证了基于物理约束的轻量级扩散式细化策略在LiDAR点云稠密化中的有效性,为实时、可靠、高保真的远距离感知提供了新范式。 Abstract: LiDAR perception is severely limited by the distance-dependent sparsity of distant objects. While diffusion models can recover dense geometry, they suffer from prohibitive latency and physical hallucinations manifesting as ghost points. We propose Scanline-Consistent Range-Aware Diffusion, a framework that treats densification as probabilistic refinement rather than generation. By leveraging Partial Diffusion (SDEdit) on a coarse prior, we achieve high-fidelity results in just 156ms. Our novel Ray-Consistency loss and Negative Ray Augmentation enforce sensor physics to suppress artifacts. Our method achieves state-of-the-art results on KITTI-360 and nuScenes, directly boosting off-the-shelf 3D detectors without retraining. Code will be made available.[100] An Intelligent Framework for Real-Time Yoga Pose Detection and Posture Correction
Chandramouli Haldar
Main category: cs.CV
TL;DR: 本文提出了一种基于边缘AI的实时瑜伽姿势检测与姿态矫正框架,结合轻量级姿态估计、生物力学特征提取和CNN-LSTM时序学习,实现精准姿势识别与量化评分,并通过多模态反馈提供实时矫正指导,同时采用模型量化与剪枝优化以适配资源受限设备。
Details
Motivation: 瑜伽益处依赖于正确姿势执行,而自学或在线训练中姿势不当易导致效果下降和运动损伤,亟需智能辅助系统。 Method: 构建混合边缘AI框架:融合轻量级人体姿态估计模型、生物力学特征(如关节角、骨骼特征)提取、CNN-LSTM时序建模;设计量化评分机制,对比参考姿态评估偏差;集成视觉/文本/语音多模态实时反馈;应用模型量化与剪枝实现端侧低延迟部署。 Result: 实现了高精度实时瑜伽姿势识别与动态姿态分析,在资源受限边缘设备上达成低延迟性能,并支持多模态实时矫正反馈。 Conclusion: 该框架可作为智能、可扩展的数字瑜伽助手,提升用户训练安全性和有效性,适用于现代健身应用。 Abstract: Yoga is widely recognized for improving physical fitness, flexibility, and mental well being. However, these benefits depend strongly on correct posture execution. Improper alignment during yoga practice can reduce effectiveness and increase the risk of musculoskeletal injuries, especially in self guided or online training environments. This paper presents a hybrid Edge AI based framework for real time yoga pose detection and posture correction. The proposed system integrates lightweight human pose estimation models with biomechanical feature extraction and a CNN LSTM based temporal learning architecture to recognize yoga poses and analyze motion dynamics. Joint angles and skeletal features are computed from detected keypoints and compared with reference pose configurations to evaluate posture correctness. A quantitative scoring mechanism is introduced to measure alignment deviations and generate real time corrective feedback through visual, text based, and voice based guidance. In addition, Edge AI optimization techniques such as model quantization and pruning are applied to enable low latency performance on resource constrained devices. The proposed framework provides an intelligent and scalable digital yoga assistant that can improve user safety and training effectiveness in modern fitness applications.[101] Tiny-ViT: A Compact Vision Transformer for Efficient and Explainable Potato Leaf Disease Classification
Shakil Mia,Umme Habiba,Urmi Akter,SK Rezwana Quadir Raisa,Jeba Maliha,Md. Iqbal Hossain,Md. Shakhauat Hossan Sumon
Main category: cs.CV
TL;DR: 本文提出了一种轻量级视觉Transformer模型Tiny-ViT,用于在资源受限设备上高效、准确地分类马铃薯叶片病害(早疫病、晚疫病和健康),在精度、鲁棒性、实时性和可解释性方面均优于现有基准模型。
Details
Motivation: 传统马铃薯病害识别方法耗时且易受人为误差影响,亟需自动化、高效、适用于边缘设备的解决方案。 Method: 提出轻量级Vision Transformer模型Tiny-ViT;采用图像预处理(缩放、CLAHE增强、高斯模糊);使用Grad-CAM提升模型可解释性。 Result: 在三类马铃薯叶片数据集上测试准确率达99.85%,平均交叉验证准确率99.82%,MCC达0.9990,置信区间窄[0.9980, 0.9995],推理速度快、计算开销低。 Conclusion: Tiny-ViT是一种高精度、高鲁棒性、低计算成本且具备良好可解释性的马铃薯病害分类模型,适用于实时农业应用。 Abstract: Early and precise identification of plant diseases, especially in potato crops is important to ensure the health of the crops and ensure the maximum yield . Potato leaf diseases, such as Early Blight and Late Blight, pose significant challenges to farmers, often resulting in yield losses and increased pesticide use. Traditional methods of detection are not only time-consuming, but are also subject to human error, which is why automated and efficient methods are required. The paper introduces a new method of potato leaf disease classification Tiny-ViT model, which is a small and effective Vision Transformer (ViT) developed to be used in resource-limited systems. The model is tested on a dataset of three classes, namely Early Blight, Late Blight, and Healthy leaves, and the preprocessing procedures include resizing, CLAHE, and Gaussian blur to improve the quality of the image. Tiny-ViT model has an impressive test accuracy of 99.85% and a mean CV accuracy of 99.82% which is better than baseline models such as DEIT Small, SWIN Tiny, and MobileViT XS. In addition to this, the model has a Matthews Correlation Coefficient (MCC) of 0.9990 and narrow confidence intervals (CI) of [0.9980, 0.9995], which indicates high reliability and generalization. The training and testing inference time is competitive, and the model exhibits low computational expenses, thereby, making it applicable in real-time applications. Moreover, interpretability of the model is improved with the help of GRAD-CAM, which identifies diseased areas. Altogether, the proposed Tiny-ViT is a solution with a high level of robustness, efficiency, and explainability to the problem of plant disease classification.[102] A Near-Raw Talking-Head Video Dataset for Various Computer Vision Tasks
Babak Naderi,Ross Cutler
Main category: cs.CV
TL;DR: 本文开源了一个大规模、高保真度的近原始视频数据集,包含847段来自真实环境下的网络摄像头采集的说话人头部视频,并基于该数据集进行了编解码效率评估,揭示了内容类型和背景处理对压缩效率的影响。
Details
Motivation: 现有面向实时通信中说话人头部视频处理研究的公开数据集稀缺且信号保真度低,亟需高质量、大规模、真实场景的数据资源。 Method: 采集805名参与者使用446种不同消费级网络摄像头在自然环境中录制的847段15秒说话人视频,全部采用FFV1无损编码保存原始信号;每段视频标注MOS及10个感知质量标记;构建含原始/背景虚化/背景替换三类内容的120片段分层基准子集;在四个数据集和四种编解码器(H.264/H.265/H.266/AV1)上进行VMAF BD-rate效率评测。 Result: H.266相比H.264最高实现-71.3% VMAF BD-rate节省;发现编码器×数据集(η²_p = .112)和编码器×内容条件(η²_p = .149)存在显著交互效应;10个质量标记联合解释64.4%的MOS方差;数据集规模达此前最大同类数据集的5倍(847 vs. 160)。 Conclusion: 该开源数据集以近原始信号 fidelity 和更大规模填补了真实场景说话人视频研究的数据空白,为视频压缩与增强模型的训练与评测提供了可靠基准。 Abstract: Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a near-raw dataset of 847 talking-head recordings (approximately 212 minutes), each 15\,s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4\%) or MJPEG-encoded (75.6\%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4\% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to $-71.3\%$ (H.266) relative to H.264, with significant encoder$\times$dataset ($η_p^2 = .112$) and encoder$\times$content condition ($η_p^2 = .149$) interactions, demonstrating that both content type and background processing affect compression efficiency. The dataset offers 5$\times$ the scale of the largest prior talking-head webcam dataset (847 vs.\ 160 clips) with lossless signal fidelity, establishing a resource for training and benchmarking video compression and enhancement models in real-time communication.[103] Low Dose CT for Stroke Diagnosis: A Dual Pipeline Deep Learning Framework for Portable Neuroimaging
Rhea Ghosal,Ronok Ghosal,Eileen Lou
Main category: cs.CV
TL;DR: 本文提出了一种用于便携式低剂量CT脑扫描中卒中分类的深度学习框架,比较了直接分类与先去噪再分类两种流程,在多个剂量水平下评估性能,发现去噪虽提升图像观感但未必提升诊断性能,揭示了观感质量与诊断效用间的权衡。
Details
Motivation: 便携式CT扫描仪在院前和资源有限环境中可实现早期卒中检测,但需降低辐射剂量,从而引入噪声并影响诊断可靠性。 Method: 采用受控泊松噪声模拟真实低剂量CT条件,构建深度学习框架,对比直接分类噪声LDCT图像与先去噪再分类两种流程,并在多个剂量水平下以准确率、敏感性和AUC评估性能。 Result: 去噪能提升图像感知质量,但未一致提升分类性能;在某些情况下,直接分类敏感性更高;最佳去噪-分类流程在中等剂量下达到0.94 AUC和0.91准确率,较直接分类在部分场景高出最多6%。 Conclusion: 本工作为基于RSNA出血性卒中数据的低剂量CT卒中分诊建立了可复现基线,强调需进一步在缺血性卒中队列及真实便携CT系统上验证。 Abstract: Portable CT scanners enable early stroke detection in prehospital and low-resource settings but require reduced radiation doses, introducing noise that degrades diagnostic reliability. We present a deep learning framework for stroke classification from simulated low-dose CT (LDCT) brain scans for AI-assisted triage in mobile clinical environments. Controlled Poisson noise is applied to high-dose CT images to simulate realistic LDCT conditions. We compare two pipelines: (1) direct classification of noisy LDCT images and (2) denoising followed by classification. Performance is evaluated across multiple dose levels using accuracy, sensitivity, and AUC. While denoising improves perceptual image quality, it does not consistently improve classification. In several settings, direct classification yields higher sensitivity, revealing a trade-off between perceptual quality and diagnostic utility. The best denoise-then-classify pipeline achieves 0.94 AUC and 0.91 accuracy at moderate dose levels, outperforming direct classification by up to 6% in select cases. This work establishes a reproducible baseline for LDCT stroke triage using hemorrhagic stroke data (RSNA dataset) and highlights the need for validation on ischemic cohorts and real-world portable CT systems.[104] JND-Guided Neural Watermarking with Spatial Transformer Decoding for Screen-Capture Robustness
Jiayi Qin,Jingwei Li,Chuan Wu
Main category: cs.CV
TL;DR: 本文提出了一种端到端深度学习框架,用于屏幕拍摄鲁棒水印,通过联合优化嵌入与提取、引入真实感噪声模拟层、JND感知损失函数及自动定位模块,在保持高视觉质量的同时显著提升水印提取准确率。
Details
Motivation: 现有方法难以在屏幕拍摄复杂失真(如摩尔纹、色域偏移、透视畸变、传感器噪声)下兼顾水印提取准确率与图像视觉质量。 Method: 提出端到端深度学习框架,包含:(i) 基于物理建模的摩尔纹生成器的综合噪声仿真层,结合对抗训练;(ii) 自适应调节嵌入强度的JND感知损失函数;(iii) 基于语义分割的前景提取与对称噪声模板机制实现自动定位与抗裁剪恢复。 Result: 在嵌入127比特载荷时,平均PSNR达30.94 dB,SSIM达0.94,显著优于现有方法。 Conclusion: 该框架有效提升了屏幕拍摄场景下水印的鲁棒性与视觉保真度,为实际部署提供了全自动、高可靠性的解决方案。 Abstract: Screen-shooting robust watermarking aims to imperceptibly embed extractable information into host images such that the watermark survives the complex distortion pipeline of screen display and camera recapture. However, achieving high extraction accuracy while maintaining satisfactory visual quality remains an open challenge, primarily because the screen-shooting channel introduces severe and entangled degradations including Moiré patterns, color-gamut shifts, perspective warping, and sensor noise. In this paper, we present an end-to-end deep learning framework that jointly optimizes watermark embedding and extraction for screen-shooting robustness. Our framework incorporates three key innovations: (i) a comprehensive noise simulation layer that faithfully models realistic screen-shooting distortions -- notably including a physically-motivated Moiré pattern generator -- enabling the network to learn robust representations against the full spectrum of capture-channel noise through adversarial training; (ii) a Just Noticeable Distortion (JND) perceptual loss function that adaptively modulates watermark embedding strength by supervising the perceptual discrepancy between the JND coefficient map and the watermark residual, thereby concentrating watermark energy in perceptually insensitive regions to maximize visual quality; and (iii) two complementary automatic localization modules -- a semantic-segmentation-based foreground extractor for captured image rectification and a symmetric noise template mechanism for anti-cropping region recovery -- that enable fully automated watermark decoding under realistic deployment conditions. Extensive experiments demonstrate that our method achieves an average PSNR of 30.94~dB and SSIM of 0.94 on watermarked images while embedding 127-bit payloads.[105] A training-free framework for high-fidelity appearance transfer via diffusion transformers
Shengrong Gu,Ye Wang,Song Wu,Rui Ma,Qian Wang,Lanjun Wang,Zili Yi
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的框架,用于在Diffusion Transformers (DiTs) 中实现高保真外观迁移,通过解耦结构与外观,并结合高保真反演和新颖的注意力共享机制,在1024px分辨率下实现了优于专用方法的性能。
Details
Motivation: Diffusion Transformers(DiTs)虽在生成任务上表现出色,但其全局自注意力机制使其难以进行可控的、基于参考图像的编辑;而简单地向DiT注入局部外观会破坏整体场景结构。 Method: 提出一种无需训练的框架:1)利用高保真反演建立源图像的内容先验(涵盖光照与微观纹理);2)设计新型注意力共享机制,依据几何先验动态融合参考图像的纯化外观特征;3)实现结构与外观的协同解耦。 Result: 该方法在1024px分辨率下运行,显著优于现有专用方法,在语义属性迁移和细粒度材质应用等任务中均取得SOTA效果;大量实验验证了其在结构保持与外观保真两方面的优越性。 Conclusion: 本文首次实现了对DiTs的高效、训练-free控制编辑,为基于参考图像的高保真外观迁移提供了新范式,兼具结构稳定性与外观精确性。 Abstract: Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the source image, capturing its lighting and micro-textures. A novel attention-sharing mechanism then dynamically fuses purified appearance features from a reference, guided by geometric priors. Our unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material application. Extensive experiments confirm our state-of-the-art performance in both structural preservation and appearance fidelity.[106] Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models
Chen Zheng,Yuxuan Lai,Haoyang Lu,Wentao Ma,Jitao Yang,Jian Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉-语言模型(VLM)的中文手写汉字质量评估与多级反馈生成方法,超越传统回归式打分,支持简单评分(Task 1)和丰富描述性反馈(Task 2),通过LoRA微调与上下文学习提升性能,在CCL 2025评测中达到SOTA。
Details
Motivation: 现有自动评分方法仅输出单一分数,缺乏可操作的改进建议,难以有效辅助学习者提升书写能力。 Method: 利用视觉-语言模型(VLM),结合低秩适配(LoRA)微调与上下文学习(in-context learning),实现对中文手写汉字的质量分析与两级反馈生成(简单评分与描述性反馈)。 Result: 在CCL 2025手写汉字质量评估评测的多个赛道上达到当前最优(state-of-the-art)性能。 Conclusion: VLM结合适配策略能有效支撑细粒度、可解释的手写汉字评估与教学反馈,为智能汉语书写教学提供新范式。 Abstract: The handwriting of Chinese characters is a fundamental aspect of learning the Chinese language. Previous automated assessment methods often framed scoring as a regression problem. However, this score-only feedback lacks actionable guidance, which limits its effectiveness in helping learners improve their handwriting skills. In this paper, we leverage vision-language models (VLMs) to analyze the quality of handwritten Chinese characters and generate multi-level feedback. Specifically, we investigate two feedback generation tasks: simple grade feedback (Task 1) and enriched, descriptive feedback (Task 2). We explore both low-rank adaptation (LoRA)-based fine-tuning strategies and in-context learning methods to integrate aesthetic assessment knowledge into VLMs. Experimental results show that our approach achieves state-of-the-art performances across multiple evaluation tracks in the CCL 2025 workshop on evaluation of handwritten Chinese character quality.[107] Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption
Mehmet Kaan Erol
Main category: cs.CV
TL;DR: 本文对比了两个不同规模的视觉语言模型(Qwen2.5-VL-7B 4-bit 和 SmolVLM2-500M FP16)在 VQAv2 和 COCO Captions 上的失败模式,提出三类错误分类法(Object Blindness、Semantic Drift、Prior Bias),发现小模型具有更显著的 Semantic Drift 和 negation collapse 现象,尤其在 COCO 上表现突出,并开源了可复现的安全评估流程。
Details
Motivation: 快速压缩大型视觉语言模型用于边缘部署时,其失败模式是否发生质变(而不仅是频率增加)尚不明确,亟需系统性诊断框架。 Method: 构建三类错误分类法(Object Blindness / Semantic Drift / Prior Bias),使用 GPT-4o 作为文本裁判;评估指标包括 Expected Calibration Error(ECE)、结构化否定探针(四类模板)、模糊鲁棒性实验;在 VQAv2 和 COCO Captions 上对比两个模型(Qwen2.5-VL-7B 4-bit vs. SmolVLM2-500M FP16)。 Result: SmolVLM2-500M 在 COCO 上出现严重 negation collapse(-33.2pp vs. -20.8pp,p<10^-8),尤其在 false_yn 模板中错误回答“Yes”达100%;Semantic Drift 是两模型在 VQAv2 和 Qwen 在 COCO 上的主导错误类型;Prior Bias 仅出现在 VQAv2;小模型呈现数据集依赖的非对称校准偏差。 Conclusion: 紧凑型 VLM 并非仅‘性能更差’,而是展现出可识别、可量化的独特失败签名(如语义漂移与否定崩溃),需在边缘部署前通过标准化安全审计流程进行系统性评估。 Abstract: The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category error taxonomy (Object Blindness, Semantic Drift, Prior Bias) is applied as a diagnostic framework. A text-only GPT-4o judge reveals Semantic Drift (B) as the dominant failure mode on VQAv2 and on COCO for Qwen, with a mixed Object Blindness / Semantic Drift profile for SmolVLM2 on COCO; Prior Bias (C) is present on VQAv2 but absent on COCO for both models. Confidence calibration is measured via Expected Calibration Error (ECE) using geometric mean token probability, compositional reasoning is probed with structured negation probes across four templates, and a blur robustness experiment completes the evaluation. For this model pair, the compact model exhibits a qualitatively distinct failure signature: a 12.5pp larger negation collapse (-33.2pp vs. -20.8pp, Wald 95% CI [8.2, 16.8]pp, p < 10^-8), driven almost entirely by COCO while the VQAv2 gap is not statistically significant (4.5pp, p=0.19). The most discriminating template is false_yn: SMOLVLM2-500M responds "Yes" (incorrectly claiming a depicted object is absent) on 100% of COCO trials vs. 14% for Q WEN 2.5-VL-7B. Asymmetric dataset-dependent miscalibration and a blur experiment with two controlled ablations complete the analysis. The fully reproducible pipeline is released for systematic safety auditing of compressed VLMs prior to edge deployment.[108] Quantized Vision-Language Models for Damage Assessment: A Comparative Study of LLaVA-1.5-7B Quantization Levels
Takato Yasuno
Main category: cs.CV
TL;DR: 本文研究了量化视觉-语言模型(VLMs)在桥梁损伤自动评估中的应用,提出了一种结合LLaVA-1.5-7B、结构化JSON提取和规则优先级评分的端到端流程,并通过系统量化比较发现Q5_K_M在质量、速度与资源消耗间取得最优平衡。
Details
Motivation: 桥梁基础设施检测任务繁重且依赖专家经验,亟需自动化方法提升效率与可扩展性;现有VLM部署受限于计算资源,需探索量化策略以适配消费级GPU。 Method: 构建基于LLaVA-1.5-7B的端到端管道,集成视觉分析、结构化JSON抽取与规则驱动优先级评分;对254张露筋图像系统比较Q4_K_M、Q5_K_M和Q8_0三种量化级别;引入5分制质量评估框架,涵盖损伤类型识别与严重度分类,并进行统计相关性分析。 Result: Q5_K_M量化方案在质量(3.18±1.35/5)、速度(5.67秒/图)和效率(0.56质量/秒)上表现最优:相比Q4_K_M质量提升8.5%、仅减速4.5%;相比Q8_0质量相当但快25%;其文本质量相关性最低(-0.148),表明性能更稳定。 Conclusion: Q5_K_M是面向桥梁损伤评估部署的推荐量化级别,在保持高描述质量的同时显著提升推理效率与硬件兼容性,为边缘端VLM应用提供了实用指导。 Abstract: Bridge infrastructure inspection is a critical but labor-intensive task requiring expert assessment of structural damage such as rebar exposure, cracking, and corrosion. This paper presents a comprehensive study of quantized Vision-Language Models (VLMs) for automated bridge damage assessment, focusing on the trade-offs between description quality, inference speed, and resource requirements. We develop an end-to-end pipeline combining LLaVA-1.5-7B for visual damage analysis, structured JSON extraction, and rule-based priority scoring. To enable deployment on consumer-grade GPUs, we conduct a systematic comparison of three quantization levels: Q4_K_M, Q5_K_M, and Q8\_0 across 254 rebar exposure images. We introduce a 5-point quality evaluation framework assessing damage type recognition, severity classification. Our results demonstrate that Q5_K_M achieves the optimal balance: quality score 3.18$\pm$1.35/5.0, inference time 5.67s/image, and 0.56 quality/sec efficiency -- 8.5% higher quality than Q4_K_M with only 4.5% speed reduction, while matching Q8_0's quality with 25% faster inference. Statistical analysis reveals Q5_K_M exhibits the weakest text-quality correlation (-0.148), indicating consistent performance regardless of description length.[109] From Content to Audience: A Multimodal Annotation Framework for Broadcast Television Analytics
Paolo Cupini,Francesco Pierri
Main category: cs.CV
TL;DR: 本文系统评估了多模态大语言模型(MLLMs)在意大利电视新闻语义标注任务中的性能,构建了覆盖四类语义维度的领域专用基准,并比较了不同架构与输入策略的效果;发现视频输入增益高度依赖模型规模,且所选方案已成功部署于真实广播节目并支持受众分析。
Details
Motivation: 现有MLLMs虽具通用视频理解能力,但在广播电视这类具有结构化音视频编排、领域特有编辑模式和严格运行约束的场景中,其在不同流水线架构和输入配置下的实际效果缺乏充分实证研究。 Method: 构建面向意大利电视新闻的四维语义标注基准(视觉环境、主题、敏感内容、命名实体),对比两种流水线架构在9个前沿多模态模型(如Gemini 3.0 Pro、LLaMA 4 Maverick等)上的表现,并测试融合视频帧、ASR、说话人区分与元数据的渐进式输入策略。 Result: 视频输入带来的性能提升显著依赖模型规模:大模型能有效利用时序连续性,小模型则因Token过载而性能下降;所选最优流水线已在14期完整广播节目中部署,实现分钟级标注并与真实收视率数据对齐,支撑话题级受众敏感性与代际参与度差异的相关分析。 Conclusion: 多模态标注框架在广播新闻场景中具备操作可行性,但模型选型与输入设计需适配领域约束与规模特性;该工作为内容驱动的受众分析提供了可落地的技术路径与实证基础。 Abstract: Automated semantic annotation of broadcast television content presents distinctive challenges, combining structured audiovisual composition, domain-specific editorial patterns, and strict operational constraints. While multimodal large language models (MLLMs) have demonstrated strong general-purpose video understanding capabilities, their comparative effectiveness across pipeline architectures and input configurations in broadcast-specific settings remains empirically undercharacterized. This paper presents a systematic evaluation of multimodal annotation pipelines applied to broadcast television news in the Italian setting. We construct a domain-specific benchmark of clips labeled across four semantic dimensions: visual environment classification, topic classification, sensitive content detection, and named entity recognition. Two different pipeline architectures are evaluated across nine frontier models, including Gemini 3.0 Pro, LLaMA 4 Maverick, Qwen-VL variants, and Gemma 3, under progressively enriched input strategies combining visual signals, automatic speech recognition, speaker diarization, and metadata. Experimental results demonstrate that gains from video input are strongly model-dependent: larger models effectively leverage temporal continuity, while smaller models show performance degradation under extended multimodal context, likely due to token overload. Beyond benchmarking, the selected pipeline is deployed on 14 full broadcast episodes, with minute-level annotations integrated with normalized audience measurement data provided by an Italian media company. This integration enables correlational analysis of topic-level audience sensitivity and generational engagement divergence, demonstrating the operational viability of the proposed framework for content-based audience analytics.[110] From Prediction to Diagnosis: Reasoning-Aware AI for Photovoltaic Defect Inspection
Dev Mistry,Feng Qiu,Bo Chen,Feng Liu,Can Chen,Mohammad Shahidehpour,Ren Wang
Main category: cs.CV
TL;DR: 本文提出REVL-PV框架,结合电致发光、热成像和可见光图像,通过视觉-语言多模态学习实现可解释的光伏缺陷识别与诊断,兼顾高准确率(93%)与专业级诊断推理能力。
Details
Motivation: 现有光伏缺陷检测系统多为黑箱分类器,缺乏面向高可靠性能源基础设施所需的诊断洞察力。 Method: 提出REVL-PV视觉-语言框架,在多模态影像(电致发光、热、可见光)上嵌入领域特定的诊断推理机制,要求模型在分类前将视觉证据关联至合理缺陷机理,生成结构化诊断报告。 Result: 在1927块真实光伏组件、8类缺陷上达到93%分类精度;具备强鲁棒性与可解释性;盲测显示其诊断理由与认证专家评估在缺陷识别、根因归因和视觉描述上高度一致。 Conclusion: 推理感知的多模态学习为光伏能源基础设施的可信AI辅助巡检提供了通用范式。 Abstract: Reliable photovoltaic defect identification is essential for maintaining energy yield, ensuring warranty compliance, and enabling scalable inspection of rapidly expanding solar fleets. Although recent advances in computer vision have improved automated defect detection, most existing systems operate as opaque classifiers that provide limited diagnostic insight for high-stakes energy infrastructure. Here we introduce REVL-PV, a vision-language framework that embeds domain-specific diagnostic reasoning into multimodal learning across electroluminescence, thermal, and visible-light imagery. By requiring the model to link visual evidence to plausible defect mechanisms before classification, the framework produces structured diagnostic reports aligned with professional photovoltaic inspection practice. Evaluated on 1,927 real-world modules spanning eight defect categories, REVL-PV achieves 93\% classification accuracy while producing interpretable diagnostic rationales and maintaining strong robustness under realistic image corruptions. A blind concordance study with a certified solar inspection expert shows strong semantic alignment between model explanations and expert assessments across defect identification, root-cause attribution, and visual descriptions. These results demonstrate that reasoning-aware multimodal learning establishes a general paradigm for trustworthy AI-assisted inspection of photovoltaic energy infrastructure.[111] BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting
Renbo Tu,Ali SaraerToosi,Nicholas S. Conroy,Gennady Pekhimenko,Aviad Levis
Main category: cs.CV
TL;DR: BHCast 是一个神经网络框架,能够从单张模糊的黑洞图像(如EHT拍摄)中预测等离子体动力学演化,并反演黑洞物理参数(如自旋、倾角),结合多尺度损失、自回归建模与梯度提升树,实现可解释、模块化且鲁棒的科学推断。
Details
Motivation: EHT提供的黑洞图像是静态、模糊且缺乏动力学信息的;传统数值模拟计算成本高,难以用于实时推断,亟需一种高效、可解释的数据驱动方法来从单帧低分辨率图像中恢复动态演化和物理参数。 Method: 提出BHCast框架:1)基于神经网络的自回归多尺度金字塔损失模型,将单张模糊图像超分辨并外推为高分辨率动态视频;2)从中提取可解释的时空特征(如模式速度、螺距角);3)用梯度提升树从特征反演黑洞物理参数(自旋、倾角);整体采用预报-推断分离架构以增强模块性与不确定性量化能力。 Result: 在Sgr A*和M87*的模拟数据(模糊至EHT分辨率)及真实M87* EHT图像上验证有效;能稳定生成长时间尺度的高分辨率动力学视频,并准确反演黑洞自旋与视线倾角等关键参数。 Conclusion: BHCast为解决天体物理中的分辨率受限逆问题提供了可扩展范式,证明了学习动力学在从有限观测中提取深层物理规律方面的潜力,兼具科学可解释性与工程实用性。 Abstract: The Event Horizon Telescope (EHT) delivered the first image of a black hole by capturing the light from its surrounding accretion flow, revealing structure but not dynamics. Simulations of black hole accretion dynamics are essential for interpreting EHT images but costly to generate and impractical for inference. Motivated by this bottleneck, BHCast presents a framework for forecasting black hole plasma dynamics from a single, blurry snapshot, such as those captured by the EHT. At its core, BHCast is a neural model that transforms a static image into forecasted future frames, revealing the underlying dynamics hidden within one snapshot. With a multi-scale pyramid loss, we demonstrate how autoregressive forecasting can simultaneously super-resolve and evolve a blurry frame into a coherent, high-resolution movie that remains stable over long time horizons. From forecasted dynamics, we can then extract interpretable spatio-temporal features, such as pattern speed (rotation rate) and pitch angle. Finally, BHCast uses gradient-boosting trees to recover black hole properties from these plasma features, including the spin and viewing inclination angle. The separation between forecasting and inference provides modular flexibility, interpretability, and robust uncertainty quantification. We demonstrate the effectiveness of BHCast on simulations of two distinct black hole accretion systems, Sagittarius A* and M87*, by testing on simulated frames blurred to EHT resolution and real EHT images of M87*. Ultimately, our methodology establishes a scalable paradigm for solving inverse problems, demonstrating the potential of learned dynamics to unlock insights from resolution-limited scientific data.[112] Limits of Imagery Reasoning in Frontier LLM Models
Sergio Y. Hayashi,Nina S. T. Hirata
Main category: cs.CV
TL;DR: 本文探讨了为大语言模型(LLM)配备外部“意象模块”(可渲染和旋转3D模型)是否能提升其空间推理能力(如心理旋转),实验表明即使借助该模块,准确率最高仅62.5%,揭示当前前沿模型缺乏基础的视觉-空间感知与推理能力。
Details
Motivation: LLMs虽具强大推理能力,但在需心理模拟的空间任务(如心理旋转)上表现不佳,作者试图通过引入外部意象模块作为‘认知假体’来弥补这一缺陷。 Method: 采用双模块架构:一个基于多模态大语言模型(MLLM)的推理模块与一个可渲染/旋转3D模型的意象模块协同完成3D旋转任务,并系统评估其性能与失败原因。 Result: 系统在3D旋转任务上准确率最高仅62.5%;分析表明,即便将3D状态维护与操作外包给意象模块,系统仍失败,暴露出现有模型缺乏低层空间信号感知(深度、运动、短时动态预测)及图像上沉思式推理能力。 Conclusion: 当前前沿大模型缺乏支撑视觉-空间交互的基础能力,单纯增加外部工具模块不足以解决根本性缺失;需在模型底层增强空间感知与图像动态推理能力。 Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.[113] RatSeizure: A Benchmark and Saliency-Context Transformer for Rat Seizure Localization
Ting Yu Tsai,An Yu,Lucy Lee,Felix X. -F. Ye,Damian S. Shin,Tzu-Jen Kao,Xin Li,Ming-Ching Chang
Main category: cs.CV
TL;DR: 本文提出了RatSeizure——首个面向大鼠癫痫行为细粒度分析的公开基准数据集,并设计了RaSeformer模型用于时序动作定位,同时建立了标准化评测协议。
Details
Motivation: 现有动物行为数据集普遍存在可及性差、标签粗糙、关键临床事件时间定位不足等问题,制约了癫痫发生机制与治疗响应研究的进展。 Method: 构建RatSeizure数据集(含精细动作单元与时序边界标注),并提出RaSeformer模型——一种融合显著性与上下文建模的Transformer架构,专用于癫痫行为的时序定位。 Result: RaSeformer在RatSeizure上展现出优异性能,成为该任务的强基线模型;同时确立了标准数据划分与评测协议,支撑可复现基准测试。 Conclusion: RatSeizure与RaSeformer共同填补了动物癫痫行为分析领域在高质量数据与专用模型方面的空白,推动了该方向的标准化与深入研究。 Abstract: Animal models, particularly rats, play a critical role in seizure research for studying epileptogenesis and treatment response. However, progress is limited by the lack of datasets with precise temporal annotations and standardized evaluation protocols. Existing animal behavior datasets often have limited accessibility, coarse labeling, and insufficient temporal localization of clinically meaningful events. To address these limitations, we introduce RatSeizure, the first publicly benchmark for fine-grained seizure behavior analysis. The dataset consists of recorded clips annotated with seizure-related action units and temporal boundaries, enabling both behavior classification and temporal localization. We further propose RaSeformer, a saliency-context Transformer for temporal action localization that highlights behavior-relevant context while suppressing redundant cues. Experiments on RatSeizure show that RaSeformer achieves strong performance and provides a competitive reference model for this challenging task. We also establish standardized dataset splits and evaluation protocols to support reproducible benchmarking.[114] Can We Change the Stroke Size for Easier Diffusion?
Yunwei Bai,Ying Kiat Tan,Yao Shu,Tsuhan Chen
Main category: cs.CV
TL;DR: This paper proposes stroke-size control as a method to address the low signal-to-noise challenge in diffusion models by adjusting the effective roughness of targets, predictions, and perturbations across timesteps.
Details
Motivation: Diffusion models struggle in low signal-to-noise regimes where pixel-level predictions must be made under high noise; the authors draw an analogy to using overly fine brush strokes in oil painting, suggesting a need for adaptive 'stroke size'. Method: The authors introduce stroke-size control as a controlled intervention that modulates the effective roughness of supervised targets, model predictions, and perturbations across diffusion timesteps. Result: Theoretical and empirical analyses demonstrate advantages and trade-offs of stroke-size control in easing the low signal-to-noise challenge. Conclusion: Stroke-size control offers a principled way to adapt diffusion modeling to varying noise levels, improving robustness and performance in challenging regimes. Abstract: Diffusion models can be challenged in the low signal-to-noise regime, where they have to make pixel-level predictions despite the presence of high noise. The geometric intuition is akin to using the finest stroke for oil painting throughout, which may be ineffective. We therefore study stroke-size control as a controlled intervention that changes the effective roughness of the supervised target, predictions and perturbations across timesteps, in an attempt to ease the low signal-to-noise challenge. We analyze the advantages and trade-offs of the intervention both theoretically and empirically. Code will be released.[115] HighlightBench: Benchmarking Markup-Driven Table Reasoning in Scientific Documents
Lexin Wang,Shenghua Liu,Yiwei Wang,Yujun Cai,Yuyao Ge,Jiayu Yao,Jiafeng Guo,Xueqi Cheng
Main category: cs.CV
TL;DR: 本文提出HighlightBench,一个用于评估视觉标记(如高亮、下划线)驱动的表格理解能力的诊断性基准,将评估分解为五类任务,并提供可解释的参考流程以定位模型在感知到执行链中的错误。
Details
Motivation: 现有多模态大语言模型虽在文档理解上取得进展,但对视觉标记作为逻辑指令的处理能力尚未被充分探索;且当前评测方法无法区分模型是‘没看见’标记还是‘不会用’标记,导致评估盲区。 Method: 构建HighlightBench基准,涵盖Markup Grounding、Constrained Retrieval、Local Relations、Aggregation & Comparison、Consistency & Missingness五类任务;设计具中间决策显式化的参考pipeline,支持可复现基线与细粒度错误归因。 Result: 实验表明,即使强大多模态模型在需将视觉线索与符号推理一致对齐并满足结构化输出约束时仍表现不稳定。 Conclusion: HighlightBench填补了标记驱动表格理解的评测空白,揭示了当前MLLMs在感知—推理—执行链中对视觉标记的利用仍存在关键瓶颈。 Abstract: Visual markups such as highlights, underlines, and bold text are common in table-centric documents. Although multimodal large language models (MLLMs) have made substantial progress in document understanding, their ability to treat such cues as explicit logical directives remains under-explored. More importantly, existing evaluations cannot distinguish whether a model fails to see the markup or fails to reason with it. This creates a key blind spot in assessing markup-conditioned behavior over tables. To address this gap, we introduce HighlightBench, a diagnostic benchmark for markup-driven table understanding that decomposes evaluation into five task families: Markup Grounding, Constrained Retrieval, Local Relations, Aggregation \& Comparison, and Consistency \& Missingness. We further provide a reference pipeline that makes intermediate decisions explicit, enabling reproducible baselines and finer-grained attribution of errors along the perception-to-execution chain. Experiments show that even strong models remain unstable when visual cues must be consistently aligned with symbolic reasoning under structured output constraints.[116] Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval
Xintao Zong,Xian Zhong,Wenxuan Liu,Jianhao Ding,Zhaofei Yu,Tiejun Huang
Main category: cs.CV
TL;DR: 本文提出了一种脑启发的跨模态脉冲融合网络(CMSF),首次将直接训练的脉冲神经网络(SNN)应用于图像-文本检索(ITR),在仅需2个时间步的情况下,实现了超越主流人工神经网络(ANN)的检索精度、极低能耗与高速度。
Details
Motivation: 现有基于ANN的图像-文本检索方法忽视跨模态交互、检索延迟和能量效率;而直接训练的高性能、低功耗SNN在多模态任务中仍面临巨大挑战。 Method: 提出跨模态脉冲融合网络(CMSF),在脉冲层面融合视觉与文本单模态特征,生成增强的多模态表征作为软监督信号,反向优化单模态脉冲嵌入,缓解语义损失。 Result: CMSF仅用2个时间步即在图像-文本检索任务上达到SOTA精度,同时显著降低能耗、提升检索速度,性能超越当前最优ANN方法。 Conclusion: CMSF是迈向多模态SNN的重要一步,为脉冲神经网络在多模态学习中的应用提供了脑启发的统一框架,兼顾时序动力学建模与跨模态对齐。 Abstract: Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite requiring only two time steps, CMSF achieves top-tier retrieval accuracy, surpassing state-of-the-art ANN counterparts while maintaining exceptionally low energy consumption and high retrieval speed. This work marks a significant step toward multimodal SNNs, offering a brain-inspired framework that unifies temporal dynamics with cross-modal alignment and provides new insights for future spiking-based multimodal research. The code is available at https://github.com/zxt6174/CMSF.[117] Confidence Matters: Uncertainty Quantification and Precision Assessment of Deep Learning-based CMR Biomarker Estimates Using Scan-rescan Data
Dewmini Hasara Wickremasinghe,Michelle Gibogwe,Andrew Bell,Esther Puyol-Antón,Muhummad Sohaib Nazir,Reza Razavi,Bruno Paun,Paul Aljabar,Andrew P. King
Main category: cs.CV
TL;DR: 本文研究了深度学习方法在心脏磁共振成像(CMR)分析中的精度评估问题,指出仅依赖准确率(如Dice分数)可能掩盖实际精度不足;通过引入不确定性估计技术(深度集成、测试时增强、蒙特卡洛Dropout)和新提出的分布型指标,发现尽管点估计精度高,但扫描-重扫置信区间重叠率低、统计差异显著,表明传统指标不足以反映真实重复性。
Details
Motivation: 现有深度学习方法在心血管磁共振分析中通常只关注准确率,忽视了对预测精度(尤其是扫描-重扫一致性)的评估,可能导致临床应用中可靠性被高估。 Method: 将三种不确定性估计技术(深度集成、测试时增强、蒙特卡洛Dropout)应用于先进的心脏功能生物标志物DL模型,并提出基于预测分布的新精度评估指标,结合两个外部扫描-重扫CMR数据集进行验证。 Result: 模型点估计表现优秀(平均Dice达87%),但分布型指标显示:扫描/重扫置信区间重叠>50%的情况不足45%;超过65%的病例中扫描与重扫生物标志物存在统计显著差异。 Conclusion: 仅依赖点估计指标会误导对模型精度的判断;必须采用能反映预测不确定性和重复性的分布型评估指标,以更真实地衡量临床适用性。 Abstract: The performance of deep learning (DL) methods for the analysis of cine cardiovascular magnetic resonance (CMR) is typically assessed in terms of accuracy, overlooking precision. In this work, uncertainty estimation techniques, namely deep ensemble, test-time augmentation, and Monte Carlo dropout, are applied to a state-of-the-art DL pipeline for cardiac functional biomarker estimation, and new distribution-based metrics are proposed for the assessment of biomarker precision. The model achieved high accuracy (average Dice 87%) and point estimate precision on two external validation scan-rescan CMR datasets. However, distribution-based metrics showed that the overlap between scan/rescan confidence intervals was >50% in less than 45% of the cases. Statistical similarity tests between scan and rescan biomarkers also resulted in significant differences for over 65% of the cases. We conclude that, while point estimate metrics might suggest good performance, distributional analyses reveal lower precision, highlighting the need to use more representative metrics to assess scan-rescan agreement.[118] Elucidating the Design Space of Flow Matching for Cellular Microscopy
Charles Jones,Emmanuel Noutahi,Jason Hartford,Cian Eastwood
Main category: cs.CV
TL;DR: 本文系统分析了用于细胞显微图像的流匹配生成模型的设计空间,提出了一种简单、稳定且可扩展的建模方案,并构建了比以往方法大两个数量级的基础模型,在FID和KID指标上显著超越先前方法;进一步结合预训练分子嵌入进行微调,实现了对未见分子扰动响应的最先进模拟性能。
Details
Motivation: 流匹配生成模型在模拟细胞对生物扰动响应方面日益重要,但其设计空间庞大且缺乏系统探索。 Method: 系统分析流匹配模型在细胞显微图像任务中的设计空间,摒弃冗余技术,提出一种简单、稳定、可扩展的建模流程,并构建大规模基础模型;再利用预训练分子嵌入进行微调。 Result: 模型规模达以往方法的100倍,FID降低2倍、KID降低10倍;微调后在模拟未知分子扰动响应任务上达到SOTA性能。 Conclusion: 许多常用技术在该任务中非但不必要,反而损害性能;精简而可扩展的设计能显著提升流匹配模型在细胞图像生成中的效果与泛化能力。 Abstract: Flow-matching generative models are increasingly used to simulate cell responses to biological perturbations. However, the design space for building such models is large and underexplored. We systematically analyse the design space of flow matching models for cell-microscopy images, finding that many popular techniques are unnecessary and can even hurt performance. We develop a simple, stable, and scalable recipe which we use to train our foundation model. We scale our model to two orders of magnitude larger than prior methods, achieving a two-fold FID and ten-fold KID improvement over prior methods. We then fine-tune our model with pre-trained molecular embeddings to achieve state-of-the-art performance simulating responses to unseen molecules. Code is available at https://github.com/valence-labs/microscopy-flow-matching[119] PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI
Hayder Saad Abdulbaqi,Mohammed Hadi Rahim,Mohammed Hassan Hadi,Haider Ali Aboud,Ali Hussein Allawi
Main category: cs.CV
TL;DR: PhyDCM是一个开源的MRI脑肿瘤分类框架,结合MedViT混合架构、标准化DICOM处理与交互式可视化界面,强调模块化、可复现性与可扩展性,在多个数据集上达到93%以上准确率。
Details
Motivation: 解决现有深度学习方法封闭性强、可复现性差、难以扩展的问题,同时应对MRI数据量激增带来的诊断挑战。 Method: 提出PhyDCM开源框架,集成基于MedViT的混合分类架构,支持标准化DICOM预处理(强度重缩放与有限数据增强),采用模块化设计分离计算逻辑与图形界面,并提供多平面重建与结构化输出功能。 Result: 在BRISC2025及多个Kaggle MRI数据集(FigShare、SARTAJ、Br35H)上实现各分类类别均超93%的稳定准确率,支持标准化输出与多模态扩展。 Conclusion: PhyDCM通过开源、模块化与标准化设计,为AI驱动的可复现医学影像分析提供了实用、透明且可扩展的基础框架。 Abstract: MRI-based medical imaging has become indispensable in modern clinical diagnosis, particularly for brain tumor detection. However, the rapid growth in data volume poses challenges for conventional diagnostic approaches. Although deep learning has shown strong performance in automated classification, many existing solutions are confined to closed technical architectures, limiting reproducibility and further academic development. PhyDCM is introduced as an open-source software framework that integrates a hybrid classification architecture based on MedViT with standardized DICOM processing and an interactive desktop visualization interface. The system is designed as a modular digital library that separates computational logic from the graphical interface, allowing independent modification and extension of components. Standardized preprocessing, including intensity rescaling and limited data augmentation, ensures consistency across varying MRI acquisition settings. Experimental evaluation on MRI datasets from BRISC2025 and curated Kaggle collections (FigShare, SARTAJ, and Br35H) demonstrates stable diagnostic performance, achieving over 93% classification accuracy across categories. The framework supports structured, exportable outputs and multi-planar reconstruction of volumetric data. By emphasizing transparency, modularity, and accessibility, PhyDCM provides a practical foundation for reproducible AI-driven medical image analysis, with flexibility for future integration of additional imaging modalities.[120] Deep Learning Aided Vision System for Planetary Rovers
Lomash Relia,Jai G Singla,Amitabh,Nitant Dube
Main category: cs.CV
TL;DR: This paper proposes a vision system for planetary rovers that combines real-time perception (CLAHE-enhanced stereo, YOLOv11n detection, neural distance estimation) with offline terrain reconstruction (Depth Anything V2 + Open3D point cloud fusion), achieving high accuracy (2.26 cm median depth error) and efficiency on lunar imagery.
Details
Motivation: To provide a scalable, compute-efficient vision solution for autonomous planetary exploration that bridges real-time perception and high-fidelity terrain reconstruction. Method: Real-time module: CLAHE-enhanced stereo imagery, YOLOv11n for object detection, and a neural network for distance estimation. Offline module: Depth Anything V2 for monocular depth estimation, fused into dense point clouds using Open3D. Result: Neural network achieves 2.26 cm median depth error (1–10 m range) on Chandrayaan-3 NavCam data; YOLOv11n shows balanced precision-recall on grayscale lunar scenes. Conclusion: The integrated architecture delivers reliable metric context in real time and qualitative reconstructions offline, offering an effective, scalable vision solution for planetary rovers. Abstract: This study presents a vision system for planetary rovers, combining real-time perception with offline terrain reconstruction. The real-time module integrates CLAHE enhanced stereo imagery, YOLOv11n based object detection, and a neural network to estimate object distances. The offline module uses the Depth Anything V2 metric monocular depth estimation model to generate depth maps from captured images, which are fused into dense point clouds using Open3D. Real world distance estimates from the real time pipeline provide reliable metric context alongside the qualitative reconstructions. Evaluation on Chandrayaan 3 NavCam stereo imagery, benchmarked against a CAHV based utility, shows that the neural network achieves a median depth error of 2.26 cm within a 1 to 10 meter range. The object detection model maintains a balanced precision recall tradeoff on grayscale lunar scenes. This architecture offers a scalable, compute-efficient vision solution for autonomous planetary exploration.[121] The Language of Touch: Translating Vibrations into Text with Dual-Branch Learning
Jin Chen,Yifeng Lin,Chao Zeng,Si Wu,Tiesong Zhao
Main category: cs.CV
TL;DR: 本文首次提出触觉信号的字幕生成任务(vibrotactile captioning),并设计了ViPAC方法,通过双分支结构分离周期性与非周期性成分,结合动态融合机制、正交性约束和权重正则化提升特征互补性与融合一致性;同时构建首个触觉-文本配对数据集LMT108-CAP,并在实验中显著优于音频/图像字幕基线方法。
Details
Motivation: 尽管IEEE P1918.1推动了振动触觉数据标准化,但其语义解释与理解仍是未解难题;本文首次尝试实现从触觉信号到自然语言描述的生成(即触觉字幕生成)。 Method: 提出ViPAC方法:采用双分支策略分别建模周期性与非周期性成分,引入动态融合机制自适应整合特征,并施加正交性约束与权重正则化以保障特征互补与融合一致性;同时基于LMT-108构建首个触觉-文本配对数据集LMT108-CAP,利用GPT-4o为每个表面图像生成5条约束性字幕。 Result: ViPAC在触觉字幕任务上显著超越适配自音频和图像字幕的基线方法,展现出更高的词汇保真度和语义对齐能力。 Conclusion: 本文开创性地定义并解决了触觉字幕生成问题,提出的ViPAC框架与LMT108-CAP数据集为触觉语义理解与多模态人机交互提供了新范式与基础资源。 Abstract: The standardization of vibrotactile data by IEEE P1918.1 workgroup has greatly advanced its applications in virtual reality, human-computer interaction and embodied artificial intelligence. Despite these efforts, the semantic interpretation and understanding of vibrotactile signals remain an unresolved challenge. In this paper, we make the first attempt to address vibrotactile captioning, {\it i.e.}, generating natural language descriptions from vibrotactile signals. We propose Vibrotactile Periodic-Aperiodic Captioning (ViPAC), a method designed to handle the intrinsic properties of vibrotactile data, including hybrid periodic-aperiodic structures and the lack of spatial semantics. Specifically, ViPAC employs a dual-branch strategy to disentangle periodic and aperiodic components, combined with a dynamic fusion mechanism that adaptively integrates signal features. It also introduces an orthogonality constraint and weighting regularization to ensure feature complementarity and fusion consistency. Additionally, we construct LMT108-CAP, the first vibrotactile-text paired dataset, using GPT-4o to generate five constrained captions per surface image from the popular LMT-108 dataset. Experiments show that ViPAC significantly outperforms the baseline methods adapted from audio and image captioning, achieving superior lexical fidelity and semantic alignment.[122] Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
Qi Zhang,Denis Rozumny,Francesco Girlanda,Sezer Karaoglu,Marc Pollefeys,Theo Gevers,Martin R. Oswald
Main category: cs.CV
TL;DR: Unblur-SLAM 是一种新型RGB SLAM方法,能从模糊图像中实现清晰3D重建,自适应处理运动模糊与散焦模糊,并通过分阶段策略(先去模糊再优化,失败则直接建模模糊过程)提升位姿估计与几何/纹理重建质量。
Details
Motivation: 现有SLAM方法难以有效处理含多种模糊(如运动模糊、散焦模糊)的输入图像,导致重建质量下降;需一种能自适应应对不同模糊类型并兼顾精度与效率的新方法。 Method: 提出两阶段框架:第一阶段用定制训练的前馈去模糊网络预处理图像,成功去模糊帧进入局部-全局多视角优化与闭环;失败帧则绕过去模糊,直接在全局3D高斯泼溅(3DGS)表示上引入额外模糊网络,建模多子帧模糊形成过程以反演清晰细节和子帧位姿;计算量随输入模糊程度动态调整。 Result: 在多个真实数据集上验证,Unblur-SLAM在位姿估计精度和几何/纹理的清晰重建效果上均优于现有方法,展现出对运动模糊和散焦模糊的鲁棒性及一致性提升。 Conclusion: Unblur-SLAM通过融合学习式去模糊与基于3DGS的模糊感知建模,实现了模糊输入下的高性能SLAM,为实际模糊场景中的实时、鲁棒三维重建提供了新范式。 Abstract: We propose Unblur-SLAM, a novel RGB SLAM pipeline for sharp 3D reconstruction from blurred image inputs. In contrast to previous work, our approach is able to handle different types of blur and demonstrates state-of-the-art performance in the presence of both motion blur and defocus blur. Moreover, we adjust the computation effort with the amount of blur in the input image. As a first stage, our method uses a feed-forward image deblurring model for which we propose a suitable training scheme that can improve both tracking and mapping modules. Frames that are successfully deblurred by the feed-forward network obtain refined poses and depth through local-global multi-view optimization and loop closure. Frames that fail the first stage deblurring are directly modeled through the global 3DGS representation and an additional blur network to model multiple blurred sub-frames and simulate the blur formation process in 3D space, thereby learning sharp details and refined sub-frame poses. Experiments on several real-world datasets demonstrate consistent improvements in both pose estimation and sharp reconstruction results of geometry and texture.[123] Implicit neural representations for larval zebrafish brain microscopy: a reproducible benchmark on the MapZebrain atlas
Agnieszka Pregowska
Main category: cs.CV
TL;DR: 本文提出了一个可复现的隐式神经表示(INR)基准,用于高分辨率斑马鱼幼鱼脑图谱(MapZebrain)的评估,比较了SIREN、傅里叶特征、Haar位置编码和多分辨率网格四种编码方法在重建精度、边界保持能力等方面的表现,发现Haar和傅里叶编码在保持神经纤维结构细节方面最优。
Details
Motivation: 现有隐式神经表示(INRs)在高分辨率斑马鱼幼鱼显微图像中缺乏可复现的评估标准,而准确保持神经毡边界和细微神经突起对神经解剖学研究至关重要。 Method: 采用统一且种子可控的协议,在950张灰度显微图像(包括图谱切片和单神经元投射)上对比SIREN、傅里叶特征、Haar位置编码和多分辨率网格;图像按每幅图的1%–99%百分位归一化,并采用沿X轴40%列式确定性留出法测试空间泛化能力。 Result: Haar和傅里叶编码在留出列上的宏平均重建保真度最高(约26 dB),SSIM与边缘聚焦误差分析进一步表明其更优的边界保持能力;SIREN在宏平均上表现较差,但在面积加权微平均下仍具竞争力。 Conclusion: 显式的频谱与多尺度编码(如Haar、傅里叶)比平滑偏向型方法(如SIREN)更能有效捕捉高频神经解剖细节;Haar与傅里叶更适合边界敏感任务(如图谱配准、标签迁移、形态保持共享),SIREN则适合作为背景建模或去噪的轻量基线。 Abstract: Implicit neural representations (INRs) offer continuous coordinate-based encodings for atlas registration, cross-modality resampling, sparse-view completion, and compact sharing of neuroanatomical data. Yet reproducible evaluation is lacking for high-resolution larval zebrafish microscopy, where preserving neuropil boundaries and fine neuronal processes is critical. We present a reproducible INR benchmark for the MapZebrain larval zebrafish brain atlas. Using a unified, seed-controlled protocol, we compare SIREN, Fourier features, Haar positional encoding, and a multi-resolution grid on 950 grayscale microscopy images, including atlas slices and single-neuron projections. Images are normalized with per-image (1,99) percentiles estimated from 10% of pixels in non-held-out columns, and spatial generalization is tested with a deterministic 40% column-wise hold-out along the X-axis. Haar and Fourier achieve the strongest macro-averaged reconstruction fidelity on held-out columns (about 26 dB), while the grid is moderately behind. SIREN performs worse in macro averages but remains competitive on area-weighted micro averages in the all-in-one regime. SSIM and edge-focused error further show that Haar and Fourier preserve boundaries more accurately. These results indicate that explicit spectral and multiscale encodings better capture high-frequency neuroanatomical detail than smoother-bias alternatives. For MapZebrain workflows, Haar and Fourier are best suited to boundary-sensitive tasks such as atlas registration, label transfer, and morphology-preserving sharing, while SIREN remains a lightweight baseline for background modelling or denoising.[124] arg-VU: Affordance Reasoning with Physics-Aware 3D Geometry for Visual Understanding in Robotic Surgery
Nan Xiao,Yunxin Fan,Farong Wang,Fei Liu
Main category: cs.CV
TL;DR: 本文提出arg-VU框架,结合3D高斯泼溅重建与扩展位置动力学(XPBD)建模,实现手术场景中可变形组织的物理感知功能推理,显著提升预测稳定性、物理一致性和可解释性。
Details
Motivation: 手术机器人中组织高度可变形、柔顺且与器械运动动态耦合,现有方法缺乏对物理约束的建模,导致功能推理不可靠。 Method: 采用3D高斯泼溅(3DGS)重建手术场景并生成时序一致的曲面表示;利用扩展位置动力学(XPBD)嵌入局部形变约束,生成代表性几何点(RGPs),并基于其约束敏感性定义各向异性刚度度量;融合SE(3)工具位姿计算刚性诱导位移,进而推导物理感知的顺应性能量和位置一致性得分。 Result: 在手术视频数据集上,arg-VU相比运动学基线展现出更稳定、物理一致且可解释的功能预测。 Conclusion: 物理感知的几何表征能有效支撑可变形手术环境下的可靠功能推理,为具身机器人交互提供基础。 Abstract: Affordance reasoning provides a principled link between perception and action, yet remains underexplored in surgical robotics, where tissues are highly deformable, compliant, and dynamically coupled with tool motion. We present arg-VU, a physics-aware affordance reasoning framework that integrates temporally consistent geometry tracking with constraint-induced mechanical modeling for surgical visual understanding. Surgical scenes are reconstructed using 3D Gaussian Splatting (3DGS) and converted into a temporally tracked surface representation. Extended Position-Based Dynamics (XPBD) embeds local deformation constraints and produces representative geometry points (RGPs) whose constraint sensitivities define anisotropic stiffness metrics capturing the local constraint-manifold geometry. Robotic tool poses in SE(3) are incorporated to compute rigidly induced displacements at RGPs, from which we derive two complementary measures: a physics-aware compliance energy that evaluates mechanical feasibility with respect to local deformation constraints, and a positional agreement score that captures motion alignment (as kinematic motion baseline). Experiments on surgical video datasets show that arg-VU yields more stable, physically consistent, and interpretable affordance predictions than kinematic baselines. These results demonstrate that physics-aware geometric representations enable reliable affordance reasoning for deformable surgical environments and support embodied robotic interaction.[125] Envisioning global urban development with satellite imagery and generative AI
Kailai Sun,Yuebing Liang,Mingyi He,Yunhan Zheng,Alok Prakash,Shenhao Wang,Jinhua Zhao,Alex "Sandy'' Pentland
Main category: cs.CV
TL;DR: 本研究提出了一种多模态生成式AI框架,用于在全球尺度上构想可持续城市开发,可生成高保真、多样化且逼真的城市卫星影像,并支持目标导向的城市规划与跨城市知识迁移。
Details
Motivation: 过去的城市发展研究多为预测性任务,未能体现其生成本质;亟需一种能反映城市发展的生成式方法以支持可持续规划。 Method: 设计了一个融合文本提示与地理空间控制的多模态生成AI框架,利用全球500个最大都市区的遥感数据进行训练,支持条件生成、环境感知重建及跨城市风格迁移,并挖掘城市形态的潜在表征。 Result: 成功生成高质量、可控、多样化的城市卫星图像;实现跨城市风格迁移;潜在表征可提升碳排放预测等下游任务性能;专家评估显示生成图像与真实图像相当。 Conclusion: 该框架为全球城市加速规划和情景式规划提供了创新工具,推动生成式AI在可持续城市发展中的应用。 Abstract: Urban development has been a defining force in human history, shaping cities for centuries. However, past studies mostly analyze such development as predictive tasks, failing to reflect its generative nature. Therefore, this study designs a multimodal generative AI framework to envision sustainable urban development at a global scale. By integrating prompts and geospatial controls, our framework can generate high-fidelity, diverse, and realistic urban satellite imagery across the 500 largest metropolitan areas worldwide. It enables users to specify urban development goals, creating new images that align with them while offering diverse scenarios whose appearance can be controlled with text prompts and geospatial constraints. It also facilitates urban redevelopment practices by learning from the surrounding environment. Beyond visual synthesis, we find that it encodes and interprets latent representations of urban form for global cross-city learning, successfully transferring styles of urban environments across a global spatial network. The latent representations can also enhance downstream prediction tasks such as carbon emission prediction. Further, human expert evaluation confirms that our generated urban images are comparable to real urban images. Overall, this study presents innovative approaches for accelerated urban planning and supports scenario-based planning processes for worldwide cities.[126] Dual-View Optical Flow for 4D Micro-Expression Recognition - A Multi-Stream Fusion Attention Approach
Luu Tu Nguyen,Thi Bich Phuong Man,Vu Tram Anh Khuong,Thanh Ha Le,Thi Duyen Ngo
Main category: cs.CV
TL;DR: 本文提出了一种双视角光流方法用于4D微表情识别,通过相位感知的光流提取与Triple-Stream MicroAttNet网络实现高精度分类,在4DME数据集上取得SOTA性能。
Details
Motivation: 微表情识别因持续时间极短、强度微弱及4D网格数据高维复杂而极具挑战性。 Method: 采用双同步视角采集微表情序列,计算光流表征运动;进行视图分离、逐帧人脸裁剪、自动顶点帧检测,并将序列分解为起始-顶点和顶点-偏移两阶段,分别提取水平、垂直与幅度光流通道;输入Triple-Stream MicroAttNet,融合注意力模块与Squeeze-and-Excitation模块;使用Focal Loss与Adam优化器训练。 Result: 在4DME多标签数据集(24受试者、5类情绪)上,宏平均UF1达0.536,较官方基线提升超50%,获4DMR IJCAI Workshop Challenge 2025第一名;消融实验显示融合注意力与SE模块各贡献最多3.6点UF1增益。 Conclusion: 双视角、相位感知的光流建模结合多流特征融合,为4D微表情识别提供了鲁棒且可解释的解决方案。 Abstract: Micro-expression recognition is vital for affective computing but remains challenging due to the extremely brief, low-intensity facial motions involved and the high-dimensional nature of 4D mesh data. To address these challenges, we introduce a dual-view optical flow approach that simplifies mesh processing by capturing each micro-expression sequence from two synchronized viewpoints and computing optical flow to represent motion. Our pipeline begins with view separation and sequence-wise face cropping to ensure spatial consistency, followed by automatic apex-frame detection based on peak motion intensity in both views. We decompose each sequence into onset-apex and apex-offset phases, extracting horizontal, vertical, and magnitude flow channels for each phase. These are fed into our Triple-Stream MicroAttNet, which employs a fusion attention module to adaptively weight modality-specific features and a squeeze-and-excitation block to enhance magnitude representations. Training uses focal loss to mitigate class imbalance and the Adam optimizer with early stopping. Evaluated on the multi-label 4DME dataset, comprising 24 subjects and five emotion categories, in the 4DMR IJCAI Workshop Challenge 2025, our method achieves a macro-UF1 score of 0.536, outperforming the official baseline by over 50\% and securing first place. Ablation studies confirm that both the fusion attention and SE components each contribute up to 3.6 points of UF1 gain. These results demonstrate that dual-view, phase-aware optical flow combined with multi-stream fusion yields a robust and interpretable solution for 4D micro-expression recognition.[127] Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation
Dongsheng Yang,Yinfeng Yu,Liejun Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为BTK的视觉-语言导航(VLN)框架,通过融合环境特定的文本知识与生成式图像知识库,显著提升了语义理解与跨模态对齐能力,在R2R和REVERIE数据集上均取得性能提升。
Details
Motivation: 现有VLN方法难以有效捕捉关键语义线索并准确对齐视觉观测,限制了导航性能。 Method: 提出BTK框架,利用Qwen3-4B提取目标相关短语,Flux-Schnell构建R2R-GP和REVERIE-GP两个图像知识库,BLIP-2构建全景文本知识库,并通过Goal-Aware Augmentor和Knowledge Augmentor实现多模态知识融合。 Result: 在R2R和REVERIE测试未见集上,成功率(SR)分别提升5%和2.07%,路径加权成功率(SPL)分别提升4%和3.69%。 Conclusion: BTK通过协同整合环境特异性文本与生成式图像知识,有效增强了语义定位与跨模态对齐,为VLN任务提供了新范式。 Abstract: Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross-modal alignment. Extensive experiments on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions demonstrate that BTK significantly outperforms existing baselines. On the test unseen splits of R2R and REVERIE, SR increased by 5% and 2.07% respectively, and SPL increased by 4% and 3.69% respectively. The source code is available at https://github.com/yds3/IPM-BTK/.[128] LACON: Training Text-to-Image Model from Uncurated Data
Zhiyang Liang,Ziyu Wan,Hongyu Liu,Dong Chen,Qiu Shen,Hao Zhu,Dongdong Chen
Main category: cs.CV
TL;DR: 本文提出LACON框架,通过将质量信号(如美学评分、水印概率)作为显式条件标签,利用未筛选数据的全质量谱进行训练,证明了低质量数据具有潜在价值,并在相同计算预算下超越仅使用过滤数据的基线方法。
Details
Motivation: 质疑当前文本到图像生成中‘先过滤后训练’范式下丢弃低质量数据是否合理,探索未筛选数据中是否蕴含未被开发的潜力。 Method: 提出LACON(Labeling-and-Conditioning)训练框架,不进行数据过滤,而是将美学评分、水印概率等质量信号作为显式的定量条件标签,让生成模型学习从差到好的完整数据质量分布。 Result: LACON在相同计算预算下,生成质量显著优于仅在过滤数据上训练的基线模型,验证了未筛选数据(包括低质量数据)具有实质性价值。 Conclusion: 低质量原始数据并非无用,其蕴含的质量分布信息可通过显式建模加以利用;抛弃它们是一种资源浪费,LACON为更高效、更鲁棒的生成模型训练提供了新范式。 Abstract: The success of modern text-to-image generation is largely attributed to massive, high-quality datasets. Currently, these datasets are curated through a filter-first paradigm that aggressively discards low-quality raw data based on the assumption that it is detrimental to model performance. Is the discarded bad data truly useless, or does it hold untapped potential? In this work, we critically re-examine this question. We propose LACON (Labeling-and-Conditioning), a novel training framework that exploits the underlying uncurated data distribution. Instead of filtering, LACON re-purposes quality signals, such as aesthetic scores and watermark probabilities as explicit, quantitative condition labels. The generative model is then trained to learn the full spectrum of data quality, from bad to good. By learning the explicit boundary between high- and low-quality content, LACON achieves superior generation quality compared to baselines trained only on filtered data using the same compute budget, proving the significant value of uncurated data.[129] TTE-CAM: Built-in Class Activation Maps for Test-Time Explainability in Pretrained Black-Box CNNs
Kerol Djoumessi,Philipp Berens
Main category: cs.CV
TL;DR: 本文提出TTE-CAM框架,在测试时将预训练黑盒CNN转换为自解释模型,兼顾预测性能与解释忠实性。
Details
Motivation: CNN在医学图像分析中性能优异但缺乏可解释性,限制其在临床高风险场景的应用;现有方法难以同时满足解释忠实性与预测性能。 Method: 提出TTE-CAM测试时框架,用基于卷积的分类头替代原CNN分类层,并以原始权重初始化,使模型具备内置可解释性。 Result: TTE-CAM在保持原有黑盒模型预测性能的同时,提供与主流后验解释方法相当甚至更优的忠实解释效果(定性与定量均验证)。 Conclusion: TTE-CAM有效弥合了可解释性与预测性能之间的权衡鸿沟,为临床部署可信AI提供了新路径。 Abstract: Convolutional neural networks (CNNs) achieve state-of-the-art performance in medical image analysis yet remain opaque, limiting adoption in high-stakes clinical settings. Existing approaches face a fundamental trade-off: post-hoc methods provide unfaithful approximate explanations, while inherently interpretable architectures are faithful but often sacrifice predictive performance. We introduce TTE-CAM, a test-time framework that bridges this gap by converting pretrained black-box CNNs into self-explainable models via a convolution-based replacement of their classification head, initialized from the original weights. The resulting model preserves black-box predictive performance while delivering built-in faithful explanations competitive with post-hoc methods, both qualitatively and quantitatively. The code is available at https://github.com/kdjoumessi/Test-Time-Explainability[130] Computer Vision with a Superpixelation Camera
Sasidharan Mahalingam,Rachel Brown,Atul Ingle
Main category: cs.CV
TL;DR: 本文提出了一种名为SuperCam的新型相机设计,通过在图像采集时实时进行超像素分割,减少冗余数据,在内存受限的边缘设备上提升图像分割、目标检测和单目深度估计等下游视觉任务的性能。
Details
Motivation: 传统相机产生大量冗余数据,难以在资源受限的边缘设备上高效处理;而多数下游视觉算法并不需要全分辨率像素流。 Method: 提出SuperCam架构,在相机端实时执行超像素分割,实现数据自适应压缩与处理。 Result: SuperCam在内存受限条件下优于现有超像素算法,并在图像分割、目标检测和单目深度估计任务中展现出更优性能。 Conclusion: 超像素分割是面向边缘设备部署视觉模型的关键技术,SuperCam为构建高效边缘视觉系统提供了新范式。 Abstract: Conventional cameras generate a lot of data that can be challenging to process in resource-constrained applications. Usually, cameras generate data streams on the order of the number of pixels in the image. However, most of this captured data is redundant for many downstream computer vision algorithms. We propose a novel camera design, which we call SuperCam, that adaptively processes captured data by performing superpixel segmentation on the fly. We show that SuperCam performs better than current state-of-the-art superpixel algorithms under memory-constrained situations. We also compare how well SuperCam performs when the compressed data is used for downstream computer vision tasks. Our results demonstrate that the proposed design provides superior output for image segmentation, object detection, and monocular depth estimation in situations where the available memory on the camera is limited. We posit that superpixel segmentation will play a crucial role as more computer vision inference models are deployed in edge devices. SuperCam would allow computer vision engineers to design more efficient systems for these applications.[131] FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition
Jie Zhu,Xiao Guo,Yiyang Su,Anil Jain,Xiaoming Liu
Main category: cs.CV
TL;DR: 本文提出FusionAgent,一种基于多模态大语言模型的动态模型融合框架,通过强化微调实现样本级自适应专家模型选择,并引入ACT分数融合方法解决分数不对齐与嵌入异质性问题,在全身生物特征识别任务中显著优于现有方法且更高效。
Details
Motivation: 现有分数融合策略通常是静态的,对每个测试样本都调用所有模型,未考虑样本质量或模态可靠性,导致效率低、鲁棒性差。 Method: 提出FusionAgent框架:1)将各专家模型视为工具,由多模态大语言模型(MLLM)作为智能体进行动态模型选择;2)采用基于度量奖励的强化微调(RFT)训练该智能体;3)设计Anchor-based Confidence Top-k(ACT)分数融合方法,以最自信模型为锚点、置信度感知地融合互补预测。 Result: 在多个全身生物特征识别基准上,FusionAgent显著超越现有最优方法(SoTA),同时减少模型调用次数,提升推理效率,并具备可解释性与鲁棒性。 Conclusion: 动态、可解释、鲁棒的模型融合对真实场景识别系统至关重要,FusionAgent为此提供了有效可行的新范式。 Abstract: Model fusion is a key strategy for robust recognition in unconstrained scenarios, as different models provide complementary strengths. This is especially important for whole-body human recognition, where biometric cues such as face, gait, and body shape vary across samples and are typically integrated via score-fusion. However, existing score-fusion strategies are usually static, invoking all models for every test sample regardless of sample quality or modality reliability. To overcome these limitations, we propose \textbf{FusionAgent}, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each expert model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with a metric-based reward, the agent learns to adaptively determine the optimal model combination for each test input. To address the model score misalignment and embedding heterogeneity, we introduce Anchor-based Confidence Top-k (ACT) score-fusion, which anchors on the most confident model and integrates complementary predictions in a confidence-aware manner. Extensive experiments on multiple whole-body biometric benchmarks demonstrate that FusionAgent significantly outperforms SoTA methods while achieving higher efficiency through fewer model invocations, underscoring the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems. Project page: \href{https://fusionagent.github.io/}{FusionAgent}.[132] Live Interactive Training for Video Segmentation
Xinyu Yang,Haozheng Yu,Yihong Sun,Bharath Hariharan,Jennifer J. Sun
Main category: cs.CV
TL;DR: 本文提出Live Interactive Training (LIT)框架,使视觉模型能在推理时在线学习用户反馈,显著减少视频分割中重复的人工修正次数。
Details
Motivation: 现有交互式视频分割方法(如SAM2)无法从用户修正中持续学习,导致在遮挡、目标分离、伪装等复杂场景下需大量重复干预。 Method: 提出LIT框架,其具体实现LIT-LoRA通过在推理过程中实时更新轻量级LoRA模块来吸收用户修正,并将该知识泛化至后续视频帧。 Result: 在挑战性视频分割基准上平均减少18–34%的总修正次数,单次修正训练开销仅约0.5秒;并成功迁移到其他分割模型及CLIP图像细粒度分类任务。 Conclusion: LIT证明了推理时在线适应的有效性,为降低复杂视觉任务中冗余人工干预提供了新范式。 Abstract: Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of ~0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks. Project: https://youngxinyu1802.github.io/projects/LIT/.[133] Leveraging Avatar Fingerprinting: A Multi-Generator Photorealistic Talking-Head Public Database and Benchmark
Laura Pedrouzo-Rodriguez,Luis F. Gomez,Ruben Tolosana,Ruben Vera-Rodriguez,Roberto Daza,Aythami Morales,Julian Fierrez
Main category: cs.CV
TL;DR: 本文提出AVAPrintDB数据库和标准化基准,用于评估avatar指纹识别系统在多生成器场景下的鲁棒性,并发现现有方法对合成流程和数据域变化高度敏感。
Details
Motivation: 现有公开的avatar指纹识别数据库稀缺且过时,无法反映当前真实、多样的合成场景,亟需构建更贴近实际的多生成器数据库和可复现基准。 Method: 构建了包含两个音视频语料库和三个前沿生成器(GAGAvatar、LivePortrait、HunyuanPortrait)的AVAPrintDB数据库,涵盖自驱动与跨驱动重演;设计标准化指纹识别基准,集成现有方法并探索基于DINOv2和CLIP等基础模型的新方法;开展生成器偏移与数据集偏移下的综合分析。 Result: 实验表明,尽管身份相关运动线索在合成头像中仍存在,当前指纹识别系统对合成流程和源域变化极为敏感,泛化能力有限。 Conclusion: AVAPrintDB及其配套基准为avatar指纹识别研究提供了重要基础设施,揭示了现有方法在现实多源场景下的局限性,推动更具鲁棒性的方法发展。 Abstract: Recent advances in photorealistic avatar generation have enabled highly realistic talking-head avatars, raising security concerns regarding identity impersonation in AI-mediated communication. To advance in this challenging problem, the task of avatar fingerprinting aims to determine whether two avatar videos are driven by the same human operator or not. However, current public databases in the literature are scarce and based solely on old-fashioned talking-head avatar generators, not representing realistic scenarios for the current task of avatar fingerprinting. To overcome this situation, the present article introduces AVAPrintDB, a new publicly available multi-generator talking-head avatar database for avatar fingerprinting. AVAPrintDB is constructed from two audiovisual corpora and three state-of-the-art avatar generators (GAGAvatar, LivePortrait, HunyuanPortrait), representing different synthesis paradigms, and includes both self- and cross-reenactments to simulate legitimate usage and impersonation scenarios. Building on this database, we also define a standardized and reproducible benchmark for avatar fingerprinting, considering public state-of-the-art avatar fingerprinting systems and exploring novel methods based on Foundation Models (DINOv2 and CLIP). Also, we conduct a comprehensive analysis under generator and dataset shift. Our results show that, while identity-related motion cues persist across synthetic avatars, current avatar fingerprinting systems remain highly sensitive to changes in the synthesis pipeline and source domain. The AVAPrintDB, benchmark protocols, and avatar fingerprinting systems are publicly available to facilitate reproducible research.[134] From 3D Pose to Prose: Biomechanics-Grounded Vision--Language Coaching
Yuyang Ji,Yixuan Shen,Shengjie Zhu,Yu Kong,Feng Liu
Main category: cs.CV
TL;DR: BioCoach是一个基于生物力学的视觉-语言框架,用于从流式视频中提供健身指导,通过融合视觉外观与3D骨骼运动学,实现个性化、可解释的动作反馈。
Details
Motivation: 现有健身指导系统多依赖模式匹配,缺乏对个体生物力学特征(如形态测量、运动约束和周期性)的建模,难以提供准确、相位感知且可解释的反馈。 Method: 提出三阶段流程:1)运动特异性自由度选择器,聚焦关键关节;2)结构化生物力学上下文,整合个体形态数据与周期/约束分析;3)视觉-生物力学条件反馈模块,采用交叉注意力生成精准文本反馈;采用参数高效训练,冻结视觉与语言主干。 Result: 在新建的QEVD-bio-fit-coach数据集和生物力学感知LLM评判指标下,BioCoach在词法与判断类指标上显著提升,同时保持时间触发能力;在原始QEVD-fit-coach上也提升了文本质量与正确性,时序性能接近持平。 Conclusion: 显式建模3D运动学与生物力学约束是实现高精度、相位感知、可解释健身指导的关键,BioCoach验证了融合领域知识与多模态大模型的有效范式。 Abstract: We present BioCoach, a biomechanics-grounded vision--language framework for fitness coaching from streaming video. BioCoach fuses visual appearance and 3D skeletal kinematics, through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision--biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.[135] Real-time Appearance-based Gaze Estimation for Open Domains
Zhenhao Li,Zheng Liu,Seunghyun Lee,Amin Fadaeinejad,Yuanhao Yu
Main category: cs.CV
TL;DR: 本文提出了一种鲁棒的外观式视线估计(AGE)框架,通过数据增强和多任务学习提升模型在无约束场景(如佩戴眼镜、光照差)下的泛化能力,并构建新基准验证鲁棒性,其轻量级MobileNet模型以极小参数量达到接近SOTA的性能。
Details
Motivation: 现有AGE模型在无约束实际场景(如面部可穿戴设备、低光照)下泛化能力差,主因是训练图像多样性不足及跨数据集标签(尤其俯仰角)一致性差。 Method: 1)采用多种图像增强技术(如眼镜/口罩合成、光照变化)扩展图像流形;2)将视线回归重构为多任务学习,融合多视角监督对比学习(SupCon)、离散化标签分类和眼区分割作为辅助任务。 Result: 在新构建的鲁棒性基准上验证有效;基于MobileNet的轻量模型参数量不足UniGaze-H的1%,却达到与其相当的泛化性能,支持移动端高精度实时视线追踪。 Conclusion: 所提框架无需额外人工标注即可显著提升AGE模型在真实复杂场景中的鲁棒性与泛化能力,兼顾高效性与实用性。 Abstract: Appearance-based gaze estimation (AGE) has achieved remarkable performance in constrained settings, yet we reveal a significant generalization gap where existing AGE models often fail in practical, unconstrained scenarios, particularly those involving facial wearables and poor lighting conditions. We attribute this failure to two core factors: limited image diversity and inconsistent label fidelity across different datasets, especially along the pitch axis. To address these, we propose a robust AGE framework that enhances generalization without requiring additional human-annotated data. First, we expand the image manifold via an ensemble of augmentation techniques, including synthesis of eyeglasses, masks, and varied lighting. Second, to mitigate the impact of anisotropic inter-dataset label deviation, we reformulate gaze regression as a multi-task learning problem, incorporating multi-view supervised contrastive (SupCon) learning, discretized label classification, and eye-region segmentation as auxiliary objectives. To rigorously validate our approach, we curate new benchmark datasets designed to evaluate gaze robustness under challenging conditions, a dimension largely overlooked by existing evaluation protocols. Our MobileNet-based lightweight model achieves generalization performance competitive with the state-of-the-art (SOTA) UniGaze-H, while utilizing less than 1\% of its parameters, enabling high-fidelity, real-time gaze tracking on mobile devices.[136] Multimodal Deep Learning for Diabetic Foot Ulcer Staging Using Integrated RGB and Thermal Imaging
Gulengul Mermer,Mustafa Furkan Aksu,Gozde Ozsezer,Sevki Cetinkalp,Orhan Er,Mehmet Kemal Gullu
Main category: cs.CV
TL;DR: 本研究开发了一种基于树莓派的便携式多模态成像系统(RGB+热成像),构建了含1205例、6期标注的糖尿病足溃疡(DFU)数据集,验证了四通道融合输入(RGB+热)显著提升深度学习模型分类性能,其中VGG16达93.25%准确率,并通过Grad-CAM证实热通道有助于定位温度异常区域,RGB通道提供结构纹理补充。
Details
Motivation: 糖尿病足溃疡(DFU)易致截肢、医疗成本高,亟需早期诊断;单模态图像(如RGB)信息有限,探索多模态(RGB+热)图像能否提升DFU分期分类性能具有重要临床价值。 Method: 设计树莓派便携式系统同步采集RGB与热成像图像;构建含1205样本、6期专家标注的医院真实数据集;设置RGB-only、thermal-only、RGB+Thermal(热图作为第四通道)三类训练集;在DenseNet121、EfficientNetV2、InceptionV3、ResNet50、VGG16五种模型上训练评估;采用Grad-CAM可视化分析模型关注区域。 Result: RGB+Thermal四通道融合方案整体优于单模态;VGG16在该方案下表现最优:准确率93.25%,F1-score 92.53%,MCC 91.03%;Grad-CAM显示热通道突出溃疡区温度异常,RGB通道提供结构纹理支持。 Conclusion: 多模态(RGB+热)图像融合可显著提升DFU分期分类性能,热成像作为第四通道能增强模型对病理区域的定位能力,结合RGB的结构信息形成互补,为便携式智能筛查设备提供了可行技术路径。 Abstract: Diabetic foot ulcers (DFU) are one of the serious complications of diabetes that can lead to amputations and high healthcare costs. Regular monitoring and early diagnosis are critical for reducing the clinical burden and the risk of amputation. The aim of this study is to investigate the impact of using multimodal images on deep learning models for the classification of DFU stages. To this end, we developed a Raspberry Pi-based portable imaging system capable of simultaneously capturing RGB and thermal images. Using this prototype, a dataset consisting of 1,205 samples was collected in a hospital setting. The dataset was labeled by experts into six distinct stages. To evaluate the models performance, we prepared three different training sets: RGB-only, thermal-only, and RGB+Thermal (with the thermal image added as a fourth channel). We trained these training sets on the DenseNet121, EfficientNetV2, InceptionV3, ResNet50, and VGG16 models. The results show that the multimodal training dataset, in which RGB and thermal data are combined across four channels, outperforms single-modal approaches. The highest performance was observed in the VGG16 model trained on the RGB+Thermal dataset. The model achieved an accuracy of 93.25%, an F1-score of 92.53%, and an MCC of 91.03%. Grad-CAM heatmap visualizations demonstrated that the thermal channel helped the model focus on the correct location by highlighting temperature anomalies in the ulcer region, while the RGB channel supported the decision-making process with complementary structural and textural information.[137] Beyond Mortality: Advancements in Post-Mortem Iris Recognition through Data Collection and Computer-Aided Forensic Examination
Rasel Ahmed Bhuiyan,Parisa Farmanifard,Renu Sharma,Andrey Kuehlkamp,Aidan Boyd,Patrick J Flynn,Kevin W Bowyer,Arun Ross,Dennis Chute,Adam Czajka
Main category: cs.CV
TL;DR: 本文构建了迄今规模最大的近红外与可见光后死亡虹膜图像数据集(259名死者,最长死后间隔1674小时),首次包含同一个人死前死后的虹膜图像;系统评估了五种虹膜识别方法在338名死者数据上的性能,分析了人口统计学因素影响;提出了针对死后虹膜图像的活体检测模型(视为呈现攻击检测);并开源了一个集成三种方法、具备可解释性的法医虹膜识别工具。
Details
Motivation: 解决后死亡虹膜识别领域数据稀缺、专用方法少、社会对非法冒用担忧加剧等关键障碍。 Method: 1) 构建大规模多模态(NIR+可见光)死后虹膜数据集(含首例同体死前/死后对比);2) 联合公开数据,对五种主流虹膜识别算法在338名死者样本上进行基准测试,并分析年龄、性别等人口统计学因素影响;3) 将死后虹膜识别建模为呈现攻击检测问题,训练专用检测模型;4) 开发集成多种算法并嵌入可视化解释模块的开源法医工具。 Result: 提供了首个大规模、多模态、含死前/死后对照的死后虹膜数据集;揭示了当前算法在死后识别中性能随PMI增长而显著下降的趋势及人口统计学偏差;验证了死后虹膜图像可被有效检测为呈现攻击;开源工具提升了法医应用中的结果可信度与可解释性。 Conclusion: 本工作通过数据、基准、检测模型与工具四方面突破,系统推动了后死亡虹膜识别从研究走向可靠法医实践,同时为应对潜在滥用风险提供了技术防御手段。 Abstract: Post-mortem iris recognition brings both hope to the forensic community (a short-term but accurate and fast means of verifying identity) as well as concerns to society (its potential illicit use in post-mortem impersonation). These hopes and concerns have grown along with the volume of research in post-mortem iris recognition. Barriers to further progress in post-mortem iris recognition include the difficult nature of data collection, and the resulting small number of approaches designed specifically for comparing iris images of deceased subjects. This paper makes several unique contributions to mitigate these barriers. First, we have collected and we offer a new dataset of NIR (compliant with ISO/IEC 19794-6 where possible) and visible-light iris images collected after demise from 259 subjects, with the largest PMI (post-mortem interval) being 1,674 hours. For one subject, the data has been collected before and after death, the first such case ever published. Second, the collected dataset was combined with publicly-available post-mortem samples to assess the current state of the art in automatic forensic iris recognition with five iris recognition methods and data originating from 338 deceased subjects. These experiments include analyses of how selected demographic factors influence recognition performance. Thirdly, this study implements a model for detecting post-mortem iris images, which can be considered as presentation attacks. Finally, we offer an open-source forensic tool integrating three post-mortem iris recognition methods with explainability elements added to make the comparison process more human-interpretable.[138] A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
Mujtaba Hussain Mirza,Antonio D'Orazio,Odelia Melamed,Iacopo Masi
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的轻量级防御方法ET3,通过最小化输入样本的能量来增强模型对对抗扰动的鲁棒性,并在分类、CLIP零样本分类及LVLM任务(如图像描述和视觉问答)中验证了其有效性。
Details
Motivation: 多模态模型和大视觉语言模型(LVLM)易受对抗扰动影响,威胁实际应用可靠性;现有对抗训练需重新训练,而测试时变换(TTT)提供了一种推理阶段提升鲁棒性的新路径。 Method: 提出能量引导的测试时变换(ET3),基于能量最小化原理设计无需训练的输入变换策略,并从理论上证明其在合理假设下可保证分类成功。 Result: ET3在分类器、CLIP零样本分类以及LVLM的图像描述和视觉问答任务上均显著提升了对抗鲁棒性,展现出广泛适用性和强防御能力。 Conclusion: ET3是一种高效、通用且无需训练的测试时防御方法,为提升多模态与视觉语言模型的鲁棒性提供了新思路和实用工具。 Abstract: Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference.In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples.Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code is available at github.com/OmnAI-Lab/Energy-Guided-Test-Time-Defense .[139] GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection
Jiaming Li,Zhijia Liang,Weikai Chen,Lin Ma,Guanbin Li
Main category: cs.CV
TL;DR: 本文提出GUIDED框架,通过解耦主体与属性的语义纠缠,提升细粒度开放词汇目标检测性能。
Details
Motivation: 现有开放词汇检测器在细粒度场景下表现不佳,主要由于预训练视觉语言模型嵌入中主体与属性语义纠缠,导致属性过表达、定位不准和嵌入空间语义漂移。 Method: GUIDED将定位与细粒度识别分为两个独立路径:1)用语言模型提取粗粒度主体及属性;2)仅用主体嵌入指导定位;3)通过注意力机制融合有益属性信息;4)区域级属性判别模块结合改进的VLM进行细粒度分类。 Result: 在FG-OVD和3F-OVD基准上达到新SOTA性能。 Conclusion: 解耦建模与模块化优化能有效缓解语义纠缠问题,显著提升细粒度开放词汇检测效果。 Abstract: Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings -- leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, HUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization. Our code will be released at https://github.com/lijm48/GUIDED.[140] Generative Shape Reconstruction with Geometry-Guided Langevin Dynamics
Linus Härenstam-Nielsen,Dmitrii Pozdeev,Thomas Dagès,Nikita Araslanov,Daniel Cremers
Main category: cs.CV
TL;DR: 本文提出GG-Langevin方法,结合扩散模型的先验与观测数据的一致性,通过几何引导的Langevin动力学实现高精度、鲁棒的3D形状重建。
Details
Motivation: 现有形状重建方法在理想条件下表现好,但在不完整或含噪观测下失效;而生成模型虽能生成逼真形状,却难以保证与观测一致。需统一几何保真与先验合理性。 Method: 提出几何引导的Langevin动力学(GG-Langevin),在扩散模型引导的采样轨迹中每一步都约束测量一致性,实现生成式重建。 Result: 实验表明GG-Langevin在表面重建任务中相比现有方法具有更高的几何精度和更强的缺失数据鲁棒性。 Conclusion: GG-Langevin成功融合了基于优化的重建与基于生成模型的先验,为病态3D重建问题提供了统一、有效的概率解法。 Abstract: Reconstructing complete 3D shapes from incomplete or noisy observations is a fundamentally ill-posed problem that requires balancing measurement consistency with shape plausibility. Existing methods for shape reconstruction can achieve strong geometric fidelity in ideal conditions but fail under realistic conditions with incomplete measurements or noise. At the same time, recent generative models for 3D shapes can synthesize highly realistic and detailed shapes but fail to be consistent with observed measurements. In this work, we introduce GG-Langevin: Geometry-Guided Langevin dynamics, a probabilistic approach that unifies these complementary perspectives. By traversing the trajectories of Langevin dynamics induced by a diffusion model, while preserving measurement consistency at every step, we generatively reconstruct shapes that fit both the measurements and the data-informed prior. We demonstrate through extensive experiments that GG-Langevin achieves higher geometric accuracy and greater robustness to missing data than existing methods for surface reconstruction.[141] YOLO Object Detectors for Robotics -- a Comparative Study
Patryk Niżeniec,Marcin Iwanowski,Marcin Gahbler
Main category: cs.CV
TL;DR: 本文评估了YOLO系列目标检测器在机器人工作空间内物体检测任务中的适用性,通过自建数据集和COCO2017数据集,并引入图像失真以测试鲁棒性,为机器人视觉任务选择合适YOLO版本提供实验依据。
Details
Motivation: 验证YOLO系列模型在机器人工作空间中目标检测任务的适用性,因YOLO已成为多领域视觉系统的关键组件,但不同版本与变体性能差异尚需实证评估。 Method: 使用自定义数据集和COCO2017数据集,在多种训练/测试配置下,对多个YOLO版本及其变体进行实验;引入图像失真以评估模型鲁棒性。 Result: 获得了不同YOLO版本在标准及失真条件下的检测性能对比结果,揭示了各模型在机器人视觉任务中的表现差异与适用场景。 Conclusion: 实验结果可为机器人视觉应用中选择合适的YOLO版本提供实践指导,强调了模型选型需结合具体任务需求与环境鲁棒性要求。 Abstract: YOLO object detectors recently became a key component of vision systems in many domains. The family of available YOLO models consists of multiple versions, each in various variants. The research reported in this paper aims to validate the applicability of members of this family to detect objects located within the robot workspace. In our experiments, we used our custom dataset and the COCO2017 dataset. To test the robustness of investigated detectors, the images of these datasets were subject to distortions. The results of our experiments, including variations of training/testing configurations and models, may support the choice of the appropriate YOLO version for robotic vision tasks.[142] RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs
Logan Lawrence,Mustafa Chasmai,Rangel Daroya,Wuao Liu,Seoyun Jeong,Aaron Sun,Max Hamilton,Fabien Delattre,Oindrila Saha,Subhransu Maji,Grant Van Horn
Main category: cs.CV
TL;DR: 本文提出RealBirdID基准,要求模型在细粒度鸟类识别任务中能根据图像判断是否可回答,并在不可回答时给出基于证据的理由(如需要鸣叫、图像质量差或视角受阻),而非盲目猜测。实验发现现有MLLMs在可回答样本上准确率低(<13%),且其拒答能力与分类能力不相关,拒答理由也常错误。
Details
Motivation: 现有细粒度鸟类识别系统在单张图像信息不足(如需声音、遮挡、角度不佳等)时仍强行预测,缺乏有依据的拒答机制;当前多模态评测偏向可回答样本,无法评估模型对不确定性的真实处理能力。 Method: 构建RealBirdID基准:包含按属划分的可回答样本集与带标注拒答理由(如'需鸣叫'、'图像质量差'、'视野遮挡')的不可回答验证集;评估指标涵盖物种识别准确率、拒答覆盖率及拒答理由正确性。 Result: (1)主流MLLMs(如GPT-5、Gemini-2.5 Pro)在可回答子集上准确率低于13%;(2)分类能力强的模型未必更擅长合理拒答;(3)MLLMs即使拒答,所给理由也普遍错误。 Conclusion: RealBirdID为面向拒答的细粒度识别提供了聚焦评测目标和可复现的评估范式,揭示了当前MLLMs在不确定性建模与证据推理上的严重不足。 Abstract: Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (e.g. vocalization), or obscured due to occlusion, camera angle, or low resolution. Yet today's multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rationale: "requires vocalization," "low quality image," or "view obstructed". For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models (less than 13% accuracy for MLLMs including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.[143] Unified Number-Free Text-to-Motion Generation Via Flow Matching
Guanhe Huang,Oya Celiktutan
Main category: cs.CV
TL;DR: 本文提出Unified Motion Flow (UMF),包含Pyramid Motion Flow (P-Flow)和Semi-Noise Motion Flow (S-Flow),用于解决多智能体运动生成中变量人数泛化难、自回归模型效率低与误差累积的问题;通过统一潜在空间、单通先验生成与多通反应生成,实现高效、鲁棒的文本驱动多人运动合成。
Details
Motivation: 现有生成模型在固定人数运动合成上表现良好,但在人数可变场景下泛化能力差;且基于有限领域数据的自回归方法存在计算低效和误差累积问题。 Method: 提出UMF框架,含P-Flow(基于多尺度噪声条件的层次化运动先验生成)与S-Flow(学习联合概率路径以自适应完成反应变换与上下文重建);采用统一潜在空间融合异构运动数据集,支持端到端统一训练。 Result: 实验与用户研究表明UMF在文本驱动的多人运动生成任务中显著优于现有方法,具备强泛化性、高效性与鲁棒性。 Conclusion: UMF是一种面向变量人数、数据高效的通用多人运动生成模型,为多智能体运动建模提供了新范式。 Abstract: Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF' s effectiveness as a generalist model for multi-person motion generation from text. Project page: https://githubhgh.github.io/umf/.[144] MOOZY: A Patient-First Foundation Model for Computational Pathology
Yousef Kotp,Vincent Quoc-Huy Trinh,Christopher Pal,Mahdi S. Hosseini
Main category: cs.CV
TL;DR: MOOZY是一种以患者为中心的病理学基础模型,通过病例变换器建模同一患者多张全切片图像(WSI)间的依赖关系,采用两阶段开源自监督与低成本多任务监督预训练,在多项下游任务中显著超越现有模型,且参数量更小、可复现性强。
Details
Motivation: 当前计算病理学中的WSI基础模型多以单张切片为中心,依赖私有数据和昂贵的配对报告监督,缺乏对同一患者多张切片间关系的显式建模,难以实现跨临床任务的泛化。 Method: 提出MOOZY模型:第一阶段用掩码自蒸馏在77,134张公开WSI特征网格上预训练视觉编码器;第二阶段引入病例变换器,结合56个公开数据集共333项任务(含分类与生存分析)进行多任务对齐;整体采用患者案例为基本建模单元。 Result: 在8个留出任务的五折冻结特征探针评估中,MOOZY在多数指标上达到最优或并列最优;相比TITAN和PRISM,加权F1、加权ROC-AUC和平衡准确率的宏平均分别提升+7.37%/+5.50%/+7.83%和+8.83%/+10.70%/+9.78%;参数量仅85.77M,为GigaPath的1/14。 Conclusion: 基于公开数据、以患者为中心的预训练范式能生成高迁移性嵌入表示,为构建可扩展、可复现的患者级组织病理学基础模型提供了切实可行路径。 Abstract: Computational pathology needs whole-slide image (WSI) foundation models that transfer across diverse clinical tasks, yet current approaches remain largely slide-centric, often depend on private data and expensive paired-report supervision, and do not explicitly model relationships among multiple slides from the same patient. We present MOOZY, a patient-first pathology foundation model in which the patient case, not the individual slide, is the core unit of representation. MOOZY explicitly models dependencies across all slides from the same patient via a case transformer during pretraining, combining multi-stage open self-supervision with scaled low-cost task supervision. In Stage 1, we pretrain a vision-only slide encoder on 77,134 public slide feature grids using masked self-distillation. In Stage 2, we align these representations with clinical semantics using a case transformer and multi-task supervision over 333 tasks from 56 public datasets, including 205 classification and 128 survival tasks across four endpoints. Across eight held-out tasks with five-fold frozen-feature probe evaluation, MOOZY achieves best or tied-best performance on most metrics and improves macro averages over TITAN by +7.37%, +5.50%, and +7.83% and over PRISM by +8.83%, +10.70%, and +9.78% for weighted F1, weighted ROC-AUC, and balanced accuracy, respectively. MOOZY is also parameter efficient with 85.77M parameters, 14x smaller than GigaPath. These results demonstrate that open, reproducible patient-level pretraining yields transferable embeddings, providing a practical path toward scalable patient-first histopathology foundation models.[145] Towards Intrinsic-Aware Monocular 3D Object Detection
Zhihao Zhang,Abhinav Kumar,Xiaoming Liu
Main category: cs.CV
TL;DR: MonoIA是一种内在感知的单目3D目标检测框架,通过语言模型生成内在参数的语义嵌入,并将其分层整合到检测网络中,以提升跨不同相机内参设置下的泛化能力和检测精度。
Details
Motivation: 现有单目3D检测方法对相机内参敏感、泛化能力差,因其将内参视为纯数值而非影响视觉感知的几何变换。 Method: 提出MonoIA框架,利用大语言模型和视觉-语言模型生成表征相机内参视觉与几何含义的‘内在嵌入’,并通过内在自适应模块分层注入检测网络,实现特征表示的内参自适应调制。 Result: 在KITTI、Waymo、nuScenes等基准上达到新SOTA(如KITTI榜单+1.18%),多数据集联合训练下性能进一步提升(如KITTI Val +4.46%)。 Conclusion: 将相机内参建模从数值条件化转向语义表征,显著增强了单目3D检测在多样相机配置下的鲁棒性与统一性。 Abstract: Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image. Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsics govern how 3D scenes are projected onto the image plane. We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation. The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry. To capture this effect, MonoIA employs large language models and vision-language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters. These embeddings are hierarchically integrated into the detection network via an Intrinsic Adaptation Module, allowing the model to modulate its feature representations according to camera-specific configurations and maintain consistent 3D detection across intrinsics. This shifts intrinsic modeling from numeric conditioning to semantic representation, enabling robust and unified perception across cameras. Extensive experiments show that MonoIA achieves new state-of-the-art results on standard benchmarks including KITTI, Waymo, and nuScenes (e.g., +1.18% on the KITTI leaderboard), and further improves performance under multi-dataset training (e.g., +4.46% on KITTI Val).[146] VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation
Jihwan Hong,Jaeyoung Do
Main category: cs.CV
TL;DR: 本文提出VIRST框架,通过统一全局视频推理与像素级掩码预测,解决了现有固定关键帧方法在处理动态视频和多步推理任务时的性能下降问题。
Details
Motivation: 现有基于固定关键帧的方法难以捕捉快速变化的时空动态,并且无法有效处理需要多步推理的查询,导致在运动密集型和推理导向型视频上性能显著下降。 Method: 提出端到端框架VIRST,包含时空融合(STF)模块将分割感知的视频特征融入视觉语言骨干网络,以及时间动态锚点更新器以在大运动、遮挡和目标重现情况下维持稳定的时间线索。 Result: 在多种RVOS基准测试中取得当前最优结果,尤其在真实且具挑战性的条件下展现出对指代表达和推理任务的良好泛化能力。 Conclusion: VIRST通过统一建模时空推理与分割预测,有效提升了RVOS系统在复杂动态场景下的鲁棒性与泛化性。 Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation-aware video features into the vision-language backbone, and employs the Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings. The code and checkpoints are available at https://github.com/AIDASLab/VIRST.[147] ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
Jovana Kondic,Pengyuan Li,Dhiraj Joshi,Isaac Sanchez,Ben Wiesel,Shafiq Abedin,Amit Alfassy,Eli Schwartz,Daniel Caraballo,Yagmur Gizem Cinar,Florian Scheidegger,Steven I. Ross,Daniel Karl I. Weidele,Hang Hua,Ekaterina Arutyunova,Roei Herzig,Zexue He,Zihan Wang,Xinyue Yu,Yunfei Zhao,Sicong Jiang,Minghao Liu,Qunshu Lin,Peter Staar,Luis Lastras,Aude Oliva,Rogerio Feris
Main category: cs.CV
TL;DR: ChartNet 是一个百万级高质量多模态图表理解数据集,通过代码引导合成生成150万多样图表样本,涵盖24种图表类型和6种绘图库,包含五种对齐模态(代码、图像、表格、摘要、问答推理),并经过严格质量过滤,显著提升多模态模型在图表理解任务上的性能。
Details
Motivation: 当前视觉-语言模型(VLMs)在联合推理图表几何模式、结构化数值数据与自然语言方面能力有限,亟需高质量、大规模、细粒度对齐的多模态图表理解数据集。 Method: 提出代码引导的合成流水线,自动生成1.5百万图表样本;构建包含绘图代码、图表图像、数据表、自然语言摘要及带推理的问答五种对齐组件的数据集;引入人工标注、真实世界、安全与定位等专用子集;设计严格质量过滤流程保障视觉保真度、语义准确性和表示多样性。 Result: 在多个图表理解基准上,基于ChartNet微调的模型性能持续提升;ChartNet成为目前最大规模开源图表理解数据集,并已公开发布。 Conclusion: ChartNet为构建具备鲁棒性与泛化能力的可视化理解基础模型提供了关键的大规模监督信号,推动图表理解从单一模态向跨模态深度协同推理发展。 Abstract: Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language -- a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. Moreover, a rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. The dataset is publicly available at https://huggingface.co/datasets/ibm-granite/ChartNet[148] Structural Graph Probing of Vision-Language Models
Haoyu He,Yue Zhuo,Yu Zheng,Qi R. Wang
Main category: cs.CV
TL;DR: 本文通过神经拓扑学视角研究视觉-语言模型(VLMs),将每层建模为基于神经元共激活的层内相关性图,发现跨模态结构随网络深度增加而收敛于一组紧凑的循环枢纽神经元,扰动这些神经元显著影响模型输出,表明神经拓扑是VLM可解释性的有效中间尺度。
Details
Motivation: 现有VLMs虽性能优异,但其神经元群体层面的计算组织机制尚不清楚,亟需一种能揭示行为相关内部结构的可解释性方法。 Method: 将VLM各层表示为基于神经元共激活构建的相关性图(即神经拓扑),分析其跨模态、跨深度的结构演化,并通过靶向扰动识别因果关键神经元。 Result: 相关性拓扑蕴含可恢复的行为信号;跨模态结构随深度增强而收敛于少量循环枢纽神经元;扰动这些枢纽神经元显著改变模型输出。 Conclusion: 神经拓扑是一种有意义的VLM可解释性中间尺度:比局部归因更丰富,比全电路还原更可行,且经验上与多模态行为强相关。 Abstract: Vision-language models (VLMs) achieve strong multimodal performance, yet how computation is organized across populations of neurons remains poorly understood. In this work, we study VLMs through the lens of neural topology, representing each layer as a within-layer correlation graph derived from neuron-neuron co-activations. This view allows us to ask whether population-level structure is behaviorally meaningful, how it changes across modalities and depth, and whether it identifies causally influential internal components under intervention. We show that correlation topology carries recoverable behavioral signal; moreover, cross-modal structure progressively consolidates with depth around a compact set of recurrent hub neurons, whose targeted perturbation substantially alters model output. Neural topology thus emerges as a meaningful intermediate scale for VLM interpretability: richer than local attribution, more tractable than full circuit recovery, and empirically tied to multimodal behavior. Code is publicly available at https://github.com/he-h/vlm-graph-probing.[149] LightCtrl: Training-free Controllable Video Relighting
Yizuo Peng,Xuelin Chen,Kai Zhang,Xiaodong Cun
Main category: cs.CV
TL;DR: LightCtrl 是首个无需训练即可通过用户指定光轨迹显式控制视频光照的视频重打光方法,结合预训练扩散模型与创新的光图注入和几何感知重打光模块,显著提升光照可控性与时间一致性。
Details
Motivation: 现有视频重打光方法缺乏对输出光照的显式控制能力。 Method: 提出 LightCtrl 方法,结合预训练图像重打光扩散模型与视频扩散先验;引入 Light Map Injection 模块(基于光轨迹采样并注入噪声)和 Geometry-Aware Relighting 模块(在频域动态融合 RGB 与法线图隐表示),实现训练无关的显式光照控制。 Result: LightCtrl 能生成高质量、光照变化多样且严格遵循指定光轨迹的视频,在可控性上优于基线方法。 Conclusion: LightCtrl 首次实现了无需训练的显式视频光照控制,通过模块化设计有效提升光照一致性与几何感知能力,为可控视频编辑提供了新范式。 Abstract: Recent diffusion models have achieved remarkable success in image relighting, and this success has quickly been extended to video relighting. However, existing methods offer limited explicit control over illumination in the relighted output. We present LightCtrl, the first controllable video relighting method that enables explicit control of video illumination through a user-supplied light trajectory in a training-free manner. Our approach combines pre-trained diffusion models: an image relighting model processes each frame individually, followed by a video diffusion prior to enhance temporal consistency. To achieve explicit control over dynamically varying lighting, we introduce two key components. First, a Light Map Injection module samples light trajectory-specific noise and injects it into the latent representation of the source video, improving illumination coherence with the conditional light trajectory. Second, a Geometry-Aware Relighting module dynamically combines RGB and normal map latents in the frequency domain to suppress the influence of the original lighting, further enhancing adherence to the input light trajectory. Experiments show that LightCtrl produces high-quality videos with diverse illumination changes that closely follow the specified light trajectory, demonstrating improved controllability over baseline methods. Code is available at: https://github.com/GVCLab/LightCtrl.[150] SceneExpander: Expanding 3D Scenes with Free-Form Inserted Views
Zijian He,enjie Liu,Yihao Wang,Weizhi Zhong,Huan Yuan,Kun Gai,Guangrun Wang,Guanbin Li
Main category: cs.CV
TL;DR: 本文提出SceneExpander方法,通过测试时自适应和双蒸馏信号(锚点蒸馏与插入视图自蒸馏),在用户控制下对多视角重建的3D场景进行扩展,有效缓解因插入视图几何错位导致的全局不一致问题。
Details
Motivation: 现实世界构建工作流具有迭代性,需在已有真实场景基础上由用户控制扩展覆盖范围,但现有方法难以处理插入视图与原重建之间的3D错位问题,导致几何偏移、幻觉内容和视角依赖伪影。 Method: 提出SceneExpander,对参数化前馈式3D重建模型进行测试时自适应;引入两种互补蒸馏信号:锚点蒸馏利用原始捕获视图稳定几何结构,插入视图自蒸馏在保留观测支持预测的同时调整潜在几何与外观以适配错位插入视图。 Result: 在ETH场景及在线数据上的实验表明,该方法在错位条件下显著提升了场景扩展行为合理性与重建质量。 Conclusion: SceneExpander通过双蒸馏机制实现了鲁棒的3D场景扩展,为用户中心的迭代式世界构建提供了新范式。 Abstract: World building with 3D scene representations is increasingly important for content creation, simulation, and interactive experiences, yet real workflows are inherently iterative: creators must repeatedly extend an existing scene under user control. Motivated by this research gap, we study 3D scene expansion in a user-centric workflow: starting from a real scene captured by multi-view images, we extend its coverage by inserting an additional view synthesized by a generative model. Unlike simple object editing or style transfer in a fixed scene, the inserted view is often 3D-misaligned with the original reconstruction, introducing geometry shifts, hallucinated content, or view-dependent artifacts that break global multi-view consistency. To address the challenge, we propose SceneExpander, which applies test-time adaptation to a parametric feed-forward 3D reconstruction model with two complementary distillation signals: anchor distillation stabilizes the original scene by distilling geometric cues from the captured views, while inserted-view self-distillation preserves observation-supported predictions yet adapts latent geometry and appearance to accommodate the misaligned inserted view. Experiments on ETH scenes and online data demonstrate improved expansion behavior and reconstruction quality under misalignment.[151] EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow
Dogyun Park,Yanyu Li,Sergey Tulyakov,Anil Kag
Main category: cs.CV
TL;DR: 本文提出EFlow,一种高效的少步训练框架,通过门控局部-全局注意力机制和路径下降引导训练等创新,显著降低视频扩散Transformer的计算成本和推理延迟。
Details
Motivation: 视频扩散Transformer的扩展受限于注意力机制的二次复杂度和迭代采样步骤的高成本。 Method: 提出EFlow框架,包括Gated Local-Global Attention(减少每步计算)和Path-Drop Guided训练(降低目标计算开销),并引入Mean-Velocity Additivity正则化以保证低步数下的高质量生成。 Result: EFlow实现比标准solution-flow高2.5倍的训练吞吐量,比标准迭代模型低45.3倍的推理延迟,在Kinetics和大规模文本到视频数据集上保持竞争力。 Conclusion: EFlow为视频扩散模型提供了实用的从头训练方案,有效缓解了计算瓶颈,推动了高效视频生成的发展。 Abstract: Scaling video diffusion transformers is fundamentally bottlenecked by two compounding costs: the expensive quadratic complexity of attention per step, and the iterative sampling steps. In this work, we propose EFlow, an efficient few-step training framework, that tackles these bottlenecks simultaneously. To reduce sampling steps, we build on a solution-flow objective that learns a function mapping a noised state at time t to time s. Making this formulation computationally feasible and high-quality at video scale, however, demands two complementary innovations. First, we propose Gated Local-Global Attention, a token-droppable hybrid block which is efficient, expressive, and remains highly stable under aggressive random token-dropping, substantially reducing per-step compute. Second, we develop an efficient few-step training recipe. We propose Path-Drop Guided training to replace the expensive guidance target with a computationally cheap, weak path. Furthermore, we augment this with a Mean-Velocity Additivity regularizer to ensure high fidelity at extremely low step counts. Together, our EFlow enables a practical from-scratch training pipeline, achieving up to 2.5x higher training throughput over standard solution-flow, and 45.3x lower inference latency than standard iterative models with competitive performance on Kinetics and large-scale text-to-video datasets.[152] PRUE: A Practical Recipe for Field Boundary Segmentation at Scale
Gedeon Muhawenayo,Caleb Robinson,Subash Khanal,Zhanpei Fang,Isaac Corley,Alexander Wollam,Tianyi Gao,Leonard Strnad,Ryan Avery,Lyndon Estes,Ana M. Tárano,Nathan Jacobs,Hannah Kerner
Main category: cs.CV
TL;DR: 本文系统评估了18种分割与地理空间基础模型在FTW基准上的全球田块边界划分性能,提出一种结合U-Net主干、复合损失函数和定向数据增强的新分割方法,在IoU和object-F1上分别提升6%和9%,并开源模型及五国田块数据集。
Details
Motivation: 现有基于卫星图像的田块边界提取深度学习方法对光照、空间尺度和地理位置变化敏感,缺乏系统性评估与鲁棒性强的通用方案。 Method: 在统一实验设置下系统评估18种模型;提出基于U-Net主干、复合损失函数与针对性数据增强的新型分割方法。 Result: 所提方法在FTW基准上达到76% IoU和47% object-F1,较先前基线分别提升6%和9%;开源全部模型及五个国家的田块边界数据集。 Conclusion: U-Net类语义分割模型优于实例分割与地理空间基础模型;所提方法为田块边界划分提供了可靠、可扩展且可复现的实用框架。 Abstract: Large-scale maps of field boundaries are essential for agricultural monitoring tasks. Existing deep learning approaches for satellite-based field mapping are sensitive to illumination, spatial scale, and changes in geographic location. We conduct the first systematic evaluation of segmentation and geospatial foundation models (GFMs) for global field boundary delineation using the Fields of The World (FTW) benchmark. We evaluate 18 models under unified experimental settings, showing that a U-Net semantic segmentation model outperforms instance-based and GFM alternatives on a suite of performance and deployment metrics. We propose a new segmentation approach that combines a U-Net backbone, composite loss functions, and targeted data augmentations to enhance performance and robustness under real-world conditions. Our model achieves a 76\% IoU and 47\% object-F1 on FTW, an increase of 6\% and 9\% over the previous baseline. Our approach provides a practical framework for reliable, scalable, and reproducible field boundary delineation across model design, training, and inference. We release all models and model-derived field boundary datasets for five countries.[153] LLM Enhanced Action Recognition via Hierarchical Global-Local Skeleton-Language Model
Ruosi Wang,Fangwei Zuo,Lei Li,Zhaoqiang Xia
Main category: cs.CV
TL;DR: 本文提出了一种分层全局-局部骨架-语言模型(HocSLM),通过结合多尺度时空建模与视觉-语言语义对齐,显著提升了基于骨架的人体动作识别性能。
Details
Motivation: 现有基于GCN的方法依赖短程运动拓扑,难以建模长程关节依赖、复杂时序动态及跨模态语义对齐,限制了动作语义表征能力。 Method: 提出HocSLM:1)设计分层全局-局部网络(HGLNet),含复合拓扑空间模块与双路径分层时间模块;2)利用大视觉-语言模型(VLM)生成视频对应的动作文本描述;3)构建骨架-语言序列融合模块,借助骨架-语言模型(SLM)在统一语义空间中对齐骨架特征与文本描述。 Result: 在NTU RGB+D 60、NTU RGB+D 120和Northwestern-UCLA三个主流数据集上达到SOTA性能。 Conclusion: HocSLM通过协同建模多尺度时空结构与跨模态语义对齐,有效增强了骨架动作表征的语义判别力与理解能力,为骨架动作识别提供了新范式。 Abstract: Skeleton-based human action recognition has achieved remarkable progress in recent years. However, most existing GCN-based methods rely on short-range motion topologies, which not only struggle to capture long-range joint dependencies and complex temporal dynamics but also limit cross-modal semantic alignment and understanding due to insufficient modeling of action semantics. To address these challenges, we propose a hierarchical global-local skeleton-language model (HocSLM), enabling the large action model be more representative of action semantics. First, we design a hierarchical global-local network (HGLNet) that consists of a composite-topology spatial module and a dual-path hierarchical temporal module. By synergistically integrating multi-level global and local modules, HGLNet achieves dynamically collaborative modeling at both global and local scales while preserving prior knowledge of human physical structure, significantly enhancing the model's representation of complex spatio-temporal relationships. Then, a large vision-language model (VLM) is employed to generate textual descriptions by passing the original RGB video sequences to this model, providing the rich action semantics for further training the skeleton-language model. Furthermore, we introduce a skeleton-language sequential fusion module by combining the features from HGLNet and the generated descriptions, which utilizes a skeleton-language model (SLM) for aligning skeletal spatio-temporal features and textual action descriptions precisely within a unified semantic space. The SLM model could significantly enhance the HGLNet's semantic discrimination capabilities and cross-modal understanding abilities. Extensive experiments demonstrate that the proposed HocSLM achieves the state-of-the-art performance on three mainstream benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA.[154] UniDAC: Universal Metric Depth Estimation for Any Camera
Girish Chandar Ganesan,Yuliang Guo,Liu Ren,Xiaoming Liu
Main category: cs.CV
TL;DR: 本文提出UniDAC框架,通过解耦相对深度预测和空间变化尺度估计,实现单模型在多种相机(如鱼眼、360°)下的通用单目度量深度估计,并引入轻量级深度引导尺度估计模块与畸变感知位置编码RoPE-φ,在跨相机泛化上达到SOTA。
Details
Motivation: 现有零样本单目度量深度估计方法难以泛化到鱼眼、360°等大视场相机;已有统一表征方法依赖特定训练数据或需多模型适配。 Method: 提出UniDAC框架:1)将度量深度解耦为相对深度预测+空间变化尺度估计;2)设计轻量Depth-Guided Scale Estimation模块,利用相对深度图上采样粗粒度尺度图;3)引入畸变感知位置编码RoPE-φ,对ERP投影中纬度方向进行加权建模。 Result: 在多个数据集上跨相机泛化性能全面超越先前方法,达到当前最优(SoTA)。 Conclusion: UniDAC实现了单模型对多样相机类型的通用鲁棒性,验证了解耦建模与畸变感知设计在跨域深度估计中的有效性。 Abstract: Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in real-world applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and $360^\circ$ cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large-FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE-$φ$, a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.[155] MotiMem: Motion-Aware Approximate Memory for Energy-Efficient Neural Perception in Autonomous Vehicles
Haohua Que,Mingkai Liu,Jiayue Xie,Haojia Gao,Jiajun Sun,Hongyi Xu,Handong Yao,Fei Qiao
Main category: cs.CV
TL;DR: MotiMem是一种硬件-软件协同设计的内存接口方案,利用时间一致性与稀疏性编码,在大幅降低动态内存能耗的同时,保持高检测精度。
Details
Motivation: 高分辨率传感器带来感知鲁棒性提升,但导致电池受限电动车面临严重内存墙问题;数据搬运能耗常高于计算能耗;传统图像压缩语义盲、面向存储而非总线开关活动优化。 Method: 提出MotiMem:1)利用时间一致性,通过轻量2D运动传播动态识别兴趣区域(RoI);2)采用混合稀疏感知编码,结合自适应取反与截断以诱导比特级稀疏性。 Result: 在nuScenes、Waymo和KITTI数据集及16种检测模型上实验表明,MotiMem降低内存接口动态能耗约43%,同时保持约93%的目标检测精度,显著优于JPEG、WebP等标准编解码器。 Conclusion: MotiMem在能耗与精度间建立了新的帕累托前沿,为自动驾驶感知系统提供了高效内存接口新范式。 Abstract: High-resolution sensors are critical for robust autonomous perception but impose a severe memory wall on battery-constrained electric vehicles. In these systems, data movement energy often outweighs computation. Traditional image compression is ill-suited as it is semantically blind and optimizes for storage rather than bus switching activity. We propose MotiMem, a hardware-software co-designed interface. Exploiting temporal coherence,MotiMem uses lightweight 2D Motion Propagation to dynamically identify Regions of Interest (RoI). Complementing this, a Hybrid Sparsity-Aware Coding scheme leverages adaptive inversion and truncation to induce bitlevel sparsity. Extensive experiments across nuScenes, Waymo, and KITTI with 16 detection models demonstrate that MotiMem reduces memory-interface dynamic energy by approximately 43 percent while retaining approximately 93 percent of the object detection accuracy, establishing a new Pareto frontier significantly superior to standard codecs like JPEG and WebP.[156] RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
Sen Zhang,Runmei Li,Zhichao Zheng,Yuhe Zhang,Jiani Li,Kailun Zhang,Tao Zhang,Wenjun Wu,Qunbo Wang
Main category: cs.CV
TL;DR: 本文提出RailVQA-bench(首个面向列车自动运行的视觉问答基准)和RailVQA-CoM(协同大-小模型框架),以提升 cab-view 视觉认知在安全关键场景下的泛化性、可解释性与推理规划能力,同时兼顾低延迟与可靠性。
Details
Motivation: 现有ATO系统在罕见但关键的边缘案例上泛化差,缺乏高层推理与规划能力;LMMs虽具认知潜力,但存在计算开销大和幻觉风险;且缺乏评估视觉认知能力的专用基准。 Method: 构建RailVQA-bench(20,000单帧+1,168视频QA对),并提出RailVQA-CoM框架:采用透明三模块结构与自适应时序采样,协同小型高效模型与大型认知模型。 Result: 显著提升性能、可解释性与跨域泛化能力,降低推理延迟,并支持即插即用部署于自动驾驶系统。 Conclusion: RailVQA-bench与RailVQA-CoM共同为安全可靠的智能列车视觉认知提供了新基准与可行架构,推动LMMs在轨交安全场景中的落地应用。 Abstract: Automatic Train Operation (ATO) relies on low-latency, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, reduces inference latency, and strengthens cross-domain generalization, while enabling plug-and-play deployment in autonomous driving systems. Code and datasets will be available at https://github.com/Cybereye-bjtu/RailVQA.[157] SJD-VP: Speculative Jacobi Decoding with Verification Prediction for Autoregressive Image Generation
Bingqi Shan,Baoquan Zhang,Xiaochen Qi,Xutao Li,Yunming Ye,Liqiang Nie
Main category: cs.CV
TL;DR: 本文提出了一种新的推测性雅可比解码方法SJD-VP,通过利用迭代过程中token概率的变化来预测并优先采样更可能通过验证的token,从而提高接受率并加速自回归图像生成。
Details
Motivation: 现有推测性雅可比解码(SJD)方法因token选择模糊导致推测token接受率低,且现有改进工作未能充分利用解码的迭代动态特性。 Method: 提出SJD-VP方法,基于观察——概率上升的token更可能被验证接受且正确;利用跨迭代的token概率变化指导采样,优先选择概率上升的token,以预测其通过后续验证的可能性。该方法为即插即用式,可无缝集成到现有SJD框架中。 Result: 在标准基准上的大量实验表明,SJD-VP能持续加速自回归解码,并提升图像生成质量。 Conclusion: SJD-VP通过建模token概率演化规律,有效缓解了推测性解码中的低接受率问题,在不牺牲生成质量的前提下提升了推理效率,是一种实用且通用的优化方案。 Abstract: Speculative Jacobi Decoding (SJD) has emerged as a promising method for accelerating autoregressive image generation. Despite its potential, existing SJD approaches often suffer from the low acceptance rate issue of speculative tokens due to token selection ambiguity. Recent works attempt to mitigate this issue primarily from the relaxed token verification perspective but fail to fully exploit the iterative dynamics of decoding. In this paper, we conduct an in-depth analysis and make a novel observation that tokens whose probabilities increase are more likely to match the verification-accepted and correct token. Based on this, we propose a novel Speculative Jacobi Decoding with Verification Prediction (SJD-VP). The key idea is to leverage the change in token probabilities across iterations to guide sampling, favoring tokens whose probabilities increase. This effectively predicts which tokens are likely to pass subsequent verification, boosting the acceptance rate. In particular, our SJD-VP is plug-and-play and can be seamlessly integrated into existing SJD methods. Extensive experiments on standard benchmarks demonstrate that our SJD-VP method consistently accelerates autoregressive decoding while improving image generation quality.[158] Follow Your Heart: Landmark-Guided Transducer Pose Scoring for Point-of-Care Echocardiography
Zaiyang Guo,Jessie N. Dong,Filippos Bellos,Jilei Hao,Emily J. MacKay,Trevor Chan,Shir Goldfinger,Sethu Reddy,Steven Vance,Jason J. Corso,Alison M. Pouch
Main category: cs.CV
TL;DR: 本文提出了一种多任务网络,用于在床旁经胸超声心动图(TTE)中辅助获取高质量的心尖四腔(A4CH)视图并自动估算左室射血分数(LVEF),无需复杂的探头位置追踪设备。
Details
Motivation: A4CH视图获取对临床评估(如LVEF)至关重要,但对新手操作者而言探头定位困难,亟需图像驱动的实时反馈与量化指导。 Method: 设计级联式多任务网络,包含探头姿态评分模块和不确定性感知的左心室解剖标志点检测器,并集成LVEF自动估算;仅依赖常规TTE图像训练与推理,不依赖外部位姿传感器。 Result: 在密集‘扫查’协议采集的床旁TTE数据上验证,网络能仅凭图像准确判别探头姿态(目标/近目标/偏离目标),并生成可视化解剖标志引导图像解读与定向。 Conclusion: 该方法为资源有限场景下的床旁TTE提供了可行、低成本的A4CH视图获取辅助策略,有望提升基层操作质量与效率。 Abstract: Point-of-care transthoracic echocardiography (TTE) makes it possible to assess a patient's cardiac function in almost any setting. A critical step in the TTE exam is acquisition of the apical 4-chamber (A4CH) view, which is used to evaluate clinically impactful measurements such as left ventricular ejection fraction (LVEF). However, optimizing transducer pose for high-quality image acquisition and subsequent measurement is a challenging task, particularly for novice users. In this work, we present a multi-task network that provides feedback cues for A4CH view acquisition and automatically estimates LVEF in high-quality A4CH images. The network cascades a transducer pose scoring module and an uncertainty-aware LV landmark detector with automated LVEF estimation. A strength is that network training and inference do not require cumbersome or costly setups for transducer position tracking. We evaluate performance on point-of-care TTE data acquired with a spatially dense "sweep" protocol around the optimal A4CH view. The results demonstrate the network's ability to determine when the transducer pose is on target, close to target, or far from target based on the images alone, while generating visual landmark cues that guide anatomical interpretation and orientation. In conclusion, we demonstrate a promising strategy to provide guidance for A4CH view acquisition, which may be useful when deploying point-of-care TTE in limited resource settings.[159] LightMover: Generative Light Movement with Color and Intensity Controls
Gengze Zhou,Tianyu Wang,Soo Ye Kim,Zhixin Shu,Xin Yu,Yannick Hold-Geoffroy,Sumit Chaturvedi,Qi Wu,Zhe Lin,Scott Cohen
Main category: cs.CV
TL;DR: LightMover是一个利用视频扩散先验实现单张图像可控光照编辑的框架,将光照编辑建模为视觉token空间中的序列到序列预测问题,并引入自适应token剪枝机制提升效率与保真度。
Details
Motivation: 现有方法难以在不重新渲染场景的前提下,对单张图像实现物理合理的、细粒度的光照控制(如位置、颜色、强度及其产生的反射、阴影和衰减);同时缺乏对空间运动与外观属性的统一建模。 Method: 将光照编辑建模为视觉token空间的序列到序列预测任务,输入图像和光控token,输出调整后的光照效果;提出自适应token剪枝机制以压缩控制序列;构建可扩展渲染管线生成大规模配对训练数据。 Result: LightMover实现了对光照位置、颜色和强度的精确独立控制,在PSNR及语义一致性(DINO、CLIP)指标上表现优异,控制序列长度减少41%且保持编辑保真度。 Conclusion: LightMover通过融合视频扩散先验与token化光照控制,为单图光照编辑提供了高效、物理合理且语义一致的新范式。 Abstract: We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. To train our framework, we construct a scalable rendering pipeline that generates large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent with the original image. LightMover enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.[160] MEDIC-AD: Towards Medical Vision-Language Model's Clinical Intelligence
Woohyeon Park,Jaeik Kim,Sunghwan Steve Cho,Pa Hong,Wookyoung Jeong,Yoojin Nam,Namjoon Kim,Ginny Y. Wong,Ka Chun Cheung,Jaeyoung Do
Main category: cs.CV
TL;DR: MEDIC-AD 是一种面向临床的医学视觉语言模型,通过引入异常感知标记、时序差异标记和可解释性热图生成阶段,显著提升病灶检测、症状追踪与可视化解释能力,在多项任务上达到SOTA。
Details
Motivation: 现有医学视觉语言模型缺乏将广义知识转化为临床可操作输出(如病灶定位、病情变化追踪、可解释决策)的有效机制。 Method: 提出三阶段框架:1)引入可学习的异常感知标记([161] ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning
Huanxuan Liao,Zhongtao Jiang,Yupu Hao,Yuqiao Tan,Shizhu He,Jun Zhao,Kun Xu,Kang Liu
Main category: cs.CV
TL;DR: 本文提出ResAdapt框架,在输入端自适应分配视觉预算,以在有限视觉token预算下提升多模态大语言模型(MLLMs)的视频理解性能,尤其在低预算和高推理强度任务中显著提升效率与准确率。
Details
Motivation: 现有MLLMs因高分辨率输入导致视觉token数量激增,难以兼顾空间分辨率与长时序上下文;瓶颈在于编码器接收的像素量而非后编码压缩方式。 Method: 提出ResAdapt:包含轻量级Allocator(建模为上下文赌博机)与原MLLM主干耦合;采用成本感知策略优化(CAPO)训练Allocator,将稀疏反馈转化为稳定的学习信号。 Result: 在预算受限的视频问答、时序定位和图像推理任务中,ResAdapt显著提升低预算下的性能,常位于效率-精度前沿;支持相同视觉预算下处理最多16倍帧数,并带来超15%性能增益。 Conclusion: 输入侧自适应分配视觉资源是缓解MLLMs视觉token瓶颈的有效途径,ResAdapt在保持主干不变前提下实现高效、可扩展的视频理解。 Abstract: Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.[162] Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision
Yizhou Jin,Yuezhu Feng,Jinjin Zhang,Peng Wang,Qingjie Liu,Yunhong Wang
Main category: cs.CV
TL;DR: 本文提出ReAL方法,利用多模态大语言模型(MLLM)内在推理能力,在仅图像级监督下实现异常检测、像素级定位与可解释推理,无需额外模块或像素级标注;通过提取推理过程中的异常相关token并聚合其注意力响应生成异常图,并引入一致性引导的强化学习优化模块CGRO提升推理与定位一致性。
Details
Motivation: 现有MLLM异常检测方法局限于图像级检测与文本推理,像素级定位依赖外部视觉模块和密集标注,缺乏端到端、弱监督下的细粒度定位与可解释性。 Method: 提出Reasoning-Driven Anomaly Localization (ReAL):从MLLM自回归推理中提取异常相关token,聚合其注意力响应生成像素级异常图;引入Consistency-Guided Reasoning Optimization (CGRO)模块,用强化学习对齐推理token与视觉注意力。 Result: 在四个公开基准上显著提升异常检测、定位精度与可解释性;仅用图像级监督即达到与依赖像素级监督的MLLM方法相当的性能。 Conclusion: MLLM具备未被充分挖掘的像素级定位潜力,通过合理建模推理过程与视觉注意力的一致性,可在极弱监督下实现高性能、可解释的异常分析。 Abstract: Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps. We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization. Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability. Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision. Code is available at https://github.com/YizhouJin313/ReADL.[163] Communicating about Space: Language-Mediated Spatial Integration Across Partial Views
Ankur Sikarwar,Debangan Mishra,Sudarshan Nikhil,Ponnurangam Kumaraguru,Aishwarya Agrawal
Main category: cs.CV
TL;DR: 本文提出COSMIC基准,用于评估多模态大语言模型(MLLMs)通过自然语言对话协同构建共享空间认知的能力;实验发现当前MLLMs在锚点物体识别上表现较好,但在关系推理和全局一致地图构建上能力有限,远逊于人类表现。
Details
Motivation: 探究MLLMs是否能像人类一样,通过交流视角依赖的局部观察,协作建立对共享环境的统一、非自我中心的空间心智模型。 Method: 构建COSMIC基准:两个静态MLLM代理从不同视角观察3D室内场景,通过自然语言对话协作回答空间查询;包含899个场景、1250个问答对及五类任务;同时采集250组人类对话作为对比。 Result: MLLMs能力呈明显层级:锚点物体识别最可靠,关系推理次之,全局地图构建接近随机水平;思维能力提升锚点定位但不足以支撑高层空间协作;Gemini-3-Pro-Thinking达72%准确率,远低于人类95%。 Conclusion: 当前MLLMs尚不具备稳健构建与维护共享空间心智模型的能力,尤其在对话收敛性与全局一致性方面存在根本局限,亟需新方法突破。 Abstract: Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at https://github.com/ankursikarwar/Cosmic[164] Incentivizing Temporal-Awareness in Egocentric Video Understanding Models
Zhiyang Xu,Tian Qin,Bowen Jin,Zhengfeng Lai,Meng Cao,Lifu Huang,Peng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为Temporal Global Policy Optimization (TGPO)的强化学习算法,旨在提升多模态大语言模型(MLLMs)在以自我为中心视频理解中的时间感知能力,通过对比有序与打乱顺序的视频帧输出,生成校准的全局奖励信号,从而抑制空间捷径行为并增强时序推理能力。
Details
Motivation: 现有MLLMs在以自我为中心的视觉理解中缺乏时间意识,主要因训练目标未显式鼓励时序推理,而依赖于帧级空间捷径。 Method: 提出TGPO——一种基于可验证奖励的强化学习(RLVR)算法,通过对比模型对有序和打乱视频帧的输出,生成全局归一化奖励信号;结合GRPO和GSPO实现冷启动训练并抑制空间捷径。 Result: 在五个以自我为中心视频基准上实验表明,TGPO显著提升了时序定位与因果一致性,优于现有基于强化学习的视频推理方法。 Conclusion: TGPO为构建具备鲁棒时间感知能力的MLLMs提供了一种简单且可扩展的路径。 Abstract: Multimodal large language models (MLLMs) have recently shown strong performance in visual understanding, yet they often lack temporal awareness, particularly in egocentric settings where reasoning depends on the correct ordering and evolution of events. This deficiency stems in part from training objectives that fail to explicitly reward temporal reasoning and instead rely on frame-level spatial shortcuts. To address this limitation, we propose Temporal Global Policy Optimization (TGPO), a reinforcement learning with verifiable rewards (RLVR) algorithm designed to incentivize temporal awareness in MLLMs. TGPO contrasts model outputs generated from temporally ordered versus shuffled video frames to derive calibrated, globally normalized reward signals that explicitly favor temporally coherent reasoning. Integrated with GRPO and GSPO, TGPO supports cold-start RL training and effectively suppresses spatial shortcut behaviors learned by existing MLLMs. Experiments across five egocentric video benchmarks demonstrate that TGPO consistently improves temporal grounding and causal coherence, outperforming prior RL-based video reasoning approaches. Our results suggest that TGPO offers a simple and scalable pathway toward temporally robust MLLMs for egocentric video understanding.[165] MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation
Xiaofeng Tan,Wanjiang Weng,Hongsong Wang,Fang Zhao,Xin Geng,Liang Wang
Main category: cs.CV
TL;DR: 本文提出了一种面向文本到动作生成的强化微调框架,包含多维度奖励模型MotionReward和高效微调方法EasyTune,显著提升语义一致性、真实感与人类偏好对齐,同时降低内存开销与计算成本。
Details
Motivation: 现有文本到动作生成模型在监督预训练后仍难以对齐高阶目标(如语义一致性、真实感、人类偏好);已有后训练方法存在表示单一、优化片面、计算开销大等局限。 Method: 提出强化微调框架:1)MotionReward——基于异构动作表示(关节/旋转等)映射至统一文本锚定语义空间,支持多维奖励学习,并引入自精炼偏好学习;2)EasyTune——针对去噪过程中的递归梯度依赖瓶颈,采用逐步而非全轨迹优化,实现细粒度、内存高效更新。 Result: 在MLD模型上FID达0.132、峰值内存22.10GB,较DRaFT节省15.22GB;在ACMDM(关节表示)上FID降低22.9%;在HY Motion(旋转表示)上R-Precision提升12.6%,FID改善23.3%。 Conclusion: 该框架实现了跨表示、多目标、高效可扩展的文本到动作后训练,为生成质量与效率协同优化提供了新范式。 Abstract: Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimization. We present a reinforcement fine-tuning framework that comprises a heterogeneous-representation, multi-dimensional reward model, MotionReward, and an efficient, fine-grained fine-tuning method, EasyTune. To obtain a unified semantics representation, MotionReward maps heterogeneous motions into a shared semantic space anchored by text, enabling multidimensional reward learning; Self-refinement Preference Learning further enhances semantics without additional annotations. For efficient and effective fine-tuning, we identify the recursive gradient dependence across denoising steps as the key bottleneck, and propose EasyTune, which optimizes step-wise rather than over the full trajectory, yielding dense, fine-grained, and memory-efficient updates. Extensive experiments validate the effectiveness of our framework, achieving FID 0.132 at 22.10 GB peak memory for MLD model and saving up to 15.22 GB over DRaFT. It reduces FID by 22.9% on joint-based ACMDM, and achieves a 12.6% R-Precision gain and 23.3% FID improvement on rotation-based HY Motion. Our project page with code is publicly available.[166] K$α$LOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks
David Tschirschwitz,Volker Rodehorst
Main category: cs.CV
TL;DR: 本文提出KαLOS(KALOS)元算法,通过先解决空间对应问题再评估一致性,标准化数据集质量评估,克服了传统统计指标在视觉任务中处理实例对应问题的不足,并引入新型噪声生成器验证其有效性。
Details
Motivation: 当前目标检测基准进展停滞,主要受限于无法区分模型改进与标签噪声;需严格量化标注一致性以确保评估数据可靠性,但现有统计指标无法处理视觉任务中的实例对应问题,且缺乏客观一致性的真值验证标准。 Method: 提出KαLOS元算法,基于'定位优先'原则,先解析空间对应关系,再构建名义可靠性矩阵;采用数据驱动方式校准定位参数以适配不同任务(如边界框、体积分割、姿态估计);并设计新型可控噪声生成器模拟复杂非各向同性的人类标注变异。 Result: KαLOS实现了对数据集质量的细粒度诊断(如标注者活力、协作聚类、定位敏感性),并在多种视觉任务上展现出良好泛化性与鲁棒性;新型噪声生成器为指标性质提供了实证支持。 Conclusion: KαLOS为现代计算机视觉基准中区分信号与噪声提供了可靠、标准化的评估框架,有望重建领域对基准评测的信任。 Abstract: Progress in object detection benchmarks is stagnating. It is limited not by architectures but by the inability to distinguish model improvements from label noise. To restore trust in benchmarking the field requires rigorous quantification of annotation consistency to ensure the reliability of evaluation data. However, standard statistical metrics fail to handle the instance correspondence problem inherent to vision tasks. Furthermore, validating new agreement metrics remains circular because no objective ground truth for agreement exists. This forces reliance on unverifiable heuristics. We propose K$α$LOS (KALOS), a unified meta-algorithm that generalizes the "Localization First" principle to standardize dataset quality evaluation. By resolving spatial correspondence before assessing agreement, our framework transforms complex spatio-categorical problems into nominal reliability matrices. Unlike prior heuristic implementations, K$α$LOS employs a principled, data-driven configuration; by statistically calibrating the localization parameters to the inherent agreement distribution, it generalizes to diverse tasks ranging from bounding boxes to volumetric segmentation or pose estimation. This standardization enables granular diagnostics beyond a single score. These include annotator vitality, collaboration clustering, and localization sensitivity. To validate this approach, we introduce a novel and empirically derived noise generator. Where prior validations relied on uniform error assumptions, our controllable testbed models complex and non-isotropic human variability. This provides evidence of the metric's properties and establishes K$α$LOS as a robust standard for distinguishing signal from noise in modern computer vision benchmarks.[167] Let Triggers Control: Frequency-Aware Dropout for Effective Token Control
Junyoung Koh,Hoyeon Moon,Dongha Kim,Seungmin Lee,Sanghyun Park,Min Song
Main category: cs.CV
TL;DR: 本文提出了一种名为Frequency-Aware Dropout(FAD)的正则化方法,用于提升文本到图像生成模型中触发词对新概念的可控性,通过共现分析与课程式调度,在不增加参数或修改结构的前提下显著提升提示保真度与用户感知质量。
Details
Motivation: 现有基于LoRA和单一触发词的个性化文本到图像模型常因触发词与上下文在微调中频繁共现而导致语义纠缠,从而降低可控性。 Method: 提出Frequency-Aware Dropout(FAD),包含共现分析与课程式调度两个核心组件,作为一种无参正则化策略应用于token-based扩散模型(如SD 1.5、SDXL)及NL-driven骨干网络(如FLUX、Qwen-Image)。 Result: 在多个主流模型上验证了FAD能一致提升提示保真度、风格精度和用户感知质量,且无需新增参数或架构改动,计算开销极小。 Conclusion: FAD是一种简单高效、即插即用的正则化技术,有效解耦触发词与上下文表征,显著增强文本到图像生成中的可控性与个性化能力。 Abstract: Text-to-image models such as Stable Diffusion have achieved unprecedented levels of high-fidelity visual synthesis. As these models advance, personalization of generative models -- commonly facilitated through Low-Rank Adaptation (LoRA) with a dedicated trigger token -- has become a significant area of research. Previous works have naively assumed that fine-tuning with a single trigger token to represent new concepts. However, this often results in poor controllability, where the trigger token alone fails to reliably evoke the intended concept. We attribute this issue to the frequent co-occurrence of the trigger token with the surrounding context during fine-tuning, which entangles their representations and compromises the token's semantic distinctiveness. To disentangle this, we propose Frequency-Aware Dropout (FAD) -- a novel regularization technique that improves prompt controllability without adding new parameters. FAD consists of two key components: co-occurrence analysis and curriculum-inspired scheduling. Qualitative and quantitative analyses across token-based diffusion models (SD~1.5 and SDXL) and natural language--driven backbones (FLUX and Qwen-Image) demonstrate consistent gains in prompt fidelity, stylistic precision, and user-perceived quality. Our method provides a simple yet effective dropout strategy that enhances controllability and personalization in text-to-image generation. Notably, it achieves these improvements without introducing additional parameters or architectural modifications, making it readily applicable to existing models with minimal computational overhead.[168] Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
Ji Ma,Wei Suo,Peng Wang,Yanning Zhang
Main category: cs.CV
TL;DR: 本文研究了多模态链式思维(MCoT)模型在视觉推理中幻觉问题的独特成因,发现其主要源于关联推理步骤(即发散思维),并提出一种简单有效的解码干预策略来定位和缓解此类幻觉,显著提升性能且兼容其他缓解方法。
Details
Motivation: 尽管MCoT模型在复杂视觉推理中表现优异,但其存在严重幻觉问题,且已有研究指出视觉注意力衰减是主因;然而作者质疑:MCoT模型的幻觉是否具有区别于传统大视觉语言模型(LVLMs)的独特成因? Method: 系统分析MCoT模型的幻觉模式,识别出幻觉文本主要产生于关联推理(发散思维)步骤;据此设计一种能准确定位发散思维步骤并在解码过程中进行干预的轻量级策略。 Result: 所提方法在多个基准上大幅超越现有幻觉缓解方法;且可无缝集成其他方法,进一步提升其效果;代码已开源。 Conclusion: MCoT模型的幻觉具有独特机制(即发散思维步骤),针对性干预该机制可高效缓解幻觉,并具备良好兼容性与实用性。 Abstract: Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at https://github.com/ASGO-MM/MCoT-hallucination.[169] Make It Up: Fake Images, Real Gains in Generalized Few-shot Semantic Segmentation
Guohuan Xie,Xin He,Dingying Fan,Le Zhang,Ming-Ming Cheng,Yun Liu
Main category: cs.CV
TL;DR: 本文提出Syn4Seg框架,通过扩散模型生成多样化且类别一致的图像,并结合两阶段伪标签优化与边界约束更新,提升广义少样本语义分割中新型类别的覆盖度与掩码质量。
Details
Motivation: 广义少样本语义分割(GFSS)受限于新类别在稀疏标注下的外观覆盖不足;现有扩散模型虽可生成图像,但常因覆盖不全和掩码不可靠导致监督噪声大。 Method: Syn4Seg包含三部分:1)构建嵌入去重的提示词库以提升提示空间覆盖,生成多样且类一致的合成图像;2)支持引导的两阶段伪标签优化:先过滤低一致性区域获取高精度种子,再用融合全局(支持集)与局部(图像)统计的自适应原型重标不确定像素;3)仅对边界带和未标记像素,采用约束式SAM更新以增强轮廓保真度。 Result: 在PASCAL-5^i和COCO-20^i数据集上,1-shot与5-shot设置下均取得一致性能提升,验证了合成数据在提供可靠掩码和精确边界的GFSS中的有效性与可扩展性。 Conclusion: Syn4Seg通过生成增强与精细化伪标签策略,有效缓解了GFSS中新类别覆盖不足与伪标签噪声问题,为少样本分割提供了兼顾规模性与精度的新范式。 Abstract: Generalized few-shot semantic segmentation (GFSS) is fundamentally limited by the coverage of novel-class appearances under scarce annotations. While diffusion models can synthesize novel-class images at scale, practical gains are often hindered by insufficient coverage and noisy supervision when masks are unavailable or unreliable. We propose Syn4Seg, a generation-enhanced GFSS framework designed to expand novel-class coverage while improving pseudo-label quality. Syn4Seg first maximizes prompt-space coverage by constructing an embedding-deduplicated prompt bank for each novel class, yielding diverse yet class-consistent synthetic images. It then performs support-guided pseudo-label estimation via a two-stage refinement that i) filters low-consistency regions to obtain high-precision seeds and ii) relabels uncertain pixels with image-adaptive prototypes that combine global (support) and local (image) statistics. Finally, we refine only boundary-band and unlabeled pixels using a constrained SAM-based update to improve contour fidelity without overwriting high-confidence interiors. Extensive experiments on PASCAL-$5^i$ and COCO-$20^i$ demonstrate consistent improvements in both 1-shot and 5-shot settings, highlighting synthetic data as a scalable path for GFSS with reliable masks and precise boundaries.[170] HD-VGGT: High-Resolution Visual Geometry Transformer
Tianrun Chen,Yuanqi Hu,Yidong Han,Hanjie Xu,Deyi Ji,Qi Zhu,Chunan Yu,Xin Zhang,Cheng Chen,Chaotao Ding,Ying Zang,Xuanfu Li,Jin Ma,Lanyun Zhu
Main category: cs.CV
TL;DR: 本文提出HD-VGGT,一种双分支架构,用于高效稳健的高分辨率3D重建:低分辨率分支预测全局一致的粗几何,高分辨率分支通过学习的特征上采样模块细化细节,并引入特征调制机制抑制不可靠特征,从而在避免全分辨率Transformer高昂开销的同时实现SOTA重建质量。
Details
Motivation: 高分辨率图像对精确3D重建至关重要,但现有基于Transformer的前馈方法(如VGGT)在扩展至高分辨率时面临计算与内存成本剧增问题;同时,视觉模糊区域(如重复纹理、弱纹理、镜面反射)会产生不稳定的特征token,尤其在高分辨率下严重损害几何推断。 Method: 提出HD-VGGT双分支架构:1)低分辨率分支生成全局一致的粗略几何;2)高分辨率分支通过学习的特征上采样模块细化细节;3)引入Feature Modulation机制,在Transformer早期抑制不可靠特征。 Result: HD-VGGT在使用高分辨率图像和监督信号的同时,避免了全分辨率Transformer的高昂成本,实现了当前最优(state-of-the-art)的重建质量。 Conclusion: HD-VGGT通过解耦粗细几何建模与特征稳定性控制,有效解决了高分辨率3D重建中的效率与鲁棒性瓶颈,为大规模场景高保真重建提供了可行路径。 Abstract: High-resolution imagery is essential for accurate 3D reconstruction, as many geometric details only emerge at fine spatial scales. Recent feed-forward approaches, such as the Visual Geometry Grounded Transformer (VGGT), have demonstrated the ability to infer scene geometry from large collections of images in a single forward pass. However, scaling these models to high-resolution inputs remains challenging: the number of tokens in transformer architectures grows rapidly with both image resolution and the number of views, leading to prohibitive computational and memory costs. Moreover, we observe that visually ambiguous regions, such as repetitive patterns, weak textures, or specular surfaces, often produce unstable feature tokens that degrade geometric inference, especially at higher resolutions. We introduce HD-VGGT, a dual-branch architecture for efficient and robust high-resolution 3D reconstruction. A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module. To handle unstable tokens, we propose Feature Modulation, which suppresses unreliable features early in the transformer. HD-VGGT leverages high-resolution images and supervision without full-resolution transformer costs, achieving state-of-the-art reconstruction quality.[171] EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams
JaeSeong Kim,Chaehwan Lim,Sang Hyun Gil,Suan Lee
Main category: cs.CV
TL;DR: EuraGovExam是一个源自欧亚五地真实公务员考试的多语言、多模态基准数据集,包含8000+道扫描图像题,要求模型直接从图像中进行跨语言、布局感知推理;现有VLMs仅达86%准确率,凸显其挑战性与诊断价值。
Details
Motivation: 现有基准难以反映真实公共部门考试的视觉复杂性、文化现实性和多语言性,亟需更贴近实际高风险场景的评估标准。 Method: 构建EuraGovExam数据集:采集韩国、日本、台湾、印度和欧盟五地真实公务员考试扫描图像,共8000+道多选题,覆盖17个领域;所有题目内容(题干、选项、图表)均嵌入单张高分辨率图像,仅提供标准化输出指令;强调真实排版(表格、多语种字体、表单式布局)。 Result: 当前最先进视觉语言模型(VLMs)在该基准上准确率仅为86%,显著低于常规文本或简化图像基准表现,验证了其高难度与强诊断能力。 Conclusion: EuraGovExam确立了面向高风险、多语言、图像驱动场景下VLM评估的新标准,并支撑电子政务、公文分析与公平化考试备考等实际应用。 Abstract: We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content--including problem statements, answer choices, and visual elements--within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark's difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.[172] NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
Yanying Li,Jinyang Li,Shengfeng He,Yangyang Xu,Junyu Dong,Yong Du
Main category: cs.CV
TL;DR: NimbusGS 是一个统一框架,用于从多种恶劣天气条件下的退化多视角图像中重建高质量3D场景,通过分解退化为全局透射场和逐视角颗粒残差,并引入几何引导梯度缩放机制,实现鲁棒的3D高斯表示自监督优化。
Details
Motivation: 现有方法通常只针对特定天气类型,缺乏在混合、多样化恶劣天气下的泛化能力;本文旨在解决通用天气退化建模与高质量几何重建的挑战。 Method: 将天气退化分解为:1)跨视角共享的全局透射场(表征静态大气效应),2)逐视角颗粒残差(表征动态散射与遮挡);并提出几何引导的梯度缩放机制,以缓解严重能见度下降下3D高斯自监督优化中的梯度失衡问题。 Result: 在多种复杂恶劣天气条件下,NimbusGS 实现了更优的几何重建质量,性能超越各类任务专用方法。 Conclusion: 该方法通过物理启发的退化解耦建模与稳定优化机制,显著提升了恶劣天气下多视角3D重建的鲁棒性与泛化性。 Abstract: We present NimbusGS, a unified framework for reconstructing high-quality 3D scenes from degraded multi-view inputs captured under diverse and mixed adverse weather conditions. Unlike existing methods that target specific weather types, NimbusGS addresses the broader challenge of generalization by modeling the dual nature of weather: a continuous, view-consistent medium that attenuates light, and dynamic, view-dependent particles that cause scattering and occlusion. To capture this structure, we decompose degradations into a global transmission field and per-view particulate residuals. The transmission field represents static atmospheric effects shared across views, while the residuals model transient disturbances unique to each input. To enable stable geometry learning under severe visibility degradation, we introduce a geometry-guided gradient scaling mechanism that mitigates gradient imbalance during the self-supervised optimization of 3D Gaussian representations. This physically grounded formulation allows NimbusGS to disentangle complex degradations while preserving scene structure, yielding superior geometry reconstruction and outperforming task-specific methods across diverse and challenging weather conditions. Code is available at https://github.com/lyy-ovo/NimbusGS.[173] An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
Yi Feng,Junwu E,Zizhan Guo,Yu Ma,Hanli Wang,Rui Fan
Main category: cs.CV
TL;DR: 本文提出了一种面向自动驾驶的3D全景占用预测新基准,包括首个统一高质量3D网格库ADMesh和大规模物理一致的全景占用数据集CarlaOcc,并建立了标准化评估与模型评测平台。
Details
Motivation: 现有3D全景占用预测研究受限于缺乏高质量3D网格资源、实例级标注及物理一致的占用数据集,导致几何重建精度低、遮挡推理不可靠、整体3D理解受限。 Method: 构建了ADMesh——首个面向自动驾驶的统一高质量3D网格库(含15K+带纹理与语义标注模型);基于此,在CARLA中生成CarlaOcc数据集(10万+帧,0.05m体素分辨率,含实例级占用真值);设计标准化评估指标并系统评测主流模型。 Result: 发布了ADMesh与CarlaOcc数据集(含10万+帧、0.05m体素精度、实例级标注),提出了新评估指标,并在该基准上完成了代表性模型的系统性评测。 Conclusion: 本工作为3D全景占用预测提供了高质量数据基础、统一评测标准与开源平台,显著推动了精确几何重建、可靠遮挡推理与全场景3D感知的研究进展。 Abstract: Panoptic occupancy prediction aims to jointly infer voxel-wise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. Specifically, we introduce ADMesh, the first unified 3D mesh library tailored for autonomous driving, which integrates over 15K high-quality 3D models with diverse textures and rich semantic annotations. Building upon ADMesh, we further construct CarlaOcc, a large-scale, physically consistent panoptic occupancy dataset generated using the CARLA simulator. This dataset contains over 100K frames with fine-grained, instance-level occupancy ground truth at voxel resolutions as fine as 0.05 m. Furthermore, standardized evaluation metrics are introduced to quantify the quality of existing occupancy datasets. Finally, a systematic benchmark of representative models is established on the proposed dataset, which provides a unified platform for fair comparison and reproducible research in the field of 3D panoptic perception. Code and dataset are available at https://mias.group/CarlaOcc.[174] Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection
Jinhu Fu,Yihang Lou,Qingyi Si,Shudong Zhang,Yan Bai,Sen Su
Main category: cs.CV
TL;DR: 本文提出CARE框架,通过因果中介分析识别LVLM中的不安全通道,并引入双模态安全子空间投影方法,在推理时动态将激活投影到安全子空间,从而提升模型安全性而不损害多模态能力。
Details
Motivation: 大型视觉语言模型(LVLMs)虽在多模态任务中表现优异,但其内部安全机制不透明、难以控制。 Method: 首先采用因果中介分析定位导致不安全行为的神经元和层;然后基于广义特征分解,在视觉与文本模态上分别学习良性与恶意激活之间的安全子空间;最后通过混合融合机制在推理时动态投影激活至安全子空间。 Result: 在多个安全基准测试中显著提升LVLM的安全鲁棒性,优于现有激活引导与对齐方法,并具备对未见攻击的泛化防御能力。 Conclusion: CARE框架实现了对LVLM不安全通道的可解释诊断与有效修复,在保障安全性的同时维持了模型原有的多模态理解与推理能力。 Abstract: Large Vision-Language Models (LVLMs) have achieved impressive performance across multimodal understanding and reasoning tasks, yet their internal safety mechanisms remain opaque and poorly controlled. In this work, we present a comprehensive framework for diagnosing and repairing unsafe channels within LVLMs (CARE). We first perform causal mediation analysis to identify neurons and layers that are causally responsible for unsafe behaviors. Based on these findings, we introduce a dual-modal safety subspace projection method that learns generalized safety subspaces for both visual and textual modalities through generalized eigen-decomposition between benign and malicious activations. During inference, activations are dynamically projected toward these safety subspaces via a hybrid fusion mechanism that adaptively balances visual and textual corrections, effectively suppressing unsafe features while preserving semantic fidelity. Extensive experiments on multiple safety benchmarks demonstrate that our causal-subspace repair framework significantly enhances safety robustness without degrading general multimodal capabilities, outperforming prior activation steering and alignment-based baselines. Additionally, our method exhibits good transferability, defending against unseen attacks.[175] SaSaSaSa2VA: 2nd Place of the 5th PVUW MeViS-Text Track
Dengxian Gong,Quanzhu Niu,Shihao Chen,Yuanzheng Wu,Yikang Zhou,Tao Zhang,Haobo Yuan,Lu Qi,Shunping Ji
Main category: cs.CV
TL;DR: 本文提出Still Awesome SaSaSa2VA(SaSaSaSa2VA),通过引入目标存在感知验证机制,在MeViS运动中心表达视频分割基准上取得优异性能,最终得分89.19,获第5届PVUW挑战赛第二名。
Details
Motivation: 现有RVOS方法主要依赖静态文本线索,而MeViS基准引入了运动中心表达和无目标查询,需新方法应对动态理解和目标存在判断的挑战。 Method: 在SaSaSa2VA基础上,增加目标存在感知的验证机制,利用更多输入帧和[SEG]标记强化Sa2VA主干,并显式建模目标是否存在。 Result: 在MeViS-Text Track上取得89.19分,排名第二;消融实验表明该存在感知验证策略足以显著提升运动中心指代任务性能。 Conclusion: 简单的目标存在感知验证机制即可有效提升模型对运动表达的理解与分割能力,无需复杂架构改动,验证了其在RVOS中尤其是运动推理任务中的关键作用。 Abstract: Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion-centric expressions (referring & reasoning motion expressions) and introducing no-target queries. Extending SaSaSa2VA, where increased input frames and [SEG] tokens already strengthen the Sa2VA backbone, we adopt a simple yet effective target existence-aware verification mechanism, leading to Still Awesome SaSaSa2VA (SaSaSaSa2VA). Despite its simplicity, the method achieves a final score of 89.19 in the 5th PVUW Challenge (MeViS-Text Track), securing 2nd place. Both quantitative results and ablations suggest that this existence-aware verification strategy is sufficient to unlock strong performance on motion-centric referring tasks.[176] IP-SAM: Prompt-Space Conditioning for Prompt-Absent Camouflaged Object Detection
Huiyao Zhang,Jin Bai,Rui Guo,JianWen Tan,HongFei Wang,Ye Li
Main category: cs.CV
TL;DR: 本文提出IP-SAM,通过自生成内在提示(SPG)和提示空间门控(PSG),在不依赖外部提示的前提下实现全自动图像分割,在伪装目标检测和医学息肉分割任务中均取得优异性能。
Details
Motivation: 现有基于提示的分割模型在实际部署中面临无提示可用的结构性矛盾,而主流适配方法绕过原生提示接口,削弱了提示引导解码能力。 Method: 提出提示空间条件化方法:1)自提示生成器(SPG)将图像上下文蒸馏为内在提示作为区域锚点;2)这些提示经冻结的SAM2提示编码器投影,恢复提示引导解码;3)提示空间门控(PSG)利用内在背景提示作为不对称抑制约束以抑制背景误检。 Result: 在四个伪装目标检测基准上达到SOTA(如COD10K上MAE=0.017),仅需21.26M可训练参数;且在仅用Kvasir-SEG训练时,零样本迁移到CVC-ClinicDB和ETIS医学数据集表现强劲。 Conclusion: IP-SAM证明了从提示空间视角进行模型适配的有效性,兼顾全自动分割能力与原生提示机制优势,具备跨域泛化潜力。 Abstract: Prompt-conditioned foundation segmenters have emerged as a dominant paradigm for image segmentation, where explicit spatial prompts (e.g., points, boxes, masks) guide mask decoding. However, many real-world deployments require fully automatic segmentation, creating a structural mismatch: the decoder expects prompts that are unavailable at inference. Existing adaptations typically modify intermediate features, inadvertently bypassing the model's native prompt interface and weakening prompt-conditioned decoding. We propose IP-SAM, which revisits adaptation from a prompt-space perspective through prompt-space conditioning. Specifically, a Self-Prompt Generator (SPG) distills image context into complementary intrinsic prompts that serve as coarse regional anchors. These cues are projected through SAM2's frozen prompt encoder, restoring prompt-guided decoding without external intervention. To suppress background-induced false positives, Prompt-Space Gating (PSG) leverages the intrinsic background prompt as an asymmetric suppressive constraint prior to decoding. Under a deterministic no-external-prompt protocol, IP-SAM achieves state-of-the-art performance across four camouflaged object detection benchmarks (e.g., MAE 0.017 on COD10K) with only 21.26M trainable parameters (optimizing SPG, PSG, and a task-specific mask decoder trained from scratch, alongside image-encoder LoRA while keeping the prompt encoder frozen). Furthermore, the proposed conditioning strategy generalizes beyond COD to medical polyp segmentation, where a model trained solely on Kvasir-SEG exhibits strong zero-shot transfer to both CVC-ClinicDB and ETIS.[177] Zero-shot Vision-Language Reranking for Cross-View Geolocalization
Yunus Talha Erzurumlu,John E. Anderson,William J. Shuart,Charles Toth,Alper Yilmaz
Main category: cs.CV
TL;DR: 本文提出了一种基于零样本视觉语言模型(VLM)的两阶段重排序框架,用于提升跨视角地理定位(CVGL)的Top-1准确率;实验发现点对点评分策略失效,而LLaVA的成对比较策略显著提升性能,表明VLM更擅长相对视觉判断而非绝对相关性打分。
Details
Motivation: 现有CVGL系统虽能召回高相关候选(高Recall@k),但难以精准选出唯一最优匹配(低Top-1精度),亟需更精细的重排序方法。 Method: 提出两阶段框架:先用SOTA方法进行初步检索,再利用零样本VLM进行重排序;系统对比点对点(Pointwise)与成对比较(Pairwise)两种策略。 Result: 在VIGOR数据集上,所有点对点方法导致性能灾难性下降或无改善;而基于LLaVA的成对比较策略显著提升Top-1准确率。 Conclusion: 零样本VLM不适用于绝对相关性打分,但在细粒度相对视觉判断上表现优异,因此成对重排序是提升CVGL精度的有效新方向。 Abstract: Cross-view geolocalization (CVGL) systems, while effective at retrieving a list of relevant candidates (high Recall@k), often fail to identify the single best match (low Top-1 accuracy). This work investigates the use of zero-shot Vision-Language Models (VLMs) as rerankers to address this gap. We propose a two-stage framework: state-of-the-art (SOTA) retrieval followed by VLM reranking. We systematically compare two strategies: (1) Pointwise (scoring candidates individually) and (2) Pairwise (comparing candidates relatively). Experiments on the VIGOR dataset show a clear divergence: all pointwise methods cause a catastrophic drop in performance or no change at all. In contrast, a pairwise comparison strategy using LLaVA improves Top-1 accuracy over the strong retrieval baseline. Our analysis concludes that, these VLMs are poorly calibrated for absolute relevance scoring but are effective at fine-grained relative visual judgment, making pairwise reranking a promising direction for enhancing CVGL precision.[178] Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
Seng Nam Chen,Hao Chen,Chenglam Ho,Xinyu Mao,Jinping Wang,Yu Zhang,Chao Li
Main category: cs.CV
TL;DR: 本文提出SceneBench基准以评估视觉语言模型(VLMs)在长视频中场景级时序理解能力,发现现有模型存在显著长程上下文遗忘;为此设计Scene-RAG方法,通过动态场景记忆检索增强,提升性能2.50%。
Details
Motivation: 现有长视频理解基准仅关注细粒度感知或粗粒度摘要,缺乏对符合人类感知的场景级时序理解的评估。 Method: 定义‘场景’为视觉与语义一致的视频连贯片段;构建SceneBench基准;提出Scene-RAG方法,通过跨场景检索与整合构建动态场景记忆。 Result: VLMs在SceneBench上场景级问答准确率显著下降;Scene-RAG带来+2.50%性能提升。 Conclusion: 当前VLMs在长视频场景级理解上仍存在严重长程上下文遗忘,需发展更具鲁棒性、类人化的视频理解能力。 Abstract: Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision-language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. This Scene-RAG improves VLM performance by +2.50%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.[179] MD-RWKV-UNet: Scale-Aware Anatomical Encoding with Cross-Stage Fusion for Multi-Organ Segmentation
Zhuoyi Fang
Main category: cs.CV
TL;DR: 本文提出MD-RWKV-UNet,通过动态可变形空间移位与RWKV机制结合、选择性核注意力及跨阶段双注意力融合,提升多器官分割中对尺度变化、形变和边界精度的建模能力,在Synapse和ACDC数据集上达到SOTA。
Details
Motivation: 多器官分割面临解剖变异大、器官间依赖复杂、尺度与形状多样等挑战,传统编码器-解码器难以兼顾局部细节与长程上下文,尤其对小或易形变器官效果不佳。 Method: 提出MD-RWKV-UNet:核心为MD-RWKV模块(融合可变形空间移位与Receptance Weighted Key Value机制),引入Selective Kernel Attention实现多尺度卷积核自适应选择,并采用跨阶段双注意力融合策略聚合多级特征。 Result: 在Synapse和ACDC数据集上取得SOTA性能,尤其在边界精度和小器官分割指标上显著提升。 Conclusion: MD-RWKV-UNet提供了一种轻量且表达力强的动态器官建模方案,克服了静态卷积与全局注意力的局限,有效提升了多尺度、多形态器官的分割鲁棒性与精度。 Abstract: Multi-organ segmentation in medical imaging remains challenging due to large anatomical variability, complex inter-organ dependencies, and diverse organ scales and shapes. Conventional encoder-decoder architectures often struggle to capture both fine-grained local details and long-range context, which are crucial for accurate delineation - especially for small or deformable organs. To address these limitations, we propose MD-RWKV-UNet, a dynamic encoder network that enables scale-aware representation and spatially adaptive context modeling. At its core is the MD-RWKV block, a dual-path module that integrates deformable spatial shifts with the Receptance Weighted Key Value mechanism, allowing the receptive field to adapt dynamically to local structural cues. We further incorporate Selective Kernel Attention to enable adaptive selection of convolutional kernels with varying receptive fields, enhancing multi-scale interaction and improving robustness to organ size and shape variation. In parallel, a cross-stage dual-attention fusion strategy aggregates multi-level features across the encoder, preserving low-level structure while enhancing semantic consistency. Unlike methods that stack static convolutions or rely heavily on global attention, our approach provides a lightweight yet expressive solution for dynamic organ modeling. Experiments on Synapse and ACDC demonstrate state-of-the-art performance, particularly in boundary precision and small-organ segmentation.[180] TrendGen: An Outfit Recommendation and Display System
Theodoros Koukopoulos,Dimos Klimenof,Ioannis Xarchakos
Main category: cs.CV
TL;DR: 本文提出了TrendGen时尚AI系统,通过生成趋势一致的搭配建议和高质量平铺视图,提升在线购物体验。
Details
Motivation: 解决原始图像中光照不均、服装角度不佳、背景复杂和遮挡等问题,以构建适用于现实场景的鲁棒时尚AI系统。 Method: TrendGen系统结合服装图像与商品属性生成协调的穿搭推荐,并利用生成式AI将原始图像转换为高质量平铺视图。 Result: 在生产数据上的评估表明,TrendGen能持续生成高质量的穿搭组合和平铺图像。 Conclusion: TrendGen显著推动了人工智能在时尚零售领域的实际应用发展。 Abstract: Recent advances in Computer Vision have significantly improved image understanding and generation, revolutionizing the fashion industry. However, challenges such as inconsistent lighting, non-ideal garment angles, complex backgrounds, and occlusions in raw images hinder their full potential. Overcoming these obstacles is crucial for developing robust fashion AI systems capable of real-world applications. In this paper, we introduce TrendGen, a Fashion AI system designed to enhance online shopping with intelligent outfit recommendations. Deployed on a major e-commerce platform, TrendGen leverages cloth images and product attributes to generate trend-aligned, cohesive outfit suggestions. Additionally, it employs Generative AI to transform raw images into high-quality lay-down views, offering a clear and structured presentation of garments. Our evaluation on production data demonstrates TrendGen's consistent high-quality outfits and lay-down images, marking a significant advancement in AI-driven solutions for fashion retail.[181] TrackMAE: Video Representation Learning via Track Mask and Predict
Renaud Vandeghen,Fida Mohammad Thoker,Marc Van Droogenbroeck,Bernard Ghanem
Main category: cs.CV
TL;DR: TrackMAE是一种新型的掩码视频建模方法,通过引入显式的运动轨迹作为重建信号,并结合运动感知的掩码策略,提升了视频表征学习中对时序动态的建模能力,在多个下游任务上超越现有自监督方法。
Details
Motivation: 现有掩码视频建模(MVM)方法仅隐式编码运动信息,难以支持需要细粒度运动感知的任务。 Method: 使用现成点跟踪器提取稀疏运动轨迹;设计运动感知的tube掩码策略;在像素与特征语义空间中联合进行重建,并以运动轨迹作为额外监督信号。 Result: 在六个不同下游数据集上持续优于当前最优视频自监督学习方法,学习到更具判别性和泛化性的视频表征。 Conclusion: 显式引入运动信息作为重建目标可显著提升视频自监督预训练效果,TrackMAE为时序建模提供了更有效的范式。 Abstract: Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos, generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve random tube masking with a motion-aware masking strategy. We enhance video representations learned in both pixel and feature semantic reconstruction spaces by providing a complementary supervision signal in the form of motion targets. We evaluate on six datasets across diverse downstream settings and find that TrackMAE consistently outperforms state-of-the-art video self-supervised learning baselines, learning more discriminative and generalizable representations. Code available at https://github.com/rvandeghen/TrackMAE[182] Human-Centric Perception for Child Sexual Abuse Imagery
Camila Laranjeira,João Macedo,Sandra Avila,Fabrício Benevenuto,Jefersson A. dos Santos
Main category: cs.CV
TL;DR: 本文提出了一种面向儿童性虐待图像(CSAI)分类的可解释、细粒度人体感知方法,构建了包含多龄段、多显性程度人体图像的Body-Keypoint-Part Dataset(BKPD),并设计BKP-Association与YOLO-BKP两种联合姿态估计与部件检测方法,以实现个体级结构化解析。
Details
Motivation: 现有CSAI自动识别方法多依赖黑箱式泛化模型(如 pornography 分类),缺乏对性暗示线索(如姿势、着装)的客观、可解释建模,难以支撑执法与NGO的实际需求。 Method: 构建人体关键点与部件层级标注数据集BKPD;提出BKP-Association(基于图匹配的关键点-部件关联)和YOLO-BKP(改进YOLO框架实现端到端关键点+部件检测)两种联合建模方法;在COCO及自建数据集上进行跨域评测与消融分析。 Result: 所提方法在COCO-Keypoints和COCO-HumanParts上达到SOTA水平,在BKPD上验证了对年龄与显性程度变化的鲁棒性;跨域实验揭示CSAI相关数据在姿态与部件分布上的显著偏移,凸显领域特殊性。 Conclusion: 本工作首次将人体结构化感知(关键点+部件+层级语义)系统引入CSAI分析,为构建可解释、可审计、面向真实场景的CSAI检测系统奠定新范式与数据基础。 Abstract: Law enforcement agencies and non-gonvernmental organizations handling reports of Child Sexual Abuse Imagery (CSAI) are overwhelmed by large volumes of data, requiring the aid of automation tools. However, defining sexual abuse in images of children is inherently challenging, encompassing sexually explicit activities and hints of sexuality conveyed by the individual's pose, or their attire. CSAI classification methods often rely on black-box approaches, targeting broad and abstract concepts such as pornography. Thus, our work is an in-depth exploration of tasks from the literature on Human-Centric Perception, across the domains of safe images, adult pornography, and CSAI, focusing on targets that enable more objective and explainable pipelines for CSAI classification in the future. We introduce the Body-Keypoint-Part Dataset (BKPD), gathering images of people from varying age groups and sexual explicitness to approximate the domain of CSAI, along with manually curated hierarchically structured labels for skeletal keypoints and bounding boxes for person and body parts, including head, chest, hip, and hands. We propose two methods, namely BKP-Association and YOLO-BKP, for simultaneous pose estimation and detection, with targets associated per individual for a comprehensive decomposed representation of each person. Our methods are benchmarked on COCO-Keypoints and COCO-HumanParts, as well as our human-centric dataset, achieving competitive results with models that jointly perform all tasks. Cross-domain ablation studies on BKPD and a case study on RCPD highlight the challenges posed by sexually explicit domains. Our study addresses previously unexplored targets in the CSAI domain, paving the way for novel research opportunities.[183] Class-Distribution Guided Active Learning for 3D Occupancy Prediction in Autonomous Driving
Wonjune Kim,In-Jae Lee,Sihwan Hwang,Sanmin Kim,Dongsuk Kum
Main category: cs.CV
TL;DR: 本文提出了一种面向3D占用预测任务的类别分布引导的主动学习框架,通过结合样本间多样性、集合内多样性和频率加权不确定性三个标准,高效选择需标注的数据,在显著减少标注成本(仅42.4%数据)下达到接近全监督性能(26.62 mIoU)。
Details
Motivation: 3D占用预测存在严重的类别不平衡问题(如行人、锥桶等关键目标体素占比极小),且体素级标注成本高昂;对主导背景类大量标注效率低下。 Method: 提出一种类别分布引导的主动学习框架,融合三个互补准则:1)样本间多样性(选择预测类别分布与已有标注集差异大的样本);2)集合内多样性(避免单轮采样中重复冗余);3)频率加权不确定性(用各类别逆频率加权体素熵,突出稀有类不确定性)。采用地理隔离的训练/验证划分以保障评估有效性。 Result: 在Occ3D-nuScenes上仅用42.4%标注数据即达26.62 mIoU,媲美全监督结果,并优于其他主动学习基线;在SemanticKITTI上跨架构验证也表现一致有效。 Conclusion: 类别分布引导的主动学习能有效缓解3D占用预测中的标注成本高和类别不平衡问题,在保证性能的同时显著提升标注效率,具备跨数据集和模型架构的泛化能力。 Abstract: 3D occupancy prediction provides dense spatial understanding critical for safe autonomous driving. However, this task suffers from a severe class imbalance due to its volumetric representation, where safety-critical objects (bicycles, traffic cones, pedestrians) occupy minimal voxels compared to dominant backgrounds. Additionally, voxel-level annotation is costly, yet dedicating effort to dominant classes is inefficient. To address these challenges, we propose a class-distribution guided active learning framework for selecting training samples to annotate in autonomous driving datasets. Our approach combines three complementary criteria to select the training samples. Inter-sample diversity prioritizes samples whose predicted class distributions differ from those of the labeled set, intra-set diversity prevents redundant sampling within each acquisition cycle, and frequency-weighted uncertainty emphasizes rare classes by reweighting voxel-level entropy with inverse per-sample class proportions. We ensure evaluation validity by using a geographically disjoint train/validation split of Occ3D-nuScenes, which reduces train-validation overlap and mitigates potential map memorization. With only 42.4% labeled data, our framework reaches 26.62 mIoU, comparable to full supervision and outperforming active learning baselines at the same budget. We further validate generality on SemanticKITTI using a different architecture, demonstrating consistent effectiveness across datasets.[184] Complet4R: Geometric Complete 4D Reconstruction
Weibang Wang,Kenan Li,Zhuoguang Chen,Yijun Yuan,Hang Zhao
Main category: cs.CV
TL;DR: Complet4R是一种端到端的几何完备4D重建框架,利用解码器-only Transformer从视频序列中全局建模时序上下文,实现每帧(含遮挡区域)的完整、一致三维几何重建。
Details
Motivation: 现有方法依赖成对重建或局部运动估计,难以兼顾时间一致性与几何完备性,尤其在遮挡区域表现不佳。 Method: 提出Complet4R框架,将4D重建形式化为重建与补全的统一任务;采用仅解码器Transformer直接处理视频序列,全局累积上下文信息,逐帧生成完整几何(含跨帧可见的遮挡区域)。 Result: 在新提出的几何完备4D重建基准及3D点跟踪任务上达到SOTA性能。 Conclusion: Complet4R通过全局时序建模实现了动态场景下高保真、几何完备且时间一致的4D重建,为后续研究提供了新范式和开源代码支持。 Abstract: We introduce Complet4R, a novel end-to-end framework for Geometric Complete 4D Reconstruction, which aims to recover temporally coherent and geometrically complete reconstruction for dynamic scenes. Our method formalizes the task of Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion, by directly accumulating full contexts onto each frame. Unlike previous approaches that rely on pairwise reconstruction or local motion estimation, Complet4R utilizes a decoder-only transformer to operate all context globally directly from sequential video input, reconstructing a complete geometry for every single timestamp, including occluded regions visible in other frames. Our method demonstrates the state-of-the-art performance on our proposed benchmark for Geometric Complete 4D Reconstruction and the 3D Point Tracking task. Code will be released to support future research.[185] Dual-Path Learning based on Frequency Structural Decoupling and Regional-Aware Fusion for Low-Light Image Super-Resolution
Ji-Xuan He,Jia-Cheng Zhao,Feng-Qi Cui,Jinyang Huang,Yang Liu,Sirui Zhao,Meng Li,Zhi Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为Decoupling then Perceive(DTP)的频率感知框架,用于低光照图像超分辨率(LLISR),通过解耦亮度与纹理成分并分别建模,显著提升重建质量。
Details
Motivation: 现有LLISR方法多采用串行方式处理低照度增强与超分辨率,易导致伪影放大、纹理抑制和结构退化。 Method: 提出频率感知结构解耦(FSD)机制分离亮度(低频)与纹理(高频);设计语义特异性双路径表示(SDR)学习策略分别增强;引入跨频语义重组(CSR)模块保证结构一致性和感知对齐。 Result: 在主流LLISR基准上超越SOTA:PSNR提升1.6%,SSIM提升9.6%,LPIPS降低48%。 Conclusion: DTP通过显式频率解耦与协同重建,在保持结构完整性的同时显著提升低光超分的感知质量与保真度。 Abstract: Low-light image super-resolution (LLISR) is essential for restoring fine visual details and perceptual quality under insufficient illumination conditions with ubiquitous low-resolution devices. Although pioneer methods achieve high performance on single tasks, they solve both tasks in a serial manner, which inevitably leads to artifact amplification, texture suppression, and structural degradation. To address this, we propose Decoupling then Perceive (DTP), a novel frequency-aware framework that explicitly separates luminance and texture into semantically independent components, enabling specialized modeling and coherent reconstruction. Specifically, to adaptively separate the input into low-frequency luminance and high-frequency texture subspaces, we propose a Frequency-aware Structural Decoupling (FSD) mechanism, which lays a solid foundation for targeted representation learning and reconstruction. Based on the decoupled representation, a Semantics-specific Dual-path Representation (SDR) learning strategy that performs targeted enhancement and reconstruction for each frequency component is further designed, facilitating robust luminance adjustment and fine-grained texture recovery. To promote structural consistency and perceptual alignment in the reconstructed output, building upon this dual-path modeling, we further introduce a Cross-frequency Semantic Recomposition (CSR) module that selectively integrates the decoupled representations. Extensive experiments on the most widely used LLISR benchmarks demonstrate the superiority of our DTP framework, improving $+$1.6\% PSNR, $+$9.6\% SSIM, and $-$48\% LPIPS compared to the most state-of-the-art (SOTA) algorithm. Codes are released at https://github.com/JXVision/DTP.[186] Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models
Mehedi Hasan Tusar,Fateme Fayyazbakhsh,Igor Melnychuk,Ming C. Leu
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv11的双任务深度学习模型,可同时进行伤口边界分割(WBS)和五类临床伤口(烧伤、压疮、糖尿病足溃疡、血管性溃疡、手术伤口)分类(WC),在增强后的平衡数据集上达到高F1分数,兼具精度与轻量化部署能力。
Details
Motivation: 现有AI模型通常仅支持单一任务(分割或分类)或覆盖伤口类型有限,临床适用性不足;亟需能同时处理多类伤口的鲁棒、实用化模型。 Method: 基于YOLOv11构建双任务模型(WBS+WC);构建含2963张标注图像的伤口类型平衡数据集;采用旋转、翻转及亮度/饱和度/曝光变化进行数据增强;使用分层五折交叉验证评估性能。 Result: YOLOv11x在WBS和WC任务上F1-score分别达0.9341和0.8736;数据增强显著提升烧伤检测性能;轻量版YOLOv11n保持较高精度且计算成本更低;模型对复杂背景和类内差异具有鲁棒性。 Conclusion: YOLOv11架构可有效支撑多任务、多类别伤口智能分析,具备在临床及远程医疗场景中实时部署的潜力。 Abstract: Accurate wound classification and boundary segmentation are essential for guiding clinical decisions in both chronic and acute wound management. However, most existing AI models are limited, focusing on a narrow set of wound types or performing only a single task (segmentation or classification), which reduces their clinical applicability. This study presents a deep learning model based on YOLOv11 that simultaneously performs wound boundary segmentation (WBS) and wound classification (WC) across five clinically relevant wound types: burn injury (BI), pressure injury (PI), diabetic foot ulcer (DFU), vascular ulcer (VU), and surgical wound (SW). A wound-type balanced dataset of 2,963 annotated images was created to train the models for both tasks, with stratified five-fold cross-validation ensuring robust and unbiased evaluation. The models trained on the original non-augmented dataset achieved consistent performance across folds, though BI detection accuracy was relatively lower. Therefore, the dataset was augmented using rotation, flipping, and variations in brightness, saturation, and exposure to help the model learn more generalized and invariant features. This augmentation significantly improved model performance, particularly in detecting visually subtle BI cases. Among tested variants, YOLOv11x achieved the highest performance with F1-scores of 0.9341 (WBS) and 0.8736 (WC), while the lightweight YOLOv11n provided comparable accuracy at lower computational cost, making it suitable for resource-constrained deployments. Supported by confusion matrices and visual detection outputs, the results confirm the model's robustness against complex backgrounds and high intra-class variability, demonstrating the potential of YOLOv11-based architectures for accurate, real-time wound analysis in both clinical and remote care settings.[187] Unsafe by Reciprocity: How Generation-Understanding Coupling Undermines Safety in Unified Multimodal Models
Kaishen Wang,Heng Huang
Main category: cs.CV
TL;DR: 本文提出RICE攻击范式,揭示统一多模态模型(UMMs)中理解与生成功能间的双向交互会放大安全风险,发现跨功能互惠性本身即为结构性漏洞来源。
Details
Motivation: 现有安全研究多孤立分析多模态理解与图像生成功能,而UMMs中二者紧密耦合的安全影响尚不明确。 Method: 提出RICE(基于互惠交互的跨功能攻击)范式,系统评估生成到理解(G-U)和理解到生成(U-G)两条攻击路径,验证不安全中间信号在模态间传播与放大。 Result: 实验表明RICE在两个方向均取得高攻击成功率(ASR),揭示了UMMs固有的、此前被忽视的安全弱点。 Conclusion: UMMs中理解与生成的跨功能互惠性本身构成结构性安全漏洞,需在架构设计阶段即考虑双向安全约束。 Abstract: Recent advances in Large Language Models (LLMs) and Text-to-Image (T2I) models have led to the emergence of Unified Multimodal Models (UMMs), where multimodal understanding and image generation are tightly integrated within a shared architecture. Prior studies suggest that such reciprocity enhances cross-functionality performance through shared representations and joint optimization. However, the safety implications of this tight coupling remain largely unexplored, as existing safety research predominantly analyzes understanding and generation functionalities in isolation. In this work, we investigate whether cross-functionality reciprocity itself constitutes a structural source of vulnerability in UMMs. We propose RICE: Reciprocal Interaction-based Cross-functionality Exploitation, a novel attack paradigm that explicitly exploits bidirectional interactions between understanding and generation. Using this framework, we systematically evaluate Generation-to-Understanding (G-U) and Understanding-to-Generation (U-G) attack pathways, demonstrating that unsafe intermediate signals can propagate across modalities and amplify safety risks. Extensive experiments show high Attack Success Rates (ASR) in both directions, revealing previously overlooked safety weaknesses inherent to UMMs.[188] EVA: Bridging Performance and Human Alignment in Hard-Attention Vision Models for Image Classification
Pengcheng Pan,Yonekura Shogo,Kuniyoshi Yasuo
Main category: cs.CV
TL;DR: 本文提出EVA,一种受神经科学启发的硬注意力机制测试平台,通过小规模序列注视采样和最小化的中央-周边表征,在不依赖眼动监督的情况下,显式建模分类性能与人类扫描路径相似性之间的权衡。EVA在CIFAR-10和ImageNet-100上保持高分类精度的同时显著提升扫描路径对齐度,并泛化至无眼动标注的COCO-Search18自然场景。
Details
Motivation: 优化视觉模型仅追求分类准确率会损害其人类扫描路径的相似性和可解释性,需显式建模性能与人类对齐之间的权衡。 Method: EVA采用基于CNN的特征提取器、方差控制与自适应门控机制,在最小化中央-周边表征下进行序列注视采样,仅用标准分类损失训练,无需眼动监督。 Result: 在CIFAR-10上提升DTW、NSS等扫描路径对齐指标且精度不降;消融显示CNN提升精度但削弱人类相似性,而方差控制与门控可恢复对齐;在ImageNet-100和无监督的COCO-Search18上也展现出良好泛化性。 Conclusion: EVA为可信、人类可解释的主动视觉提供了原理性框架,实现了分类性能与人类扫描行为对齐的可控协同优化。 Abstract: Optimizing vision models purely for classification accuracy can impose an alignment tax, degrading human-like scanpaths and limiting interpretability. We introduce EVA, a neuroscience-inspired hard-attention mechanistic testbed that makes the performance-human-likeness trade-off explicit and adjustable. EVA samples a small number of sequential glimpses using a minimal fovea-periphery representation with CNN-based feature extractor and integrates variance control and adaptive gating to stabilize and regulate attention dynamics. EVA is trained with the standard classification objective without gaze supervision. On CIFAR-10 with dense human gaze annotations, EVA improves scanpath alignment under established metrics such as DTW, NSS, while maintaining competitive accuracy. Ablations show that CNN-based feature extraction drives accuracy but suppresses human-likeness, whereas variance control and gating restore human-aligned trajectories with minimal performance loss. We further validate EVA's scalability on ImageNet-100 and evaluate scanpath alignment on COCO-Search18 without COCO-Search18 gaze supervision or finetuning, where EVA yields human-like scanpaths on natural scenes without additional training. Overall, EVA provides a principled framework for trustworthy, human-interpretable active vision.[189] TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR
Ted Lentsch,Santiago Montiel-Marín,Holger Caesar,Dariu M. Gavrila
Main category: cs.CV
TL;DR: 本文提出TerraSeg,首个自监督、领域无关的LiDAR地面分割模型,利用大规模统一数据集OmniLiDAR和无需人工标注的PseudoLabeler模块,在多个基准上达到SOTA性能。
Details
Motivation: 现有LiDAR地面分割方法依赖手工设计或昂贵的逐点人工标注,泛化性和可扩展性差。 Method: 提出TerraSeg模型和PseudoLabeler自监督标签生成模块,并在涵盖15种传感器、2200万扫描的大规模统一数据集OmniLiDAR上训练。 Result: 在nuScenes、SemanticKITTI和Waymo Perception上实现无监督下的SOTA性能,并支持实时推理。 Conclusion: TerraSeg验证了自监督与大规模多样化数据结合可显著提升LiDAR地面分割的泛化能力与实用性。 Abstract: LiDAR perception is fundamental to robotics, enabling machines to understand their environment in 3D. A crucial task for LiDAR-based scene understanding and navigation is ground segmentation. However, existing methods are either handcrafted for specific sensor configurations or rely on costly per-point manual labels, severely limiting their generalization and scalability. To overcome this, we introduce TerraSeg, the first self-supervised, domain-agnostic model for LiDAR ground segmentation. We train TerraSeg on OmniLiDAR, a unified large-scale dataset that aggregates and standardizes data from 12 major public benchmarks. Spanning almost 22 million raw scans across 15 distinct sensor models, OmniLiDAR provides unprecedented diversity for learning a highly generalizable ground model. To supervise training without human annotations, we propose PseudoLabeler, a novel module that generates high-quality ground and non-ground labels through self-supervised per-scan runtime optimization. Extensive evaluations demonstrate that, despite using no manual labels, TerraSeg achieves state-of-the-art results on nuScenes, SemanticKITTI, and Waymo Perception while delivering real-time performance. Our code and model weights are publicly available.[190] Inference-Time Structural Reasoning for Compositional Vision-Language Understanding
Amartya Bhattacharya
Main category: cs.CV
TL;DR: 本文提出了一种统一的评估与增强框架,通过场景图(Scene Graph)增强提升视觉语言模型(VLMs)在Winoground上的组合推理能力,尤其显著提升了Qwen3-VL-8B-Thinking的表现。
Details
Motivation: 现有视觉语言模型虽擅长图像-文本检索,但在需理解词语间关系结构的组合推理任务(如Winoground)上表现不佳,亟需引入结构化语义先验。 Method: 构建基于依存分析的TextSceneGraphParser提取主谓宾三元组,并设计Graph Asymmetry Scorer(基于最优二分匹配)量化并注入关系结构先验;在Winoground上对CLIP、BLIP、LLaVA、Qwen3-VL-8B-Thinking进行基准测试,辅以caption消融实验和多轮SG过滤策略。 Result: Qwen3-VL-8B-Thinking在基础设置下达62.75分,经多轮SG过滤后提升至66.0,超越此前开源SOTA;场景图增强对强模型有效,对弱模型增益甚微或为负。 Conclusion: 推理时注入结构化关系先验可显著提升VLM组合推理能力,但效果高度依赖模型基线能力,凸显‘能力-增强’协同设计的重要性。 Abstract: Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and augmentation framework benchmarking four architecturally diverse VLMs,CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking,on the Winoground benchmark under plain and scene-graph-augmented regimes. We introduce a dependency-based TextSceneGraphParser (spaCy) extracting subject-relation-object triples, and a Graph Asymmetry Scorer using optimal bipartite matching to inject structural relational priors. Caption ablation experiments (subject-object masking and swapping) reveal that Qwen3-VL-8B-Thinking achieves a group score of 62.75, far above all encoder-based models, while a proposed multi-turn SG filtering strategy further lifts it to 66.0, surpassing prior open-source state-of-the-art. We analyze the capability augmentation tradeoff and find that SG augmentation benefits already capable models while providing negligible or negative gains for weaker baselines. Code: https://github.com/amartyacodes/Inference-Time-Structural-Reasoning-for-Compositional-Vision-Language-Understanding[191] HMPDM: A Diffusion Model for Driving Video Prediction with Historical Motion Priors
Ke Li,Tianjia Yang,Kaidi Liang,Xianbiao Hu,Ruwen Qin
Main category: cs.CV
TL;DR: 本文提出了一种基于历史运动先验的扩散模型(HMPDM),用于提升自动驾驶中的视频预测性能,通过引入时间感知潜变量条件模块、运动感知金字塔编码器和自条件策略,在Cityscapes和KITTI数据集上显著提升了时序一致性和视觉质量。
Details
Motivation: 现有视频预测模型受限于多阶段训练流程,难以建模真实驾驶场景中复杂的运动模式,导致时序一致性和视觉质量下降。 Method: 提出HMPDM模型,包含三个核心设计:(i) 时间感知潜变量条件模块(TaLC)用于隐式注入历史运动信息;(ii) 运动感知金字塔编码器(MaPE)实现多尺度运动表征;(iii) 自条件(SC)策略以稳定迭代去噪过程。 Result: 在Cityscapes和KITTI数据集上显著优于现有方法,在Cityscapes上FVD指标提升28.2%(相同单目RGB输入设置下)。 Conclusion: HMPDM有效利用历史运动先验增强运动理解和时序连贯性,为自动驾驶视频预测提供了高效且高质量的新范式。 Abstract: Video prediction is a useful function for autonomous driving, enabling intelligent vehicles to reliably anticipate how driving scenes will evolve and thereby supporting reasoning and safer planning. However, existing models are constrained by multi-stage training pipelines and remain insufficient in modeling the diverse motion patterns in real driving scenes, leading to degraded temporal consistency and visual quality. To address these challenges, this paper introduces the historical motion priors-informed diffusion model (HMPDM), a video prediction model that leverages historical motion priors to enhance motion understanding and temporal coherence. The proposed deep learning system introduces three key designs: (i) a Temporal-aware Latent Conditioning (TaLC) module for implicit historical motion injection; (ii) a Motion-aware Pyramid Encoder (MaPE) for multi-scale motion representation; (iii) a Self-Conditioning (SC) strategy for stable iterative denoising. Extensive experiments on the Cityscapes and KITTI benchmarks demonstrate that HMPDM outperforms state-of-the-art video prediction methods with efficiency, achieving a 28.2% improvement in FVD on Cityscapes under the same monocular RGB input configuration setting. The implementation codes are publicly available at https://github.com/KELISBU/HMPDM.[192] Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models
Yuhang Han,Yuyang Wu,Zhengbo Jiao,Yiyu Wang,Xuyang Liu,Shaobo Wang,Hanlin Xu,Xuming Hu,Linfeng Zhang
Main category: cs.CV
TL;DR: 本文提出KAWHI方法,通过显式融入结构化视觉信息,解决LVLMs在强化学习中视觉表征瓶颈问题,提升多模态推理性能。
Details
Motivation: 现有RLVR方法在LVLMs中受限于视觉表征瓶颈,缺乏对视觉信息的显式建模与有效利用,导致视觉表征难以紧密耦合到强化学习优化过程中。 Method: 提出KAWHI(Key-Region Aligned Weighted Harmonic Incentive)插件式奖励重加权机制:通过分层几何聚合自适应定位语义显著区域;利用结构化归因识别视觉关键注意力头;进行段落级信用再分配,使空间视觉证据与语义决定性推理步骤对齐。 Result: 在多个推理基准上广泛实验验证了KAWHI作为通用增强模块的有效性,能持续提升多种统一奖励优化方法(如GRPO、GSPO)的性能。 Conclusion: KAWHI成功将结构化视觉信息显式引入LVLMs的强化学习优化流程,突破了视觉表征瓶颈,显著提升了多模态推理能力。 Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has substantially enhanced the reasoning capabilities of large language models in abstract reasoning tasks. However, its application to Large Vision-Language Models (LVLMs) remains constrained by a structural representational bottleneck. Existing approaches generally lack explicit modeling and effective utilization of visual information, preventing visual representations from being tightly coupled with the reinforcement learning optimization process and thereby limiting further improvements in multimodal reasoning performance. To address this limitation, we propose KAWHI (Key-Region Aligned Weighted Harmonic Incentive), a plug-and-play reward reweighting mechanism that explicitly incorporates structured visual information into uniform reward policy optimization methods (e.g., GRPO and GSPO). The method adaptively localizes semantically salient regions through hierarchical geometric aggregation, identifies vision-critical attention heads via structured attribution, and performs paragraph-level credit reallocation to align spatial visual evidence with semantically decisive reasoning steps. Extensive empirical evaluations on diverse reasoning benchmarks substantiate KAWHI as a general-purpose enhancement module, consistently improving the performance of various uniform reward optimization methods. Project page: KAWHI (https://kawhiiiileo.github.io/KAWHI_PAGE/)[193] Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
Nazia Tasnim,Shrimai Prabhumoye,Bryan A. Plummer
Main category: cs.CV
TL;DR: 本文提出CRISP方法,通过分解预训练权重为共享基矩阵和混合投影,统一支持模型压缩(MC)与参数高效微调(PEFT),在双任务及单任务上均超越现有方法。
Details
Motivation: 现有参数重组(PR)方法多针对单一任务(如仅PEFT或仅MC),难以兼顾资源受限场景下对模型压缩与快速适配的双重需求;尤其PEFT模块参数量仍较大,影响边缘设备部署。 Method: CRISP将预训练权重分解为跨层共享的基矩阵与轻量级混合投影(mixer weight),通过调节基矩阵规模实现模型压缩,利用极小尺寸的mixer(<200参数)支持高效微调。 Result: CRISP在双任务(PEFT+MC)上较先前兼容方法提升4–5%,在纯PEFT任务上超越SOTA 1.5%,在PEFT+MC组合任务上提升1%。 Conclusion: CRISP是一种通用、灵活的参数重组框架,能无缝集成多种PR任务,在精度与参数效率间取得更好平衡,适用于边缘等资源受限场景。 Abstract: Parameter Recombination (PR) methods aim to efficiently compose the weights of a neural network for applications like Parameter-Efficient FineTuning (PEFT) and Model Compression (MC), among others. Most methods typically focus on one application of PR, which can make composing them challenging. For example, when deploying a large model you may wish to compress the model and also quickly adapt to new settings. However, PEFT methods often can still contain millions of parameters. This may be small compared to the original model size, but can be problematic in resource constrained deployments like edge devices, where they take a larger portion of the compressed model's parameters. To address this, we present Coefficient-gated weight Recombination by Interpolated Shared basis Projections (CRISP), a general approach that seamlessly integrates multiple PR tasks within the same framework. CRISP accomplishes this by factorizing pretrained weights into basis matrices and their component mixing projections. Sharing basis matrices across layers and adjusting its size enables us to perform MC, whereas the mixer weight's small size (fewer than 200 in some experiments) enables CRISP to support PEFT. Experiments show CRISP outperforms methods from prior work capable of dual-task applications by 4-5\% while also outperforming the state-of-the-art in PEFT by 1.5\% and PEFT+MC combinations by 1\%. Our code is available on the repository: https://github.com/appledora/CRISP-CVPR26.[194] Mind the Shape Gap: A Benchmark and Baseline for Deformation-Aware 6D Pose Estimation of Agricultural Produce
Nikolas Chatzis,Angeliki Tsinouka,Katerina Papadimitriou,Niki Efthymiou,Marios Glytsos,George Retsinas,Paris Oikonomou,Gerasimos Potamianos,Petros Maragos,Panagiotis Paraskevas Filntisis
Main category: cs.CV
TL;DR: 本文提出PEAR基准和SEED方法,解决农业采摘中因农产品形变和形状差异导致的6D位姿估计难题;PEAR提供8类农产品的联合6D位姿与逐实例3D形变真值,SEED是仅用RGB图像联合预测位姿与显式晶格形变的统一框架,在合成数据上训练并超越MegaPose。
Details
Motivation: 现有实例级方法难以获取每件农产品精确3D模型,类别级方法依赖固定模板,在真实农产品几何偏差下精度显著下降,缺乏对形变的鲁棒性。 Method: 构建PEAR基准(含8类农产品、高精度机器人采集的联合6D位姿与逐实例3D形变真值);提出SEED框架,基于RGB单图联合预测6D位姿与显式晶格形变,完全在带UV级生成纹理增强的合成数据上训练。 Result: 在PEAR上验证SOTA方法性能下降达6倍;SEED在相同RGB-only条件下,于8类中的6类超越MegaPose。 Conclusion: 显式形状建模(如晶格形变)是提升农业机器人位姿估计可靠性的重要步骤;PEAR为该领域提供了首个支持形变感知评估的基准。 Abstract: Accurate 6D pose estimation for robotic harvesting is fundamentally hindered by the biological deformability and high intra-class shape variability of agricultural produce. Instance-level methods fail in this setting, as obtaining exact 3D models for every unique piece of produce is practically infeasible, while category-level approaches that rely on a fixed template suffer significant accuracy degradation when the prior deviates from the true instance geometry. To bridge such lack of robustness to deformation, we introduce PEAR (Pose and dEformation of Agricultural pRoduce), the first benchmark providing joint 6D pose and per-instance 3D deformation ground truth across 8 produce categories, acquired via a robotic manipulator for high annotation accuracy. Using PEAR, we show that state-of-the-art methods suffer up to 6x performance degradation when faced with the inherent geometric deviations of real-world produce. Motivated by this finding, we propose SEED (Simultaneous Estimation of posE and Deformation), a unified RGB-only framework that jointly predicts 6D pose and explicit lattice deformations from a single image across multiple produce categories. Trained entirely on synthetic data with generative texture augmentation applied at the UV level, SEED outperforms MegaPose on 6 out of 8 categories under identical RGB-only conditions, demonstrating that explicit shape modeling is a critical step toward reliable pose estimation in agricultural robotics.[195] SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
Jiang Zhang,Shijie Zhou,Bangya Liu,Achuta Kadambi,Zhiwen Fan
Main category: cs.CV
TL;DR: 本文提出SpatialStack框架,通过在多层级上同步融合视觉、几何与语言表征,显著提升大视觉-语言模型(VLMs)的3D空间推理能力。基于该框架构建的VLM-SpatialStack在多个3D空间推理基准上达到SOTA性能。
Details
Motivation: 现有大视觉-语言模型(VLMs)难以可靠进行3D空间推理,因其无法捕捉细粒度3D几何结构和空间关系;现有引入多视角几何变换器的方法仅融合深层特征,忽略层次化信号,形成空间理解瓶颈。 Method: 提出SpatialStack——一种通用的分层融合框架,将多级几何特征逐层堆叠并同步对齐到语言主干中,实现视觉、几何与语言表征在模型各层级上的渐进式对齐,突破传统晚期视觉-几何融合范式。 Result: 所构建的VLM-SpatialStack在多个3D空间推理基准(如ScanRefer、REVERIE等)上取得SOTA性能;消融实验表明多级融合策略能持续提升3D理解能力,并在多种空间推理任务上展现出强泛化性。 Conclusion: SpatialStack是一种有效且可扩展的视觉-语言-几何融合新范式,为下一代具身与物理AI系统中的多模态空间理解提供了坚实基础。 Abstract: Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.[196] Evaluating Large and Lightweight Vision Models for Irregular Component Segmentation in E-Waste Disassembly
Xinyao Zhang,Chang Liu,Xiao Liang,Minghui Zheng,Sara Behdad
Main category: cs.CV
TL;DR: 本研究比较了SAM2和YOLOv8在电子废弃物(e-waste)中笔记本组件精确分割任务上的性能,发现轻量级YOLOv8在准确率和边界精度上显著优于SAM2;同时构建了一个含1456张标注图像的新数据集与基准框架,支撑机器人拆解与循环制造系统。
Details
Motivation: 电子废弃物回收中,对不规则、密集排列的部件进行精确分割,对机器人拆解和材料回收至关重要。现有大模型在工业场景下的适用性尚不明确,需开展针对性评估。 Method: 对比评估基于Transformer的SAM2与轻量级YOLOv8在自建笔记本组件RGB图像数据集(1456张,含逻辑板、散热器、风扇等,多光照与多角度)上的分割性能;采用随机旋转、翻转、裁剪等数据增强提升鲁棒性。 Result: YOLOv8达到mAP50=98.8%、mAP50-95=85%,边界精度强;SAM2仅达mAP50=8.4%,且存在掩码重叠与轮廓不一致问题。 Conclusion: 大型预训练模型(如SAM2)需针对具体工业任务优化才能实用;轻量高效模型(如YOLOv8)在特定工业视觉任务中更具优势;所构建数据集与基准为机器人e-waste拆解提供了重要基础。 Abstract: Precise segmentation of irregular and densely arranged components is essential for robotic disassembly and material recovery in electronic waste (e-waste) recycling. This study evaluates the impact of model architecture and scale on segmentation performance by comparing SAM2, a transformer-based vision model, with the lightweight YOLOv8 network. Both models were trained and tested on a newly collected dataset of 1,456 annotated RGB images of laptop components including logic boards, heat sinks, and fans, captured under varying illumination and orientation conditions. Data augmentation techniques, such as random rotation, flipping, and cropping, were applied to improve model robustness. YOLOv8 achieved higher segmentation accuracy (mAP50 = 98.8%, mAP50-95 = 85%) and stronger boundary precision than SAM2 (mAP50 = 8.4%). SAM2 demonstrated flexibility in representing diverse object structures but often produced overlapping masks and inconsistent contours. These findings show that large pre-trained models require task-specific optimization for industrial applications. The resulting dataset and benchmarking framework provide a foundation for developing scalable vision algorithms for robotic e-waste disassembly and circular manufacturing systems.[197] LOME: Learning Human-Object Manipulation with Action-Conditioned Egocentric World Model
Quankai Gao,Jiawei Yang,Qiangeng Xu,Le Chen,Yue Wang
Main category: cs.CV
TL;DR: LOME是一个以自我为中心的世界模型,能够根据输入图像、文本提示和逐帧人体动作(包括身体姿态和手势)生成逼真的人-物交互视频,解决了传统物理动画泛化性差和扩展性不足的问题。
Details
Motivation: 传统基于物理的动画在建模人-物操作时存在泛化能力差、难以适应多样物体形态及真实环境扩展的问题。 Method: 提出LOME模型,通过在预训练视频生成模型基础上,针对多样的自我中心视角人-物交互视频进行微调,并在训练中联合估计空间人体动作与环境上下文,实现强而精确的动作引导。 Result: LOME在动作跟随准确性、未见场景泛化能力以及手-物交互物理效果(如倒水)的真实性方面表现优异;实验表明其在时间一致性和运动控制上显著优于现有图像/视频驱动的动作条件生成方法及I/T2V模型。 Conclusion: LOME为实现照片级真实的AR/VR体验和可扩展的机器人训练提供了新路径,无需依赖显式的3D/4D建模或局限于仿真环境。 Abstract: Learning human-object manipulation presents significant challenges due to its fine-grained and contact-rich nature of the motions involved. Traditional physics-based animation requires extensive modeling and manual setup, and more importantly, it neither generalizes well across diverse object morphologies nor scales effectively to real-world environment. To address these limitations, we introduce LOME, an egocentric world model that can generate realistic human-object interactions as videos conditioned on an input image, a text prompt, and per-frame human actions, including both body poses and hand gestures. LOME injects strong and precise action guidance into object manipulation by jointly estimating spatial human actions and the environment contexts during training. After finetuning a pretrained video generative model on videos of diverse egocentric human-object interactions, LOME demonstrates not only high action-following accuracy and strong generalization to unseen scenarios, but also realistic physical consequences of hand-object interactions, e.g., liquid flowing from a bottle into a mug after executing a ``pouring'' action. Extensive experiments demonstrate that our video-based framework significantly outperforms state-of-the-art image based and video-based action-conditioned methods and Image/Text-to-Video (I/T2V) generative model in terms of both temporal consistency and motion control. LOME paves the way for photorealistic AR/VR experiences and scalable robotic training, without being limited to simulated environments or relying on explicit 3D/4D modeling.[198] From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis
Ranran Huang,Weixun Luo,Ye Mao,Krystian Mikolajczyk
Main category: cs.CV
TL;DR: NAS3R是一种无需真实标注和预训练先验的自监督前馈框架,能联合学习显式3D几何与相机参数,通过自预测相机参数进行2D光度监督下的重建,在无标定、无位姿的图像上实现稳定、可扩展的几何感知3D重建。
Details
Motivation: 解决现有自监督3D重建方法依赖预训练模型或真实标注、难以兼顾几何准确性与训练稳定性的问题。 Method: 提出NAS3R框架:基于共享Transformer主干与掩码注意力机制联合优化3D高斯重建与相机参数预测;采用深度相关的高斯表示以提升优化条件数;仅利用多视角2D图像的光度一致性进行自监督训练。 Result: 在多个基准上显著优于其他自监督方法,展现出更强的几何感知能力与泛化性,并支持与监督架构兼容及引入先验信息。 Conclusion: NAS3R为无约束数据下的3D重建提供了一种可扩展、几何感知强且完全自监督的新范式。 Abstract: In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at https://ranrhuang.github.io/nas3r/.[199] Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development
Zhongying Deng,Cheng Tang,Ziyan Huang,Jiashi Lin,Ying Chen,Junzhi Ning,Chenglong Ma,Jiyao Liu,Wei Li,Yinghao Zhu,Shujian Gao,Yanyan Huang,Sibo Ju,Yanzhou Su,Pengcheng Chen,Wenhao Tang,Tianbin Li,Haoyu Wang,Yuanfeng Ji,Hui Sun,Shaobo Min,Liang Peng,Feilong Tang,Haochen Xue,Rulin Zhou,Chaoyang Zhang,Wenjie Li,Shaohao Rui,Weijie Ma,Xingyue Zhao,Yibin Wang,Kun Yuan,Zhaohui Lu,Shujun Wang,Jinjie Wei,Lihao Liu,Dingkang Yang,Lin Wang,Yulong Li,Haolin Yang,Yiqing Shen,Lequan Yu,Xiaowei Hu,Yun Gu,Yicheng Wu,Benyou Wang,Minghui Zhang,Angelica I. Aviles-Rivero,Qi Gao,Hongming Shan,Xiaoyu Ren,Fang Yan,Hongyu Zhou,Haodong Duan,Maosong Cao,Shanshan Wang,Bin Fu,Xiaomeng Li,Zhi Hou,Chunfeng Song,Lei Bai,Yuan Cheng,Yuandong Pu,Xiang Li,Wenhai Wang,Hao Chen,Jiaxin Zhuang,Songyang Zhang,Huiguang He,Mengzhang Li,Bohan Zhuang,Zhian Bai,Rongshan Yu,Liansheng Wang,Yukun Zhou,Xiaosong Wang,Xin Guo,Guanbin Li,Xiangru Lin,Dakai Jin,Mianxin Liu,Wenlong Zhang,Qi Qin,Conghui He,Yuqiang Li,Ye Luo,Nanqing Dong,Jie Xu,Wenqi Shao,Bo Zhang,Qiujuan Yan,Yihao Liu,Jun Ma,Zhi Lu,Yuewen Cao,Zongwei Zhou,Jianming Liang,Shixiang Tang,Qi Duan,Dongzhan Zhou,Chen Jiang,Yuyin Zhou,Yanwu Xu,Jiancheng Yang,Shaoting Zhang,Xiaohong Liu,Siqi Luo,Yi Xin,Chaoyu Liu,Haochen Wen,Xin Chen,Alejandro Lozano,Min Woo Sun,Yuhui Zhang,Yue Yao,Xiaoxiao Sun,Serena Yeung-Levy,Xia Li,Jing Ke,Chunhui Zhang,Zongyuan Ge,Ming Hu,Jin Ye,Zhifeng Li,Yirong Chen,Yu Qiao,Junjun He
Main category: cs.CV
TL;DR: 本文对超过1000个公开医学影像数据集进行了迄今最大规模的系统性调查,揭示了其规模小、任务窄、分布不均等碎片化问题,并提出元数据驱动融合范式(MDFP)以整合数据集,同时发布交互式发现门户与统一结构化表格,为构建医学基础模型提供数据资源与方法论支持。
Details
Motivation: 医学影像领域缺乏大规模、统一、高质量的数据集,主要受限于临床专家依赖、伦理与隐私约束,阻碍了医学基础模型的发展。 Method: 开展覆盖1000+开放医学影像数据集的系统性调研,从模态、任务、解剖部位、标注、局限性及可集成性等方面进行结构化编目;提出元数据驱动融合范式(MDFP),按共享模态或任务自动整合分散数据集;构建交互式发现门户与统一结构化数据表。 Result: 发布了迄今最大规模的医学影像数据集综述,识别出数据碎片化核心问题;实现了基于MDFP的数据集自动化集成能力;上线交互式门户并整理出统一结构化数据表,涵盖所有调研数据集的关键特征与参考链接。 Conclusion: 该工作不仅系统刻画了当前医学影像数据生态现状,更通过MDFP和配套工具提供了切实可行的数据规模化路径,有望加速医学数据发现、提升数据集构建科学性,并推动更强大、通用的医学基础模型发展。 Abstract: Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.[200] Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning
Feiding,Yongkang Zhang,Yuhao Liao,Zijian Zeng,Chunzheng Zhu,Yaozong Zheng,Yafei Liu,Yeling Peng,Youwei Wang,Sibo Wang,Huiming Yang,Linglin Liao,Shunzhi Yang
Main category: cs.CV
TL;DR: 本文提出Differential Feedback方法,通过自动修复错误推理轨迹来构建token/step级监督掩码,实现低成本、过程级的视觉-语言对齐,提升多模态推理性能。
Details
Motivation: 现有基于GRPO的视觉-语言模型训练仅依赖终端奖励,导致多步推理中信用分配稀疏,削弱视觉证据与中间步骤的联系,并引发优化不稳定和视觉幻觉。 Method: 提出Differential Feedback方法,自动构造token/step级监督掩码,通过修复错误推理轨迹显式标记需修正的关键位置,无需大规模人工分步标注,可无缝集成到GRPO类框架中。 Result: 在MMMStar和MathVista等多模态推理基准上,同等计算预算下平均提升3%。 Conclusion: Differential Feedback是一种高效、低成本的方案,能实现准确的视觉-推理过程对齐。 Abstract: Vision--language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach offers an effective, low-cost solution for accurate vision--reasoning process alignment.[201] Estimating the Impact of COVID-19 on Travel Demand in Houston Area Using Deep Learning and Satellite Imagery
Alekhya Pachika,Lu Gao,Lingguang Song,Pan Lu,Xingju Wang
Main category: cs.CV
TL;DR: 本文利用高分辨率卫星图像(GSD 15–30 cm)结合Detectron2与Faster R-CNN模型进行车辆计数,分析休斯顿大都市区在COVID-19前后的交通需求变化,发现2020年选定地点车辆数平均下降30%,验证了卫星遥感+CV方法在出行需求与经济活动评估中的有效性。
Details
Motivation: 传统交通数据采集受限于覆盖范围与实时性;而高分辨率卫星影像和先进计算机视觉技术为大范围、非接触式、动态监测交通基础设施使用状况及出行需求提供了新可能,尤其在突发公共事件(如COVID-19)中亟需替代性数据源。 Method: 基于Google Earth Engine获取休斯顿地区高分辨率卫星影像;采用Detectron2框架实现Faster R-CNN目标检测模型,对大学、商场、社区广场、餐厅、超市等五类典型场所的车辆进行自动识别与计数;对比2019年与2020年同期车辆数量变化趋势。 Result: 2020年各监测地点车辆数量较2019年平均下降约30%;模型成功提取出反映出行需求与区域经济活动强度的时空信号;验证了卫星影像结合深度学习方法在宏观交通态势感知中的可行性与鲁棒性。 Conclusion: 高分辨率卫星影像配合先进计算机视觉算法可作为可靠、可扩展的出行需求与经济活动评估工具,为交通管理部门提供及时、客观的决策支持,尤其适用于难以布设传统传感器或突发事件下的应急响应场景。 Abstract: Considering recent advances in remote sensing satellite systems and computer vision algorithms, many satellite sensing platforms and sensors have been used to monitor the condition and usage of transportation infrastructure systems. The level of details that can be detected increases significantly with the increase of ground sample distance (GSD), which is around 15 cm - 30 cm for high-resolution satellite images. In this study, we analyzed data acquired from high-resolution satellite imagery to provide insights, predictive signals, and trend for travel demand estimation. More specifically, we estimate the impact of COVID-19 in the metropolitan area of Houston using satellite imagery from Google Earth Engine datasets. We developed a car-counting model through Detectron2 and Faster R-CNN to monitor the presence of cars within different locations (i.e., university, shopping mall, community plaza, restaurant, supermarket) before and during the COVID-19. The results show that the number of cars detected at these selected locations reduced on average 30% in 2020 compared with the previous year 2019. The results also show that satellite imagery provides rich information for travel demand and economic activity estimation. Together with advanced computer vision and deep learning algorithms, it can generate reliable and accurate information for transportation agency decision makers.[202] Fully Spiking Neural Networks with Target Awareness for Energy-Efficient UAV Tracking
Pengzhi Zhong,Jiwei Mo,Dan Zeng,Feixiang He,Shuiwang Li
Main category: cs.CV
TL;DR: STATrack is the first fully spiking neural network framework for UAV visual tracking using only RGB inputs, improving target feature retention via mutual information maximization and achieving competitive performance with low energy consumption.
Details
Motivation: Existing SNN-based trackers rely on expensive event cameras, limiting their deployment on UAVs; there is a need for an efficient SNN tracker that works with standard RGB inputs. Method: Proposes STATrack, a fully spiking neural network framework for UAV visual tracking using RGB inputs only; introduces adaptive mutual information maximization between templates and features to mitigate background interference. Result: STATrack achieves competitive tracking performance on four UAV tracking benchmarks while maintaining low energy consumption. Conclusion: STATrack demonstrates that fully spiking neural networks can effectively perform UAV visual tracking with RGB inputs, offering an energy-efficient alternative without requiring event cameras. Abstract: Spiking Neural Networks (SNNs), characterized by their event-driven computation and low power consumption, have shown great potential for energy-efficient visual tracking on unmanned aerial vehicles (UAVs). However, existing efficient SNN-based trackers heavily rely on costly event cameras, limiting their deployment on UAVs. To address this limitation, we propose STATrack, an efficient fully spiking neural network framework for UAV visual tracking using RGB inputs only. To the best of our knowledge, this work is the first to investigate spiking neural networks for UAV visual tracking tasks. To mitigate the weakening of target features by background tokens, we propose adaptively maximizing the mutual information between templates and features. Extensive experiments on four widely used UAV tracking benchmarks demonstrate that STATrack achieves competitive tracking performance while maintaining low energy consumption.[203] Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
Xuanpu Zhao,Zhentao Tan,Dianmo Sheng,Tianxiang Chen,Yao Liu,Yue Wu,Tao Gong,Qi Chu,Nenghai Yu
Main category: cs.CV
TL;DR: 本文提出一种无需轨迹监督的两阶段强化学习框架,通过'信息缺口'机制和定位损失,提升多模态大语言模型对图像裁剪区域细节的关注与推理能力。
Details
Motivation: 现有基于裁剪工具的多模态大语言模型过度依赖全局图像输入,而忽视裁剪区域内的细粒度细节,限制其在复杂视觉场景中的感知与推理能力。 Method: 提出两阶段强化学习框架:第一阶段引入'信息缺口'机制,通过调节全局图像粒度,引导模型关注裁剪区域的信息增益;第二阶段引入基于少量边界框标注的定位损失,提升裁剪精度。整个框架无需轨迹监督。 Result: 在高分辨率视觉问答基准上达到SOTA性能,显著增强模型对裁剪区域细节的关注能力。 Conclusion: 该方法为多模态大语言模型高效感知与推理细粒度视觉信息提供了新思路,提升了其在复杂视觉场景中的实用性。 Abstract: To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning and reinforcement learning, have made significant progress, our empirical analysis reveals a key limitation. We demonstrate the model's strong reliance on global input and its weak dependence on the details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the ``Information Gap" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model's attention to cropped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs. Code is available at: https://github.com/XuanPu-Z/LFPC.[204] Streamlined Open-Vocabulary Human-Object Interaction Detection
Chang Sun,Dongliang Liao,Changxing Ding
Main category: cs.CV
TL;DR: 本文提出SL-HOI框架,仅基于冻结的DINOv3模型实现开词汇HOI检测,通过融合其骨干网络(用于精确定位)与文本对齐视觉头(用于开放词汇分类),并设计交互查询与视觉头联合输入机制以弥合表征差距,在SWiG-HOI和HICO-DET上达到SOTA。
Details
Motivation: 现有开词汇HOI检测方法依赖传统HOI检测器与视觉语言模型(VLM)协作,但二者表征差异大,特征融合困难。 Method: 提出纯基于DINOv3的SL-HOI框架:冻结全部DINOv3参数,利用其骨干网络做细粒度定位、文本对齐视觉头做开放词汇分类;创新性地将交互查询与骨干图像token共同输入视觉头,以促进跨注意力对齐;仅引入少量可学习参数。 Result: 在SWiG-HOI和HICO-DET基准上均取得当前最优性能(state-of-the-art)。 Conclusion: 仅依赖冻结DINOv3并辅以轻量适配的设计,能高效、有效地解决开词汇HOI检测问题,验证了简化架构与表征对齐策略的有效性。 Abstract: Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training. Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories. However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce SL-HOI, a StreamLined open-vocabulary HOI detection framework based solely on the powerful DINOv3 model. Our design leverages the complementary strengths of DINOv3's components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification. Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head's output, we propose first feeding both the interaction queries and the backbone image tokens into the vision head, effectively bridging their representation gaps. All DINOv3 parameters in our approach are frozen, with only a small number of learnable parameters added, allowing a fast adaptation to the HOI detection task. Extensive experiments show that SL-HOI achieves state-of-the-art performance on both the SWiG-HOI and HICO-DET benchmarks, demonstrating the effectiveness of our streamlined model architecture. Code is available at https://github.com/MPI-Lab/SL-HOI.[205] Transferring Physical Priors into Remote Sensing Segmentation via Large Language Models
Yuxi Lu,Kunqi Li,Zhidong Li,Xiaohan Su,Biao Wu,Chenya Huang,Bin Liang
Main category: cs.CV
TL;DR: 本文提出PriorSeg模型,通过构建物理中心知识图谱(PCKG)和物理-天空-空间对齐数据集(Phy-Sky-SA),在不重新训练基础模型的前提下,融合多源遥感物理先验提升语义分割精度与物理合理性。
Details
Motivation: 现有遥感语义分割方法依赖空间对齐的多源数据且引入新传感器需昂贵重训练;基础模型虽能预训练利用物理变量,但仍受限于对齐要求和泛化能力。 Method: 构建物理中心知识图谱(PCKG)提取1763个词汇中的物理先验;据此构建异构、空间对齐数据集Phy-Sky-SA;设计PriorSeg模型,采用视觉-物理联合训练策略及新型物理一致性损失进行物理感知残差精调。 Result: PriorSeg在异构遥感设置下显著提升分割精度与物理可解释性,无需重训基础模型;消融实验验证了PCKG、Phy-Sky-SA及物理一致性损失的有效性。 Conclusion: 将领域物理先验以知识图谱和一致性损失方式嵌入分割流程,是一种高效、可扩展、免重训的遥感语义分割新范式。 Abstract: Semantic segmentation of remote sensing imagery is fundamental to Earth observation. Achieving accurate results requires integrating not only optical images but also physical variables such as the Digital Elevation Model (DEM), Synthetic Aperture Radar (SAR) and Normalized Difference Vegetation Index (NDVI). Recent foundation models (FMs) leverage pre-training to exploit these variables but still depend on spatially aligned data and costly retraining when involving new sensors. To overcome these limitations, we introduce a novel paradigm for integrating domain-specific physical priors into segmentation models. We first construct a Physical-Centric Knowledge Graph (PCKG) by prompting large language models to extract physical priors from 1,763 vocabularies, and use it to build a heterogeneous, spatial-aligned dataset, Phy-Sky-SA. Building on this foundation, we develop PriorSeg, a physics-aware residual refinement model trained with a joint visual-physical strategy that incorporates a novel physics-consistency loss. Experiments on heterogeneous settings demonstrate that PriorSeg improves segmentation accuracy and physical plausibility without retraining the FMs. Ablation studies verify the effectiveness of the Phy-Sky-SA dataset, the PCKG, and the physics-consistency loss.[206] Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
Haifeng Huang,Yilun Chen,Zehan Wang,Jiangmiao Pang,Zhou Zhao
Main category: cs.CV
TL;DR: Chat-Scene++ 是一种新型多模态大语言模型框架,将3D场景建模为富含上下文语义的对象序列,提升细粒度物体定位与空间推理能力,在多个3D视觉-语言基准上达到SOTA性能,且仅需2D输入即可适用于真实场景。
Details
Motivation: 现有3D多模态大模型在细粒度物体定位和上下文推理方面表现不足,难以有效理解与交互复杂3D环境。 Method: 提出Chat-Scene++框架,将3D场景结构化为带标识符的对象序列;利用大规模预训练的3D场景级和2D图像级编码器提取上下文丰富的对象特征;引入可接地的链式思维(G-CoT)推理,支持类别与空间双重区分的多步推断。 Result: 在ScanRefer、Multi3DRefer、Scan2Cap、ScanQA和SQA3D五个主流3D视觉-语言基准上达到SOTA性能;无需任务特定头或微调;仅用2D输入即可实现真实场景适用性。 Conclusion: Chat-Scene++通过对象中心化、上下文感知的序列建模与G-CoT推理,显著提升了3D场景理解、物体定位与空间推理能力,兼具高性能与实用性。 Abstract: Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs.[207] Understanding Semantic Perturbations on In-Processing Generative Image Watermarks
Anirudh Nakra,Min Wu
Main category: cs.CV
TL;DR: 本文提出了一种多阶段框架,用于系统性地压力测试生成式水印在语义漂移下的鲁棒性,发现现有水印方法在语义编辑下(如目标替换、重生成)检测率急剧下降,暴露了当前评估体系的重大缺陷。
Details
Motivation: 现有生成式水印虽对传统后处理(如滤波、几何变换)鲁棒,但对保持视觉质量却改变高层语义内容的操作(如语义编辑)的鲁棒性缺乏研究和理解。 Method: 提出一个基于现成模型(目标检测、掩码生成、语义引导修复/重生成)的多阶段框架,实现可控、语义显著但感知失真小的图像编辑,以压力测试水印在语义漂移下的可检测性。 Result: 实验表明,水印鲁棒性高度依赖语义纠缠程度:许多在常规扰动下表现良好的水印,在语义编辑下检测率骤降至接近零,而图像视觉质量仍保持较高水平。 Conclusion: 当前水印评估存在关键盲区;未来水印设计与基准测试必须显式纳入对语义操纵的鲁棒性考量。 Abstract: The widespread deployment of high-fidelity generative models has intensified the need for reliable mechanisms for provenance and content authentication. In-processing watermarking, embedding a signature into the generative model's synthesis procedure, has been advocated as a solution and is often reported to be robust to standard post-processing (such as geometric transforms and filtering). Yet robustness to semantic manipulations that alter high-level scene content while maintaining reasonable visual quality is not well studied or understood. We introduce a simple, multi-stage framework for systematically stress-testing in-processing generative watermarks under semantic drift. The framework utilizes off-the-shelf models for object detection, mask generation, and semantically guided inpainting or regeneration to produce controlled, meaning-altering edits with minimal perceptual degradation. Based on extensive experiments on representative schemes, we find that robustness varies significantly with the degree of semantic entanglement: methods by which watermarks remain detectable under a broad suite of conventional perturbations can fail under semantic edits, with watermark detectability in many cases dropping to near zero while image quality remains high. Overall, our results reveal a critical gap in current watermarking evaluations and suggest that watermark designs and benchmarking must explicitly account for robustness against semantic manipulation.[208] SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering
Jiahao Niu,Rongjia Zheng,Wenju Xu,WeiShi Zheng,Qing Zhang
Main category: cs.CV
TL;DR: SGS-Intrinsic 是一种面向稀疏视角图像的室内场景逆渲染框架,通过构建几何一致的高斯语义场,并结合混合光照模型与材质先验,实现高质量几何重建及材质-光照解耦;引入光照不变材质约束与去阴影模型以提升材质恢复鲁棒性。
Details
Motivation: 现有基于3D高斯泼溅(3DGS)的方法多针对物体中心重建,在稀疏视角下失效,难以兼顾几何重建质量与材质-光照解耦精度。 Method: 构建由语义与几何先验引导的稠密且几何一致的高斯语义场;采用混合光照模型与材质先验进行材质-光照解耦;引入光照不变材质约束与去阴影模型缓解投影阴影影响。 Result: 在多个基准数据集上显著优于现有3DGS-based逆渲染方法,在重建保真度和逆渲染质量两方面均有持续提升。 Conclusion: SGS-Intrinsic为稀疏视角下的室内逆渲染提供了可靠、鲁棒且高质量的解决方案,推动了3DGS在复杂真实场景中的实用化进展。 Abstract: We present SGS-Intrinsic, an indoor inverse rendering framework that works well for sparse-view images. Unlike existing 3D Gaussian Splatting (3DGS) based methods that focus on object-centric reconstruction and fail to work under sparse view settings, our method allows to achieve high-quality geometry reconstruction and accurate disentanglement of material and illumination. The core idea is to construct a dense and geometry-consistent Gaussian semantic field guided by semantic and geometric priors, providing a reliable foundation for subsequent inverse rendering. Building upon this, we perform material-illumination disentanglement by combining a hybrid illumination model and material prior to effectively capture illumination-material interactions. To mitigate the impact of cast shadows and enhance the robustness of material recovery, we introduce illumination-invariant material constraint together with a deshadowing model. Extensive experiments on benchmark datasets show that our method consistently improves both reconstruction fidelity and inverse rendering quality over existing 3DGS-based inverse rendering approaches. Our code is available at https://github.com/GrumpySloths/SGS_Intrinsic.github.io.[209] SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision
Shuai Xiang,Wei Guo,James Burridge,Shouyang Liu,Hao Lu,Tokihiro Fukatsu
Main category: cs.CV
TL;DR: SPROUT是一种面向农业领域的视觉基础模型,通过像素空间扩散Transformer在260万张农田图像上进行无监督预训练,显著缩小农业视觉任务中的领域差距,并在下游任务中超越现有模型。
Details
Motivation: 现有视觉基础模型在农业应用中存在显著领域差距,缺乏针对农作物多样性、生长阶段和复杂田间环境的适配能力。 Method: 提出SPROUT模型,采用VAE-free的像素空间扩散Transformer架构,基于扩散去噪机制进行多作物、多任务的开放田间无监督预训练。 Result: 在多种下游农业视觉任务上持续优于现有网络预训练及农业专用基础模型,且预训练成本大幅降低。 Conclusion: SPROUT有效弥合了通用视觉基础模型与农业实际场景之间的领域鸿沟,为农业AI提供了高效、可扩展的表征学习范式。 Abstract: Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce $SPROUT$ ($S$calable $P$lant $R$epresentation model via $O$pen-field $U$nsupervised $T$raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at https://github.com/UTokyo-FieldPhenomics-Lab/SPROUT.[210] TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets
Zhixuan Liu,Peter Schaldenbrand,Yijun Li,Long Mai,Aniruddha Mahapatra,Cusuh Ham,Jean Oh,Jui-Hsien Wang
Main category: cs.CV
TL;DR: TokenDial 提出了一种无需微调的连续滑块式属性控制框架,通过在视频生成模型的中间时空视觉token空间中施加可调节幅度的加性偏移,实现对视频外观与运动强度等属性的精细、稳定控制。
Details
Motivation: 现有文本到视频生成模型虽能生成高质量整体视频,但难以连续、可控地调节特定属性(如特效强度或运动幅度),同时保持身份、背景和时序一致性。 Method: 基于中间时空视觉patch-token空间中加性偏移构成语义控制方向的观察,利用预训练理解信号(语义方向匹配用于外观,运动幅度缩放用于运动)学习属性特异性token偏移,不重训主干模型。 Result: 在多种属性和提示下验证了TokenDial的有效性,定量评估与人工研究均表明其可控性与编辑质量优于当前最优基线。 Conclusion: TokenDial为预训练文本到视频模型提供了高效、免训练、细粒度的连续属性控制新范式,兼顾语义保真与动态一致性。 Abstract: We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.[211] OmniColor: A Unified Framework for Multi-modal Lineart Colorization
Xulu Zhang,Haoqian Du,Xiaoyong Wei,Qing Li
Main category: cs.CV
TL;DR: OmniColor是一个支持多种用户约束(空间对齐与语义参考)的统一线稿上色框架,通过双路径编码、视觉语言模型编码、自适应门控等机制实现高精度、高可控性与高稳定性上色。
Details
Motivation: 现有线稿上色方法难以在多样化用户约束下兼顾精确性与灵活性。 Method: 提出OmniColor框架:1)将控制信号分为空间对齐型与语义参考型;2)对前者采用双路径编码+密集特征对齐损失;3)对后者采用纯VLM编码+时间冗余消除;4)引入自适应空间-语义门控模块协调多模态输入。 Result: 在可控性、视觉质量与时间稳定性方面均优于现有方法。 Conclusion: OmniColor为专业线稿上色提供了鲁棒、实用的多模态统一解决方案。 Abstract: Lineart colorization is a critical stage in professional content creation, yet achieving precise and flexible results under diverse user constraints remains a significant challenge. To address this, we propose OmniColor, a unified framework for multi-modal lineart colorization that supports arbitrary combinations of control signals. Specifically, we systematically categorize guidance signals into two types: spatially-aligned conditions and semantic-reference conditions. For spatially-aligned inputs, we employ a dual-path encoding strategy paired with a Dense Feature Alignment loss to ensure rigorous boundary preservation and precise color restoration. For semantic-reference inputs, we utilize a VLM-only encoding scheme integrated with a Temporal Redundancy Elimination mechanism to filter repetitive information and enhance inference efficiency. To resolve potential input conflicts, we introduce an Adaptive Spatial-Semantic Gating module that dynamically balances multi-modal constraints. Experimental results demonstrate that OmniColor achieves superior controllability, visual quality, and temporal stability, providing a robust and practical solution for lineart colorization. The source code and dataset will be open at https://github.com/zhangxulu1996/OmniColor.[212] Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation
Rachit Agarwal,Abhishek Joshi,Sathish Chalasani,Woo Jin Kim
Main category: cs.CV
TL;DR: 本文提出DeMo-Pose方法,通过新型多模态融合策略融合单目语义特征与基于深度的图卷积表示,并引入无需推理开销的Mesh-Point Loss(MPL),实现不依赖CAD模型的实时类别级9自由度(6D位姿+3D尺寸)RGB-D姿态估计,在REAL275基准上显著超越现有方法。
Details
Motivation: 现有方法存在两类问题:纯深度方法忽略RGB语义信息;多数RGB-D融合方法因跨模态对齐不佳,难以有效结合RGB语义与3D几何表征。此外,缺乏兼顾几何感知与推理效率的训练损失设计。 Method: 提出DeMo-Pose混合架构:1)设计新型多模态融合策略,将单目RGB语义特征与深度驱动的图卷积几何表征进行对齐融合;2)引入Mesh-Point Loss(MPL),在训练中利用网格结构增强几何推理,但不增加推理负担。 Result: 在REAL275基准上,相比强基线GPV-Pose,3D IoU提升3.2%,位姿精度提升11.1%;支持实时推理,且在多个物体类别上全面超越当前最优方法。 Conclusion: 深度与RGB的有效融合,结合几何感知的学习机制,是实现鲁棒、高效类别级3D姿态估计的关键;DeMo-Pose验证了该思路的有效性与实用性。 Abstract: Object pose estimation is a fundamental task in 3D vision with applications in robotics, AR/VR, and scene understanding. We address the challenge of category-level 9-DoF pose estimation (6D pose + 3Dsize) from RGB-D input, without relying on CAD models during inference. Existing depth-only methods achieve strong results but ignore semantic cues from RGB, while many RGB-D fusion models underperform due to suboptimal cross-modal fusion that fails to align semantic RGB cues with 3D geometric representations. We propose DeMo-Pose, a hybrid architecture that fuses monocular semantic features with depth-based graph convolutional representations via a novel multimodal fusion strategy. To further improve geometric reasoning, we introduce a novel Mesh-Point Loss (MPL) that leverages mesh structure during training without adding inference overhead. Our approach achieves real-time inference and significantly improves over state-of-the-art methods across object categories, outperforming the strong GPV-Pose baseline by 3.2\% on 3D IoU and 11.1\% on pose accuracy on the REAL275 benchmark. The results highlight the effectiveness of depth-RGB fusion and geometry-aware learning, enabling robust category-level 3D pose estimation for real-world applications.[213] LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Meituan LongCat Team,Bin Xiao,Chao Wang,Chengjiang Li,Chi Zhang,Chong Peng,Hang Yu,Hao Yang,Haonan Yan,Haoze Sun,Haozhe Zhao,Hong Liu,Hui Su,Jiaqi Zhang,Jiawei Wang,Jing Li,Kefeng Zhang,Manyuan Zhang,Minhao Jing,Peng Pei,Quan Chen,Taofeng Xue,Tongxin Pan,Xiaotong Li,Xiaoyang Li,Xiaoyu Zhao,Xing Hu,Xinyang Lin,Xunliang Cai,Yan Bai,Yan Feng,Yanjie Li,Yao Qiu,Yerui Sun,Yifan Lu,Ying Luo,Yipeng Mei,Yitian Chen,Yuchen Xie,Yufang Liu,Yufei Chen,Yulei Qian,Yuqi Peng,Zhihang Yu,Zhixiong Han,Changran Wang,Chen Chen,Dian Zheng,Fengjiao Chen,Ge Yang,Haowei Guo,Haozhe Wang,Hongyu Li,Huicheng Jiang,Jiale Hong,Jialv Zou,Jiamu Li,Jianping Lin,Jiaxing Liu,Jie Yang,Jing Jin,Jun Kuang,Juncheng She,Kunming Luo,Kuofeng Gao,Lin Qiu,Linsen Guo,Mianqiu Huang,Qi Li,Qian Wang,Rumei Li,Siyu Ren,Wei Wang,Wenlong He,Xi Chen,Xiao Liu,Xiaoyu Li,Xu Huang,Xuanyu Zhu,Xuezhi Cao,Yaoming Zhu,Yifei Cao,Yimeng Jia,Yizhen Jiang,Yufei Gao,Zeyang Hu,Zhenlong Yuan,Zijian Zhang,Ziwen Wang
Main category: cs.CV
TL;DR: 本文提出DiNA框架和LongCat-Next模型,实现文本、视觉、音频在统一离散空间中的原生自回归建模,突破传统语言中心范式,提升多模态理解与生成性能。
Details
Motivation: 现有多模态系统以语言为中心,将非语言模态作为外部附加,导致架构碎片化和融合效果不佳,亟需统一、原生的多模态建模范式。 Method: 提出Discrete Native Autoregressive(DiNA)框架,核心是Discrete Native Any-resolution Visual Transformer(dNaViT),支持任意分辨率的视觉信号离散化;构建LongCat-Next模型,在单一自回归目标下联合处理文本、图像和音频。 Result: LongCat-Next在多项多模态基准测试中表现优异,显著突破离散视觉建模在理解任务上的性能瓶颈,统一协调理解与生成任务。 Conclusion: DiNA与LongCat-Next为原生多模态建模提供了新范式,开源模型与分词器以推动社区发展。 Abstract: The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next[214] MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
Jongmin Lee,Seungyeop Kang,Sungjoo Yoo
Main category: cs.CV
TL;DR: 本文提出MV-RoMa,一种多视角密集匹配模型,通过联合估计源图像到多个共视目标图像的密集对应关系,提升3D重建的密度与精度。
Details
Motivation: 现有匹配器多为两两匹配,导致跨视角匹配结果碎片化、几何不一致,难以满足SfM等3D视觉任务需求。 Method: 设计了多视角编码器(利用两两匹配结果作为几何先验)和多视角匹配优化器(采用像素级注意力优化对应关系),并提出后处理策略将多视角一致对应转化为高质量轨迹用于SfM。 Result: 在多个具有挑战性的基准上,MV-RoMa相比现有稀疏与密集匹配方法,生成更可靠对应关系,并实现更稠密、更精确的3D重建。 Conclusion: MV-RoMa通过联合多视角匹配与几何一致性建模,显著提升了跨视角对应质量与SfM重建性能。 Abstract: Establishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose MV-RoMa, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model's consistent multi-view correspondences as high-quality tracks for SfM. Across diverse and challenging benchmarks, MV-RoMa produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods. Project page: https://icetea-cv.github.io/mv-roma/.[215] Annotation-Free Detection of Drivable Areas and Curbs Leveraging LiDAR Point Cloud Maps
Fulong Ma,Daojie Peng,Jun Ma
Main category: cs.CV
TL;DR: 本文提出了一种基于地图的自动数据标注模块(MADL),结合LiDAR建图/定位与路沿检测,自动生成可驾驶区域和路沿的训练标签,克服了单帧数据遮挡和远距离点云稀疏问题,并引入数据审核代理提升质量,在多个数据集上优于人工标注及现有自监督方法。
Details
Motivation: 深度神经网络在可行驶区域和路沿检测中性能优异,但严重依赖大规模人工标注数据,成本高、耗时长、依赖专家,限制了实际应用。 Method: 提出基于地图的自动数据标注模块(MADL),融合LiDAR建图/定位与路沿检测,实现双任务自动标签生成;并设计数据审查代理过滤低质量样本。 Result: 在KITTI、KITTI-CARLA和3D-Curb数据集上,MADL生成的数据在鲁棒性和准确性上均优于传统及前沿自监督方法,且媲美甚至超越人工标注效果。 Conclusion: MADL有效缓解了对人工标注的依赖,提升了自动标注精度与规模,为自动驾驶感知模型提供了高质量、低成本的训练数据生成新范式。 Abstract: Drivable areas and curbs are critical traffic elements for autonomous driving, forming essential components of the vehicle visual perception system and ensuring driving safety. Deep neural networks (DNNs) have significantly improved perception performance for drivable area and curb detection, but most DNN-based methods rely on large manually labeled datasets, which are costly, time-consuming, and expert-dependent, limiting their real-world application. Thus, we developed an automated training data generation module. Our previous work generated training labels using single-frame LiDAR and RGB data, suffering from occlusion and distant point cloud sparsity. In this paper, we propose a novel map-based automatic data labeler (MADL) module, combining LiDAR mapping/localization with curb detection to automatically generate training data for both tasks. MADL avoids occlusion and point cloud sparsity issues via LiDAR mapping, creating accurate large-scale datasets for DNN training. In addition, we construct a data review agent to filter the data generated by the MADL module, eliminating low-quality samples. Experiments on the KITTI, KITTI-CARLA and 3D-Curb datasets show that MADL achieves impressive performance compared to manual labeling, and outperforms traditional and state-of-the-art self-supervised methods in robustness and accuracy.[216] PANDORA: Pixel-wise Attention Dissolution and Latent Guidance for Zero-Shot Object Removal
Dinh-Khoi Vo,Van-Loc Nguyen,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
Main category: cs.CV
TL;DR: 本文提出了一种名为PANDORA的零样本图像对象移除框架,无需微调、提示工程或推理时优化,通过像素级注意力溶解和局部注意力解耦引导,在预训练文本到图像扩散模型上实现高保真、语义一致的多目标移除。
Details
Motivation: 现有图像对象移除方法存在纹理不一致、刚性伪影、前景-背景解耦弱及多目标扩展性差等问题,且依赖微调或提示工程,限制了实用性与泛化性。 Method: 提出PANDORA框架:1)像素级注意力溶解(Pixel-wise Attention Dissolution),通过抑制掩码区域最相关注意力键来切断目标在自注意力中的表征;2)局部注意力解耦引导(Localized Attentional Disentanglement Guidance),引导去噪过程朝向利于干净移除的潜在流形。二者协同实现单次前向推理下的多目标移除。 Result: 在多个基准上显著优于现有SOTA方法,展现出更高的视觉保真度与语义合理性;支持无需提示、零样本、非刚性、可扩展的多目标擦除。 Conclusion: PANDORA验证了仅通过操控预训练扩散模型内部注意力机制即可实现高质量零样本对象移除,为免训练编辑提供了新范式。 Abstract: Removing objects from natural images is challenging due to difficulty of synthesizing semantically coherent content while preserving background integrity. Existing methods often rely on fine-tuning, prompt engineering, or inference-time optimization, yet still suffer from texture inconsistency, rigid artifacts, weak foreground-background disentanglement, and poor scalability for multi-object removal. We propose a novel zero-shot object removal framework, namely PANDORA, that operates directly on pre-trained text-to-image diffusion models, requiring no fine-tuning, prompts, or optimization. We propose Pixel-wise Attention Dissolution to remove object by nullifying the most correlated attention keys for masked pixels, effectively eliminating the object from self-attention flow and allowing background context to dominate reconstruction. We further introduce Localized Attentional Disentanglement Guidance to steer denoising toward latent manifolds favorable to clean object removal. Together, these components enable precise, non-rigid, prompt-free, and scalable multi-object erasure in a single pass. Experiments demonstrate superior visual fidelity and semantic plausibility compared to state-of-the-art methods. The project page is available at https://vdkhoi20.github.io/PANDORA.[217] Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method
Xiaoran Xu,Xiaoshan Yang,Jiangang Yang,Yifan Xu,Jian Liu,Changsheng Xu
Main category: cs.CV
TL;DR: 本文提出DG-OVOD新任务,揭示OVOD在域偏移下因视觉流形与文本嵌入耦合脆弱而导致跨模态空间崩溃的问题,并设计PICA方法通过渐进式域不变对齐提升泛化性。
Details
Motivation: 现有开放词汇目标检测(OVOD)假设域平稳,但在实际分布偏移场景下表现脆弱,其根本问题在于视觉与文本模态在隐空间中的耦合不稳定。 Method: 提出Progressive Domain-invariant Cross-modal Alignment(PICA),采用多级模糊性与信号强度课程学习策略,构建并自适应优化基于样本置信度和视觉一致性的伪词原型,实现跨域跨模态对齐。 Result: 实证表明视觉域偏移会导致跨模态隐空间坍塌,而PICA显著缓解该问题,在DG-OVOD基准上提升模型对未知域和新类别的鲁棒检测性能。 Conclusion: OVOD的域泛化能力本质上依赖于跨模态隐空间对齐的稳定性;本文不仅定义了DG-OVOD任务并构建了相应基准,也为构建真正开放、动态环境下的通用视觉系统提供了新视角。 Abstract: Open-Vocabulary Object Detection (OVOD) has achieved remarkable success in generalizing to novel categories. However, this success often rests on the implicit assumption of domain stationarity. In this work, we provide a principled revisit of the OVOD paradigm, uncovering a fundamental vulnerability: the fragile coupling between visual manifolds and textual embeddings when distribution shifts occur. We first systematically formalize Domain-Generalized Open-Vocabulary Object Detection (DG-OVOD). Through empirical analysis, we demonstrate that visual shifts do not merely add noise; they cause a collapse of the latent cross-modal space where novel category visual signals detach from their semantic anchors. Motivated by these insights, we propose Progressive Domain-invariant Cross-modal Alignment (PICA). PICA departs from uniform training by introducing a multi-level ambiguity and signal strength curriculum. It builds adaptive pseudo-word prototypes, refined via sample confidence and visual consistency, to enforce invariant cross-domain modality alignment. Our findings suggest that OVOD's robustness to domain shifts is intrinsically linked to the stability of the latent cross-modal alignment space. Our work provides both a challenging benchmark and a new perspective on building truly generalizable open-vocabulary systems that extend beyond static laboratory conditions.[218] Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
Baoheng Zhang,Jiahui Liu,Gui Zhao,Weizhou Zhang,Yixuan Ma,Jun Jiang,Yingxian Chen,Wilton W. T. Fok,Xiaojuan Qi,Hayden Kwok-Hay So
Main category: cs.CV
TL;DR: 本文提出Event-MLLM,通过融合事件流与RGB帧,并引入光照指示器和光照校正损失,在极端光照条件下实现鲁棒的视觉语言推理。
Details
Motivation: 现有MLLM在极端光照下因RGB信息严重退化而失效,需利用事件相机对光照不敏感的特性增强鲁棒性。 Method: 提出Event-MLLM模型,包含两个核心组件:1)基于DINOv2分支学习的光照指示器,动态调节事件与RGB融合;2)光照校正损失,使融合特征在潜空间中对齐正常光照语义。同时构建首个面向MLLM的多光照事件-指令数据集。 Result: Event-MLLM在自建极端光照基准上显著超越通用、光照自适应及纯事件基线模型,达到当前最优性能。 Conclusion: 事件流可有效弥补RGB在极端光照下的语义缺失,结合动态融合机制与潜空间语义对齐,能显著提升MLLM在复杂照明条件下的视觉语言推理能力。 Abstract: Multimodal Large Language Models (MLLMs) perform strong vision-language reasoning under standard conditions but fail in extreme illumination, where RGB inputs lose irrevocable structure and semantics. We propose Event-MLLM, an event-enhanced model that performs all-light visual reasoning by dynamically fusing event streams with RGB frames. Two key components drive our approach: an Illumination Indicator - a learnable signal derived from a DINOv2 branch that represents exposure degradation and adaptively modulates event-RGB fusion - and an Illumination Correction Loss that aligns fused features with non-degraded (normal-light) semantics in the latent space, compensating for information lost in extreme lighting. We curate the first multi-illumination event-instruction corpus for MLLMs, with 2,241 event-RGB samples (around 6 QA pairs each) across diverse scenes and 17 brightness rates (0.05x - 20x), plus an instruct-following benchmark for reasoning, counting, and fine-grained recognition under extreme lighting. Experiments show that Event-MLLM markedly outperforms general-purpose, illumination-adaptive, and event-only baselines, setting a new state of the art in robust multimodal perception and reasoning under challenging illumination.[219] Structured Observation Language for Efficient and Generalizable Vision-Language Navigation
Daojie Peng,Fulong Ma,Jun Ma
Main category: cs.CV
TL;DR: SOL-Nav 提出将RGB-D图像划分为网格,提取语义、颜色和深度信息生成结构化文本描述,并与语言指令一同输入预训练语言模型(PLM)进行导航,从而减少对视觉预训练的依赖、提升泛化能力。
Details
Motivation: 现有VLN方法依赖大规模视觉预训练,且在环境变化(如光照、纹理)下泛化能力差。 Method: 将RGB-D图像划分为N×N网格,为每个网格提取语义、颜色和深度信息,构建结构化文本描述,与语言指令拼接后输入预训练语言模型(PLM)进行端到端导航。 Result: 在R2R、RxR等标准VLN基准及真实场景部署中,SOL-Nav显著减小模型规模、降低训练数据依赖,充分利用PLM的推理能力,并在未见环境中表现出强泛化性。 Conclusion: 用结构化语言替代原始视觉输入是提升VLN泛化性与效率的有效范式,SOL-Nav为轻量、通用的具身导航提供了新思路。 Abstract: Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-training and suffering from poor generalization under environmental variations (e.g., lighting, texture). To address these issues, we propose SOL-Nav (Structured Observation Language for Navigation), a novel framework that translates egocentric visual observations into compact structured language descriptions for efficient and generalizable navigation. Specifically, we divide RGB-D images into a N*N grid, extract representative semantic, color, and depth information for each grid cell to form structured text, and concatenate this with the language instruction as pure language input to a pre-trained language model (PLM). Experimental results on standard VLN benchmarks (R2R, RxR) and real-world deployments demonstrate that SOL-Nav significantly reduces the model size and training data dependency, fully leverages the reasoning and representation capabilities of PLMs, and achieves strong generalization to unseen environments.[220] A Robust Low-Rank Prior Model for Structured Cartoon-Texture Image Decomposition with Heavy-Tailed Noise
Weihao Tang,Hongjin He
Main category: cs.CV
TL;DR: 本文提出了一种基于鲁棒低秩先验的卡通-纹理图像分解模型,采用Huber损失函数替代传统ℓ2范数以应对重尾噪声,结合全变分范数和核范数分别刻画卡通和纹理成分,并设计了两种适用于不同退化算子的算子分裂算法,实验验证了其在强重尾噪声下的优越性能。
Details
Motivation: 卡通-纹理图像分解在存在重尾噪声时难以获得鲁棒准确的结果,亟需更鲁棒的建模方法。 Method: 提出基于Huber损失函数(代替ℓ2范数)的数据保真项,联合全变分范数(刻画卡通成分)与核范数(刻画纹理成分),并设计两种适配不同退化算子的可实现算子分裂算法。 Result: 在高强重尾噪声下的图像恢复任务中,所提模型展现出优于现有方法的性能。 Conclusion: 采用Huber损失结合低秩与结构先验能显著提升卡通-纹理分解在重尾噪声下的鲁棒性与准确性。 Abstract: Cartoon-texture image decomposition is a fundamental yet challenging problem in image processing. A significant hurdle in achieving accurate decomposition is the pervasive presence of noise in the observed images, which severely impedes robust results. To address the challenging problem of cartoon-texture decomposition in the presence of heavy-tailed noise, we in this paper propose a robust low-rank prior model. Our approach departs from conventional models by adopting the Huber loss function as the data-fidelity term, rather than the traditional $\ell_2$-norm, while retaining the total variation norm and nuclear norm to characterize the cartoon and texture components, respectively. Given the inherent structure, we employ two implementable operator splitting algorithms, tailored to different degradation operators. Extensive numerical experiments, particularly on image restoration tasks under high-intensity heavy-tailed noise, efficiently demonstrate the superior performance of our model.[221] STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding
Junho Kim,Hosu Lee,James M. Rehg,Minsu Kim,Yong Man Ro
Main category: cs.CV
TL;DR: 本文提出STRIDE方法,通过结构化时间建模与迭代去噪,提升视频流中何时响应的决策质量。
Details
Motivation: 现实部署需要流式感知和主动交互,而现有Video-LLMs主要面向离线长视频推理,缺乏对‘何时响应’这一关键问题的建模。 Method: 将主动激活建模为结构化序列问题,利用滑动时间窗口联合建模激活信号,并引入轻量级掩码扩散模块进行迭代优化。 Result: 在多个流式视频基准和下游模型上验证了STRIDE能显著提升响应时机决策的可靠性与时序一致性。 Conclusion: STRIDE通过显式建模时间跨度结构和迭代 refinement,有效解决了流式视频中主动响应时机预测的关键挑战。 Abstract: Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.[222] You Only Erase Once: Erasing Anything without Bringing Unexpected Content
Yixing Zhu,Qing Zhang,Wenju Xu,Wei-Shi Zheng
Main category: cs.CV
TL;DR: YOEO是一种基于扩散模型的对象擦除方法,利用无配对真实图像数据训练,在无需配对数据的情况下实现高质量、无伪影的对象擦除,并保持上下文一致性。
Details
Motivation: 现有基于扩散的方法因缺乏足够配对训练数据及对掩码区域内容生成的显式约束,难以在擦除目标对象的同时避免生成意外内容。 Method: 提出YOEO方法,使用大规模无配对真实图像训练对象擦除扩散模型,引入杂项检测器与基于实体分割模型构建的上下文一致性损失,并采用扩散蒸馏策略实现少步推理。 Result: 实验表明YOEO在对象擦除任务上优于当前最先进方法,能生成高质量、无伪影结果并保持场景上下文连贯性。 Conclusion: YOEO有效解决了无配对数据下对象擦除的难题,兼顾擦除质量与上下文保真度,具备实用潜力。 Abstract: We present YOEO, an approach for object erasure. Unlike recent diffusion-based methods which struggle to erase target objects without generating unexpected content within the masked regions due to lack of sufficient paired training data and explicit constraint on content generation, our method allows to produce high-quality object erasure results free of unwanted objects or artifacts while faithfully preserving the overall context coherence to the surrounding content. We achieve this goal by training an object erasure diffusion model on unpaired data containing only large-scale real-world images, under the supervision of a sundries detector and a context coherence loss that are built upon an entity segmentation model. To enable more efficient training and inference, a diffusion distillation strategy is employed to train for a few-step erasure diffusion model. Extensive experiments show that our method outperforms the state-of-the-art object erasure methods. Code will be available at https://zyxunh.github.io/YOEO-ProjectPage/.[223] Clore: Interactive Pathology Image Segmentation with Click-based Local Refinement
Tiantong Wang,Minfan Zhao,Jun Shi,Hannan Wang,Yue Dai
Main category: cs.CV
TL;DR: 本文提出了一种名为Clore的点击式局部精炼方法,通过分层交互范式(初始全局分割+后续局部细节优化)提升病理图像的交互式分割精度与效率,减少交互次数并提高细粒度结构捕捉能力。
Details
Motivation: 现有基于点击的交互式分割方法依赖全局迭代更新,导致重复预测,难以捕捉细粒度结构或修正局部细微错误。 Method: 提出Click-based Local Refinement (Clore) 管线,采用分层交互范式:初始点击驱动全局粗分割,后续点击逐步局部精炼边界。 Result: 在四个数据集上实验表明,Clore在分割精度与交互成本之间取得最佳平衡,以更少点击获得更高精度。 Conclusion: Clore是一种简单高效的方法,显著提升了病理图像交互式分割的准确性、效率与实用性。 Abstract: Recent advancements in deep learning-based interactive segmentation methods have significantly improved pathology image segmentation. Most existing approaches utilize user-provided positive and negative clicks to guide the segmentation process. However, these methods primarily rely on iterative global updates for refinement, which lead to redundant re-prediction and often fail to capture fine-grained structures or correct subtle errors during localized adjustments. To address this limitation, we propose the Click-based Local Refinement (Clore) pipeline, a simple yet efficient method designed to enhance interactive segmentation. The key innovation of Clore lies in its hierarchical interaction paradigm: the initial clicks drive global segmentation to rapidly outline large target regions, while subsequent clicks progressively refine local details to achieve precise boundaries. This approach not only improves the ability to handle fine-grained segmentation tasks but also achieves high-quality results with fewer interactions. Experimental results on four datasets demonstrate that Clore achieves the best balance between segmentation accuracy and interaction cost, making it an effective solution for efficient and accurate interactive pathology image segmentation.[224] OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation
Sanghyeon Lee,Minwoo Lee,Euijin Shin,Kangyeol Kim,Seunghwan Choi,Jaegul Choo
Main category: cs.CV
TL;DR: 本文提出了一种参数高效的适配方法,用于基于预训练扩散Transformer的面板感知上下文图像生成,通过在冻结的位置编码上组合可学习的、面板特定的正交算子,保持特征几何结构和面板内不变性,并在多种位置编码设计下均有效提升指令编辑性能。
Details
Motivation: 现有方法在面板感知的上下文图像生成中缺乏参数高效且保持模型原有行为的适配机制,难以兼顾面板间相对条件建模与面板内合成能力保留。 Method: 在预训练扩散Transformer的冻结位置编码上,引入可学习的、面板特定的正交算子进行组合,确保等距性和同面板不变性。 Result: 该方法在多种位置编码方案下均表现出良好泛化性,并显著提升了基于图像的上下文指令编辑流水线(包括SOTA方法)的效果。 Conclusion: 所提正交算子适配方法是一种通用、参数高效且几何保持的策略,为面板感知的扩散模型微调提供了新范式。 Abstract: We introduce a parameter-efficient adaptation method for panel-aware in-context image generation with pre-trained diffusion transformers. The key idea is to compose learnable, panel-specific orthogonal operators onto the backbone's frozen positional encodings. This design provides two desirable properties: (1) isometry, which preserves the geometry of internal features, and (2) same-panel invariance, which maintains the model's pre-trained intra-panel synthesis behavior. Through controlled experiments, we demonstrate that the effectiveness of our adaptation method is not tied to a specific positional encoding design but generalizes across diverse positional encoding regimes. By enabling effective panel-relative conditioning, the proposed method consistently improves in-context image-based instructional editing pipelines, including state-of-the-art approaches.[225] OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery
Qi Guo,Jue Wang,Yinhe Liu,Yanfei Zhong
Main category: cs.CV
TL;DR: 本文提出OpenDPR框架,将开放词汇变化检测(OVCD)分为两阶段:先用视觉基础模型生成变化建议,再用视觉语言模型识别类别;针对类别识别和变化定位两大瓶颈,分别设计了无训练的扩散引导原型检索方法和弱监督的空间到变化模块S2C,显著提升性能。
Details
Motivation: 开放词汇变化检测(OVCD)需泛化至任意变化类别,但现有方法在类别识别(VLMs难以表征细粒度地物类别)和变化定位(VFMs缺乏变化先验)两方面存在瓶颈。 Method: 提出OpenDPR:1)利用扩散模型离线构建目标类别的多样化视觉原型,并在推理时于视觉空间中对变化建议进行相似性检索;2)设计弱监督空间到变化模块S2C,适配VFMs的空间建模能力以提升变化定位;二者结合形成OpenDPR-W。 Result: 在四个基准数据集上,OpenDPR及弱监督变体OpenDPR-W均达到当前最优性能。 Conclusion: 通过解耦变化提议与类别识别、引入扩散模型构建视觉原型、并融合弱监督变化定位模块,OpenDPR有效缓解了OVCD中的核心瓶颈,为开放词汇遥感变化检测提供了新范式。 Abstract: Open-vocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to the limited ability of VLMs based on image-text matching to represent fine-grained land-cover categories. To address this, we propose OpenDPR, a training-free vision-centric diffusion-guided prototype retrieval framework. OpenDPR leverages diffusion models to construct diverse prototypes for target categories offline, and to perform similarity retrieval with change proposals in the visual space during inference. The secondary bottleneck lies in change localization, due to the inherent lack of change priors in VFMs. To bridge this gap, we design a spatial-to-change weakly supervised change detection module named S2C to adapt their strong spatial modeling capabilities for change localization. Integrating the pretrained S2C into OpenDPR leads to an optional weakly supervised variant named OpenDPR-W, which further improves OVCD with minimal supervision. Experimental results on four benchmark datasets demonstrate that the proposed methods achieve state-of-the-art performance under both supervision modes. Code is available at https://github.com/guoqi2002/OpenDPR.[226] V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models
Xinying Lin,Xuyang Liu,Yiyu Wang,Teng Ma,Wenqi Ren
Main category: cs.CV
TL;DR: 本文提出V-CAST方法,通过曲率引导的时间分配与双锚点空间选择,在不训练、即插即用前提下提升VideoLLM长视频推理效率,兼顾时空信息覆盖与位置对齐,显著降低内存与延迟并保持高性能。
Details
Motivation: 现有VideoLLM长上下文推理受限于冗余视觉token,尤其在token压缩中存在时空信息覆盖不足的问题:帧级粗粒度分配或场景分割导致覆盖不连续,而token合并又破坏MRoPE下的(t,h,w)坐标对齐。 Method: 提出无训练、即插即用的V-CAST剪枝策略:1)将token压缩建模为轨迹逼近问题;2)设计曲率引导的时间分配模块,动态将每帧token预算分配至语义转折点和事件边界;3)采用双锚点空间选择机制,在不干预注意力的前提下保留高熵视觉证据,并严格维持token原始空间坐标以保障位置对齐。 Result: 在多种架构与规模的VideoLLM上验证,V-CAST达到原始性能的98.6%,平均超越次优方法1.1%;峰值内存与总延迟分别降至Qwen3-VL-8B-Instruct基线的86.7%和86.4%。 Conclusion: V-CAST有效缓解了长视频理解中token压缩的时空覆盖与坐标对齐矛盾,是一种高效、通用且部署友好的预填充阶段优化方案。 Abstract: Video large language models (VideoLLMs) show strong capability in video understanding, yet long-context inference is still dominated by massive redundant visual tokens in the prefill stage. We revisit token compression for VideoLLMs under a tight budget and identify a key bottleneck, namely insufficient spatio-temporal information coverage. Existing methods often introduce discontinuous coverage through coarse per-frame allocation or scene segmentation, and token merging can further misalign spatio-temporal coordinates under MRoPE-style discrete (t,h,w) bindings. To address these issues, we propose V-CAST (Video Curvature-Aware Spatio-Temporal Pruning), a training-free, plug-and-play pruning policy for long-context video inference. V-CAST casts token compression as a trajectory approximation problem and introduces a curvature-guided temporal allocation module that routes per-frame token budgets to semantic turns and event boundaries. It further adopts a dual-anchor spatial selection mechanism that preserves high-entropy visual evidence without attention intervention, while keeping retained tokens at their original coordinates to maintain positional alignment. Extensive experiments across multiple VideoLLMs of different architectures and scales demonstrate that V-CAST achieves 98.6% of the original performance, outperforms the second-best method by +1.1% on average, and reduces peak memory and total latency to 86.7% and 86.4% of vanilla Qwen3-VL-8B-Instruct.[227] Amped: Adaptive Multi-stage Non-edge Pruning for Edge Detection
Yuhan Gao,Xinqing Li,Xin He,Bing Li,Xinzhong Zhu,Ming-Ming Cheng,Yun Liu
Main category: cs.CV
TL;DR: 本文提出了一种自适应多阶段非边缘Token剪枝框架Amped,用于提升Transformer边缘检测器的效率;同时设计了轻量高效的新模型SED,在保持高精度(ODS F-measure达86.5%)的同时降低计算开销(GFLOPs最多减少40%,仅损失0.4% ODS)。
Details
Motivation: 现有基于Transformer的边缘检测器虽精度高,但因长程建模和高分辨率输入导致计算开销大,难以实际部署。 Method: 提出Amped框架,通过多阶段识别并早期剪枝高置信度非边缘Token;同时设计轻量Transformer模型SED以简化结构、提升实用性。 Result: Amped在多个检测器上实现最高40% GFLOPs削减、仅0.4% ODS下降;SED单模型达到86.5% ODS F-measure,为当前最优性能之一。 Conclusion: Amped与SED共同实现了边缘检测任务中精度与效率的良好平衡,显著提升了Transformer模型在实际场景中的可行性。 Abstract: Edge detection is a fundamental image analysis task that underpins numerous high-level vision applications. Recent advances in Transformer architectures have significantly improved edge quality by capturing long-range dependencies, but this often comes with computational overhead. Achieving higher pixel-level accuracy requires increased input resolution, further escalating computational cost and limiting practical deployment. Building on the strong representational capacity of recent Transformer-based edge detectors, we propose an Adaptive Multi-stage non-edge Pruning framework for Edge Detection(Amped). Amped identifies high-confidence non-edge tokens and removes them as early as possible to substantially reduce computation, thus retaining high accuracy while cutting GFLOPs and accelerating inference with minimal performance loss. Moreover, to mitigate the structural complexity of existing edge detection networks and facilitate their integration into real-world systems, we introduce a simple yet high-performance Transformer-based model, termed Streamline Edge Detector(SED). Applied to both existing detectors and our SED, the proposed pruning strategy provides a favorable balance between accuracy and efficiency-reducing GFLOPs by up to 40% with only a 0.4% drop in ODS F-measure. In addition, despite its simplicity, SED achieves a state-of-the-art ODS F-measure of 86.5%. The code will be released.[228] A Benchmarking Methodology to Assess Open-Source Video Large Language Models in Automatic Captioning of News Videos
David Miranda Paredes,Jose M. Saavedra,Marcelo Pizarro
Main category: cs.CV
TL;DR: 本文对八种开源视频大语言模型(VidLLMs)在新闻视频自动字幕生成任务上的性能进行了系统评估,提出两个新指标TFS和EFS以弥补传统指标在主题结构与实体覆盖评估上的不足,并发现Gemma-3整体表现最优。
Details
Motivation: 新闻视频数量庞大但字幕生成仍依赖人工;现有VidLLMs缺乏在新闻领域的全面评测;传统评估指标存在表面形式依赖、静态帧不敏感和功能词膨胀等问题。 Method: 构建两个新闻视频基准数据集(智利TV新闻和BBC新闻),采用多种传统指标(METEOR、ROUGE-L、BERTScore等)及新提出的Thematic Fidelity Score(TFS)和Entity Fidelity Score(EFS)进行综合评测,对比八种开源VidLLMs的性能。 Result: 标准指标判别力有限;TFS和EFS能更有效地评估主题结构保持与命名实体覆盖;Gemma-3在两个数据集和多数维度上表现最佳,Qwen-VL稳定居次。 Conclusion: 新闻视频字幕生成需更契合领域特性的评估指标;Gemma-3是当前最优开源VidLLM选择;TFS和EFS为未来新闻理解任务提供了可推广的评估范式。 Abstract: News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datasets: a Chilean TV news corpus (approximately 1,345 clips) and a BBC News corpus (9,838 clips). We employ lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics proposed in this work: the Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS). Our analysis reveals that standard metrics exhibit limited discriminative power for news video captioning due to surface-form dependence, static-frame insensitivity, and function-word inflation. TFS and EFS address these gaps by directly assessing thematic structure preservation and named-entity coverage in the generated captions. Results show that Gemma~3 achieves the highest overall performance across both datasets and most evaluation dimensions, with Qwen-VL as a consistent runner-up.[229] LiDAR for Crowd Management: Applications, Benefits, and Future Directions
Abdullah Khanfor,Chaima Zaghouani,Hakim Ghazzai,Ahmad Alsharoa,Gianluca Setti
Main category: cs.CV
TL;DR: 本文综述了LiDAR技术在人群管理中的应用,涵盖其优势、任务分类(检测、计数、跟踪、行为分类)、挑战及未来研究方向。
Details
Motivation: 提升人群管理效能,尤其在隐私保护、恶劣天气适应性和三维空间精度方面弥补传统监控技术的不足。 Method: 综述分析法:构建人群管理四大任务的分类体系,并结合LiDAR应用实例进行阐述;识别当前技术瓶颈与研究空白。 Result: 系统梳理了LiDAR在 crowd detection/counting/tracking/behavior classification 中的应用案例;指出数据集缺乏、多传感器融合、AI集成及点云处理等关键挑战。 Conclusion: LiDAR是面向公共安全的人群管理极具潜力的技术路径,需跨学科协同推动数据、算法与系统层面的进一步发展。 Abstract: Light Detection and Ranging (LiDAR) technology offers significant advantages for effective crowd management. This article presents LiDAR technology and highlights its primary advantages over other monitoring technologies, including enhanced privacy, performance in various weather conditions, and precise 3D mapping. We present a general taxonomy of four key tasks in crowd management: crowd detection, counting, tracking, and behavior classification, with illustrative examples of LiDAR applications for each task. We identify challenges and open research directions, including the scarcity of dedicated datasets, sensor fusion requirements, artificial intelligence integration, and processing needs for LiDAR point clouds. This article offers actionable insights for developing crowd management solutions tailored to public safety applications.[230] Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling
Minh-Tuan Tran,Xuan-May Le,Quan Hung Tran,Mehrtash Harandi,Dinh Phung,Trung Le
Main category: cs.CV
TL;DR: 本文提出Composer,一种基于测试时实例特定参数组合的自适应生成建模新范式,通过在推理前为每个输入动态生成参数适配,注入预训练模型权重中,实现无需微调的逐输入特化,显著提升生成质量与上下文感知能力。
Details
Motivation: 受人类能根据感知或想象情境灵活调整内部生成表征能力的启发,解决现有生成模型(如扩散模型、自回归网络)依赖固定预训练参数、缺乏动态适应性的局限。 Method: Composer在测试时为每个输入生成条件化的参数适配,并将其注入预训练模型权重中;该适配仅执行一次,随后用于多步生成过程,不涉及微调或重训练。 Result: 在多种生成模型和应用场景(包括轻量/量化模型及测试时扩展)上显著提升性能,同时保持极低计算与内存开销。 Conclusion: Composer通过输入感知的参数组合,确立了生成模型动态适配每个输入的新范式,突破传统静态参数化限制。 Abstract: Existing generative models, such as diffusion and auto-regressive networks, are inherently static, relying on a fixed set of pretrained parameters to handle all inputs. In contrast, humans flexibly adapt their internal generative representations to each perceptual or imaginative context. Inspired by this capability, we introduce Composer, a new paradigm for adaptive generative modeling based on test-time instance-specific parameter composition. Composer generates input-conditioned parameter adaptations at inference time, which are injected into the pretrained model's weights, enabling per-input specialization without fine-tuning or retraining. Adaptation occurs once prior to multi-step generation, yielding higher-quality, context-aware outputs with minimal computational and memory overhead. Experiments show that Composer substantially improves performance across diverse generative models and use cases, including lightweight/quantized models and test-time scaling. By leveraging input-aware parameter composition, Composer establishes a new paradigm for designing generative models that dynamically adapt to each input, moving beyond static parameterization.[231] Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
Yuhe Liu,Zhenxiong Tan,Yujia Hu,Songhua Liu,Xinchao Wang
Main category: cs.CV
TL;DR: 本文提出了一种面向线性注意力架构(如SANA)的新型可控扩散生成框架,通过双路径统一门控条件模块,有效融合多种异构条件输入,在边缘设备上实现高效、安全、高保真且强可控的图像生成。
Details
Motivation: 现有基于扩散模型的可控视觉生成方法计算开销大,通常需部署在云端,存在用户数据隐私风险;而适配边缘设备的线性注意力模型又难以兼容现有可控框架(如ControlNet、OminiControl),存在灵活性不足或收敛慢的问题。 Method: 提出一种专为线性注意力骨干网络设计的可控扩散框架,核心是双路径结构下的统一门控条件模块,可同时处理空间对齐与非对齐等多种异构条件输入。 Result: 在多个任务和基准上实验表明,该方法在基于线性注意力模型的可控生成中达到SOTA性能,在保真度和可控性上均超越现有方法。 Conclusion: 所提框架成功解决了线性注意力模型在可控生成中的条件融合与训练效率难题,为隐私保护的端侧可控图像生成提供了可行且高效的方案。 Abstract: Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.[232] Customized Visual Storytelling with Unified Multimodal LLMs
Wei-Hua Li,Cheng Sun,Chu-Song Chen
Main category: cs.CV
TL;DR: 本文提出了VstoryGen,一种多模态故事生成框架,能够基于文本描述、角色身份图像和镜头类型等多模态输入,生成连贯且具有电影感的故事序列;通过参数高效提示调优实现镜头类型控制,并构建了两个新基准进行评估。
Details
Motivation: 现有故事生成方法大多仅依赖文本输入,少数引入角色身份线索但缺乏更广泛的多模态条件控制,难以兼顾角色、场景一致性与电影化表达多样性。 Method: 提出VstoryGen框架,融合文本描述、角色与背景参考图像;引入基于电影数据的参数高效提示调优以实现镜头类型控制;构建两个新基准评测角色/场景一致性、文图对齐及镜头控制能力。 Result: 在新构建的两个基准上,VstoryGen在角色与场景一致性、文本-视觉对齐以及镜头类型控制方面均优于现有方法,展现出更强的连贯性与电影化多样性。 Conclusion: VstoryGen验证了多模态条件(尤其是镜头类型)对提升故事生成质量与电影表现力的重要性,为可控、定制化多模态叙事提供了有效技术路径。 Abstract: Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.[233] LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation
Shentong Mo,Sukmin Yun
Main category: cs.CV
TL;DR: 本文提出LVRPO框架,通过基于组相对策略优化(GRPO)的强化学习方法,显式对齐语言与视觉表征,无需额外编码器或手工设计的跨模态目标,显著提升多模态理解、生成与推理能力。
Details
Motivation: 现有统一多模态预训练方法依赖隐式或间接对齐信号,在细粒度语言-视觉推理和可控生成方面表现不足,难以同时支持理解与生成任务。 Method: 提出LVRPO框架,采用Group Relative Policy Optimization(GRPO)进行语言-视觉联合强化偏好优化,直接通过偏好驱动的强化信号优化模型行为,实现语义一致的语言-视觉交互。 Result: 在涵盖多模态理解、生成与推理的广泛基准测试中,LVRPO持续超越强统一预训练基线模型。 Conclusion: LVRPO提供了一种无需辅助编码器或人工设计目标的高效多模态对齐新范式,自然支持多样化多模态能力,并在多项任务上取得显著性能提升。 Abstract: Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.[234] Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?
Samik Some,Vinay P. Namboodiri
Main category: cs.CV
TL;DR: 本文探讨了如何利用未标注视频帧和粗略标注,结合SAM和SAM 2等分割基础模型,降低视频语义分割的数据标注成本,实验表明可减少约1/3人工标注量且性能不降,并发现帧多样性比帧数量更重要。
Details
Motivation: 深度神经网络在视频语义分割中依赖大量昂贵的像素级精细标注,而未标注视频帧和粗略标注成本低、易获取,亟需探索如何利用这些低成本资源降低标注负担。 Method: 利用Segment Anything Model(SAM)和SAM 2等先进分割基础模型,在未标注视频帧和粗略标注基础上自动生成高质量掩码,从而减少人工精细标注需求;同时分析帧数量与帧多样性对最终性能的影响。 Result: 合理使用SAM/SAM 2可将人工标注需求降低约三分之一,同时保持相近的视频语义分割性能;进一步发现数据集中帧的多样性比单纯增加帧数对性能提升更为关键。 Conclusion: 基于基础模型的自动掩码生成是降低视频分割标注成本的有效途径,且构建高质量数据集应更注重帧的多样性而非数量。 Abstract: Present-day deep neural networks for video semantic segmentation require a large number of fine-grained pixel-level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state-of-the-art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used appropriately, we can reduce the need for annotation by a third with similar performance for video semantic segmentation. More significantly, our analysis suggests that the variety of frames in the dataset is more important than the number of frames for obtaining the best performance.[235] Ink Detection from Surface Topography of the Herculaneum Papyri
Giorgio Angelotti,Federica Nicolardi,Paul Henderson,W. Brent Seales
Main category: cs.CV
TL;DR: 本文提出了一种基于表面形貌特征的机器学习方法,利用高分辨率三维光学轮廓测量数据区分赫库兰尼姆纸莎草卷上碳基墨水与碳化纸莎草,揭示了形态信号所需的空间分辨率,为闭合卷轴X射线断层扫描读取提供指导。
Details
Motivation: 赫库兰尼姆纸莎草卷因墨水和纸莎草均碳化,在X射线成像中缺乏密度或成分对比度,传统方法难以识别墨迹。 Method: 基于形貌假设,采集已机械展开纸莎草的三维光学轮廓数据,训练机器学习模型区分有墨与无墨区域;系统分析横向采样率对可学习性的影响及原分辨率模型在降分辨率输入下的表现。 Result: 证实高分辨率表面形貌本身即含可用于墨迹检测的有效信号;分割性能随横向分辨率下降而降低,揭示了有效利用形貌信号所需的特征空间尺度。 Conclusion: 该研究明确了形态学方法读取闭合纸莎草卷所需的空间分辨率目标,为未来X射线断层扫描无损解读提供关键依据。 Abstract: Reading the Herculaneum papyri is challenging because both the scrolls and the ink, which is carbon-based, are carbonized. In X-ray radiography and tomography, ink detection typically relies on density- or composition-driven contrast, but carbon ink on carbonized papyrus provides little attenuation contrast. Building on the morphological hypothesis, we show that the surface morphology of written regions contains enough signal to distinguish ink from papyrus. To this end, we train machine learning models on three-dimensional optical profilometry from mechanically opened Herculaneum papyri to separate inked and uninked areas. We further quantify how lateral sampling governs learnability and how a native-resolution model behaves on coarsened inputs. We show that high-resolution topography alone contains a usable signal for ink detection. Diminishing segmentation performance with decreasing lateral resolution provides insight into the characteristic spatial scales that must be resolved on our dataset to exploit the morphological signal. These findings inform spatial resolution targets for morphology-based reading of closed scrolls through X-ray tomography.[236] RAP: Retrieve, Adapt, and Prompt-Fit for Training-Free Few-Shot Medical Image Segmentation
Zhihao Mao,Bangpu Chen
Main category: cs.CV
TL;DR: 本文提出RAP框架,无需训练即可实现少样本医学图像分割,通过检索、自适应和提示SAM2模型,结合解剖结构拟合与检索增强提示,显著提升性能。
Details
Motivation: 现有少样本医学图像分割方法主要依赖稀疏标注的语义对应,忽略了医学图像中解剖目标跨患者和成像方式重复出现的高频形态特性(如边界几何和空间布局)。 Method: RAP框架包含三步:1)利用DINOv3特征从存档中检索形态兼容的支持样本;2)通过拟合边界感知的结构线索,将支持掩码自适应到查询图像,生成解剖一致的预掩码;3)将预掩码转换为提示(Voronoi分区采样正点、扇区采样负点),输入SAM2进行无微调精修。 Result: 在多个医学分割基准上,RAP持续超越现有少样本分割基线,达到当前最优性能。 Conclusion: 显式结构拟合与检索增强提示相结合,为鲁棒、无训练的少样本医学图像分割提供了一种简单而有效的新路径。 Abstract: Few-shot medical image segmentation (FSMIS) has achieved notable progress, yet most existing methods mainly rely on semantic correspondences from scarce annotations while under-utilizing a key property of medical imagery: anatomical targets exhibit repeatable high-frequency morphology (e.g., boundary geometry and spatial layout) across patients and acquisitions. We propose RAP, a training-free framework that retrieves, adapts, and prompts Segment Anything Model 2 (SAM2) for FSMIS. First, RAP retrieves morphologically compatible supports from an archive using DINOv3 features to reduce brittleness in single-support choice. Second, it adapts the retrieved support mask to the query by fitting boundary-aware structural cues, yielding an anatomy-consistent pre-mask under domain shifts. Third, RAP converts the pre-mask into prompts by sampling positive points via Voronoi partitioning and negative points via sector-based sampling, and feeds them into SAM2 for final refinement without any fine-tuning. Extensive experiments on multiple medical segmentation benchmarks show that RAP consistently surpasses prior FSMIS baselines and achieves state-of-the-art performance. Overall, RAP demonstrates that explicit structural fitting combined with retrieval-augmented prompting offers a simple and effective route to robust training-free few-shot medical segmentation.[237] Look, Compare and Draw: Differential Query Transformer for Automatic Oil Painting
Lingyu Liu,Yaxiong Wang,Li Zhu,Lizi Liao,Zhedong Zheng
Main category: cs.CV
TL;DR: 本文提出了一种基于差分图像分析的神经油彩绘画模型,通过新设计的DQ-Transformer架构和对抗训练,生成更具表现力、更少重复笔触的动态油画效果,在视觉真实感与艺术真实性上优于现有方法。
Details
Motivation: 解决自动油画中笔触重复、缺乏表现力的问题,借鉴人类绘画‘观察—比较—绘制’过程,提升笔触的动态性与艺术性。 Method: 引入差分图像分析机制;提出Differential Query Transformer(DQ-Transformer)架构,融合差分图像表征与位置编码;结合对抗训练优化笔触预测。 Result: 在定性评估与用户研究中显著优于现有方法,以更少笔触实现更高视觉真实感与艺术真实性;支持逐笔绘制动画展示。 Conclusion: DQ-Transformer有效提升了自动油画的表达能力与审美质量,为神经风格绘画提供了新范式。 Abstract: This work introduces a new approach to automatic oil painting that emphasizes the creation of dynamic and expressive brushstrokes. A pivotal challenge lies in mitigating the duplicate and common-place strokes, which often lead to less aesthetic outcomes. Inspired by the human painting process, \ie, observing, comparing, and drawing, we incorporate differential image analysis into a neural oil painting model, allowing the model to effectively concentrate on the incremental impact of successive brushstrokes. To operationalize this concept, we propose the Differential Query Transformer (DQ-Transformer), a new architecture that leverages differentially derived image representations enriched with positional encoding to guide the stroke prediction process. This integration enables the model to maintain heightened sensitivity to local details, resulting in more refined and nuanced stroke generation. Furthermore, we incorporate adversarial training into our framework, enhancing the accuracy of stroke prediction and thereby improving the overall realism and fidelity of the synthesized paintings. Extensive qualitative evaluations, complemented by a controlled user study, validate that our DQ-Transformer surpasses existing methods in both visual realism and artistic authenticity, typically achieving these results with fewer strokes. The stroke-by-stroke painting animations are available on our project website.[238] Synergizing Discriminative Exemplars and Self-Refined Experience for MLLM-based In-Context Learning in Medical Diagnosis
Wenkai Zhao,Zipei Wang,Mengjie Fang,Di Dong,Jie Tian,Lingwei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需更新模型权重的临床模拟工作流(Clinician Mimetic Workflow),结合判别性样本核心集选择(DECS)与自精炼经验总结(SRES),在MedMNIST 12个数据集上实现了媲美全监督和微调模型的性能,显著提升医疗领域多模态大模型的零样本诊断能力。
Details
Motivation: 通用多模态大语言模型(MLLMs)在医疗诊断等专业领域表现不足,而全监督微调又受限于专家标注成本高、计算开销大。 Method: 提出Clinician Mimetic Workflow,包含两个核心模块:1)判别性示例核心集选择(DECS),从噪声数据中自动选取具有判别力的视觉核心样本;2)自精炼经验总结(SRES),将多样化的推理路径提炼为动态文本经验库,用于上下文学习。 Result: 在MedMNIST全部12个2D数据集上显著优于零样本通用及医疗MLLMs,并达到与全监督视觉模型和领域微调MLLMs相当的性能。 Conclusion: 该方法实现了高效、免参数更新的医疗ICL新范式,为参数高效医疗多模态推理树立了新基准。 Abstract: General Multimodal Large Language Models (MLLMs) often underperform in capturing domain-specific nuances in medical diagnosis, trailing behind fully supervised baselines. Although fine-tuning provides a remedy, the high costs of expert annotation and massive computational overhead limit its scalability. To bridge this gap without updating the weights of the pre-trained backbone of the MLLM, we propose a Clinician Mimetic Workflow. This is a novel In-Context Learning (ICL) framework designed to synergize Discriminative Exemplar Coreset Selection (DECS) and Self-Refined Experience Summarization (SRES). Specifically, DECS simulates a clinician's ability to reference "anchor cases" by selecting discriminative visual coresets from noisy data at the computational level; meanwhile, SRES mimics the cognition and reflection in clinical diagnosis by distilling diverse rollouts into a dynamic textual Experience Bank. Extensive evaluation across all 12 datasets of the MedMNIST 2D benchmark demonstrates that our method outperforms zero-shot general and medical MLLMs. Simultaneously, it achieves performance levels comparable to fully supervised vision models and domain-specific fine-tuned MLLMs, setting a new benchmark for parameter-efficient medical in-context learning. Our code is available at an anonymous repository: https://anonymous.4open.science/r/Synergizing-Discriminative-Exemplars-and-Self-Refined-Experience-ED74.[239] TIR-Agent: Training an Explorative and Efficient Agent for Image Restoration
Yisheng Zhang,Guoli Jia,Haote Hu,Shanxu Zhao,Kaikai Zhao,Long Sun,Xinwei Long,Kai Tian,Che Jiang,Zhaoxiang Liu,Kai Wang,Shiguo Lian,Kaiyan Zhang,Bowen Zhou
Main category: cs.CV
TL;DR: 本文提出TIR-Agent,一种可训练的视觉-语言图像修复代理,通过监督微调和强化学习两阶段训练策略,实现降质感知的任务调度与工具组合,显著提升修复效果与推理效率。
Details
Motivation: 现有视觉-语言图像修复代理多为训练无关方法,依赖启发式任务调度与穷举式工具遍历,导致修复路径次优、计算开销大;核心瓶颈在于缺乏学习型决策策略来处理降质感知的任务排序与工具组合。 Method: 提出TIR-Agent,采用监督微调(SFT)加强化学习(RL)的两阶段训练;RL阶段引入随机扰动策略增强探索,并设计多维自适应奖励机制动态加权多种图像质量指标;同时构建全局共享模型调用池以支持高吞吐、异步GPU工具调用。 Result: 在域内与域外退化任务上均超越12个基线模型(含6个端到端模型、3个无训练代理、3个专有模型),推理速度提升超2.5倍。 Conclusion: 学习型工具调用策略是提升视觉-语言图像修复代理性能与效率的关键,TIR-Agent验证了端到端可训练代理在复杂图像修复任务中的有效性与泛化能力。 Abstract: Vision-language agents that orchestrate specialized tools for image restoration (IR) have emerged as a promising method, yet most existing frameworks operate in a training-free manner. They rely on heuristic task scheduling and exhaustive tool traversal, resulting in sub-optimal restoration paths and prohibitive computational cost. We argue that the core bottleneck lies in the absence of a learned policy to make decision, as a vision-language model cannot efficiently handle degradation-aware task ordering and tool composition. To this end, we propose TIR-Agent, a trainable image restoration agent that performs a direct tool-calling policy through a two-stage training pipeline of supervised fine-tuning (SFT) followed by reinforcement learning (RL). Two key designs underpin effective RL training: (i) a random perturbation strategy applied to the SFT data, which broadens the policy's exploration over task schedules and tool compositions, and (ii) a multi-dimensional adaptive reward mechanism that dynamically re-weights heterogeneous image quality metrics to mitigate reward hacking. To support high-throughput, asynchronous GPU-based tool invocation during training, we further develop a globally shared model-call pool. Experiments on both in-domain and out-of-domain degradations show that TIR-Agent outperforms 12 baselines, including 6 all-in-one models, 3 training-free agents, and 3 proprietary models, and achieves over 2.5$\times$ inference speedup by eliminating redundant tool executions.[240] Data Organization Matters in Multimodal Instruction Tuning: A Controlled Study of Capability Trade-offs
Guowei Tang
Main category: cs.CV
TL;DR: 本文研究了多模态大语言模型(MLLMs)在指令微调阶段中,监督数据的时间组织方式(即数据调度策略)对模型在通用视觉理解、结构化推理和细粒度OCR/文档理解等能力间权衡的影响。通过固定其他所有训练条件、仅改变后对齐监督数据的时序安排,作者比较了四种策略:直接混合、课程学习、均衡采样和逆课程学习。结果表明,课程学习在整体性能和结构化推理上最优;均衡采样利于OCR任务但损害整体能力平衡;逆课程学习最差。研究强调数据调度是多模态模型适配中一个关键且应被显式设计的维度。
Details
Motivation: 现有MLLMs虽在多种视觉-语言任务上表现良好,但其能力来自异构、任务结构与学习需求差异大的监督信号;这些信号在训练中的时间组织方式(即数据调度)对能力权衡的影响尚未被系统探究。 Method: 采用受控的三阶段训练框架:固定骨干网络、可训练模块和优化流程,仅改变后对齐监督数据的时间排列顺序;对比四种数据组织策略——直接混合、课程学习(先通用后OCR密集)、均衡采样、逆课程学习(先OCR密集后通用)。 Result: 课程学习获得最佳整体权衡与最强结构化推理能力;均衡采样提升OCR相关能力但削弱通用能力平衡;逆课程学习性能最差且优化不稳定;训练动力学分析显示,先建立通用理解与推理再引入OCR密集监督,有助于更平滑优化与更快收敛。 Conclusion: 数据组织(调度)是多模态模型指令微调中的一阶设计变量;课程式数据调度是提升模型综合能力与稳定性的重要策略;应将数据调度作为多模态适配的显式设计维度。 Abstract: Recent multimodal large language models (MLLMs) perform strongly on general visual understanding, diagram and chart reasoning, and document-centric perception. However, these abilities are learned from heterogeneous supervision sources with very different task structures and learning demands, and the effect of their temporal organization during training remains underexplored. We study whether data organization affects the trade-off among general understanding, structured reasoning, and fine-grained OCR/document understanding in multimodal instruction tuning. To isolate this factor, we use a controlled three-stage training framework in which the backbone, trainable modules, and optimization pipeline are fixed across all runs, and only the temporal arrangement of post-alignment supervision is changed. We compare four strategies: direct mixture, curriculum training, balanced sampling, and reverse curriculum. Experiments on general visual instruction following, diagram reasoning, chart reasoning, scene-text question answering, and document question answering show that data organization is a first-order design variable in multimodal adaptation. Curriculum training gives the best overall trade-off and the strongest structured reasoning performance. Balanced sampling is better for OCR-oriented capability but weakens the broader capability balance. Reverse curriculum performs worst in both final performance and optimization stability. Training-dynamics analysis further suggests that building general understanding and reasoning before introducing OCR-intensive supervision leads to smoother optimization and faster convergence. These findings highlight data scheduling as an explicit design dimension for multimodal model adaptation.[241] AI-Powered Facial Mask Removal Is Not Suitable For Biometric Identification
Emily A Cooper,Hany Farid
Main category: cs.CV
TL;DR: 本文对商业AI面部去遮蔽技术的有效性和风险进行了大规模分析,评估其生成的面部图像是否能可靠匹配真实身份。
Details
Motivation: 由于社交媒体用户利用生成式AI增强低质量视觉证据并导致误识别事件(如联邦特工被错误识别),作者希望系统评估AI面部去遮蔽技术的可靠性与潜在危害。 Method: 开展大规模实证分析,评测主流商业AI驱动的面部去遮蔽工具在人脸识别匹配任务中的表现。 Result: 发现AI生成的‘去遮蔽’人脸图像通常无法可靠匹配真实身份,存在严重误匹配和误导风险。 Conclusion: 当前AI面部去遮蔽技术尚不成熟,不应被用于司法或公众调查等高风险场景,亟需监管与技术改进。 Abstract: Recently, crowd-sourced online criminal investigations have used generative-AI to enhance low-quality visual evidence. In one high-profile case, social-media users circulated an "AI-unmasked" image of a federal agent involved in a fatal shooting, fueling a wide-spread misidentification. In response to this and similar incidents, we conducted a large-scale analysis evaluating the efficacy and risks of commercial AI-powered facial unmasking, specifically assessing whether the resulting faces can be reliably matched to true identities.[242] E-TIDE: Fast, Structure-Preserving Motion Forecasting from Event Sequences
Biswadeep Sen,Benoit R. Cottereau,Nicolas Cuperlier,Terence Sim
Main category: cs.CV
TL;DR: 本文提出E-TIDE,一种轻量级、端到端可训练的事件张量预测架构,无需大规模预训练,通过TIDE模块实现高效时空建模,在保持高性能的同时显著降低计算与内存开销。
Details
Motivation: 事件相机数据稀疏且时序精确,但现有预测方法依赖计算密集型骨干网络或大规模预训练,难以适用于资源受限场景。 Method: 提出E-TIDE架构,核心为TIDE模块(Temporal Interaction for Dynamic Events),采用大核混合与活动感知门控机制,高效建模稀疏事件张量的时空依赖。 Result: 在标准事件数据集上,E-TIDE达到具有竞争力的预测性能,模型尺寸更小、训练需求更低,适合实时部署。 Conclusion: E-TIDE是一种高效、轻量、无需预训练的事件预测方法,为资源受限下的下游任务(如未来语义分割、目标跟踪)提供了实用解决方案。 Abstract: Event-based cameras capture visual information as asynchronous streams of per-pixel brightness changes, generating sparse, temporally precise data. Compared to conventional frame-based sensors, they offer significant advantages in capturing high-speed dynamics while consuming substantially less power. Predicting future event representations from past observations is an important problem, enabling downstream tasks such as future semantic segmentation or object tracking without requiring access to future sensor measurements. While recent state-of-the-art approaches achieve strong performance, they often rely on computationally heavy backbones and, in some cases, large-scale pretraining, limiting their applicability in resource-constrained scenarios. In this work, we introduce E-TIDE, a lightweight, end-to-end trainable architecture for event-tensor prediction that is designed to operate efficiently without large-scale pretraining. Our approach employs the TIDE module (Temporal Interaction for Dynamic Events), motivated by efficient spatiotemporal interaction design for sparse event tensors, to capture temporal dependencies via large-kernel mixing and activity-aware gating while maintaining low computational complexity. Experiments on standard event-based datasets demonstrate that our method achieves competitive performance with significantly reduced model size and training requirements, making it well-suited for real-time deployment under tight latency and memory budgets.[243] RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization
Junwei Zheng,Ruize Dai,Ruiping Liu,Zichao Zeng,Yufan Chen,Fangjinhua Wang,Kunyu Peng,Kailun Yang,Jiaming Zhang,Rainer Stiefelhagen
Main category: cs.CV
TL;DR: 本文提出了一种基于全景图像与OpenStreetMap的鲁棒跨视角地理定位方法RHO,构建了大规模基准数据集CV-RHO,并在定位精度上相较SOTA提升达20%。
Details
Motivation: 现有跨视角地理定位多基于针孔相机与卫星图像,难以应对复杂天气、光照及传感器噪声;而全景图像与OSM更具实用性与鲁棒性,但缺乏相应基准与方法。 Method: 构建大规模CV-RHO数据集(2.7M图像);提出RHO模型,采用双分支Pin-Pan架构;引入SUM模块校正全景畸变;设计POF机制融合位置与朝向信息。 Result: 在CV-RHO上实验表明,RHO模型相较当前最优方法性能提升最高达20%,验证了数据集价值与模型有效性。 Conclusion: 全景图像与OSM结合是提升跨视角地理定位鲁棒性的可行路径,CV-RHO数据集和RHO模型为该方向提供了重要基础与新范式。 Abstract: Metric Cross-View Geo-Localization (MCVGL) aims to estimate the 3-DoF camera pose (position and heading) by matching ground and satellite images. In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). To this end, we establish a large-scale MCVGL benchmark dataset, CV-RHO, with over 2.7M images under different weather and lighting conditions, as well as sensor noise. Furthermore, we propose a model termed RHO with a two-branch Pin-Pan architecture for accurate visual localization. A Split-Undistort-Merge (SUM) module is introduced to address the panoramic distortion, and a Position-Orientation Fusion (POF) mechanism is designed to enhance the localization accuracy. Extensive experiments prove the value of our CV-RHO dataset and the effectiveness of the RHO model, with a significant performance gain up to 20% compared with the state-of-the-art baselines. Project page: https://github.com/InSAI-Lab/RHO.[244] When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
Chengyin Hu,Xuemeng Sun,Jiajun Han,Qike Zhang,Xiang Chen,Xin Wang,Yiwei Wei,Jiahua Long
Main category: cs.CV
TL;DR: 本文提出了一种受三维织物褶皱力学启发的参数化结构扰动方法,用于评估和降低视觉-语言模型(VLMs)对物理上合理非刚性形变(如表面褶皱)的鲁棒性。该方法通过多尺度褶皱场建模、位移场畸变与表观一致性变化集成,并在低维参数空间中优化扰动,显著削弱了多种SOTA VLM在零样本分类、图像描述和视觉问答任务上的性能。
Details
Motivation: 现有视觉-语言模型(VLMs)在跨模态理解任务中表现优异,但其对物理上合理、非刚性形变(如柔性表面褶皱)的鲁棒性尚不明确,亟需系统性评估与攻击方法。 Method: 提出基于织物褶皱力学的参数化结构扰动方法:构建多尺度褶皱位移场,融合表面一致的外观变化;设计低维参数空间中的分层适应度函数,并采用优化驱动的搜索策略以兼顾视觉自然性与对抗有效性。 Result: 所提方法在零样本分类代理任务上优化扰动后,能有效迁移到生成式任务(图像描述、视觉问答),显著降低多种SOTA VLM性能,且持续优于基线方法。 Conclusion: VLMs对物理真实非刚性形变高度敏感,本文提出的结构化扰动方法不仅揭示了其鲁棒性缺陷,也为提升跨模态模型的物理合理性建模能力提供了新思路。 Abstract: Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.[245] RINO: Rotation-Invariant Non-Rigid Correspondences
Maolin Gao,Shao Jie Hu-Chen,Congyue Deng,Riccardo Marin,Leonidas Guibas,Daniel Cremers
Main category: cs.CV
TL;DR: RINO是一种无监督、旋转不变的密集对应框架,通过RINONet特征提取器实现刚性和非刚性形状匹配的统一,无需预对齐或手工设计特征,在多种挑战性场景下表现优异。
Details
Motivation: 现有深度学习方法依赖中间几何特征或手工设计描述符,在非等距形变、部分数据和非流形输入下效果受限。 Method: 提出RINO框架,核心是RINONet特征提取器,结合基于向量的SO(3)-不变学习与方向感知的复函数映射,直接从原始几何中提取鲁棒特征。 Result: 在任意姿态、非等距、部分性、非流形性和噪声等挑战性非刚性匹配任务中展现出前所未有的性能。 Conclusion: RINO实现了无监督、端到端、数据驱动的密集对应,克服了传统方法对预处理和手工特征的依赖,显著提升了复杂几何条件下的匹配鲁棒性。 Abstract: Dense 3D shape correspondence remains a central challenge in computer vision and graphics as many deep learning approaches still rely on intermediate geometric features or handcrafted descriptors, limiting their effectiveness under non-isometric deformations, partial data, and non-manifold inputs. To overcome these issues, we introduce RINO, an unsupervised, rotation-invariant dense correspondence framework that effectively unifies rigid and non-rigid shape matching. The core of our method is the novel RINONet, a feature extractor that integrates vector-based SO(3)-invariant learning with orientation-aware complex functional maps to extract robust features directly from raw geometry. This allows for a fully end-to-end, data-driven approach that bypasses the need for shape pre-alignment or handcrafted features. Extensive experiments show unprecedented performance of RINO across challenging non-rigid matching tasks, including arbitrary poses, non-isometry, partiality, non-manifoldness, and noise.[246] GS3LAM: Gaussian Semantic Splatting SLAM
Linfei Li,Lin Zhang,Zhong Wang,Ying Shen
Main category: cs.CV
TL;DR: 本文提出GS3LAM框架,利用3D高斯泼溅(3DGS)实现多模态(RGB、深度、语义)实时稠密语义SLAM,通过语义高斯场建模、深度自适应尺度正则化和随机采样关键帧映射策略,提升跟踪鲁棒性、渲染质量与语义精度。
Details
Motivation: 现有语义SLAM系统在显式表示中受限于分辨率和未知区域预测能力,在隐式表示中难以满足实时性要求;亟需一种兼顾效率、连续性和可扩展性的稠密场景表示方法。 Method: 提出GS3LAM框架:1)构建语义高斯场(SG-Field)联合优化相机位姿与场景表示;2)引入深度自适应尺度正则化(DSR)解决高斯尺度不变性与几何表面错配问题;3)设计随机采样关键帧映射(RSKM)策略缓解灾难性遗忘。 Result: 在基准数据集上,GS3LAM相比SOTA方法显著提升跟踪鲁棒性、渲染质量和语义精度,且满足实时性要求。 Conclusion: 3D高斯泼溅是一种适用于实时稠密语义SLAM的高效隐式表示;GS3LAM通过多模态联合优化与针对性正则化/映射策略,有效克服了现有方法在尺度一致性、遗忘问题和实时性方面的瓶颈。 Abstract: Recently, the multi-modal fusion of RGB, depth, and semantics has shown great potential in dense Simultaneous Localization and Mapping (SLAM). However, a prerequisite for generating consistent semantic maps is the availability of dense, efficient, and scalable scene representations. Existing semantic SLAM systems based on explicit representations are often limited by resolution and an inability to predict unknown areas. Conversely, implicit representations typically rely on time-consuming ray tracing, failing to meet real-time requirements. Fortunately, 3D Gaussian Splatting (3DGS) has emerged as a promising representation that combines the efficiency of point-based methods with the continuity of geometric structures. To this end, we propose GS3LAM, a Gaussian Semantic Splatting SLAM framework that processes multimodal data to render consistent, dense semantic maps in real-time. GS3LAM models the scene as a Semantic Gaussian Field (SG-Field) and jointly optimizes camera poses and the field via multimodal error constraints. Furthermore, a Depth-adaptive Scale Regularization (DSR) scheme is introduced to resolve misalignments between scale-invariant Gaussians and geometric surfaces. To mitigate catastrophic forgetting, we propose a Random Sampling-based Keyframe Mapping (RSKM) strategy, which demonstrates superior performance over common local covisibility optimization methods. Extensive experiments on benchmark datasets show that GS3LAM achieves increased tracking robustness, superior rendering quality, and enhanced semantic precision compared to state-of-the-art methods. Source code is available at https://github.com/lif314/GS3LAM.[247] Inference-time Trajectory Optimization for Manga Image Editing
Ryosuke Furuta
Main category: cs.CV
TL;DR: 提出了一种推理时自适应方法,利用输入漫画图像本身对预训练图像编辑模型进行微调,以提升其在漫画图像编辑任务上的性能。
Details
Motivation: 现有预训练图像编辑模型主要在自然图像上训练,在漫画图像上表现不佳;而重新训练或微调大型模型在计算成本和版权方面均不现实。 Method: 在推理时轻微修正生成轨迹,使模型能在空提示下更忠实地重建输入漫画图像。 Result: 实验表明该方法在性能上持续优于现有基线,且仅带来可忽略的计算开销。 Conclusion: 该推理时自适应方法为解决预训练模型在特定领域(如漫画)泛化能力不足的问题提供了一种高效、低开销的可行方案。 Abstract: We present an inference-time adaptation method that tailors a pretrained image editing model to each input manga image using only the input image itself. Despite recent progress in pretrained image editing, such models often underperform on manga because they are trained predominantly on natural-image data. Re-training or fine-tuning large-scale models on manga is, however, generally impractical due to both computational cost and copyright constraints. To address this issue, our method slightly corrects the generation trajectory at inference time so that the input image can be reconstructed more faithfully under an empty prompt. Experimental results show that our method consistently outperforms existing baselines while incurring only negligible computational overhead.[248] Towards Emotion Recognition with 3D Pointclouds Obtained from Facial Expression Images
Laura Rayón Ropero,Jasper De Laet,Filip Lemic,Pau Sabater Nácher,Nabeel Nisar Bhat,Sergi Abadal,Jeroen Famaey,Eduard Alarcón,Xavier Costa-Pérez
Main category: cs.CV
TL;DR: 本文提出了一种基于高频无线传感(HFWS)的隐私保护型3D面部情感识别(FER)新范式,通过可穿戴设备生成面部点云,并构建了首个大规模3D FER数据集AffectNet3D;所提方法在BU-3DFE上微调后达70%+准确率,且在模拟遮挡下仍优于纯3D训练,验证了其在连续、隐私敏感场景下的可行性。
Details
Motivation: 现有基于2D图像的深度学习FER方法存在隐私泄露风险,难以满足日益严格的数据保护法规,且不适用于连续实时监测;而3D点云传感(如HFWS)具备非接触、隐私友好、适合可穿戴部署等优势,但缺乏标注的3D FER数据集制约其发展。 Method: 提出基于FLAME模型的3D点云合成方法,将公开2D情感数据集(AffectNet)转化为3D点云,构建AffectNet3D;设计点云精细化流程(如面部区域提取),并采用PointNet++进行训练与微调;进一步通过点云掩码模拟可穿戴传感受限条件,评估鲁棒性。 Result: 在BU-3DFE数据集上微调后分类准确率超70%,接近oracle水平;在仅用25% BU-3DFE样本微调时,AffectNet3D预训练模型性能仍优于直接在BU-3DFE上训练的模型;掩码实验表明模型对部分点云缺失具有较强鲁棒性。 Conclusion: 所提出的HFWS驱动的3D FER框架结合合成数据生成与点云建模,有效缓解了3D FER数据稀缺问题,验证了其在隐私保护、连续监测等实际场景中的可行性与优越性,为下一代可穿戴情感计算系统提供了新路径。 Abstract: Facial Emotion Recognition is a critical research area within Affective Computing due to its wide-ranging applications in Human Computer Interaction, mental health assessment and fatigue monitoring. Current FER methods predominantly rely on Deep Learning techniques trained on 2D image data, which pose significant privacy concerns and are unsuitable for continuous, real-time monitoring. As an alternative, we propose High-Frequency Wireless Sensing (HFWS) as an enabler of continuous, privacy-aware FER, through the generation of detailed 3D facial pointclouds via on-person sensors embedded in wearables. We present arguments supporting the privacy advantages of HFWS over traditional 2D imaging, particularly under increasingly stringent data protection regulations. A major barrier to adopting HFWS for FER is the scarcity of labeled 3D FER datasets. Towards addressing this issue, we introduce a FLAME-based method to generate 3D facial pointclouds from existing public 2D datasets. Using this approach, we create AffectNet3D, a 3D version of the AffectNet database. To evaluate the quality and usability of the generated data, we design a pointcloud refinement pipeline focused on isolating the facial region, and train the popular PointNet++ model on the refined pointclouds. Fine-tuning the model on a small subset of the unseen 3D FER dataset BU-3DFE yields a classification accuracy exceeding 70%, comparable to oracle-level performance. To further investigate the potential of HFWS-based FER for continuous monitoring, we simulate wearable sensing conditions by masking portions of the generated pointclouds. Experimental results show that models trained on AffectNet3D and fine-tuned with just 25% of BU-3DFE outperform those trained solely on BU-3DFE. These findings highlight the viability of our pipeline and support the feasibility of continuous, privacy-aware FER via wearable HFWS systems.[249] Diversity Matters: Dataset Diversification and Dual-Branch Network for Generalized AI-Generated Image Detection
Nusrat Tasnim,Kutub Uddin,Khalid Malik
Main category: cs.CV
TL;DR: 本文提出了一种名为'Diversity Matters'的新框架,通过强调数据多样性和特征域互补性来提升AI生成图像检测的泛化性与鲁棒性,结合像素域和频域CLIP特征,并引入特征域相似性过滤机制以增强训练集多样性。
Details
Motivation: AI生成图像(如GAN、扩散模型等)激增,带来虚假信息、版权侵犯与数字安全风险;但现有检测方法在面对多样生成模型和数据分布时泛化能力不足。 Method: 提出特征域相似性过滤机制,剔除跨类与类内高度相似样本以提升训练数据多样性;设计双分支网络,融合像素域与频率域的CLIP特征,联合建模语义与结构线索。 Result: 在多个基准数据集上实验表明,该方法显著提升了跨模型与跨数据集检测性能,增强了对未知生成模型及对抗条件的鲁棒性。 Conclusion: 数据与特征多样性对构建可靠、鲁棒的AI生成图像检测器至关重要,'Diversity Matters'为应对合成内容快速演进提供了新思路。 Abstract: The rapid proliferation of AI-generated images, powered by generative adversarial networks (GANs), diffusion models, and other synthesis techniques, has raised serious concerns about misinformation, copyright violations, and digital security. However, detecting such images in a generalized and robust manner remains a major challenge due to the vast diversity of generative models and data distributions. In this work, we present \textbf{Diversity Matters}, a novel framework that emphasizes data diversity and feature domain complementarity for AI-generated image detection. The proposed method introduces a feature-domain similarity filtering mechanism that discards redundant or highly similar samples across both inter-class and intra-class distributions, ensuring a more diverse and representative training set. Furthermore, we propose a dual-branch network that combines CLIP features from the pixel domain and the frequency domain to jointly capture semantic and structural cues, leading to improved generalization against unseen generative models and adversarial conditions. Extensive experiments on benchmark datasets demonstrate that the proposed approach significantly improves cross-model and cross-dataset performance compared to existing methods. \textbf{Diversity Matters} highlights the critical role of data and feature diversity in building reliable and robust detectors against the rapidly evolving landscape of synthetic content.[250] Tracking without Seeing: Geospatial Inference using Encrypted Traffic from Distributed Nodes
Sadik Yagiz Yetim,Gaofeng Dong,Isaac-Neil Zanoria,Ronit Barman,Maggie Wigness,Tarek Abdelzaher,Mani Srivastava,Suhas Diggavi
Main category: cs.CV
TL;DR: GraySense 是一种基于学习的框架,通过分析加密无线视频传输流量(如数据包大小)来实现地理空间目标跟踪,无需访问原始视频流;它结合数据包级间接信息与可选的直接传感器数据,在不接触原始信号的情况下实现了约2.33米的跟踪误差。
Details
Motivation: 传统动态环境感知依赖多传感器原始信号融合,本文探索仅利用加密网络层数据包信息(如大小)进行地理空间推理的新范式,并研究其与直接传感数据的融合以增强推理能力。 Method: 提出 GraySense 框架:第一阶段为 Packet Grouping 模块,从加密流量中识别帧边界并估计帧大小;第二阶段为基于带循环状态 Transformer 编码器的 Tracker 模块,融合包级间接输入与可选的相机直接输入,估计目标位置。实验基于 CARLA 仿真视频与模拟网络。 Result: 在无原始信号访问条件下,GraySense 在目标尺寸(4.61m × 1.93m)内实现 2.33 米欧氏距离跟踪误差;这是首次在加密包级信息上实现此类地理空间目标跟踪。 Conclusion: 证明加密网络流量中蕴含足够丰富的场景动态潜信息,可用于高精度地理空间感知,拓展了隐式信号感知的应用边界。 Abstract: Accurate observation of dynamic environments traditionally relies on synthesizing raw, signal-level information from multiple distributed sensors. This work investigates an alternative approach: performing geospatial inference using only encrypted packet-level information, without access to the raw sensory data. We further explore how this indirect information can be fused with directly available sensory data to extend overall inference capabilities. We introduce GraySense, a learning-based framework that performs geospatial object tracking by analyzing encrypted wireless video transmission traffic, such as packet sizes, from cameras with inaccessible streams. GraySense leverages the inherent relationship between scene dynamics and transmitted packet sizes to infer object motion. The framework consists of two stages: (1) a Packet Grouping module that identifies frame boundaries and estimates frame sizes from encrypted network traffic, and (2) a Tracker module, based on a Transformer encoder with a recurrent state, which fuses indirect packet-based inputs with optional direct camera-based inputs to estimate the object's position. Extensive experiments with realistic videos from the CARLA simulator and emulated networks under varying conditions show that GraySense achieves 2.33 meters tracking error (Euclidean distance) without raw signal access, within the dimensions of tracked objects (4.61m x 1.93m). To our knowledge, this capability has not been previously demonstrated, expanding the use of latent signals for sensing.[251] MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences
Shijian Wang,Jiarui Jin,Runhao Fu,Zexuan Yan,Xingjian Wang,Mengkang Hu,Eric Wang,Xiaoxi Li,Kangning Zhang,Li Yao,Wenxiang Jiao,Xuelian Cheng,Yuan Lu,Zongyuan Ge
Main category: cs.CV
TL;DR: 本文提出MuSEAgent,一种多模态推理智能体,通过引入状态化经验学习范式,将交互数据抽象为原子决策经验,并构建质量筛选的经验库,在推理时实现策略驱动的经验检索,从而提升多模态决策能力。
Details
Motivation: 现有研究智能体多依赖轨迹级检索,难以有效利用历史交互中的细粒度决策经验;本文旨在通过建模状态化经验,提升多模态环境下智能体的自适应推理与决策能力。 Method: 提出状态化经验学习范式,利用后见推理(hindsight reasoning)将交互数据抽象为原子决策经验;构建质量过滤的经验库;设计宽搜索与深搜索互补的自适应经验检索机制,支持多视角语义下的多模态指导动态获取。 Result: 在细粒度视觉感知与复杂多模态推理任务上,MuSEAgent持续优于强轨迹级经验检索基线。 Conclusion: 状态化经验建模能显著增强多模态智能体的推理能力,为研究智能体提供了更高效、可复用的经验利用新范式。 Abstract: Research agents have recently achieved significant progress in information seeking and synthesis across heterogeneous textual and visual sources. In this paper, we introduce MuSEAgent, a multimodal reasoning agent that enhances decision-making by extending the capabilities of research agents to discover and leverage stateful experiences. Rather than relying on trajectory-level retrieval, we propose a stateful experience learning paradigm that abstracts interaction data into atomic decision experiences through hindsight reasoning. These experiences are organized into a quality-filtered experience bank that supports policy-driven experience retrieval at inference time. Specifically, MuSEAgent enables adaptive experience exploitation through complementary wide- and deep-search strategies, allowing the agent to dynamically retrieve multimodal guidance across diverse compositional semantic viewpoints. Extensive experiments demonstrate that MuSEAgent consistently outperforms strong trajectory-level experience retrieval baselines on both fine-grained visual perception and complex multimodal reasoning tasks. These results validate the effectiveness of stateful experience modeling in improving multimodal agent reasoning.[252] Towards Context-Aware Image Anonymization with Multi-Agent Reasoning
Robert Aufschläger,Jakob Folz,Gautam Savaliya,Manjitha D Vidanalage,Michael Heigl,Martin Schramm
Main category: cs.CV
TL;DR: 本文提出CAIAMAR框架,一种基于多智能体推理与扩散模型的上下文感知街景图像匿名化方法,兼顾隐私保护、图像质量与合规性。
Details
Motivation: 街景图像中存在上下文依赖的个人身份信息(PII),现有方法或过度处理、或漏检间接标识符,且API方案损害数据主权。 Method: 构建三智能体协同的CAIAMAR框架:采用PDCA循环与轮询发言机制;结合预定义规则处理高置信度PII,多智能体联合推理识别间接标识符;通过‘侦察-缩放’策略、开放词汇分割与IoU去重(30%阈值)实现空间滤波的粗到精检测;引入模态特异性扩散引导与外观解相关以降低重识别风险。 Result: 在CUHK03-NP上人物重识别风险降低73%(R1从62.4%降至16.9%);CityScapes上KID=0.001、FID=9.1,图像质量显著优于现有方法;支持非直接PII跨类别检测,保留下游语义分割性能;全程本地运行、开源模型、生成可审计日志。 Conclusion: CAIAMAR实现了上下文感知、高质量、可解释、合规(GDPR)且数据主权可控的街景图像匿名化,为隐私增强视觉系统提供了新范式。 Abstract: Street-level imagery contains personally identifiable information (PII), some of which is context-dependent. Existing anonymization methods either over-process images or miss subtle identifiers, while API-based solutions compromise data sovereignty. We present an agentic framework CAIAMAR (\underline{C}ontext-\underline{A}ware \underline{I}mage \underline{A}nonymization with \underline{M}ulti-\underline{A}gent \underline{R}easoning) for context-aware PII segmentation with diffusion-based anonymization, combining pre-defined processing for high-confidence cases with multi-agent reasoning for indirect identifiers. Three specialized agents coordinate via round-robin speaker selection in a Plan-Do-Check-Act (PDCA) cycle, enabling large vision-language models to classify PII based on spatial context (private vs. public property) rather than rigid category rules. The agents implement spatially-filtered coarse-to-fine detection where a scout-and-zoom strategy identifies candidates, open-vocabulary segmentation processes localized crops, and $IoU$-based deduplication ($30\%$ threshold) prevents redundant processing. Modal-specific diffusion guidance with appearance decorrelation substantially reduces re-identification (Re-ID) risks. On CUHK03-NP, our method reduces person Re-ID risk by $73\%$ ($R1$: $16.9\%$ vs. $62.4\%$ baseline). For image quality preservation on CityScapes, we achieve KID: $0.001$, and FID: $9.1$, significantly outperforming existing anonymization. The agentic workflow detects non-direct PII instances across object categories, and downstream semantic segmentation is preserved. Operating entirely on-premise with open-source models, the framework generates human-interpretable audit trails supporting EU's GDPR transparency requirements while flagging failed cases for human review.[253] Benchmarking Multi-View BEV Object Detection with Mixed Pinhole and Fisheye Cameras
Xiangzhong Liu,Hao Shen
Main category: cs.CV
TL;DR: 本文提出了一种面向混合针孔与鱼眼相机配置的BEV 3D目标检测新基准(FishBEVOD),通过KITTI-360转nuScenes格式构建真实数据集,并探索了图像校正、失真感知视图变换模块(基于MEI模型)和极坐标表征三种适应策略,发现无投影架构(如PETR)对鱼眼畸变更鲁棒。
Details
Motivation: 现有BEV 3D检测模型主要针对针孔相机设计,在鱼眼相机引入的径向畸变下性能显著下降;而实际自动驾驶系统越来越多采用混合相机配置以实现全视野感知,亟需适配鱼眼畸变的检测方法与评估基准。 Method: 1) 构建首个含真实鱼眼与针孔图像的BEV 3D检测基准FishBEVOD(KITTI-360→nuScenes格式);2) 提出三种适配策略:零样本/微调下的图像校正、基于MEI相机模型的失真感知视图变换模块(VTM)、适配径向畸变的极坐标特征表示;3) 在BEVFormer、BEVDet、PETR三种主流架构上系统评测。 Result: 实验表明:投影无关架构(如PETR)在鱼眼场景下天然更鲁棒且性能更优;校正策略适用于零样本迁移但依赖后处理;极坐标表征可提升特征对畸变的适应性;所有策略均验证了FishBEVOD基准的有效性。 Conclusion: 本工作填补了鱼眼-针孔混合相机下BEV 3D检测的基准与方法空白,证实了架构选择(尤其是否依赖显式投影)对畸变鲁棒性的关键影响,并为低成本、高鲁棒的3D感知系统设计提供了实用指南。 Abstract: Modern autonomous driving systems increasingly rely on mixed camera configurations with pinhole and fisheye cameras for full view perception. However, Bird's-Eye View (BEV) 3D object detection models are predominantly designed for pinhole cameras, leading to performance degradation under fisheye distortion. To bridge this gap, we introduce a multi-view BEV detection benchmark with mixed cameras by converting KITTI-360 into nuScenes format. Our study encompasses three adaptations: rectification for zero-shot evaluation and fine-tuning of nuScenes-trained models, distortion-aware view transformation modules (VTMs) via the MEI camera model, and polar coordinate representations to better align with radial distortion. We systematically evaluate three representative BEV architectures, BEVFormer, BEVDet and PETR, across these strategies. We demonstrate that projection-free architectures are inherently more robust and effective against fisheye distortion than other VTMs. This work establishes the first real-data 3D detection benchmark with fisheye and pinhole images and provides systematic adaptation and practical guidelines for designing robust and cost-effective 3D perception systems. The code is available at https://github.com/CesarLiu/FishBEVOD.git.[254] 3-D Representations for Hyperspectral Flame Tomography
Nicolas Tricard,Zituo Chen,Sili Deng
Main category: cs.CV
TL;DR: 本文比较了基于体素网格和连续神经网络的火焰断层扫描重建方法,发现带全变分正则化的体素网格方法在精度、内存占用和运行时间方面更优。
Details
Motivation: 尽管神经网络火焰表征在重建质量上显示出优势,但尚缺乏与相同算法下体素网格表示法的严格定量对比。 Method: 对模拟池火进行断层扫描重建,分别采用带不同正则化项的体素网格表示法和连续神经网络表示法;通过射线追踪求解辐射传输方程,并结合仪器线型函数计算红外高光谱相机接收到的光谱强度。 Result: 带全变分正则化的体素网格方法在重建精度、内存消耗和运行时间方面均优于神经网络表示法。 Conclusion: 在当前模拟设置下,优化的体素网格方法仍是更高效可靠的火焰三维重建方案;未来将拓展至更多表征形式及真实实验配置。 Abstract: Flame tomography is a compelling approach for extracting large amounts of data from experiments via 3-D thermochemical reconstruction. Recent efforts employing neural-network flame representations have suggested improved reconstruction quality compared with classical tomography approaches, but a rigorous quantitative comparison with the same algorithm using a voxel-grid representation has not been conducted. Here, we compare a classical voxel-grid representation with varying regularizers to a continuous neural representation for tomographic reconstruction of a simulated pool fire. The representations are constructed to give temperature and composition as a function of location, and a subsequent ray-tracing step is used to solve the radiative transfer equation to determine the spectral intensity incident on hyperspectral infrared cameras, which is then convolved with an instrument lineshape function. We demonstrate that the voxel-grid approach with a total-variation regularizer reproduces the ground-truth synthetic flame with the highest accuracy for reduced memory intensity and runtime. Future work will explore more representations and under experimental configurations.[255] Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning
Ming Liu,Yunbei Zhang,Shilong Liu,Liwen Wang,Wensheng Zhang
Main category: cs.CV
TL;DR: 本文研究了如何通过可验证的奖励函数设计来提升视频生成模型在空间推理和多步规划任务中的泛化能力,发现基于客观指标的可验证奖励能显著提高RL微调效果,并避免多模态奖励模型导致的退化问题。
Details
Motivation: 视频生成模型在空间推理和多步规划任务中表现不佳,而强化学习的有效性严重依赖于奖励设计,但该问题缺乏系统性研究。 Method: 将Group Relative Policy Optimization(GRPO)适配到基于流的视频模型,并在迷宫求解与机器人导航任务上训练;提出面向结构化游戏环境的多组件轨迹奖励和面向机器人导航的嵌入级可验证奖励,替代失效的多模态奖励模型。 Result: 在复杂3D迷宫任务中,模型精确匹配准确率比监督微调基线提升29.1%;在避陷阱任务中提升51.4%;实验证明可验证奖励对训练稳定性至关重要,而多模态奖励易导致退化解。 Conclusion: 可验证奖励设计是实现鲁棒视频推理的关键使能技术。 Abstract: Video generation models produce visually coherent content but struggle with tasks requiring spatial reasoning and multi-step planning. Reinforcement learning (RL) offers a path to improve generalization, but its effectiveness in video reasoning hinges on reward design -- a challenge that has received little systematic study. We investigate this problem by adapting Group Relative Policy Optimization (GRPO) to flow-based video models and training them on maze-solving and robotic navigation tasks. We first show that multimodal reward models fail catastrophically in this setting. To address this, we design verifiable reward functions grounded in objective task metrics. For structured game environments, we introduce a multi-component trajectory reward. For robotic navigation, we propose an embedding-level verifiable reward. Our experiments show that RL fine-tuning with verifiable rewards improves generalization. For example, on complex 3D mazes, our model improves exact match accuracy by 29.1\% over the SFT baseline, and on trap-avoidance tasks by 51.4\%. Our systematic reward analysis reveals that verifiable rewards are critical for stable training, while multimodal reward models could lead to degenerate solutions. These findings establish verifiable reward design as a key enabler for robust video reasoning. Code will be publicly available.[256] Poppy: Polarization-based Plug-and-Play Guidance for Enhancing Monocular Normal Estimation
Irene Kim,Sai Tanmay Reddy Chakkera,Alexandros Graikos,Dimitris Samaras,Akshat Dave
Main category: cs.CV
TL;DR: Poppy是一种无需训练的框架,利用单次偏振测量在测试时优化任意冻结RGB主干网络输出的表面法向量,通过可微渲染层将优化后的法向量转化为偏振预测并与观测信号对比,显著提升在反射、无纹理和暗表面等挑战性场景下的法向估计精度。
Details
Motivation: 单目表面法向量估计器在反射、无纹理和暗表面上表现差;偏振信息能独立于纹理和反射率提供表面朝向物理线索,但现有偏振方法依赖多视角采集或专用训练数据,泛化能力受限。 Method: Poppy框架在测试时利用单次偏振测量,对冻结的RGB主干网络输入RGB图像和输出法向量进行逐像素偏移优化,并联合学习反射率分解;通过可微分渲染层将优化后的法向量转化为偏振预测,并与实测偏振信号比对以指导优化。 Result: 在七个基准数据集和三种主干架构(扩散、光流、前馈)上,Poppy在合成数据上平均角度误差降低23–26%,在真实数据上降低6–16%。 Conclusion: Poppy证明了在测试阶段引入偏振线索可有效提升冻结RGB法向估计器在难例表面的性能,无需重新训练,具备强泛化性和实用性。 Abstract: Monocular surface normal estimators trained on large-scale RGB-normal data often perform poorly in the edge cases of reflective, textureless, and dark surfaces. Polarization encodes surface orientation independently of texture and albedo, offering a physics-based complement for these cases. Existing polarization methods, however, require multi-view capture or specialized training data, limiting generalization. We introduce Poppy, a training-free framework that refines normals from any frozen RGB backbone using single-shot polarization measurements at test time. Keeping backbone weights frozen, Poppy optimizes per-pixel offsets to the input RGB and output normal along with a learned reflectance decomposition. A differentiable rendering layer converts the refined normals into polarization predictions and penalizes mismatches with the observed signal. Across seven benchmarks and three backbone architectures (diffusion, flow, and feed-forward), Poppy reduces mean angular error by 23-26% on synthetic data and 6-16% on real data. These results show that guiding learned RGB-based normal estimators with polarization cues at test time refines normals on challenging surfaces without retraining.[257] SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation
Tripti Shukla,Zsolt Kira
Main category: cs.CV
TL;DR: 本文提出SAGE框架,在解码过程中动态调节自注意力机制,利用注意力‘汇点’(sink)令牌实时监测并缓解大视觉语言模型(VLMs)的幻觉问题,无需重训练或修改模型结构。
Details
Motivation: 现有方法多在生成后过滤或依赖额外训练/外部验证,无法在解码中实时干预幻觉;而幻觉与注意力过度集中于无语义的汇点令牌(如标点)密切相关,亟需一种解码时可干预、可解释的接地增强机制。 Method: SAGE识别并利用注意力汇点令牌作为触发锚点,在每个汇点处提取已生成序列的语义概念,结合自注意力图和梯度归因估计其视觉接地性,并计算空间一致性;据此自适应地锐化或拓宽自注意力分布,强化可靠区域、抑制不可靠区域。 Result: 在MSCOCO和AMBER等幻觉基准上,SAGE在多种VLM架构上平均相对降低幻觉10.65%(MSCOCO)和7.19%(AMBER),显著优于现有解码策略,同时保持描述覆盖率,且不需重训练或架构改动。 Conclusion: SAGE首次将汇点感知与实时接地评估融入解码过程,提供了一种轻量、通用、即插即用的幻觉缓解范式,为可控、可信的多模态生成开辟了新路径。 Abstract: Large vision-language models (VLMs) frequently suffer from hallucinations, generating content that is inconsistent with visual inputs. Existing methods typically address this problem through post-hoc filtering, additional training objectives, or external verification, but they do not intervene during the decoding process when hallucinations arise. In this work, we introduce SAGE, a Sink-Aware Grounded Decoding framework that mitigates hallucinations by dynamically modulating self-attention during generation. Hallucinations are strongly correlated with attention sink tokens - punctuation or function tokens that accumulate disproportionate attention despite carrying limited semantic content. SAGE leverages these tokens as anchors to monitor grounding reliability in real time. At each sink trigger, the method extracts semantic concepts from the generated sequence, estimates their visual grounding using both self-attention maps and gradient-based attribution, and measures their spatial agreement. Based on this signal, self-attention distributions are adaptively sharpened or broadened to reinforce grounded regions or suppress unreliable ones. Extensive experiments across diverse hallucination benchmarks demonstrate that SAGE consistently outperforms existing decoding strategies, achieving substantial reductions in hallucination while preserving descriptive coverage, without requiring model retraining or architectural modifications. Our method achieves an average relative improvement of 10.65% on MSCOCO and 7.19% on AMBER across diverse VLM architectures, demonstrating consistent gains in hallucination mitigation.[258] Rényi Entropy: A New Token Pruning Metric for Vision Transformers
Wei-Yuan Su,Ruijie Zhang,Zheng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的token重要性度量方法Col-Ln,基于Rényi熵,能在ViT首层可靠识别信息丰富的token,从而提升早期层的token剪枝精度,显著加速高分辨率输入下的推理。
Details
Motivation: 现有基于[CLS] token的重要性估计方法在ViT早期层因语义表征不成熟而不可靠,导致早期剪枝精度低、信息损失大。 Method: 提出基于Rényi熵的无训练token重要性度量Col-Ln,使首层即可准确识别关键token,支持更可靠的早期token剪枝。 Result: 在ViT和大型视觉语言模型(LVLMs)上广泛实验表明,该方法在多个基准上持续优于现有SOTA剪枝方法。 Conclusion: Col-Ln提供了一种简单、高效且泛化性强的训练前token剪枝策略,有效缓解ViT的高计算开销问题,尤其适用于高分辨率视觉任务。 Abstract: Vision Transformers (ViTs) achieve state-of-the-art performance but suffer from the $O(N^2)$ complexity of self-attention, making inference costly for high-resolution inputs. To address this bottleneck, token pruning has emerged as a critical technique to accelerate inference. Most existing methods rely on the [CLS] token to estimate patch importance. However, we argue that the [CLS] token can be unreliable in early layers where semantic representations are still immature. As a result, pruning in the early layer often leads to inaccurate importance estimation and unnecessary information loss. In this work, we propose a training-free token importance metric, namely Col-Ln, which is derived from Rényi entropy that enables the identification of informative tokens from the first layer of the network, thereby enabling more reliable pruning in token reduction. Extensive experiments on ViTs and Large Vision-Language Models (LVLMs) demonstrate that our approach consistently outperforms state-of-the-art pruning methods across diverse benchmarks.[259] BINO: Encoder Centric Self Supervised Stereo With Native Pair Input
Haokun Zhou
Main category: cs.CV
TL;DR: BINO提出一种紧凑型双目编码器,通过输入级融合矫正图像对、引入行感知位置编码及特定蒸馏策略,在低资源预训练下实现了优异的立体匹配性能,表明跨视图推理能力可内化于轻量编码器中,无需额外链接模块。
Details
Motivation: 现有自监督视觉模型虽迁移性好,但未针对双目细粒度跨视图对应设计;几何导向方法常依赖双目解码器或显式链接模块,而BINO探索是否能在紧凑编码器内部学习强双目结构。 Method: BINO在输入阶段融合矫正后的双目图像对,构建立体微单元token,并采用行感知的patch相位位置编码;训练采用单视图掩码token蒸馏,并结合遮挡建模与视图特异性外观失配损失。 Result: 在仅用KITTI Object数据集预训练的严格低资源设置下,BINO在代理密集立体匹配、困难负样本检索和KITTI Stereo 2012视差任务中,作为冻结特征提取器且无额外链接模块时,性能优于所有对比基线;使用相同轻量立体头时,性能接近CroCo-v2但编码器尺寸显著更小;KITTI Stereo 2015上的迁移实验也呈现一致趋势。 Conclusion: 大量原本归因于独立链接模块的跨视图推理能力,实际上可在紧凑、可复用的编码器内部习得,为高效双目表征学习提供了新范式。 Abstract: Stereo needs features that preserve fine cross view correspondence rather than only semantic similarity. Recent self supervised vision models transfer well, but they are not built for this goal, and geometry focused methods often rely on a binocular decoder or another explicit linkage module during pretraining. BINO asks whether strong binocular structure can instead be learned inside a compact encoder. It does this by fusing the rectified pair at the input stage, forming stereo micro cell tokens, and using a row aware patch phase positional encoding. Training uses one view masked token only distillation together with occlusion and view specific appearance mismatch. In a strict low resource setting with pretraining only on KITTI object, BINO gives the best frozen descriptor results under a no linkage probe among all compared baselines on proxy dense stereo, hard negative retrieval, and KITTI Stereo~2012 disparity. With the same lightweight stereo head for every encoder, it stays near CroCo~v2 while using a much smaller encoder. Supplementary transfer experiments on KITTI Stereo~2015 show the same qualitative trend. These results suggest that much of the cross view reasoning often assigned to a separate linkage module can be learned inside a compact and reusable encoder.[260] Spatial Orthogonal Refinement for Robust RGB-Event Visual Object Tracking
Dexing Huang,Shiao Wang,Fan Zhang,Xiao Wang
Main category: cs.CV
TL;DR: 本文提出SOR-Track框架,利用事件相机的高时序分辨率弥补RGB图像在高速运动下的运动模糊缺陷,通过空间正交精炼(SOR)模块动态提取事件流中的方向性结构信息,并以此调制和校正RGB纹理,实现鲁棒的多模态跟踪。
Details
Motivation: 传统RGB传感器在高速运动下易产生严重运动模糊,而现有RGB-事件融合方法未能显式利用事件流中蕴含的方向几何先验来校正退化的RGB特征。 Method: 提出基于空间正交精炼(SOR)的轻量级RGB-事件跟踪框架;SOR模块采用由局部运动方向动态引导的一组正交方向滤波器,从事件流中提取尖锐且运动一致的结构响应,并通过非对称结构调制机制将其作为几何锚点来调制和精炼RGB纹理。 Result: 在大规模FE108基准上实验表明,SOR-Track在运动模糊和低光照条件下持续优于现有融合类跟踪器。 Conclusion: SOR-Track提供了一种原理清晰、物理可解释的多模态特征对齐与纹理校正方法,在保持结构简洁的同时显著提升跟踪鲁棒性。 Abstract: Robust visual object tracking (VOT) remains challenging in high-speed motion scenarios, where conventional RGB sensors suffer from severe motion blur and performance degradation. Event cameras, with microsecond temporal resolution and high dynamic range, provide complementary structural cues that can potentially compensate for these limitations. However, existing RGB-Event fusion methods typically treat event data as dense intensity representations and adopt black-box fusion strategies, failing to explicitly leverage the directional geometric priors inherently encoded in event streams to rectify degraded RGB features. To address this limitation, we propose SOR-Track, a streamlined framework for robust RGB-Event tracking based on Spatial Orthogonal Refinement (SOR). The core SOR module employs a set of orthogonal directional filters that are dynamically guided by local motion orientations to extract sharp and motion-consistent structural responses from event streams. These responses serve as geometric anchors to modulate and refine aliased RGB textures through an asymmetric structural modulation mechanism, thereby explicitly bridging structural discrepancies between two modalities. Extensive experiments on the large-scale FE108 benchmark demonstrate that SOR-Track consistently outperforms existing fusion-based trackers, particularly under motion blur and low-light conditions. Despite its simplicity, the proposed method offers a principled and physics-grounded approach to multi-modal feature alignment and texture rectification. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvTracking[261] FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation
Liuzhou Zhang,Zeyu Zhang,Biao Wu,Luyao Tang,Zirui Song,Hongyang He,Renda Han,Guangzhen Yao,Huacan Wang,Ronghao Chen,Xiuying Chen,Guan Huang,Zheng Zhu
Main category: cs.CV
TL;DR: 本文提出了一种无需姿态估计的实时手语视频生成新框架,基于扩散模型直接从文本生成手语视频,并引入可训练滑动分块注意力(T-STA)机制提升推理效率,在保持高质量的同时实现3.07倍加速。
Details
Motivation: 现有手语视频生成模型依赖复杂中间表示(如姿态),限制了灵活性与效率,难以满足实时、轻量部署需求。 Method: 提出基于扩散模型的无姿态生成框架;设计可训练滑动分块注意力(T-STA)机制,将可学习稀疏性融入训练与推理全过程,利用时空局部性加速推理。 Result: 视频生成速度提升3.07倍,质量无损;支持实时部署;开源代码已发布。 Conclusion: 该工作为高质量、实时、无姿态的手语合成提供了新范式,有望推动包容性人机交互工具的发展。 Abstract: Sign language plays a crucial role in bridging communication gaps between the deaf and hard-of-hearing communities. However, existing sign language video generation models often rely on complex intermediate representations, which limits their flexibility and efficiency. In this work, we propose a novel pose-free framework for real-time sign language video generation. Our method eliminates the need for intermediate pose representations by directly mapping natural language text to sign language videos using a diffusion-based approach. We introduce two key innovations: (1) a pose-free generative model based on the a state-of-the-art diffusion backbone, which learns implicit text-to-gesture alignments without pose estimation, and (2) a Trainable Sliding Tile Attention (T-STA) mechanism that accelerates inference by exploiting spatio-temporal locality patterns. Unlike previous training-free sparsity approaches, T-STA integrates trainable sparsity into both training and inference, ensuring consistency and eliminating the train-test gap. This approach significantly reduces computational overhead while maintaining high generation quality, making real-time deployment feasible. Our method increases video generation speed by 3.07x without compromising video quality. Our contributions open new avenues for real-time, high-quality, pose-free sign language synthesis, with potential applications in inclusive communication tools for diverse communities. Code: https://github.com/AIGeeksGroup/FlashSign.[262] ForestSim: A Synthetic Benchmark for Intelligent Vehicle Perception in Unstructured Forest Environments
Pragat Wagle,Zheng Chen,Lantao Liu
Main category: cs.CV
TL;DR: 本文提出ForestSim,一个用于森林等非结构化环境语义分割的高保真合成数据集,包含2094张图像、25种场景及20类像素级标注,支持智能越野车辆感知系统研究。
Details
Motivation: 现有语义分割数据集多针对城市结构化道路,而森林等极端非结构化野外环境缺乏高质量、低成本的像素级标注数据,制约了林业自动化、农业机器人、灾害响应等领域的感知系统发展。 Method: 基于Unreal Engine与Microsoft AirSim构建高保真仿真环境,生成2094张覆盖多季节、多地形和多植被密度的合成图像,并提供20类与自主导航相关的像素级精确标注。 Result: 在多个SOTA语义分割模型上进行基准测试,结果表明模型在ForestSim上取得良好性能,验证了该数据集的有效性和挑战性。 Conclusion: ForestSim为非结构化越野环境下的感知研究提供了可扩展、易获取的高质量合成数据基础,推动智能越野车辆技术发展。 Abstract: Robust scene understanding is essential for intelligent vehicles operating in natural, unstructured environments. While semantic segmentation datasets for structured urban driving are abundant, the datasets for extremely unstructured wild environments remain scarce due to the difficulty and cost of generating pixel-accurate annotations. These limitations hinder the development of perception systems needed for intelligent ground vehicles tasked with forestry automation, agricultural robotics, disaster response, and all-terrain mobility. To address this gap, we present ForestSim, a high-fidelity synthetic dataset designed for training and evaluating semantic segmentation models for intelligent vehicles in forested off-road and no-road environments. ForestSim contains 2094 photorealistic images across 25 diverse environments, covering multiple seasons, terrain types, and foliage densities. Using Unreal Engine environments integrated with Microsoft AirSim, we generate consistent, pixel-accurate labels across 20 classes relevant to autonomous navigation. We benchmark ForestSim using state-of-the-art architectures and report strong performance despite the inherent challenges of unstructured scenes. ForestSim provides a scalable and accessible foundation for perception research supporting the next generation of intelligent off-road vehicles. The dataset and code are publicly available: Dataset: https://vailforestsim.github.io Code: https://github.com/pragatwagle/ForestSim[263] A Cross-Scale Decoder with Token Refinement for Off-Road Semantic Segmentation
Seongkyu Choi Jhonghyun An
Main category: cs.CV
TL;DR: 本文提出了一种面向越野场景的跨尺度解码器,通过全局-局部标记优化、门控细节桥接和不确定性引导的类别感知点优化三个机制,在保持计算效率的同时提升语义分割对模糊标注、边界不清和稀疏结构的鲁棒性。
Details
Motivation: 越野场景中存在地形不规则、植被杂乱及标注模糊等问题,导致类别间相似度高、过渡区域不确定、细小结构监督稀疏,现有解码器难以兼顾细节保留与噪声抑制。 Method: 提出跨尺度解码器,包含:1)全局-局部标记优化模块(带边界感知正则化的紧凑瓶颈);2)门控细节桥(单次跨尺度注意力注入细粒度结构线索);3)不确定性引导的类别感知点优化(选择性更新最不可靠像素)。 Result: 在标准越野基准上持续优于先前方法,无需密集特征融合,兼顾噪声鲁棒性、边界一致性与部署效率。 Conclusion: 该框架有效解决了越野语义分割中因标注模糊与结构稀疏带来的核心挑战,实现了高精度、高鲁棒性与高效率的统一。 Abstract: Off-road semantic segmentation is fundamentally challenged by irregular terrain, vegetation clutter, and inherent annotation ambiguity. Unlike urban scenes with crisp object boundaries, off-road environments exhibit strong class-level similarity among terrain categories, resulting in thick and uncertain transition regions that degrade boundary coherence and destabilize training. Rare or thin structures, such as narrow traversable gaps or isolated obstacles, further receive sparse and unreliable supervision and are easily overwhelmed by dominant background textures. Existing decoder designs either rely on low-scale bottlenecks that oversmooth fine structural details, or repeatedly fuse high-detail features, which tends to amplify annotation noise and incur substantial computational cost. We present a cross-scale decoder that explicitly addresses these challenges through three complementary mechanisms. First, a global--local token refinement module consolidates semantic context on a compact bottleneck lattice, guided by boundary-aware regularization to remain robust under ambiguous supervision. Second, a gated detail bridge selectively injects fine-scale structural cues only once through cross-scale attention, preserving boundary and texture information while avoiding noise accumulation. Third, an uncertainty-guided class-aware point refinement selectively updates the least reliable pixels, improving rare and ambiguous structures with minimal computational overhead. The resulting framework achieves noise-robust and boundary-preserving segmentation tailored to off-road environments, recovering fine structural details while maintaining deployment-friendly efficiency. Experimental results on standard off-road benchmarks demonstrate consistent improvements over prior approaches without resorting to heavy dense feature fusion.[264] JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understanding
Koki Maeda,Naoaki Okazaki
Main category: cs.CV
TL;DR: 本文介绍了JaWildText,一个专为评估视觉语言模型(VLMs)在日语场景文本理解能力而设计的诊断型基准数据集,涵盖密集场景文本问答、收据关键信息提取和手写OCR三项任务,强调其对混合文字、竖排文本及大量汉字等日语特有挑战的覆盖。
Details
Motivation: 现有多语言基准未能充分反映日语场景文本的独特挑战(如混合文字、竖排书写、庞大字符集),且已有日语视觉文本数据集多集中于扫描文档,缺乏真实场景(in-the-wild)文本,亟需更具代表性的评测资源。 Method: 构建JaWildText数据集:包含3241个样本、2961张在日本实地采集的新图像,标注112万字符、3643种唯一字符;设计三项互补任务——密集场景文本VQA、收据关键信息提取(KIE)、手写OCR;对14个开源VLM进行系统评测并开展错误分析。 Result: 最佳模型在三项任务上的平均得分为0.64;错误分析表明字符识别(尤其是汉字)仍是主要瓶颈;JaWildText支持细粒度、文字感知的日语场景文本能力诊断。 Conclusion: JaWildText填补了日语真实场景文本评测的空白,为推动VLM在复杂东亚文字理解上的发展提供了重要基准与诊断工具。 Abstract: Japanese scene text poses challenges that multilingual benchmarks often fail to capture, including mixed scripts, frequent vertical writing, and a character inventory far larger than the Latin alphabet. Although Japanese is included in several multilingual benchmarks, these resources do not adequately capture the language-specific complexities. Meanwhile, existing Japanese visual text datasets have primarily focused on scanned documents, leaving in-the-wild scene text underexplored. To fill this gap, we introduce JaWildText, a diagnostic benchmark for evaluating vision-language models (VLMs) on Japanese scene text understanding. JaWildText contains 3,241 instances from 2,961 images newly captured in Japan, with 1.12 million annotated characters spanning 3,643 unique character types. It comprises three complementary tasks that vary in visual organization, output format, and writing style: (i) Dense Scene Text Visual Question Answering (STVQA), which requires reasoning over multiple pieces of visual text evidence; (ii) Receipt Key Information Extraction (KIE), which tests layout-aware structured extraction from mobile-captured receipts; and (iii) Handwriting OCR, which evaluates page-level transcription across various media and writing directions. We evaluate 14 open-weight VLMs and find that the best model achieves an average score of 0.64 across the three tasks. Error analyses show recognition remains the dominant bottleneck, especially for kanji. JaWildText enables fine-grained, script-aware diagnosis of Japanese scene text capabilities, and will be released with evaluation code.[265] RehearsalNeRF: Decoupling Intrinsic Neural Fields of Dynamic Illuminations for Scene Editing
Changyeon Won,Hyunjun Jung,Jungu Cho,Seonmi Park,Chi-Hoon Lee,Hae-Gon Jeon
Main category: cs.CV
TL;DR: 本文提出RehearsalNeRF,通过利用稳定光照下(如彩排阶段)采集的场景,实现动态光照变化下神经辐射场的解耦学习,有效分离场景辐射与光照颜色,并支持动态物体建模与编辑。
Details
Motivation: 现有神经辐射场方法在动态光照变化下难以解耦主体自身辐射与光照颜色,导致建模不准确。 Method: 提出RehearsalNeRF:1)利用稳定光照下的彩排数据保证几何一致性;2)引入可学习的时序光照向量解耦光照颜色;3)结合交互式掩码和基于光流的新正则化方法处理动态物体。 Result: 在动态光照下的新视角合成与场景编辑任务中展现出鲁棒性能。 Conclusion: RehearsalNeRF能有效应对严重动态光照变化,实现辐射场的解耦建模,并支持动态对象重建与编辑。 Abstract: Although there has been significant progress in neural radiance fields, an issue on dynamic illumination changes still remains unsolved. Different from relevant works that parameterize time-variant/-invariant components in scenes, subjects' radiance is highly entangled with their own emitted radiance and lighting colors in spatio-temporal domain. In this paper, we present a new effective method to learn disentangled neural fields under the severe illumination changes, named RehearsalNeRF. Our key idea is to leverage scenes captured under stable lighting like rehearsal stages, easily taken before dynamic illumination occurs, to enforce geometric consistency between the different lighting conditions. In particular, RehearsalNeRF employs a learnable vector for lighting effects which represents illumination colors in a temporal dimension and is used to disentangle projected light colors from scene radiance. Furthermore, our RehearsalNeRF is also able to reconstruct the neural fields of dynamic objects by simply adopting off-the-shelf interactive masks. To decouple the dynamic objects, we propose a new regularization leveraging optical flow, which provides coarse supervision for the color disentanglement. We demonstrate the effectiveness of RehearsalNeRF by showing robust performances on novel view synthesis and scene editing under dynamic illumination conditions. Our source code and video datasets will be publicly available.[266] MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation
Ruiyao Liu,Hui Shen,Ping Zhang,Yunta Hsieh,Yifan Zhang,Jing Xu,Sicheng Chen,Junchen Li,Jiawei Lu,Jianing Ma,Jiaqi Mo,Qi Han,Zhen Zhang,Zhongwei Wan,Jing Xiong,Xin Wang,Ziyuan Liu,Hangrui Cao,Ngai Wong
Main category: cs.CV
TL;DR: 本文提出MathGen基准测试,评估生成模型在数学视觉生成任务上的能力,发现当前文本到图像模型在数学保真度上表现较差。
Details
Motivation: 现代生成模型虽能解决数学问题,但在需要以图表、几何构造等视觉形式表达答案时,其能力尚不明确,因此需构建专门基准进行评估。 Method: 构建包含900道题、覆盖七大领域的MathGen基准,并采用Script-as-a-Judge协议与可执行验证器实现确定性、客观性评估。 Result: 实验表明,最优闭源模型整体准确率仅42.0%,开源模型仅约1–11%,结构化任务中常接近0%。 Conclusion: 当前文本到图像模型在基础数学视觉生成任务上仍远未达到实用水平。 Abstract: Modern generative models have demonstrated the ability to solve challenging mathematical problems. In many real-world settings, however, mathematical solutions must be expressed visually through diagrams, plots, geometric constructions, and structured symbolic layouts, where correctness depends on precise visual composition. Can generative models still do so when the answer must be rendered visually rather than written in text? To study this problem, we introduce MathGen, a rigorous benchmark of 900 problems spanning seven core domains, each paired with an executable verifier under a Script-as-a-Judge protocol for deterministic and objective evaluation. Experiments on representative open-source and proprietary text-to-image models show that mathematical fidelity remains a major bottleneck: even the best closed-source model reaches only 42.0% overall accuracy, while open-source models achieve just ~ 1-11%, often near 0% on structured tasks. Overall, current T2I models remain far from competent at even elementary mathematical visual generation.[267] ExFusion: Efficient Transformer Training via Multi-Experts Fusion
Jiacheng Ruan,Daize Dong,Xiaoye Qu,Tong Zhu,Ting Liu,Yuzhuo Fu,Yu Cheng,Suncheng Xiang
Main category: cs.CV
TL;DR: ExFusion是一种新型预训练方法,通过在Transformer中融合多个专家(FFN)来提升训练效率,在不显著增加计算、存储和部署开销的前提下,利用MoE的多专家能力提升模型性能。
Details
Motivation: 直接训练MoE模型计算资源消耗大、参数存储和部署开销高,亟需一种低成本、高效益的多专家增强方法。 Method: ExFusion在初始化阶段将Transformer的FFN‘升级’为多专家结构并分配可学习权重;训练中动态加权融合多个专家为单个等效FFN用于前向传播;训练后用学习到的权重将多专家集成回单一FFN。 Result: 在多种CV和NLP任务上验证了ExFusion的有效性,实现了接近MoE的性能增益,同时保持与稠密模型相近的计算、存储和部署成本。 Conclusion: ExFusion成功地在几乎不增加额外开销的前提下,将MoE的多专家优势融入标准Transformer训练流程,为高效扩展模型容量提供了新范式。 Abstract: Mixture-of-Experts (MoE) models substantially improve performance by increasing the capacity of dense architectures. However, directly training MoE models requires considerable computational resources and introduces extra overhead in parameter storage and deployment. Therefore, it is critical to develop an approach that leverages the multi-expert capability of MoE to enhance performance while incurring minimal additional cost. To this end, we propose a novel pre-training approach, termed ExFusion, which improves the efficiency of Transformer training through multi-expert fusion. Specifically, during the initialization phase, ExFusion upcycles the feed-forward network (FFN) of the Transformer into a multi-expert configuration, where each expert is assigned a weight for later parameter fusion. During training, these weights allow multiple experts to be fused into a single unified expert equivalent to the original FFN, which is subsequently used for forward computation. As a result, ExFusion introduces multi-expert characteristics into the training process while incurring only marginal computational cost compared to standard dense training. After training, the learned weights are used to integrate multi-experts into a single unified expert, thereby eliminating additional overhead in storage and deployment. Extensive experiments on a variety of computer vision and natural language processing tasks demonstrate the effectiveness of the proposed method.[268] Learning Multi-View Spatial Reasoning from Cross-View Relations
Suchae Jeong,Jaehwi Song,Haeone Lee,Hanna Kim,Jian Kim,Dongjun Lee,Dong Kyu Shin,Changyeon Kim,Dongyoon Hahm,Woogyeol Jin,Juheon Choi,Kimin Lee
Main category: cs.CV
TL;DR: 本文提出Cross-View Relations (XVR)数据集,用于提升视觉语言模型(VLMs)在多视角空间推理能力,尤其面向具身智能与机器人操作任务。
Details
Motivation: 现有VLMs在单视图任务上表现优异,但缺乏理解3D环境和跨视角操作物体所需的多视角空间推理能力。 Method: 构建大规模XVR数据集(100K样本,源自18K 3D场景和70K机器人操作轨迹),涵盖对应关系、验证和定位三类空间推理任务,并对VLMs进行微调;进一步将其作为骨干网络集成到Vision-Language-Action模型中。 Result: XVR微调显著提升VLMs在MindCube和RoboSpatial等多视角空间推理基准上的性能,并提高RoboCasa任务中的操作成功率。 Conclusion: 显式建模跨视角空间关系可有效增强VLMs的多视角推理能力,并能迁移到真实机器人操作任务中。 Abstract: Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple views. XVR comprises 100K vision-question-answer samples derived from 18K diverse 3D scenes and 70K robotic manipulation trajectories, spanning three fundamental spatial reasoning tasks: Correspondence (matching objects across views), Verification (validating spatial relationships), and Localization (identifying object positions). VLMs fine-tuned on XVR achieve substantial improvements on established multi-view and robotic spatial reasoning benchmarks (MindCube and RoboSpatial). When integrated as backbones in Vision-Language-Action models, XVR-trained representations improve success rates on RoboCasa. Our results demonstrate that explicit training on cross-view spatial relations significantly enhances multi-view reasoning and transfers effectively to real-world robotic manipulation.[269] Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs
Pei An,Junfeng Ding,Jiaqi Yang,Yulong Wang,Jie Ma,Liangliang Nan
Main category: cs.CV
TL;DR: 本文提出了一种名为Hg-I2P的异构图嵌入方法,用于图像到点云(I2P)配准,通过构建2D-3D区域映射的异构图来增强跨模态特征交互、提升特征判别力,并利用图结构的一致性剔除不可靠对应关系,在跨域场景下显著提升了泛化性和精度。
Details
Motivation: 图像与点云之间存在巨大模态差异,导致学习兼具判别性和泛化性的跨模态特征困难,现有方法在未见场景中性能显著下降。 Method: 构建一个表征2D分割区域与3D点云区域映射关系的异构图,通过挖掘多路径特征关系学习该图;在异构边指导下自适应调整跨模态特征;并基于图的投影一致性剔除不可靠2D-3D对应。 Result: 在六个室内外跨域基准数据集上,Hg-I2P在泛化性和配准精度上均显著优于现有方法。 Conclusion: 异构图建模能有效弥合图像与点云间的模态鸿沟,统一优化特征表示与对应关系,是提升I2P配准鲁棒性与泛化能力的有效范式。 Abstract: Image-to-point-cloud (I2P) registration aims to align 2D images with 3D point clouds by establishing reliable 2D-3D correspondences. The drastic modality gap between images and point clouds makes it challenging to learn features that are both discriminative and generalizable, leading to severe performance drops in unseen scenarios. We address this challenge by introducing a heterogeneous graph that enables refining both cross-modal features and correspondences within a unified architecture. The proposed graph represents a mapping between segmented 2D and 3D regions, which enhances cross-modal feature interaction and thus improves feature discriminability. In addition, modeling the consistency among vertices and edges within the graph enables pruning of unreliable correspondences. Building on these insights, we propose a heterogeneous graph embedded I2P registration method, termed Hg-I2P. It learns a heterogeneous graph by mining multi-path feature relationships, adapts features under the guidance of heterogeneous edges, and prunes correspondences using graph-based projection consistency. Experiments on six indoor and outdoor benchmarks under cross-domain setups demonstrate that Hg-I2P significantly outperforms existing methods in both generalization and accuracy. Code is released on https://github.com/anpei96/hg-i2p-demo.[270] AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
Nghia Vu,Tuong Do,Khang Nguyen,Baoru Huang,Nhat Le,Binh Xuan Nguyen,Erman Tjiputra,Quang D. Tran,Ravi Prakash,Te-Chuan Chiu,Anh Nguyen
Main category: cs.CV
TL;DR: 本文提出了AffordBridge数据集和AffordMatcher方法,用于在点云场景中更精确地识别功能交互区域,通过建立图像与点云实例间的语义对应关系提升效果。
Details
Motivation: 现有方法主要关注物体几何结构、视觉知识和功能标签,难以扩展到包含物体与场景级语义的复杂场景中。 Method: 构建大规模AffordBridge点云数据集(含RGB图像关联),并提出AffordMatcher方法,实现图像与点云实例间的语义关键点匹配,利用视觉提示符精确定位功能区域。 Result: 在AffordBridge数据集上的实验表明,AffordMatcher相比其他方法具有更优的功能区域识别性能。 Conclusion: 结合多模态(点云+RGB)与语义对齐的框架能有效提升场景级功能学习能力,为具身智能中的交互理解提供新思路。 Abstract: Affordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating object- and scene-level semantics is not straightforward. In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within the scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that establishes coherent semantic correspondences between image-based and point cloud-based instances for keypoint matching, enabling a more precise identification of affordance regions based on cues, so-called visual signifiers. Experimental results on our dataset demonstrate the effectiveness of our approach compared to other methods.[271] RetinexDualV2: Physically-Grounded Dual Retinex for Generalized UHD Image Restoration
Mohab Kishawy,Jun Chen
Main category: cs.CV
TL;DR: RetinexDualV2是一种基于物理原理的双分支框架,用于多种超高清图像恢复任务,通过任务特定的物理接地模块和物理条件多头自注意力机制,实现对复杂退化(如雨滴、低光照)的鲁棒校正。
Details
Motivation: 现有通用模型难以应对多样且复杂的UHD图像退化问题,缺乏对退化物理特性的显式建模,导致泛化性与鲁棒性受限。 Method: 提出RetinexDualV2框架,包含Task-Specific Physical Grounding Module(TS-PGM)提取退化感知先验(如雨滴掩码、暗通道),并结合Physical-conditioned Multi-head Self-Attention(PC-MSA)引导Retinex分解网络进行反射与照度校正。 Result: 在NTIRE 2026雨滴去除挑战赛获第4名,在JNLLIE联合噪声-低光增强挑战赛获第5名;实验表明其性能与效率达当前最优水平。 Conclusion: 物理驱动的统一架构可有效提升模型对多种复杂退化的泛化能力,无需任务定制结构修改,验证了显式物理先验建模在图像恢复中的关键价值。 Abstract: We propose RetinexDualV2, a unified, physically grounded dual-branch framework for diverse Ultra-High-Definition (UHD) image restoration. Unlike generic models, our method employs a Task-Specific Physical Grounding Module (TS-PGM) to extract degradation-aware priors (e.g., rain masks and dark channels). These explicitly guide a Retinex decomposition network via a novel Physical-conditioned Multi-head Self-Attention (PC-MSA) mechanism, enabling robust reflection and illumination correction. This physical conditioning allows a single architecture to handle various complex degradations seamlessly, without task-specific structural modifications. RetinexDualV2 demonstrates exceptional generalizability, securing 4\textsuperscript{th} place in the NTIRE 2026 Day and Night Raindrop Removal Challenge and 5\textsuperscript{th} place in the Joint Noise Low-light Enhancement (JNLLIE) Challenge. Extensive experiments confirm the state-of-the-art performance and efficiency of our physically motivated approach.[272] CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models
Kesheng Chen,Yamin Hu,Qi Zhou,Zhenqian Zhu,Wenjian Luo
Main category: cs.CV
TL;DR: 本文提出了一种名为'常识驱动幻觉(CDH)'的现象,即当视觉证据与常识冲突时,视觉-语言模型(VLMs)倾向于忽略视觉信息而依赖常识,导致错误输出;为此构建了CDH-Bench基准,从计数、关系和属性三类异常检测CDH,并在二元与多选问答任务上评估前沿VLMs,发现其仍易受先验知识干扰。
Details
Motivation: 探究视觉-语言模型在视觉证据与常识冲突时的可靠性,识别模型是否盲目依赖常识而忽略真实视觉信息。 Method: 构建CDH-Bench基准,显式构造视觉证据与常识冲突的样本,覆盖计数、关系、属性三类异常;在二元和多选问答任务上评估前沿VLMs,并引入CF-Acc、CS-Acc、CFAD、CCR、RPD等新指标量化模型对视觉证据与常识的依赖程度。 Result: 前沿VLMs在CDH-Bench上表现出显著的常识驱动幻觉:Counterfactual Accuracy普遍下降,Commonsense Collapse Rate较高,表明模型仍严重依赖先验常识而非视觉证据。 Conclusion: CDH是VLMs中一种关键可靠性缺陷;CDH-Bench为诊断模型视觉保真度提供了可控、可解释的评估工具,揭示当前VLMs在视觉-语义对齐方面仍存在根本性不足。 Abstract: Vision-language models (VLMs) achieve strong performance on many benchmarks, yet a basic reliability question remains underexplored: when visual evidence conflicts with commonsense, do models follow what is shown or what commonsense suggests? A characteristic failure in this setting is that the model overrides visual evidence and outputs the commonsense alternative. We term this phenomenon \textbf{commonsense-driven hallucination} (CDH). To evaluate it, we introduce \textbf{CDH-Bench}, a benchmark designed to create explicit \textbf{visual evidence--commonsense conflicts}. CDH-Bench covers three dimensions: \textit{counting anomalies}, \textit{relational anomalies}, and \textit{attribute anomalies}. We evaluate frontier VLMs under \textit{binary Question Answering (QA)} and \textit{multiple-choice QA}, and report metrics including \textit{Counterfactual Accuracy} (CF-Acc), \textit{Commonsense Accuracy} (CS-Acc), \textit{Counterfactual Accuracy Drop} (CFAD), \textit{Commonsense Collapse Rate} (CCR), and \textit{Relative Prior Dependency} (RPD). Results show that even strong models remain vulnerable to prior-driven normalization under visual evidence--commonsense conflict. CDH-Bench provides a controlled diagnostic of visual fidelity under visual evidence--commonsense conflict.[273] Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment
Tongfei Liu,Yufan Liu,Bing Li,Weiming Hu
Main category: cs.CV
TL;DR: 本文提出DsCo框架,通过扩散模型的Noise-Optimization方法合成代表性样本,并引入'Doping'策略混合原始与合成数据,在数据可访问和不可访问场景下均实现SOTA性能,尤其在大数据量下显著压缩数据规模而不损性能。
Details
Motivation: 解决现有基于扩散模型的数据集蒸馏方法缺乏理论支撑、难以扩展至大数据量、且不适用于无数据场景的问题。 Method: 建立理论框架证明数据集蒸馏等价于分布匹配,并揭示其固有效率瓶颈;提出DsCo框架,包含基于扩散模型的Noise-Optimization(NOpt)合成方法和可选的'Doping'增强策略(混合原始与合成样本)。 Result: DsCo在低数据量下达到SOTA性能;在高数据量下几乎将数据集规模减半而无性能下降;同时支持数据可访问与数据不可访问场景。 Conclusion: DsCo为数据集蒸馏提供了理论基础与高效实用方案,突破了现有方法在效率、适用性与可扩展性上的限制。 Abstract: The high cost and accessibility problem associated with large datasets hinder the development of large-scale visual recognition systems. Dataset Distillation addresses these problems by synthesizing compact surrogate datasets for efficient training, storage, transfer, and privacy preservation. The existing state-of-the-art diffusion-based dataset distillation methods face three issues: lack of theoretical justification, poor efficiency in scaling to high data volumes, and failure in data-free scenarios. To address these issues, we establish a theoretical framework that justifies the use of diffusion models by proving the equivalence between dataset distillation and distribution matching, and reveals an inherent efficiency limit in the dataset distillation paradigm. We then propose a Dataset Concentration (DsCo) framework that uses a diffusion-based Noise-Optimization (NOpt) method to synthesize a small yet representative set of samples, and optionally augments the synthetic data via "Doping", which mixes selected samples from the original dataset with the synthetic samples to overcome the efficiency limit of dataset distillation. DsCo is applicable in both data-accessible and data-free scenarios, achieving SOTA performances for low data volumes, and it extends well to high data volumes, where it nearly reduces the dataset size by half with no performance degradation.[274] Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation
Jiachen Li,Hongyun Wang,Jinyu Xu,Wenbo Jiang,Yanchun Ma,Yongjian Liu,Qing Xie,Bolong Zheng
Main category: cs.CV
TL;DR: 本文提出PPCR框架,通过语义理解-空间定位-实例分割的渐进式提示引导跨模态推理,提升指代表达图像分割性能。
Details
Motivation: 现有方法缺乏显式的语言描述到图像目标区域的推理机制,难以有效处理复杂属性和对象间关系。 Method: PPCR框架采用多模态大语言模型生成语义分割提示,再生成空间分割提示,实现从语义理解到空间定位的渐进式推理,并将两类提示联合融入分割模块。 Result: 在标准指代表达图像分割基准上,PPCR持续优于现有方法。 Conclusion: PPCR通过结构化、渐进式的跨模态推理机制,显著提升了指代表达图像分割的准确性和鲁棒性。 Abstract: Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in the image. To address these limitations, we propose PPCR, a Progressive Prompt-guided Cross-modal Reasoning framework for referring image segmentation. PPCR explicitly structures the reasoning process as a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline. Specifically, PPCR first employs multimodal large language models (MLLMs) to generate Semantic Segmentation Prompt that capture key semantic cues of the target object. Based on this semantic context, Spatial Segmentation Prompt are further generated to reason about object location and spatial extent, enabling a progressive transition from semantic understanding to spatial grounding. The Semantic and Spatial Segmentation prompts are then jointly integrated into the segmentation module to guide accurate target localization and segmentation. Extensive experiments on standard referring image segmentation benchmarks demonstrate that PPCR consistently outperforms existing methods. The code will be publicly released to facilitate reproducibility.[275] UniDA3D: A Unified Domain-Adaptive Framework for Multi-View 3D Object Detection
Hongjing Wu,Cheng Chi,Jinlin Wu,Yanzhao Su,Zhen Lei,Wenqi Ren
Main category: cs.CV
TL;DR: 本文提出UniDA3D,一种统一的多目标域自适应多视角3D目标检测器,通过查询引导的域差异缓解模块(QDDM)和域自适应师生训练策略,在夜间、雨天、雾天等恶劣天气下实现鲁棒的全天气3D感知,仅用相机即在合成基准上显著超越现有方法。
Details
Motivation: 现有基于多视角相机的3D检测方法在夜间、雨、雾等复杂环境下降级严重,因其训练数据多来自理想条件,缺乏跨域泛化能力。 Method: 提出UniDA3D框架:1)将多种恶劣天气场景建模为统一多目标域自适应问题;2)设计查询引导的域差异缓解(QDDM)模块,结合查询中心的对抗学习与对比学习,在批级和全局级对齐源域与目标域对象特征;3)引入带指数滑动平均教师模型与动态高质量伪标签的域自适应师生训练流程,增强一致性学习并抑制背景噪声。 Result: 在基于nuScenes构建的nuScenes-Night/Rain/Haze合成多视角3D基准上,UniDA3D在极端天气下显著优于SOTA相机-only多视角3D检测器,mAP和NDS大幅提升,且保持实时推理效率。 Conclusion: UniDA3D实现了单次统一训练覆盖多类恶劣天气,解决了多视角相机3D检测在真实复杂环境中的鲁棒性瓶颈,为低成本、全天候自动驾驶感知提供了新范式。 Abstract: Camera-only 3D object detection is critical for autonomous driving, offering a cost-effective alternative to LiDAR based methods. In particular, multi-view 3D object detection has emerged as a promising direction due to its balanced trade-off between performance and cost. However, existing methods often suffer significant performance degradation under complex environmental conditions such as nighttime, fog, and rain, primarily due to their reliance on training data collected mostly in ideal conditions. To address this challenge, we propose UniDA3D, a unified domain-adaptive multi-view 3D object detector designed for robust perception under diverse adverse conditions. UniDA3D formulates nighttime, rainy, and foggy scenes as a unified multi target domain adaptation problem and leverages a novel query guided domain discrepancy mitigation (QDDM) module to align object features between source and target domains at both batch and global levels via query-centric adversarial and contrastive learning. Furthermore, we introduce a domain-adaptive teacher student training pipeline with an exponential-moving-average teacher and dynamically updated high-quality pseudo labels to enhance consistency learning and suppress background noise in unlabeled target domains. In contrast to prior approaches that require separate training for each condition, UniDA3D performs a single unified training process across multiple domains, enabling robust all-weather 3D perception. On a synthesized multi-view 3D benchmark constructed by generating nighttime, rainy, and foggy counterparts from nuScenes (nuScenes-Night, nuScenes-Rain, and nuScenes-Haze), UniDA3D consistently outperforms state of-the-art camera-only multi-view 3D detectors under extreme conditions, achieving substantial gains in mAP and NDS while maintaining real-time inference efficiency.[276] CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition
Muhammad Osama Zeeshan,Masoumeh Sharafi,Benoît Savary,Alessandro Lameiras Koerich,Marco Pedersoli,Eric Granger
Main category: cs.CV
TL;DR: 本文提出CLIP-AU和CLIP-AUTT两种方法,利用动作单元(AUs)作为结构化文本提示增强CLIP在细粒度、个性化情绪识别(ER)中的性能;CLIP-AU通过AU-图像对齐实现无需微调的泛化表征学习,CLIP-AUTT进一步在测试时动态适配新用户视频,提升个性化表现。
Details
Motivation: 现有基于CLIP的情绪识别方法依赖对比预训练或大语言模型生成文本提示,存在噪声大、计算开销高、难以建模细粒度表情等问题;同时缺乏对个体间细微表达差异的建模能力。 Method: 提出CLIP-AU:将动作单元(AUs)作为结构化、可解释的文本提示嵌入CLIP框架,对齐AU语义与面部动态以学习通用表征;进一步提出CLIP-AUTT:在测试阶段结合熵引导的时间窗口选择与提示调优,实现面向未见用户的视频级个性化自适应。 Result: 在BioVid、StressID和BAH三个细粒度视频情绪识别数据集上,CLIP-AU和CLIP-AUTT均超越现有基于CLIP的FER及测试时自适应(TTA)方法,显著提升鲁棒性与个性化性能。 Conclusion: 结构化AU提示能有效增强CLIP在细粒度情绪识别中的语义对齐能力;测试时个性化提示调优是实现跨用户泛化与个体适配的关键路径,为轻量、可解释、个性化的视觉情感识别提供了新范式。 Abstract: Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP's contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates interpretable AU semantics into CLIP. It learns generic, subject-agnostic representations by aligning AU prompts with facial dynamics, enabling fine-grained ER without CLIP fine-tuning or LLM-generated text supervision. Although CLIP-AU models fine-grained AU semantics, it does not adapt to subject-specific variability in subtle expressions. To address this limitation, we propose CLIP-AUTT, a video-based test-time personalization method that dynamically adapts AU prompts to videos from unseen subjects. By combining entropy-guided temporal window selection with prompt tuning, CLIP-AUTT enables subject-specific adaptation while preserving temporal consistency. Our extensive experiments on three challenging video-based subtle ER datasets, BioVid, StressID, and BAH, indicate that CLIP-AU and CLIP-AUTT outperform state-of-the-art CLIP-based FER and TTA methods, achieving robust and personalized subtle ER.[277] DipGuava: Disentangling Personalized Gaussian Features for 3D Head Avatars from Monocular Video
Jeonghaeng Lee,Seok Keun Choi,Zhixuan Li,Weisi Lin,Sanghoon Lee
Main category: cs.CV
TL;DR: DipGuava是一种新型3D高斯头像生成方法,通过两阶段解耦学习面部外观(几何驱动的基础外观 + 个性化残差细节),实现从单目视频中生成高保真、身份保持的逼真头像。
Details
Motivation: 现有3D头像方法难以捕捉个性化细节,导致 realism 和 expressiveness 不足。 Method: 提出DipGuava:第一阶段学习几何驱动的基础外观(全局结构与粗粒度表情变化);第二阶段预测个性化残差细节(皱纹、皮肤微变形等高频非线性特征);通过动态外观融合对齐融合二者。 Result: 在视觉质量和定量指标上均超越先前方法,生成身份保持、光真实感强的3D头像。 Conclusion: 显式解耦面部外观并分阶段建模,可显著降低学习歧义、提升重建保真度,是构建高质量个性化3D头像的有效范式。 Abstract: While recent 3D head avatar creation methods attempt to animate facial dynamics, they often fail to capture personalized details, limiting realism and expressiveness. To fill this gap, we present DipGuava (Disentangled and Personalized Gaussian UV Avatar), a novel 3D Gaussian head avatar creation method that successfully generates avatars with personalized attributes from monocular video. DipGuava is the first method to explicitly disentangle facial appearance into two complementary components, trained in a structured two-stage pipeline that significantly reduces learning ambiguity and enhances reconstruction fidelity. In the first stage, we learn a stable geometry-driven base appearance that captures global facial structure and coarse expression-dependent variations. In the second stage, the personalized residual details not captured in the first stage are predicted, including high-frequency components and nonlinearly varying features such as wrinkles and subtle skin deformations. These components are fused via dynamic appearance fusion that integrates residual details after deformation, ensuring spatial and semantic alignment. This disentangled design enables DipGuava to generate photorealistic, identity-preserving avatars, consistently outperforming prior methods in both visual quality and quantitativeperformance, as demonstrated in extensive experiments.[278] Energy-Aware Imitation Learning for Steering Prediction Using Events and Frames
Hu Cao,Jiong Liu,Xingzhuo Yan,Rui Song,Yan Xia,Walter Zimmer,Guang Chen,Alois Knoll
Main category: cs.CV
TL;DR: 本文提出了一种面向自动驾驶转向预测的能量感知模仿学习框架,融合事件相机与帧相机数据,通过能量驱动的跨模态融合模块(ECFM)和能量感知解码器提升预测可靠性与安全性,在DDD20和DRFuser数据集上超越现有SOTA方法。
Details
Motivation: 传统帧式相机在长曝光、高速运动和恶劣光照下易导致感知不准确,需引入事件相机作为互补模态以提升鲁棒性。 Method: 提出能量感知模仿学习框架,包含能量驱动的跨模态融合模块(ECFM)和能量-aware解码器,联合利用事件流与图像帧进行转向预测。 Result: 在DDD20和DRFuser两个真实世界公开数据集上,本方法性能优于当前最先进(SOTA)方法。 Conclusion: 事件与帧的协同建模结合能量感知机制可有效提升自动驾驶中转向预测的可靠性与安全性,验证了所提框架的有效性与实用性。 Abstract: In autonomous driving, relying solely on frame-based cameras can lead to inaccuracies caused by factors like long exposure times, high-speed motion, and challenging lighting conditions. To address these issues, we introduce a bio-inspired vision sensor known as the event camera. Unlike conventional cameras, event cameras capture sparse, asynchronous events that provide a complementary modality to mitigate these challenges. In this work, we propose an energy-aware imitation learning framework for steering prediction that leverages both events and frames. Specifically, we design an Energy-driven Cross-modality Fusion Module (ECFM) and an energy-aware decoder to produce reliable and safe predictions. Extensive experiments on two public real-world datasets, DDD20 and DRFuser, demonstrate that our method outperforms existing state-of-the-art (SOTA) approaches. The codes and trained models will be released upon acceptance.[279] Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
Huimin Zeng,Yue Bai,Hailing Wang,Yun Fu
Main category: cs.CV
TL;DR: 本文提出PhysHDR-GS,一种物理启发的高动态范围新视角合成框架,通过建模场景固有反射率与可调环境光照,结合图像-曝光分支和高斯-光照分支,并引入跨分支HDR一致性损失与光照引导梯度缩放策略,显著提升HDR细节重建质量并保持实时渲染速度。
Details
Motivation: 现有HDR-NVS方法难以准确建模环境光照依赖的外观变化,隐式监督色调映射结果易导致HDR值异常及曝光区域梯度缺失。 Method: 提出PhysHDR-GS框架,包含互补的图像-曝光(IE)分支和高斯-光照(GI)分支;引入跨分支HDR一致性损失提供显式HDR监督,并采用光照引导的梯度缩放策略缓解曝光偏差导致的梯度饥饿问题。 Result: 在真实与合成数据集上均优于现有方法,HDR细节重建PSNR提升2.04 dB(相比HDR-GS),渲染速度达76 FPS。 Conclusion: PhysHDR-GS通过物理建模与双分支协同优化,有效解决了HDR-NVS中光照依赖外观建模与梯度不均衡问题,兼顾高质量重建与实时性。 Abstract: High dynamic range novel view synthesis (HDR-NVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision for HDR content, while an illumination-guided gradient scaling strategy mitigates exposure-biased gradient starvation and reduces under-densified representations. Experimental results across realistic and synthetic datasets demonstrate our superiority in reconstructing HDR details (e.g., a PSNR gain of 2.04 dB over HDR-GS), while maintaining real-time rendering speed (up to 76 FPS). Code and models are available at https://huimin-zeng.github.io/PhysHDR-GS/.[280] SegRGB-X: General RGB-X Semantic Segmentation Model
Jiong Liu,Yingjie Xu,Xingcheng Zhou,Rui Song,Walter Zimmer,Alois Knoll,Hu Cao
Main category: cs.CV
TL;DR: 本文提出了一种通用的任意模态语义分割框架,通过模态感知CLIP、模态对齐嵌入和领域特定精炼模块,在五种不同传感器模态数据集上实现了65.03%的mIoU,达到SOTA性能。
Details
Motivation: 解决跨任意传感器模态语义分割中因传感器特性差异大、传统方法开发冗余导致的挑战。 Method: 提出三个核心创新:(1) 模态感知CLIP(MA-CLIP),通过LoRA微调提供模态特异性场景理解指导;(2) 模态对齐嵌入以捕获细粒度特征;(3) 领域特定精炼模块(DSRM)实现动态特征调整。 Result: 在事件、热成像、深度、偏振和光场五种互补模态数据集上评估,mIoU达65.03%,超越专用多模态方法,达到当前最优性能。 Conclusion: 所提出的通用任意模态语义分割框架有效统一了多模态分割任务,显著减少重复开发,具备强泛化能力和实际应用潜力。 Abstract: Semantic segmentation across arbitrary sensor modalities faces significant challenges due to diverse sensor characteristics, and the traditional configurations for this task result in redundant development efforts. We address these challenges by introducing a universal arbitrary-modal semantic segmentation framework that unifies segmentation across multiple modalities. Our approach features three key innovations: (1) the Modality-aware CLIP (MA-CLIP), which provides modality-specific scene understanding guidance through LoRA fine-tuning; (2) Modality-aligned Embeddings for capturing fine-grained features; and (3) the Domain-specific Refinement Module (DSRM) for dynamic feature adjustment. Evaluated on five diverse datasets with different complementary modalities (event, thermal, depth, polarization, and light field), our model surpasses specialized multi-modal methods and achieves state-of-the-art performance with a mIoU of 65.03%. The codes will be released upon acceptance.[281] Adapting SAM to Nuclei Instance Segmentation and Classification via Cooperative Fine-Grained Refinement
Jingze Su,Tianle Zhu,Jiaxin Cai,Zhiyi Wang,Qi Li,Xiao Zhang,Tong Tong,Shu Wang,Wenxi Liu
Main category: cs.CV
TL;DR: 本文提出了一种参数高效的微调框架Cooperative Fine-Grained Refinement of SAM,用于提升SAM在细胞核实例分割任务中的性能,通过引入局部感知适配器、分层特征融合模块和边界引导掩码优化机制,在保持SAM强大全局建模能力的同时增强其对医学图像局部结构的感知与分割精度。
Details
Motivation: SAM在自然图像上表现优异,但直接应用于医学图像(如细胞核分割)时缺乏对关键局部结构特征的感知能力,且全量微调计算成本高。 Method: 提出三部分协同框架:1)多尺度自适应局部感知适配器(轻量卷积核动态增强局部特征);2)分层调制融合模块(动态聚合多级编码器特征以保留空间细节);3)边界引导掩码优化(融合多上下文边界线索与语义特征进行显式监督优化)。 Result: 该方法在细胞核实例分割任务中实现了高精度分割,显著提升了边界清晰度与局部结构感知能力,同时保持参数高效性。 Conclusion: 所提参数高效微调框架成功将SAM的通用先验知识迁移至医学图像领域,在不牺牲全局建模能力的前提下,有效弥补其局部感知短板,为病理图像分析提供了实用、精准的分割解决方案。 Abstract: Nuclei instance segmentation is critical in computational pathology for cancer diagnosis and prognosis. Recently, the Segment Anything Model has demonstrated exceptional performance in various segmentation tasks, leveraging its rich priors and powerful global context modeling capabilities derived from large-scale pre-training on natural images. However, directly applying SAM to the medical imaging domain faces significant limitations: it lacks sufficient perception of the local structural features that are crucial for nuclei segmentation, and full fine-tuning for downstream tasks requires substantial computational costs. To efficiently transfer SAM's robust prior knowledge to nuclei instance segmentation while supplementing its task-aware local perception, we propose a parameter-efficient fine-tuning framework, named Cooperative Fine-Grained Refinement of SAM, consisting of three core components: 1) a Multi-scale Adaptive Local-aware Adapter, which enables effective capability transfer by augmenting the frozen SAM backbone with minimal parameters and instilling a powerful perception of local structures through dynamically generated, multi-scale convolutional kernels; 2) a Hierarchical Modulated Fusion Module, which dynamically aggregates multi-level encoder features to preserve fine-grained spatial details; and 3) a Boundary-Guided Mask Refinement, which integrates multi-context boundary cues with semantic features through explicit supervision, producing a boundary-focused signal to refine initial mask predictions for sharper delineation. These three components work cooperatively to enhance local perception, preserve spatial details, and refine boundaries, enabling SAM to perform accurate nuclei instance segmentation directly.[282] Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models
Arundhathi Dev,Justin Zhan
Main category: cs.CV
TL;DR: 本文提出了一种模块化的检测-校正OCR框架,解耦视觉字符检测与语言校正,使用T5/ByT5/BART等预训练序列模型进行无标注领域自适应,在保持近SOTA精度的同时将计算开销降低约95%。
Details
Motivation: 现有端到端Transformer OCR模型虽精度高,但领域适配需数百GPU小时,计算成本过高,限制了资源有限的研究者(如数字人文学者)的使用。 Method: 构建轻量级、领域无关的视觉字符检测模块,与基于预训练序列模型(T5、ByT5、BART)的领域特定语言校正模块解耦;校正器仅在合成噪声数据上训练,无需目标域标注图像。 Result: 在现代手写体、草书及历史文献上验证有效;发现T5-Base适用于现代标准文本,ByT5-Base在历史文献(含古拼写)上表现更优;整体精度接近SOTA,计算量减少约95%。 Conclusion: 该解耦范式在精度与效率间实现更好权衡,为资源受限场景提供了可行、高效的OCR替代方案。 Abstract: Optical character recognition remains critical infrastructure for document digitization, yet state-of-the-art performance is often restricted to well-resourced institutions by prohibitive computational barriers. End-to-end transformer architectures achieve strong accuracy but demand hundreds of GPU hours for domain adaptation, limiting accessibility for practitioners and digital humanities scholars. We present a modular detection-and-correction framework that achieves near-SOTA accuracy with single-GPU training. Our approach decouples lightweight visual character detection (domain-agnostic) from domain-specific linguistic correction using pretrained sequence models including T5, ByT5, and BART. By training the correctors entirely on synthetic noise, we enable annotation-free domain adaptation without requiring labeled target images. Evaluating across modern clean handwriting, cursive script, and historical documents, we identify a critical "Pareto frontier" in architecture selection: T5-Base excels on modern text with standard vocabulary, whereas ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level. Our results demonstrate that this decoupled paradigm matches end-to-end transformer accuracy while reducing compute by approximately 95%, establishing a viable, resource-efficient alternative to monolithic OCR architectures.[283] Effort-Based Criticality Metrics for Evaluating 3D Perception Errors in Autonomous Driving
Sharang Kaul,Simon Bultmann,Mario Berk,Abhinav Valada
Main category: cs.CV
TL;DR: 本文提出三种基于努力的新型临界性度量指标(FSR、MDR、LEA),用于更精准评估自动驾驶感知错误的安全影响,避免传统指标混淆误报与漏报后果,并通过可达性椭球碰撞滤波器筛选动态可行威胁,实验表明其能有效识别关键感知失效。
Details
Motivation: 现有临界性指标(如TTC)将误报(FP)和漏报(FN)的后果混为一谈,无法区分其真实安全影响,导致对感知错误严重性的误判。 Method: 提出三个纵向/横向努力型指标:False Speed Reduction(FSR)、Maximum Deceleration Rate(MDR)和Lateral Evasion Acceleration(LEA);引入可达性椭球碰撞滤波器筛选动态可行威胁,并结合帧级匹配与轨迹级聚合进行评分。 Result: 在nuScenes和Argoverse 2数据集上评估显示,65–93%的感知错误为非关键性;Spearman相关性分析证实三指标均捕获了传统时间/减速度/归一化临界性指标无法获取的安全相关信息。 Conclusion: 所提努力型临界性指标能更准确、差异化地刻画感知错误的真实安全影响,支持面向安全关键场景的感知失效定向挖掘与分析。 Abstract: Criticality metrics such as time-to-collision (TTC) quantify collision urgency but conflate the consequences of false-positive (FP) and false-negative (FN) perception errors. We propose two novel effort-based metrics: False Speed Reduction (FSR), the cumulative velocity loss from persistent phantom detections, and Maximum Deceleration Rate (MDR), the peak braking demand from missed objects under a constant-acceleration model. These longitudinal metrics are complemented by Lateral Evasion Acceleration (LEA), adapted from prior lateral evasion kinematics and coupled with reachability-based collision timing to quantify the minimum steering effort to avoid a predicted collision. A reachability-based ellipsoidal collision filter ensures only dynamically plausible threats are scored, with frame-level matching and track-level aggregation. Evaluation of different perception pipelines on nuScenes and Argoverse~2 shows that 65-93% of errors are non-critical, and Spearman correlation analysis confirms that all three metrics capture safety-relevant information inaccessible to established time-based, deceleration-based, or normalized criticality measures, enabling targeted mining of the most critical perception failures.[284] Event6D: Event-based Novel Object 6D Pose Tracking
Jae-Young Kang,Hoonehee Cho,Taeyeop Lee,Minjun Kang,Bowen Wen,Youngho Kim,Kuk-Jin Yoon
Main category: cs.CV
TL;DR: 本文提出EventTrack6D,一种基于事件相机的6D物体位姿跟踪框架,无需针对特定物体训练即可泛化到新物体,通过在深度帧之间任意时间戳重建强度和深度图像,实现高帧率(>120 FPS)且运动鲁棒的跟踪。
Details
Motivation: 传统RGB和深度相机在快速动态场景中易受运动模糊和大像素位移影响,而事件相机具有微秒级延迟优势,亟需利用其特性实现高效6D位姿跟踪。 Method: 提出EventTrack6D框架,基于稀疏事件流,在最近深度测量条件下联合重建任意时刻的密集光度(intensity)与几何(depth)信息;使用纯合成数据训练,并构建包含大规模合成训练集及真实/仿真测试集的基准套件。 Result: EventTrack6D运行速度超120 FPS,在快速运动下保持时序一致性;仅用合成数据训练即能有效泛化至真实场景,无需微调,且在多样物体与运动模式下保持高精度跟踪。 Conclusion: 验证了事件相机在无先验物体模型前提下进行6D位姿跟踪的有效性,推动了事件驱动视觉在机器人与AR等实时动态场景中的应用。 Abstract: Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclusively on synthetic data, EventTrack6D generalizes effectively to real-world scenarios without fine-tuning, maintaining accurate tracking across diverse objects and motion patterns. Our method and datasets validate the effectiveness of event cameras for event-based 6D pose tracking of novel objects. Code and datasets are publicly available at https://chohoonhee.github.io/Event6D.[285] Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
Zhen Zou,Xiaoxiao Ma,Mingde Yao,Jie Huang,LinJiang Huang,Feng Zhao
Main category: cs.CV
TL;DR: 本文提出Drift-AR方法,利用连续空间自回归模型中各位置的预测熵作为统一信号,同时加速自回归生成阶段(通过熵感知的推测解码)和扩散视觉解码阶段(将熵解释为反称漂移场的物理方差,实现单步解码),在保持甚至提升生成质量的同时获得3.8–5.5倍加速。
Details
Motivation: 现有AR-扩散混合范式存在双重速度瓶颈(AR序列生成与扩散多步去噪),且缺乏统一加速原理;作者发现预测熵天然编码了空间变化的生成不确定性,可同时指导AR草稿质量和视觉解码校正强度,但此前未被充分挖掘。 Method: 提出Drift-AR:1)熵感知推测解码——用因果归一化熵损失对齐草稿与目标熵分布,减少草稿拒绝;2)将熵 reinterpret 为反称漂移场的初始状态物理方差,实现无需迭代去噪或蒸馏的单步(1-NFE)视觉解码;熵信号仅计算一次、零开销共享。 Result: 在MAR、TransDiff和NextStep-1上实现3.8–5.5×加速,支持真正1-NFE解码,生成质量匹配或超越原模型。 Conclusion: 预测熵是连接AR与扩散两阶段的关键统一信号;Drift-AR通过熵驱动的联合优化,首次实现了双阶段协同加速,在效率与质量间取得新平衡。 Abstract: Autoregressive (AR)-Diffusion hybrid paradigms combine AR's structured semantic modeling with diffusion's high-fidelity synthesis, yet suffer from a dual speed bottleneck: the sequential AR stage and the iterative multi-step denoising of the diffusion vision decode stage. Existing methods address each in isolation without a unified principle design. We observe that the per-position \emph{prediction entropy} of continuous-space AR models naturally encodes spatially varying generation uncertainty, which simultaneously governing draft prediction quality in the AR stage and reflecting the corrective effort required by vision decoding stage, which is not fully explored before. Since entropy is inherently tied to both bottlenecks, it serves as a natural unifying signal for joint acceleration. In this work, we propose \textbf{Drift-AR}, which leverages entropy signal to accelerate both stages: 1) for AR acceleration, we introduce Entropy-Informed Speculative Decoding that align draft--target entropy distributions via a causal-normalized entropy loss, resolving the entropy mismatch that causes excessive draft rejection; 2) for visual decoder acceleration, we reinterpret entropy as the \emph{physical variance} of the initial state for an anti-symmetric drifting field -- high-entropy positions activate stronger drift toward the data manifold while low-entropy positions yield vanishing drift -- enabling single-step (1-NFE) decoding without iterative denoising or distillation. Moreover, both stages share the same entropy signal, which is computed once with no extra cost. Experiments on MAR, TransDiff, and NextStep-1 demonstrate 3.8--5.5$\times$ speedup with genuine 1-NFE decoding, matching or surpassing original quality. Code will be available at https://github.com/aSleepyTree/Drift-AR.[286] Object Detection Based on Distributed Convolutional Neural Networks
Liang Sun
Main category: cs.CV
TL;DR: 本文提出了一种基于分布式卷积神经网络(DisCNN)的简单目标检测方法,通过多尺度特征检测并重叠高分区域生成边界框,仅需正负样本图像进行训练,支持多类并行检测且模型轻量高效。
Details
Motivation: 现有目标检测方法可能依赖复杂结构或大量标注数据,本文旨在设计一种更简洁、仅需正负类标签、且能通过多尺度特征检测提升效率的方法。 Method: 构建分布式卷积神经网络(DisCNN),利用其输出向量中对应正类模块对正特征存在概率的正单调性,在多个尺度上定位高得分图像块,并通过重叠这些块生成最终检测框;支持多类并行检测。 Result: 该方法在保持轻量模型架构的同时,实现了单目标和多目标的快速检测,无需精确位置标注,仅依赖对象中心图像及正负类标签即可完成训练与检测。 Conclusion: DisCNN提供了一种新颖、高效、低标注需求的目标检测范式,验证了基于多尺度特征响应单调性的检测可行性,适用于实时与资源受限场景。 Abstract: Based on the Distributed Convolutional Neural Network(DisCNN), a straightforward object detection method is proposed. The modules of the output vector of a DisCNN with respect to a specific positive class are positively monotonic with the presence probabilities of the positive features. So, by identifying all high-scoring patches across all possible scales, the positive object can be detected by overlapping them to form a bounding box. The essential idea is that the object is detected by detecting its features on multiple scales, ranging from specific sub-features to abstract features composed of these sub-features. Training DisCNN requires only object-centered image data with positive and negative class labels. The detection process for multiple positive classes can be conducted in parallel to significantly accelerate it, and also faster for single-object detection because of its lightweight model architecture.[287] \textit{4DSurf}: High-Fidelity Dynamic Scene Surface Reconstruction
Renjie Wu,Hongdong Li,Jose M. Alvarez,Miaomiao Liu
Main category: cs.CV
TL;DR: 本文提出4DSurf框架,用于通用动态场景表面重建,通过高斯形变诱导的符号距离函数流正则化和重叠分段策略,有效处理大形变和时间不一致性问题,在Hi4D和CMU Panoptic数据集上显著提升重建精度与时间一致性。
Details
Motivation: 现有基于高斯点绘(GS)的动态表面重建方法通常局限于单一物体或小形变,难以保持大形变场景下时间一致的表面重建。 Method: 提出4DSurf框架,核心包括:1)高斯形变诱导的符号距离函数流正则化,约束高斯运动与表面演化对齐;2)重叠分段划分策略,将序列划分为小形变重叠段,并通过共享时间步传递几何信息。 Result: 在Hi4D和CMU Panoptic数据集上,Chamfer距离分别优于SOTA方法49%和19%,且在稀疏视角下展现出更优的时间一致性。 Conclusion: 4DSurf是一种无需预设物体数量与类型、可处理大形变和时间不一致性的通用动态表面重建新框架,显著提升了重建质量与时间稳定性。 Abstract: This paper addresses the problem of dynamic scene surface reconstruction using Gaussian Splatting (GS), aiming to recover temporally consistent geometry. While existing GS-based dynamic surface reconstruction methods can yield superior reconstruction, they are typically limited to either a single object or objects with only small deformations, struggling to maintain temporally consistent surface reconstruction of large deformations over time. We propose ``\textit{4DSurf}'', a novel and unified framework for generic dynamic surface reconstruction that does not require specifying the number or types of objects in the scene, can handle large surface deformations and temporal inconsistency in reconstruction. The key innovation of our framework is the introduction of Gaussian deformations induced Signed Distance Function Flow Regularization that constrains the motion of Gaussians to align with the evolving surface. To handle large deformations, we introduce an Overlapping Segment Partitioning strategy that divides the sequence into overlapping segments with small deformations and incrementally passes geometric information across segments through the shared overlapping timestep. Experiments on two challenging dynamic scene datasets, Hi4D and CMU Panoptic, demonstrate that our method outperforms state-of-the-art surface reconstruction methods by 49\% and 19\% in Chamfer distance, respectively, and achieves superior temporal consistency under sparse-view settings.[288] AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation
Zhaohe Liao,Kaixun Jiang,Zhihang Liu,Yujie Wei,Junqiu Yu,Quanhao Li,Hong-Tao Yu,Pandeng Li,Yuzheng Wang,Zhen Xing,Shiwei Zhang,Chen-Wei Xie,Yun Zheng,Xihui Liu
Main category: cs.CV
TL;DR: 本文提出AIBench,首个用于评估学术插图逻辑正确性和美学质量的基准,通过VQA评估逻辑一致性,VLM评估美学,并发现模型在此任务上的性能差距远大于通用任务。
Details
Motivation: 现有图像生成模型能否生成可直接用于学术论文的插图尚不明确;直接使用VLM评估存在对复杂图文理解不可靠的问题。 Method: 构建AIBench基准:设计四级逻辑导向的VQA问题(源自论文方法部分的逻辑图),用以评估插图与论文在不同粒度上的逻辑一致性;同时利用VLM评估美学质量。 Result: 实验表明:1)模型在AIBench上的性能差距显著大于通用任务;2)逻辑正确性与美学质量难以兼顾;3)测试时缩放(test-time scaling)能显著提升两项能力。 Conclusion: AIBench揭示了当前图像生成模型在高密度、复杂逻辑驱动的学术插图生成方面仍存在明显瓶颈,逻辑与美学的协同优化是关键挑战。 Abstract: Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored.Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.[289] MolmoPoint: Better Pointing for VLMs with Grounding Tokens
Christopher Clark,Yue Yang,Jae Sung Park,Zixian Ma,Jieyu Zhang,Rohun Tripathi,Mohammadreza Salehi,Sangho Lee,Taira Anderson,Winson Han,Ranjay Krishna
Main category: cs.CV
TL;DR: 本文提出了一种新颖的视觉-语言模型(VLM)指代机制,用逐级选择视觉token(粗粒度→子块→内部位置)替代传统坐标生成,显著提升图像、GUI、视频指代与跟踪性能,并具备更高样本效率。
Details
Motivation: 现有VLM通过生成文本坐标实现指代,需学习复杂坐标系统且token开销大;本文旨在设计更直观、高效、细粒度的视觉token直接选择机制。 Method: 引入特殊‘指代token’,通过跨模态注意力依次选择:1)目标区域对应视觉token;2)该区域内的子块;3)子块内的具体位置;并采用顺序生成、相对位置编码及‘无更多点’终止标记优化训练。 Result: 在PointBench图像指代达70.7% SOTA;ScreenSpotPro GUI指代达61.1%(全开源模型SOTA);视频指代人类偏好胜率59.1%,Molmo2Track跟踪提升+6.3%;同时显著提升样本效率。 Conclusion: 基于视觉token层级选择的指代机制比坐标生成更直观、高效、可扩展,为VLM空间理解提供了新范式,并在多任务上验证了其优越性与泛化能力。 Abstract: Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.[290] LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization
Chutian Meng,Fan Ma,Chi Zhang,Jiaxu Miao,Yi Yang,Yueting Zhuang
Main category: cs.CV
TL;DR: 本文提出LogiStory框架,通过多智能体系统显式建模视觉逻辑(角色、动作与场景间的感知与因果连贯性),提升多图故事可视化中的叙事逻辑与视觉质量,并构建LogicTale基准进行评估。
Details
Motivation: 现有视觉序列生成模型虽在图像质量和知识融合上取得进展,但缺乏对视觉逻辑(即人物、动作和场景随时间变化的感知与因果连贯性)的关注,导致叙事断裂、逻辑不清。 Method: 提出逻辑感知的多图故事可视化框架LogiStory,采用多智能体系统实现角色定位、因果链提取与故事级一致性验证;同时构建LogicTale基准,包含富含因果推理标注的故事数据集,并设计自动与人工评估协议。 Result: 实验表明该方法显著提升了生成视觉故事的叙事逻辑性,在视觉逻辑与感知质量两方面均优于基线模型。 Conclusion: 本工作首次将视觉逻辑作为显式建模目标,为图像序列与视频生成任务中逻辑一致性的建模与约束提供了基础性思路与实践框架。 Abstract: Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.[291] GEMS: Agent-Native Multimodal Generation with Memory and Skills
Zefeng He,Siyuan Huang,Xiaoye Qu,Yafu Li,Tong Zhu,Yu Cheng,Yang Yang
Main category: cs.CV
TL;DR: 本文提出GEMS框架,通过多智能体循环、持久化记忆和可扩展技能模块,显著提升多模态生成模型在通用及下游任务上的性能,甚至使轻量级6B模型超越SOTA大模型。
Details
Motivation: 现有多模态生成模型在复杂指令和专业下游任务上表现不足,需突破基础模型固有局限。 Method: 提出GEMS框架,包含三个核心组件:Agent Loop(结构化多智能体闭环优化)、Agent Memory(分层存储事实状态与经验摘要的持久化轨迹记忆)、Agent Skill(按需加载的领域专用技能库)。 Result: 在5个主流任务和4个下游任务上,GEMS在多个生成后端上均实现显著性能提升;轻量级Z-Image-Turbo(6B)在GenEval2上超越Nano Banana 2。 Conclusion: GEMS验证了智能体范式能有效拓展基础模型能力边界,尤其适用于资源受限场景下的高性能多模态生成。 Abstract: Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.[292] To View Transform or Not to View Transform: NeRF-based Pre-training Perspective
Hyeonjun Jeong,Juyeb Shin,Dongsuk Kum
Main category: cs.CV
TL;DR: 本文提出NeRP3D,一种基于NeRF相似点云的3D检测器,通过保留预训练NeRF网络并学习连续3D表示,避免了视图变换与辐射场先验冲突导致的模糊表征问题,在nuScenes数据集上显著提升场景重建与3D检测性能。
Details
Motivation: 现有NeRF预训练方法在应用于3D感知时,强行耦合视图变换(离散刚性)与辐射场(连续自适应)导致先验冲突,产生模糊3D表征;且预训练NeRF网络在下游任务中被丢弃,造成3D表征利用低效。 Method: 提出NeRF-Resembled Point-based 3D detector(NeRP3D),直接学习连续3D表示,绕过视图变换;全程保留并复用预训练NeRF网络,继承其连续3D表征学习机制。 Result: 在nuScenes数据集上,NeRP3D在预训练场景重建任务和下游3D目标检测任务中均显著超越先前SOTA方法。 Conclusion: NeRP3D通过统一连续3D表征学习框架,解决了NeRF预训练与下游3D感知之间的先验错配与模型浪费问题,为自动驾驶视觉理解提供了更高效、一致的建模范式。 Abstract: Neural radiance fields (NeRFs) have emerged as a prominent pre-training paradigm for vision-centric autonomous driving, which enhances 3D geometry and appearance understanding in a fully self-supervised manner. To apply NeRF-based pretraining to 3D perception models, recent approaches have simply applied NeRFs to volumetric features obtained from view transformation. However, coupling NeRFs with view transformation inherits conflicting priors; view transformation imposes discrete and rigid representations, whereas radiance fields assume continuous and adaptive functions. When these opposing assumptions are forced into a single pipeline, the misalignment surfaces as blurry and ambiguous 3D representations that ultimately limit 3D scene understanding. Moreover, the NeRF network for pre-training is discarded during downstream tasks, resulting in inefficient utilization of enhanced 3D representations through NeRF. In this paper, we propose a novel NeRF-Resembled Point-based 3D detector that can learn continuous 3D representation and thus avoid the misaligned priors from view transformation. NeRP3D preserves the pre-trained NeRF network regardless of the tasks, inheriting the principle of continuous 3D representation learning and leading to greater potentials for both scene reconstruction and detection tasks. Experiments on nuScenes dataset demonstrate that our proposed approach significantly improves previous state-of-the-art methods, outperforming not only pretext scene reconstruction tasks but also downstream detection tasks.[293] SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting
Alexander Prutsch,Christian Fruhwirth-Reisinger,David Schinagl,Horst Possegger
Main category: cs.CV
TL;DR: 本文提出了一种面向动态交通环境的流式运动预测新框架,通过实例感知的上下文流和双目标训练,提升了在不同观测长度下的预测鲁棒性与实时性。
Details
Motivation: 现有流式运动预测方法在面对异构观测长度时性能下降,难以适应动态交通场景中持续变化的输入条件。 Method: 提出一种增量处理观测窗口的流式预测框架,引入实例感知的上下文流以持续更新智能体隐状态,并设计双训练目标以保证不同观测时长下的预测一致性。 Result: 在Argoverse 2、nuScenes和Argoverse 1上验证了方法鲁棒性;在Argoverse 2多智能体流式基准上达到SOTA,且延迟极低。 Conclusion: 该框架兼顾准确性与实时性,适合真实世界自动驾驶系统的部署。 Abstract: In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously. Streaming-based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths. To address this, we propose a novel streaming-based motion forecasting framework that explicitly focuses on evolving scenes. Our method incrementally processes incoming observation windows and leverages an instance-aware context streaming to maintain and update latent agent representations across inference steps. A dual training objective further enables consistent forecasting accuracy across diverse observation horizons. Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also on the single-agent benchmarks. Our model achieves state-of-the-art performance in streaming inference on the Argoverse 2 multi-agent benchmark, while maintaining minimal latency, highlighting its suitability for real-world deployment.[294] Octree-based Learned Point Cloud Geometry Compression: A Lossy Perspective
Kaiyu Zheng,Wei Gao,Huiming Zheng
Main category: cs.CV
TL;DR: 本文提出了一种基于八叉树的点云有损压缩新方法,针对物体点云和LiDAR点云分别设计了叶节点有损编码和可变码率控制策略,显著提升了压缩性能。
Details
Motivation: 传统八叉树有损压缩方法在调整量化步长时易导致大量点丢失和严重失真,其潜力尚未被充分挖掘。 Method: 针对物体点云,提出叶节点级的有损压缩方法,采用逐位编码与二值预测;针对LiDAR点云,设计简单有效的可变速率控制方法。 Result: 所提叶节点压缩方法在物体点云上显著优于先前八叉树方法;所提码率控制方法在LiDAR点云上实现约1%比特误差且无需微调。 Conclusion: 八叉树上下文学习在点云有损压缩中具有巨大潜力,需针对不同点云类型设计专用有损策略。 Abstract: Octree-based context learning has recently become a leading method in point cloud compression. However, its potential on lossy compression remains undiscovered. The traditional lossy compression paradigm using lossless octree representation with quantization step adjustment may result in severe distortions due to massive missing points in quantization. Therefore, we analyze data characteristics of different point clouds and propose lossy approaches specifically. For object point clouds that suffer from quantization step adjustment, we propose a new leaf nodes lossy compression method, which achieves lossy compression by performing bit-wise coding and binary prediction on leaf nodes. For LiDAR point clouds, we explore variable rate approaches and propose a simple but effective rate control method. Experimental results demonstrate that the proposed leaf nodes lossy compression method significantly outperforms the previous octree-based method on object point clouds, and the proposed rate control method achieves about 1% bit error without finetuning on LiDAR point clouds.[295] RAWIC: Bit-Depth Adaptive Lossless Raw Image Compression
Chunhang Zheng,Tongda Xu,Mingli Xie,Yan Wang,Dou Li
Main category: cs.CV
TL;DR: 本文提出RAWIC,一种针对Bayer格式原始图像的比特深度自适应学习型无损压缩框架,通过引入比特深度作为辅助输入并设计相应的熵模型,实现对多相机、多比特深度原始图像的统一高效无损压缩,性能优于JPEG-XL等传统编解码器。
Details
Motivation: 原始图像具有高比特深度和传感器依赖性,现有学习型无损压缩方法主要面向8位sRGB图像,而原始图像重建方法本质上是有损且依赖特定相机假设,因此需要一种通用、无损、自适应比特深度的原始图像压缩方案。 Method: 将单通道Bayer数据转换为四通道RGGB格式并分块;对每块计算其比特深度作为辅助输入;设计比特深度自适应的熵模型,以比特深度为条件估计各块的概率分布,从而实现单模型适配多种相机与比特深度。 Result: RAWIC在多个数据集上持续超越传统无损编解码器,平均比特率比JPEG-XL降低7.7%。 Conclusion: RAWIC是一种有效、通用且自适应的原始图像无损压缩框架,解决了现有方法在比特深度兼容性和相机泛化性方面的局限。 Abstract: Raw images preserve linear sensor measurements and high bit-depth information crucial for advanced vision tasks and photography applications, yet their storage remains challenging due to large file sizes, varying bit depths, and sensor-dependent characteristics. Existing learned lossless compression methods mainly target 8-bit sRGB images, while raw reconstruction approaches are inherently lossy and rely on camera-specific assumptions. To address these challenges, we introduce RAWIC, a bit-depth-adaptive learned lossless compression framework for Bayer-pattern raw images. We first convert single-channel Bayer data into a four-channel RGGB format and partition it into patches. For each patch, we compute its bit depth and use it as auxiliary input to guide compression. A bit-depth-adaptive entropy model is then designed to estimate patch distributions conditioned on their bit depths. This architecture enables a single model to handle raw images from diverse cameras and bit depths. Experiments show that RAWIC consistently surpasses traditional lossless codecs, achieving an average 7.7% bitrate reduction over JPEG-XL. Our code is available at https://github.com/chunbaobao/RAWIC.[296] Contour-Guided Query-Based Feature Fusion for Boundary-Aware and Generalizable Cardiac Ultrasound Segmentation
Zahid Ullah,Sieun Choi,Jihie Kim
Main category: cs.CV
TL;DR: 本文提出了一种名为CGQR-Net的轮廓引导查询精炼网络,通过融合多分辨率特征与解剖轮廓结构先验,提升心脏超声图像分割的边界精度和鲁棒性。
Details
Motivation: 现有基于外观学习的方法在低对比度、斑点噪声、不规则边界及跨设备/人群域偏移下难以保持边界精度和结构一致性。 Method: 采用HRNet主干提取多尺度高分辨率特征;从粗分割结果中提取并编码解剖轮廓为可学习查询嵌入;通过跨注意力机制使轮廓引导查询与融合特征交互以实现结构感知精炼;引入双头监督联合优化分割与边界预测。 Result: 在CAMUS和CardiacNet数据集上验证,显著提升了分割准确率、边界精度,并展现出对不同成像条件的良好泛化能力。 Conclusion: 将轮廓级结构信息与特征级表征融合,能有效提升心脏超声分割的可靠性与鲁棒性。 Abstract: Accurate cardiac ultrasound segmentation is essential for reliable assessment of ventricular function in intelligent healthcare systems. However, echocardiographic images are challenging due to low contrast, speckle noise, irregular boundaries, and domain shifts across devices and patient populations. Existing methods, largely based on appearance-driven learning, often fail to preserve boundary precision and structural consistency under these conditions. To address these issues, we propose a Contour-Guided Query Refinement Network (CGQR-Net) for boundary-aware cardiac ultrasound segmentation. The framework integrates multi-resolution feature representations with contour-derived structural priors. An HRNet backbone preserves high-resolution spatial details while capturing multi-scale context. A coarse segmentation is first generated, from which anatomical contours are extracted and encoded into learnable query embeddings. These contour-guided queries interact with fused feature maps via cross-attention, enabling structure-aware refinement that improves boundary delineation and reduces noise artifacts. A dual-head supervision strategy jointly optimizes segmentation and boundary prediction to enforce structural consistency. The proposed method is evaluated on the CAMUS dataset and further validated on the CardiacNet dataset to assess cross-dataset generalization. Experimental results demonstrate improved segmentation accuracy, enhanced boundary precision, and robust performance across varying imaging conditions. These results highlight the effectiveness of integrating contour-level structural information with feature-level representations for reliable cardiac ultrasound segmentation.[297] Attention Frequency Modulation: Training-Free Spectral Modulation of Diffusion Cross-Attention
Seunghun Oh,Unsang Park
Main category: cs.CV
TL;DR: 本文提出Attention Frequency Modulation (AFM),一种在推理阶段对扩散模型中跨注意力机制进行傅里叶域编辑的方法,通过调控低频与高频成分来连续控制文本token竞争的空间尺度,无需重训练或修改提示词。
Details
Motivation: 现有工作对扩散模型中跨注意力的多分辨率、步进式动态特性缺乏深入刻画,导致缺乏原理性、免训练的控制手段。 Method: 将跨注意力建模为潜空间网格上的时空信号,提取token无关的注意力集中图,并分析其径向分桶傅里叶功率谱;发现编码器跨注意力具有稳定的粗到细频谱演化规律;据此设计AFM:在傅里叶域对token-wise预softmax logits按进度调度重加权低/高频分量,并由token分配熵自适应门控。 Result: 在Stable Diffusion上验证AFM可稳定重分布注意力频谱,实现显著可控视觉编辑,同时保持语义一致性;熵主要起自适应增益作用,而非独立控制维度。 Conclusion: AFM提供了一种免训练、即插即用的频域干预机制,揭示了跨注意力的时频结构本质,并为细粒度、连续的生成控制开辟新路径。 Abstract: Cross-attention is the primary interface through which text conditions latent diffusion models, yet its step-wise multi-resolution dynamics remain under-characterized, limiting principled training-free control. We cast diffusion cross-attention as a spatiotemporal signal on the latent grid by summarizing token-softmax weights into token-agnostic concentration maps and tracking their radially binned Fourier power over denoising. Across prompts and seeds, encoder cross-attention exhibits a consistent coarse-to-fine spectral progression, yielding a stable time-frequency fingerprint of token competition. Building on this structure, we introduce Attention Frequency Modulation (AFM), a plug-and-play inference-time intervention that edits token-wise pre-softmax cross-attention logits in the Fourier domain: low- and high-frequency bands are reweighted with a progress-aligned schedule and can be adaptively gated by token-allocation entropy, before the token softmax. AFM provides a continuous handle to bias the spatial scale of token-competition patterns without retraining, prompt editing, or parameter updates. Experiments on Stable Diffusion show that AFM reliably redistributes attention spectra and produces substantial visual edits while largely preserving semantic alignment. Finally, we find that entropy mainly acts as an adaptive gain on the same frequency-based edit rather than an independent control axis.[298] MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding
Guangjing Yang,Ziyuan Qin,Chaoran Zhang,Chenlin Du,Jinlin Wang,Wanran Sun,Zhenyu Zhang,Bing Ji,Qicheng Lao
Main category: cs.CV
TL;DR: 本文提出MedLoc-R1框架,通过性能感知的渐进式奖励调度机制,缓解医学图像视觉定位中因奖励稀疏导致的训练困难,显著提升定位精度与训练稳定性。
Details
Motivation: 现有基于强化学习的医学视觉定位方法(如GRPO)在处理小或模糊病灶区域时面临严重奖励稀疏问题,固定IoU奖励机制导致梯度消失和早期训练停滞。 Method: 提出MedLoc-R1:一种无需额外网络或梯度路径的性能感知奖励调度框架,包含滑动窗口性能追踪器和多条件更新规则,动态从密集奖励过渡到严格定位要求。 Result: 在三个医学视觉定位基准上,MedLoc-R1一致优于GRPO基线,提升定位准确率与训练稳定性。 Conclusion: MedLoc-R1是一种通用、轻量且有效的RL-based医学视觉定位解决方案,适用于高风险临床场景。 Abstract: Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code \& checkpoints are available at \hyperlink{}{https://github.com/MembrAI/MedLoc-R1}.[299] SVGS: Single-View to 3D Object Editing via Gaussian Splatting
Pengcheng Xue,Yan Tian,Qiutao Song,Ziyi Wang,Linyang He,Weiping Ding,Mahmoud Hassaballah,Karen Egiazarian,Wei-Fa Yang,Leszek Rutkowski
Main category: cs.CV
TL;DR: 本文提出SVGS方法,一种基于单视图文本驱动的3D高斯泼溅编辑技术,通过多视图扩散模型引导的单视图编辑策略与稀疏3D高斯表示,显著提升编辑一致性与效率。
Details
Motivation: 现有基于隐式表示(如NeRF)或高斯编辑的方法存在处理速度慢、区域控制弱、多视图编辑结果不一致等问题,难以兼顾编辑一致性与效率。 Method: 提出SVGS:基于3D高斯泼溅的单视图文本驱动编辑方法;采用多视图扩散模型指导的单视图编辑策略,仅保留编辑结果一致的视图进行3D重建;使用稀疏3D高斯表示以提升效率。 Result: 在多种场景下对比实验表明,SVGS在编辑能力与处理速度上均优于Instruct-NeRF2NeRF、GaussianEditor等基线方法。 Conclusion: SVGS有效解决了多视图不一致与效率瓶颈问题,是3D场景文本编辑领域的重要进展。 Abstract: Text-driven 3D scene editing has attracted considerable interest due to its convenience and user-friendliness. However, methods that rely on implicit 3D representations, such as Neural Radiance Fields (NeRF), while effective in rendering complex scenes, are hindered by slow processing speeds and limited control over specific regions of the scene. Moreover, existing approaches, including Instruct-NeRF2NeRF and GaussianEditor, which utilize multi-view editing strategies, frequently produce inconsistent results across different views when executing text instructions. This inconsistency can adversely affect the overall performance of the model, complicating the task of balancing the consistency of editing results with editing efficiency. To address these challenges, we propose a novel method termed Single-View to 3D Object Editing via Gaussian Splatting (SVGS), which is a single-view text-driven editing technique based on 3D Gaussian Splatting (3DGS). Specifically, in response to text instructions, we introduce a single-view editing strategy grounded in multi-view diffusion models, which reconstructs 3D scenes by leveraging only those views that yield consistent editing results. Additionally, we employ sparse 3D Gaussian Splatting as the 3D representation, which significantly enhances editing efficiency. We conducted a comparative analysis of SVGS against existing baseline methods across various scene settings, and the results indicate that SVGS outperforms its counterparts in both editing capability and processing speed, representing a significant advancement in 3D editing technology. For further details, please visit our project page at: https://amateurc.github.io/svgs.github.io.[300] MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios
Zhang Li,Zhibo Lin,Qiang Liu,Ziyang Zhang,Shuo Zhang,Zidun Guo,Jiajun Song,Jiarui Zhang,Xiang Bai,Yuliang Liu
Main category: cs.CV
TL;DR: 本文提出了首个面向多语言数字与拍摄文档解析的基准测试MDPBench,包含3400张覆盖17种语言、多种文字体系及真实拍摄条件的文档图像,并揭示了当前模型(尤其是开源模型)在非拉丁文字和拍摄文档上存在显著性能下降。
Details
Motivation: 现有文档解析研究主要集中于干净、数字化、格式良好的少数主流语言文档,缺乏系统性评估模型在多语言、多文字体系及真实拍摄条件下的泛化能力的基准。 Method: 构建了多语言文档解析基准MDPBench,涵盖17种语言、多样文字体系和拍摄条件的3400张文档图像;采用专家模型标注、人工修正与人工验证的严格流程生成高质量标注;设置公开与私有评测集以保障公平比较。 Result: 评估发现闭源模型(如Gemini3-Pro)相对稳健,而开源模型在非拉丁文字和拍摄文档上平均性能分别下降14.0%和17.8%。 Conclusion: 当前文档解析模型存在显著的语言与场景性能不平衡,亟需构建更包容、可部署的多语言解析系统。 Abstract: We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.[301] Robust Remote Sensing Image-Text Retrieval with Noisy Correspondence
Qiya Song,Yiqiang Xie,Yuan Sun,Renwei Dian,Xudong Kang
Main category: cs.CV
TL;DR: 本文提出了一种鲁棒的遥感图像-文本检索(RRSITR)范式,以应对遥感数据中普遍存在的图像-文本配对噪声问题(Noisy Correspondence, NC),通过自步学习策略、样本可靠性加权和鲁棒三元组损失,实现从易到难的渐进式多模态学习。
Details
Motivation: 现有遥感图像-文本检索(RSITR)方法普遍假设图像与文本严格匹配,但实际中高质量对齐数据获取成本高,且真实遥感数据集(如RSITMD)本身存在大量不准确或错配描述,这一‘噪声对应’(NC)问题长期被忽视。 Method: 提出RRSITR范式:1)基于损失大小将训练样本分为干净、模糊、噪声三类;2)为每对样本分配可靠性权重;3)设计多模态自步函数动态调控训练顺序与权重;4)针对噪声样本引入语义相似度驱动的动态软边距鲁棒三元组损失。 Result: 在三个主流遥感基准数据集上显著优于现有最优方法,尤其在高噪声率下性能提升明显。 Conclusion: 噪声对应是RSITR中一个关键却未被探索的问题;所提RRSITR范式通过模拟人类认知学习机制,有效提升了模型在含噪多模态遥感数据下的鲁棒性与检索性能。 Abstract: As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing Image-Text Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. In addition, we also notice that the remote sensing datasets (e.g., RSITMD) truly contain some inaccurate or mismatched image text descriptions. Based on the above observations, we reveal an important but untouched problem in RSITR, i.e., Noisy Correspondence (NC). To overcome these challenges, we propose a novel Robust Remote Sensing Image-Text Retrieval (RRSITR) paradigm that designs a self-paced learning strategy to mimic human cognitive learning patterns, thereby learning from easy to hard from multi-modal data with NC. Specifically, we first divide all training sample pairs into three categories based on the loss magnitude of each pair, i.e., clean sample pairs, ambiguous sample pairs, and noisy sample pairs. Then, we respectively estimate the reliability of each training pair by assigning a weight to each pair based on the values of the loss. Further, we respectively design a new multi-modal self-paced function to dynamically regulate the training sequence and weights of the samples, thus establishing a progressive learning process. Finally, for noisy sample pairs, we present a robust triplet loss to dynamically adjust the soft margin based on semantic similarity, thereby enhancing the robustness against noise. Extensive experiments on three popular benchmark datasets demonstrate that the proposed RRSITR significantly outperforms the state-of-the-art methods, especially in high noise rates. The code is available at: https://github.com/MSFLabX/RRSITR[302] Intelligent Road Condition Monitoring using 3D In-Air SONAR Sensing
Amber Cassimon,Robin Kerstens,Walter Daems,Jan Steckel
Main category: cs.CV
TL;DR: 本文研究了空中3D声纳传感器在道路表面状况监测中的应用,重点是道路材料分类和道路损伤检测与分类。实验表明,声纳在材料分类上F1达90%,但在损伤检测上仅约75%,表明其有潜力但需进一步研究提升精度。
Details
Motivation: 现有相机和LiDAR等传感器在雨、雾、烟等恶劣条件下性能下降,而声纳具有抗干扰能力,可用于垃圾车、邮递车等车辆的无感式道路监测(opportunistic sensing)。 Method: 基于一个包含多种道路损伤标注(含路面材料标签)的单一数据集,分别开展道路材料分类(沥青、混凝土、砌块路)和损伤检测与分类(不依赖材料类型、不进行定位)任务,使用声纳传感器数据进行建模与评估。 Result: 道路材料分类F1分数接近90%;道路损伤检测与分类F1分数约为75%。 Conclusion: 声纳是一种有前景的传感方式,适用于基于机会感知的道路养护管理系统,但损伤检测精度尚需提升,后续研究应聚焦于此。 Abstract: In this paper, we investigate the capabilities of in-air 3D SONAR sensors for the monitoring of road surface conditions. Concretely, we consider two applications: Road material classification and Road damage detection and classification. While such tasks can be performed with other sensor modalities, such as camera sensors and LiDAR sensors, these sensor modalities tend to fail in harsh sensing conditions, such as heavy rain, smoke or fog. By using a sensing modality that is robust to such interference, we enable the creation of opportunistic sensing applications, where vehicles performing other tasks (garbage collection, mail delivery, etc.) can also be used to monitor the condition of the road. For these tasks, we use a single dataset, in which different types of damages are annotated, with labels including the material of the road surface. In the material classification task, we differentiate between three different road materials: Asphalt, Concrete and Element roads. In the damage detection and classification task, we determine if there is damage, and what type of damage (independent of material type), without localizing the damage. We are succesful in determining the road surface type from SONAR sensor data, with F1 scores approaching 90% on the test set, but find that for the detection of damages performace lags, with F1 score around 75%. From this, we conclude that SONAR sensing is a promising modality to include in opportunistic sensing-based pavement management systems, but that further research is needed to reach the desired accuracy.[303] RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation
Chanseul Cho,Seokju Yun,Jeaseong Jeon,Seungjae Moon,Youngmin Ro
Main category: cs.CV
TL;DR: 本文提出RecycleLoRA方法,利用Rank-Revealing QR分解挖掘视觉基础模型子空间结构,设计主辅双LoRA适配器分别学习次要与主要方向特征,提升域泛化语义分割性能,无需额外正则或推理开销。
Details
Motivation: 现有域泛化语义分割方法对视觉基础模型(VFMs)内部丰富的子空间结构挖掘不足,且LoRA组件存在表征多样性低、参数利用效率差的问题。 Method: 提出RecycleLoRA:1)用Rank-Revealing QR分解(RRQR)识别VFM中主/次子空间方向;2)主适配器在次要方向上学习多样独立特征;3)子适配器在主要方向上做微调;双适配器协同工作,无需额外正则损失。 Result: 在合成到真实、真实到真实的域泛化语义分割任务上均达到SOTA性能,不引入复杂结构或额外推理延迟。 Conclusion: 系统性地利用预训练模型子空间结构(通过RRQR初始化)可显著提升域泛化能力;RecycleLoRA验证了子空间感知适配器设计的有效性与高效性。 Abstract: Domain Generalized Semantic Segmentation (DGSS) aims to maintain robust performance across unseen target domains. Vision Foundation Models (VFMs) offer rich multi-domain knowledge that can enhance generalization. However, strategies for actively exploiting the rich subspace structures within VFMs remain under-explored, with many existing methods focusing primarily on preserving pre-trained knowledge. Furthermore, their LoRA components often suffer from limited representational diversity and inefficient parameter utilization. We propose RecycleLoRA, which addresses both challenges by employing Rank-Revealing QR Decomposition (RRQR) to systematically exploit VFM's subspace structures and enhance LoRA's representational richness. Our main adapter leverages minor subspace directions identified by RRQR to learn diverse and independent features, achieving competitive performance even when used alone. We further introduce a sub adapter that carefully refines major directions with minimal adjustments, providing complementary improvements to the main adapter's strong baseline performance. This design enables the dual adapters to learn distinct representations without requiring additional regularization losses. Our systematic exploitation of pre-trained subspace structures through RRQR-based initialization leads to superior domain generalization performance. RecycleLoRA achieves state-of-the-art performance on both synthetic-to-real generalization and real-to-real generalization tasks without complex architectures or additional inference latency.[304] BlankSkip: Early-exit Object Detection onboard Nano-drones
Carlo Marra,Beatrice Alessandra Motetti,Alessio Burrello,Enrico Macii,Massimo Poncino,Daniele Jahier Pagliari
Main category: cs.CV
TL;DR: 本文提出BlankSkip,一种用于纳米无人机上实时目标检测的自适应深度神经网络,通过引入辅助分类任务(识别无目标帧)实现早期退出,从而在保持精度微降的前提下显著提升平均吞吐量。
Details
Motivation: 纳米无人机计算资源极其受限(约10 MiB内存、1 W功耗),而传统目标检测模型难以满足实时性要求;早期退出机制在分类任务中已有研究,但在密集型任务如目标检测中应用困难。 Method: 提出BlankSkip方法,利用一个简单的辅助二分类任务(判断输入帧是否包含感兴趣目标)触发早期退出,从而跳过对‘空白帧’的完整目标检测推理;在真实纳米无人机平台Crazyflie 2.1上部署并验证。 Result: 在先进纳米无人机目标检测数据集上,相比静态MobileNet-SSD检测器,平均吞吐量提升最高达24%,仅带来0.015的mAP下降。 Conclusion: BlankSkip证明了轻量级辅助分类驱动的早期退出策略可有效适配资源严苛场景下的目标检测任务,在精度与效率间取得良好平衡。 Abstract: Deploying tiny computer vision Deep Neural Networks (DNNs) on-board nano-sized drones is key for achieving autonomy, but is complicated by the extremely tight constraints of their computational platforms (approximately 10 MiB memory, 1 W power budget). Early-exit adaptive DNNs that dial down the computational effort for "easy-to-process" input frames represent a promising way to reduce the average inference latency. However, while this approach is extensively studied for classification, its application to dense tasks like object detection (OD) is not straightforward. In this paper, we propose BlankSkip, an adaptive network for on-device OD that leverages a simple auxiliary classification task for early exit, i.e., identifying frames with no objects of interest. With experiments using a real-world nano-drone platform, the Bitcraze Crazyflie 2.1, we achieve up to 24% average throughput improvement with a limited 0.015 mean Average Precision (mAP) drop compared to a static MobileNet-SSD detector, on a state-of-the-art nano-drones OD dataset.[305] ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models
Yuhuan Xie,Aoxuan Pan,Yi-Hua Huang,Chirui Chang,Peng Dai,Xin Yu,Xiaojuan Qi
Main category: cs.CV
TL;DR: ObjectMorpher是一个统一的交互式图像编辑框架,将模糊的2D编辑提升为几何感知的3D操作,利用3D高斯溅射和图结构非刚性变形实现精细、真实且高效的对象级编辑。
Details
Motivation: 现有2D图像编辑方法缺乏3D感知,结果易模糊或不真实;而3D感知方法依赖繁重优化或不完整单目重建,难以兼顾精度、可控性与效率。 Method: ObjectMorpher首先用图像到3D生成器将目标实例提升为可编辑的3D高斯溅射(3DGS),再通过基于图的ARAP约束非刚性形变响应用户拖拽控制点,最后用复合扩散模块协调光照、颜色与边界以实现无缝融合。 Result: 在KID、LPIPS、SIFID指标及用户偏好评测中均优于2D拖拽与现有3D感知基线方法,支持跨类别细粒度、照片级真实感编辑。 Conclusion: ObjectMorpher实现了高效、可控、几何一致的对象级图像编辑, bridging the gap between 2D interactivity and 3D fidelity. Abstract: Achieving precise, object-level control in image editing remains challenging: 2D methods lack 3D awareness and often yield ambiguous or implausible results, while existing 3D-aware approaches rely on heavy optimization or incomplete monocular reconstructions. We present ObjectMorpher, a unified, interactive framework that converts ambiguous 2D edits into geometry-grounded operations. ObjectMorpher lifts target instances with an image-to-3D generator into editable 3D Gaussian Splatting (3DGS), enabling fast, identity-preserving manipulation. Users drag control points; a graph-based non-rigid deformation with as-rigid-as-possible (ARAP) constraints ensures physically sensible shape and pose changes. A composite diffusion module harmonizes lighting, color, and boundaries for seamless reintegration. Across diverse categories, ObjectMorpher delivers fine-grained, photorealistic edits with superior controllability and efficiency, outperforming 2D drag and 3D-aware baselines on KID, LPIPS, SIFID, and user preference.[306] Event-Based Method for High-Speed 3D Deformation Measurement under Extreme Illumination Conditions
Banglei Guan,Yifei Bian,Zibin Liu,Haoyang Li,Xuanyu Bai,Taihang Lei,Bin Li,Yang Shang,Qifeng Yu
Main category: cs.CV
TL;DR: 本文提出了一种基于多事件相机阵列的高动态范围、低延迟3D形变监测方法,适用于强光等极端光照条件下的大型工程结构(如发射塔、悬索桥)高速形变测量。方法涵盖异步事件流特征提取、Kruppa方程与参数优化联合标定、以及统一坐标变换与线性交会实现3D测量。实验表明相对误差低于0.08%,验证了其高精度与鲁棒性。
Details
Motivation: 传统相机在极端光照下易过曝,动态范围有限,难以准确测量大型结构在强光环境下的高速3D形变,存在安全隐患。 Method: 结合事件相机异步事件流与时间相关性分析提取标记点中心;利用Kruppa方程与参数优化框架实现快速标定;通过统一坐标变换和线性交会完成3D形变测量。 Result: 相对测量误差低于0.08%;野外极端光照实验(含自标定与3D形变测量)验证了方法有效性与鲁棒性。 Conclusion: 该方法有效克服了传统相机在极端光照下测量高速3D形变的局限性,相比其他方法,在恶劣光照条件下仍能保持高精度(相对误差<0.1%),适用于实际工程安全监测。 Abstract: Background: Large engineering structures, such as space launch towers and suspension bridges, are subjected to extreme forces that cause high-speed 3D deformation and compromise safety. These structures typically operate under extreme illumination conditions. Traditional cameras often struggle to handle strong light intensity, leading to overexposure due to their limited dynamic range. Objective: Event cameras have emerged as a compelling alternative to traditional cameras in high dynamic range and low-latency applications. This paper presents an integrated method, from calibration to measurement, using a multi-event camera array for high-speed 3D deformation monitoring of structures in extreme illumination conditions. Methods: Firstly, the proposed method combines the characteristics of the asynchronous event stream and temporal correlation analysis to extract the corresponding marker center point. Subsequently, the method achieves rapid calibration by solving the Kruppa equations in conjunction with a parameter optimization framework. Finally, by employing a unified coordinate transformation and linear intersection, the method enables the measurement of 3D deformation of the target structure. Results: Experiments confirmed that the relative measurement error is below 0.08%. Field experiments under extreme illumination conditions, including self-calibration of a multi-event camera array and 3D deformation measurement, verified the performance of the proposed method. Conclusions: This paper addressed the critical limitation of traditional cameras in measuring high-speed 3D deformations under extreme illumination conditions. The experimental results demonstrate that, compared to other methods, the proposed method can accurately measure 3D deformations of structures under harsh lighting conditions, and the relative error of the measured deformation is less than 0.1%.[307] ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization
Bingchen Li,Zhixin Wang,Fan Li,Jiaqi Xu,Jiaming Guo,Renjing Pei,Xin Li,Zhibo Chen
Main category: cs.CV
TL;DR: 本文提出了一种基于FLUX扩散模型的旧照片着色新框架,通过结构-颜色解耦、渐进式直接偏好优化(Pro-DPO)和视觉语义提示,显著提升着色准确性与结构一致性。
Details
Motivation: 现有旧照片修复模型在去噪、去划痕等方面有效,但因旧照片特有的褪色、色偏等退化特性与现代图像分布存在显著域差距,导致着色不准确。 Method: 提出基于FLUX生成扩散模型的着色框架;引入结构-颜色解耦策略;设计渐进式Direct Preference Optimization(Pro-DPO)以建模粗到细的颜色偏好;采用视觉语义提示替代文本提示,从旧照片中提取细粒度语义信息以消除固有颜色偏差。 Result: 在合成与真实数据集上均超越现有最先进着色方法(含闭源商用模型),生成高质量、生动自然的着色结果。 Conclusion: 结构-颜色解耦、Pro-DPO优化与视觉语义提示三者协同,有效弥合了旧照片与现代图像间的域差距,为历史影像修复提供了更鲁棒、精准的着色解决方案。 Abstract: Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization. This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structure-color decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.[308] ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining
Yucheng Huang,Luping Ji,Xiangwei Jiang,Wen Li,Mao Ye
Main category: cs.CV
TL;DR: 本文提出了一种名为Topological Layout Learning (ToLL)的3D场景图(3DSG)预训练框架,通过锚点条件拓扑几何推理和结构化多视角增强,实现无需标签的自监督学习,显著提升了3DSG表征质量。
Details
Motivation: 现有3DSG生成方法受限于数据稀缺,且依赖谓词标注或易被强物体先验绕过谓词学习,缺乏鲁棒、无标签的自监督预训练任务。 Method: 提出ToLL框架:1)设计锚点条件拓扑几何推理(GNN建模零中心子图全局布局,空间先验来自稀疏锚点,并由谓词特征严格调制);2)构建结构化多视角增强与自蒸馏机制以避免语义失真并提升表征。 Result: 在3DSSG数据集上大量实验表明,ToLL显著提升表征质量,优于当前最优基线方法。 Conclusion: ToLL提供了一种有效的无监督/自监督预训练范式,解决了3DSG中谓词关系学习薄弱与标注依赖问题,增强了模型泛化能力。 Abstract: 3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and semantic-affordance perception. However, its generalizability is often constrained by data scarcity. Current solutions primarily focus on cross-modal assisted representation learning and object-centric generation pre-training. The former relies heavily on predicate annotations, while the latter's predicate learning may be bypassed due to strong object priors. Consequently, they could not often provide a label-free and robust self-supervised proxy task for 3DSG fine-tuning. To bridge this gap, we propose a Topological Layout Learning (ToLL) for 3DSG pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning, with a GNN to recover the global layout of zero-centered subgraphs by the spatial priors from sparse anchors. This process is strictly modulated by predicate features, thereby enforcing the predicate relation learning. Furthermore, we construct a Structural Multi-view Augmentation to avoid semantic corruption, and enhancing representations via self-distillation. The extensive experiments on 3DSSG dataset demonstrate that our ToLL could improve representation quality, outperforming state-of-the-art baselines.[309] A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps
Xuanlong Yu,Youyang Sha,Longfei Liu,Xi Shen,Di Yang
Main category: cs.CV
TL;DR: 本文提出了一种混合集成解码器和渐进式微调框架,用于提升少样本目标检测(FSOD)的泛化能力和优化稳定性,无需额外参数或复杂数据增强,在多个基准上显著优于现有方法。
Details
Motivation: 少样本目标检测(FSOD)面临训练样本稀缺导致的优化不稳定和泛化能力差的问题。 Method: 提出一种混合集成解码器:包含共享层次层和多个并行解码分支,各分支使用继承或新初始化的去噪查询以增强预测多样性;结合统一的渐进式微调框架与平台感知学习率调度策略。 Result: 在RF100-VL数据集10-shot设置下平均性能达41.9,显著优于SAM3(35.7);在CD-FSOD构建的跨域测试集上也展现出更强的OOD鲁棒性。 Conclusion: 所提方法在有效性、泛化性和鲁棒性方面均表现出色,且不引入额外参数、无需复杂数据增强或调参。 Abstract: Few-shot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. Inspired by ensemble learning, the decoder comprises a shared hierarchical layer followed by multiple parallel decoder branches, where each branch employs denoising queries either inherited from the shared layer or newly initialized to encourage prediction diversity. This design fully exploits pretrained weights without introducing additional parameters, and the resulting diverse predictions can be effectively ensembled to improve generalization. We further leverage a unified progressive fine-tuning framework with a plateau-aware learning rate schedule, which stabilizes optimization and achieves strong few-shot adaptation without complex data augmentations or extensive hyperparameter tuning. Extensive experiments on CD-FSOD, ODinW-13, and RF100-VL validate the effectiveness of our approach. Notably, on RF100-VL, which includes 100 datasets across diverse domains, our method achieves an average performance of 41.9 in the 10-shot setting, significantly outperforming the recent approach SAM3, which obtains 35.7. We further construct a mixed-domain test set from CD-FSOD to evaluate robustness to out-of-distribution (OOD) samples, showing that our proposed modules lead to clear improvement gains. These results highlight the effectiveness, generalization, and robustness of the proposed method. Code is available at: https://github.com/Intellindust-AI-Lab/FT-FSOD.[310] Explaining CLIP Zero-shot Predictions Through Concepts
Onat Ozdemir,Anders Christensen,Stephan Alaniz,Zeynep Akata,Emre Akbas
Main category: cs.CV
TL;DR: 本文提出EZPC方法,通过将CLIP的图文嵌入投影到由语言描述学习的概念空间中,实现对CLIP零样本预测的可解释性解释,无需额外监督,兼顾准确性与可解释性。
Details
Motivation: CLIP等大模型虽在零样本识别上表现优异,但预测结果缺乏可解释性;而概念瓶颈模型虽可解释,却依赖概念标注且泛化能力差。本文旨在融合二者优势。 Method: EZPC将CLIP的联合图文嵌入投影至由语言描述学习的概念空间,通过对其施加对齐与重建目标进行无监督训练,使概念激活既保持CLIP语义结构又具备可解释性。 Result: 在CIFAR-100、CUB-200-2011、Places365、ImageNet-100和ImageNet-1k五个基准数据集上,EZPC在维持CLIP高零样本分类精度的同时,提供了有意义的概念级解释。 Conclusion: EZPC为开放词汇视觉语言模型提供了基于显式语义概念的可解释、可信赖的实现路径,是迈向可解释AI的重要一步。 Abstract: Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.[311] Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
Kazuma Ikeda,Ryosei Hara,Rokuto Nagata,Ozora Sako. Zihao Ding,Takahiro Kado,Ibuki Fujioka,Taro Beppu,Mariko Isogawa,Kentaro Yoshioka
Main category: cs.CV
TL;DR: 本文提出Ghost-FWL——首个大规模移动式全波形LiDAR(FWL)鬼点检测与去除数据集,并基于其构建FWL-MAE自监督模型,显著提升鬼点去除精度及下游SLAM和3D检测性能。
Details
Motivation: 现有基于几何一致性的鬼点去除方法在稀疏、动态的移动LiDAR数据上失效;而全波形LiDAR(FWL)能提供更丰富的时序强度信息,有望更好区分真实反射与鬼点。 Method: 构建首个大规模移动FWL鬼点标注数据集Ghost-FWL(24K帧,7.5亿峰值级标注),并提出FWL-MAE掩码自编码器进行自监督表征学习,建立FWL鬼点检测基线模型。 Result: 所提基线模型在鬼点去除精度上超越现有方法;鬼点去除后使LiDAR SLAM轨迹误差降低66%,3D目标检测误报率降低50倍。 Conclusion: FWL为移动场景鬼点去除提供了新范式,Ghost-FWL数据集与FWL-MAE模型为该方向奠定基础,并显著提升下游感知与定位任务性能。 Abstract: LiDAR has become an essential sensing modality in autonomous driving, robotics, and smart-city applications. However, ghost points (or ghosts), which are false reflections caused by multi-path laser returns from glass and reflective surfaces, severely degrade 3D mapping and localization accuracy. Prior ghost removal relies on geometric consistency in dense point clouds, failing on mobile LiDAR's sparse, dynamic data. We address this by exploiting full-waveform LiDAR (FWL), which captures complete temporal intensity profiles rather than just peak distances, providing crucial cues for distinguishing ghosts from genuine reflections in mobile scenarios. As this is a new task, we present Ghost-FWL, the first and largest annotated mobile FWL dataset for ghost detection and removal. Ghost-FWL comprises 24K frames across 10 diverse scenes with 7.5 billion peak-level annotations, which is 100x larger than existing annotated FWL datasets. Benefiting from this large-scale dataset, we establish a FWL-based baseline model for ghost detection and propose FWL-MAE, a masked autoencoder for efficient self-supervised representation learning on FWL data. Experiments show that our baseline outperforms existing methods in ghost removal accuracy, and our ghost removal further enhances downstream tasks such as LiDAR-based SLAM (66% trajectory error reduction) and 3D object detection (50x false positive reduction). The dataset and code is publicly available and can be accessed via the project page: https://keio-csg.github.io/Ghost-FWL[312] TwinMixing: A Shuffle-Aware Feature Interaction Model for Multi-Task Segmentation
Minh-Khoi Do,Huy Che,Dinh-Duy Phan,Duc-Khai Lam,Duc-Lung Vu
Main category: cs.CV
TL;DR: 本文提出了一种轻量级多任务分割模型TwinMixing,专用于可行驶区域和车道线分割,通过共享编码器与任务特定解码器、高效金字塔混合(EPM)模块和双分支上采样(DBU)块,在保持低计算开销的同时提升精度。在BDD100K数据集上,base版本以0.43M参数和3.95 GFLOPs达到92.0% mIoU(可行驶区)和32.3% IoU(车道线),优于现有方法,适合实时嵌入式部署。
Details
Motivation: 在低成本硬件上实现高精度且实时的可行驶区域与车道线分割仍具挑战性,需兼顾准确率与效率。 Method: 提出TwinMixing模型:采用共享Encoder与任务专用Decoder;Encoder中引入Efficient Pyramid Mixing(EPM)模块,融合分组卷积、深度可分离空洞卷积与通道混洗以高效提取多尺度特征;Decoder采用Dual-Branch Upsampling(DBU)块,结合可学习转置卷积(细粒度)与无参双线性插值(粗粒度)进行特征重建。 Result: 在BDD100K上,TwinMixing-base达92.0% mIoU(可行驶区)和32.3% IoU(车道线),仅需0.43M参数和3.95 GFLOPs,性能优于现有方法,并具备实时部署潜力。 Conclusion: TwinMixing通过紧凑、模块化设计,在精度与效率间取得优异平衡,为自动驾驶嵌入式感知系统提供了实用的多任务分割解决方案。 Abstract: Accurate and efficient perception is essential for autonomous driving, where segmentation tasks such as drivable-area and lane segmentation provide critical cues for motion planning and control. However, achieving high segmentation accuracy while maintaining real-time performance on low-cost hardware remains a challenging problem. To address this issue, we introduce TwinMixing, a lightweight multi-task segmentation model designed explicitly for drivable-area and lane segmentation. The proposed network features a shared encoder and task-specific decoders, enabling both feature sharing and task specialization. Within the encoder, we propose an Efficient Pyramid Mixing (EPM) module that enhances multi-scale feature extraction through a combination of grouped convolutions, depthwise dilated convolutions and channel shuffle operations, effectively expanding the receptive field while minimizing computational cost. Each decoder adopts a Dual-Branch Upsampling (DBU) Block composed of a learnable transposed convolution-based Fine detailed branch and a parameter-free bilinear interpolation-based Coarse grained branch, achieving detailed yet spatially consistent feature reconstruction. Extensive experiments on the BDD100K dataset validate the effectiveness of TwinMixing across three configurations - tiny, base, and large. Among them, the base configuration achieves the best trade-off between accuracy and computational efficiency, reaching 92.0% mIoU for drivable-area segmentation and 32.3% IoU for lane segmentation with only 0.43M parameters and 3.95 GFLOPs. Moreover, TwinMixing consistently outperforms existing segmentation models on the same tasks, as illustrated in Fig. 1. Thanks to its compact and modular design, TwinMixing demonstrates strong potential for real-time deployment in autonomous driving and embedded perception systems. The source code: https://github.com/Jun0se7en/TwinMixing.[313] DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning
Weimin Liu,Qingkun Li,Jiyuan Qiu,Wenjun Wang,Joshua H. Meng
Main category: cs.CV
TL;DR: 本文提出DiffAttn,一种基于扩散模型的驾驶员视觉注意力预测框架,融合Swin Transformer编码器、多尺度特征融合解码器与大语言模型(LLM)层,实现对局部与全局场景上下文及安全关键线索的精准建模,在多个公开数据集上达到SOTA性能。
Details
Motivation: 驾驶员视觉注意力是预判潜在危险、支撑决策与操控的关键感知信号,其缺失会威胁交通安全;现有方法在建模注意力分布的准确性、细粒度场景理解及语义推理能力方面仍有不足。 Method: 提出DiffAttn:以条件扩散-去噪过程建模注意力预测任务;采用Swin Transformer作为编码器提取多尺度场景特征;设计融合特征金字塔(FFP)与密集多尺度条件扩散的解码器;引入大语言模型(LLM)层增强自上而下的语义推理和对安全关键线索的敏感性。 Result: 在四个公开数据集上显著超越主流视频驱动、自上而下特征驱动及LLM增强类基线方法,达到当前最优(SoTA)性能;支持可解释的驾驶员中心化场景理解。 Conclusion: DiffAttn为智能网联车辆提供了更鲁棒、可解释、语义增强的视觉注意力预测能力,有望提升舱内人机交互、风险感知与驾驶员状态测量等关键功能。 Abstract: Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.[314] TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K
Mattia D'Urso,Yuxi Hu,Christian Sormann,Mattia Rossi,Friedrich Fraundorfer
Main category: cs.CV
TL;DR: TerraSky3D is a new high-resolution, large-scale 3D reconstruction dataset with 50,000 images across 150 diverse scenes (ground, aerial, mixed), focused on European landmarks and including calibration data, camera poses, and depth maps.
Details
Motivation: The scarcity of suitable public 3D datasets—often low-resolution, limited in scale or scene variety, or inconsistent in image quality—motivates the creation of TerraSky3D. Method: The authors captured and curated TerraSky3D: a dataset of 50,000 high-resolution images from 150 ground, aerial, and mixed scenes of European landmarks, accompanied by accurate calibration parameters, camera poses, and depth maps. Result: TerraSky3D provides a challenging, large-scale, high-quality 3D reconstruction dataset designed to support training and evaluation of modern 3D reconstruction pipelines. Conclusion: TerraSky3D fills a critical gap in the availability of robust, diverse, and well-annotated 3D reconstruction datasets, enabling advancement in related algorithms and benchmarks. Abstract: Despite the growing need for data of more and more sophisticated 3D reconstruction pipelines, we can still observe a scarcity of suitable public datasets. Existing 3D datasets are either low resolution, limited to a small amount of scenes, based on images of varying quality because retrieved from the internet, or limited to specific capturing scenarios. Motivated by this lack of suitable 3D datasets, we captured TerraSky3D, a high-resolution large-scale 3D reconstruction dataset comprising 50,000 images divided into 150 ground, aerial, and mixed scenes. The dataset focuses on European landmarks and comes with curated calibration data, camera poses, and depth maps. TerraSky3D tries to answer the need for challenging dataset that can be used to train and evaluate 3D reconstruction-related pipelines.[315] DinoDental: Benchmarking DINOv3 as a Unified Vision Encoder for Dental Image Analysis
Kun Tang,Xinquan Yang,Mianjie Zheng,Xuefen Liu,Xuguang Li,Xiaoqi Guo,Ruihan Chen,Linlin Shen,He Meng
Main category: cs.CV
TL;DR: 本文提出DinoDental基准,系统评估自监督视觉基础模型DINOv3在牙科影像分析中的迁移能力,验证其作为即用型编码器的有效性,并比较不同适配策略的性能。
Details
Motivation: 牙科影像领域专家标注稀缺且昂贵,而DINOv3等通用自监督视觉模型在牙科这一具有独特成像特性和临床细微差别的领域中的可靠性尚不明确。 Method: 构建名为DinoDental的统一基准,涵盖全景X光片和口内照片上的分类、检测与实例分割任务;通过调整模型规模、输入分辨率及对比冻结特征、全量微调和LoRA等适配策略,系统分析DINOv3的迁移性能。 Result: DINOv3可作为牙科影像分析的强统一编码器,在各类任务中保持竞争力,尤其在口内图像理解和边界敏感的密集预测任务中优势明显。 Conclusion: DinoDental为评估DINOv3在牙科领域的适用性提供了系统框架,确立了指导牙科AI社区高效、有效选择与适配模型的基础性基准。 Abstract: The scarcity and high cost of expert annotations in dental imaging present a significant challenge for the development of AI in dentistry. DINOv3, a state-of-the-art, self-supervised vision foundation model pre-trained on 1.7 billion images, offers a promising pathway to mitigate this issue. However, its reliability when transferred to the dental domain, with its unique imaging characteristics and clinical subtleties, remains unclear. To address this, we introduce DinoDental, a unified benchmark designed to systematically evaluate whether DINOv3 can serve as a reliable, off-the-shelf encoder for comprehensive dental image analysis without requiring domain-specific pre-training. Constructed from multiple public datasets, DinoDental covers a wide range of tasks, including classification, detection, and instance segmentation on both panoramic radiographs and intraoral photographs. We further analyze the model's transfer performance by scaling its size and input resolution, and by comparing different adaptation strategies, including frozen features, full fine-tuning, and the parameter-efficient Low-Rank Adaptation (LoRA) method. Our experiments show that DINOv3 can serve as a strong unified encoder for dental image analysis across both panoramic radiographs and intraoral photographs, remaining competitive across tasks while showing particularly clear advantages for intraoral image understanding and boundary-sensitive dense prediction. Collectively, DinoDental provides a systematic framework for comprehensively evaluating DINOv3 in dental analysis, establishing a foundational benchmark to guide efficient and effective model selection and adaptation for the dental AI community.[316] Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification
Yangmei Chen,Zhongyuan Zhang,Xikun Zhang,Xinyu Hao,Mingliang Hou,Renqiang Luo,Ziqi Xu
Main category: cs.CV
TL;DR: 本文提出PEMV-thyroid框架,通过多视角学习与原型增强机制提升甲状腺超声结节分类模型在跨设备、跨中心场景下的鲁棒性与泛化能力。
Details
Motivation: 现有深度学习方法在甲状腺超声图像分类中因图像异质性强,易捕获虚假相关性,导致跨设备/跨中心部署时泛化性能差。 Method: 提出Prototype-Enhanced Multi-View(PEMV-thyroid)框架,融合多视角特征表示,并引入基于混合原型信息的原型校正机制以优化决策边界。 Result: 在多个甲状腺超声数据集上实验表明,该方法在跨设备和跨域评估中持续优于SOTA方法,显著提升诊断准确率与临床泛化性。 Conclusion: PEMV-thyroid通过结合多视角学习与原型级引导,实现了对异质超声图像更稳定的表征学习,为鲁棒医学图像分析提供了新思路。 Abstract: Thyroid nodule classification using ultrasound imaging is essential for early diagnosis and clinical decision-making; however, despite promising performance on in-distribution data, existing deep learning methods often exhibit limited robustness and generalisation when deployed across different ultrasound devices or clinical environments. This limitation is mainly attributed to the pronounced heterogeneity of thyroid ultrasound images, which can lead models to capture spurious correlations rather than reliable diagnostic cues. To address this challenge, we propose PEMV-thyroid, a Prototype-Enhanced Multi-View learning framework that accounts for data heterogeneity by learning complementary representations from multiple feature perspectives and refining decision boundaries through a prototype-based correction mechanism with mixed prototype information. By integrating multi-view representations with prototype-level guidance, the proposed approach enables more stable representation learning under heterogeneous imaging conditions. Extensive experiments on multiple thyroid ultrasound datasets demonstrate that PEMV-thyroid consistently outperforms state-of-the-art methods, particularly in cross-device and cross-domain evaluation scenarios, leading to improved diagnostic accuracy and generalisation performance in real-world clinical settings. The source code is available at https://github.com/chenyangmeii/Prototype-Enhanced-Multi-View-Learning.[317] Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes
Luke Palmer,Petar Palasek,Hazem Abdelkawy
Main category: cs.CV
TL;DR: 本文提出了一种将注视建模为自回归动力系统的全新方法,通过Affinity Relation Transformer(ART)和Object Density Network(ODN)显式建模驾驶场景中注视与环境的动态交互,并发布新数据集Focus100,显著提升了注视轨迹、扫描路径和显著性图的建模自然性。
Details
Motivation: 现有方法通常将注视坍缩为显著图或扫描路径,仅隐式处理注视动力学;而准确建模人类注视对汽车安全等视觉应用至关重要。 Method: 将注视建模为自回归动力系统,显式展开原始注视轨迹;用基于注视中心的图表示驾驶场景,并通过异质图Transformer(ART)建模注视、交通物体与道路结构间的交互;引入Object Density Network(ODN)预测下一步注视分布;发布新数据集Focus100并直接在原始注视数据上训练。 Result: 所提方法在注视轨迹、扫描路径动力学和显著性图生成上比现有注意力模型更自然,提供了动态环境中人类注意力时序建模的新见解。 Conclusion: 显式建模注视动力学与环境交互是提升驾驶场景中人类注意力建模性能的关键,ART与ODN联合框架及Focus100数据集为该方向提供了有效支撑。 Abstract: Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods typically collapse gaze into saliency maps or scanpaths, treating gaze dynamics only implicitly. We instead formulate gaze modelling as an autoregressive dynamical system and explicitly unroll raw gaze trajectories over time, conditioned on both gaze history and the evolving environment. Driving scenes are represented as gaze-centric graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze, traffic objects, and road structure. We further introduce the Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic and object-centric nature of attentional shifts in complex environments. We also release Focus100, a new dataset of raw gaze data from 30 participants viewing egocentric driving footage. Trained directly on raw gaze, without fixation filtering, our unified approach produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models, offering valuable insights for the temporal modelling of human attention in dynamic environments.[318] SFDemorpher: Generalizable Face Demorphing for Operational Morphing Attack Detection
Raul Ismayilov,Luuk Spreeuwers
Main category: cs.CV
TL;DR: 本文提出SFDemorpher框架,通过在StyleGAN潜在空间和高维特征空间中联合进行身份解耦,提升差分面相变形攻击检测(D-MAD)的泛化性与可部署性。
Details
Motivation: 现有面相变形攻击检测方法因训练数据有限且假设所有输入均为变形图像,缺乏实际部署所需的泛化能力。 Method: 提出SFDemorpher框架,采用双通路训练策略,结合以合成身份为主的混合数据集,在StyleGAN潜空间与高维特征空间联合实现身份解耦;支持对变形与真实证件图像统一处理。 Result: 在未见身份、多样采集条件及13种变形技术下达到SOTA泛化性能;显著拉大真实与变形样本得分分布间距,同时提供高保真可视化重建以增强可解释性。 Conclusion: SFDemorpher提升了D-MAD在真实场景(如边检验证与证件注册)中的鲁棒性与实用性,为操作级人脸防伪部署提供了新范式。 Abstract: Face morphing attacks compromise biometric security by creating document images that verify against multiple identities, posing significant risks from document issuance to border control. Differential Morphing Attack Detection (D-MAD) offers an effective countermeasure, particularly when employing face demorphing to disentangle identities blended in the morph. However, existing methods lack operational generalizability due to limited training data and the assumption that all document inputs are morphs. This paper presents SFDemorpher, a framework designed for the operational deployment of face demorphing for D-MAD that performs identity disentanglement within joint StyleGAN latent and high-dimensional feature spaces. We introduce a dual-pass training strategy handling both morphed and bona fide documents, leveraging a hybrid corpus with predominantly synthetic identities to enhance robustness against unseen distributions. Extensive evaluation confirms state-of-the-art generalizability across unseen identities, diverse capture conditions, and 13 morphing techniques, spanning both border verification and the challenging document enrollment stage. Our framework achieves superior D-MAD performance by widening the margin between the score distributions of bona fide and morphed samples while providing high-fidelity visual reconstructions facilitating explainability.[319] Integrating Multimodal Large Language Model Knowledge into Amodal Completion
Heecheol Yun,Eunho Yang
Main category: cs.CV
TL;DR: 本文提出AmodalCG框架,利用多模态大语言模型(MLLMs)的物理常识知识指导无模态补全任务,在严重遮挡情况下调用MLLM推理缺失区域的范围与内容,并结合视觉生成模型迭代优化补全结果,显著提升性能。
Details
Motivation: 现有无模态补全方法或仅依赖视觉生成模型(缺乏物理常识),或仅在分割阶段使用MLLM知识,无法显式引导补全过程;而人类依赖先验经验与常识推断被遮挡区域,因此需将真实世界知识显式融入补全流程。 Method: 提出AmodalCG框架:1)评估遮挡程度,仅在目标严重遮挡时选择性调用MLLM;2)利用MLLM分别推理缺失区域的范围和内容;3)由视觉生成模型融合MLLM指导并迭代优化补全结果。 Result: 在多种真实图像数据集上实验表明,AmodalCG显著优于所有现有方法。 Conclusion: 多模态大语言模型蕴含的真实世界知识可有效提升无模态补全性能,是解决该挑战性任务的有前景方向。 Abstract: With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.[320] VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning
Li-Heng Chen,Ke Cheng,Yahui Liu,Lei Shi,Shi-Sheng Huang,Hongbo Fu
Main category: cs.CV
TL;DR: VistaGEN is a new driving video generation method that enables fine-grained, object-level controllability (e.g., 3D objects, images, text) while ensuring spatiotemporal consistency in long videos, via a closed-loop generation-evaluation-regeneration framework with multiview vision-language reasoning and an object-level refinement module.
Details
Motivation: Existing driving video generation methods lack fine-grained object-level controllability and struggle to maintain spatiotemporal consistency—especially in long videos. Method: Introduces VistaGEN, incorporating multiview visual-language reasoning into a video generator and proposing a multiview vision-language evaluator (MV-VLM) to assess spatiotemporal consistency; establishes a closed-loop generation-evaluation-regeneration mechanism with an object-level refinement module. Result: VistaGEN achieves superior fine-grained controllability (especially for long-tail objects) and significantly better spatiotemporal consistency than prior methods, as validated by extensive evaluation. Conclusion: The proposed closed-loop framework with multiview vision-language modeling effectively bridges the gap between controllability and consistency in long driving video generation. Abstract: Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.[321] Optimized Weighted Voting System for Brain Tumor Classification Using MRI Images
Ha Anh Vu
Main category: cs.CV
TL;DR: 本文提出了一种加权集成学习方法,融合深度学习与传统机器学习模型,结合图像增强技术,在脑肿瘤MRI分类任务中实现了SOTA精度。
Details
Motivation: 准确分类MRI图像中的脑肿瘤对临床诊断和治疗规划至关重要,而单一模型性能存在局限,需提升鲁棒性与准确性。 Method: 构建包含ResNet101、DenseNet121、Xception、CNN-MRI、边缘增强ResNet50、SVM及KNN(HOG特征)的多模型加权集成系统,并采用平衡对比度增强、K-means聚类和Canny边缘检测进行预处理。 Result: 在Figshare和Kaggle MRI数据集上达到当前最优分类精度,显著优于现有方法。 Conclusion: 加权集成学习能有效融合异构模型优势,提升脑肿瘤分类性能,为医学影像分析提供了可靠、可扩展的框架。 Abstract: The accurate classification of brain tumors from MRI scans is essential for effective diagnosis and treatment planning. This paper presents a weighted ensemble learning approach that combines deep learning and traditional machine learning models to improve classification performance. The proposed system integrates multiple classifiers, including ResNet101, DenseNet121, Xception, CNN-MRI, and ResNet50 with edge-enhanced images, SVM, and KNN with HOG features. A weighted voting mechanism assigns higher influence to models with better individual accuracy, ensuring robust decision-making. Image processing techniques such as Balance Contrast Enhancement, K-means clustering, and Canny edge detection are applied to enhance feature extraction. Experimental evaluations on the Figshare and Kaggle MRI datasets demonstrate that the proposed method achieves state-of-the-art accuracy, outperforming existing models. These findings highlight the potential of ensemble-based learning for improving brain tumor classification, offering a reliable and scalable framework for medical image analysis.[322] SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering
Jiho Park,Sieun Choi,Jaeyoon Seo,Minho Sohn,Yeana Kim,Jihie Kim
Main category: cs.CV
TL;DR: 本文提出了一种无需参考图像的草图评估指标SEA,用于量化草图在语义抽象上的效率,并构建了首个语义标注草图数据集CommonSketch以支持评估。
Details
Motivation: 现有草图评估方法无法有效衡量草图的核心特性——语义抽象效率,因其依赖参考图像、低层视觉特征或识别准确率,忽略了抽象性本身。 Method: 提出SEA(Sketch Evaluation metric for Abstraction efficiency)指标:基于常识知识提取每类草图的关键语义元素,利用视觉问答模型判断这些元素的存在性,从而在保证语义可识别性的前提下评估视觉简洁性;同时构建CommonSketch数据集,包含300类共23,100张带标题和元素级标注的人绘草图。 Result: 实验表明SEA与人类判断高度一致,能可靠区分不同抽象效率水平;CommonSketch为草图元素级理解提供了系统性评测基准。 Conclusion: SEA是一种有效、参考-free的草图抽象效率评估指标,CommonSketch为其提供了坚实的数据支撑,共同推动草图理解与生成的研究。 Abstract: A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches. To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy. To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.[323] AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
Milton Zhou,Sizhong Qin,Yongzhi Li,Quan Chen,Peng Jiang
Main category: cs.CV
TL;DR: AutoCut是一个端到端广告短视频编辑框架,通过多模态离散化和可控编辑,统一视频、音频与文本表征,利用多模态大语言模型实现视频选择排序、脚本生成与配乐选择,并显著降低制作成本与迭代时间。
Details
Motivation: 当前短视频广告制作流程与AI工具割裂且模态单一,导致高成本与低效率。 Method: 提出AutoCut框架:1)专用编码器提取音视频特征;2)残差矢量量化将其离散为与文本对齐的统一token;3)基于基础模型构建多模态大语言模型,结合多模态对齐与监督微调;4)构建完整生产流水线将token序列转为可部署长视频。 Result: 在真实广告数据集上实验表明,AutoCut显著降低制作成本与迭代时间,同时大幅提升一致性与可控性。 Conclusion: AutoCut为可扩展的短视频内容创作提供了高效、统一、可控的新范式。 Abstract: Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.[324] Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models
Tao Xia,Jiawei Liu,Yukun Zhang,Ting Liu,Wei Wang,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉自回归(VAR)模型的文本引导图像编辑新框架,通过粗到细的可编辑token定位、结构相关特征注入及强化学习驱动的自适应注入策略,显著提升了编辑保真度与结构一致性。
Details
Motivation: 现有VAR图像编辑方法在可编辑token准确定位和编辑结果结构一致性方面存在不足。 Method: 提出三阶段方法:1)粗到细token定位策略;2)基于中间层特征分析设计结构特征注入机制;3)强化学习驱动的自适应多尺度/多层特征注入。 Result: 在局部与全局编辑任务上均优于现有最先进方法,显著提升结构一致性和编辑质量。 Conclusion: 通过深入分析VAR模型中间特征分布并引入自适应特征注入机制,本文有效解决了VAR编辑中定位不准与结构失真两大核心问题。 Abstract: Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.[325] SVH-BD : Synthetic Vegetation Hyperspectral Benchmark Dataset for Emulation of Remote Sensing Images
Chedly Ben Azizi,Claire Guilloteau,Gilles Roussel,Matthieu Puigt
Main category: cs.CV
TL;DR: 本论文介绍了一个包含10915个合成高光谱图像立方体及其对应像素级植被性状图的数据集,旨在支持辐射传输模拟、植被性状反演及不确定性量化研究。
Details
Motivation: 为支持辐射传输模拟、植被性状反演和不确定性量化研究,提供一个大规模、高质量、物理一致的合成高光谱数据集。 Method: 利用PROSAIL模型构建查找表,反演Sentinel-2 L2A地表反射率以获取植被性状;再通过前向PROSAIL模拟生成物理一致的高光谱反射率立方体;覆盖四个生态多样性区域,并提供不确定性地图与Sentinel-2场景分类层。 Result: 构建了含10,915个高光谱图像立方体(211波段,400–2500 nm,64×64像素)及对应植被性状图的数据集,附带不确定性估计与区域分类信息。 Conclusion: 该数据集为辐射传输代理模型开发、植被性状反演方法评测以及光谱-生物物理关系研究提供了可控且贴近现实的基准资源。 Abstract: This dataset provides a large collection of 10,915 synthetic hyperspectral image cubes paired with pixel-level vegetation trait maps, designed to support research in radiative transfer emulation, vegetation trait retrieval, and uncertainty quantification. Each hyperspectral cube contains 211 bands spanning 400--2500 nm at 10 nm resolution and a fixed spatial layout of 64 \times 64 pixels, offering continuous simulated surface reflectance spectra suitable for emulator development and machine-learning tasks requiring high spectral detail. Vegetation traits were derived by inverting Sentinel-2 Level-2A surface reflectance using a PROSAIL-based lookup-table approach, followed by forward PROSAIL simulations to generate hyperspectral reflectance under physically consistent canopy and illumination conditions. The dataset covers four ecologically diverse regions -- East Africa, Northern France, Eastern India, and Southern Spain -- and includes 5th and 95th percentile uncertainty maps as well as Sentinel-2 scene classification layers. This resource enables benchmarking of inversion methods, development of fast radiative transfer emulators, and studies of spectral--biophysical relationships under controlled yet realistic environmental variability.[326] EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation
Sravanth Kodavanti,Manjunath Arveti,Sowmya Vajrala,Srinivas Miriyala,Vikram N R
Main category: cs.CV
TL;DR: 本文提出EdgeDiT,一种专为移动NPU(如高通Hexagon、苹果ANE)设计的轻量级扩散Transformer架构,在显著降低参数量、计算量和延迟的同时,保持生成质量与可扩展性。
Details
Motivation: Diffusion Transformers(DiT)虽在图像生成上达到SOTA,但其高计算与内存开销阻碍了在资源受限边缘设备上的本地部署。 Method: 提出硬件感知优化框架,系统识别并剪除DiT主干中对移动数据流负担大的结构冗余,构建面向移动NPU的轻量模型系列。 Result: 实现参数减少20–30%,FLOPs下降36–46%,端侧延迟降低1.65倍;在FID与推理延迟的Pareto权衡上优于优化的移动U-Net和原始DiT。 Conclusion: EdgeDiT为将大规模生成式基础模型从高端GPU迁移至终端设备提供了可扩展的硬件协同设计范式,支持响应式、隐私保护与离线AI生成。 Abstract: Diffusion Transformers (DiT) have established a new state-of-the-art in high-fidelity image synthesis; however, their massive computational complexity and memory requirements hinder local deployment on resource-constrained edge devices. In this paper, we introduce EdgeDiT, a family of hardware-efficient generative transformers specifically engineered for mobile Neural Processing Units (NPUs), such as the Qualcomm Hexagon and Apple Neural Engine (ANE). By leveraging a hardware-aware optimization framework, we systematically identify and prune structural redundancies within the DiT backbone that are particularly taxing for mobile data-flows. Our approach yields a series of lightweight models that achieve a 20-30% reduction in parameters, a 36-46% decrease in FLOPs, and a 1.65-fold reduction in on-device latency without sacrificing the scaling advantages or the expressive capacity of the original transformer architecture. Extensive benchmarking demonstrates that EdgeDiT offers a superior Pareto-optimal trade-off between Frechet Inception Distance (FID) and inference latency compared to both optimized mobile U-Nets and vanilla DiT variants. By enabling responsive, private, and offline generative AI directly on-device, EdgeDiT provides a scalable blueprint for transitioning large-scale foundation models from high-end GPUs to the palm of the user.[327] Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation
Weichao Cai,Weiliang Huang,Biao Xue,Chao Huang,Fei Yuan,Bob Zhang
Main category: cs.CV
TL;DR: 本文提出了一种面向海洋场景的多任务互补学习框架(MCLF),结合新构建的红外-可见光海洋船舶数据集(IVMSD),协同完成图像恢复、多模态融合与语义分割,显著提升了复杂海况下的感知鲁棒性与分割精度。
Details
Motivation: 现有方法难以应对海洋环境中雾、强反射等耦合退化问题,且缺乏端到端联合优化结构恢复与语义有效性的机制;同时缺乏真实反映海洋退化特性的红外-可见光公开数据集。 Method: 构建了IVMSD数据集,并提出多任务互补学习框架(MCLF),包含频率-空间增强互补模块(FSEC)、语义-视觉一致性注意力模块(SVCA)及跨模态引导注意力机制,实现图像恢复、多模态融合与语义分割的统一建模。 Result: 在IVMSD数据集上实验表明,所提方法在语义分割性能上达到SOTA,显著提升复杂海洋条件下的鲁棒性与感知质量。 Conclusion: MCLF通过多任务协同与跨模态互补设计,有效缓解海洋图像退化对语义理解的影响,验证了端到端联合优化在 maritime scene understanding 中的有效性与必要性。 Abstract: Marine scene understanding and segmentation plays a vital role in maritime monitoring and navigation safety. However, prevalent factors like fog and strong reflections in maritime environments cause severe image degradation, significantly compromising the stability of semantic perception. Existing restoration and enhancement methods typically target specific degradations or focus solely on visual quality, lacking end-to-end collaborative mechanisms that simultaneously improve structural recovery and semantic effectiveness. Moreover, publicly available infrared-visible datasets are predominantly collected from urban scenes, failing to capture the authentic characteristics of coupled degradations in marine environments. To address these challenges, the Infrared-Visible Maritime Ship Dataset (IVMSD) is proposed to cover various maritime scenarios under diverse weather and illumination conditions. Building upon this dataset, a Multi-task Complementary Learning Framework (MCLF) is proposed to collaboratively perform image restoration, multimodal fusion, and semantic segmentation within a unified architecture. The framework includes a Frequency-Spatial Enhancement Complementary (FSEC) module for degradation suppression and structural enhancement, a Semantic-Visual Consistency Attention (SVCA) module for semantic-consistent guidance, and a cross-modality guided attention mechanism for selective fusion. Experimental results on IVMSD demonstrate that the proposed method achieves state-of-the-art segmentation performance, significantly enhancing robustness and perceptual quality under complex maritime conditions.[328] From Pixels to Reality: Physical-Digital Patch Attacks on Real-World Camera
Victoria Leonenkova,Ekaterina Shumitskaya,Dmitriy Vatolin,Anastasia Antsiferova
Main category: cs.CV
TL;DR: 本文提出了一种名为DiPA的新型数字-物理对抗攻击方法,通过在智能手机屏幕上显示对抗性补丁,攻击基于摄像头的身份认证系统,具有高迁移性、易部署和无需打印等优势。
Details
Motivation: 针对现有基于摄像头的身份认证系统存在的安全漏洞,尤其是物理对抗攻击中依赖打印补丁、部署慢、迁移性差等问题,提出一种更实用、更高效的数字-物理攻击方式。 Method: DiPA采用数字方式在手机屏幕显示对抗补丁,不依赖打印;利用ArcFace、MagFace、CosFace等先进人脸识别模型集成训练,提升对未知商用系统的迁移能力;支持实时动态调整补丁并观察攻击效果。 Result: 实验表明DiPA在成功率、特征空间扰动程度和置信度下降方面均优于现有物理攻击方法,并成功实现实时规避攻击演示。 Conclusion: DiPA揭示了移动设备、普适视觉与传感器驱动身份认证基础设施交叉领域中的关键安全脆弱性,为提升系统鲁棒性提供了新视角和挑战。 Abstract: This demonstration presents Digital-Physical Adversarial Attacks (DiPA), a new class of practical adversarial attacks against pervasive camera-based authentication systems, where an attacker displays an adversarial patch directly on a smartphone screen instead of relying on printed artifacts. This digital-only physical presentation enables rapid deployment, removes the need for total-variation regularization, and improves patch transferability in black-box conditions. DiPA leverages an ensemble of state-of-the-art face-recognition models (ArcFace, MagFace, CosFace) to enhance transfer across unseen commercial systems. Our interactive demo shows a real-time dodging attack against a deployed face-recognition camera, preventing authorized users from being recognized while participants dynamically adjust patch patterns and observe immediate effects on the sensing pipeline. We further demonstrate DiPA's superiority over existing physical attacks in terms of success rate, feature-space distortion, and reductions in detection confidence, highlighting critical vulnerabilities at the intersection of mobile devices, pervasive vision, and sensor-driven authentication infrastructures.[329] GeoHCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting
Xuan Deng,Xiandong Meng,Hengyu Man,Qiang Zhu,Tiange Zhang,Debin Zhao,Xiaopeng Fan
Main category: cs.CV
TL;DR: 本文提出GeoHCC,一种几何感知的3D高斯泼溅压缩框架,通过引入邻域感知锚点剪枝(NAAP)和分层几何引导熵编码,在保持几何完整性与渲染质量的同时显著降低存储开销。
Details
Motivation: 3D高斯泼溅(3DGS)虽支持高保真实时渲染,但存储开销大;现有基于锚点的压缩方法忽略显式几何依赖,导致结构退化和率失真性能不佳。 Method: 提出GeoHCC框架:1)邻域感知锚点剪枝(NAAP),通过加权邻域特征聚合评估锚点重要性并合并冗余锚点;2)基于优化锚点结构的分层熵编码,利用轻量级几何引导卷积(GG-Conv)建模粗到细先验以实现空间自适应上下文建模。 Result: 实验表明GeoHCC在结构保持、几何完整性与渲染保真度上均优于当前最优的基于锚点的压缩方法。 Conclusion: 显式建模锚点间几何相关性可有效缓解3DGS压缩中的结构退化问题,提升率失真性能,为实用化部署提供新路径。 Abstract: Although 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, its prohibitive storage overhead severely hinders practical deployment. Recent anchor-based 3DGS compression schemes reduce redundancy through context modeling, yet overlook explicit geometric dependencies, leading to structural degradation and suboptimal rate-distortion performance. In this paper, we propose GeoHCC, a geometry-aware 3DGS compression framework that incorporates inter-anchor geometric correlations into anchor pruning and entropy coding for compact representation. We first introduce Neighborhood-Aware Anchor Pruning (NAAP), which evaluates anchor importance via weighted neighborhood feature aggregation and merges redundant anchors into salient neighbors, yielding a compact yet geometry-consistent anchor set. Building upon this optimized structure, we further develop a hierarchical entropy coding scheme, in which coarse-to-fine priors are exploited through a lightweight Geometry-Guided Convolution (GG-Conv) operator to enable spatially adaptive context modeling and rate-distortion optimization. Extensive experiments demonstrate that GeoHCC effectively resolves the structure preservation bottleneck, maintaining superior geometric integrity and rendering fidelity over state-of-the-art anchor-based approaches.[330] $R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation
Linqian Fan,Peiqin Sun,Tiancheng Wen,Shun Lu,Chengru Song
Main category: cs.CV
TL;DR: 本文提出了一种将分布匹配重构为奖励(R_dm)的新范式,统一了扩散匹配蒸馏(DMD)与强化学习(RL),并设计了组归一化分布匹配(GNDM)以提升稳定性、灵活性与采样效率,在FID和HPS等指标上显著超越现有方法。
Details
Motivation: 传统扩散模型蒸馏受限于仅锚定教师模型的目标;现有结合强化学习的方法多采用简单目标加权,优化不稳定且奖励集成不灵活。 Method: 将分布匹配建模为可优化的奖励R_dm,提出组归一化分布匹配(GNDM)稳定R_dm估计,并支持自适应多奖励融合与重要性采样(IS)。 Result: GNDM将FID降低1.87;多奖励变体GNDMR达到HPS 30.37和FID-SD 12.21,兼顾美学质量与保真度。 Conclusion: R_dm范式为实时高保真生成提供了更灵活、稳定、高效的统一框架。 Abstract: Diffusion models achieve state-of-the-art generative performance but are fundamentally bottlenecked by their slow iterative sampling process. While diffusion distillation techniques enable high-fidelity few-step generation, traditional objectives often restrict the student's performance by anchoring it solely to the teacher. Recent approaches have attempted to break this ceiling by integrating Reinforcement Learning (RL), typically through a simple summation of distillation and RL objectives. In this work, we propose a novel paradigm by reconceptualizing distribution matching as a reward, denoted as $R_{dm}$. This unified perspective bridges the algorithmic gap between Diffusion Matching Distillation (DMD) and RL, providing several key benefits. (1) Enhanced optimization stability: we introduce Group Normalized Distribution Matching (GNDM), which adapts standard RL group normalization to stabilize $R_{dm}$ estimation. By leveraging group-mean statistics, GNDM establishes a more robust and effective optimization direction. (2) Seamless reward integration: our reward-centric formulation inherently supports adaptive weighting mechanisms, allowing flexible combination of DMD with external reward models. (3) Improved sampling efficiency: by aligning with RL principles, the framework readily incorporates importance sampling (IS), leading to a significant boost in sampling efficiency. Extensive experiments demonstrate that GNDM outperforms vanilla DMD, reducing the FID by 1.87. Furthermore, our multi-reward variant, GNDMR, surpasses existing baselines by achieving a strong balance between aesthetic quality and fidelity, reaching a peak HPS of 30.37 and a low FID-SD of 12.21. Overall, $R_{dm}$ provides a flexible, stable, and efficient framework for real-time high-fidelity synthesis. Code will be released upon publication.[331] Decoupling Wavelet Sub-bands for Single Source Domain Generalization in Fundus Image Segmentation
Shramana Dey,Varun Ajith,Abhirup Banerjee,Sushmita Mitra
Main category: cs.CV
TL;DR: 本文提出WaveSDG,一种基于小波引导的单源域泛化分割网络,通过小波子带分解解耦解剖结构与域特异性外观,并设计WISER模块增强低频全局解剖信息、优化高频方向边缘,显著提升眼底图像跨设备/场景的分割鲁棒性与稳定性。
Details
Motivation: 眼底图像域泛化面临设备与临床环境差异大、标注数据昂贵且受隐私限制的问题;现有单源域泛化(SDG)方法难以有效建模解剖拓扑或分离外观与解剖特征。 Method: 提出WaveSDG网络,核心为Wavelet-based Invariant Structure Extraction and Refinement(WISER)模块:利用小波子带分解将编码器特征按语义角色分别处理——强化低频分量以锚定全局解剖结构,选择性增强高频中的方向边缘并抑制噪声。 Result: 在1个源域+5个未见目标域的眼杯/眼盘分割任务上,WaveSDG在平衡Dice分数和95% Hausdorff距离上均优于7种SOTA方法,且方差更低,表明其精度、鲁棒性与跨域稳定性更优。 Conclusion: WaveSDG通过小波引导的特征解耦与结构精炼策略,有效缓解了单源域泛化中解剖一致性与外观变化之间的冲突,为医疗影像无监督跨域适应提供了新思路。 Abstract: Domain generalization in fundus imaging is challenging due to variations in acquisition conditions across devices and clinical settings. The inability to adapt to these variations causes performance degradation on unseen domains for deep learning models. Besides, obtaining annotated data across domains is often expensive and privacy constraints restricts their availability. Although single-source domain generalization (SDG) offers a realistic solution to this problem, the existing approaches frequently fail to capture anatomical topology or decouple appearance from anatomical features. This research introduces WaveSDG, a new wavelet-guided segmentation network for SDG. It decouples anatomical structure from domain-specific appearance through a wavelet sub-band decomposition. A novel Wavelet-based Invariant Structure Extraction and Refinement (WISER) module is proposed to process encoder features by leveraging distinct semantic roles of each wavelet sub-band. The module refines low-frequency components to anchor global anatomy, while selectively enhancing directional edges and suppressing noise within the high-frequency sub-bands. Extensive ablation studies validate the effectiveness of the WISER module and its decoupling strategy. Our evaluations on optic cup and optic disc segmentation across one source and five unseen target datasets show that WaveSDG consistently outperforms seven state-of-the-art methods. Notably, it achieves the best balanced Dice score and lowest 95th percentile Hausdorff distance with reduced variance, indicating improved accuracy, robustness, and cross-domain stability.[332] Post-hoc Self-explanation of CNNs
Ahcène Boubekki,Line H. Clemmensen
Main category: cs.CV
TL;DR: 本文提出了一种基于k-means的后验解释方法,通过替换CNN最后一层为k-means分类器,并利用中间特征(尤其是较浅层)生成概念级、空间一致的解释图,在保持性能的同时提升可解释性。
Details
Motivation: 标准CNN虽可被数学重释为自解释模型,但其内置原型无法准确表征数据;需在不牺牲性能前提下提升其可解释性。 Method: 用k-means分类器替代CNN最终线性层;统一形式化k-means对分类器输出(B4)、编码器末层输出及中间特征(如B234)的后验解释;结合卷积感受野的空间一致性与梯度无关的特征归因,生成概念型解释图。 Result: 在ResNet34上验证:使用较浅层(如B234)特征可提升语义保真度,但伴随轻微预测性能下降;整体方法在解释性与性能间取得良好权衡。 Conclusion: k-means替代方案及其对中间特征的解释扩展,为CNN提供了无需梯度、具空间意义且语义更清晰的自解释能力,是一种实用且理论一致的可解释AI路径。 Abstract: Although standard Convolutional Neural Networks (CNNs) can be mathematically reinterpreted as Self-Explainable Models (SEMs), their built-in prototypes do not on their own accurately represent the data. Replacing the final linear layer with a $k$-means-based classifier addresses this limitation without compromising performance. This work introduces a common formalization of $k$-means-based post-hoc explanations for the classifier, the encoder's final output (B4), and combinations of intermediate feature activations. The latter approach leverages the spatial consistency of convolutional receptive fields to generate concept-based explanation maps, which are supported by gradient-free feature attribution maps. Empirical evaluation with a ResNet34 shows that using shallower, less compressed feature activations, such as those from the last three blocks (B234), results in a trade-off between semantic fidelity and a slight reduction in predictive performance.[333] CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains
Wenhan Wang,Zhixiang Zhou,Zhongtian Ma,Yanzhu Chen,Ziyu Lin,Hao Sheng,Pengfei Liu,Honglin Ma,Wenqi Shao,Qiaosheng Zhang,Yu Qiao
Main category: cs.CV
TL;DR: 本文提出CiQi-Agent,一个面向古瓷鉴赏的领域专用智能代理,支持多图像输入、视觉工具调用和多模态检索增强生成,可对六类属性进行细粒度分析,并构建了大规模专家标注数据集CiQi-VQA与基准CiQi-Bench。
Details
Motivation: 古瓷鉴赏高度依赖专业知识与经验,难以被非专业人士理解;为促进文化遗产知识普及并辅助专家工作,需构建面向该领域的智能分析系统。 Method: 构建大规模专家标注数据集CiQi-VQA(含29,596件瓷器、51,553张图像、557,940组VQA对)及六维基准CiQi-Bench;设计融合视觉工具与多模态检索工具的工具增强推理框架,通过监督微调与强化学习训练CiQi-Agent(7B)。 Result: CiQi-Agent(7B)在CiQi-Bench六项属性上全面超越所有对比的开源与闭源模型,平均准确率比GPT-5高12.2%;模型与数据集已开源发布。 Conclusion: CiQi-Agent验证了工具增强多模态大模型在专业文物鉴赏任务中的有效性,为文化传承智能化提供了可扩展的技术范式。 Abstract: The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent -- a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question--answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2\% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA.[334] INSID3: Training-Free In-Context Segmentation with DINOv3
Claudia Cuttano,Gabriele Trivigno,Christoph Reich,Daniel Cremers,Carlo Masone,Stefan Roth
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的上下文分割方法INSID3,仅利用冻结的DINOv3自监督特征,即可实现多粒度概念分割,在多个任务上达到SOTA,且参数更少、无需监督。
Details
Motivation: 现有上下文分割方法要么依赖微调损害泛化性,要么组合多个冻结模型导致结构复杂和粒度固定;作者希望探索仅用单一自监督骨干网络实现语义匹配与分割的可能性。 Method: 基于DINOv3大规模密集自监督特征具有的强空间结构和语义对应能力,设计了无需训练的INSID3方法,仅从冻结DINOv3特征出发,根据单个标注示例完成多粒度分割。 Result: INSID3在单样本语义、部件及个性化分割任务上达到SOTA,mIoU提升+7.5%,参数减少3倍,且无需掩码或类别级监督。 Conclusion: 单一冻结的自监督视觉骨干(如DINOv3)足以支撑高质量、多粒度的上下文分割,无需任何微调或辅助模型,验证了自监督表征在开放世界分割中的强大潜力。 Abstract: In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at https://github.com/visinf/INSID3 .[335] ConceptWeaver: Weaving Disentangled Concepts with Flow
Jintao Chen,Aiming Hao,Xiaoqing Chen,Chengyu Bai,Chubin Chen,Yanxun Li,Jiahong Wu,Xiangxiang Chu,Shanghang Zhang
Main category: cs.CV
TL;DR: 本文提出ConceptWeaver框架,通过发现流模型生成过程的三阶段特性(蓝图、实例化、细化),实现从单张图像中解耦并操控概念,并在推理中通过ConceptWeaver Guidance机制精准注入语义偏移。
Details
Motivation: 预训练的基于流的生成模型虽能合成复杂场景,但缺乏从单张真实图像中直接解耦和定制底层概念的机制。 Method: 提出微分探测技术分析概念词元对速度场的影响,发现生成过程分为三个阶段;据此设计ConceptWeaver框架,采用阶段感知优化策略从单图学习概念语义偏移,并通过ConceptWeaver Guidance(CWG)机制在对应阶段注入偏移。 Result: ConceptWeaver实现了高保真、可组合的合成与编辑效果,在多个实验中验证了其有效性。 Conclusion: 理解并利用流模型内在的阶段性生成机制,是实现精确、多粒度内容操控的关键。 Abstract: Pre-trained flow-based models excel at synthesizing complex scenes yet lack a direct mechanism for disentangling and customizing their underlying concepts from one-shot real-world sources. To demystify this process, we first introduce a novel differential probing technique to isolate and analyze the influence of individual concept tokens on the velocity field over time. This investigation yields a critical insight: the generative process is not monolithic but unfolds in three distinct stages. An initial \textbf{Blueprint Stage} establishes low-frequency structure, followed by a pivotal \textbf{Instantiation Stage} where content concepts emerge with peak intensity and become naturally disentangled, creating an optimal window for manipulation. A final concept-insensitive refinement stage then synthesizes fine-grained details. Guided by this discovery, we propose \textbf{ConceptWeaver}, a framework for one-shot concept disentanglement. ConceptWeaver learns concept-specific semantic offsets from a single reference image using a stage-aware optimization strategy that aligns with the three-stage framework. These learned offsets are then deployed during inference via our novel ConceptWeaver Guidance (CWG) mechanism, which strategically injects them at the appropriate generative stage. Extensive experiments validate that ConceptWeaver enables high-fidelity, compositional synthesis and editing, demonstrating that understanding and leveraging the intrinsic, staged nature of flow models is key to unlocking precise, multi-granularity content manipulation.[336] Bridging the Geometry Mismatch: Frequency-Aware Anisotropic Serialization for Thin-Structure SSMs
Jin Bai,Huiyao Zhang,Qi Wen,Ningyang Li,Shengyang Li,Atta ur Rahman,Xiaolin Tian
Main category: cs.CV
TL;DR: 本文提出FGOS-Net,通过频率-几何解耦实现细长线性结构的拓扑保持分割,显著提升连通性与精度。
Details
Motivation: 细长线性结构分割对拓扑敏感,传统SSM因各向同性扫描导致几何失配,难以沿结构走向建模。 Method: 提出频率-几何解耦框架FGOS-Net:1)特征分解为拓扑载体与方向高频分量以校正空间错位;2)设计频率对齐扫描,实现几何条件下的方向一致序列化;3)引入主动探测策略选择性注入高频细节并抑制纹理歧义。 Result: 在四个基准上超越强基线;DeepCrack上达91.3% mIoU和97.1% clDice,运行速度80 FPS,计算量仅7.87 GFLOPs。 Conclusion: 频率-几何解耦有效缓解SSM的几何失配问题,兼顾高精度、高效率与强拓扑保持能力。 Abstract: The segmentation of thin linear structures is inherently topology allowbreak-critical, where minor local errors can sever long-range connectivity. While recent State-Space Models (SSMs) offer efficient long-range modeling, their isotropic serialization (e.g., raster scanning) creates a geometry mismatch for anisotropic targets, causing state propagation across rather than along the structure trajectories. To address this, we propose FGOS-Net, a framework based on frequency allowbreak-geometric disentanglement. We first decompose features into a stable topology carrier and directional high-frequency bands, leveraging the latter to explicitly correct spatial misalignments induced by downsampling. Building on this calibrated topology, we introduce frequency-aligned scanning that elevates serialization to a geometry-conditioned decision, preserving direction-consistent traces. Coupled with an active probing strategy to selectively inject high-frequency details and suppress texture ambiguity, FGOS-Net consistently outperforms strong baselines across four challenging benchmarks. Notably, it achieves 91.3% mIoU and 97.1% clDice on DeepCrack while running at 80 FPS with only 7.87 GFLOPs.[337] Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree
Fei Wu,Guanghao Ding,Zijian Niu,Zhenrui Wang,Lei Yang,Zhuosheng Zhang,Shilin Wang
Main category: cs.CV
TL;DR: 本文提出了一种结合轻量级伪影感知检测器与多模态大语言模型(MLLMs)的AI生成图像检测新框架,通过模糊决策树自适应融合语义与感知线索,在准确性和泛化性上达到SOTA。
Details
Motivation: 现有基于低层伪影的检测方法泛化能力差,而依赖MLLMs的方法又缺乏对细微生成伪影的感知敏感性,亟需一种能兼顾二者优势的检测框架。 Method: 构建一个融合轻量级伪影感知检测器与MLLMs的框架,利用模糊决策树将基础检测器输出作为模糊隶属度,实现语义与感知线索的自适应融合。 Result: 在多种生成模型上实现了最先进的检测精度和强泛化能力。 Conclusion: 语义推理与细粒度感知的协同融合可有效提升AI生成图像检测的鲁棒性与泛化性,模糊决策树是实现二者互补融合的有效机制。 Abstract: The malicious use and widespread dissemination of AI-generated images pose a serious threat to the authenticity of digital content. Existing detection methods exploit low-level artifacts left by common manipulation steps within the generation pipeline, but they often lack generalization due to model-specific overfitting. Recently, researchers have resorted to Multimodal Large Language Models (MLLMs) for AIGC detection, leveraging their high-level semantic reasoning and broad generalization capabilities. While promising, MLLMs lack the fine-grained perceptual sensitivity to subtle generation artifacts, making them inadequate as standalone detectors. To address this issue, we propose a novel AI-generated image detection framework that synergistically integrates lightweight artifact-aware detectors with MLLMs via a fuzzy decision tree. The decision tree treats the outputs of basic detectors as fuzzy membership values, enabling adaptive fusion of complementary cues from semantic and perceptual perspectives. Extensive experiments demonstrate that the proposed method achieves state-of-the-art accuracy and strong generalization across diverse generative models.[338] GEditBench v2: A Human-Aligned Benchmark for General Image Editing
Zhangqi Jiang,Zheng Sun,Xianfang Zeng,Yufeng Yang,Xuanyang Zhang,Yongliang Wu,Wei Cheng,Gang Yu,Xu Yang,Bihan Wen
Main category: cs.CV
TL;DR: 本文提出GEditBench v2基准和PVC-Judge评估模型,以解决现有图像编辑评估中任务覆盖窄、视觉一致性度量不足的问题。
Details
Motivation: 现有图像编辑评估框架任务覆盖范围窄,标准指标难以准确衡量视觉一致性(如身份、结构和语义连贯性保持)。 Method: 构建包含1200个真实用户查询、覆盖23类任务(含开放集)的GEditBench v2基准;提出基于区域解耦偏好数据合成的开源配对评估模型PVC-Judge;并构建专家标注的VCReward-Bench验证其与人类判断的一致性。 Result: PVC-Judge在开源模型中达到SOTA性能,平均表现超越GPT-5.1;对16个前沿编辑模型的评测表明GEditBench v2能更贴近人类判断,揭示当前模型关键缺陷。 Conclusion: GEditBench v2与PVC-Judge共同为精准图像编辑提供了更可靠、更符合人类感知的评估基础。 Abstract: Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.[339] Seen2Scene: Completing Realistic 3D Scenes with Visibility-Guided Flow
Quan Meng,Yujin Chen,Lei Li,Matthias Nießner,Angela Dai
Main category: cs.CV
TL;DR: Seen2Scene is the first flow matching-based method for 3D scene completion and generation trained directly on incomplete, real-world 3D scans, using visibility-guided flow matching and sparse TSDF representations to handle partial observations effectively.
Details
Motivation: Prior methods rely on complete, synthetic 3D data, limiting realism and generalization to real-world, partial scans; there's a need for models that learn directly from incomplete, real-world 3D observations. Method: Introduces visibility-guided flow matching to mask unknown regions in real scans; represents scenes as TSDF volumes in sparse grids; uses a sparse transformer for efficient modeling; conditions on 3D layout boxes (and flexibly supports text or partial scans). Result: Outperforms baselines in completion accuracy and generation quality; produces coherent, complete, and realistic 3D scenes for complex, cluttered real environments. Conclusion: Seen2Scene demonstrates that direct learning from incomplete, real-world 3D scans is feasible and effective for realistic scene completion and generation, advancing practical applicability in real environments. Abstract: We present Seen2Scene, the first flow matching-based approach that trains directly on incomplete, real-world 3D scans for scene completion and generation. Unlike prior methods that rely on complete and hence synthetic 3D data, our approach introduces visibility-guided flow matching, which explicitly masks out unknown regions in real scans, enabling effective learning from real-world, partial observations. We represent 3D scenes using truncated signed distance field (TSDF) volumes encoded in sparse grids and employ a sparse transformer to efficiently model complex scene structures while masking unknown regions. We employ 3D layout boxes as an input conditioning signal, and our approach is flexibly adapted to various other inputs such as text or partial scans. By learning directly from real-world, incomplete 3D scans, Seen2Scene enables realistic 3D scene completion for complex, cluttered real environments. Experiments demonstrate that our model produces coherent, complete, and realistic 3D scenes, outperforming baselines in completion accuracy and generation quality.[340] MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures
Tim Strohmeyer,Lucas Morin,Gerhard Ingmar Meijer,Valéry Weber,Ahmed Nassar,Peter Staar
Main category: cs.CV
TL;DR: 本文提出MarkushGrapher-2,一种端到端的多模态化学结构(特别是Markush结构)识别方法,结合OCR、视觉-文本-布局联合编码与两阶段训练策略,在自建大规模数据集和新基准IP5-M上显著超越现有方法。
Details
Motivation: 现有方法在识别文献中多模态描述的Markush化学结构时精度不足,难以支持大规模自动化处理。 Method: 采用专用OCR提取化学图像中的文本;通过Vision-Text-Layout编码器和光学化学结构识别视觉编码器联合编码图像、文本与版式信息;利用两阶段训练策略融合表征,并自回归生成Markush结构表示;同时构建了大规模合成数据集及人工标注基准IP5-M。 Result: 在多模态Markush结构识别任务上显著超越当前最优模型,同时在常规分子结构识别任务中保持强性能。 Conclusion: MarkushGrapher-2为化学文献中复杂Markush结构的自动解析提供了高效可靠的端到端解决方案,并推动了该方向的数据与评测基准建设。 Abstract: Automatically extracting chemical structures from documents is essential for the large-scale analysis of the literature in chemistry. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic large-scale processing. In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure. To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets are released publicly.[341] Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
Athos Georgiou
Main category: cs.CV
TL;DR: Hydra是一种双头视觉语言模型(VLM),通过一个LoRA适配器切换实现检索与生成双重功能,兼顾ColBERT式检索和自回归生成,显著降低内存开销并保持生成质量。
Details
Motivation: 解决视觉文档理解中检索与生成需分离模型导致的内存占用高、系统复杂的问题。 Method: 提出Hydra双头架构:共享一个VLM主干,仅训练一个LoRA适配器用于检索;推理时通过开关该适配器切换为多向量检索或原生生成模式,并满足三项工程要求(注意力模式恢复、lm_head保留、KV缓存感知解码)以保障生成一致性。 Result: Hydra(4B)在ViDoRe V1上检索性能接近单头基线(差距<1%),V2/V3部分任务表现更优;生成质量与独立基线完全一致(100%字节相同,ANLS差异≤0.0044);GPU峰值内存降低41%;GritLM式联合训练在LoRA(r=16)下无增益;机制可扩展至Qwen2.5-Omni-3B支持音视频模态。 Conclusion: 单一VLM通过轻量适配器切换即可高效统一检索与生成,验证了双头设计的可行性与泛化性,为多模态系统简化提供新范式。 Abstract: Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.[342] Domain-Invariant Prompt Learning for Vision-Language Models
Arsham Gholamzadeh Khoee,Yinan Yu,Robert Feldt
Main category: cs.CV
TL;DR: 本文提出DiCoOp,一种改进的软提示方法,通过对抗训练学习域不变提示,提升CLIP等视觉语言模型在未见域上的泛化能力。
Details
Motivation: 现有软提示方法(如CoOp)缺乏处理跨域分布偏移的能力,难以应对未见域的域泛化挑战。 Method: 提出Domain-invariant Context Optimization (DiCoOp),在CoOp基础上引入对抗训练机制,迫使提示向量学习域不变特征,同时保持分类判别力。 Result: 实验表明DiCoOp在多个视觉域的域泛化任务中持续优于CoOp。 Conclusion: DiCoOp有效提升了预训练视觉语言模型在跨域场景下的零样本迁移与泛化性能。 Abstract: Large pre-trained vision-language models like CLIP have transformed computer vision by aligning images and text in a shared feature space, enabling robust zero-shot transfer via prompting. Soft-prompting, such as Context Optimization (CoOp), effectively adapts these models for downstream recognition tasks by learning a set of context vectors. However, CoOp lacks explicit mechanisms for handling domain shifts across unseen distributions. To address this, we propose Domain-invariant Context Optimization (DiCoOp), an extension of CoOp optimized for domain generalization. By employing an adversarial training approach, DiCoOp forces the model to learn domain-invariant prompts while preserving discriminative power for classification. Experimental results show that DiCoOp consistently surpasses CoOp in domain generalization tasks across diverse visual domains.[343] Curriculum-Guided Myocardial Scar Segmentation for Ischemic and Non-ischemic Cardiomyopathy
Nivetha Jayakumar,Jonathan Pan,Shuo Wang,Bishow Paudel,Nisha Hosadurg,Cristiane C. Singulane,Sivam Bhatt,Amit R. Patel,Miaomiao Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于课程学习的框架,用于提升晚期钆增强心脏磁共振(LGE-CMR)图像中心肌瘢痕分割的准确性与鲁棒性,尤其适用于低瘢痕负荷和弥漫性瘢痕等挑战性情况。
Details
Motivation: 心肌瘢痕的精准识别与量化对心血管疾病诊断和预后至关重要,但现有方法受限于图像对比度差异、成像条件不佳(如钆剂洗脱)及弥漫性瘢痕标注不一致(观察者间变异)等问题。 Method: 提出一种课程学习策略,按样本置信度由高到低逐步训练模型,先学习明确瘢痕区域,再过渡到低置信度或视觉模糊、瘢痕负荷低的样本,以增强模型对不确定标注和细微瘢痕表现的鲁棒性。 Result: 实验表明该方法显著提升分割精度与一致性,尤其在最小/弥漫性瘢痕病例中优于标准训练基线。 Conclusion: 该课程学习框架为利用不完美标注数据实现临床可用的心肌瘢痕定量分析提供了原则性解决方案。 Abstract: Identification and quantification of myocardial scar is important for diagnosis and prognosis of cardiovascular diseases. However, reliable scar segmentation from Late Gadolinium Enhancement Cardiac Magnetic Resonance (LGE-CMR) images remains a challenge due to variations in contrast enhancement across patients, suboptimal imaging conditions such as post contrast washout, and inconsistencies in ground truth annotations on diffuse scars caused by inter observer variability. In this work, we propose a curriculum learning-based framework designed to improve segmentation performance under these challenging conditions. The method introduces a progressive training strategy that guides the model from high-confidence, clearly defined scar regions to low confidence or visually ambiguous samples with limited scar burden. By structuring the learning process in this manner, the network develops robustness to uncertain labels and subtle scar appearances that are often underrepresented in conventional training pipelines. Experimental results show that the proposed approach enhances segmentation accuracy and consistency, particularly for cases with minimal or diffuse scar, outperforming standard training baselines. This strategy provides a principled way to leverage imperfect data for improved myocardial scar quantification in clinical applications. Our code is publicly available on GitHub.[344] XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs
Chengyin Hu,Jiaju Han,Xuemeng Sun,Qike Zhang,Yiwei Wei,Ang Li,Chunlei Meng,Xiang Chen,Jiahuan Long
Main category: cs.CV
TL;DR: 本文提出X形稀疏像素攻击(XSPA),通过仅扰动图像中两条对角线上的少量像素(约1.76%),在保持视觉不可察觉的前提下,引发视觉-语言模型(VLMs)在零样本分类、图像描述和视觉问答等多任务上的语义级联合失效,揭示了当前VLMs在结构化稀疏扰动下的严重鲁棒性缺陷。
Details
Motivation: 现有VLMs依赖共享的视觉-文本表征空间,虽利于跨任务泛化,但可能因小视觉扰动在该空间中传播而引发多任务协同语义失败;尤其在交互式与决策支持场景中,其对高度受限、稀疏且几何固定扰动的鲁棒性尚不明确。 Method: 提出X形稀疏像素攻击(XSPA),将扰动严格限制在两条相交对角线上,在极低像素修改率(~1.76%)下,联合优化分类目标、跨任务语义引导以及扰动幅度与沿线上平滑性正则项,实现可迁移的误分类与语义漂移。 Result: 在COCO数据集上,XSPA显著降低多个VLM性能:CLIP ViT-L/14零样本准确率下降52.33分,OpenCLIP ViT-B/16下降67.00分;GPT-4评估的图像描述一致性最多下降58.60分,VQA正确率最多下降44.38分。 Conclusion: 即使是非常稀疏、视觉上几乎不可见且具有固定几何先验的扰动,也能严重破坏VLM的跨任务语义一致性,暴露了当前多模态系统在结构化鲁棒性方面的关键短板。 Abstract: Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.[345] Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering
Yanjie Zhang,Yafei Li,Rui Sheng,Zixin Chen,Yanna Lin,Huamin Qu,Lei Chen,Yushi Sun
Main category: cs.CV
TL;DR: ChartCynics是一个双路径代理框架,通过解耦感知与验证、结合结构诊断与OCR数据提取,并采用两阶段优化的代理摘要器,显著提升了对误导性图表的识别准确率,超越现有大模型。
Details
Motivation: Vision-Language Models在面对误导性图表时表现不佳,因其 deceptive visual structures 和 distorted data representations 构成重大挑战。 Method: 提出ChartCynics:1)Diagnostic Vision Path(ROI裁剪检测结构异常如倒置坐标轴);2)OCR-Driven Data Path(确保数值依据);3)Agentic Summarizer(Oracle-Informed SFT + Deception-Aware GRPO两阶段优化以解决跨模态冲突并惩罚视觉陷阱)。 Result: 在两个基准上分别达到74.43%和64.55%准确率,相较Qwen3-VL-8B提升约29个百分点,优于当前最优闭源模型。 Conclusion: 专用代理工作流可赋予小型开源模型更强鲁棒性,为可信图表理解奠定新基础。 Abstract: Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and distorted data representations. We present ChartCynics, an agentic dual-path framework designed to unmask visual deception via a "skeptical" reasoning paradigm. Unlike holistic models, ChartCynics decouples perception from verification: a Diagnostic Vision Path captures structural anomalies (e.g., inverted axes) through strategic ROI cropping, while an OCR-Driven Data Path ensures numerical grounding. To resolve cross-modal conflicts, we introduce an Agentic Summarizer optimized via a two-stage protocol: Oracle-Informed SFT for reasoning distillation and Deception-Aware GRPO for adversarial alignment. This pipeline effectively penalizes visual traps and enforces logical consistency. Evaluations on two benchmarks show that ChartCynics achieves 74.43% and 64.55% accuracy, providing an absolute performance boost of ~29% over the Qwen3-VL-8B backbone, outperforming state-of-the-art proprietary models. Our results demonstrate that specialized agentic workflows can grant smaller open-source models superior robustness, establishing a new foundation for trustworthy chart interpretation.[346] ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection
Haojing Chen,Yutong Li,Zhihang Liu,Tao Tan,Haoyu Bian,Qiuju Ma
Main category: cs.CV
TL;DR: 本文提出ORSIFlow,一种基于显著性引导的校正流框架,将光学遥感图像显著目标检测(ORSI-SOD)建模为确定性的潜在流生成问题,在保证高性能的同时大幅提升推理效率。
Details
Motivation: ORSI-SOD面临复杂背景、低对比度、不规则形状和尺度变化大等挑战;现有判别式方法和扩散生成式方法分别存在表达能力或效率/稳定性不足的问题。 Method: 提出ORSIFlow框架:1)利用冻结的变分自编码器构建紧凑潜在空间,实现少步数的确定性流生成;2)设计显著特征判别器增强全局语义判别能力;3)引入显著特征校准器提升边界精度。 Result: 在多个公开基准上达到SOTA性能,同时显著提升推理效率;代码已开源。 Conclusion: ORSIFlow通过将ORSI-SOD重构为确定性潜在流生成任务,并结合双模块显著性增强机制,有效兼顾了检测精度与计算效率,为遥感图像显著性检测提供了新范式。 Abstract: Optical Remote Sensing Image Salient Object Detection (ORSI-SOD) remains challenging due to complex backgrounds, low contrast, irregular object shapes, and large variations in object scale. Existing discriminative methods directly regress saliency maps, while recent diffusion-based generative approaches suffer from stochastic sampling and high computational cost. In this paper, we propose ORSIFlow, a saliency-guided rectified flow framework that reformulates ORSI-SOD as a deterministic latent flow generation problem. ORSIFlow performs saliency mask generation in a compact latent space constructed by a frozen variational autoencoder, enabling efficient inference with only a few steps. To enhance saliency awareness, we design a Salient Feature Discriminator for global semantic discrimination and a Salient Feature Calibrator for precise boundary refinement. Extensive experiments on multiple public benchmarks show that ORSIFlow achieves state-of-the-art performance with significantly improved efficiency. Codes are available at: https://github.com/Ch3nSir/ORSIFlow.[347] Detection of Adversarial Attacks in Robotic Perception
Ziad Sharawy,Mohammad Nakshbandiand,Sorin Mihai Grigorescu
Main category: cs.CV
TL;DR: 本文探讨了深度神经网络在机器人感知语义分割任务中面对对抗性攻击的脆弱性,并指出需针对机器人场景设计专门的鲁棒架构与检测策略。
Details
Motivation: 深度神经网络在语义分割中表现优异,但在安全关键的机器人应用中易受对抗攻击,现有针对图像分类的鲁棒性研究难以直接迁移至机器人语义分割任务。 Method: 未在摘要中明确说明具体方法,但暗示需开发面向机器人语义分割的专用鲁棒架构与对抗检测策略。 Result: 未在摘要中给出具体实验结果,但强调语义分割在机器人场景下需区别于图像分类的鲁棒性研究路径。 Conclusion: 语义分割在机器人感知中的对抗鲁棒性问题亟需针对性研究,不能简单沿用图像分类领域的方案。 Abstract: Deep Neural Networks (DNNs) achieve strong performance in semantic segmentation for robotic perception but remain vulnerable to adversarial attacks, threatening safety-critical applications. While robustness has been studied for image classification, semantic segmentation in robotic contexts requires specialized architectures and detection strategies.[348] ELViS: Efficient Visual Similarity from Local Descriptors that Generalizes Across Domains
Pavel Suma,Giorgos Kordopatis-Zilos,Yannis Kalantidis,Giorgos Tolias
Main category: cs.CV
TL;DR: 本文提出ELViS模型,一种在相似性空间而非表征空间操作的图像到图像相似性模型,通过局部描述符匹配、最优传输优化和投票聚合机制,实现对未见域数据的强泛化能力,并在多领域检索基准上显著优于现有方法。
Details
Motivation: 大规模实例级训练数据稀缺,模型通常在特定领域数据集上训练,但在真实检索场景中需应对多样化的领域,因此跨域泛化能力至关重要。 Method: ELViS模型在相似性空间中操作,利用局部描述符对应关系,通过数据依赖增益的最优传输步骤抑制无信息描述符,并通过投票机制聚合强对应关系以得到图像级相似度。 Result: ELViS在包含地标、艺术品、商品及多领域集合的八个数据集组成的基准上,作为重排序方法,在域外场景和平均性能上大幅超越现有方法,且计算成本仅为其他方法的一小部分。 Conclusion: ELViS凭借其基于相似性空间的设计、强归纳偏置以及高效可解释的结构,显著提升了跨域图像检索的泛化能力与实用性。 Abstract: Large-scale instance-level training data is scarce, so models are typically trained on domain-specific datasets. Yet in real-world retrieval, they must handle diverse domains, making generalization to unseen data critical. We introduce ELViS, an image-to-image similarity model that generalizes effectively to unseen domains. Unlike conventional approaches, our model operates in similarity space rather than representation space, promoting cross-domain transfer. It leverages local descriptor correspondences, refines their similarities through an optimal transport step with data-dependent gains that suppress uninformative descriptors, and aggregates strong correspondences via a voting process into an image-level similarity. This design injects strong inductive biases, yielding a simple, efficient, and interpretable model. To assess generalization, we compile a benchmark of eight datasets spanning landmarks, artworks, products, and multi-domain collections, and evaluate ELViS as a re-ranking method. Our experiments show that ELViS outperforms competing methods by a large margin in out-of-domain scenarios and on average, while requiring only a fraction of their computational cost. Code available at: https://github.com/pavelsuma/ELViS/[349] Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
Mih Dinh,SouYoung Jin
Main category: cs.CV
TL;DR: 本文提出Unsafe2Safe,一种全自动图像隐私保护流水线,通过多模态引导的扩散编辑技术识别并重写图像中的敏感区域,在保障隐私的同时维持图像结构与下游任务性能。
Details
Motivation: 大规模图像数据集常含可识别或敏感内容,模型训练中可能记忆并泄露此类信息,亟需兼顾隐私保护与数据效用的自动化解决方案。 Method: 分两阶段:第一阶段利用视觉语言模型检测隐私风险、生成私有/公开图文对,并由大语言模型生成身份中立的编辑指令;第二阶段采用指令驱动的扩散编辑器,基于双文本提示重写敏感区域。同时构建涵盖质量、作弊、隐私、效用四维度的统一评估体系。 Result: 在MS-COCO、Caltech101和MIT Indoor67数据集上,显著降低人脸相似度、文本相似度与人口统计可预测性,同时保持下游模型精度接近原始数据训练水平;微调扩散编辑器后进一步提升隐私保护与语义保真度。 Conclusion: Unsafe2Safe为构建大规模隐私安全数据集提供了可扩展、有原则的解决方案,在视觉一致性与下游实用性之间实现良好平衡。 Abstract: Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision-language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy, and Utility dimensions. Across MS-COCO, Caltech101, and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.[350] TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark
Hannes Mareen,Dimitrios Karageorgiou,Paschalis Giakoumoglou,Peter Lambert,Symeon Papadopoulos,Glenn Van Wallendael
Main category: cs.CV
TL;DR: 本文介绍了TGIF2数据集,用于评估现代文本引导图像修复技术对媒体取证的影响,发现现有方法在面对FLUX.1模型生成的修复图像及随机非语义掩码时性能下降,并揭示了对象偏差和生成式超分辨率削弱取证痕迹的问题。
Details
Motivation: 现有基准(如TGIF)无法有效评估完全重建(FR)图像的篡改定位能力,且新文本引导修复模型不断涌现,亟需更新的数据集与基准来分析取证鲁棒性。 Method: 构建扩展数据集TGIF2,包含FLUX.1模型生成的编辑样本和随机非语义掩码;开展覆盖图像伪造定位(IFL)与合成图像检测(SID)的综合取证评估,包括对FR图像微调IFL方法、生成式超分辨率攻击等实验。 Result: IFL与SID方法在FLUX.1生成的篡改图像上性能显著下降;微调可提升FR图像定位效果,但随机非语义掩码暴露对象偏差;生成式超分辨率严重削弱取证痕迹。 Conclusion: TGIF2为现代文本引导修复与AI图像增强带来的取证挑战提供了更新的评估基准,揭示了当前方法的泛化能力局限与潜在偏差。 Abstract: Generative AI has made text-guided inpainting a powerful image editing tool, but at the same time a growing challenge for media forensics. Existing benchmarks, including our text-guided inpainting forgery (TGIF) dataset, show that image forgery localization (IFL) methods can localize manipulations in spliced images but struggle not in fully regenerated (FR) images, while synthetic image detection (SID) methods can detect fully regenerated images but cannot perform localization. With new generative inpainting models emerging and the open problem of localization in FR images remaining, updated datasets and benchmarks are needed. We introduce TGIF2, an extended version of TGIF, that captures recent advances in text-guided inpainting and enables a deeper analysis of forensic robustness. TGIF2 augments the original dataset with edits generated by FLUX.1 models, as well as with random non-semantic masks. Using the TGIF2 dataset, we conduct a forensic evaluation spanning IFL and SID, including fine-tuning IFL methods on FR images and generative super-resolution attacks. Our experiments show that both IFL and SID methods degrade on FLUX.1 manipulations, highlighting limited generalization. Additionally, while fine-tuning improves localization on FR images, evaluation with random non-semantic masks reveals object bias. Furthermore, generative super-resolution significantly weakens forensic traces, demonstrating that common image enhancement operations can undermine current forensic pipelines. In summary, TGIF2 provides an updated dataset and benchmark, which enables new insights into the challenges posed by modern inpainting and AI-based image enhancements. TGIF2 is available at https://github.com/IDLabMedia/tgif-dataset.[351] Divide and Restore: A Modular Task-Decoupled Framework for Universal Image Restoration
Joanna Wiekiera,Martyna Zur
Main category: cs.CV
TL;DR: 本文提出了一种模块化、任务解耦的图像恢复框架,通过轻量级CNN分类器动态路由至专用恢复节点,避免任务干扰,支持模型无关扩展,显著降低训练开销并提升多退化场景下的可扩展性与效率。
Details
Motivation: 现有复杂单体全功能图像恢复模型存在任务间负向干扰、训练成本高、难以扩展等问题,亟需一种更灵活、高效、易部署的替代方案。 Method: 设计基于显式诊断路由机制的模块化框架:使用轻量CNN分类器评估输入图像并路由至对应U-Net专家;各专家专精特定退化类型,路径隔离;支持任意恢复方法即插即用。 Result: 实验表明该方法在标准本地硬件上即可实现高效多退化图像恢复,新增退化类型仅需训练单个专家和更新路由器,无需全系统重训,训练开销显著降低。 Conclusion: 该任务解耦、路由驱动的模块化框架为图像恢复提供了更可扩展、低门槛、计算友好的新范式,兼顾性能与实用性。 Abstract: Restoring images affected by various types of degradation, such as noise, blur, or improper exposure, remains a significant challenge in computer vision. While recent trends favor complex monolithic all-in-one architectures, these models often suffer from negative task interference and require extensive joint training cycles on high-end computing clusters. In this paper, we propose a modular, task-decoupled image restoration framework based on an explicit diagnostic routing mechanism. The architecture consists of a lightweight Convolutional Neural Network (CNN) classifier that evaluates the input image and dynamically directs it to a specialized restoration node. A key advantage of this framework is its model-agnostic extensibility: while we demonstrate it using three independent U-Net experts, the system allows for the integration of any restoration method tailored to specific tasks. By isolating reconstruction paths, the framework prevents feature conflicts and significantly reduces training overhead. Unlike monolithic models, adding new degradation types in our framework only requires training a single expert and updating the router, rather than a full system retraining. Experimental results demonstrate that this computationally accessible approach offers a scalable and efficient solution for multi-degradation restoration on standard local hardware. The code will be published upon paper acceptance.[352] Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure
Chao Yin,Hongzhe Yue,Qing Han,Difeng Hu,Zhenyu Liang,Fangzhou Lin,Bing Sun,Boyu Wang,Mingkai Li,Wei Yao,Jack C. P. Cheng
Main category: cs.CV
TL;DR: 本文提出了Industrial3D——首个面向工业级机械、电气与管道(MEP)设施的大型高精度地面激光雷达点云数据集,并构建了跨范式基准,揭示了当前方法在工业场景中因统计稀有性与几何模糊性导致的严重性能瓶颈。
Details
Motivation: 现有建筑领域点云基准(如S3DIS、ScanNet)无法反映工业MEP设施中TLS数据所特有的极端几何模糊性、严重遮挡和类别极度不平衡问题,制约了Scan-to-BIM等关键工业应用的自动化语义理解能力。 Method: 构建了包含6.12亿个专家标注点(6 mm分辨率)、覆盖13座水处理厂的Industrial3D数据集;设计统一评估协议,在全监督、弱监督、无监督及基础模型四类范式下系统评测9种代表性方法。 Result: 最佳全监督方法达55.74% mIoU,零样本Point-SAM仅15.79%,差距达39.95个百分点;分析指出性能瓶颈源于215:1的类别不平衡(比S3DIS严重3.5倍)与尾部类别与头部管道共享圆柱几何结构所致的几何模糊性。 Conclusion: Industrial3D填补了工业3D场景理解的数据与基准空白,揭示了单纯频率重加权不足以解决该领域核心挑战,为后续研究提供了可复现、多范式、强现实性的新标准。 Abstract: Automated semantic understanding of dense point clouds is a prerequisite for Scan-to-BIM pipelines, digital twin construction, and as-built verification--core tasks in the digital transformation of the construction industry. Yet for industrial mechanical, electrical, and plumbing (MEP) facilities, this challenge remains largely unsolved: TLS acquisitions of water treatment plants, chiller halls, and pumping stations exhibit extreme geometric ambiguity, severe occlusion, and extreme class imbalance that architectural benchmarks (e.g., S3DIS or ScanNet) cannot adequately represent. We present Industrial3D, a terrestrial LiDAR dataset comprising 612 million expertly labelled points at 6 mm resolution from 13 water treatment facilities. At 6.6x the scale of the closest comparable MEP dataset, Industrial3D provides the largest and most demanding testbed for industrial 3D scene understanding to date. We further establish the first industrial cross-paradigm benchmark, evaluating nine representative methods across fully supervised, weakly supervised, unsupervised, and foundation model settings under a unified benchmark protocol. The best supervised method achieves 55.74% mIoU, whereas zero-shot Point-SAM reaches only 15.79%--a 39.95 percentage-point gap that quantifies the unresolved domain-transfer challenge for industrial TLS data. Systematic analysis reveals that this gap originates from a dual crisis: statistical rarity (215:1 imbalance, 3.5x more severe than S3DIS) and geometric ambiguity (tail-class points share cylindrical primitives with head-class pipes) that frequency-based re-weighting alone cannot resolve. Industrial3D, along with benchmark code and pre-trained models, will be publicly available at https://github.com/pointcloudyc/Industrial3D.[353] Sim-to-Real Fruit Detection Using Synthetic Data: Quantitative Evaluation and Embedded Deployment with Isaac Sim
Martina Hutter-Mironovova
Main category: cs.CV
TL;DR: 本研究探讨了在数据受限和嵌入式部署要求下,合成数据在目标检测sim-to-real迁移中的有效性;结果表明,混合训练策略(合成+少量真实数据)在保持接近纯真实数据训练性能的同时,显著减少人工标注需求,并提升域偏移下的鲁棒性,且模型可在Jetson Orin NX上实现实时推理。
Details
Motivation: 解决目标检测中真实标注数据稀缺、标注成本高,以及嵌入式设备部署对模型轻量化和实时性的要求。 Method: 在NVIDIA Isaac Sim中生成合成水果图像,与少量真实水果图像结合,分别采用纯真实、纯合成和混合方式训练YOLO模型;在匹配域和域偏移两种测试集上评估性能;使用TensorRT优化并在Jetson Orin NX上部署验证实时性。 Result: 纯真实数据训练精度最高;纯合成数据因域差距性能下降;混合训练显著优于纯合成,接近纯真实性能,且在域偏移下更鲁棒;所有模型均成功部署于Jetson Orin NX并实现实时推理。 Conclusion: 合成数据最有效的方式是与少量真实数据联合使用;实际应用中需兼顾检测精度与嵌入式部署约束。 Abstract: This study investigates the effectiveness of synthetic data for sim-to-real transfer in object detection under constrained data conditions and embedded deployment requirements. Synthetic datasets were generated in NVIDIA Isaac Sim and combined with limited real-world fruit images to train YOLO-based detection models under real-only, synthetic-only, and hybrid regimes. Performance was evaluated on two test datasets: an in-domain dataset with conditions matching the training data and a domain shift dataset containing real fruit and different background conditions. Results show that models trained exclusively on real data achieve the highest accuracy, while synthetic-only models exhibit reduced performance due to a domain gap. Hybrid training strategies significantly improve performance compared to synthetic-only approaches and achieve results close to real-only training while reducing the need for manual annotation. Under domain shift conditions, all models show performance degradation, with hybrid models providing improved robustness. The trained models were successfully deployed on a Jetson Orin NX using TensorRT optimization, achieving real-time inference performance. The findings highlight that synthetic data is most effective when used in combination with real data and that deployment constraints must be considered alongside detection accuracy.[354] Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems
Khalid Adnan Alsayed
Main category: cs.CV
TL;DR: 本文指出,面部识别系统在执法和安全场景中的高总体准确率掩盖了不同人口统计群体间的显著性能差异,主张采用基于子群体错误率(如FPR、FNR)的公平性评估框架,替代单一的聚合准确率指标。
Details
Motivation: 现有面部识别系统虽报告高整体准确率,但在不同人口群体中表现不均,可能造成实际社会危害;仅依赖聚合准确率会忽视关键公平性问题。 Method: 通过分析子群体层面的错误分布(包括假阳性率FPR与假阴性率FNR),对比具有相似总体准确率但公平性表现迥异的系统,并探讨准确性导向评估在执法场景中的操作风险。 Result: 实证表明,总体准确率相近的系统在子群体错误率上差异显著;仅用聚合准确率评估会掩盖不公平现象,带来误判或漏判等现实风险。 Conclusion: 应摒弃以准确率为首要指标的做法,转而采用公平性感知的评估方法和模型无关的审计策略,建立更全面、负责任的人工智能评估框架。 Abstract: Facial recognition systems are increasingly deployed in law enforcement and security contexts, where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy, growing evidence demonstrates that such systems often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm. This paper argues that aggregate accuracy is an insufficient metric for evaluating the fairness and reliability of facial recognition systems in high-stakes environments. Through analysis of subgroup-level error distribution, including false positive rate (FPR) and false negative rate (FNR), the paper demonstrates how aggregate performance metrics can obscure critical disparities across demographic groups. Empirical observations show that systems with similar overall accuracy can exhibit substantially different fairness profiles, with subgroup error rates varying significantly despite a single aggregate metric. The paper further examines the operational risks associated with accuracy-centric evaluation practices in law enforcement applications, where misclassification may result in wrongful suspicion or missed identification. It highlights the importance of fairness-aware evaluation approaches and model-agnostic auditing strategies that enable post-deployment assessment of real-world systems. The findings emphasise the need to move beyond accuracy as a primary metric and adopt more comprehensive evaluation frameworks for responsible AI deployment.[355] AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
Haozhe Qi,Kevin Qu,Mahdi Rad,Rui Wang,Alexander Mathis,Marc Pollefeys
Main category: cs.CV
TL;DR: AdaptToken是一种无需训练的长视频理解框架,利用MLLM的自不确定性(响应熵)作为全局控制信号,实现跨片段的token选择与早期停止,显著提升长视频理解准确率并降低推理开销。
Details
Motivation: 现有方法难以在长视频中跨远距离片段比较帧/ token的相关性,且缺乏基于证据充分性的动态停止机制;同时受限于MLLM的上下文长度和内存成本。 Method: 将视频分组,提取跨模态注意力对组内token排序,并用模型响应熵估计各组提示相关性,据此全局分配token预算;进一步基于熵值触发早期停止(AdaptToken-Lite)。 Result: 在四个长视频基准(VideoMME、LongVideoBench、LVBench、MLVU)及多个MLLM(7B–72B)上持续提升准确率(如Qwen2.5-VL 7B平均+6.7),支持至10K帧输入;AdaptToken-Lite推理时间减半而性能相当。 Conclusion: AdaptToken通过将模型自不确定性转化为可控信号,为长视频理解提供了高效、通用且无需训练的新范式。 Abstract: Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token[356] DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing
Kailai Feng,Yuxiang Wei,Bo Chen,Yang Pan,Hu Ye,Songwei Liu,Chenqian Yan,Yuan Gao
Main category: cs.CV
TL;DR: DreamLite is a compact (0.39B), unified on-device diffusion model supporting both text-to-image generation and text-guided image editing, achieving high performance and real-time inference (under 1s for 1024×1024) on smartphones via architectural design, task-progressive pretraining, and step distillation.
Details
Motivation: Existing diffusion models are too large for on-device deployment; current on-device variants support only text-to-image generation, lacking unified image editing capability. Method: DreamLite uses a pruned mobile U-Net backbone and unifies generation/editing via in-context spatial concatenation in latent space ((target|blank) for generation, (target|source) for editing); employs task-progressive joint pretraining, SFT, RL, and step distillation to 4 denoising steps. Result: Achieves GenEval 0.72 and ImgEdit 4.11—outperforming existing on-device models and competitive with some server-side models; generates/edits 1024×1024 images in <1s on Xiaomi 14. Conclusion: DreamLite is the first unified on-device diffusion model enabling efficient, high-quality text-to-image generation and text-guided image editing in resource-constrained settings. Abstract: Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.[357] SonoWorld: From One Image to a 3D Audio-Visual Scene
Derong Jin,Xiyi Chen,Ming C. Lin,Ruohan Gao
Main category: cs.CV
TL;DR: 本文提出了Image2AVScene任务,即从单张图像生成3D音视频场景,并设计了首个框架SonoWorld来实现该任务,通过全景外绘、3D场景构建、语言引导的声音锚点放置及空间音频渲染,实现了与场景几何和语义对齐的沉浸式音视频体验。
Details
Motivation: 现有视觉场景生成技术虽能将单张图像转化为可探索的3D世界,但缺乏声音导致沉浸感不完整,因此需要实现音视频联合的3D场景生成。 Method: 提出SonoWorld框架:1)从单图外绘生成360°全景;2)提升为可导航3D场景;3)基于语言引导放置声音锚点;4)渲染面向点源、面源和环境声的ambisonics空间音频。 Result: 在新构建的真实世界数据集和用户研究中验证了方法有效性,并拓展至单次声学学习和音视频空间源分离等应用。 Conclusion: SonoWorld首次实现了从单张图像生成几何与语义一致的3D音视频场景,显著提升了沉浸感,并展现出多样的下游应用潜力。 Abstract: Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/[358] FlowIt: Global Matching for Optical Flow with Confidence-Guided Refinement
Sadra Safadoust,Fabio Tosi,Matteo Poggi,Fatma Güney
Main category: cs.CV
TL;DR: FlowIt是一种新型光流估计架构,采用分层Transformer捕捉全局上下文,并将光流初始化建模为最优传输问题,以生成鲁棒的初始流场及遮挡/置信度图,再通过引导式细化阶段传播高置信度区域的运动估计。
Details
Motivation: 解决传统光流方法在处理大像素位移时因局部匹配局限而导致性能下降的问题,提升对长距离对应关系的建模能力与跨数据集泛化能力。 Method: 提出基于分层Transformer的架构;将光流初始化建模为最优传输问题以获得初始流场、遮挡图和置信度图;设计引导式细化模块,利用高置信度区域信息传播优化低置信度区域。 Result: 在Sintel和KITTI基准上达到SOTA;在Sintel、Spring和LayeredFlow上实现跨数据集零样本泛化性能新SOTA。 Conclusion: FlowIt通过全局建模、最优传输初始化与置信引导细化,显著提升了大位移光流估计的鲁棒性与泛化能力,为光流估计提供了新范式。 Abstract: We present FlowIt, a novel architecture for optical flow estimation designed to robustly handle large pixel displacements. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the efficacy of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel and KITTI benchmarks, while simultaneously establishing new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow.[359] SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild
Patrick Rim,Kevin Harris,Braden Copple,Shangchen Han,Xu Xie,Ivan Shugurov,Sizhe An,He Wen,Alex Wong,Tomas Hodan,Kun He
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、背戴式多相机系统,结合VR头显,实现真实野外环境下高精度手-物3D标注,并发布首个大规模野外3D手-物交互数据集SHOW3D。
Details
Motivation: 现有手-物交互数据集多在受控影棚采集,缺乏环境多样性,导致模型泛化能力差。 Method: 设计背戴式同步多相机系统与VR头显联合标定;开发基于自我-外部视角(ego-exo)的手与物体3D跟踪标注流程。 Result: 构建了SHOW3D数据集——首个包含多样化真实场景(含户外)的大型3D手-物交互数据集,并验证其在下游任务中显著缓解了现实性与标注精度之间的权衡。 Conclusion: 该工作突破了野外环境下高保真3D标注的技术瓶颈,为真实场景手-物交互理解提供了新基准与实用工具。 Abstract: Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io[360] On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers
Omer Dahary,Benaya Koren,Daniel Garibi,Daniel Cohen-Or
Main category: cs.CV
TL;DR: 本文提出了一种在上下文空间中施加排斥力的新框架,以提升扩散Transformer模型在文本到图像生成中的多样性,同时保持视觉保真度和语义一致性。
Details
Motivation: 现代文本到图像扩散模型虽语义对齐优秀,但缺乏生成多样性(典型性偏差),难以满足创意应用需求;现有提升多样性的方法存在效率低或破坏图像结构的问题。 Method: 在扩散Transformer的多模态注意力通道中,于文本条件与图像结构已融合但构图尚未固定的中间层之间,动态注入上下文空间排斥力干预。 Result: 显著提升了生成多样性,且不损害视觉质量与语义准确性;计算开销小,适用于‘Turbo’及蒸馏等高效模型,而传统轨迹干预在此类模型中常失效。 Conclusion: 上下文空间排斥是一种高效、鲁棒且结构感知的多样性增强机制,为扩散模型的可控生成提供了新范式。 Abstract: Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.[361] PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models
Lorenza Prospero,Orest Kupyn,Ostap Viniavskyi,João F. Henriques,Christian Rupprecht
Main category: cs.CV
TL;DR: 本文提出PoseDreamer,利用扩散模型生成大规模带3D网格标注的合成数据集,结合可控图像生成、直接偏好优化、课程式难样本挖掘与多阶段质量过滤,显著提升图像质量和模型性能。
Details
Motivation: 获取用于3D人体网格估计的标注数据集困难重重,真实数据规模小、标注成本高,传统合成数据又存在逼真度低、多样性差和制作成本高等问题。 Method: 提出PoseDreamer生成式数据构建流程,融合可控扩散图像生成、Direct Preference Optimization(DPO)实现控制对齐、课程式难样本挖掘及多阶段质量过滤,确保生成图像与3D标签严格对应并聚焦高价值样本。 Result: 生成超50万高质量合成样本,在图像质量指标上较渲染型数据集提升76%;所训模型性能媲美或超越基于真实/传统合成数据训练的模型;与传统合成数据组合时效果优于真实+合成组合。 Conclusion: 生成式数据是构建高质量3D人体估计数据集的新可行路径,PoseDreamer为该方向提供了高效、可扩展且实用的范式。 Abstract: Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.[362] HandX: Scaling Bimanual Motion and Interaction Generation
Zimu Zhang,Yucheng Zhang,Xiyan Xu,Ziyin Wang,Sirui Xu,Kai Zhou,Bing Zhou,Chuan Guo,Jian Wang,Yu-Xiong Wang,Liang-Yan Gui
Main category: cs.CV
TL;DR: 本文提出HandX,一个涵盖数据、标注和评估的统一基础框架,旨在解决现有全身运动合成模型在手部精细运动和双手机交互建模上的不足。通过整合过滤现有数据、采集高质量双手机交互动捕数据,并利用大语言模型辅助生成语义丰富的细粒度标注,构建了高质量数据集;在此基础上,对扩散与自回归模型进行基准测试,并提出新的手部专用评估指标,验证了模型规模与数据质量对生成效果的正向影响。
Details
Motivation: 现有全身运动合成模型难以准确建模手部精细运动(如手指屈伸、接触时序、双手协调),且缺乏高保真、富含指端动态细节的双手机交互数据资源。 Method: 提出HandX统一框架:1)整合过滤现有数据并采集新高保真双手机动捕数据;2)设计解耦式标注策略——先提取接触事件、手指屈曲等关键运动特征,再利用大语言模型生成语义丰富的细粒度描述;3)基于新数据对扩散与自回归模型开展多条件模式下的基准测试,并引入手部专用评估指标。 Result: 实验表明所提方法能生成高质量灵巧手部运动;新提出的指标有效支撑评估;观察到模型规模与数据规模/质量提升带来双手机交互语义连贯性的显著增强。 Conclusion: HandX填补了高保真双手机交互运动建模的数据与方法空白,为未来灵巧手运动合成研究提供了坚实基础与开放资源。 Abstract: Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.[363] Gen-Searcher: Reinforcing Agentic Search for Image Generation
Kaituo Feng,Manyuan Zhang,Shuang Chen,Yunlong Lin,Kaixuan Fan,Yilei Jiang,Hongyu Li,Dian Zheng,Chenyang Wang,Xiangyu Yue
Main category: cs.CV
TL;DR: 本文提出Gen-Searcher,首个融合搜索能力的图像生成智能体,通过多跳推理与外部知识检索提升生成准确性,并构建了专用数据集与评估基准KnowGen;采用SFT+基于双奖励(文本+图像)的强化学习训练,显著提升现有模型在知识密集型图像生成任务上的性能。