Table of Contents
cs.CL [Back]
[1] Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
Canxiang Yan,Chunxiang Jin,Dawei Huang,Haibing Yu,Han Peng,Hui Zhan,Jie Gao,Jing Peng,Jingdong Chen,Jun Zhou,Kaimeng Ren,Ming Yang,Mingxue Yang,Qiang Xu,Qin Zhao,Ruijie Xiong,Shaoxiong Lin,Xuezhi Wang,Yi Yuan,Yifei Wu,Yongjie Lyu,Zhengyu He,Zhihao Qiu,Zhiqiang Fang,Ziyuan Huang
Main category: cs.CL
TL;DR: 提出统一的连续语音分词器MingTok-Audio和语音语言模型Ming-UniAudio,实现语音理解、生成与自由形式编辑的统一,并发布首个面向自然语言指令的自由语音编辑基准Ming-Freeform-Audio-Edit。
Details
Motivation: 现有语音模型在理解和生成任务之间存在表征冲突,难以支持基于指令的自由语音编辑。 Method: 设计统一的连续语音分词器MingTok-Audio,融合语义与声学特征;基于此构建兼具理解与生成能力的语音语言模型Ming-UniAudio,并训练专用编辑模型Ming-UniAudio-Edit,支持无需时间戳的自然语言指令驱动编辑。 Result: Ming-UniAudio在ContextASR基准12项指标中8项达到SOTA;中文语音克隆Seed-TTS-WER达0.95;推出首个自由形式语音编辑基准Ming-Freeform-Audio-Edit。 Conclusion: 该框架实现了语音理解、生成与编辑的统一,推动了指令驱动的自由语音内容编辑发展。 Abstract: Existing speech models suffer from competing requirements on token representations by understanding and generation tasks. This discrepancy in representation prevents speech language models from performing instruction-based free-form editing. To solve this challenge, we introduce a novel framework that unifies speech understanding, generation, and editing. The core of our unified model is a unified continuous speech tokenizer MingTok-Audio, the first continuous tokenizer to effectively integrate semantic and acoustic features, which makes it suitable for both understanding and generation tasks. Based on this unified continuous audio tokenizer, we developed the speech language model Ming-UniAudio, which achieved a balance between generation and understanding capabilities. Ming-UniAudio sets new state-of-the-art (SOTA) records on 8 out of 12 metrics on the ContextASR benchmark. Notably, for Chinese voice cloning, it achieves a highly competitive Seed-TTS-WER of 0.95. Leveraging this foundational model, we further trained a dedicated speech editing model Ming-UniAudio-Edit, the first speech language model that enables universal, free-form speech editing guided solely by natural language instructions, handling both semantic and acoustic modifications without timestamp condition. To rigorously assess the editing capability and establish a foundation for future research, we introduce Ming-Freeform-Audio-Edit, the first comprehensive benchmark tailored for instruction-based free-form speech editing, featuring diverse scenarios and evaluation dimensions spanning semantic correctness, acoustic quality, and instruction alignment. We open-sourced the continuous audio tokenizer, the unified foundational model, and the free-form instruction-based editing model to facilitate the development of unified audio understanding, generation, and manipulation.[2] Retracing the Past: LLMs Emit Training Data When They Get Lost
Myeongseob Ko,Nikhil Reddy Billa,Adam Nguyen,Charles Fleming,Ming Jin,Ruoxi Jia
Main category: cs.CL
TL;DR: 本文提出了一种名为Confusion-Inducing Attacks (CIA)的新框架,通过系统性地增加模型不确定性来提取大语言模型中记忆的训练数据,并引入Mismatched SFT方法增强对对齐模型的攻击效果,实验证明该方法优于现有基准。
Details
Motivation: 大语言模型对训练数据的记忆引发隐私和版权问题,现有数据提取方法效果有限且缺乏对记忆泄露机制的深入理解。 Method: 提出CIA框架,利用模型在生成记忆内容前会出现持续高熵状态的现象,优化输入以诱导这种高熵状态;对于对齐模型,采用Mismatched监督微调来削弱其对齐并诱发混淆。 Result: 在多种对齐与非对齐大语言模型上的实验表明,CIA在无需先验知识的情况下,能更有效地提取原文及近似原文的训练数据,显著优于现有方法。 Conclusion: CIA提供了一种更系统的评估大语言模型记忆风险的方法,揭示了当前模型普遍存在记忆泄露隐患。 Abstract: The memorization of training data in large language models (LLMs) poses significant privacy and copyright concerns. Existing data extraction methods, particularly heuristic-based divergence attacks, often exhibit limited success and offer limited insight into the fundamental drivers of memorization leakage. This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data by systematically maximizing model uncertainty. We empirically demonstrate that the emission of memorized text during divergence is preceded by a sustained spike in token-level prediction entropy. CIA leverages this insight by optimizing input snippets to deliberately induce this consecutive high-entropy state. For aligned LLMs, we further propose Mismatched Supervised Fine-tuning (SFT) to simultaneously weaken their alignment and induce targeted confusion, thereby increasing susceptibility to our attacks. Experiments on various unaligned and aligned LLMs demonstrate that our proposed attacks outperform existing baselines in extracting verbatim and near-verbatim training data without requiring prior knowledge of the training data. Our findings highlight persistent memorization risks across various LLMs and offer a more systematic method for assessing these vulnerabilities.[3] Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning
Rufan Zhang,Lin Zhang,Xianghang Mi
Main category: cs.CL
TL;DR: 提出一种基于上下文学习(ICL)的统一框架,利用基础模型实现对毒性、垃圾信息和负面情感的自适应检测,支持无需重新训练的轻量级个性化内容审核。
Details
Motivation: 现有内容审核系统集中化、任务特定且缺乏透明度,难以满足用户多样化偏好和隐私敏感或去中心化环境的需求。 Method: 利用基础模型的上下文学习能力,在二分类、多分类和多标签设置下统一检测有害内容,并通过简单的提示干预实现用户个性化定制,无需模型微调。 Result: 在公开基准和新标注的Mastodon数据集上实验表明:基础模型具有强跨任务泛化能力,仅需一个示例即可实现有效个性化,加入标签定义或推理可提升对噪声数据的鲁棒性。 Conclusion: 该工作展示了ICL在构建实用、隐私保护且高度自适应的新一代以用户为中心的内容安全系统中的潜力,推动内容审核从‘一刀切’模式转向个性化方案。 Abstract: The proliferation of harmful online content--e.g., toxicity, spam, and negative sentiment--demands robust and adaptable moderation systems. However, prevailing moderation systems are centralized and task-specific, offering limited transparency and neglecting diverse user preferences--an approach ill-suited for privacy-sensitive or decentralized environments. We propose a novel framework that leverages in-context learning (ICL) with foundation models to unify the detection of toxicity, spam, and negative sentiment across binary, multi-class, and multi-label settings. Crucially, our approach enables lightweight personalization, allowing users to easily block new categories, unblock existing ones, or extend detection to semantic variations through simple prompt-based interventions--all without model retraining. Extensive experiments on public benchmarks (TextDetox, UCI SMS, SST2) and a new, annotated Mastodon dataset reveal that: (i) foundation models achieve strong cross-task generalization, often matching or surpassing task-specific fine-tuned models; (ii) effective personalization is achievable with as few as one user-provided example or definition; and (iii) augmenting prompts with label definitions or rationales significantly enhances robustness to noisy, real-world data. Our work demonstrates a definitive shift beyond one-size-fits-all moderation, establishing ICL as a practical, privacy-preserving, and highly adaptable pathway for the next generation of user-centric content safety systems. To foster reproducibility and facilitate future research, we publicly release our code on GitHub and the annotated Mastodon dataset on Hugging Face.[4] MCP4IFC: IFC-Based Building Design Using Large Language Models
Bharathi Kannan Nithyanantham,Tobias Sesterhenn,Ashwin Nedungadi,Sergio Peral Garijo,Janis Zenkner,Christian Bartelt,Stefan Lüdtke
Main category: cs.CL
TL;DR: MCP4IFC是一个开源框架,使大语言模型能够通过Model Context Protocol直接操作IFC数据,支持自然语言指令到BIM操作的转换。
Details
Motivation: 将生成式AI引入建筑、工程和施工领域需要能将自然语言指令转化为对标准化数据模型操作的系统。 Method: 提出MCP4IFC框架,集成BIM工具集,包括场景查询、预定义建模功能和结合上下文学习与检索增强生成的动态代码生成系统。 Result: 实验表明,使用该框架的大语言模型可成功完成从建造简单房屋到查询编辑现有IFC数据的复杂任务。 Conclusion: MCP4IFC为LLM驱动的BIM设计提供了开放基础,推动了人工智能辅助建模工作流的发展。 Abstract: Bringing generative AI into the architecture, engineering and construction (AEC) field requires systems that can translate natural language instructions into actions on standardized data models. We present MCP4IFC, a comprehensive open-source framework that enables Large Language Models (LLMs) to directly manipulate Industry Foundation Classes (IFC) data through the Model Context Protocol (MCP). The framework provides a set of BIM tools, including scene querying tools for information retrieval, predefined functions for creating and modifying common building elements, and a dynamic code-generation system that combines in-context learning with retrieval-augmented generation (RAG) to handle tasks beyond the predefined toolset. Experiments demonstrate that an LLM using our framework can successfully perform complex tasks, from building a simple house to querying and editing existing IFC data. Our framework is released as open-source to encourage research in LLM-driven BIM design and provide a foundation for AI-assisted modeling workflows. Our code is available at https://show2instruct.github.io/mcp4ifc/.[5] FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference
Kunxi Li,Yufan Xiong,Zhonghua Jiang,Yiyun Zhou,Zhaode Wang,Chengfei Lv,Shengyu Zhang
Main category: cs.CL
TL;DR: FlowMM是一种自适应的多模态KV缓存合并框架,利用跨模态信息流和敏感性感知的令牌匹配机制,在显著减少内存占用和解码延迟的同时保持任务性能。
Details
Motivation: 传统KV缓存驱逐策略在多模态场景下容易导致上下文丢失或幻觉,现有KV合并方法受限于模态间分布偏差和注意力偏差,效果有限。 Method: 提出FlowMM框架,基于跨模态信息流动态应用层特定的合并策略,并引入敏感性自适应令牌匹配机制,结合令牌相似性和任务关键敏感性进行低风险合并。 Result: 在多个主流MLLM上实验表明,FlowMM可将KV缓存内存减少80%至95%,解码延迟降低1.3-1.8倍,同时保持具有竞争力的任务性能。 Conclusion: FlowMM有效提升了多模态大模型中KV缓存管理的效率与生成质量,为多模态上下文压缩提供了新的解决方案。 Abstract: Traditional KV cache eviction strategies, which discard less critical KV-pairs based on attention scores, often degrade generation quality, causing context loss or hallucinations. Recent efforts shift toward KV merging, merging eviction tokens with retention tokens based on similarity. However, in multimodal scenarios, distributional biases across modality tokens and attentional biases in cross-modal interactions limit its effectiveness. This work introduces FlowMM, an adaptive framework for cross-modal information flow-guided multimodal KV cache merging. FlowMM leverages cross-modal information flow to dynamically apply layer-specific merging strategies, capturing modality-specific patterns while preserving contextual integrity. Furthermore, we introduce a sensitivity-adaptive token matching mechanism that jointly evaluates token similarity and task-critical sensitivity, merging low-risk tokens while safeguarding high-sensitivity ones. Extensive experiments across diverse leading MLLMs show that FlowMM reduces KV cache memory by 80% to 95% and decoding latency by 1.3-1.8x, while maintaining competitive task performance.[6] Future of AI Models: A Computational perspective on Model collapse
Trivikram Satharasi,S Sitharama Iyengar
Main category: cs.CL
TL;DR: 本研究通过分析2013至2025年英文维基百科的语义相似性,量化并预测了由于AI生成内容泛滥导致的模型崩溃风险,发现LLM公开应用后语义相似性显著上升,威胁数据多样性与模型泛化能力。
Details
Motivation: 随着AI生成内容在互联网上的迅速扩张,训练数据中合成内容比例上升,可能导致后续模型训练陷入‘模型崩溃’,即语义和语言多样性退化,影响模型性能和泛化能力。因此亟需量化这一趋势以评估其影响。 Method: 使用Transformer嵌入和余弦相似度指标,对2013年至2025年按年划分的英文维基百科(经Common Crawl筛选)进行语义相似性分析,分阶段考察LLM普及前后的变化趋势。 Result: 结果显示,在LLM公开采用之前,语义相似性已有缓慢上升趋势,可能源于早期RNN/LSTM系统的文本规范化处理;而在LLM广泛应用后,语义相似性呈现指数级增长,同时观察到由语言多样性、数据规模波动和采样误差引起的波动。 Conclusion: AI生成内容的递归污染正在加速语义同质化,模型崩溃的风险已显现,需警惕其对未来模型训练质量和数据生态的长期影响。 Abstract: Artificial Intelligence, especially Large Language Models (LLMs), has transformed domains such as software engineering, journalism, creative writing, academia, and media (Naveed et al. 2025; arXiv:2307.06435). Diffusion models like Stable Diffusion generate high-quality images and videos from text. Evidence shows rapid expansion: 74.2% of newly published webpages now contain AI-generated material (Ryan Law 2025), 30-40% of the active web corpus is synthetic (Spennemann 2025; arXiv:2504.08755), 52% of U.S. adults use LLMs for writing, coding, or research (Staff 2025), and audits find AI involvement in 18% of financial complaints and 24% of press releases (Liang et al. 2025). The underlying neural architectures, including Transformers (Vaswani et al. 2023; arXiv:1706.03762), RNNs, LSTMs, GANs, and diffusion networks, depend on large, diverse, human-authored datasets (Shi & Iyengar 2019). As synthetic content dominates, recursive training risks eroding linguistic and semantic diversity, producing Model Collapse (Shumailov et al. 2024; arXiv:2307.15043; Dohmatob et al. 2024; arXiv:2402.07712). This study quantifies and forecasts collapse onset by examining year-wise semantic similarity in English-language Wikipedia (filtered Common Crawl) from 2013 to 2025 using Transformer embeddings and cosine similarity metrics. Results reveal a steady rise in similarity before public LLM adoption, likely driven by early RNN/LSTM translation and text-normalization pipelines, though modest due to a smaller scale. Observed fluctuations reflect irreducible linguistic diversity, variable corpus size across years, finite sampling error, and an exponential rise in similarity after the public adoption of LLM models. These findings provide a data-driven estimate of when recursive AI contamination may significantly threaten data richness and model generalization.[7] Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
Usha Bhalla,Alex Oesterling,Claudio Mayrink Verdun,Himabindu Lakkaraju,Flavio P. Calmon
Main category: cs.CL
TL;DR: 提出Temporal Sparse Autoencoders (T-SAEs),通过引入对比损失捕捉语义特征的长程依赖性,实现语言模型中语义与句法特征的无监督解耦。
Details
Motivation: 现有稀疏自编码器在发现语言理解中的深层语义概念时存在偏差,倾向于捕捉浅层、局部或噪声特征,忽视语言的丰富结构(如语义的长程依赖性)。 Method: 提出T-SAE,引入一种新的对比损失,鼓励高层语义特征在相邻token间保持激活一致性,从而在无监督情况下分离语义与句法特征。 Result: 在多个数据集和模型上,T-SAE能恢复更平滑、连贯的语义概念,且不牺牲重构质量;即使无显式语义信号,仍展现出清晰的语义结构。 Conclusion: T-SAE通过融入语言结构先验,提升了字典学习方法在语言模型可解释性中的表现,为无监督可解释性提供了新路径。 Abstract: Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they suffer from a variety of problems, including a systematic failure to capture the rich conceptual information that drives linguistic understanding. Instead, they exhibit a bias towards shallow, token-specific, or noisy features, such as "the phrase 'The' at the start of sentences". In this work, we propose that this is due to a fundamental issue with how dictionary learning methods for LLMs are trained. Language itself has a rich, well-studied structure spanning syntax, semantics, and pragmatics; however, current unsupervised methods largely ignore this linguistic knowledge, leading to poor feature discovery that favors superficial patterns over meaningful concepts. We focus on a simple but important aspect of language: semantic content has long-range dependencies and tends to be smooth over a sequence, whereas syntactic information is much more local. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.[8] UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
Preston Firestone,Shubham Ugare,Gagandeep Singh,Sasa Misailovic
Main category: cs.CL
TL;DR: 本文利用幺半群理论形式化子词分词过程,证明了包含非法UTF-8符号的词汇表可能导致生成非法UTF-8序列,并揭示了增量解码与整体解码结果不一致的问题,验证了实际系统中的相关缺陷。
Details
Motivation: 解决基于字节的子词分词可能导致非法UTF-8序列的问题,避免在语言模型应用中引发解码错误和安全漏洞。 Method: 使用幺半群理论对分词过程进行形式化建模,证明非法UTF-8令牌的存在会导致输出序列非法,并分析增量解码与整体解码的差异。 Result: 理论上证明了含非法UTF-8令牌的分词器必然可能生成非法UTF-8序列,实验验证了主流大模型、服务引擎和受限生成系统的相关问题。 Conclusion: 语言模型系统必须谨慎处理分词与UTF-8解码的交互,建议采用全序列解码或确保词汇表为合法UTF-8以避免潜在错误。 Abstract: Subword tokenization segments input text according to a pre-defined vocabulary to feed it into a language model; the language model, in turn, generates a sequence made from this same vocabulary. The members of the vocabulary can be built of code points or bytes. Using code points means that all members of the vocabulary are valid UTF-8 characters. However, it also requires thousands of initial members to achieve acceptable coverage of inputs. Beginning with bytes, on the contrary, avoids out-of-vocabulary errors with only 256 initial members of the vocabulary, but the members of the vocabulary and sequences of them are not guaranteed to be valid UTF-8. Sequences that are not valid UTF-8 break code that assumes its input to be valid UTF-8. Applications of language models must account for the breakage thereby introduced. In this paper, we formalize tokenization using monoid theory and prove that tokenizers whose vocabularies contain tokens that are ill-formed UTF-8 can always produce sequences that are ill-formed UTF-8. We demonstrate formally that attempting to incrementally convert tokens back to a string and interpret the results as UTF-8 gives different results than converting the whole sequence of tokens at once. This formal result predicts real-world bugs: we evaluate mitigations for the problem identified and provide case studies of major foundation models, serving engines, and constrained generation systems.[9] Optimizing Diversity and Quality through Base-Aligned Model Collaboration
Yichen Wang,Chenghao Yang,Tenghao Huang,Muhao Chen,Jonathan May,Mina Lee
Main category: cs.CL
TL;DR: 提出BACo框架,在推理时通过基础模型与对齐模型的令牌级协作,动态结合两者优势,在保持生成质量的同时显著提升多样性,且仅需单次解码即可实现高质量与高多样性的平衡。
Details
Motivation: 现有对齐方法虽提升大语言模型输出质量,但牺牲了多样性,导致生成结果趋同;而现有的提升多样性方法常以牺牲质量或增加计算成本为代价,缺乏有效平衡二者的方法。 Method: 提出Base-Aligned Model Collaboration (BACo) 框架,设计基于预测不确定性与语义角色的路由策略,在每个令牌生成时动态决定从基础模型还是对齐模型解码,实现在推理时的令牌级模型协作。 Result: 在三个开放生成任务和13项指标上,BACo持续优于现有先进推理时基线方法,最佳路由器实现多样性与质量联合提升21.3%,人类评估也验证了其优势。 Conclusion: 基础模型与对齐模型在推理时的协作可有效优化并控制生成结果的多样性与质量,BACo为解决该权衡问题提供了高效、可控的新范式。 Abstract: Alignment has greatly improved large language models (LLMs)' output quality at the cost of diversity, yielding highly similar outputs across generations. We propose Base-Aligned Model Collaboration (BACo), an inference-time token-level model collaboration framework that dynamically combines a base LLM with its aligned counterpart to optimize diversity and quality. Inspired by prior work (Fei et al., 2025), BACo employs routing strategies that determine, at each token, from which model to decode based on next-token prediction uncertainty and predicted contents' semantic role. Prior diversity-promoting methods, such as retraining, prompt engineering, and multi-sampling methods, improve diversity but often degrade quality or require costly decoding or post-training. In contrast, BACo achieves both high diversity and quality post hoc within a single pass, while offering strong controllability. We explore a family of routing strategies, across three open-ended generation tasks and 13 metrics covering diversity and quality, BACo consistently surpasses state-of-the-art inference-time baselines. With our best router, BACo achieves a 21.3% joint improvement in diversity and quality. Human evaluations also mirror these improvements. The results suggest that collaboration between base and aligned models can optimize and control diversity and quality.[10] OckBench: Measuring the Efficiency of LLM Reasoning
Zheng Du,Hao Kang,Song Han,Tushar Krishna,Ligeng Zhu
Main category: cs.CL
TL;DR: 本文提出了OckBench,一个模型和硬件无关的基准测试工具,用于评估大语言模型在推理和代码生成任务中的准确性和解码token效率,强调应将token消耗作为重要评估维度。
Details
Motivation: 现有基准测试主要关注准确性和输出质量,忽视了解码token效率对实际系统中延迟、成本和能耗的重要影响。 Method: 设计了一个同时评估准确性和token数量的基准OckBench,并在多个开源和闭源模型上进行实验,分析准确率与token消耗之间的权衡关系。 Result: 实验发现许多准确率相近的模型在token消耗上差异显著,揭示了效率差异是一个被忽视但重要的区分维度,并展示了准确率-效率的帕累托前沿。 Conclusion: 应转变评估范式,不再将token视为“免费”资源,OckBench为研究高效推理提供了统一平台。 Abstract: Large language models such as GPT-4, Claude 3, and the Gemini series have improved automated reasoning and code generation. However, existing benchmarks mainly focus on accuracy and output quality, and they ignore an important factor: decoding token efficiency. In real systems, generating 10,000 tokens versus 100,000 tokens leads to large differences in latency, cost, and energy. In this work, we introduce OckBench, a model-agnostic and hardware-agnostic benchmark that evaluates both accuracy and token count for reasoning and coding tasks. Through experiments comparing multiple open- and closed-source models, we uncover that many models with comparable accuracy differ wildly in token consumption, revealing that efficiency variance is a neglected but significant axis of differentiation. We further demonstrate Pareto frontiers over the accuracy-efficiency plane and argue for an evaluation paradigm shift: we should no longer treat tokens as "free" to multiply. OckBench provides a unified platform for measuring, comparing, and guiding research in token-efficient reasoning. Our benchmarks are available at https://ockbench.github.io/ .[11] In-Context Learning Without Copying
Kerem Sahin,Sheridan Feucht,Adam Belfki,Jannik Brinkmann,Aaron Mueller,David Bau,Chris Wendler
Main category: cs.CL
TL;DR: 本文提出Hapax方法,通过抑制归纳头对损失的贡献来研究归纳复制对上下文学习(ICL)的影响,发现即使减少归纳复制,模型在抽象ICL任务上的表现仍可保持甚至提升,表明归纳复制并非实现抽象ICL的必要条件。
Details
Motivation: 探究归纳头引发的归纳复制是否是模型获得上下文学习能力的关键因素,特别是当抑制这种机制时,模型是否仍能发展出复杂的ICL能力。 Method: 提出Hapax设置,在训练中忽略可由归纳头正确预测的token的损失贡献,从而抑制归纳复制,并在多种抽象ICL任务上评估模型性能及机制变化。 Result: Hapax显著减少了归纳复制(31.7%的token被忽略),但在21项任务中13项上表现优于基线模型,且在非归纳位置取得更低损失,同时模型发展出更少更弱的归纳头但仍保持ICL能力。 Conclusion: 归纳复制不是学习抽象上下文学习机制的必要条件,模型可以在缺乏强归纳头的情况下发展出有效的ICL能力。 Abstract: Induction heads are attention heads that perform inductive copying by matching patterns from earlier context and copying their continuations verbatim. As models develop induction heads, they often experience a sharp drop in training loss, a phenomenon cited as evidence that induction heads may serve as a prerequisite for more complex in-context learning (ICL) capabilities. In this work, we ask whether transformers can still acquire ICL capabilities when inductive copying is suppressed. We propose Hapax, a setting where we omit the loss contribution of any token that can be correctly predicted by induction heads. Despite a significant reduction in inductive copying, performance on abstractive ICL tasks (i.e., tasks where the answer is not contained in the input context) remains comparable and surpasses the vanilla model on 13 of 21 tasks, even though 31.7\% of tokens are omitted from the loss. Furthermore, our model achieves lower loss values on token positions that cannot be predicted correctly by induction heads. Mechanistic analysis further shows that models trained with Hapax develop fewer and weaker induction heads but still preserve ICL capabilities. Taken together, our findings indicate that inductive copying is not essential for learning abstractive ICL mechanisms.[12] Multi-Scale Feature Fusion and Graph Neural Network Integration for Text Classification with Large Language Models
Xiangchen Song,Yulin Huang,Jinxu Guo,Yuchen Liu,Yaxuan Luan
Main category: cs.CL
TL;DR: 提出一种结合大语言模型、特征金字塔和图神经网络的混合文本分类方法,显著提升复杂语义场景下的分类性能。
Details
Motivation: 为了在复杂语义上下文中提升文本分类的性能,解决现有方法在多尺度语义融合与结构化建模方面的不足。 Method: 利用大语言模型提取深层语义特征,通过特征金字塔进行多尺度特征融合,并将融合后的特征转化为图结构,使用图神经网络建模语义单元间的潜在关系,最后通过读出层进行分类预测。 Result: 在ACC、F1-Score、AUC和Precision指标上均优于现有模型,验证了方法的有效性和鲁棒性。 Conclusion: 该框架有效平衡了全局与局部信息、语义与结构建模,为文本分类中的多尺度融合与结构化语义建模提供了新思路。 Abstract: This study investigates a hybrid method for text classification that integrates deep feature extraction from large language models, multi-scale fusion through feature pyramids, and structured modeling with graph neural networks to enhance performance in complex semantic contexts. First, the large language model captures contextual dependencies and deep semantic representations of the input text, providing a rich feature foundation for subsequent modeling. Then, based on multi-level feature representations, the feature pyramid mechanism effectively integrates semantic features of different scales, balancing global information and local details to construct hierarchical semantic expressions. Furthermore, the fused features are transformed into graph representations, and graph neural networks are employed to capture latent semantic relations and logical dependencies in the text, enabling comprehensive modeling of complex interactions among semantic units. On this basis, the readout and classification modules generate the final category predictions. The proposed method demonstrates significant advantages in robustness alignment experiments, outperforming existing models on ACC, F1-Score, AUC, and Precision, which verifies the effectiveness and stability of the framework. This study not only constructs an integrated framework that balances global and local information as well as semantics and structure, but also provides a new perspective for multi-scale feature fusion and structured semantic modeling in text classification tasks.[13] Language Generation: Complexity Barriers and Implications for Learning
Marcelo Arenas,Pablo Barceló,Luis Cofré,Alexander Kozachinskiy
Main category: cs.CL
TL;DR: 本文指出,尽管Kleinberg和Mullainathan证明了语言生成在理论上是可能的,但对于正则和上下文无关语言等简单语言族,实现成功生成所需的样本数量可能极其庞大,甚至无法由可计算函数界定,揭示了理论可能性与实际可学习性之间的巨大差距。
Details
Motivation: 解释现代语言模型为何能在实践中成功生成语言,尽管理论上需要极多样本才能保证生成质量,从而弥合理论与实践之间的鸿沟。 Method: 通过分析正则语言和上下文无关语言等经典语言族,研究实现语言生成所需样本数量的下界,并探讨其是否可被可计算函数限制。 Result: 发现对于许多简单语言族,成功生成所需样本数量极大,且在某些情况下无任何可计算函数上界,表明理论上的可生成性不意味着实际可行性。 Conclusion: 现代语言模型的成功不能仅依赖于大规模数据,而应考虑自然语言本身的结构特性,这些特性使得在有限数据下高效生成成为可能。 Abstract: Kleinberg and Mullainathan showed that, in principle, language generation is always possible: with sufficiently many positive examples, a learner can eventually produce sentences indistinguishable from those of a target language. However, the existence of such a guarantee does not speak to its practical feasibility. In this work, we show that even for simple and well-studied language families -- such as regular and context-free languages -- the number of examples required for successful generation can be extraordinarily large, and in some cases not bounded by any computable function. These results reveal a substantial gap between theoretical possibility and efficient learnability. They suggest that explaining the empirical success of modern language models requires a refined perspective -- one that takes into account structural properties of natural language that make effective generation possible in practice.[14] DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning
Yaxuan Wang,Chris Yuhao Liu,Quan Liu,Jinglong Pang,Wei Wei,Yujia Bao,Yang Liu
Main category: cs.CL
TL;DR: 提出DRAGON框架,利用推理增强的上下文指令在无需保留数据的情况下实现大语言模型中的高效遗忘,具有良好的可扩展性和实际应用性。
Details
Motivation: 现有遗忘方法通常需要访问保留数据以平衡遗忘效果与模型性能,但在实际场景中保留数据往往不可用,限制了其应用。 Method: 提出DRAGON框架,通过上下文链式思维(CoT)指令和轻量级检测模块识别需遗忘的提示,并利用专门的CoT守护模型进行上下文干预,不修改基础模型。 Result: 在三种代表性遗忘任务上实验表明,DRAGON在遗忘效果、持续遗忘能力、可扩展性方面表现优异,且适用于数据受限的实际场景。 Conclusion: DRAGON为大语言模型提供了一种高效、实用的遗忘机制,无需依赖保留数据,兼顾安全与模型通用能力。 Abstract: Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a lightweight detection module to identify forget-worthy prompts without any retain data. These are then routed through a dedicated CoT guard model to enforce safe and accurate in-context intervention. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical scenarios.[15] Quantifying Edits Decay in Fine-tuned LLMs
Yinjie Cheng,Paul Youssef,Christin Seifert,Jörg Schlötterer,Zhixue Zhao
Main category: cs.CL
TL;DR: 该研究系统评估了知识编辑在微调后的保持情况,发现大多数编辑会衰减,并受编辑方法、微调策略和模型结构的影响。
Details
Motivation: 探究知识编辑在后续微调中是否仍然有效,以应对恶意编辑清除和有益编辑保留的实际需求。 Method: 评估了两种先进编辑方法(MEMIT、AlphaEdit)和三种微调方式(全参数、LoRA、DoRA)在五个大语言模型和三个数据集上的232种配置,分析编辑的衰减情况。 Result: 发现编辑在微调后普遍衰减,AlphaEdit比MEMIT更易衰减;选择性地微调编辑层可有效去除编辑,但微调非编辑层反而比全量微调对编辑造成更大损害。 Conclusion: 知识编辑与微调的集成需谨慎设计,提出的选择性层微调策略为实际应用提供了可行方案,强调评估编辑方法时应考虑完整应用流程。 Abstract: Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits as shown in Figure 1, current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edits decay after fine-tuning, investigating how fine-tuning affects knowledge editing. We evaluate two state-of-the-art editing methods (MEMIT, AlphaEdit) and three fine-tuning approaches (full-parameter, LoRA, DoRA) across five LLMs and three datasets, yielding 232 experimental configurations. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we propose selective-layer fine-tuning and find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.[16] Retrieval-Augmented Generation in Medicine: A Scoping Review of Technical Implementations, Clinical Applications, and Ethical Considerations
Rui Yang,Matthew Yu Heng Wong,Huitao Li,Xin Li,Wentao Zhu,Jingchi Liao,Kunyu Yu,Jonathan Chong Kai Liew,Weihao Xuan,Yingjian Chen,Yuhe Ke,Jasmine Chiat Ling Ong,Douglas Teodoro,Chuan Hong,Daniel Shi Wei Ting,Nan Liu
Main category: cs.CL
TL;DR: 该研究综述了检索增强生成(RAG)技术在医学领域的应用现状,发现其仍处于早期阶段,主要依赖公开数据和通用大模型,评估关注不足于偏见与安全,未来需加强临床验证、多语言适配和低资源环境支持。
Details
Motivation: 由于医学知识快速增长和临床实践日益复杂,大语言模型(LLM)虽有潜力但存在局限,因此需要通过RAG技术提升其在临床中的适用性和可靠性。 Method: 本文采用文献综述方法,系统分析RAG在医学中的应用,涵盖数据来源、检索模型、生成模型、评估方式及应用场景等方面。 Result: 研究发现RAG在医学中主要应用于问答、报告生成、文本摘要和信息提取;多数研究使用公开数据和英文嵌入模型,缺乏对私有数据和非英语环境的支持;评估多依赖自动化指标和有限的人工评价,对偏见与安全关注不足。 Conclusion: 医学领域的RAG技术尚处于初级阶段,需在临床验证、跨语言适应性和低资源环境支持方面进一步发展,以实现可信、负责任的全球应用。 Abstract: The rapid growth of medical knowledge and increasing complexity of clinical practice pose challenges. In this context, large language models (LLMs) have demonstrated value; however, inherent limitations remain. Retrieval-augmented generation (RAG) technologies show potential to enhance their clinical applicability. This study reviewed RAG applications in medicine. We found that research primarily relied on publicly available data, with limited application in private data. For retrieval, approaches commonly relied on English-centric embedding models, while LLMs were mostly generic, with limited use of medical-specific LLMs. For evaluation, automated metrics evaluated generation quality and task performance, whereas human evaluation focused on accuracy, completeness, relevance, and fluency, with insufficient attention to bias and safety. RAG applications were concentrated on question answering, report generation, text summarization, and information extraction. Overall, medical RAG remains at an early stage, requiring advances in clinical validation, cross-linguistic adaptation, and support for low-resource settings to enable trustworthy and responsible global use.[17] NILC: Discovering New Intents with LLM-assisted Clustering
Hongtao Wang,Renchi Yang,Wenqing Lin
Main category: cs.CL
TL;DR: 本文提出了一种名为NILC的新意图发现(NID)框架,通过结合大语言模型(LLMs)迭代优化聚类中心和文本嵌入,提升对未知和已知意图的识别效果。
Details
Motivation: 现有NID方法采用级联架构,缺乏阶段间反馈机制,且仅依赖嵌入的聚类忽略了细粒度语义,导致性能受限。 Method: NILC采用迭代流程:利用LLM生成语义增强的聚类中心,并对模糊或简短的难样本进行重写以改进聚类;在半监督设置中引入种子和软必须链接作为监督信号。 Result: 在六个跨领域基准数据集上,NILC在无监督和半监督设置下均显著优于多个最新基线方法。 Conclusion: NILC通过融合LLM的语义理解能力与迭代优化机制,有效提升了新意图发现的准确性和鲁棒性。 Abstract: New intent discovery (NID) seeks to recognize both new and known intents from unlabeled user utterances, which finds prevalent use in practical dialogue systems. Existing works towards NID mainly adopt a cascaded architecture, wherein the first stage focuses on encoding the utterances into informative text embeddings beforehand, while the latter is to group similar embeddings into clusters (i.e., intents), typically by K-Means. However, such a cascaded pipeline fails to leverage the feedback from both steps for mutual refinement, and, meanwhile, the embedding-only clustering overlooks nuanced textual semantics, leading to suboptimal performance. To bridge this gap, this paper proposes NILC, a novel clustering framework specially catered for effective NID. Particularly, NILC follows an iterative workflow, in which clustering assignments are judiciously updated by carefully refining cluster centroids and text embeddings of uncertain utterances with the aid of large language models (LLMs). Specifically, NILC first taps into LLMs to create additional semantic centroids for clusters, thereby enriching the contextual semantics of the Euclidean centroids of embeddings. Moreover, LLMs are then harnessed to augment hard samples (ambiguous or terse utterances) identified from clusters via rewriting for subsequent cluster correction. Further, we inject supervision signals through non-trivial techniques seeding and soft must links for more accurate NID in the semi-supervised setting. Extensive experiments comparing NILC against multiple recent baselines under both unsupervised and semi-supervised settings showcase that NILC can achieve significant performance improvements over six benchmark datasets of diverse domains consistently.[18] IDALC: A Semi-Supervised Framework for Intent Detection and Active Learning based Correction
Ankan Mullick,Sukannya Purkayastha,Saransh Sharma,Pawan Goyal,Niloy Ganguly
Main category: cs.CL
TL;DR: 本文提出了一种名为IDALC的半监督框架,用于检测用户意图并纠正系统拒绝的语音输入,显著降低了人工标注成本,同时在准确率和macro-F1上优于基线方法。
Details
Motivation: 语音对话系统在面对低置信度或新意图时会拒绝用户输入,导致大量需手动标注的数据,重新训练成本高,因此需要一种高效降低标注成本的方法。 Method: 提出IDALC框架,结合意图检测与主动学习,利用少量标注数据和大量未标注数据进行迭代训练,自动识别新意图并修正系统错误。 Result: 在多个基准数据集上实验表明,IDALC比基线方法准确率提高5-10%,macro-F1提升4-8%,且标注成本仅占未标注数据的6-10%。 Conclusion: IDALC能有效减少语音系统中意图识别的人工标注开销,同时保持高性能,适用于持续学习和实际部署场景。 Abstract: Voice-controlled dialog systems have become immensely popular due to their ability to perform a wide range of actions in response to diverse user queries. These agents possess a predefined set of skills or intents to fulfill specific user tasks. But every system has its own limitations. There are instances where, even for known intents, if any model exhibits low confidence, it results in rejection of utterances that necessitate manual annotation. Additionally, as time progresses, there may be a need to retrain these agents with new intents from the system-rejected queries to carry out additional tasks. Labeling all these emerging intents and rejected utterances over time is impractical, thus calling for an efficient mechanism to reduce annotation costs. In this paper, we introduce IDALC (Intent Detection and Active Learning based Correction), a semi-supervised framework designed to detect user intents and rectify system-rejected utterances while minimizing the need for human annotation. Empirical findings on various benchmark datasets demonstrate that our system surpasses baseline methods, achieving a 5-10% higher accuracy and a 4-8% improvement in macro-F1. Remarkably, we maintain the overall annotation cost at just 6-10% of the unlabelled data available to the system. The overall framework of IDALC is shown in Fig. 1[19] Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs
Renfei Zhang,Manasa Kaniselvan,Niloofar Mireshghallah
Main category: cs.CL
TL;DR: RL增强的语言模型在知识回忆任务中表现更好,尤其是涉及层次化知识结构的任务,这主要归因于其改进的知识遍历策略而非新知识获取。
Details
Motivation: 挑战RL训练会损害语言模型记忆能力的普遍观点,探究其对结构化知识回忆的影响。 Method: 通过比较RL模型与SFT模型在知识回忆任务上的表现,结合结构化提示和层间激活分析,研究知识表示与检索过程的变化。 Result: 发现RL模型在层级知识任务上优于SFT模型;结构化提示可显著缩小性能差距;RL主要改变查询时的知识遍历方式,而非知识本身表示。 Conclusion: RL提升的是模型搜索和导航已有知识的能力,而不是引入新的记忆内容。 Abstract: Reinforcement learning (RL) is often credited with improving language model reasoning and generalization at the expense of degrading memorized knowledge. We challenge this narrative by observing that RL-enhanced models consistently outperform their base and supervised fine-tuned (SFT) counterparts on pure knowledge recall tasks, particularly those requiring traversal of hierarchical, structured knowledge (e.g., medical codes). We hypothesize these gains stem not from newly acquired data, but from improved procedural skills in navigating and searching existing knowledge hierarchies within the model parameters. To support this hypothesis, we show that structured prompting, which explicitly guides SFTed models through hierarchical traversal, recovers most of the performance gap (reducing 24pp to 7pp on MedConceptsQA for DeepSeek-V3/R1). We further find that while prompting improves final-answer accuracy, RL-enhanced models retain superior ability to recall correct procedural paths on deep-retrieval tasks. Finally our layer-wise internal activation analysis reveals that while factual representations (e.g., activations for the statement "code 57.95 refers to urinary infection") maintain high cosine similarity between SFT and RL models, query representations (e.g., "what is code 57.95") diverge noticeably, indicating that RL primarily transforms how models traverse knowledge rather than the knowledge representation itself.[20] Interpretable Recognition of Cognitive Distortions in Natural Language Texts
Anton Kolonin,Anna Arinicheva
Main category: cs.CL
TL;DR: 提出了一种基于加权结构模式(如N-gram)的自然语言文本多因素分类新方法,考虑了模式间的异层关系,用于心理护理中认知扭曲的自动检测,具有可解释性、鲁棒性和透明性,并在公开数据集上显著提升了F1分数。
Details
Motivation: 为了提升心理护理中认知扭曲自动检测的准确性与可解释性,解决现有模型在多因素分类任务中表现不足的问题。 Method: 采用加权结构模式(如N-gram)并考虑其异层关系,设计了新的识别与学习算法,结合最优超参数进行训练。 Result: 在两个公开数据集上验证了方法的有效性,F1分数显著优于现有文献报道的结果。 Conclusion: 所提方法在认知扭曲检测任务中优于当前主流方法,具备良好的可解释性与实用性,代码与模型已开源供后续研究使用。 Abstract: We propose a new approach to multi-factor classification of natural language texts based on weighted structured patterns such as N-grams, taking into account the heterarchical relationships between them, applied to solve such a socially impactful problem as the automation of detection of specific cognitive distortions in psychological care, relying on an interpretable, robust and transparent artificial intelligence model. The proposed recognition and learning algorithms improve the current state of the art in this field. The improvement is tested on two publicly available datasets, with significant improvements over literature-known F1 scores for the task, with optimal hyper-parameters determined, having code and models available for future use by the community.[21] Revisiting Entropy in Reinforcement Learning for Large Reasoning Models
Renren Jin,Pengzhi Gao,Yuqi Ren,Zhuowen Han,Tongxuan Zhang,Wuwei Huang,Wei Liu,Jian Luan,Deyi Xiong
Main category: cs.CL
TL;DR: 该论文研究了强化学习与可验证奖励(RLVR)中大语言模型(LLM)的熵崩溃问题,发现离策略更新次数、训练数据多样性和优化目标中的裁剪阈值是影响熵的关键因素,并提出通过调整正负优势token的损失权重来调控模型熵。
Details
Motivation: 在RLVR训练中,LLM的熵容易崩溃,导致模型过早收敛到次优解,限制性能提升,但目前缺乏对这一现象的系统研究。 Method: 通过大规模实验分析RLVR训练中LLM的熵动态变化,研究其与响应多样性、校准性和性能的关系,并结合理论与实证方法探讨影响熵的关键因素及调控机制。 Result: 发现离策略更新次数、训练数据多样性和裁剪阈值显著影响模型熵;正优势token是熵崩溃的主要来源;通过调节正负优势token的损失权重可有效控制模型熵。 Conclusion: 该研究揭示了RLVR中熵崩溃的关键成因,并提出一种有效的熵调控方法,有助于提升LLM的推理能力和训练稳定性。 Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a predominant approach for enhancing the reasoning capabilities of large language models (LLMs). However, the entropy of LLMs usually collapses during RLVR training, causing premature convergence to suboptimal local minima and hinder further performance improvement. Although various approaches have been proposed to mitigate entropy collapse, a comprehensive study of entropy in RLVR remains lacking. To address this gap, we conduct extensive experiments to investigate the entropy dynamics of LLMs trained with RLVR and analyze how model entropy correlates with response diversity, calibration, and performance across various benchmarks. Our findings reveal that the number of off-policy updates, the diversity of training data, and the clipping thresholds in the optimization objective are critical factors influencing the entropy of LLMs trained with RLVR. Moreover, we theoretically and empirically demonstrate that tokens with positive advantages are the primary contributors to entropy collapse, and that model entropy can be effectively regulated by adjusting the relative loss weights of tokens with positive and negative advantages during training.[22] LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis
Favour Yahdii Aghaebe,Tanefa Apekey,Elizabeth Williams,Nafise Sadat Moosavi
Main category: cs.CL
TL;DR: 研究评估了先进语言模型在生成生物医学研究摘要时保留年龄相关信息的能力,发现现有模型在不同年龄群体间存在系统性差异,尤其是对成人和代表性不足人群的摘要准确性较低,提示需要更公平的评估框架。
Details
Motivation: 随着语言模型越来越多地被用于生物医学证据整合,但其是否能准确保留关键的人口统计学信息(如年龄)尚不清楚,而这对临床干预至关重要。 Method: 构建了一个按年龄分层的数据集DemogSummary,涵盖儿童、成人和老年人群,并使用Qwen、Longformer和GPT-4.1 Nano三种主流语言模型进行摘要生成评估,采用标准指标和新提出的‘人口显著性评分’(DSS)来衡量年龄相关信息的保留与幻觉情况。 Result: 结果显示各模型在不同年龄组间存在系统性差异:针对成人的摘要人口保真度最低,代表性不足的人群更容易出现信息幻觉。 Conclusion: 当前语言模型在忠实且无偏见地生成生物医学摘要方面存在局限,亟需开发关注公平性的评估框架和摘要流程。 Abstract: Clinical interventions often hinge on age: medications and procedures safe for adults may be harmful to children or ineffective for older adults. However, as language models are increasingly integrated into biomedical evidence synthesis workflows, it remains uncertain whether these systems preserve such crucial demographic distinctions. To address this gap, we evaluate how well state-of-the-art language models retain age-related information when generating abstractive summaries of biomedical studies. We construct DemogSummary, a novel age-stratified dataset of systematic review primary studies, covering child, adult, and older adult populations. We evaluate three prominent summarisation-capable LLMs, Qwen (open-source), Longformer (open-source) and GPT-4.1 Nano (proprietary), using both standard metrics and a newly proposed Demographic Salience Score (DSS), which quantifies age-related entity retention and hallucination. Our results reveal systematic disparities across models and age groups: demographic fidelity is lowest for adult-focused summaries, and under-represented populations are more prone to hallucinations. These findings highlight the limitations of current LLMs in faithful and bias-free summarisation and point to the need for fairness-aware evaluation frameworks and summarisation pipelines in biomedical NLP.[23] Multi-Reward GRPO Fine-Tuning for De-biasing Large Language Models: A Study Based on Chinese-Context Discrimination Data
Deng Yixuan,Ji Xiaoqiang
Main category: cs.CL
TL;DR: 提出了一种基于多奖励信号的GRPO框架,用于减少大语言模型中的文化特定和多维度歧视,通过构建中文语境下的偏见数据集并利用DeBERTa-v3训练多维奖励模型,有效降低了偏见强度且不损害生成质量。
Details
Motivation: 现有对齐技术如RLHF和DPO在处理文化特定和多维度歧视方面存在局限,LLM仍表现出反映社会刻板印象的隐性偏见。 Method: 构建源自中文语境的合成英文数据集,涵盖地域、民族、职业等偏见类型;配对中立与偏见响应,训练基于DeBERTa-v3的多维奖励模型(公平性、中立性、语言质量);利用该奖励模型指导GRPO微调。 Result: 实验显示该方法显著降低偏见强度,提升模型在非歧视标准上的对齐程度,同时保持生成文本的流畅性和信息量。 Conclusion: 基于GRPO的多奖励优化能有效去偏大语言模型,所提框架可复制用于跨文化语境的伦理对齐。 Abstract: Large Language Models (LLMs) often exhibit implicit biases and discriminatory tendencies that reflect underlying social stereotypes. While recent alignment techniques such as RLHF and DPO have mitigated some of these issues, they remain limited in addressing culturally specific and multi-dimensional forms of discrimination. This paper proposes a Multi-Reward Group Relative Policy Optimization (GRPO) framework to fine-tune LLMs toward ethical and bias-free behavior. Our approach constructs a synthetic English-language dataset derived from Chinese-context discrimination categories, including regional, ethnic, and occupational biases. Each instance is paired with both neutral and biased responses to train a reward model based on DeBERTa-v3, which provides multi-dimensional reward signals capturing fairness, neutrality, and linguistic quality. The trained reward model then guides GRPO fine-tuning to optimize model outputs along these ethical dimensions. Experimental results demonstrate significant reductions in bias intensity and improved alignment with non-discriminatory standards without compromising fluency or informativeness. This study highlights the effectiveness of GRPO-based multi-reward optimization for de-biasing LLMs and offers a replicable framework for cultural-contextual ethical alignment.[24] Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts
Xinyuan Yan,Shusen Liu,Kowshik Thopalli,Bei Wang
Main category: cs.CL
TL;DR: 提出一种基于拓扑和降维的交互式可视化框架,用于聚焦探索稀疏自编码器中的可解释特征。
Details
Motivation: 稀疏自编码器提取的特征方向数量庞大,传统可视化方法存在压缩伪影、重叠绘图和邻域失真等问题,难以全面探索。 Method: 结合基于拓扑的视觉编码与降维技术,构建交互式可视化系统,聚焦于特定概念及其对应的SAE特征。 Result: 能够更准确地表示选定特征间的局部和全局关系,支持对潜在空间中概念表示进行深入、细致的分析。 Conclusion: 该框架通过有选择性地可视化关键特征,提升了对大模型中可解释特征的理解能力。 Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for uncovering interpretable features in large language models (LLMs) through the sparse directions they learn. However, the sheer number of extracted directions makes comprehensive exploration intractable. While conventional embedding techniques such as UMAP can reveal global structure, they suffer from limitations including high-dimensional compression artifacts, overplotting, and misleading neighborhood distortions. In this work, we propose a focused exploration framework that prioritizes curated concepts and their corresponding SAE features over attempts to visualize all available features simultaneously. We present an interactive visualization system that combines topology-based visual encoding with dimensionality reduction to faithfully represent both local and global relationships among selected features. This hybrid approach enables users to investigate SAE behavior through targeted, interpretable subsets, facilitating deeper and more nuanced analysis of concept representation in latent space.[25] Efficient Hate Speech Detection: A Three-Layer LoRA-Tuned BERTweet Framework
Mahmoud El-Bahnasawi
Main category: cs.CL
TL;DR: 提出了一种三层框架,结合基于规则的预过滤、参数高效的LoRA微调BERTweet模型和持续学习能力,以极小的模型规模实现了接近大模型的仇恨言论检测性能。
Details
Motivation: 开发计算效率高且适用于实时部署的仇恨言论检测系统,解决现有大模型计算成本高、难以在资源受限环境中应用的问题。 Method: 采用规则预过滤、LoRA微调BERTweet模型,并通过数据集统一和优化微调提升性能,支持持续学习。 Result: 达到0.85的macro F1分数,为SOTA大模型(如SafePhi)性能的94%,但基础模型小100倍(134M vs 14B),仅需1.87M可训练参数,在单T4 GPU上训练约2小时。 Conclusion: 该方法在显著降低计算资源需求的同时保持了高性能,使仇恨言论检测更易于在资源受限环境下实际部署。 Abstract: This paper addresses the critical challenge of developing computationally efficient hate speech detection systems that maintain competitive performance while being practical for real-time deployment. We propose a novel three-layer framework that combines rule-based pre-filtering with a parameter-efficient LoRA-tuned BERTweet model and continuous learning capabilities. Our approach achieves 0.85 macro F1 score - representing 94% of the performance of state-of-the-art large language models like SafePhi (Phi-4 based) while using a base model that is 100x smaller (134M vs 14B parameters). Compared to traditional BERT-based approaches with similar computational requirements, our method demonstrates superior performance through strategic dataset unification and optimized fine-tuning. The system requires only 1.87M trainable parameters (1.37% of full fine-tuning) and trains in approximately 2 hours on a single T4 GPU, making robust hate speech detection accessible in resource-constrained environments while maintaining competitive accuracy for real-world deployment.[26] ReMoD: Rethinking Modality Contribution in Multimodal Stance Detection via Dual Reasoning
Bingbing Wang,Zhengda Jin,Bin Liang,Jing Li,Ruifeng Xu
Main category: cs.CL
TL;DR: 提出ReMoD框架,通过双推理机制动态权衡多模态在立场检测中的贡献,提升多模态立场检测性能。
Details
Motivation: 现有方法简单融合多模态信息,忽视不同模态在立场表达中的差异性贡献,导致引入误解噪声。 Method: 受人类认知双过程理论启发,设计ReMoD框架,包含直觉式经验推理和反思式推理两个阶段:直觉阶段通过模态与语义经验池生成初始立场假设;反思阶段通过Modality-CoT和Semantic-CoT两条推理链分别更新经验池并优化模态融合策略与语义理解。 Result: 在MMSD基准上实验表明,ReMoD显著优于多数基线模型,具有强泛化能力。 Conclusion: ReMoD通过动态调整模态贡献权重,有效减少噪声干扰,提升了多模态立场检测的准确性和鲁棒性。 Abstract: Multimodal Stance Detection (MSD) is a crucial task for understanding public opinion on social media. Existing work simply fuses information from various modalities to learn stance representations, overlooking the varying contributions of stance expression from different modalities. Therefore, stance misunderstanding noises may be drawn into the stance learning process due to the risk of learning errors by rough modality combination. To address this, we get inspiration from the dual-process theory of human cognition and propose **ReMoD**, a framework that **Re**thinks **Mo**dality contribution of stance expression through a **D**ual-reasoning paradigm. ReMoD integrates *experience-driven intuitive reasoning* to capture initial stance cues with *deliberate reflective reasoning* to adjust for modality biases, refine stance judgments, and thereby dynamically weight modality contributions based on their actual expressive power for the target stance. Specifically, the intuitive stage queries the Modality Experience Pool (MEP) and Semantic Experience Pool (SEP) to form an initial stance hypothesis, prioritizing historically impactful modalities. This hypothesis is then refined in the reflective stage via two reasoning chains: Modality-CoT updates MEP with adaptive fusion strategies to amplify relevant modalities, while Semantic-CoT refines SEP with deeper contextual insights of stance semantics. These dual experience structures are continuously refined during training and recalled at inference to guide robust and context-aware stance decisions. Extensive experiments on the public MMSD benchmark demonstrate that our ReMoD significantly outperforms most baseline models and exhibits strong generalization capabilities.[27] Automating Hardware Design and Verification from Architectural Papers via a Neural-Symbolic Graph Framework
Haoyue Yang,Xuanle Zhao,Yujie Liu,Zhuojun Zou,Kailin Lyu,Changchun Zhou,Yao Zhu,Jie Hao
Main category: cs.CL
TL;DR: 本文提出了一种名为ArchCraft的框架,能够将学术论文中的硬件架构描述自动转换为可综合的Verilog RTL设计,并配备可验证的测试环境,同时构建了首个用于此类任务的基准集ArchSynthBench。
Details
Motivation: 由于学术论文中硬件架构的源代码缺乏公开且硬件描述语言复杂,导致架构复现困难,因此需要一个能从非结构化文本生成可验证、可综合硬件设计的自动化框架。 Method: ArchCraft采用形式化图表示架构蓝图,用符号定义功能规范,通过解耦的方式生成RTL代码和测试平台,并基于ArchSynthBench基准进行系统评估。 Result: 在ArchSynthBench上的实验表明,ArchCraft在论文理解与代码生成方面优于直接生成方法和VerilogCoder框架,生成的RTL代码可通过综合与物理实现,满足时序约束,性能指标与原论文一致。 Conclusion: ArchCraft有效实现了从学术描述到可验证、可综合硬件设计的自动化转换,推动了硬件架构的可复现性研究,并为未来AI驱动的硬件设计提供了基础。 Abstract: The reproduction of hardware architectures from academic papers remains a significant challenge due to the lack of publicly available source code and the complexity of hardware description languages (HDLs). To this end, we propose \textbf{ArchCraft}, a Framework that converts abstract architectural descriptions from academic papers into synthesizable Verilog projects with register-transfer level (RTL) verification. ArchCraft introduces a structured workflow, which uses formal graphs to capture the Architectural Blueprint and symbols to define the Functional Specification, translating unstructured academic papers into verifiable, hardware-aware designs. The framework then generates RTL and testbench (TB) code decoupled via these symbols to facilitate verification and debugging, ultimately reporting the circuit's Power, Area, and Performance (PPA). Moreover, we propose the first benchmark, \textbf{ArchSynthBench}, for synthesizing hardware from architectural descriptions, with a complete set of evaluation indicators, 50 project-level circuits, and around 600 circuit blocks. We systematically assess ArchCraft on ArchSynthBench, where the experiment results demonstrate the superiority of our proposed method, surpassing direct generation methods and the VerilogCoder framework in both paper understanding and code completion. Furthermore, evaluation and physical implementation of the generated executable RTL code show that these implementations meet all timing constraints without violations, and their performance metrics are consistent with those reported in the original papers.[28] Stemming Hallucination in Language Models Using a Licensing Oracle
Simeon Emanuilov,Richard Ackermann
Main category: cs.CL
TL;DR: 本文提出了一种名为“许可预言机”(Licensing Oracle)的新架构,通过在生成过程中引入基于结构化知识图谱的形式化验证步骤,有效杜绝语言模型的幻觉问题。实验表明,该方法在确保事实准确性和完全避免错误回答方面显著优于现有方法。
Details
Motivation: 语言模型虽能生成语法通顺的文本,但常产生事实性错误(即“幻觉”),限制了其在关键场景中的可靠性。现有方法如微调和检索增强生成(RAG)无法彻底消除幻觉,因此需要一种能提供强事实保证的新型解决方案。 Method: 提出“许可预言机”架构,在语言模型生成过程中嵌入一个确定性的验证模块,每一步生成都需经过结构化知识图谱的形式化验证,只有符合事实的陈述才被允许输出,从而实现对生成内容的事实约束。 Result: 实验结果显示,相比基线模型、微调方法和RAG,许可预言机实现了完美的 abstention 精度(AP = 1.0)和零错误回答率(FAR-NE = 0.0),并在事实回应中达到89.1%的准确率,首次实现了对幻觉现象的有效遏制。 Conclusion: 许可预言机通过架构创新为解决语言模型幻觉提供了必要且充分的方案,尤其适用于具有结构化知识表示的领域。该框架为构建可信赖、认知可靠的AI系统开辟了新路径。 Abstract: Language models exhibit remarkable natural language generation capabilities but remain prone to hallucinations, generating factually incorrect information despite producing syntactically coherent responses. This study introduces the Licensing Oracle, an architectural solution designed to stem hallucinations in LMs by enforcing truth constraints through formal validation against structured knowledge graphs. Unlike statistical approaches that rely on data scaling or fine-tuning, the Licensing Oracle embeds a deterministic validation step into the model's generative process, ensuring that only factually accurate claims are made. We evaluated the effectiveness of the Licensing Oracle through experiments comparing it with several state-of-the-art methods, including baseline language model generation, fine-tuning for factual recall, fine-tuning for abstention behavior, and retrieval-augmented generation (RAG). Our results demonstrate that although RAG and fine-tuning improve performance, they fail to eliminate hallucinations. In contrast, the Licensing Oracle achieved perfect abstention precision (AP = 1.0) and zero false answers (FAR-NE = 0.0), ensuring that only valid claims were generated with 89.1% accuracy in factual responses. This work shows that architectural innovations, such as the Licensing Oracle, offer a necessary and sufficient solution for hallucinations in domains with structured knowledge representations, offering guarantees that statistical methods cannot match. Although the Licensing Oracle is specifically designed to address hallucinations in fact-based domains, its framework lays the groundwork for truth-constrained generation in future AI systems, providing a new path toward reliable, epistemically grounded models.[29] MuonAll: Muon Variant for Efficient Finetuning of Large Language Models
Saurabh Page,Advait Joshi,S. S. Sonawane
Main category: cs.CL
TL;DR: 提出了MuonAll优化器,通过将所有参数纳入Muon并转换为2D矩阵,在大规模语言模型微调中实现了与AdamW相当的性能。
Details
Motivation: 探索Muon在预训练以外的微调场景中的表现,并改进其对所有参数的优化能力。 Method: 提出MuonAll方法,将模型所有参数转换为2D矩阵并统一由Muon优化器处理。 Result: 在多种公开语言模型上验证了Muon和MuonAll的有效性,性能与AdamW相当。 Conclusion: Muon和MuonAll可作为微调阶段的可行替代优化器,且已开源实现。 Abstract: Muon optimizer has demonstrated robust results in pretraining of language models but its performance in finetuning of existing public pretrained models is not yet explored. Currently, Muon is used along with AdamW introducing a scope of improvement for adopting all parameters inside Muon. We introduce MuonAll, which incorporates all the parameters inside Muon by transforming into 2D matrices. We conduct extensive finetuning experiments across publicly available language models with model sizes upto half billion parameters. Muon and MuonAll perform at par with AdamW across major benchmarks, highlighting their effectiveness as alternative optimizers. We open-source the distributed implementations of Muon and MuonAll, available at https://github.com/Saurabh750/optimizer[30] Evaluation of retrieval-based QA on QUEST-LOFT
Nathan Scales,Nathanael Schärli,Olivier Bousquet
Main category: cs.CL
TL;DR: 本文分析了RAG在QUEST-LOFT基准上表现不佳的原因,并提出通过结构化输出格式和答案验证机制显著提升其性能,超越长上下文语言模型的方法。
Details
Motivation: 现有RAG方法在信息分散于多文档或需复杂推理的问题上表现不佳,且长上下文语言模型也存在类似局限,亟需改进方案。 Method: 结合结构化输出格式(包含推理和证据)和可选的答案重验证机制,对RAG进行优化,并基于人工评估更新QUEST-LOFT的性能数据。 Result: 优化后的RAG方法显著优于长上下文语言模型,在QUEST-LOFT基准上取得更好结果。 Conclusion: 通过引入结构化推理和验证机制,RAG可在复杂问答任务中超越长上下文模型,展现出更大潜力。 Abstract: Despite the popularity of retrieval-augmented generation (RAG) as a solution for grounded QA in both academia and industry, current RAG methods struggle with questions where the necessary information is distributed across many documents or where retrieval needs to be combined with complex reasoning. Recently, the LOFT study has shown that this limitation also applies to approaches based on long-context language models, with the QUEST benchmark exhibiting particularly large headroom. In this paper, we provide an in-depth analysis of the factors contributing to the poor performance on QUEST-LOFT, publish updated numbers based on a thorough human evaluation, and demonstrate that RAG can be optimized to significantly outperform long-context approaches when combined with a structured output format containing reasoning and evidence, optionally followed by answer re-verification.[31] Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
Akshar Tumu,Varad Shinde,Parisa Kordjamshidi
Main category: cs.CL
TL;DR: 本文提出使用指代表达理解任务来评估视觉语言模型的空间推理能力,揭示了模型在处理对象检测歧义、复杂空间表达和否定表达时的挑战与局限。
Details
Motivation: 现有的视觉语言模型在空间推理方面表现出不足,而当前的研究多集中于图像描述和视觉问答任务,缺乏对空间理解和定位能力的深入分析。因此,作者希望利用指代表达理解任务作为新的评估平台。 Method: 研究采用了特定任务的架构以及大型视觉语言模型,在存在对象检测歧义、复杂空间结构和包含否定的空间表达等场景下,系统地分析这些模型的表现,并根据不同的空间语义类别(如拓扑、方向、距离等)进行对比。 Result: 所有模型在该任务上均面临挑战,表现因模型结构和空间语义类型的不同而异,尤其在处理否定和复杂空间关系时性能下降明显。 Conclusion: 指代表达理解任务为评估视觉语言模型的空间推理提供了更细致的测试平台,研究结果揭示了现有模型的薄弱环节,为未来改进空间推理能力指明了方向。 Abstract: Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.[32] BookAsSumQA: An Evaluation Framework for Aspect-Based Book Summarization via Question Answering
Ryuhei Miyazato,Ting-Ruen Wei,Xuyang Wu,Hsin-Tai Wu,Kei Harada
Main category: cs.CL
TL;DR: 提出了一种基于问答的评估框架BookAsSumQA,用于方面性书籍摘要生成,并发现RAG方法在处理长文本时优于LLM方法。
Details
Motivation: 由于难以为长文本构建参考摘要,方面性摘要在书籍上的应用尚未被探索。 Method: 通过从叙事知识图谱中自动生成特定方面的问答对,利用问答性能来评估摘要质量。 Result: 实验表明,LLM方法在短文本上表现更好,而RAG方法随着文档长度增加变得更有效。 Conclusion: RAG方法在方面性书籍摘要任务中更高效且实用,尤其适用于长文本。 Abstract: Aspect-based summarization aims to generate summaries that highlight specific aspects of a text, enabling more personalized and targeted summaries. However, its application to books remains unexplored due to the difficulty of constructing reference summaries for long text. To address this challenge, we propose BookAsSumQA, a QA-based evaluation framework for aspect-based book summarization. BookAsSumQA automatically generates aspect-specific QA pairs from a narrative knowledge graph to evaluate summary quality based on its question-answering performance. Our experiments using BookAsSumQA revealed that while LLM-based approaches showed higher accuracy on shorter texts, RAG-based methods become more effective as document length increases, making them more efficient and practical for aspect-based book summarization.[33] Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning
Sangmook Lee,Dohyung Kim,Hyukhun Koh,Nakyeong Yang,Kyomin Jung
Main category: cs.CL
TL;DR: 本文提出了一种名为STEER的领域无关模型路由框架,通过小模型的置信度分数在推理过程中动态决定是否调用大模型,从而在多个复杂任务上实现了更高的准确率和更低的计算成本。
Details
Motivation: 现有的模型路由方法依赖训练过的路由器或昂贵的数据合成技术,且在领域迁移下鲁棒性差,因此需要一种无需额外训练、跨领域通用的高效推理路由机制。 Method: 提出STEER框架,利用小模型在生成每一步推理前的logits置信度来决定是否调用大模型,实现细粒度、步骤级别的模型路由,无需外部路由器或标注数据。 Result: 在数学推理、多跳问答和规划等多个挑战性基准上,STEER相比仅使用大模型的方法最高提升20%准确率并减少48%的FLOPs,优于依赖外部模块的基线方法。 Conclusion: 模型内部置信度是一种鲁棒且领域无关的路由信号,STEER为高效大模型推理提供了一个可扩展的部署路径。 Abstract: Recent advances in Large Language Models (LLMs) - particularly model scaling and test-time techniques - have greatly enhanced the reasoning capabilities of language models at the expense of higher inference costs. To lower inference costs, prior works train router models or deferral mechanisms that allocate easy queries to a small, efficient model, while forwarding harder queries to larger, more expensive models. However, these trained router models often lack robustness under domain shifts and require expensive data synthesis techniques such as Monte Carlo rollouts to obtain sufficient ground-truth routing labels for training. In this work, we propose Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning (STEER), a domain-agnostic framework that performs fine-grained, step-level routing between smaller and larger LLMs without utilizing external models. STEER leverages confidence scores from the smaller model's logits prior to generating a reasoning step, so that the large model is invoked only when necessary. Extensive evaluations using different LLMs on a diverse set of challenging benchmarks across multiple domains such as Mathematical Reasoning, Multi-Hop QA, and Planning tasks indicate that STEER achieves competitive or enhanced accuracy while reducing inference costs (up to +20% accuracy with 48% less FLOPs compared to solely using the larger model on AIME), outperforming baselines that rely on trained external modules. Our results establish model-internal confidence as a robust, domain-agnostic signal for model routing, offering a scalable pathway for efficient LLM deployment.[34] Explicit Knowledge-Guided In-Context Learning for Early Detection of Alzheimer's Disease
Puzhen Su,Yongzhu Miao,Chunxi Guo,Jintao Tang,Shasha Li,Ting Wang
Main category: cs.CL
TL;DR: 提出了一种名为EK-ICL的新框架,通过整合结构化显式知识来增强大语言模型在阿尔茨海默病(AD)检测中的上下文学习能力,显著优于现有方法。
Details
Motivation: 现有上下文学习方法在临床领域(如AD检测)中存在任务识别失败、示例选择不佳和标签语义错位等问题,尤其在数据稀缺和分布外条件下表现更差。 Method: EK-ICL结合三种知识成分:基于小语言模型的置信度分数、解析特征分数用于改进示例选择,以及标签词替换以对齐语义;并采用基于解析的检索策略和集成预测。 Result: 在三个AD数据集上的实验表明,EK-ICL显著优于最先进的微调和上下文学习基线方法,尤其在低资源和分布外条件下表现突出。 Conclusion: 显式知识的引入能有效提升临床文本中上下文学习的稳定性与任务对齐性,对低资源下的医学推理具有重要意义。 Abstract: Detecting Alzheimer's Disease (AD) from narrative transcripts remains a challenging task for large language models (LLMs), particularly under out-of-distribution (OOD) and data-scarce conditions. While in-context learning (ICL) provides a parameter-efficient alternative to fine-tuning, existing ICL approaches often suffer from task recognition failure, suboptimal demonstration selection, and misalignment between label words and task objectives, issues that are amplified in clinical domains like AD detection. We propose Explicit Knowledge In-Context Learners (EK-ICL), a novel framework that integrates structured explicit knowledge to enhance reasoning stability and task alignment in ICL. EK-ICL incorporates three knowledge components: confidence scores derived from small language models (SLMs) to ground predictions in task-relevant patterns, parsing feature scores to capture structural differences and improve demo selection, and label word replacement to resolve semantic misalignment with LLM priors. In addition, EK-ICL employs a parsing-based retrieval strategy and ensemble prediction to mitigate the effects of semantic homogeneity in AD transcripts. Extensive experiments across three AD datasets demonstrate that EK-ICL significantly outperforms state-of-the-art fine-tuning and ICL baselines. Further analysis reveals that ICL performance in AD detection is highly sensitive to the alignment of label semantics and task-specific context, underscoring the importance of explicit knowledge in clinical reasoning under low-resource conditions.[35] SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization
Yue Huang,Xiangqi Wang,Xiangliang Zhang
Main category: cs.CL
TL;DR: 本文提出了一种新的对齐范式——优先级对齐(priority alignment),强调在高风险场景中“可信优先于有用”,并提出了完全无监督的自优先对齐框架(SPA),通过自我生成、自我评估与双准则去噪构建词典序偏好对,结合不确定性加权损失进行微调,在保持安全性的前提下提升大模型的帮助性。
Details
Motivation: 在高风险场景中,大语言模型需要同时具备可信性与帮助性,但二者常存在冲突,现有方法难以兼顾,因此需要一种新的对齐机制来优先保障可信性。 Method: 提出Self-Priority Alignment(SPA)框架:模型先生成多样化响应,自行评估并优化响应质量,利用双准则去噪消除不一致性;构建词典序偏好对,并采用不确定性加权对齐损失进行微调,优先优化高置信度且差距大的样本。 Result: 在多个基准测试中,SPA在不牺牲安全性的情况下显著提升了模型的帮助性,优于强基线方法,同时保持了通用能力。 Conclusion: SPA为关键应用场景下的大语言模型提供了一种可扩展且可解释的对齐策略,有效实现了‘可信优先’的目标。 Abstract: In high-stakes scenarios-such as self-harm, legal, or medical queries-LLMs must be both trustworthy and helpful. However, these goals often conflict. We propose priority alignment, a new alignment paradigm that enforces a strict "trustworthy-before-helpful" ordering: optimization of helpfulness is conditioned on first meeting trustworthy thresholds (e.g., harmlessness or honesty). To realize this, we introduce Self-Priority Alignment (SPA)-a fully unsupervised framework that generates diverse responses, self-evaluates them and refines them by the model itself, and applies dual-criterion denoising to remove inconsistency and control variance. From this, SPA constructs lexicographically ordered preference pairs and fine-tunes the model using an uncertainty-weighted alignment loss that emphasizes high-confidence, high-gap decisions. Experiments across multiple benchmarks show that SPA improves helpfulness without compromising safety, outperforming strong baselines while preserving general capabilities. Our results demonstrate that SPA provides a scalable and interpretable alignment strategy for critical LLM applications.[36] Overview of CHIP 2025 Shared Task 2: Discharge Medication Recommendation for Metabolic Diseases Based on Chinese Electronic Health Records
Juntao Li,Haobin Yuan,Ling Luo,Tengxiao Lv,Yan Jiang,Fan Wang,Ping Zhang,Huiyi Lv,Jian Wang,Yuanyuan Sun,Hongfei Lin
Main category: cs.CL
TL;DR: 本文介绍了CHIP 2025共享任务2竞赛,旨在基于真实中文电子健康记录(EHR)数据自动推荐出院用药。研究构建了包含5,894条住院记录的高质量数据集CDrugRed,吸引了大量团队参与。表现最佳的团队采用基于大语言模型(LLM)的集成系统,在测试集上取得了0.5102的Jaccard分数和0.6267的F1分数,展示了LLM在中文EHR用药推荐中的潜力与挑战。
Details
Motivation: 出院用药推荐对慢性代谢性疾病患者的治疗连续性、预防再入院和长期管理至关重要。现有方法在处理中文EHR数据时面临多标签预测、临床文本异质性和个体化治疗差异等挑战,亟需先进的自动化推荐技术。 Method: 提出并组织CHIP 2025 Shared Task 2竞赛,构建名为CDrugRed的高质量中文出院用药数据集,包含5,894条去标识化住院记录。参赛团队提交基于机器学习特别是大语言模型(LLM)的方法进行多标签药物推荐,并通过Jaccard和F1分数评估性能。 Result: 共有526支队伍注册,其中167支和95支分别提交了A阶段和B阶段的有效结果。最优团队在最终测试集上达到Jaccard分数0.5102、F1分数0.6267,表明基于LLM的集成系统具有优越性能。 Conclusion: 该竞赛推动了中文EHR中自动用药推荐的发展,验证了大语言模型在该任务上的潜力,同时也揭示了在实际临床应用中仍存在的挑战,为未来研究提供了方向。 Abstract: Discharge medication recommendation plays a critical role in ensuring treatment continuity, preventing readmission, and improving long-term management for patients with chronic metabolic diseases. This paper present an overview of the CHIP 2025 Shared Task 2 competition, which aimed to develop state-of-the-art approaches for automatically recommending appro-priate discharge medications using real-world Chinese EHR data. For this task, we constructed CDrugRed, a high-quality dataset consisting of 5,894 de-identified hospitalization records from 3,190 patients in China. This task is challenging due to multi-label nature of medication recommendation, het-erogeneous clinical text, and patient-specific variability in treatment plans. A total of 526 teams registered, with 167 and 95 teams submitting valid results to the Phase A and Phase B leaderboards, respectively. The top-performing team achieved the highest overall performance on the final test set, with a Jaccard score of 0.5102, F1 score of 0.6267, demonstrating the potential of advanced large language model (LLM)-based ensemble systems. These re-sults highlight both the promise and remaining challenges of applying LLMs to medication recommendation in Chinese EHRs. The post-evaluation phase remains open at https://tianchi.aliyun.com/competition/entrance/532411/.[37] Analyzing and Mitigating Negation Artifacts using Data Augmentation for Improving ELECTRA-Small Model Accuracy
Mojtaba Noghabaei
Main category: cs.CL
TL;DR: 本项目研究了在SNLI数据集上微调的ELECTRA-small模型对否定现象的处理能力,发现其在含否定的样本上表现不佳。通过引入强调否定的对比集和对抗样例进行数据增强,有效提升了模型在相关样本上的准确率,且未影响整体性能。
Details
Motivation: 预训练模型常依赖数据集中的伪相关性(如否定等语言现象)而非真正理解语言,导致在基准测试中表现虚高,因此需要探究并改进模型对否定的理解能力。 Method: 使用ELECTRA-small模型在SNLI数据集上进行微调,并通过构建包含对比集和强调否定的对抗样例进行针对性数据增强。 Result: 模型在含否定的样本上分类准确率提升,整体性能保持稳定,有效缓解了因数据集伪相关性带来的问题。 Conclusion: 针对特定语言现象(如否定)的数据增强能有效提升模型鲁棒性,减少对数据集 artifacts 的依赖,增强真实语言理解能力。 Abstract: Pre-trained models for natural language inference (NLI) often achieve high performance on benchmark datasets by using spurious correlations, or dataset artifacts, rather than understanding language touches such as negation. In this project, we investigate the performance of an ELECTRA-small model fine-tuned on the Stanford Natural Language Inference (SNLI) dataset, focusing on its handling of negation. Through analysis, we identify that the model struggles with correctly classifying examples containing negation. To address this, we augment the training data with contrast sets and adversarial examples emphasizing negation. Our results demonstrate that this targeted data augmentation improves the model's accuracy on negation-containing examples without adversely affecting overall performance, therefore mitigating the identified dataset artifact.[38] TimeSense:Making Large Language Models Proficient in Time-Series Analysis
Zhirui Zhang,Changhua Pei,Tianyi Gao,Zhe Xie,Yibo Hao,Zhaoyang Yu,Longlong Xu,Tong Xiao,Jing Han,Dan Pei
Main category: cs.CL
TL;DR: 本文提出TimeSense,一种通过平衡文本推理与时间感知来提升大语言模型在时间序列分析中性能的多模态框架,并构建EvalTS基准以评估模型在复杂时序任务中的表现。
Details
Motivation: 现有方法依赖文本标签进行监督训练,导致模型偏向文本线索而忽略时间序列的完整时序特征,可能产生与实际时序上下文矛盾的输出。 Method: 提出TimeSense框架,包含一个时间感知模块,在模型上下文中重建输入的时间序列,使文本推理基于真实的时序动态;同时引入基于坐标的 positional embedding 以增强对时序数据的空间理解。 Result: 实验结果表明,TimeSense在多个任务上达到最先进性能,尤其在复杂的多维时序推理任务中显著优于现有方法。 Conclusion: TimeSense能有效融合文本推理与时间感知,提升大语言模型在多样化和现实时间序列理解任务中的准确性和鲁棒性。 Abstract: In the time-series domain, an increasing number of works combine text with temporal data to leverage the reasoning capabilities of large language models (LLMs) for various downstream time-series understanding tasks. This enables a single model to flexibly perform tasks that previously required specialized models for each domain. However, these methods typically rely on text labels for supervision during training, biasing the model toward textual cues while potentially neglecting the full temporal features. Such a bias can lead to outputs that contradict the underlying time-series context. To address this issue, we construct the EvalTS benchmark, comprising 10 tasks across three difficulty levels, from fundamental temporal pattern recognition to complex real-world reasoning, to evaluate models under more challenging and realistic scenarios. We also propose TimeSense, a multimodal framework that makes LLMs proficient in time-series analysis by balancing textual reasoning with a preserved temporal sense. TimeSense incorporates a Temporal Sense module that reconstructs the input time-series within the model's context, ensuring that textual reasoning is grounded in the time-series dynamics. Moreover, to enhance spatial understanding of time-series data, we explicitly incorporate coordinate-based positional embeddings, which provide each time point with spatial context and enable the model to capture structural dependencies more effectively. Experimental results demonstrate that TimeSense achieves state-of-the-art performance across multiple tasks, and it particularly outperforms existing methods on complex multi-dimensional time-series reasoning tasks.[39] HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection
Irina Proskurina,Marc-Antoine Carpentier,Julien Velcin
Main category: cs.CL
TL;DR: 本文研究了仇恨言论检测中的显性和隐性仇恨内容,提出使用HatePrototypes实现跨任务迁移,并验证了无需微调的早期退出机制的有效性。
Details
Motivation: 现有仇恨言论数据集主要关注针对受保护群体的显性仇恨,忽视了隐性或间接仇恨(如贬低性比较、排斥或暴力呼吁、微妙歧视语言),而这类内容同样具有危害性且需要深层语义理解。 Method: 利用针对仇恨检测优化的语言模型生成类级别的向量表示(HatePrototypes),仅用每类50个样本构建原型,并探索其在不同基准间的跨任务迁移能力及参数自由的早期退出机制。 Result: HatePrototypes能够在显性和隐性仇恨之间实现跨任务迁移,在多个基准上可互换使用;基于原型的无参数早期退出机制对两种类型的仇恨均有效。 Conclusion: 无需重复微调,HatePrototypes提供了一种高效、可迁移的仇恨言论检测方法,有助于提升对隐性仇恨的识别能力。 Abstract: Optimization of offensive content moderation models for different types of hateful messages is typically achieved through continued pre-training or fine-tuning on new hate speech benchmarks. However, existing benchmarks mainly address explicit hate toward protected groups and often overlook implicit or indirect hate, such as demeaning comparisons, calls for exclusion or violence, and subtle discriminatory language that still causes harm. While explicit hate can often be captured through surface features, implicit hate requires deeper, full-model semantic processing. In this work, we question the need for repeated fine-tuning and analyze the role of HatePrototypes, class-level vector representations derived from language models optimized for hate speech detection and safety moderation. We find that these prototypes, built from as few as 50 examples per class, enable cross-task transfer between explicit and implicit hate, with interchangeable prototypes across benchmarks. Moreover, we show that parameter-free early exiting with prototypes is effective for both hate types. We release the code, prototype resources, and evaluation scripts to support future research on efficient and transferable hate speech detection.[40] SugarTextNet: A Transformer-Based Framework for Detecting Sugar Dating-Related Content on Social Media with Context-Aware Focal Loss
Lionel Z. Wang,Shihan Ben,Yulu Huang,Simeng Qing
Main category: cs.CL
TL;DR: 本文提出了一种基于Transformer的框架SugarTextNet,用于检测社交媒体中的“糖爹/糖妈”约会相关内容,结合上下文感知的焦点损失函数以应对类别不平衡问题,并在真实中文数据集上验证了其优越性能。
Details
Motivation: 由于隐晦的语言表达、模糊的语言线索以及实际数据中严重的类别不平衡,检测社交媒体中日益泛滥的糖缘关系内容极具挑战性。 Method: 提出SugarTextNet框架,融合预训练Transformer编码器、基于注意力的线索提取器和上下文短语编码器,并引入上下文感知焦点损失函数来提升对少数类的检测能力。 Result: 在3,067条来自新浪微博的手动标注中文社交媒体帖子上进行评估,结果显示该方法在多个指标上显著优于传统机器学习模型、深度学习基线模型和大语言模型,消融实验验证了各组件的必要性。 Conclusion: 研究强调了针对敏感内容检测需采用领域特定且上下文感知的建模方法,为复杂现实场景下的内容审核提供了有效解决方案。 Abstract: Sugar dating-related content has rapidly proliferated on mainstream social media platforms, giving rise to serious societal and regulatory concerns, including commercialization of intimate relationships and the normalization of transactional relationships.~Detecting such content is highly challenging due to the prevalence of subtle euphemisms, ambiguous linguistic cues, and extreme class imbalance in real-world data.~In this work, we present SugarTextNet, a novel transformer-based framework specifically designed to identify sugar dating-related posts on social media.~SugarTextNet integrates a pretrained transformer encoder, an attention-based cue extractor, and a contextual phrase encoder to capture both salient and nuanced features in user-generated text.~To address class imbalance and enhance minority-class detection, we introduce Context-Aware Focal Loss, a tailored loss function that combines focal loss scaling with contextual weighting.~We evaluate SugarTextNet on a newly curated, manually annotated dataset of 3,067 Chinese social media posts from Sina Weibo, demonstrating that our approach substantially outperforms traditional machine learning models, deep learning baselines, and large language models across multiple metrics.~Comprehensive ablation studies confirm the indispensable role of each component.~Our findings highlight the importance of domain-specific, context-aware modeling for sensitive content detection, and provide a robust solution for content moderation in complex, real-world scenarios.[41] How Well Do LLMs Understand Drug Mechanisms? A Knowledge + Reasoning Evaluation Dataset
Sunil Mohan,Theofanis Karaletsos
Main category: cs.CL
TL;DR: 本文介绍了一个用于评估大语言模型在药物机制知识和推理能力方面表现的新数据集,特别是在反事实情境下的推理能力。研究表明,o4-mini 模型优于其他 OpenAI 模型,而 Qwen3-4B-thinking 模型性能接近甚至在某些情况下超越 o4-mini。研究还发现,开放世界推理任务比封闭世界更具挑战性,且影响推理链内部环节的反事实更难处理。
Details
Motivation: 为了评估大语言模型在药物研发和个性化医疗中的实际应用潜力,需要测试其对药物作用机制的事实性知识和在新情境下的推理能力,尤其是面对训练中未见过的反事实情况时的表现。 Method: 构建一个包含已知药物作用机制及其反事实变体的数据集,评估多个大语言模型在开放世界和封闭世界设置下的表现,比较它们在事实回忆和因果推理任务上的准确性。 Result: 实验显示 o4-mini 模型整体表现最佳,Qwen3-4B-thinking 模型紧随其后并部分超越 o4-mini;开放世界任务比封闭世界更难,且干扰推理链内部链接的反事实比干扰起始药物链接的更难处理。 Conclusion: 大语言模型在药物机制推理方面展现出潜力,但面对需自主回忆知识和复杂因果结构的反事实推理时仍面临挑战,未来模型需增强在开放世界中的知识整合与深层推理能力。 Abstract: Two scientific fields showing increasing interest in pre-trained large language models (LLMs) are drug development / repurposing, and personalized medicine. For both, LLMs have to demonstrate factual knowledge as well as a deep understanding of drug mechanisms, so they can recall and reason about relevant knowledge in novel situations. Drug mechanisms of action are described as a series of interactions between biomedical entities, which interlink into one or more chains directed from the drug to the targeted disease. Composing the effects of the interactions in a candidate chain leads to an inference about whether the drug might be useful or not for that disease. We introduce a dataset that evaluates LLMs on both factual knowledge of known mechanisms, and their ability to reason about them under novel situations, presented as counterfactuals that the models are unlikely to have seen during training. Using this dataset, we show that o4-mini outperforms the 4o, o3, and o3-mini models from OpenAI, and the recent small Qwen3-4B-thinking model closely matches o4-mini's performance, even outperforming it in some cases. We demonstrate that the open world setting for reasoning tasks, which requires the model to recall relevant knowledge, is more challenging than the closed world setting where the needed factual knowledge is provided. We also show that counterfactuals affecting internal links in the reasoning chain present a much harder task than those affecting a link from the drug mentioned in the prompt.[42] Dutch Metaphor Extraction from Cancer Patients' Interviews and Forum Data using LLMs and Human in the Loop
Lifeng Han,David Lindevelt,Sander Puts,Erik van Mulligen,Suzan Verberne
Main category: cs.CL
TL;DR: 本研究聚焦于荷兰语癌症患者中的隐喻语言,利用访谈和在线论坛数据,结合大语言模型与人工验证构建了HealthQuote.NL语料库,旨在提升医患沟通与患者健康素养。
Details
Motivation: 隐喻在医疗沟通中具有重要作用,识别患者使用的隐喻有助于理解其心理状态和疾病认知,从而改善医疗服务质量。 Method: 采用两种数据源:癌症患者讲故事的访谈数据和在线论坛内容;使用大语言模型结合思维链、少样本学习和自提示等提示策略提取隐喻,并通过人工参与验证结果,构建HealthQuote.NL语料库。 Result: 成功从荷兰语癌症患者数据中提取并验证了隐喻表达,构建了高质量的HealthQuote.NL语料库,并公开了提示模板及相关资源。 Conclusion: 提取的隐喻有助于促进医患共享决策、改善沟通、提升患者健康素养,并支持个性化护理路径的设计。 Abstract: Metaphors and metaphorical language (MLs) play an important role in healthcare communication between clinicians, patients, and patients' family members. In this work, we focus on Dutch language data from cancer patients. We extract metaphors used by patients using two data sources: (1) cancer patient storytelling interview data and (2) online forum data, including patients' posts, comments, and questions to professionals. We investigate how current state-of-the-art large language models (LLMs) perform on this task by exploring different prompting strategies such as chain of thought reasoning, few-shot learning, and self-prompting. With a human-in-the-loop setup, we verify the extracted metaphors and compile the outputs into a corpus named HealthQuote.NL. We believe the extracted metaphors can support better patient care, for example shared decision making, improved communication between patients and clinicians, and enhanced patient health literacy. They can also inform the design of personalized care pathways. We share prompts and related resources at https://github.com/aaronlifenghan/HealthQuote.NL[43] Towards Resource-Efficient Multimodal Intelligence: Learned Routing among Specialized Expert Models
Mayank Saini,Arit Kumar Bishwas
Main category: cs.CL
TL;DR: 提出了一种统一的模块化框架,通过学习路由网络将查询分配给最适合的专家模型,在保持性能的同时显著降低成本。
Details
Motivation: 大型语言模型虽然强大但推理成本高,小型开源模型成本低但处理复杂或多模态任务能力弱,需要一种兼顾成本与性能的解决方案。 Method: 设计了一个基于学习的路由网络,智能分配文本、多模态或复杂查询到最合适的专家模型;在视觉任务中采用两阶段开源管道,并复用高效的经典视觉组件。 Result: 在MMLU和VQA等基准上达到或超过单一高端LLM的性能,同时减少超过67%对高成本模型的依赖。 Conclusion: 该框架通过多模型协同和高效路由,在保证质量的前提下实现了可扩展、资源高效的AI部署。 Abstract: As AI moves beyond text, large language models (LLMs) increasingly power vision, audio, and document understanding; however, their high inference costs hinder real-time, scalable deployment. Conversely, smaller open-source models offer cost advantages but struggle with complex or multimodal queries. We introduce a unified, modular framework that intelligently routes each query - textual, multimodal, or complex - to the most fitting expert model, using a learned routing network that balances cost and quality. For vision tasks, we employ a two-stage open-source pipeline optimized for efficiency and reviving efficient classical vision components where they remain SOTA for sub-tasks. On benchmarks such as Massive Multitask Language Understanding (MMLU) and Visual Question Answering (VQA), we match or exceed the performance of always-premium LLM (monolithic systems with one model serving all query types) performance, yet reduce the reliance on costly models by over 67%. With its extensible, multi-agent orchestration, we deliver high-quality, resource-efficient AI at scale.[44] SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention
Bohan Yu,Wei Huang,Kang Liu
Main category: cs.CL
TL;DR: 本文提出SR-KI,一种将大规模结构化知识库(KBs)实时集成到大语言模型(LLM)中的新方法,通过在模型的KV缓存中注入编码为键值对的知识,并采用两阶段训练机制,在潜在空间内实现高效、可压缩且支持动态更新的知识检索。
Details
Motivation: 传统基于外部检索器的增强生成方法依赖复杂的多阶段流程且难以实时更新知识,限制了知识增强语言模型的效率与灵活性,因此需要一种端到端、可在模型内部完成知识检索并支持高效压缩与更新的新方法。 Method: 首先使用预训练编码器将知识库编码为键值对并注入LLM的KV缓存;然后采用两阶段训练:第一阶段定位LLM中最适合检索的层,第二阶段在该层使用基于注意力的损失函数监督模型关注相关知识条目。 Result: SR-KI可在单个A100 40GB GPU上将最多40K个KB条目集成到7B规模的LLM中,平均任务Recall@10超过88%,最佳任务超过98%;在问答和KB ID生成任务中表现良好,同时实现高达99.75%的知识压缩率。 Conclusion: SR-KI实现了在LLM内部端到端的知识检索与融合,无需依赖外部检索模块,显著提升了知识增强推理的效率、压缩能力和动态更新潜力,为构建可持续更新的知识密集型应用提供了可行方案。 Abstract: This paper proposes SR-KI, a novel approach for integrating real-time and large-scale structured knowledge bases (KBs) into large language models (LLMs). SR-KI begins by encoding KBs into key-value pairs using a pretrained encoder, and injects them into LLMs' KV cache. Building on this representation, we employ a two-stage training paradigm: first locating a dedicated retrieval layer within the LLM, and then applying an attention-based loss at this layer to explicitly supervise attention toward relevant KB entries. Unlike traditional retrieval-augmented generation methods that rely heavily on the performance of external retrievers and multi-stage pipelines, SR-KI supports end-to-end inference by performing retrieval entirely within the models latent space. This design enables efficient compression of injected knowledge and facilitates dynamic knowledge updates. Comprehensive experiments demonstrate that SR-KI enables the integration of up to 40K KBs into a 7B LLM on a single A100 40GB GPU, and achieves strong retrieval performance, maintaining over 98% Recall@10 on the best-performing task and exceeding 88% on average across all tasks. Task performance on question answering and KB ID generation also demonstrates that SR-KI maintains strong performance while achieving up to 99.75% compression of the injected KBs.[45] Rethinking what Matters: Effective and Robust Multilingual Realignment for Low-Resource Languages
Quang Phuoc Nguyen,David Anugraha,Felix Gaschi,Jun Bin Cheng,En-Shiun Annie Lee
Main category: cs.CL
TL;DR: 研究表明,通过精心选择语言子集进行重新对齐,可在多语言语言模型中实现与全量语言对齐相当甚至更优的跨语言迁移效果,尤其有利于低资源语言。
Details
Motivation: 现有的跨语言重新对齐方法依赖大量语言和高质量平行数据,但在语言类型差异大或低资源语言场景下效果不稳定且数据稀缺,因此需要探究是否可通过精选语言子集提升效果并降低数据需求。 Method: 进行了广泛的实证研究,比较使用全部语言与策略性选择的语言子集在重新对齐中的表现,控制实验评估其对跨语言迁移特别是低资源语言的影响。 Result: 实验表明,针对低资源语言,精心选择的语言子集进行重新对齐不仅能匹配全量语言的效果,甚至在未见的低资源语言上表现更优。 Conclusion: 有效的重新对齐无需覆盖所有语言,基于语言多样性等原则精选子集即可保持高效、鲁棒且降低数据收集成本。 Abstract: Realignment is a promising strategy to improve cross-lingual transfer in multilingual language models. However, empirical results are mixed and often unreliable, particularly for typologically distant or low-resource languages (LRLs) compared to English. Moreover, word realignment tools often rely on high-quality parallel data, which can be scarce or noisy for many LRLs. In this work, we conduct an extensive empirical study to investigate whether realignment truly benefits from using all available languages, or if strategically selected subsets can offer comparable or even improved cross-lingual transfer, and study the impact on LRLs. Our controlled experiments show that realignment can be particularly effective for LRLs and that using carefully selected, linguistically diverse subsets can match full multilingual alignment, and even outperform it for unseen LRLs. This indicates that effective realignment does not require exhaustive language coverage and can reduce data collection overhead, while remaining both efficient and robust when guided by informed language selection.[46] You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations
Amit LeVi,Raz Lapid,Rom Himelstein,Yaniv Nemcovsky,Ravid Shwartz Ziv,Avi Mendelson
Main category: cs.CL
TL;DR: 本文提出两种任务感知的后训练量化方法(TAQ和TAQO),利用隐藏层中与任务相关的信息来指导量化,从而在保持高精度的同时显著提升效率。
Details
Motivation: 大型语言模型虽能力强,但资源消耗大;现有量化方法多为任务无关,未能充分利用任务特定信号在模型层间的分布差异。 Method: 提出TAQ和TAQO两种方法:TAQ基于任务相关的激活统计分配比特宽度,TAQO则通过直接测试层对任务的敏感性来分配精度。两者均使用小规模校准集识别关键层并保留其精度。 Result: 在Phi-4、Llama-3.1、Qwen3和Qwen2.5等模型上,TAQ和TAQO均优于基线方法;例如Phi-4上TAQ达到42.33 EM / 50.81 F1,远超AWQ,且精度损失小于1.0%。 Conclusion: 任务感知的量化能有效识别并保护关键层,实现高效、专用的模型压缩,在多种模型上表现出优越性能。 Abstract: Large Language Models (LLMs) excel across diverse tasks, yet many applications require only limited capabilities, making large variants inefficient in memory and latency. Existing approaches often combine distillation and quantization, but most post-training quantization (PTQ) methods are task-agnostic, ignoring how task-specific signals are distributed across layers. In this work, we propose to use hidden representations that encode task-salient signals as a guideline for quantization. In order to fully utilize our innovative idea, this paper compares two new task-aware PTQ methods: Task-Aware Quantization (TAQ), which allocates bitwidths using task-conditioned statistics from hidden activations, and TAQO, which allocates precision based on direct layer sensitivity tests. From a small calibration set, these approaches identify task-relevant layers, preserving their precision while aggressively quantizing the rest. This yields stable task sensitivity profiles and efficient task-specialized models. Across models, TAQ and TAQO outperform the baselines; TAQ leads on Phi-4, while TAQO leads on Llama-3.1, Qwen3, and Qwen2.5. For instances, on Phi-4 it achieves 42.33 EM / 50.81 F1, far surpassing Activation-aware Weight Quantization (AWQ) (2.25 / 7.07), while remaining within < 1.0% of the original accuracy at lower average precision.[47] Better Datasets Start From RefineLab: Automatic Optimization for High-Quality Dataset Refinement
Xiaonan Luo,Yue Huang,Ping He,Xiangliang Zhang
Main category: cs.CL
TL;DR: RefineLab 是首个由大语言模型驱动的框架,可在可控的 token 预算下自动优化问答数据集的质量,提升领域覆盖、难度平衡和事实准确性。
Details
Motivation: 现有高质量问答数据集存在领域覆盖不全、难度分布不合理和事实错误等问题,而生成模型构建的数据集进一步加剧了质量问题。 Method: 提出 RefineLab 框架,通过定义目标质量属性和 token 预算,利用选择性编辑操作(如改写、干扰项替换)优化问答样本,并通过分配模块选择最优策略,在资源限制下最大化数据集质量。 Result: 实验表明,RefineLab 在覆盖范围、难度对齐、事实保真度和干扰项质量方面显著缩小了与专家数据集的差距。 Conclusion: RefineLab 提供了一条可扩展、可定制的数据集优化路径,对大语言模型的可靠评估具有重要意义。 Abstract: High-quality Question-Answer (QA) datasets are foundational for reliable Large Language Model (LLM) evaluation, yet even expert-crafted datasets exhibit persistent gaps in domain coverage, misaligned difficulty distributions, and factual inconsistencies. The recent surge in generative model-powered datasets has compounded these quality challenges. In this work, we introduce RefineLab, the first LLM-driven framework that automatically refines raw QA textual data into high-quality datasets under a controllable token-budget constraint. RefineLab takes a set of target quality attributes (such as coverage and difficulty balance) as refinement objectives, and performs selective edits within a predefined token budget to ensure practicality and efficiency. In essence, RefineLab addresses a constrained optimization problem: improving the quality of QA samples as much as possible while respecting resource limitations. With a set of available refinement operations (e.g., rephrasing, distractor replacement), RefineLab takes as input the original dataset, a specified set of target quality dimensions, and a token budget, and determines which refinement operations should be applied to each QA sample. This process is guided by an assignment module that selects optimal refinement strategies to maximize overall dataset quality while adhering to the budget constraint. Experiments demonstrate that RefineLab consistently narrows divergence from expert datasets across coverage, difficulty alignment, factual fidelity, and distractor quality. RefineLab pioneers a scalable, customizable path to reproducible dataset design, with broad implications for LLM evaluation.[48] Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigeria's Minority Languages
Oluwadara Kalejaiye,Luel Hagos Beyene,David Ifeoluwa Adelani,Mmekut-Mfon Gabriel Edet,Aniefon Daniel Akpan,Eno-Abasi Urua,Anietie Andy
Main category: cs.CL
TL;DR: 本文介绍了针对尼日利亚沿海四种少数民族语言(Anaang、Efik、Ibibio 和 Oro)的机器翻译与主题分类数据集 ibom,填补了现有基准中这些语言的空白,并评估了当前大模型在这些语言上的表现。
Details
Motivation: 尼日利亚语言多样性极高,但NLP研究集中在少数几种语言上,主要由于缺乏足够的文本数据。本文旨在推动对未被充分代表的语言的研究。 Method: 构建名为ibom的数据集,扩展Flores-200基准,将翻译文本与SIB-200主题分类标签对齐,用于机器翻译和主题分类任务。 Result: 实验表明,当前的大语言模型在零样本和少样本机器翻译任务上表现不佳,但在主题分类任务中,随着样本数增加,性能稳步提升。 Conclusion: ibom数据集为濒危语言的NLP研究提供了重要资源,凸显了当前模型在低资源语言上的局限性,并展示了少样本学习在分类任务中的潜力。 Abstract: Nigeria is the most populous country in Africa with a population of more than 200 million people. More than 500 languages are spoken in Nigeria and it is one of the most linguistically diverse countries in the world. Despite this, natural language processing (NLP) research has mostly focused on the following four languages: Hausa, Igbo, Nigerian-Pidgin, and Yoruba (i.e <1% of the languages spoken in Nigeria). This is in part due to the unavailability of textual data in these languages to train and apply NLP algorithms. In this work, we introduce ibom -- a dataset for machine translation and topic classification in four Coastal Nigerian languages from the Akwa Ibom State region: Anaang, Efik, Ibibio, and Oro. These languages are not represented in Google Translate or in major benchmarks such as Flores-200 or SIB-200. We focus on extending Flores-200 benchmark to these languages, and further align the translated texts with topic labels based on SIB-200 classification dataset. Our evaluation shows that current LLMs perform poorly on machine translation for these languages in both zero-and-few shot settings. However, we find the few-shot samples to steadily improve topic classification with more shots.[49] Rep2Text: Decoding Full Text from a Single LLM Token Representation
Haiyan Zhao,Zirui He,Fan Yang,Ali Payani,Mengnan Du
Main category: cs.CL
TL;DR: 提出Rep2Text框架,可从LLM的末尾单个token表示中重建输入文本,平均恢复16词序列超过一半的信息,并保持语义连贯性。
Details
Motivation: 探究大语言模型内部表示的可解释性,特别是单个末尾token是否包含足够的信息来恢复原始输入。 Method: 设计可训练适配器,将目标模型的内部表示映射到解码语言模型的嵌入空间,通过自回归方式重建输入文本。 Result: 在多种模型组合上实验显示,平均可恢复50%以上16-token序列信息;长序列存在信息瓶颈,但语义完整性保持良好;在医学领域数据上展现良好泛化能力。 Conclusion: LLM的末尾token包含丰富的输入信息,Rep2Text为理解模型内部机制和信息压缩提供了新工具。 Abstract: Large language models (LLMs) have achieved remarkable progress across diverse tasks, yet their internal mechanisms remain largely opaque. In this work, we address a fundamental question: to what extent can the original input text be recovered from a single last-token representation within an LLM? We propose Rep2Text, a novel framework for decoding full text from last-token representations. Rep2Text employs a trainable adapter that projects a target model's internal representations into the embedding space of a decoding language model, which then autoregressively reconstructs the input text. Experiments on various model combinations (Llama-3.1-8B, Gemma-7B, Mistral-7B-v0.1, Llama-3.2-3B) demonstrate that, on average, over half of the information in 16-token sequences can be recovered from this compressed representation while maintaining strong semantic integrity and coherence. Furthermore, our analysis reveals an information bottleneck effect: longer sequences exhibit decreased token-level recovery while preserving strong semantic integrity. Besides, our framework also demonstrates robust generalization to out-of-distribution medical data.[50] TabRAG: Tabular Document Retrieval via Structured Language Representations
Jacob Si,Mike Qu,Michelle Lee,Yingzhen Li
Main category: cs.CL
TL;DR: 提出TabRAG,一种基于解析的RAG管道,通过结构化语言表示处理以表格为主的文档,优于现有的解析方法。
Details
Motivation: 现有解析方法在提取表格数据时性能不佳,而微调嵌入模型计算成本高。 Method: 设计了一种名为TabRAG的解析型RAG管道,采用结构化语言表示来处理表格密集型文档。 Result: TabRAG在生成和检索任务上优于现有的流行解析方法。 Conclusion: TabRAG有效提升了表格密集型文档的处理性能,是一种高效且实用的RAG解决方案。 Abstract: Ingesting data for Retrieval-Augmented Generation (RAG) involves either fine-tuning the embedding model directly on the target corpus or parsing documents for embedding model encoding. The former, while accurate, incurs high computational hardware requirements, while the latter suffers from suboptimal performance when extracting tabular data. In this work, we address the latter by presenting TabRAG, a parsing-based RAG pipeline designed to tackle table-heavy documents via structured language representations. TabRAG outperforms existing popular parsing-based methods for generation and retrieval. Code is available at https://github.com/jacobyhsi/TabRAG.[51] MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making
Zhi Rui Tam,Yun-Nung Chen
Main category: cs.CL
TL;DR: 音频大语言模型在临床决策中表现出显著的模态偏差和人口统计学偏见,尤其是年龄相关差异,可能导致医疗不平等。
Details
Motivation: 随着大语言模型从文本转向音频交互,探究其在临床场景中因语音中的副语言线索(如年龄、性别、情绪)引发的潜在偏见问题。 Method: 在170个临床病例上评估模型表现,每个病例合成为36种不同年龄、性别和情绪的语音;比较音频与文本输入下的手术建议差异,并分析链式思维提示对偏见缓解的效果。 Result: 音频输入导致手术建议差异高达35%,某模型减少80%建议;年龄差异达12%,链式思维可消除性别偏见但无法缓解年龄和情绪偏见,且情绪识别性能差。 Conclusion: 音频大语言模型易受患者声音特征影响而偏离医学证据,亟需开发具备偏见意识的架构以避免加剧医疗不平等。 Abstract: As large language models transition from text-based interfaces to audio interactions in clinical settings, they might introduce new vulnerabilities through paralinguistic cues in audio. We evaluated these models on 170 clinical cases, each synthesized into speech from 36 distinct voice profiles spanning variations in age, gender, and emotion. Our findings reveal a severe modality bias: surgical recommendations for audio inputs varied by as much as 35% compared to identical text-based inputs, with one model providing 80% fewer recommendations. Further analysis uncovered age disparities of up to 12% between young and elderly voices, which persisted in most models despite chain-of-thought prompting. While explicit reasoning successfully eliminated gender bias, the impact of emotion was not detected due to poor recognition performance. These results demonstrate that audio LLMs are susceptible to making clinical decisions based on a patient's voice characteristics rather than medical evidence, a flaw that risks perpetuating healthcare disparities. We conclude that bias-aware architectures are essential and urgently needed before the clinical deployment of these models.[52] Duality-based Mode Operations and Pyramid Multilayer Mapping for Rhetorical Modes
Zi-Niu Wu
Main category: cs.CL
TL;DR: 本文提出基于对偶性的修辞模式操作和多层金字塔映射框架,扩展修辞模式集并降低认知复杂性,通过组合数学和香农熵量化表达多样性与复杂度,推动静态修辞分类向动态、可测量的话语系统转变。
Details
Motivation: 旨在建立语言学、计算建模与教育研究之间的概念桥梁,拓展修辞模式以增强多应用场景下的认知多样性。 Method: 引入对偶性操作(如分裂-统一、前向-后向等)生成新修辞模式,并构建从修辞模型到认知再到知识层面的多层金字塔映射框架,结合二项式组合学与香农熵分析进行量化评估。 Result: 定义了边际修辞比特(MRB)作为衡量表达增长速度的可扩展参数,发现分层选择能显著降低选择不确定性,实现了修辞系统的动态化与可测量性。 Conclusion: 该工作为AI系统从语言符号处理迈向分层修辞推理提供了路径,促进了语言学、教学、学术与计算研究的融合。 Abstract: Rhetorical modes are useful in both academic and non-academic writing, and can be subjects to be studied within linguistic research and computational modeling. Establishing a conceptual bridge among these domains could enable each to benefit from the others. This paper proposes duality-based mode operations (split-unite, forward-backward, expansion-reduction and orthogonal dualities) to expand the set of rhetorical modes, introducing generated modes like combination and generalization, thereby enhancing epistemic diversity across multiple applications. It further presents a pyramid multilayer mapping framework (e.g., three layers from the rhetorical model layer, to cognitive layer, and to epistemic layers) that reduces the resulting cognitive complexity. The degrees of expressive diversity and complexity reduction are quantified through binomial combinatorics and Shannon entropy analysis. A Marginal Rhetorical Bit (MRB) is identified, permitting the definition of a rhetorical-scalable parameter that measures expressive growth speed in bits per stage. A direct entropy measure shows that hierarchical selection over smaller subsets markedly reduces choice uncertainty compared with flat selection across all modes. These considerations appear to transform static and non-measurable rhetorical taxonomies into more dynamic and more measurable systems for discourse design. From this work, it would be possible to identify a pathway for future AI systems to operate not only on language tokens but on layered rhetorical reasoning structures, bridging linguistic, pedagogical, academic, and computational research[53] How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models
Subhojit Ghimire
Main category: cs.CL
TL;DR: 本文研究了AI驱动的内容审核中针对非洲裔美国英语(AAE)的系统性偏见,发现常用毒性检测模型将AAE文本误判为更具攻击性,并开发了一个交互式教学工具,揭示算法分数与人为设定的审查政策共同导致歧视性结果。
Details
Motivation: 随着AI内容审核的普及,公众对算法偏见的担忧日益增加。本文旨在量化AI在处理不同语言变体时的不公平性,并提高公众对算法歧视机制的理解。 Method: 采用定量基准测试评估toxic-bert模型在非洲裔美国英语(AAE)和标准美国英语(SAE)上的表现差异,并开发一个带有可调节敏感度阈值的交互式教学工具,以可视化偏见的影响。 Result: 模型平均将AAE文本的毒性评分高出1.8倍,‘身份仇恨’评分高出8.8倍;交互工具显示,即使算法本身有偏见,人为设定的统一审查阈值会进一步放大歧视效果。 Conclusion: AI内容审核存在针对特定语言群体的系统性偏见,真正的风险不仅在于算法输出,更在于看似中立的人类政策如何将这种偏见制度化,因此需要结合技术评估与公众教育来应对。 Abstract: Now that AI-driven moderation has become pervasive in everyday life, we often hear claims that "the AI is biased". While this is often said jokingly, the light-hearted remark reflects a deeper concern. How can we be certain that an online post flagged as "inappropriate" was not simply the victim of a biased algorithm? This paper investigates this problem using a dual approach. First, I conduct a quantitative benchmark of a widely used toxicity model (unitary/toxic-bert) to measure performance disparity between text in African-American English (AAE) and Standard American English (SAE). The benchmark reveals a clear, systematic bias: on average, the model scores AAE text as 1.8 times more toxic and 8.8 times higher for "identity hate". Second, I introduce an interactive pedagogical tool that makes these abstract biases tangible. The tool's core mechanic, a user-controlled "sensitivity threshold," demonstrates that the biased score itself is not the only harm; instead, the more-concerning harm is the human-set, seemingly neutral policy that ultimately operationalises discrimination. This work provides both statistical evidence of disparate impact and a public-facing tool designed to foster critical AI literacy.[54] Steering LLMs toward Korean Local Speech: Iterative Refinement Framework for Faithful Dialect Translation
Keunhyeung Park,Seunguk Yu,Youngbin Kim
Main category: cs.CL
TL;DR: 提出了一种名为DIA-REFINE的方言优化框架,通过迭代翻译、验证和反馈提升大模型在标准语到方言翻译中的保真度,并引入新的评估指标DFS和TDR来克服传统n-gram指标的局限性。
Details
Motivation: 现有大语言模型在标准语到方言翻译中存在显著的方言差距,且传统n-gram评估指标容易高估简单复制源文本的表现,无法准确衡量真正的方言翻译质量。 Method: 设计了DIA-REFINE框架,结合外部方言分类器进行迭代式的翻译修正与反馈;同时提出两个新指标:方言保真度得分(DFS)和目标方言语占比(TDR),以更准确评估翻译结果。 Result: 在韩语方言翻译任务上,DIA-REFINE在零样本和上下文学习设置下均显著提升了方言保真度;新指标能有效区分‘伪成功’与‘真尝试’案例,揭示模型真实表现。 Conclusion: DIA-REFINE为面向目标的包容性方言翻译提供了可靠框架,所提评估指标有助于更精准地衡量方言转换效果,推动方言翻译研究的发展。 Abstract: Standard-to-dialect machine translation remains challenging due to a persistent dialect gap in large language models and evaluation distortions inherent in n-gram metrics, which favor source copying over authentic dialect translation. In this paper, we propose the dialect refinement (DIA-REFINE) framework, which guides LLMs toward faithful target dialect outputs through an iterative loop of translation, verification, and feedback using external dialect classifiers. To address the limitations of n-gram-based metrics, we introduce the dialect fidelity score (DFS) to quantify linguistic shift and the target dialect ratio (TDR) to measure the success of dialect translation. Experiments on Korean dialects across zero-shot and in-context learning baselines demonstrate that DIA-REFINE consistently enhances dialect fidelity. The proposed metrics distinguish between False Success cases, where high n-gram scores obscure failures in dialectal translation, and True Attempt cases, where genuine attempts at dialectal translation yield low n-gram scores. We also observed that models exhibit varying degrees of responsiveness to the framework, and that integrating in-context examples further improves the translation of dialectal expressions. Our work establishes a robust framework for goal-directed, inclusive dialect translation, providing both rigorous evaluation and critical insights into model performance.[55] Textual Self-attention Network: Test-Time Preference Optimization through Textual Gradient-based Attention
Shibing Mo,Haoyang Ruan,Kai Wu,Jing Liu
Main category: cs.CL
TL;DR: 本文提出了一种名为文本自注意力网络(TSAN)的新方法,用于在测试时进行偏好优化,无需参数更新。TSAN通过自然语言模拟自注意力机制,分析、权衡并融合多个候选回答的优势,在文本梯度空间中实现可解释的迭代优化,显著提升了大模型输出与人类偏好的对齐效果。
Details
Motivation: 现有的测试时对齐方法通常只修改单个候选回答,缺乏系统性地分析和融合多个优质回答优势的机制。由于不同回答可能在清晰度、事实准确性或语气等方面各具优势,因此需要一种能够综合多候选优点的方法以生成更优结果。 Method: 提出文本自注意力网络(TSAN),将多个候选回答编码为文本形式的键和值,利用大语言模型实现基于文本的注意力计算,评估各候选的相关性,并在学习到的文本注意力引导下合成新的对齐响应。整个过程在文本梯度空间中完成,无需参数更新,支持迭代优化。 Result: 实验表明,仅通过三次测试时迭代,TSAN即可超越如Llama-3.1-70B-Instruct等监督微调模型,并优于当前最先进的测试时对齐方法,有效利用多个候选解提升性能。 Conclusion: TSAN为大语言模型的测试时偏好优化提供了一个无需训练、可解释且高效的新范式,通过模仿自注意力机制在自然语言层面整合多候选优势,显著提升输出质量与人类偏好对齐程度。 Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization capabilities, but aligning their outputs with human preferences typically requires expensive supervised fine-tuning. Recent test-time methods leverage textual feedback to overcome this, but they often critique and revise a single candidate response, lacking a principled mechanism to systematically analyze, weigh, and synthesize the strengths of multiple promising candidates. Such a mechanism is crucial because different responses may excel in distinct aspects (e.g., clarity, factual accuracy, or tone), and combining their best elements may produce a far superior outcome. This paper proposes the Textual Self-Attention Network (TSAN), a new paradigm for test-time preference optimization that requires no parameter updates. TSAN emulates self-attention entirely in natural language to overcome this gap: it analyzes multiple candidates by formatting them into textual keys and values, weighs their relevance using an LLM-based attention module, and synthesizes their strengths into a new, preference-aligned response under the guidance of the learned textual attention. This entire process operates in a textual gradient space, enabling iterative and interpretable optimization. Empirical evaluations demonstrate that with just three test-time iterations on a base SFT model, TSAN outperforms supervised models like Llama-3.1-70B-Instruct and surpasses the current state-of-the-art test-time alignment method by effectively leveraging multiple candidate solutions.[56] Sentiment Analysis On YouTube Comments Using Machine Learning Techniques Based On Video Games Content
Adi Danish Bin Muhammad Amin,Mohaiminul Islam Bhuiyan,Nur Shazwani Kamarudin,Zulfahmi Toh,Nur Syafiqah Nafis
Main category: cs.CL
TL;DR: 本研究基于YouTube评论对电子游戏进行情感分析,利用TextBlob和多种机器学习算法(如SVM)分类用户情绪,发现SVM准确率最高,结果有助于游戏开发者优化设计。
Details
Motivation: 随着游戏产业快速发展,理解玩家在社交媒体上表达的情感对于改进游戏设计和用户体验至关重要。 Method: 通过YouTube API收集游戏相关视频的评论,使用TextBlob进行情感分析,并采用朴素贝叶斯、逻辑回归和支持向量机(SVM)等机器学习算法进行分类。 Result: SVM在多个数据集上表现出最高的分类准确率,分析揭示了用户对游戏的偏好与批评趋势。 Conclusion: 先进的情感分析方法能有效捕捉玩家评论中的复杂情绪,为游戏开发提供有价值的反馈,未来可结合更复杂的自然语言处理技术进一步提升分析效果。 Abstract: The rapid evolution of the gaming industry, driven by technological advancements and a burgeoning community, necessitates a deeper understanding of user sentiments, especially as expressed on popular social media platforms like YouTube. This study presents a sentiment analysis on video games based on YouTube comments, aiming to understand user sentiments within the gaming community. Utilizing YouTube API, comments related to various video games were collected and analyzed using the TextBlob sentiment analysis tool. The pre-processed data underwent classification using machine learning algorithms, including Naïve Bayes, Logistic Regression, and Support Vector Machine (SVM). Among these, SVM demonstrated superior performance, achieving the highest classification accuracy across different datasets. The analysis spanned multiple popular gaming videos, revealing trends and insights into user preferences and critiques. The findings underscore the importance of advanced sentiment analysis in capturing the nuanced emotions expressed in user comments, providing valuable feedback for game developers to enhance game design and user experience. Future research will focus on integrating more sophisticated natural language processing techniques and exploring additional data sources to further refine sentiment analysis in the gaming domain.[57] Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights
Hyunjae Kim,Jiwoong Sohn,Aidan Gilson,Nicholas Cochran-Caggiano,Serina Applebaum,Heeju Jin,Seihee Park,Yujin Park,Jiyeong Park,Seoyoung Choi,Brittany Alexandra Herrera Contreras,Thomas Huang,Jaehoon Yun,Ethan F. Wei,Roy Jiang,Leah Colucci,Eric Lai,Amisha Dave,Tuo Guo,Maxwell B. Singer,Yonghoe Koo,Ron A. Adelman,James Zou,Andrew Taylor,Arman Cohan,Hua Xu,Qingyu Chen
Main category: cs.CL
TL;DR: 本研究对医学领域中的检索增强生成(RAG)技术进行了迄今为止最全面的专家评估,发现标准RAG在证据检索和使用方面表现不佳,反而降低了模型输出的事实性和完整性;但通过简单的改进策略可显著提升性能。
Details
Motivation: 解决大语言模型在医学应用中知识更新滞后和缺乏可验证推理的问题,评估RAG是否真正有效。 Method: 组织18名医学专家对GPT-4o和Llama-3.1-8B在200个真实患者和USMLE风格问题上的800个输出进行8万余条标注,系统分解RAG流程为证据检索、选择和响应生成三个环节进行评估。 Result: 仅22%的前16个检索段落相关,证据选择精度为41-43%,召回率为27-49%,事实性和完整性分别下降达6%和5%;而证据过滤和查询重构等简单策略可使MedMCQA和MedXpertQA性能提升最多12%和8.2%。 Conclusion: 标准RAG在医学场景中可能适得其反,需重新审视其作用,并强调分阶段评估与精心系统设计的重要性。 Abstract: Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG's role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.[58] Sensitivity of Small Language Models to Fine-tuning Data Contamination
Nicy Scaria,Silvester John Joseph Kennedy,Deepak Subramani
Main category: cs.CL
TL;DR: 该研究系统评估了23个小型语言模型(SLM)在指令微调过程中对数据污染的敏感性,发现语法污染(如字符/单词反转)导致性能急剧下降,而语义污染表现出更强的鲁棒性;同时揭示了“能力诅咒”现象——更大的模型更易受语义污染影响,且对齐训练并未一致提升鲁棒性。
Details
Motivation: 小型语言模型(SLMs)广泛应用于资源受限环境,但其在指令微调中面对数据污染的鲁棒性尚不清楚,亟需系统评估以指导实际部署。 Method: 通过在270M至4B参数的23个SLM上测试四种污染水平(25%~100%),分析语法(字符/词序反转)和语义(无关/反事实响应)转换对模型性能的影响,并比较不同模型家族、规模及基础模型与指令微调版本的表现差异。 Result: 1) 语法污染(尤其是字符反转)导致所有模型近乎完全失效;2) 语义污染存在阈值效应,核心语言能力更具弹性;3) 更大、更有能力的模型更容易学习语义污染(‘能力诅咒’);4) 指令微调带来的鲁棒性提升不一致,有时反而降低抗污染能力。 Conclusion: 当前关于模型鲁棒性的假设可能不适用于小型语言模型,特别是在面对不同类型的数据污染时;研究呼吁开发污染感知的训练协议,以确保SLM在现实场景中的可靠部署。 Abstract: Small Language Models (SLMs) are increasingly being deployed in resource-constrained environments, yet their behavioral robustness to data contamination during instruction tuning remains poorly understood. We systematically investigate the contamination sensitivity of 23 SLMs (270M to 4B parameters) across multiple model families by measuring susceptibility to syntactic and semantic transformation types during instruction tuning: syntactic transformations (character and word reversal) and semantic transformations (irrelevant and counterfactual responses), each applied at contamination levels of 25\%, 50\%, 75\%, and 100\%. Our results reveal fundamental asymmetries in vulnerability patterns: syntactic transformations cause catastrophic performance degradation, with character reversal producing near-complete failure across all models regardless of size or family, while semantic transformations demonstrate distinct threshold behaviors and greater resilience in core linguistic capabilities. Critically, we discover a ``\textit{capability curse}" where larger, more capable models become more susceptible to learning semantic corruptions, effectively following harmful instructions more readily, while our analysis of base versus instruction-tuned variants reveals that alignment provides inconsistent robustness benefits, sometimes even reducing resilience. Our work establishes three core contributions: (1) empirical evidence of SLMs' disproportionate vulnerability to syntactic pattern contamination, (2) identification of asymmetric sensitivity patterns between syntactic and semantic transformations, and (3) systematic evaluation protocols for contamination robustness assessment. These findings have immediate deployment implications, suggesting that current robustness assumptions may not hold for smaller models and highlighting the need for contamination-aware training protocols.[59] SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces
Ruiheng Liu,XiaoBing Chen,Jinyu Zhang,Qiongwen Zhang,Yu Zhang,Bailong Yang
Main category: cs.CL
TL;DR: 提出了一种名为SafeNlidb的隐私-安全对齐框架,用于增强基于大语言模型的自然语言数据库接口的安全性,通过自动化生成混合思维链数据和改进的优化方法,在不依赖人工标注的情况下实现安全感知的SQL生成。
Details
Motivation: 大语言模型在自然语言数据库接口中的应用带来了隐私和安全风险,现有方法难以应对复杂推理攻击且误报率高,需更可靠的解决方案。 Method: 设计了自动化管道生成融合安全推理与SQL生成的混合思维链数据,引入推理预热和交替偏好优化以解决DPO的多偏好振荡问题,实现细粒度的安全感知SQL生成。 Result: 实验表明该方法优于更大规模的语言模型和理想化基线,在提升安全性的同时保持了高实用性。 Conclusion: SafeNlidb有效平衡了NLIDB系统中安全性与查询效用之间的矛盾,为无需人工标注的隐私保护提供了新思路。 Abstract: The rapid advancement of Large Language Models (LLMs) has driven significant progress in Natural Language Interface to Database (NLIDB). However, the widespread adoption of LLMs has raised critical privacy and security concerns. During interactions, LLMs may unintentionally expose confidential database contents or be manipulated by attackers to exfiltrate data through seemingly benign queries. While current efforts typically rely on rule-based heuristics or LLM agents to mitigate this leakage risk, these methods still struggle with complex inference-based attacks, suffer from high false positive rates, and often compromise the reliability of SQL queries. To address these challenges, we propose \textsc{SafeNlidb}, a novel privacy-security alignment framework for LLM-based NLIDB. The framework features an automated pipeline that generates hybrid chain-of-thought interaction data from scratch, seamlessly combining implicit security reasoning with SQL generation. Additionally, we introduce reasoning warm-up and alternating preference optimization to overcome the multi-preference oscillations of Direct Preference Optimization (DPO), enabling LLMs to produce security-aware SQL through fine-grained reasoning without the need for human-annotated preference data. Extensive experiments demonstrate that our method outperforms both larger-scale LLMs and ideal-setting baselines, achieving significant security improvements while preserving high utility.WARNING: This work may contain content that is offensive and harmful![60] Learning to Focus: Focal Attention for Selective and Scalable Transformers
Dhananjay Ram,Wei Xia,Stefano Soatto
Main category: cs.CL
TL;DR: 提出了一种名为Focal Attention的改进方法,通过调节softmax温度来锐化注意力分布,从而提升Transformer模型在长上下文和不同规模下的性能。
Details
Motivation: 标准的softmax注意力容易产生噪声概率分布,影响模型在各层的有效特征选择,尤其是在处理长上下文时表现更差。 Method: 提出Focal Attention,通过对softmax温度进行控制(作为固定超参数或可学习参数)来锐化注意力分布,使模型更聚焦于相关token并抑制无关token。 Result: Focal Attention在模型大小、训练数据和上下文长度上均表现出更优的扩展性;在多个基准测试中,达到相同精度时最多减少42%参数或33%训练数据;在长上下文任务中相对提升17%至82%。 Conclusion: Focal Attention是一种简单而有效的方法,显著提升了Transformer架构在长上下文和资源效率方面的表现。 Abstract: Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective feature selection at every layer of these models, particularly for long contexts. We propose Focal Attention, a simple yet effective modification that sharpens the attention distribution by controlling the softmax temperature, either as a fixed hyperparameter or as a learnable parameter during training. This sharpening enables the model to concentrate on the most relevant tokens while suppressing irrelevant ones. Empirically, Focal Attention scales more favorably than standard transformer with respect to model size, training data, and context length. Across diverse benchmarks, it achieves the same accuracy with up to 42% fewer parameters or 33% less training data. On long-context tasks, it delivers substantial relative improvements ranging from 17% to 82%, demonstrating its effectiveness in real world applications.[61] Beyond Plain Demos: A Demo-centric Anchoring Paradigm for In-Context Learning in Alzheimer's Disease Detection
Puzhen Su,Haoran Yin,Yongzhu Miao,Jintao Tang,Shasha Li,Ting Wang
Main category: cs.CL
TL;DR: 本文提出了一种名为DA4ICL的演示中心型锚定框架,用于提升大语言模型在阿尔茨海默病(AD)检测中的上下文感知能力。通过多样且对比的检索(DCR)扩展上下文广度,并通过每层Transformer的投影向量锚定(PVA)加深上下文深度,DA4ICL在三个AD基准测试中显著优于传统上下文学习(ICL)和任务向量方法。
Details
Motivation: 大语言模型在从叙述性转录文本中检测阿尔茨海默病时面临挑战:预训练数据通常不包含此类分布外任务,且演示样本上下文高度同质化,导致模型的任务认知和上下文感知能力受限。由于任务认知在预训练后固定,改进的关键在于增强上下文感知。 Method: 提出DA4ICL框架,包含两个核心组件:1)多样与对比检索(DCR),用于增加演示集的多样性(扩展上下文广度);2)投影向量锚定(PVA),将关键信号注入每一层Transformer的隐藏状态,以增强细粒度特征表达(增加上下文深度)。该方法不依赖任务向量的粗粒度注入,而是聚焦于演示内容优化。 Result: 在三个阿尔茨海默病检测基准上,DA4ICL显著且稳定地超越了标准ICL和现有任务向量方法,表现出更强的细粒度识别能力和对分布外、低资源场景的适应性。 Conclusion: DA4ICL通过以演示为中心的上下文增强策略,解决了LLMs在OOD任务中因演示同质化和信号不足而导致的性能瓶颈,为低资源、细粒度的LLM适应提供了新范式。 Abstract: Detecting Alzheimer's disease (AD) from narrative transcripts challenges large language models (LLMs): pre-training rarely covers this out-of-distribution task, and all transcript demos describe the same scene, producing highly homogeneous contexts. These factors cripple both the model's built-in task knowledge (\textbf{task cognition}) and its ability to surface subtle, class-discriminative cues (\textbf{contextual perception}). Because cognition is fixed after pre-training, improving in-context learning (ICL) for AD detection hinges on enriching perception through better demonstration (demo) sets. We demonstrate that standard ICL quickly saturates, its demos lack diversity (context width) and fail to convey fine-grained signals (context depth), and that recent task vector (TV) approaches improve broad task adaptation by injecting TV into the LLMs' hidden states (HSs), they are ill-suited for AD detection due to the mismatch of injection granularity, strength and position. To address these bottlenecks, we introduce \textbf{DA4ICL}, a demo-centric anchoring framework that jointly expands context width via \emph{\textbf{Diverse and Contrastive Retrieval}} (DCR) and deepens each demo's signal via \emph{\textbf{Projected Vector Anchoring}} (PVA) at every Transformer layer. Across three AD benchmarks, DA4ICL achieves large, stable gains over both ICL and TV baselines, charting a new paradigm for fine-grained, OOD and low-resource LLM adaptation.[62] CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition
Hung-Yang Sung,Chien-Chun Wang,Kuan-Tang Huang,Tien-Hong Lo,Yu-Sheng Tsao,Yung-Chang Hsu,Berlin Chen
Main category: cs.CL
TL;DR: 提出CLiFT-ASR,一种用于台湾闽南语语音识别的跨语言微调框架,通过结合台罗拼音和汉字文本的两阶段训练,显著降低字符错误率。
Details
Motivation: 台湾闽南语等低资源语言缺乏标注数据,直接使用汉字转录微调难以捕捉音韵细节,仅用罗马化拼写则词汇覆盖不足,且先前研究较少探索两种标注类型的分阶段融合策略。 Method: 基于普通话HuBERT模型,采用两阶段跨语言微调:第一阶段利用台罗音标学习声学与声调表征,第二阶段利用汉字文本学习词汇与句法结构。 Result: 在TAT-MOE语料库上,CLiFT-ASR相比强基线模型实现了24.88%的字符错误率(CER)相对降低。 Conclusion: CLiFT-ASR为台湾闽南语ASR提供了一种有效且参数高效的方法,并有望推广至其他低资源语言场景。 Abstract: Automatic speech recognition (ASR) for low-resource languages such as Taiwanese Hokkien is difficult due to the scarcity of annotated data. However, direct fine-tuning on Han-character transcriptions often fails to capture detailed phonetic and tonal cues, while training only on romanization lacks lexical and syntactic coverage. In addition, prior studies have rarely explored staged strategies that integrate both annotation types. To address this gap, we present CLiFT-ASR, a cross-lingual fine-tuning framework that builds on Mandarin HuBERT models and progressively adapts them to Taiwanese Hokkien. The framework employs a two-stage process in which it first learns acoustic and tonal representations from phonetic Tai-lo annotations and then captures vocabulary and syntax from Han-character transcriptions. This progressive adaptation enables effective alignment between speech sounds and orthographic structures. Experiments on the TAT-MOE corpus demonstrate that CLiFT-ASR achieves a 24.88\% relative reduction in character error rate (CER) compared with strong baselines. The results indicate that CLiFT-ASR provides an effective and parameter-efficient solution for Taiwanese Hokkien ASR and that it has potential to benefit other low-resource language scenarios.[63] Inclusion of Role into Named Entity Recognition and Ranking
Neelesh Kumar Shukla,Sanasam Ranbir Singh
Main category: cs.CL
TL;DR: 本文研究了实体角色检测问题,将其建模为命名实体识别(NER)和实体检索/排序任务,并提出了一种领域无关的方法,利用小规模数据集自动学习角色和实体的表示。
Details
Motivation: 实体在不同上下文中扮演不同角色,传统方法难以有效识别这些角色,尤其是在缺乏大规模领域特定数据集的情况下。 Method: 将角色检测建模为NER任务(使用序列标注)和实体检索任务(将角色作为查询,实体作为文档集合),并通过自动化方法学习代表性的词汇和短语来构建角色与实体的表示,探索了句子和文档级别的上下文。 Result: 提出了可应用于小规模、领域无关数据集的实体角色检测方法,能够从间接描述中进行角色与实体匹配,并在不同上下文中建模角色。 Conclusion: 通过结合NER和信息检索方法,可以在缺乏大量标注数据的情况下有效解决实体角色检测问题,具有良好的跨领域应用潜力。 Abstract: Most of the Natural Language Processing sys- tems are involved in entity-based processing for several tasks like Information Extraction, Question-Answering, Text-Summarization and so on. A new challenge comes when entities play roles according to their act or attributes in certain context. Entity Role Detection is the task of assigning such roles to the entities. Usu- ally real-world entities are of types: person, lo- cation and organization etc. Roles could be con- sidered as domain-dependent subtypes of these types. In the cases, where retrieving a subset of entities based on their roles is needed, poses the problem of defining the role and entities having those roles. This paper presents the study of study of solving Entity Role Detection prob- lem by modeling it as Named Entity Recogni- tion (NER) and Entity Retrieval/Ranking task. In NER, these roles could be considered as mutually exclusive classes and standard NER methods like sequence tagging could be used. For Entity Retrieval, Roles could be formulated as Query and entities as Collection on which the query needs to be executed. The aspect of Entity Retrieval task, which is different than document retrieval task is that the entities and roles against which they need to be retrieved are indirectly described. We have formulated au- tomated ways of learning representative words and phrases and building representations of roles and entities using them. We have also explored different contexts like sentence and document. Since the roles depend upon con- text, so it is not always possible to have large domain-specific dataset or knowledge bases for learning purposes, so we have tried to exploit the information from small dataset in domain- agnostic way.[64] EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers
Yilin Jiang,Mingzi Zhang,Xuanyu Yin,Sheng Jin,Suyu Lu,Zuocan Ying,Zengyi Yu,Xiangjie Kong
Main category: cs.CL
TL;DR: 提出EduGuardBench,一个用于评估教学角色扮演中大模型专业保真度与教育安全性的双组件基准,揭示了模型在教学能力与安全性上的极化表现及‘教育转化效应’。
Details
Motivation: 现有基准无法全面衡量大语言模型在教师角色扮演中的专业保真度和教育场景特有的伦理风险,亟需一种综合性评估框架。 Method: 设计包含角色保真度评分(RFS)和针对教学相关危害的诊断模块的EduGuardBench,并使用基于角色的对抗性提示测试通用与学术不端相关的安全漏洞,采用攻击成功率(ASR)和三层拒绝质量评估。 Result: 实验显示14个主流模型在保真度与安全性上表现两极分化;推理型模型保真度较高,但多数仍存在能力不足问题;发现‘规模悖论’:中等规模模型可能最易受攻击;识别出‘教育转化效应’——最安全的模型能将有害请求转化为教学机会。 Conclusion: EduGuardBench提供了一种可复现的、面向教育场景的综合评估框架,超越孤立的知识测试,揭示了AI在教育应用中的复杂动态,为部署可信AI提供了新方向。 Abstract: Large Language Models for Simulating Professions (SP-LLMs), particularly as teachers, are pivotal for personalized education. However, ensuring their professional competence and ethical safety is a critical challenge, as existing benchmarks fail to measure role-playing fidelity or address the unique teaching harms inherent in educational scenarios. To address this, we propose EduGuardBench, a dual-component benchmark. It assesses professional fidelity using a Role-playing Fidelity Score (RFS) while diagnosing harms specific to the teaching profession. It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and, particularly, academic misconduct, evaluated with metrics including Attack Success Rate (ASR) and a three-tier Refusal Quality assessment. Our extensive experiments on 14 leading models reveal a stark polarization in performance. While reasoning-oriented models generally show superior fidelity, incompetence remains the dominant failure mode across most models. The adversarial tests uncovered a counterintuitive scaling paradox, where mid-sized models can be the most vulnerable, challenging monotonic safety assumptions. Critically, we identified a powerful Educational Transformation Effect: the safest models excel at converting harmful requests into teachable moments by providing ideal Educational Refusals. This capacity is strongly negatively correlated with ASR, revealing a new dimension of advanced AI safety. EduGuardBench thus provides a reproducible framework that moves beyond siloed knowledge tests toward a holistic assessment of professional, ethical, and pedagogical alignment, uncovering complex dynamics essential for deploying trustworthy AI in education. See https://github.com/YL1N/EduGuardBench for Materials.[65] RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
Haofeng Wang,Yu Zhang
Main category: cs.CL
TL;DR: 提出了一种基于树结构的推理过程评估指标RPTS,并构建了包含374张图像和390个推理实例的RPTS-Eval基准,用于评估大视觉语言模型在多模态推理中的表现,揭示了现有模型在推理过程和跨模态交互方面的局限性。
Details
Motivation: 现有基准测试多关注答案正确性而忽视推理过程,且未充分考虑跨模态关系对推理的影响,导致无法识别错误推理却得出正确答案的情况。因此需要一种能系统评估推理过程并考虑跨模态交互的评估方法。 Method: 提出Reasoning Process Tree Score (RPTS),将推理步骤组织为推理树,利用层次结构为各步骤分配加权保真度得分,并通过动态调整权重来定位模型推理失败的位置;同时构建RPTS-Eval基准,包含可靠的图文线索作为叶节点,并定义三类跨模态关系以研究其对推理的影响。 Result: 在GPT4o、Llava-Next等代表性LVLM上的评估结果显示,RPTS能有效识别模型在推理过程中的问题,揭示出开源与闭源模型在多模态推理能力上的差异,并发现当前模型在跨模态关系理解和推理一致性方面存在不足。 Conclusion: RPTS提供了一种更细粒度、可解释的推理过程评估方式,RPTS-Eval基准有助于推动多模态推理研究的发展,促进模型在复杂推理任务中提升推理质量和跨模态理解能力。 Abstract: Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks. However, most of these benchmarks evaluate models primarily through multiple-choice or short-answer formats, which do not take the reasoning process into account. Although some benchmarks assess the reasoning process, their methods are often overly simplistic and only examine reasoning when answers are incorrect. This approach overlooks scenarios where flawed reasoning leads to correct answers. In addition, these benchmarks do not consider the impact of intermodal relationships on reasoning. To address this issue, we propose the Reasoning Process Tree Score (RPTS), a tree structure-based metric to assess reasoning processes. Specifically, we organize the reasoning steps into a reasoning tree and leverage its hierarchical information to assign weighted faithfulness scores to each reasoning step. By dynamically adjusting these weights, RPTS not only evaluates the overall correctness of the reasoning, but also pinpoints where the model fails in the reasoning. To validate RPTS in real-world multimodal scenarios, we construct a new benchmark, RPTS-Eval, comprising 374 images and 390 reasoning instances. Each instance includes reliable visual-textual clues that serve as leaf nodes of the reasoning tree. Furthermore, we define three types of intermodal relationships to investigate how intermodal interactions influence the reasoning process. We evaluated representative LVLMs (e.g., GPT4o, Llava-Next), uncovering their limitations in multimodal reasoning and highlighting the differences between open-source and closed-source commercial LVLMs. We believe that this benchmark will contribute to the advancement of research in the field of multimodal reasoning.[66] HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection
Fangqi Dai,Xingjian Jiang,Zizhuang Deng
Main category: cs.CL
TL;DR: 提出了一种基于人类语言偏好检测(HLPD)的新方法,通过奖励对齐机制提升对机器修改文本的识别能力,在多种先进大模型生成文本的检测中显著优于现有方法。
Details
Motivation: 现有LLM文本检测方法在面对高级模型生成或经过多任务对抗性修改的文本时表现不佳,尤其是在黑箱场景下难以有效识别。 Method: 提出HLPD方法,基于人类写作具有独特风格模式的假设,采用基于奖励的对齐过程(HLPO),将评分模型的token分布向人类写作风格调整,增强对人类文本的敏感性,从而更好识别机器修改文本。 Result: 在GPT系列模型修改的文本检测中,HLPD比ImBD提升15.11% AUROC,超过Fast-DetectGPT 45.56%;在先进LLM生成文本上的检测平均AUROC也最高,分别超过ImBD 5.53%和Fast-DetectGPT 34.14%。 Conclusion: HLPD能有效提升对复杂、对抗性机器修订文本的检测性能,尤其在黑箱环境下表现出优越的鲁棒性和泛化能力。 Abstract: To prevent misinformation and social issues arising from trustworthy-looking content generated by LLMs, it is crucial to develop efficient and reliable methods for identifying the source of texts. Previous approaches have demonstrated exceptional performance in detecting texts fully generated by LLMs. However, these methods struggle when confronting more advanced LLM output or text with adversarial multi-task machine revision, especially in the black-box setting, where the generating model is unknown. To address this challenge, grounded in the hypothesis that human writing possesses distinctive stylistic patterns, we propose Human Language Preference Detection (HLPD). HLPD employs a reward-based alignment process, Human Language Preference Optimization (HLPO), to shift the scoring model's token distribution toward human-like writing, making the model more sensitive to human writing, therefore enhancing the identification of machine-revised text. We test HLPD in an adversarial multi-task evaluation framework that leverages a five-dimensional prompt generator and multiple advanced LLMs to create diverse revision scenarios. When detecting texts revised by GPT-series models, HLPD achieves a 15.11% relative improvement in AUROC over ImBD, surpassing Fast-DetectGPT by 45.56%. When evaluated on texts generated by advanced LLMs, HLPD achieves the highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%. Code will be made available at https://github.com/dfq2021/HLPD.[67] SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs
Zhenliang Zhang,Xinyu Hu,Xiaojun Wan
Main category: cs.CL
TL;DR: 本文提出SCOPE方法,通过在语义空间中控制版权敏感区域来缓解大语言模型在推理时的版权侵权问题,无需参数更新或外部过滤器。
Details
Motivation: 大语言模型可能无意中复制受版权保护的内容,现有防御方法多依赖表面匹配和外部过滤,部署复杂且难以捕捉语义层面的泄露。 Method: 使用稀疏自编码器(SAE)将隐藏状态映射到高维近单义空间,识别并抑制解码过程中的版权敏感子空间激活。 Result: 在多个基准上验证了SCOPE能有效减少版权侵权,同时保持模型整体性能;可解释性分析表明该子空间捕获了高层语义。 Conclusion: SCOPE提供了一种无需外部过滤、内在可控的推理时版权防护方法,平衡了安全性与模型效用。 Abstract: Large language models sometimes inadvertently reproduce passages that are copyrighted, exposing downstream applications to legal risk. Most existing studies for inference-time defences focus on surface-level token matching and rely on external blocklists or filters, which add deployment complexity and may overlook semantically paraphrased leakage. In this work, we reframe copyright infringement mitigation as intrinsic semantic-space control and introduce SCOPE, an inference-time method that requires no parameter updates or auxiliary filters. Specifically, the sparse autoencoder (SAE) projects hidden states into a high-dimensional, near-monosemantic space; benefiting from this representation, we identify a copyright-sensitive subspace and clamp its activations during decoding. Experiments on widely recognized benchmarks show that SCOPE mitigates copyright infringement without degrading general utility. Further interpretability analyses confirm that the isolated subspace captures high-level semantics.[68] Automated Circuit Interpretation via Probe Prompting
Giuseppe Birardi
Main category: cs.CL
TL;DR: 本文提出了一种名为probe prompting的自动化方法,用于将归因图转化为由概念对齐的超节点构成的紧凑可解释子图,从而加速神经网络的机械可解释性分析。
Details
Motivation: 归因图虽能揭示神经网络中特征通路,但其手动分析耗时巨大(单个提示需约2小时),亟需自动化工具提升效率与可扩展性。 Method: 从种子提示和目标logit出发,选取高影响特征,生成面向概念但上下文变化的探测提示,并基于跨提示激活模式将特征聚类为语义、关系和Say-X三类超节点。 Result: 在五个提示任务上验证,probe-prompted子图保持高解释覆盖率(Completeness 0.83)并显著压缩复杂度;相比几何聚类基线,概念对齐组表现出更高行为一致性(峰值token一致性提升2.3倍,激活模式相似性提升5.8倍);实体替换测试揭示早期特征具跨层迁移性,晚期Say-X特征专精于输出生成。 Conclusion: probe prompting实现了高效、可解释的归因图压缩,支持Transformer计算的“主干-专业化”分层结构观点,并开源代码与工具促进社区复现与应用。 Abstract: Mechanistic interpretability aims to understand neural networks by identifying which learned features mediate specific behaviors. Attribution graphs reveal these feature pathways, but interpreting them requires extensive manual analysis -- a single prompt can take approximately 2 hours for an experienced circuit tracer. We present probe prompting, an automated pipeline that transforms attribution graphs into compact, interpretable subgraphs built from concept-aligned supernodes. Starting from a seed prompt and target logit, we select high-influence features, generate concept-targeted yet context-varying probes, and group features by cross-prompt activation signatures into Semantic, Relationship, and Say-X categories using transparent decision rules. Across five prompts including classic "capitals" circuits, probe-prompted subgraphs preserve high explanatory coverage while compressing complexity (Completeness 0.83, mean across circuits; Replacement 0.54). Compared to geometric clustering baselines, concept-aligned groups exhibit higher behavioral coherence: 2.3x higher peak-token consistency (0.425 vs 0.183) and 5.8x higher activation-pattern similarity (0.762 vs 0.130), despite lower geometric compactness. Entity-swap tests reveal a layerwise hierarchy: early-layer features transfer robustly (64% transfer rate, mean layer 6.3), while late-layer Say-X features specialize for output promotion (mean layer 16.4), supporting a backbone-and-specialization view of transformer computation. We release code (https://github.com/peppinob-ol/attribution-graph-probing), an interactive demo (https://huggingface.co/spaces/Peppinob/attribution-graph-probing), and minimal artifacts enabling immediate reproduction and community adoption.[69] Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs
Yingfeng Luo,Ziqiang Xu,Yuxuan Ouyang,Murun Yang,Dingyang Lin,Kaiyan Chang,Tong Zheng,Bei Li,Peinan Feng,Quan Du,Tong Xiao,Jingbo Zhu
Main category: cs.CL
TL;DR: 本文提出了LMT,一个以中英文为中心的多语言翻译模型系列,覆盖60种语言和234个翻译方向。通过发现并解决“方向性退化”问题,采用“策略性降采样”和“并行多语言提示(PMP)”方法,LMT在同等规模模型中实现了最先进的性能,甚至超越了更大的Aya-101-13B和NLLB-54B模型,并公开发布了四种尺寸的模型以促进未来研究。
Details
Motivation: 现有的多语言机器翻译模型在语言覆盖面、翻译质量一致性以及英语中心化偏向上存在挑战,尤其是缺乏对中文的足够关注,且训练过程中存在方向性退化的现象,影响翻译性能。 Method: 提出LMT模型系列,引入“策略性降采样”缓解对称多向微调导致的反向过拟合问题,并设计“并行多语言提示(PMP)”利用类型相近的辅助语言增强跨语言迁移能力,同时进行严格的数据整理和适应策略优化。 Result: LMT在234个翻译方向上表现出色,其4B版本(LMT-60-4B)显著优于更大规模的Aya-101-13B和NLLB-54B模型,在多语言翻译任务中达到SOTA水平。 Conclusion: LMT通过解决方向性退化问题和引入新型提示机制,在兼顾中英文的基础上实现了高质量、广覆盖的多语言翻译,为构建非英语中心、可扩展且包容的翻译系统提供了有效路径和强基线。 Abstract: Large language models have significantly advanced Multilingual Machine Translation (MMT), yet the broad language coverage, consistent translation quality, and English-centric bias remain open challenges. To address these challenges, we introduce \textbf{LMT}, a suite of \textbf{L}arge-scale \textbf{M}ultilingual \textbf{T}ranslation models centered on both Chinese and English, covering 60 languages and 234 translation directions. During development, we identify a previously overlooked phenomenon of \textbf{directional degeneration}, where symmetric multi-way fine-tuning data overemphasize reverse directions (X $\to$ En/Zh), leading to excessive many-to-one mappings and degraded translation quality. We propose \textbf{Strategic Downsampling}, a simple yet effective method to mitigate this degeneration. In addition, we design \textbf{Parallel Multilingual Prompting (PMP)}, which leverages typologically related auxiliary languages to enhance cross-lingual transfer. Through rigorous data curation and refined adaptation strategies, LMT achieves SOTA performance among models of comparable language coverage, with our 4B model (LMT-60-4B) surpassing the much larger Aya-101-13B and NLLB-54B models by a substantial margin. We release LMT in four sizes (0.6B/1.7B/4B/8B) to catalyze future research and provide strong baselines for inclusive, scalable, and high-quality MMT \footnote{\href{https://github.com/NiuTrans/LMT}{https://github.com/NiuTrans/LMT}}.[70] A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation
Siddharth Betala,Kushan Raj,Vipul Betala,Rohan Saswade
Main category: cs.CL
TL;DR: 本文提出了一个用于英-印多模态翻译任务的两阶段系统,通过自动化检测和纠正训练数据中的错误,并结合参数高效微调方法LoRA提升翻译质量。
Details
Motivation: 解决英-印语言对文本翻译任务中训练数据质量不佳的问题,提升翻译准确性。 Method: 提出一种视觉增强的判别-修正管道,利用多模态语言模型识别并分类翻译错误(正确、视觉模糊、误译),分别由GPT-4o-mini和IndicTrans2进行修正;随后使用LoRA在原始和修正后的数据上微调IndicTrans2模型。 Result: 该方法在四个印度语言上平均修正了17.1%的训练样本;在多个测试集上BLEU分数均有提升,其中英语-孟加拉语最高提升+1.30,英语-奥里亚语+0.60,英语-印地语挑战集+0.10。 Conclusion: 通过自动化的数据清洗与参数高效微调相结合,能有效提升低资源印度语言翻译的质量,尤其在存在视觉歧义或翻译错误的数据中表现显著。 Abstract: In this paper, we describe our system under the team name BLEU Monday for the English-to-Indic Multimodal Translation Task at WAT 2025. We participate in the text-only translation tasks for English-Hindi, English-Bengali, English-Malayalam, and English-Odia language pairs. We present a two-stage approach that addresses quality issues in the training data through automated error detection and correction, followed by parameter-efficient model fine-tuning. Our methodology introduces a vision-augmented judge-corrector pipeline that leverages multimodal language models to systematically identify and correct translation errors in the training data. The judge component classifies translations into three categories: correct, visually ambiguous (requiring image context), or mistranslated (poor translation quality). Identified errors are routed to specialized correctors: GPT-4o-mini regenerates captions requiring visual disambiguation, while IndicTrans2 retranslates cases with pure translation quality issues. This automated pipeline processes 28,928 training examples across four languages, correcting an average of 17.1% of captions per language. We then apply Low-Rank Adaptation (LoRA) to fine-tune the IndicTrans2 en-indic 200M distilled model on both original and corrected datasets. Training on corrected data yields consistent improvements, with BLEU score gains of +1.30 for English-Bengali on the evaluation set (42.00 -> 43.30) and +0.70 on the challenge set (44.90 -> 45.60), +0.60 for English-Odia on the evaluation set (41.00 -> 41.60), and +0.10 for English-Hindi on the challenge set (53.90 -> 54.00).[71] Multilingual Lexical Feature Analysis of Spoken Language for Predicting Major Depression Symptom Severity
Anastasiia Tokareva,Judith Dineley,Zoe Firth,Pauline Conde,Faith Matcham,Sara Siddi,Femke Lamers,Ewan Carr,Carolin Oetzmann,Daniel Leightley,Yuezhou Zhang,Amos A. Folarin,Josep Maria Haro,Brenda W. J. H. Penninx,Raquel Bailon,Srinivasan Vairavan,Til Wykes,Richard J. B. Dobson,Vaibhav A. Narayan,Matthew Hotopf,Nicholas Cummins,The RADAR-CNS Consortium
Main category: cs.CL
TL;DR: 该研究探索了在纵向数据中使用可解释的词汇特征与重度抑郁症(MDD)症状严重程度之间的关联,发现在英语和荷兰语数据中存在少量显著特征,但预测性能接近随机水平,提示需要更大样本和改进方法进一步研究。
Details
Motivation: 现有研究多基于非临床、横断面的文本语言数据,且使用难以解释的复杂机器学习模型。本研究旨在通过真实世界中的纵向语音数据,识别与MDD症状严重程度相关的可解释词汇特征,并评估其跨语言的适用性。 Method: 采用线性混合效应模型分析来自英国、荷兰和西班牙的5836条录音及PHQ-8评分数据,提取可解释的词汇特征;同时结合高维向量嵌入,测试四种机器学习回归模型的预测性能。 Result: 在英语数据中发现7个与MDD症状严重程度相关的词汇特征(如词汇多样性、绝对化语言);荷兰语中观察到句子长度和积极词频的相关性;西班牙语数据未发现显著关联。所有语言下,词汇特征和向量嵌入的预测能力均接近随机水平。 Conclusion: 为明确词汇标记在临床研究与实践中的价值,需在多语言大样本中开展研究,优化数据采集协议,并开发能兼顾个体内外语言差异的机器学习模型。 Abstract: Background: Captured between clinical appointments using mobile devices, spoken language has potential for objective, more regular assessment of symptom severity and earlier detection of relapse in major depressive disorder. However, research to date has largely been in non-clinical cross-sectional samples of written language using complex machine learning (ML) approaches with limited interpretability. Methods: We describe an initial exploratory analysis of longitudinal speech data and PHQ-8 assessments from 5,836 recordings of 586 participants in the UK, Netherlands, and Spain, collected in the RADAR-MDD study. We sought to identify interpretable lexical features associated with MDD symptom severity with linear mixed-effects modelling. Interpretable features and high-dimensional vector embeddings were also used to test the prediction performance of four regressor ML models. Results: In English data, MDD symptom severity was associated with 7 features including lexical diversity measures and absolutist language. In Dutch, associations were observed with words per sentence and positive word frequency; no associations were observed in recordings collected in Spain. The predictive power of lexical features and vector embeddings was near chance level across all languages. Limitations: Smaller samples in non-English speech and methodological choices, such as the elicitation prompt, may have also limited the effect sizes observable. A lack of NLP tools in languages other than English restricted our feature choice. Conclusion: To understand the value of lexical markers in clinical research and practice, further research is needed in larger samples across several languages using improved protocols, and ML models that account for within- and between-individual variations in language.[72] Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks
Yauhen Babakhin,Radek Osmulski,Ronay Ak,Gabriel Moreira,Mengyao Xu,Benedikt Schifferer,Bo Liu,Even Oldridge
Main category: cs.CL
TL;DR: 本文提出了llama-embed-nemotron-8b,一个开源权重的文本嵌入模型,在MMTEB排行榜上实现了最先进的多语言性能,基于1610万查询-文档对的混合数据训练,并公开模型权重、训练细节和消融研究。
Details
Motivation: 现有高性能嵌入模型常缺乏透明度,训练数据和方法未充分公开。本文旨在构建一个完全开源、可复现的高性能多语言文本嵌入模型,提升社区的可访问性和研究透明性。 Method: 采用16.1百万查询-文档对(770万来自公开数据集,840万由多种开源大模型生成)进行训练;使用对比学习损失函数;进行详尽的消融实验,评估不同对比损失实现、合成数据生成策略及模型融合的影响;构建指令感知型嵌入模型。 Result: 在MMTEB基准上达到领先性能,优于现有模型;在检索、分类、语义相似度等任务中表现优异;在低资源语言和跨语言场景下展现出强大能力;验证了合成数据生成和模型融合的有效性。 Conclusion: llama-embed-nemotron-8b是一个高性能、开源、指令感知的通用文本嵌入模型,具备卓越的多语言能力,通过公开模型权重、训练数据和消融分析,为社区提供了一个透明且可扩展的嵌入解决方案。 Abstract: We introduce llama-embed-nemotron-8b, an open-weights text embedding model that achieves state-of-the-art performance on the Multilingual Massive Text Embedding Benchmark (MMTEB) leaderboard as of October 21, 2025. While recent models show strong performance, their training data or methodologies are often not fully disclosed. We aim to address this by developing a fully open-source model, publicly releasing its weights and detailed ablation studies, and planning to share the curated training datasets. Our model demonstrates superior performance across all major embedding tasks -- including retrieval, classification and semantic textual similarity (STS) -- and excels in challenging multilingual scenarios, such as low-resource languages and cross-lingual setups. This state-of-the-art performance is driven by a novel data mix of 16.1 million query-document pairs, split between 7.7 million samples from public datasets and 8.4 million synthetically generated examples from various open-weight LLMs. One of our key contributions is a detailed ablation study analyzing core design choices, including a comparison of contrastive loss implementations, an evaluation of synthetic data generation (SDG) strategies, and the impact of model merging. The llama-embed-nemotron-8b is an instruction-aware model, supporting user-defined instructions to enhance performance for specific use-cases. This combination of top-tier performance, broad applicability, and user-driven flexibility enables it to serve as a universal text embedding solution.[73] Evaluating LLMs for Anxiety, Depression, and Stress Detection Evaluating Large Language Models for Anxiety, Depression, and Stress Detection: Insights into Prompting Strategies and Synthetic Data
Mihael Arcan,David-Paul Niland
Main category: cs.CL
TL;DR: 该研究比较了多种方法在从文本中检测焦虑、抑郁和压力等心理健康问题的效果,发现基于Transformer的模型(如Distil-RoBERTa和XLNet)表现最佳,并通过合成数据提升模型召回率和泛化能力。
Details
Motivation: 由于心理疾病症状在文本中表达隐晦且多样,自动检测具有挑战性,因此需要评估不同模型在临床访谈文本中的分类性能。 Method: 使用DAIC-WOZ临床访谈数据集,对Llama、GPT、BERT、XLNet和Distil-RoBERTa等模型进行微调以分类焦虑、抑郁和压力,并采用合成数据生成缓解类别不平衡问题。 Result: Distil-RoBERTa在GAD-2任务上F1得分为0.883,XLNet在PHQ任务上最高F1达0.891,而零样本合成方法在压力检测中F1为0.884,ROC AUC为0.886。 Conclusion: 基于Transformer的模型在心理健康文本检测中表现优异,结合合成数据可提升性能,但需谨慎校准以避免精度下降。 Abstract: Mental health disorders affect over one-fifth of adults globally, yet detecting such conditions from text remains challenging due to the subtle and varied nature of symptom expression. This study evaluates multiple approaches for mental health detection, comparing Large Language Models (LLMs) such as Llama and GPT with classical machine learning and transformer-based architectures including BERT, XLNet, and Distil-RoBERTa. Using the DAIC-WOZ dataset of clinical interviews, we fine-tuned models for anxiety, depression, and stress classification and applied synthetic data generation to mitigate class imbalance. Results show that Distil-RoBERTa achieved the highest F1 score (0.883) for GAD-2, while XLNet outperformed others on PHQ tasks (F1 up to 0.891). For stress detection, a zero-shot synthetic approach (SD+Zero-Shot-Basic) reached an F1 of 0.884 and ROC AUC of 0.886. Findings demonstrate the effectiveness of transformer-based models and highlight the value of synthetic data in improving recall and generalization. However, careful calibration is required to prevent precision loss. Overall, this work emphasizes the potential of combining advanced language models and data augmentation to enhance automated mental health assessment from text.[74] When Sufficient is not Enough: Utilizing the Rashomon Effect for Complete Evidence Extraction
Katharina Beckh,Stefan Rüping
Main category: cs.CL
TL;DR: 该论文研究了在模型决策中识别完整证据的重要性,特别是在医疗等领域需要全面特征归因的场景。通过多模型集成方法,显著提升了对完整证据的召回率。
Details
Motivation: 传统的特征归因方法仅提供最小充分证据,但在合规性和分类等应用中,需要识别所有贡献特征(即完整证据),因此需要更全面的归因方法。 Method: 基于包含人工标注完整证据的医疗数据集进行案例研究,评估单个模型和多个模型集成在完整证据召回上的表现,并分析训练时引入证据信息、动态集成与置信阈值的影响。 Result: 单个模型通常只能恢复部分完整证据,而集成多个模型可将证据召回率从约0.60提升至约0.86。同时分析了召回率-精确率权衡及动态集成的效果。 Conclusion: 为获得完整的模型决策依据,应采用多模型集成策略;该方法显著优于单一模型,在需高覆盖率的应用中具有重要意义。 Abstract: Feature attribution methods typically provide minimal sufficient evidence justifying a model decision. However, in many applications this is inadequate. For compliance and cataloging, the full set of contributing features must be identified - complete evidence. We perform a case study on a medical dataset which contains human-annotated complete evidence. We show that individual models typically recover only subsets of complete evidence and that aggregating evidence from several models improves evidence recall from $\sim$0.60 (single best model) to $\sim$0.86 (ensemble). We analyze the recall-precision trade-off, the role of training with evidence, dynamic ensembles with certainty thresholds, and discuss implications.[75] Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection
Brage Eilertsen,Røskva Bjørgfinsdóttir,Francielle Vargas,Ali Ramezani-Kebrya
Main category: cs.CL
TL;DR: 提出了一种监督理性注意力(SRA)框架,通过将模型注意力与人类标注的解释对齐,提升仇恨言论检测中的可解释性和公平性。
Details
Motivation: 深度学习模型在仇恨言论检测中缺乏透明度,导致其伦理部署面临挑战,因此需要提升模型的可解释性与公平性。 Method: 在基于Transformer的分类器中引入监督注意力机制,优化包含分类损失和注意力对齐损失的联合目标函数,使模型注意力与人工标注的解释保持一致。 Result: 在英语和葡萄牙语数据集上,SRA的可解释性比现有基线高2.4倍,生成更可信且与人类更一致的词级解释;在公平性方面表现优异,尤其在识别针对身份群体的有害内容上排名第二,整体性能相当。 Conclusion: 将人类解释引入注意力机制可有效提升模型的可解释性、可信度和公平性,且不牺牲分类性能。 Abstract: The opaque nature of deep learning models presents significant challenges for the ethical deployment of hate speech detection systems. To address this limitation, we introduce Supervised Rational Attention (SRA), a framework that explicitly aligns model attention with human rationales, improving both interpretability and fairness in hate speech classification. SRA integrates a supervised attention mechanism into transformer-based classifiers, optimizing a joint objective that combines standard classification loss with an alignment loss term that minimizes the discrepancy between attention weights and human-annotated rationales. We evaluated SRA on hate speech benchmarks in English (HateXplain) and Portuguese (HateBRXplain) with rationale annotations. Empirically, SRA achieves 2.4x better explainability compared to current baselines, and produces token-level explanations that are more faithful and human-aligned. In terms of fairness, SRA achieves competitive fairness across all measures, with second-best performance in detecting toxic posts targeting identity groups, while maintaining comparable results on other metrics. These findings demonstrate that incorporating human rationales into attention mechanisms can enhance interpretability and faithfulness without compromising fairness.[76] Importance-Aware Data Selection for Efficient LLM Instruction Tuning
Tingyu Jiang,Shen Li,Yiyao Song,Lan Zhang,Hualei Zhu,Yuan Zhao,Xiaohang Xu,Kenjiro Taura,Hao Henry Wang
Main category: cs.CL
TL;DR: 提出一种新的指标MIWV,用于量化指令数据在提升大模型能力中的重要性,仅使用前1%的数据即可超越全量数据训练的效果。
Details
Motivation: 现有研究多关注指令数据的质量评分,但缺乏针对特定大模型选择最能提升其性能的高质量数据的方法。 Method: 基于模型在上下文学习(ICL)中响应的差异,提出Model Instruction Weakness Value(MIWV)指标,用以衡量指令数据的重要性并指导数据选择。 Result: 实验表明,仅选择MIWV得分最高的1%数据进行训练,即可超过使用完整数据集训练的结果。 Conclusion: MIWV是一种有效的指令数据选择指标,能够显著提升指令微调的效率和性能,超越传统基于数据质量评分的方法。 Abstract: Instruction tuning plays a critical role in enhancing the performance and efficiency of Large Language Models (LLMs). Its success depends not only on the quality of the instruction data but also on the inherent capabilities of the LLM itself. Some studies suggest that even a small amount of high-quality data can achieve instruction fine-tuning results that are on par with, or even exceed, those from using a full-scale dataset. However, rather than focusing solely on calculating data quality scores to evaluate instruction data, there is a growing need to select high-quality data that maximally enhances the performance of instruction tuning for a given LLM. In this paper, we propose the Model Instruction Weakness Value (MIWV) as a novel metric to quantify the importance of instruction data in enhancing model's capabilities. The MIWV metric is derived from the discrepancies in the model's responses when using In-Context Learning (ICL), helping identify the most beneficial data for enhancing instruction tuning performance. Our experimental results demonstrate that selecting only the top 1\% of data based on MIWV can outperform training on the full dataset. Furthermore, this approach extends beyond existing research that focuses on data quality scoring for data selection, offering strong empirical evidence supporting the effectiveness of our proposed method.[77] EmoBang: Detecting Emotion From Bengali Texts
Abdullah Al Maruf,Aditi Golder,Zakaria Masud Jiyad,Abdullah Al Numan,Tarannum Shaila Zaman
Main category: cs.CL
TL;DR: 本文提出了一个新的孟加拉语情感检测数据集,并构建了两种深度学习模型(EmoBangHybrid和EmoBangEnsemble)进行自动情感识别,同时评估了多种基线方法和大语言模型,为孟加拉语情感分析建立了首个全面的基准。
Details
Motivation: 孟加拉语作为全球第四大语言,在情感检测领域却因缺乏大规模标准化数据集而发展受限,现有研究多依赖传统机器学习方法,性能有限。因此,亟需更先进的模型和高质量数据集来推动该领域发展。 Method: 提出一个包含八种情感类别的新标注孟加拉语数据集,设计了两种模型:基于卷积循环神经网络的混合模型(EmoBangHybrid)和基于AdaBoost与双向BERT的集成模型(EmoBangEnsemble),并评估了六种基线模型、五种特征工程方法以及零样本和少样本大语言模型的表现。 Result: 实验结果显示,EmoBangHybrid和EmoBangEnsemble的准确率分别达到92.86%和93.69%,显著优于现有方法,成为新的强基线。此外,提供了多种模型的对比结果和大语言模型在该任务上的初步表现。 Conclusion: 本研究为孟加拉语情感检测提供了首个综合性基准,验证了深度学习模型在低资源语言中的有效性,推动了该领域的研究进展,并为未来工作奠定了基础。 Abstract: Emotion detection from text seeks to identify an individual's emotional or mental state - positive, negative, or neutral - based on linguistic cues. While significant progress has been made for English and other high-resource languages, Bengali remains underexplored despite being the world's fourth most spoken language. The lack of large, standardized datasets classifies Bengali as a low-resource language for emotion detection. Existing studies mainly employ classical machine learning models with traditional feature engineering, yielding limited performance. In this paper, we introduce a new Bengali emotion dataset annotated across eight emotion categories and propose two models for automatic emotion detection: (i) a hybrid Convolutional Recurrent Neural Network (CRNN) model (EmoBangHybrid) and (ii) an AdaBoost-Bidirectional Encoder Representations from Transformers (BERT) ensemble model (EmoBangEnsemble). Additionally, we evaluate six baseline models with five feature engineering techniques and assess zero-shot and few-shot large language models (LLMs) on the dataset. To the best of our knowledge, this is the first comprehensive benchmark for Bengali emotion detection. Experimental results show that EmoBangH and EmoBangE achieve accuracies of 92.86% and 93.69%, respectively, outperforming existing methods and establishing strong baselines for future research.[78] Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora
Khalil Hennara,Ahmad Bastati,Muhammad Hreden,Mohamed Motasim Hamed,Zeina Aldallal,Sara Chrouf,Safwan AlModhayan
Main category: cs.CL
TL;DR: 本文提出了一种名为Wasm的处理流水线,用于从Common Crawl数据集中构建高质量的阿拉伯语多模态数据集,并以Markdown格式输出,保留网页结构,支持文本和多模态预训练。
Details
Motivation: 由于缺乏保持文档结构的高质量阿拉伯语多模态数据集,限制了阿拉伯语大模型的发展,本文旨在填补这一空白。 Method: 提出Wasm流水线,对Common Crawl数据进行处理,提取图文交错的自然文档,生成带结构信息的Markdown格式数据,并与现有数据集的处理方法进行对比分析。 Result: 成功构建了结构完整、支持多模态预训练的阿拉伯语数据集,提供了代表性数据样本和开源处理流程。 Conclusion: Wasm流水线能有效提升阿拉伯语多模态数据的质量和可用性,推动阿拉伯语大模型的研究发展。 Abstract: The performance of large language models (LLMs) and large multimodal models (LMMs) depends heavily on the quality and scale of their pre-training datasets. Recent research shows that large multimodal models trained on natural documents where images and text are interleaved outperform those trained only on image-text pairs across a wide range of benchmarks, leveraging advanced pre- trained models to enforce semantic alignment, image-sequence consistency, and textual coherence. For Arabic, however, the lack of high-quality multimodal datasets that preserve document structure has limited progress. In this paper, we present our pipeline Wasm for processing the Common Crawl dataset to create a new Arabic multimodal dataset that uniquely provides markdown output. Unlike existing Arabic corpora that focus solely on text extraction, our approach preserves the structural integrity of web content while maintaining flexibility for both text-only and multimodal pre-training scenarios. We provide a comprehensive comparative analysis of our data processing pipeline against those used for major existing datasets, highlighting the convergences in filtering strategies and justifying our specific design choices. To support future research, we publicly release a representative dataset dump along with the multimodal processing pipeline for Arabic.[79] More Agents Helps but Adversarial Robustness Gap Persists
Khashayar Alavi,Zhastay Yeltay,Lucie Flek,Akbar Karimi
Main category: cs.CL
TL;DR: 研究探讨了多LLM代理在对抗性输入下的数学问题回答鲁棒性,发现协作能提升准确性但无法消除对抗性鲁棒性差距。
Details
Motivation: 探究LLM代理协作是否能提高对对抗性输入的鲁棒性。 Method: 使用统一的采样-投票框架Agent Forest,在多种噪声类型和真实拼写错误下评估六个开源模型在四个基准上的表现。 Result: 标点噪声影响随强度增加而增大,人类拼写错误是主要瓶颈;协作能可靠提升准确率,但对抗性鲁棒性差距依然存在。 Conclusion: 尽管多代理协作提升了数学问题解答的准确性,但在面对对抗性输入时,其鲁棒性仍受限,尤其在处理人类类拼写错误时表现不佳。 Abstract: When LLM agents work together, they seem to be more powerful than a single LLM in mathematical question answering. However, are they also more robust to adversarial inputs? We investigate this question using adversarially perturbed math questions. These perturbations include punctuation noise with three intensities (10, 30, and 50 percent), plus real-world and human-like typos (WikiTypo, R2ATA). Using a unified sampling-and-voting framework (Agent Forest), we evaluate six open-source models (Qwen3-4B/14B, Llama3.1-8B, Mistral-7B, Gemma3-4B/12B) across four benchmarks (GSM8K, MATH, MMLU-Math, MultiArith), with various numbers of agents n from one to 25 (1, 2, 5, 10, 15, 20, 25). Our findings show that (1) Noise type matters: punctuation noise harm scales with its severity, and the human typos remain the dominant bottleneck, yielding the largest gaps to Clean accuracy and the highest ASR even with a large number of agents. And (2) Collaboration reliably improves accuracy as the number of agents, n, increases, with the largest gains from one to five agents and diminishing returns beyond 10 agents. However, the adversarial robustness gap persists regardless of the agent count.[80] Think Consistently, Reason Efficiently: Energy-Based Calibration for Implicit Chain-of-Thought
Zhikang Chen,Sen Cui,Deheng Ye,Yu Zhang,Yatao Bian,Tingting Zhu
Main category: cs.CL
TL;DR: 提出了一种基于能量模型的链式思维校准框架EBM-CoT,通过在潜在空间中优化推理轨迹来提高大语言模型推理的一致性和准确性。
Details
Motivation: 现有的显式链式思维方法存在错误传播和表达受限问题,而隐式连续推理缺乏步骤间一致性机制,导致推理路径发散。 Method: 提出EBM-CoT框架,利用能量模型对潜在思维表示进行校准,动态调整推理轨迹至低能量、高一致性区域,不修改基础语言模型。 Result: 在数学、常识和符号推理基准上的实验表明,该方法显著提升了多步推理的一致性和效率。 Conclusion: EBM-CoT有效改善了大语言模型在连续推理中的准确性和稳定性,为隐式推理提供了可调控的一致性机制。 Abstract: Large Language Models (LLMs) have demonstrated strong reasoning capabilities through \emph{Chain-of-Thought} (CoT) prompting, which enables step-by-step intermediate reasoning. However, explicit CoT methods rely on discrete token-level reasoning processes that are prone to error propagation and limited by vocabulary expressiveness, often resulting in rigid and inconsistent reasoning trajectories. Recent research has explored implicit or continuous reasoning in latent spaces, allowing models to perform internal reasoning before generating explicit output. Although such approaches alleviate some limitations of discrete CoT, they generally lack explicit mechanisms to enforce consistency among reasoning steps, leading to divergent reasoning paths and unstable outcomes. To address this issue, we propose EBM-CoT, an Energy-Based Chain-of-Thought Calibration framework that refines latent thought representations through an energy-based model (EBM). Our method dynamically adjusts latent reasoning trajectories toward lower-energy, high-consistency regions in the embedding space, improving both reasoning accuracy and consistency without modifying the base language model. Extensive experiments across mathematical, commonsense, and symbolic reasoning benchmarks demonstrate that the proposed framework significantly enhances the consistency and efficiency of multi-step reasoning in LLMs.[81] LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging
Seungeon Lee,Soumi Das,Manish Gupta,Krishna P. Gummadi
Main category: cs.CL
TL;DR: 本文提出了LoRA on the Go(LoGo),一种无需训练的动态适配器选择与融合框架,可在实例级别自适应地合并多个LoRA,适用于多任务和跨领域场景。
Details
Motivation: 传统的LoRA通常为单一任务训练,难以应对现实世界中多样且不可预测的任务需求;现有组合多个LoRA的方法往往需要额外标注数据或任务特定训练,成本较高。 Method: LoGo利用单次前向传播从多个LoRA适配器中提取信号,动态判断每个适配器的相关性并实时决定其融合权重,实现训练-free的实例级适配器选择与组合。 Result: 在5个NLP基准、27个数据集和3个模型家族上的实验表明,LoGo在某些任务上优于基于训练的方法达3.6%,在其他任务上表现相当,同时保持推理效率。 Conclusion: LoGo是一种高效、实用的多任务LoRA融合框架,无需额外训练即可提升跨任务性能,具有良好的可扩展性和应用前景。 Abstract: Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models.However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.[82] TCM-Eval: An Expert-Level Dynamic and Extensible Benchmark for Traditional Chinese Medicine
Zihao Cheng,Yuheng Lu,Huaiqian Ye,Zeming Liu,Minqi Wang,Jingjing Liu,Zihan Li,Wei Fan,Yuanfang Guo,Ruiji Fu,Shifeng She,Gang Wang,Yunhong Wang
Main category: cs.CL
TL;DR: 本文提出了首个面向中医的动态可扩展基准TCM-Eval,并构建了大规模训练语料,提出自迭代思维链增强方法(SI-CoTE),开发出先进的中医大模型ZhiMingTang(ZMT),显著超越人类医师通过分数线。
Details
Motivation: 由于缺乏标准化基准和高质量训练数据,大语言模型在中医领域的应用受到严重限制,因此需要专门针对中医特点构建评估体系与数据资源。 Method: 基于国家医师资格考试题目构建TCM-Eval基准并由中医专家验证;构建大规模训练语料库;提出SI-CoTE方法,通过拒绝采样自迭代生成带推理链的问答数据,实现模型与数据协同进化。 Result: 成功开发出ZhiMingTang(ZMT)模型,在TCM-Eval上表现优异,显著超过执业医师通过门槛,并发布了公开排行榜以促进社区发展。 Conclusion: 本研究为中医领域的大模型研发提供了有效评估标准、高质量数据和先进模型范式,推动了人工智能与传统医学的深度融合。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in modern medicine, yet their application in Traditional Chinese Medicine (TCM) remains severely limited by the absence of standardized benchmarks and the scarcity of high-quality training data. To address these challenges, we introduce TCM-Eval, the first dynamic and extensible benchmark for TCM, meticulously curated from national medical licensing examinations and validated by TCM experts. Furthermore, we construct a large-scale training corpus and propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE) to autonomously enrich question-answer pairs with validated reasoning chains through rejection sampling, establishing a virtuous cycle of data and model co-evolution. Using this enriched training data, we develop ZhiMingTang (ZMT), a state-of-the-art LLM specifically designed for TCM, which significantly exceeds the passing threshold for human practitioners. To encourage future research and development, we release a public leaderboard, fostering community engagement and continuous improvement.[83] Categorical Emotions or Appraisals - Which Emotion Model Explains Argument Convincingness Better?
Lynn Greschner,Meike Bauer,Sabine Weber,Roman Klinger
Main category: cs.CL
TL;DR: 本文探讨了修辞三要素之一的“情感”(pathos)在论证说服力中的作用,提出基于评价理论(appraisal theories)的情感分析比传统分类情绪更能有效预测论证的主观说服力,并通过零样本实验验证了评价维度在说服力预测中的优越性。
Details
Motivation: 现有研究多关注论证中情绪的类别和强度,但忽视了情绪反应的主观性;作者认为接收者的认知评估(如重要性、影响)会影响其情感反应,因此需要更精细的理论框架(如评价理论)来建模这种主观性。 Method: 基于新发布的ContArgA语料库中的情感与说服力标注,采用零样本提示方法(zero-shot prompting),比较分类情绪与评价维度(appraisals)在预测论证主观说服力方面的有效性。 Result: 实验发现,虽然分类情绪信息有助于说服力预测,但引入评价维度带来的性能提升更为显著。 Conclusion: 评价理论比传统情绪分类更适合用于论证说服力的情感分析,为计算论辩提供了新的理论视角和实践路径。 Abstract: The convincingness of an argument does not only depend on its structure (logos), the person who makes the argument (ethos), but also on the emotion that it causes in the recipient (pathos). While the overall intensity and categorical values of emotions in arguments have received considerable attention in the research community, we argue that the emotion an argument evokes in a recipient is subjective. It depends on the recipient's goals, standards, prior knowledge, and stance. Appraisal theories lend themselves as a link between the subjective cognitive assessment of events and emotions. They have been used in event-centric emotion analysis, but their suitability for assessing argument convincingness remains unexplored. In this paper, we evaluate whether appraisal theories are suitable for emotion analysis in arguments by considering subjective cognitive evaluations of the importance and impact of an argument on its receiver. Based on the annotations in the recently published ContArgA corpus, we perform zero-shot prompting experiments to evaluate the importance of gold-annotated and predicted emotions and appraisals for the assessment of the subjective convincingness labels. We find that, while categorical emotion information does improve convincingness prediction, the improvement is more pronounced with appraisals. This work presents the first systematic comparison between emotion models for convincingness prediction, demonstrating the advantage of appraisals, providing insights for theoretical and practical applications in computational argumentation.[84] AdaRec: Adaptive Recommendation with LLMs via Narrative Profiling and Dual-Channel Reasoning
Meiyun Wang,Charin Polpanumas
Main category: cs.CL
TL;DR: AdaRec是一种基于大语言模型的少样本上下文学习框架,通过叙事性用户画像和双通道架构实现自适应个性化推荐,在少样本和零样本场景下均显著优于现有方法。
Details
Motivation: 现有推荐系统依赖人工特征工程且难以适应新任务,缺乏对用户行为的深入因果理解,尤其在交互数据稀少的情况下表现不佳。 Method: 提出AdaRec框架,采用叙事性画像将用户-物品交互转化为自然语言表示;设计双通道架构,结合横向行为对齐(发现同伴模式)与纵向因果归因(挖掘偏好决定因素),实现统一任务处理和跨任务快速适应。 Result: 在真实电商数据集上,AdaRec在少样本设置下比机器学习和LLM基线最高提升8%,在零样本场景下比专家设计画像提升19%;生成的合成数据用于轻量微调即可达到全微调模型性能。 Conclusion: AdaRec通过语义化表示和双通道推理机制,有效提升了推荐系统的可读性、适应性和长尾个性化能力,减少了对人工特征和大量标注数据的依赖。 Abstract: We propose AdaRec, a few-shot in-context learning framework that leverages large language models for an adaptive personalized recommendation. AdaRec introduces narrative profiling, transforming user-item interactions into natural language representations to enable unified task handling and enhance human readability. Centered on a bivariate reasoning paradigm, AdaRec employs a dual-channel architecture that integrates horizontal behavioral alignment, discovering peer-driven patterns, with vertical causal attribution, highlighting decisive factors behind user preferences. Unlike existing LLM-based approaches, AdaRec eliminates manual feature engineering through semantic representations and supports rapid cross-task adaptation with minimal supervision. Experiments on real ecommerce datasets demonstrate that AdaRec outperforms both machine learning models and LLM-based baselines by up to eight percent in few-shot settings. In zero-shot scenarios, it achieves up to a nineteen percent improvement over expert-crafted profiling, showing effectiveness for long-tail personalization with minimal interaction data. Furthermore, lightweight fine-tuning on synthetic data generated by AdaRec matches the performance of fully fine-tuned models, highlighting its efficiency and generalization across diverse tasks.[85] EMODIS: A Benchmark for Context-Dependent Emoji Disambiguation in Large Language Models
Jiacheng Huang,Ning Yu,Xiaoyin Yi
Main category: cs.CL
TL;DR: EMODIS是一个新基准,用于评估大语言模型在最小但对比性文本上下文中解析歧义表情符号的能力,揭示了模型在细微语境线索下的理解局限。
Details
Motivation: 研究大语言模型在现实交流场景中解决上下文依赖性歧义的能力,尤其是在表情符号使用中的语义理解不足问题。 Method: 构建EMODIS基准,包含含歧义表情符号的句子、两种导致不同解释的消歧上下文,以及需要上下文推理的具体问题,并对开源和API-based大模型进行评测。 Result: 即使最强的模型也常因细微上下文线索缺失而无法区分意义,表现出对主导解释的系统性偏好和对语用对比的敏感性不足。 Conclusion: EMODIS提供了一个严格的测试环境,凸显了当前大语言模型在语义推理方面与人类之间的差距。 Abstract: Large language models (LLMs) are increasingly deployed in real-world communication settings, yet their ability to resolve context-dependent ambiguity remains underexplored. In this work, we present EMODIS, a new benchmark for evaluating LLMs' capacity to interpret ambiguous emoji expressions under minimal but contrastive textual contexts. Each instance in EMODIS comprises an ambiguous sentence containing an emoji, two distinct disambiguating contexts that lead to divergent interpretations, and a specific question that requires contextual reasoning. We evaluate both open-source and API-based LLMs, and find that even the strongest models frequently fail to distinguish meanings when only subtle contextual cues are present. Further analysis reveals systematic biases toward dominant interpretations and limited sensitivity to pragmatic contrast. EMODIS provides a rigorous testbed for assessing contextual disambiguation, and highlights the gap in semantic reasoning between humans and LLMs.[86] Discourse Graph Guided Document Translation with Large Language Models
Viet-Thanh Pham,Minghan Wang,Hao-Han Liao,Thuy-Trang Vu
Main category: cs.CL
TL;DR: 提出TransGraph框架,通过结构化话语图显式建模文本块间关系,实现高效、连贯的长文档翻译。
Details
Motivation: 解决大语言模型在长文档翻译中难以捕捉长距离依赖和保持语篇连贯性的问题,同时降低对计算资源的需求和对记忆检索策略的敏感性。 Method: 引入TransGraph框架,利用话语图建模段落间的语义关系,并基于图的邻域选择性地指导各片段的翻译生成,而非依赖完整上下文或顺序处理。 Result: 在三个涵盖六种语言和多个领域的文档级机器翻译基准上,TransGraph在翻译质量和术语一致性方面均优于强基线方法,且显著降低了token开销。 Conclusion: TransGraph通过结构化语篇建模有效提升了长文档翻译的性能,在保证输出连贯性的同时实现了更高的效率和更低的资源消耗。 Abstract: Adapting large language models to full document translation remains challenging due to the difficulty of capturing long-range dependencies and preserving discourse coherence throughout extended texts. While recent agentic machine translation systems mitigate context window constraints through multi-agent orchestration and persistent memory, they require substantial computational resources and are sensitive to memory retrieval strategies. We introduce TransGraph, a discourse-guided framework that explicitly models inter-chunk relationships through structured discourse graphs and selectively conditions each translation segment on relevant graph neighbourhoods rather than relying on sequential or exhaustive context. Across three document-level MT benchmarks spanning six languages and diverse domains, TransGraph consistently surpasses strong baselines in translation quality and terminology consistency while incurring significantly lower token overhead.[87] Who Is the Story About? Protagonist Entity Recognition in News
Jorge Gabín,M. Eduardo Ares,Javier Parapar
Main category: cs.CL
TL;DR: 本文提出了主角实体识别(PER)任务,旨在识别新闻故事中起主导作用的组织,并通过大语言模型(LLM)与专家标注对比验证其有效性,结果表明PER是叙事信息抽取中有意义的扩展,且LLM可大规模近似人类对叙事重要性的判断。
Details
Motivation: 传统命名实体识别(NER)将所有提及的实体同等对待,无法反映哪些实体真正驱动新闻叙事,限制了对事件显著性、影响力和叙事焦点的理解,因此需要一种能识别关键实体的方法。 Method: 提出主角实体识别(PER)任务,使用大语言模型(LLM)结合NER引导提示自动标注大规模新闻数据,并与四位专家在黄金语料库上的标注进行比较,评估人类标注一致性及LLM与人类的一致性,同时测试其他LLM在有限上下文下推断正确主角的能力。 Result: 实验结果显示,专家标注之间具有一致性,LLM与人类判断具有较高 agreement;通过引导式提示,LLM能够生成高质量的PER标签;即使在上下文受限且无候选提示的情况下,其他LLM仍能较准确地推断主角,证明PER具有可行性与意义。 Conclusion: 主角实体识别(PER)是一项可行且有意义的任务,扩展了传统NER在叙事理解中的应用,结合大语言模型可实现对新闻叙事中关键组织的高效、可扩展识别,为下游任务提供更丰富的语义支持。 Abstract: News articles often reference numerous organizations, but traditional Named Entity Recognition (NER) treats all mentions equally, obscuring which entities genuinely drive the narrative. This limits downstream tasks that rely on understanding event salience, influence, or narrative focus. We introduce Protagonist Entity Recognition (PER), a task that identifies the organizations that anchor a news story and shape its main developments. To validate PER, we compare he predictions of Large Language Models (LLMs) against annotations from four expert annotators over a gold corpus, establishing both inter-annotator consistency and human-LLM agreement. Leveraging these findings, we use state-of-the-art LLMs to automatically label large-scale news collections through NER-guided prompting, generating scalable, high-quality supervision. We then evaluate whether other LLMs, given reduced context and without explicit candidate guidance, can still infer the correct protagonists. Our results demonstrate that PER is a feasible and meaningful extension to narrative-centered information extraction, and that guided LLMs can approximate human judgments of narrative importance at scale.[88] Retriv at BLP-2025 Task 1: A Transformer Ensemble and Multi-Task Learning Approach for Bangla Hate Speech Identification
Sourav Saha,K M Nafi Asib,Mohammed Moshiul Hoque
Main category: cs.CL
TL;DR: 本文研究了孟加拉语仇恨言论识别问题,参与了BLP Workshop 2025上的多任务共享任务,在三个子任务中采用了基于Transformer模型的集成方法,取得了具有竞争力的结果。
Details
Motivation: 孟加拉语仇恨言论识别具有社会意义,但由于语言资源稀缺且复杂,现有方法面临挑战,亟需有效的识别系统。 Method: 在子任务1A和1B中采用软投票集成(BanglaBERT、MuRIL、IndicBERTv2);在子任务1C中使用三种多任务模型并结合加权投票集成。 Result: 在子任务1A、1B和1C上分别取得72.75%、72.69%和72.62%的micro-F1分数,排名分别为第9、10和7位。 Conclusion: 基于Transformer的模型集成与加权多任务框架在低资源背景下对孟加拉语仇恨言论检测具有潜力,相关代码已公开。 Abstract: This paper addresses the problem of Bangla hate speech identification, a socially impactful yet linguistically challenging task. As part of the "Bangla Multi-task Hate Speech Identification" shared task at the BLP Workshop, IJCNLP-AACL 2025, our team "Retriv" participated in all three subtasks: (1A) hate type classification, (1B) target group identification, and (1C) joint detection of type, severity, and target. For subtasks 1A and 1B, we employed a soft-voting ensemble of transformer models (BanglaBERT, MuRIL, IndicBERTv2). For subtask 1C, we trained three multitask variants and aggregated their predictions through a weighted voting ensemble. Our systems achieved micro-f1 scores of 72.75% (1A) and 72.69% (1B), and a weighted micro-f1 score of 72.62% (1C). On the shared task leaderboard, these corresponded to 9th, 10th, and 7th positions, respectively. These results highlight the promise of transformer ensembles and weighted multitask frameworks for advancing Bangla hate speech detection in low-resource contexts. We made experimental scripts publicly available for the community.[89] ACE-ICD: Acronym Expansion As Data Augmentation For Automated ICD Coding
Tuan-Dung Le,Shohreh Haddadan,Thanh Q. Thieu
Main category: cs.CL
TL;DR: 提出了一种利用大语言模型扩展医学缩略语的新型数据增强技术,并结合一致性训练,在MIMIC-III数据集上实现了自动ICD编码的新SOTA性能。
Details
Motivation: 现有方法通常忽略临床记录中广泛使用的医学缩略语,而这对ICD编码推断至关重要,因此需要一种能有效处理缩略语的方法。 Method: 提出ACE-ICD方法,使用大语言模型将医学缩略语扩展为全称形式进行数据增强,并引入一致性训练,使模型在原始和增强文本上的预测保持一致。 Result: 在MIMIC-III数据集上,ACE-ICD在常见代码、罕见代码和全代码分配等多个设置下均取得当前最优性能。 Conclusion: 通过扩展医学缩略语并结合一致性训练,显著提升了自动ICD编码的准确性,验证了处理缩略语对模型性能的重要性。 Abstract: Automatic ICD coding, the task of assigning disease and procedure codes to electronic medical records, is crucial for clinical documentation and billing. While existing methods primarily enhance model understanding of code hierarchies and synonyms, they often overlook the pervasive use of medical acronyms in clinical notes, a key factor in ICD code inference. To address this gap, we propose a novel effective data augmentation technique that leverages large language models to expand medical acronyms, allowing models to be trained on their full form representations. Moreover, we incorporate consistency training to regularize predictions by enforcing agreement between the original and augmented documents. Extensive experiments on the MIMIC-III dataset demonstrate that our approach, ACE-ICD establishes new state-of-the-art performance across multiple settings, including common codes, rare codes, and full-code assignments. Our code is publicly available.[90] RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
Zhiyuan Zeng,Hamish Ivison,Yiping Wang,Lifan Yuan,Shuyue Stella Li,Zhuorui Ye,Siting Li,Jacqueline He,Runlong Zhou,Tong Chen,Chenyang Zhao,Yulia Tsvetkov,Simon Shaolei Du,Natasha Jaques,Hao Peng,Pang Wei Koh,Hannaneh Hajishirzi
Main category: cs.CL
TL;DR: 提出了一种基于可验证环境的强化学习方法RLVE,通过动态调整问题难度来提升语言模型的推理能力,在大规模环境套件RLVE-Gym上实现了显著优于原有训练方法的性能提升。
Details
Motivation: 传统强化学习中静态数据分布容易导致学习信号消失(问题过易或过难),难以有效提升语言模型的推理能力,因此需要一种能动态适应模型能力的训练环境。 Method: 提出RLVE框架,使用可算法验证奖励的可验证环境,并在训练过程中动态调整问题难度;构建包含400个环境的RLVE-Gym套件,进行跨所有环境的联合训练。 Result: 在六个推理基准上,RLVE相比原强化学习方法取得了3.37%的绝对平均提升,而原方法仅提升0.49%,且RLVE计算成本更低。 Conclusion: RLVE通过动态适应和环境扩展显著提升了语言模型的可泛化推理能力,验证了可验证环境与环境规模在强化学习训练中的关键作用。 Abstract: We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.[91] When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs
Shaowen Wang,Yiqi Dong,Ruinian Chang,Tansheng Zhu,Yuebo Sun,Kaifeng Lyu,Jian Li
Main category: cs.CL
TL;DR: 本文揭示了由训练数据中的虚假相关性引起的大语言模型幻觉问题,这类幻觉具有高置信度、难以通过现有检测方法识别,并且在模型扩展和拒绝微调后依然存在。
Details
Motivation: 尽管大语言模型取得了显著进展,但仍存在生成错误但看似合理回答的幻觉问题。本文关注一种此前未被充分研究的幻觉类型——由特征与属性之间的虚假统计关联引发的幻觉。 Method: 通过系统控制的合成实验和对最先进开源及专有大语言模型(包括GPT-5)的实证评估,结合理论分析,研究虚假相关性对幻觉生成及其检测方法的影响。 Result: 发现基于置信度过滤和内部状态探测等现有幻觉检测方法在面对虚假相关性时根本失效;理论分析表明此类统计偏差本质上会破坏置信度检测机制。 Conclusion: 必须开发专门针对由虚假相关性导致幻觉的新检测与缓解方法,以应对当前技术的局限性。 Abstract: Despite substantial advances, large language models (LLMs) continue to exhibit hallucinations, generating plausible yet incorrect responses. In this paper, we highlight a critical yet previously underexplored class of hallucinations driven by spurious correlations -- superficial but statistically prominent associations between features (e.g., surnames) and attributes (e.g., nationality) present in the training data. We demonstrate that these spurious correlations induce hallucinations that are confidently generated, immune to model scaling, evade current detection methods, and persist even after refusal fine-tuning. Through systematically controlled synthetic experiments and empirical evaluations on state-of-the-art open-source and proprietary LLMs (including GPT-5), we show that existing hallucination detection methods, such as confidence-based filtering and inner-state probing, fundamentally fail in the presence of spurious correlations. Our theoretical analysis further elucidates why these statistical biases intrinsically undermine confidence-based detection techniques. Our findings thus emphasize the urgent need for new approaches explicitly designed to address hallucinations caused by spurious correlations.[92] FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation
Song Jin,Shuqi Li,Shukun Zhang,Rui Yan
Main category: cs.CL
TL;DR: 本文首次提出股权研究报告(ERR)生成任务,构建了开源评估基准FinRpt,包括高质量数据集和包含11项指标的综合评估体系,并提出多智能体框架FinRpt-Gen,实验验证了其在ERR生成中的有效性与潜力。
Details
Motivation: 尽管大语言模型在金融任务中表现优异,但在全自动股权研究报告生成方面的应用尚未探索,缺乏标准数据集和评估指标,限制了该领域的发展。 Method: 提出ERR生成任务,设计集成7类金融数据的数据构建流程以自动生成高质量ERR数据集,建立包含11项指标的评估系统,并构建基于监督微调和强化学习的多智能体框架FinRpt-Gen。 Result: 构建了开源基准FinRpt和多智能体框架FinRpt-Gen,实验证明数据集质量高、评估指标有效,FinRpt-Gen在ERR生成任务中表现优异。 Conclusion: FinRpt为ERR生成提供了可靠基准,FinRpt-Gen展示了LLM在该任务上的强大能力,推动了自动化金融报告生成的发展。 Abstract: While LLMs have shown great success in financial tasks like stock prediction and question answering, their application in fully automating Equity Research Report generation remains uncharted territory. In this paper, we formulate the Equity Research Report (ERR) Generation task for the first time. To address the data scarcity and the evaluation metrics absence, we present an open-source evaluation benchmark for ERR generation - FinRpt. We frame a Dataset Construction Pipeline that integrates 7 financial data types and produces a high-quality ERR dataset automatically, which could be used for model training and evaluation. We also introduce a comprehensive evaluation system including 11 metrics to assess the generated ERRs. Moreover, we propose a multi-agent framework specifically tailored to address this task, named FinRpt-Gen, and train several LLM-based agents on the proposed datasets using Supervised Fine-Tuning and Reinforcement Learning. Experimental results indicate the data quality and metrics effectiveness of the benchmark FinRpt and the strong performance of FinRpt-Gen, showcasing their potential to drive innovation in the ERR generation field. All code and datasets are publicly available.[93] Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains
Pingjie Wang,Hongcheng Liu,Yusheng Liao,Ziqing Fan,Yaxin Du,Shuo Tang,Yanfeng Wang,Yu Wang
Main category: cs.CL
TL;DR: 本文提出了一种基于神经正切核(NTK)的辅助数据选择框架NTK-Selector,用于提升低资源领域下大语言模型的性能,通过Jacobian-free近似方法解决了NTK在大模型上的理论与计算瓶颈,在医疗、金融、法律和心理等领域显著提升了微调效果。
Details
Motivation: 由于低资源领域内数据稀缺且易过拟合,传统数据选择方法难以应用,而通用域数据丰富但未被有效利用,因此需要一种无需大量验证数据即可高效选择有价值辅助数据的方法。 Method: 提出NTK-Selector,利用神经正切核分析通用域数据对特定领域微调的影响,通过实证发现LoRA微调中LLM表现出稳定的NTK样行为,并设计Jacobian-free近似方法降低计算开销,从而实现高效的辅助数据选择。 Result: 在四个低资源领域实验中,使用NTK-Selector选择9000条辅助数据结合1000条领域数据,使Llama3-8B-Instruct和Qwen3-8B分别提升8.7和5.1个点,相较仅使用领域数据提升达10.9倍和5.7倍。 Conclusion: NTK-Selector为低资源领域下的大模型微调提供了一种高效、可行的辅助数据选择方案,显著提升了模型性能,具备实际应用价值。 Abstract: Large language models (LLMs) have achieved remarkable success across widespread tasks, yet their application in low-resource domains remains a significant challenge due to data scarcity and the high risk of overfitting. While in-domain data is limited, there exist vast amounts of similar general-domain data, and our initial findings reveal that they could potentially serve as auxiliary supervision for domain enhancement. This observation leads us to our central research question: \textbf{\textit{how to effectively select the most valuable auxiliary data to maximize domain-specific performance}}, particularly when traditional methods are inapplicable due to a lack of large in-domain data pools or validation sets. To address this, we propose \textbf{NTK-Selector}, a principled and efficient framework for selecting general-domain auxiliary data to enhance domain-specific performance via neural tangent kernels (NTK). Our method tackles two challenges of directly applying NTK to LLMs, theoretical assumptions and prohibitive computational cost, by empirically demonstrating a stable NTK-like behavior in LLMs during LoRA fine-tuning and proposing a Jacobian-free approximation method. Extensive experiments across four low-resource domains (medical, financial, legal, and psychological) demonstrate that NTK-Selector consistently improves downstream performance. Specifically, fine-tuning on 1,000 in-domain samples alone only yielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. In contrast, enriching with 9,000 auxiliary samples selected by NTK-Selector led to substantial \textbf{gains of +8.7 and +5.1 points}, which corresponds to a \textbf{10.9x and 5.7x improvement} over the domain-only setting.[94] Retriv at BLP-2025 Task 2: Test-Driven Feedback-Guided Framework for Bangla-to-Python Code Generation
K M Nafi Asib,Sourav Saha,Mohammed Moshiul Hoque
Main category: cs.CL
TL;DR: 提出一种结合指令提示与测试驱动的反馈引导迭代优化方法,使用微调的Qwen2.5-14B模型生成孟加拉语到代码的转换,在BLP 2025共享任务中取得第二名(Pass@1: 0.934)。
Details
Motivation: 孟加拉语等低资源语言缺乏足够的指令到代码数据集和评估基准,导致在代码生成任务中表现受限。 Method: 采用微调的Qwen2.5-14B模型,结合指令提示与测试驱动的反馈引导迭代优化过程:模型生成代码后通过单元测试验证,并在三次评估循环中根据测试反馈迭代修正错误输出。 Result: 在BLP Workshop 2025共享任务中获得第二名,Pass@1得分为0.934;实验揭示了孟加拉语指令理解与Python代码生成中的挑战。 Conclusion: 针对低资源语言的代码生成需专门设计方法,测试反馈驱动的迭代优化能有效提升生成准确性,具有推广价值。 Abstract: Large Language Models (LLMs) have advanced the automated generation of code from natural language prompts. However, low-resource languages (LRLs) like Bangla remain underrepresented due to the limited availability of instruction-to-code datasets and evaluation benchmarks. To address this, the BLP Workshop at IJCNLP-AACL 2025 introduced a shared task on "Code Generation in Bangla". In this work, we propose a method that combines instruction prompting with a test-driven, feedback-guided iterative refinement process using a fine-tuned Qwen2.5-14B model. The model generates code from Bangla instructions, tests it against unit tests, and iteratively refines any failing outputs through three evaluation passes, using test feedback to guide each step. This approach helped our team "Retriv" to secure 2nd place in the shared task with a Pass@1 score of 0.934. The analysis highlights challenges in Bangla instruction understanding and Python code generation, emphasizing the need for targeted methods in LRLs. We made experimental scripts publicly available for the community.[95] Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence
Sean McLeish,Ang Li,John Kirchenbauer,Dayal Singh Kalra,Brian R. Bartoldson,Bhavya Kailkhura,Avi Schwarzschild,Jonas Geiping,Tom Goldstein,Micah Goldblum
Main category: cs.CL
TL;DR: 将预训练的非循环语言模型转换为深度循环模型,通过递归课程增加模型有效深度,在减少计算成本的同时保持性能,并在数学任务上优于直接后训练原始模型。
Details
Motivation: 研究如何将现有的预训练非循环语言模型转化为深度循环模型,以解耦训练时的计算量与参数数量和测试时的计算需求。 Method: 采用递归课程的方法,在训练过程中逐步增加模型的递归层数,从而提升模型的有效深度。 Result: 在数学任务上的实验表明,转换后的循环模型在相同计算预算下表现优于直接后训练的原始非循环模型。 Conclusion: 将预训练模型转化为循环结构能更高效地利用计算资源,提升模型性能。 Abstract: Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute. In this work, we study how to convert existing pretrained non-recurrent language models into depth-recurrent models. We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost. In our experiments, on mathematics, we observe that converting pretrained models to recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.[96] Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction
Hyeryun Park,Byung Mo Gu,Jun Hee Lee,Byeong Hyeon Choi,Sekeun Kim,Hyun Koo Kim,Kyungsang Kim
Main category: cs.CL
TL;DR: 提出基于大语言模型的语音驱动手术代理协调平台(SAOP),用于在da Vinci机器人手术中通过语音指令访问和操作多模态患者数据,提升手术流畅性与安全性。
Details
Motivation: 在da Vinci机器人手术中,外科医生的手眼高度集中于操作,难以在不中断手术的情况下访问和操作多模态患者数据,亟需一种非手动、高效的人机交互方式。 Method: 构建一个基于层次化多智能体框架的Surgical Agent Orchestrator Platform(SAOP),包含一个协调代理和三个任务专用代理,均由大语言模型(LLM)驱动;通过语音指令输入,实现临床信息检索、CT图像操作和3D解剖模型导航等任务;并提出多级协调评估指标(MOEM)从指令级和类别级评估系统性能。 Result: SAOP在240条语音指令测试中表现出高准确率和成功率;LLM驱动的代理能有效应对语音识别错误以及多样化、模糊的自由形式指令,展现出良好的鲁棒性。 Conclusion: 所提出的SAOP平台能够有效支持da Vinci机器人手术中的语音交互需求,减少手术中断,具有较强的临床应用潜力。 Abstract: In da Vinci robotic surgery, surgeons' hands and eyes are fully engaged in the procedure, making it difficult to access and manipulate multimodal patient data without interruption. We propose a voice-directed Surgical Agent Orchestrator Platform (SAOP) built on a hierarchical multi-agent framework, consisting of an orchestration agent and three task-specific agents driven by Large Language Models (LLMs). These LLM-based agents autonomously plan, refine, validate, and reason to map voice commands into specific tasks such as retrieving clinical information, manipulating CT scans, or navigating 3D anatomical models on the surgical video. We also introduce a Multi-level Orchestration Evaluation Metric (MOEM) to comprehensively assess the performance and robustness from command-level and category-level perspectives. The SAOP achieves high accuracy and success rates across 240 voice commands, while LLM-based agents improve robustness against speech recognition errors and diverse or ambiguous free-form commands, demonstrating strong potential to support minimally invasive da Vinci robotic surgery.[97] ConvFill: Model Collaboration for Responsive Conversational Voice Agents
Vidya Srinivas,Zachary Englhardt,Maximus Powers,Shwetak Patel,Vikram Iyer
Main category: cs.CL
TL;DR: 本文提出了一种名为“对话填充”(conversational infill)的新方法,通过轻量级本地模型结合云端大模型的流式知识,实现低延迟且富有上下文理解能力的语音对话系统,并提出了ConvFill模型,在保持200ms以下响应延迟的同时显著提升准确率。
Details
Motivation: 云上大语言模型虽具备强大推理能力,但高延迟影响自然对话体验;而本地模型响应快却缺乏复杂理解能力,因此需要一种兼顾响应速度与智能水平的解决方案。 Method: 提出“对话填充”任务,使用轻量级本地模型生成对话内容,同时实时融合来自后端大模型的流式知识,解耦响应延迟与模型能力的关系,并基于合成多领域对话数据训练了360M参数的ConvFill模型。 Result: ConvFill在多个后端模型下均表现出可学习的对话填充能力,相比同规模小模型准确率提升36%-42%,并始终保持低于200ms的响应延迟。 Conclusion: 该方法有望推动兼具即时响应和丰富知识的本地化对话代理的发展,为实际应用中的高效语音交互系统提供新路径。 Abstract: Deploying conversational voice agents with large language models faces a critical challenge: cloud-based foundation models provide deep reasoning and domain knowledge but introduce latency that disrupts natural conversation, while on-device models respond immediately but lack sophistication. We propose conversational infill, a task where a lightweight on-device model generates contextually appropriate dialogue while seamlessly incorporating streaming knowledge from a powerful backend model. This approach decouples response latency from model capability, enabling systems that feel responsive while accessing the full power of large-scale models. We present ConvFill, a 360M parameter model trained on synthetic multi-domain conversations. Evaluation across multiple backend models shows that conversational infill can be successfully learned, with ConvFill achieving accuracy improvements of 36-42% over standalone small models of the same size while consistently retaining sub-200ms response latencies. Our results demonstrate the promise of this approach for building on-device conversational agents that are both immediately responsive and knowledgeable.[98] SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations
Manon Berriche,Célia Nouri,Chloé Clavel,Jean-Philippe Cointet
Main category: cs.CL
TL;DR: 本文介绍了SPOT,首个将社会学中的“停顿点”概念转化为可复现NLP任务的标注语料库,用于识别在线讨论中通过讽刺、微妙怀疑等形式暂停或转向对话的关键评论,并通过实验证明微调编码器模型优于提示的大语言模型。
Details
Motivation: 旨在捕捉现有框架(如反言辞或社会纠正)常忽略的、能中断或引导在线讨论的细微但关键的社会互动形式——“停顿点”,推动非英语社交媒体内容理解的研究。 Method: 构建包含43,305条法语Facebook评论的标注语料库,定义二分类任务,使用CamemBERT等编码器模型进行微调,并对比不同提示策略下的指令调优大语言模型表现,同时引入上下文元数据提升性能。 Result: 微调编码器模型在F1分数上比提示的大语言模型高出10多个百分点;加入上下文元数据后,编码器模型F1从0.75提升至0.78。 Conclusion: 监督学习对新兴非英语社交媒体任务至关重要,上下文信息有助于提升模型性能,作者公开了数据、标注指南和代码以促进可重复研究。 Abstract: We introduce SPOT (Stopping Points in Online Threads), the first annotated corpus translating the sociological concept of stopping point into a reproducible NLP task. Stopping points are ordinary critical interventions that pause or redirect online discussions through a range of forms (irony, subtle doubt or fragmentary arguments) that frameworks like counterspeech or social correction often overlook. We operationalize this concept as a binary classification task and provide reliable annotation guidelines. The corpus contains 43,305 manually annotated French Facebook comments linked to URLs flagged as false information by social media users, enriched with contextual metadata (article, post, parent comment, page or group, and source). We benchmark fine-tuned encoder models (CamemBERT) and instruction-tuned LLMs under various prompting strategies. Results show that fine-tuned encoders outperform prompted LLMs in F1 score by more than 10 percentage points, confirming the importance of supervised learning for emerging non-English social media tasks. Incorporating contextual metadata further improves encoder models F1 scores from 0.75 to 0.78. We release the anonymized dataset, along with the annotation guidelines and code in our code repository, to foster transparency and reproducible research.cs.CV [Back]
[99] Randomized-MLP Regularization Improves Domain Adaptation and Interpretability in DINOv2
Joel Valdivia Ortega,Lorenz Lamm,Franziska Eckardt,Benedikt Schworm,Marion Jasnin,Tingying Peng
Main category: cs.CV
TL;DR: 提出了一种基于对比学习的RMLP正则化方法,用于提升Vision Transformers在医学和自然图像中的可解释性和性能。
Details
Motivation: Vision Transformers在跨领域表现优异,但在医学图像中由于域偏移导致性能和可解释性下降,尤其是注意力机制中低信息量patch token的问题。 Method: 引入Randomized-MLP(RMLP)正则化方法,在微调DINOv2时结合对比学习,促使模型生成语义更对齐的表示。 Result: 在多种图像模态上验证了RMLP能保持或提升下游任务性能,同时生成更可解释的注意力图,并提供了RMLP的数学分析。 Conclusion: RMLP正则化有效增强了ViT模型的可解释性与鲁棒性,深化了对对比学习在视觉Transformer中作用的理解。 Abstract: Vision Transformers (ViTs), such as DINOv2, achieve strong performance across domains but often repurpose low-informative patch tokens in ways that reduce the interpretability of attention and feature maps. This challenge is especially evident in medical imaging, where domain shifts can degrade both performance and transparency. In this paper, we introduce Randomized-MLP (RMLP) regularization, a contrastive learning-based method that encourages more semantically aligned representations. We use RMLPs when fine-tuning DINOv2 to both medical and natural image modalities, showing that it improves or maintains downstream performance while producing more interpretable attention maps. We also provide a mathematical analysis of RMLPs, offering insights into its role in enhancing ViT-based models and advancing our understanding of contrastive learning.[100] Token Is All You Need: Cognitive Planning through Sparse Intent Alignment
Shiyao Sang
Main category: cs.CV
TL;DR: 提出了一种基于稀疏语义令牌的端到端自动驾驶方法,无需完整场景重建即可实现高性能规划,在nuPlan基准上优于传统世界模型方法。
Details
Motivation: 挑战自动驾驶中必须进行完整场景建模的传统假设,探索更高效、认知启发式的规划方式。 Method: 利用感知引导的BEV表示,提取语义丰富的稀疏令牌进行轨迹解码,引入未来令牌预测并摒弃显式重建损失。 Result: 在nuPlan上实现0.479m ADE,比基线提升12.6%;发现‘时间模糊性’现象,模型能自适应关注任务相关语义。 Conclusion: 证明‘令牌即全部’的范式可取代传统世界建模,为基于想象而非反应的智能驾驶系统奠定基础。 Abstract: We challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving (E2EAD). Unlike world-model approaches that rely on computationally intensive future scene generation or vision-language-action (VLA) systems constrained by Markov assumptions, we show that a minimal set of semantically rich tokens is sufficient for effective planning. Experiments on the nuPlan benchmark (720 scenarios, over 11,000 samples) using perception-informed BEV representations yield three key findings: (1) even without future prediction, our sparse representation achieves 0.548 m ADE, comparable to or surpassing prior methods reporting around 0.75 m on nuScenes; (2) conditioning trajectory decoding on predicted future tokens reduces ADE to 0.479 m, a 12.6% improvement over current-state baselines; and (3) explicit reconstruction loss offers no benefit and may degrade performance under reliable perception inputs. Notably, we observe the emergence of temporal fuzziness, where the model adaptively attends to task-relevant semantics rather than aligning rigidly to fixed timestamps, providing a cognitive advantage for planning under uncertainty. Our "token is all you need" principle marks a paradigm shift from reconstructing the world to understanding it, laying a foundation for cognitively inspired systems that plan through imagination rather than reaction.[101] Automated Invoice Data Extraction: Using LLM and OCR
Advait Thakur,Khushi Khanchandani,Akshita Shetty,Chaitravi Reddy,Ritisa Behera
Main category: cs.CV
TL;DR: 本文提出了一种结合OCR、深度学习、大语言模型和图分析的综合性AI平台,以解决传统OCR在发票处理中因版式多样、手写体和低质量扫描导致的识别困难,显著提升信息抽取的准确性与一致性。
Details
Motivation: 传统OCR系统受限于模板依赖,难以应对多变的发票布局、手写文本和低质量扫描,亟需更灵活、准确的信息提取方法。 Method: 结合OCR、卷积神经网络(CNN)、Transformer、大语言模型(LLM)和图分析,构建一个融合视觉命名实体识别(Visual NER)与语义理解的混合架构AI平台。 Result: 所提出的平台在多种文档类型上实现了更高精度和上下文敏感性的信息提取,显著优于传统方法,具备高可扩展性和低人工干预需求。 Conclusion: 该综合性AI平台通过多技术融合,在发票等复杂文档的信息提取任务中实现了前所未有的质量与一致性,代表了当前最佳实践的进一步演进。 Abstract: Conventional Optical Character Recognition (OCR) systems are challenged by variant invoice layouts, handwritten text, and low- quality scans, which are often caused by strong template dependencies that restrict their flexibility across different document structures and layouts. Newer solutions utilize advanced deep learning models such as Convolutional Neural Networks (CNN) as well as Transformers, and domain-specific models for better layout analysis and accuracy across various sections over varied document types. Large Language Models (LLMs) have revolutionized extraction pipelines at their core with sophisticated entity recognition and semantic comprehension to support complex contextual relationship mapping without direct programming specification. Visual Named Entity Recognition (NER) capabilities permit extraction from invoice images with greater contextual sensitivity and much higher accuracy rates than older approaches. Existing industry best practices utilize hybrid architectures that blend OCR technology and LLM for maximum scalability and minimal human intervention. This work introduces a holistic Artificial Intelligence (AI) platform combining OCR, deep learning, LLMs, and graph analytics to achieve unprecedented extraction quality and consistency.[102] MCFCN: Multi-View Clustering via a Fusion-Consensus Graph Convolutional Network
Chenping Pei,Fadi Dornaika,Jingjun Bi
Main category: cs.CV
TL;DR: 提出了一种基于融合共识图卷积网络(MCFCN)的多视图聚类方法,通过端到端学习共识图和统一图结构适配器优化视图特定图,提升聚类性能。
Details
Motivation: 现有MVC方法忽视数据拓扑结构,GNN输入易受噪声干扰,MGRC方法在跨视图一致性、难分样本处理及图构建优化方面存在不足。 Method: 设计MCFCN框架,结合视图特征融合模型与统一图结构适配器(UGA),引入相似性矩阵对齐损失(SMAL)和特征表示对齐损失(FRAL),在GCN支持下实现共识表示学习与图优化。 Result: 在八个基准数据集上达到SOTA性能,经大量定性和定量实验验证有效性。 Conclusion: MCFCN能有效提升多视图聚类效果,兼顾跨视图一致性与图结构优化,具有较强鲁棒性和应用潜力。 Abstract: Existing Multi-view Clustering (MVC) methods based on subspace learning focus on consensus representation learning while neglecting the inherent topological structure of data. Despite the integration of Graph Neural Networks (GNNs) into MVC, their input graph structures remain susceptible to noise interference. Methods based on Multi-view Graph Refinement (MGRC) also have limitations such as insufficient consideration of cross-view consistency, difficulty in handling hard-to-distinguish samples in the feature space, and disjointed optimization processes caused by graph construction algorithms. To address these issues, a Multi-View Clustering method via a Fusion-Consensus Graph Convolutional Network (MCFCN) is proposed. The network learns the consensus graph of multi-view data in an end-to-end manner and learns effective consensus representations through a view feature fusion model and a Unified Graph Structure Adapter (UGA). It designs Similarity Matrix Alignment Loss (SMAL) and Feature Representation Alignment Loss (FRAL). With the guidance of consensus, it optimizes view-specific graphs, preserves cross-view topological consistency, promotes the construction of intra-class edges, and realizes effective consensus representation learning with the help of GCN to improve clustering performance. MCFCN demonstrates state-of-the-art performance on eight multi-view benchmark datasets, and its effectiveness is verified by extensive qualitative and quantitative implementations. The code will be provided at https://github.com/texttao/MCFCN.[103] Compressing Multi-Task Model for Autonomous Driving via Pruning and Knowledge Distillation
Jiayuan Wang,Q. M. Jonathan Wu,Ning Zhang,Katsuya Suto,Lei Zhong
Main category: cs.CV
TL;DR: 提出了一种结合任务感知安全剪枝和特征级知识蒸馏的多任务模型压缩框架,用于自动驾驶中的全景感知,显著减少参数量同时保持较高的性能。
Details
Motivation: 多任务学习虽然有效,但模型参数和复杂性增加,难以在车载设备上部署。 Method: 结合基于泰勒通道重要性和梯度冲突惩罚的安全剪枝策略,以及任务头无关的知识蒸馏方法,传输教师模型的中间特征到学生模型。 Result: 在BDD100K数据集上,模型参数减少32.7%,分割性能几乎无损失,检测性能略有下降(Recall -1.2%,mAP50 -1.8%),仍保持32.7 FPS的实时运行速度。 Conclusion: 剪枝与知识蒸馏结合为多任务全景感知提供了有效的模型压缩方案。 Abstract: Autonomous driving systems rely on panoptic perception to jointly handle object detection, drivable area segmentation, and lane line segmentation. Although multi-task learning is an effective way to integrate these tasks, its increasing model parameters and complexity make deployment on on-board devices difficult. To address this challenge, we propose a multi-task model compression framework that combines task-aware safe pruning with feature-level knowledge distillation. Our safe pruning strategy integrates Taylor-based channel importance with gradient conflict penalty to keep important channels while removing redundant and conflicting channels. To mitigate performance degradation after pruning, we further design a task head-agnostic distillation method that transfers intermediate backbone and encoder features from a teacher to a student model as guidance. Experiments on the BDD100K dataset demonstrate that our compressed model achieves a 32.7% reduction in parameters while segmentation performance shows negligible accuracy loss and only a minor decrease in detection (-1.2% for Recall and -1.8% for mAP50) compared to the teacher. The compressed model still runs at 32.7 FPS in real-time. These results show that combining pruning and knowledge distillation provides an effective compression solution for multi-task panoptic perception.[104] FilletRec: A Lightweight Graph Neural Network with Intrinsic Features for Automated Fillet Recognition
Jiali Gao,Taoran Liu,Hongfei Ye,Jianjun Chen
Main category: cs.CV
TL;DR: 本文提出了一种端到端、数据驱动的CAD模型圆角特征识别与简化框架,构建了大规模基准数据集,并提出了轻量级图神经网络FilletRec,利用位姿不变的内在几何特征实现高精度、强泛化且高效的圆角识别,结合几何简化算法完成自动化流程。
Details
Motivation: 传统基于规则的方法在鲁棒性方面不足,现有深度学习模型因通用设计和训练数据不足而在复杂圆角上表现不佳,难以实现高精度和良好泛化。 Method: 构建并发布了用于圆角识别的大规模多样化基准数据集;提出FilletRec,一种基于图神经网络的轻量级模型,利用曲率等位姿不变的内在几何特征来学习基本几何模式;集成有效的几何简化算法,实现从识别到简化的自动化工作流。 Result: 实验表明,FilletRec在准确性和泛化能力上优于现有最先进方法,且参数量仅为基线模型的0.2%-5.4%,展现出高模型效率。 Conclusion: 所提出的框架通过专用数据集和基于内在几何特征的轻量级模型,显著提升了CAD模型中圆角特征识别的精度、泛化能力和效率,并实现了完整的自动化识别与简化流程。 Abstract: Automated recognition and simplification of fillet features in CAD models is critical for CAE analysis, yet it remains an open challenge. Traditional rule-based methods lack robustness, while existing deep learning models suffer from poor generalization and low accuracy on complex fillets due to their generic design and inadequate training data. To address these issues, this paper proposes an end-to-end, data-driven framework specifically for fillet features. We first construct and release a large-scale, diverse benchmark dataset for fillet recognition to address the inadequacy of existing data. Based on it, we propose FilletRec, a lightweight graph neural network. The core innovation of this network is its use of pose-invariant intrinsic geometric features, such as curvature, enabling it to learn more fundamental geometric patterns and thereby achieve high-precision recognition of complex geometric topologies. Experiments show that FilletRec surpasses state-of-the-art methods in both accuracy and generalization, while using only 0.2\%-5.4\% of the parameters of baseline models, demonstrating high model efficiency. Finally, the framework completes the automated workflow from recognition to simplification by integrating an effective geometric simplification algorithm.[105] Efficient Online Continual Learning in Sensor-Based Human Activity Recognition
Yao Zhang,Souza Leite Clayton,Yu Xiao
Main category: cs.CV
TL;DR: 本文提出了PTRN-HAR,是首个成功将基于预训练模型的在线持续学习(PTM-based OCL)应用于传感器-based人类活动识别(HAR)的工作。该方法通过对比损失在少量数据上预训练特征提取器并冻结,使用关系模块网络替代全连接分类层,显著降低了资源消耗和标注数据需求,在三个公开数据集上优于现有方法。
Details
Motivation: 现有的在线持续学习(OCL)方法在传感器-based人类活动识别中计算开销大且依赖大量标注数据,而预训练模型在计算机视觉中表现出色,但因HAR数据异构性和标注稀缺难以应用,因此需要一种高效、低资源、少样本依赖的OCL方法。 Method: 提出PTRN-HAR:1)使用对比损失在有限数据上预训练特征提取器,并在流式学习阶段冻结;2)用关系模块网络替代传统的密集分类层,实现更高效的持续学习。 Result: 在三个公开HAR数据集上的实验表明,PTRN-HAR在性能上优于现有最先进方法,同时显著降低训练资源消耗和所需标注数据量,验证了其高效性与数据效率。 Conclusion: PTRN-HAR是首个将基于预训练模型的在线持续学习成功应用于传感器-based人类活动识别的方法,通过冻结预训练特征提取器和引入关系网络,实现了高性能、低资源消耗和高数据效率,为实际部署中的持续学习提供了可行方案。 Abstract: Machine learning models for sensor-based human activity recognition (HAR) are expected to adapt post-deployment to recognize new activities and different ways of performing existing ones. To address this need, Online Continual Learning (OCL) mechanisms have been proposed, allowing models to update their knowledge incrementally as new data become available while preserving previously acquired information. However, existing OCL approaches for sensor-based HAR are computationally intensive and require extensive labeled samples to represent new changes. Recently, pre-trained model-based (PTM-based) OCL approaches have shown significant improvements in performance and efficiency for computer vision applications. These methods achieve strong generalization capabilities by pre-training complex models on large datasets, followed by fine-tuning on downstream tasks for continual learning. However, applying PTM-based OCL approaches to sensor-based HAR poses significant challenges due to the inherent heterogeneity of HAR datasets and the scarcity of labeled data in post-deployment scenarios. This paper introduces PTRN-HAR, the first successful application of PTM-based OCL to sensor-based HAR. Unlike prior PTM-based OCL approaches, PTRN-HAR pre-trains the feature extractor using contrastive loss with a limited amount of data. This extractor is then frozen during the streaming stage. Furthermore, it replaces the conventional dense classification layer with a relation module network. Our design not only significantly reduces the resource consumption required for model training while maintaining high performance, but also improves data efficiency by reducing the amount of labeled data needed for effective continual learning, as demonstrated through experiments on three public datasets, outperforming the state-of-the-art. The code can be found here: https://anonymous.4open.science/r/PTRN-HAR-AF60/[106] Do Street View Imagery and Public Participation GIS align: Comparative Analysis of Urban Attractiveness
Milad Malekzadeh,Elias Willberg,Jussi Torkko,Silviya Korpilo,Kamyar Hasanzadeh,Olle Järv,Tuuli Toivonen
Main category: cs.CV
TL;DR: 本研究比较了街景图像(SVI)与公众参与式地理信息系统(PPGIS)在捕捉城市环境感知方面的适用性,发现两者仅部分一致,表明SVI虽具可扩展性,但无法完全替代PPGIS所捕获的体验丰富性。
Details
Motivation: 随着数字工具在空间规划中的应用日益广泛,亟需理解不同数据源如何反映人们对城市环境的主观体验,尤其是SVI和PPGIS这两种方法之间的可比性尚不明确。 Method: 基于赫尔辛基市的PPGIS调查数据和参与者评分的街景图像,结合语义图像分割技术训练机器学习模型以预测视觉吸引力,并与PPGIS中标注的吸引人或不吸引人的地点进行对比,采用严格和中等两种标准计算一致性。 Result: 在中等标准下,对吸引人和不吸引人地点的一致性分别为67%和77%,但在严格标准下分别降至27%和29%;分析显示噪音、交通、人口活动和土地利用等非视觉因素显著影响感知,而这些是SVI无法捕捉的。 Conclusion: SVI可作为城市感知的可扩展视觉代理,但无法充分反映由活动水平和环境压力等非视觉因素构成的体验维度,因此不能完全取代PPGIS;建议将两种方法整合以更全面地理解城市感知。 Abstract: As digital tools increasingly shape spatial planning practices, understanding how different data sources reflect human experiences of urban environments is essential. Street View Imagery (SVI) and Public Participation GIS (PPGIS) represent two prominent approaches for capturing place-based perceptions that can support urban planning decisions, yet their comparability remains underexplored. This study investigates the alignment between SVI-based perceived attractiveness and residents' reported experiences gathered via a city-wide PPGIS survey in Helsinki, Finland. Using participant-rated SVI data and semantic image segmentation, we trained a machine learning model to predict perceived attractiveness based on visual features. We compared these predictions to PPGIS-identified locations marked as attractive or unattractive, calculating agreement using two sets of strict and moderate criteria. Our findings reveal only partial alignment between the two datasets. While agreement (with a moderate threshold) reached 67% for attractive and 77% for unattractive places, agreement (with a strict threshold) dropped to 27% and 29%, respectively. By analysing a range of contextual variables, including noise, traffic, population presence, and land use, we found that non-visual cues significantly contributed to mismatches. The model failed to account for experiential dimensions such as activity levels and environmental stressors that shape perceptions but are not visible in images. These results suggest that while SVI offers a scalable and visual proxy for urban perception, it cannot fully substitute the experiential richness captured through PPGIS. We argue that both methods are valuable but serve different purposes; therefore, a more integrated approach is needed to holistically capture how people perceive urban environments.[107] C3-Diff: Super-resolving Spatial Transcriptomics via Cross-modal Cross-content Contrastive Diffusion Modelling
Xiaofei Wang,Stephen Price,Chao Li
Main category: cs.CV
TL;DR: 本文提出了一种名为C3-Diff的跨模态对比扩散框架,用于在组织学图像引导下增强空间转录组(ST)图谱,显著提升了ST分辨率,并在多个下游任务中表现出优越性能。
Details
Motivation: 现有ST技术分辨率低且测序灵敏度不足,限制了对空间基因表达的深入理解,亟需有效方法融合组织学图像与基因表达数据以提升ST质量。 Method: 提出C3-Diff框架,改进传统对比学习以提取模态不变和内容不变特征,在特征超球面进行基于噪声的信息增强,并采用动态跨模态填补策略缓解数据稀缺问题。 Result: 在四个公开数据集上显著优于现有方法,并在细胞类型定位、基因表达相关性分析和单细胞水平基因表达预测等下游任务中表现优异。 Conclusion: C3-Diff有效提升了空间转录组数据的分辨率与质量,推动了AI在生物医学研究和临床应用中的发展。 Abstract: The rapid advancement of spatial transcriptomics (ST), i.e., spatial gene expressions, has made it possible to measure gene expression within original tissue, enabling us to discover molecular mechanisms. However, current ST platforms frequently suffer from low resolution, limiting the in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, it remains a challenge to model the interactions between histology images and gene expressions for effective ST enhancement. This study presents a cross-modal cross-content contrastive diffusion framework, called C3-Diff, for ST enhancement with histology images as guidance. In C3-Diff, we firstly analyze the deficiency of traditional contrastive learning paradigm, which is then refined to extract both modal-invariant and content-invariant features of ST maps and histology images. Further, to overcome the problem of low sequencing sensitivity in ST maps, we perform nosing-based information augmentation on the surface of feature unit hypersphere. Finally, we propose a dynamic cross-modal imputation-based training strategy to mitigate ST data scarcity. We tested C3-Diff by benchmarking its performance on four public datasets, where it achieves significant improvements over competing methods. Moreover, we evaluate C3-Diff on downstream tasks of cell type localization, gene expression correlation and single-cell-level gene expression prediction, promoting AI-enhanced biotechnology for biomedical research and clinical applications. Codes are available at https://github.com/XiaofeiWang2018/C3-Diff.[108] Video Text Preservation with Synthetic Text-Rich Videos
Ziyang Liu,Kevin Valencia,Justin Cui
Main category: cs.CV
TL;DR: 本文提出了一种轻量级方法,通过合成监督来提升文本到视频(T2V)扩散模型中的文本可读性,利用文本丰富的图像生成视频并微调预训练模型,在不修改架构的情况下显著改善了短文本的清晰度和长文本的结构先验。
Details
Motivation: 现有T2V模型难以在视频中生成清晰连贯的文本,且已有解决方案计算成本高,缺乏实用性。 Method: 使用文本到图像扩散模型生成富含文本的图像,再通过无文本依赖的图像到视频模型将其动画化,构建合成的视频-提示对,用于微调预训练T2V模型Wan2.1,无需改变模型结构。 Result: 在短文本可读性和时间一致性方面均有提升,并在长文本中展现出结构性先验的初步迹象。 Conclusion: 精心设计的合成数据与弱监督是提升T2V生成中文本保真度的有效且实用路径。 Abstract: While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and previous attempts to address this problem are computationally expensive and not suitable for video generation. In this work, we investigate a lightweight approach to improve T2V diffusion models using synthetic supervision. We first generate text-rich images using a text-to-image (T2I) diffusion model, then animate them into short videos using a text-agnostic image-to-video (I2v) model. These synthetic video-prompt pairs are used to fine-tune Wan2.1, a pre-trained T2V model, without any architectural changes. Our results show improvement in short-text legibility and temporal consistency with emerging structural priors for longer text. These findings suggest that curated synthetic data and weak supervision offer a practical path toward improving textual fidelity in T2V generation.[109] Elements of Active Continuous Learning and Uncertainty Self-Awareness: a Narrow Implementation for Face and Facial Expression Recognition
Stanislav Selitskiy
Main category: cs.CV
TL;DR: 提出一种通过监督神经网络模拟自我意识的机制,用于检测底层CNN在人脸识别和表情识别中的预测不确定性,并在高不确定性时触发主动学习模式以寻求人类帮助。
Details
Motivation: 实现类智能的反思能力,使机器学习模型能够评估自身预测的可信度并进行自我修正,推动通向通用人工智能的发展。 Method: 使用一个监督型人工神经网络(ANN)监测底层卷积神经网络(CNN)的激活模式,以识别高不确定性;该监督ANN具备记忆功能并在训练中优化参数,从而判断预测的可信度。 Result: 成功实现了对底层CNN预测不确定性的检测,并能在高不确定性情况下触发主动学习机制,使模型能够主动请求人类干预。 Conclusion: 通过引入自我意识模拟机制,增强了窄域机器学习系统的自主性和可靠性,为构建更具适应性和智能性的系统提供了可行路径。 Abstract: Reflection on one's thought process and making corrections to it if there exists dissatisfaction in its performance is, perhaps, one of the essential traits of intelligence. However, such high-level abstract concepts mandatory for Artificial General Intelligence can be modelled even at the low level of narrow Machine Learning algorithms. Here, we present the self-awareness mechanism emulation in the form of a supervising artificial neural network (ANN) observing patterns in activations of another underlying ANN in a search for indications of the high uncertainty of the underlying ANN and, therefore, the trustworthiness of its predictions. The underlying ANN is a convolutional neural network (CNN) ensemble employed for face recognition and facial expression tasks. The self-awareness ANN has a memory region where its past performance information is stored, and its learnable parameters are adjusted during the training to optimize the performance. The trustworthiness verdict triggers the active learning mode, giving elements of agency to the machine learning algorithm that asks for human help in high uncertainty and confusion conditions.[110] DiffSwap++: 3D Latent-Controlled Diffusion for Identity-Preserving Face Swapping
Weston Bondurant,Arkaprava Sinha,Hieu Le,Srijan Das,Stephanie Schuckers
Main category: cs.CV
TL;DR: 提出DiffSwap++,一种基于扩散模型的换脸方法,通过引入3D面部潜在特征提升身份保持和几何一致性,在多种数据集上优于现有方法。
Details
Motivation: 现有换脸方法在复杂姿态和表情下常出现细节伪影且身份保持不佳,主要因未充分利3D面部结构进行身份与姿态/表情的解耦。 Method: 提出DiffSwap++,在扩散模型训练中引入3D面部潜在特征,并设计以身份嵌入和面部关键点为条件的去噪过程,增强几何一致性和身份-外观解耦。 Result: 在CelebA、FFHQ和CelebV-Text上实验表明,DiffSwap++在保持源身份的同时更好保留目标姿态和表情,优于先前方法;并通过生物特征评估和用户研究验证了生成结果的真实性。 Conclusion: 利用3D面部结构信息能有效提升扩散模型在换脸任务中的身份保持和生成质量,DiffSwap++为高保真换脸提供了新的有效方案。 Abstract: Diffusion-based approaches have recently achieved strong results in face swapping, offering improved visual quality over traditional GAN-based methods. However, even state-of-the-art models often suffer from fine-grained artifacts and poor identity preservation, particularly under challenging poses and expressions. A key limitation of existing approaches is their failure to meaningfully leverage 3D facial structure, which is crucial for disentangling identity from pose and expression. In this work, we propose DiffSwap++, a novel diffusion-based face-swapping pipeline that incorporates 3D facial latent features during training. By guiding the generation process with 3D-aware representations, our method enhances geometric consistency and improves the disentanglement of facial identity from appearance attributes. We further design a diffusion architecture that conditions the denoising process on both identity embeddings and facial landmarks, enabling high-fidelity and identity-preserving face swaps. Extensive experiments on CelebA, FFHQ, and CelebV-Text demonstrate that DiffSwap++ outperforms prior methods in preserving source identity while maintaining target pose and expression. Additionally, we introduce a biometric-style evaluation and conduct a user study to further validate the realism and effectiveness of our approach. Code will be made publicly available at https://github.com/WestonBond/DiffSwapPP[111] Beyond Softmax: Dual-Branch Sigmoid Architecture for Accurate Class Activation Maps
Yoojin Oh,Junhyug Noh
Main category: cs.CV
TL;DR: 提出一种双分支Sigmoid头部结构,解耦定位与分类,提升CAM类方法的解释保真度和定位性能,且不损失分类精度。
Details
Motivation: 现有CAM方法依赖最终的Softmax分类器,存在加性logit偏移和符号坍缩问题,导致重要性评分偏差并混淆激发与抑制特征。 Method: 设计一个架构无关的双分支Sigmoid头部:复制原分类头为并行的Sigmoid分支,冻结原始Softmax头,仅用类别平衡的二元监督微调Sigmoid分支;推理时Softmax保持分类精度,Sigmoid分支生成保留贡献大小和符号的类证据图。 Result: 在细粒度任务(CUB-200-2011、Stanford Cars)和WSOL基准(ImageNet-1K、OpenImages30K)上验证了方法的有效性,显著提升解释保真度和Top-1定位性能,且分类准确率无下降。 Conclusion: 该双分支结构能有效克服Softmax带来的根本性扭曲,兼容大多数CAM变体,以极低开销实现更忠实的可视化解释。 Abstract: Class Activation Mapping (CAM) and its extensions have become indispensable tools for visualizing the evidence behind deep network predictions. However, by relying on a final softmax classifier, these methods suffer from two fundamental distortions: additive logit shifts that arbitrarily bias importance scores, and sign collapse that conflates excitatory and inhibitory features. We propose a simple, architecture-agnostic dual-branch sigmoid head that decouples localization from classification. Given any pretrained model, we clone its classification head into a parallel branch ending in per-class sigmoid outputs, freeze the original softmax head, and fine-tune only the sigmoid branch with class-balanced binary supervision. At inference, softmax retains recognition accuracy, while class evidence maps are generated from the sigmoid branch -- preserving both magnitude and sign of feature contributions. Our method integrates seamlessly with most CAM variants and incurs negligible overhead. Extensive evaluations on fine-grained tasks (CUB-200-2011, Stanford Cars) and WSOL benchmarks (ImageNet-1K, OpenImages30K) show improved explanation fidelity and consistent Top-1 Localization gains -- without any drop in classification accuracy. Code is available at https://github.com/finallyupper/beyond-softmax.[112] Google-MedGemma Based Abnormality Detection in Musculoskeletal radiographs
Soumyajit Maity,Pranjal Kamboj,Sneha Maity,Rajat Singh,Sankhadeep Chatterjee
Main category: cs.CV
TL;DR: 本文提出了一种基于MedGemma的骨骼肌肉X光片异常检测框架,利用其视觉编码器和迁移学习能力,在二分类任务中表现优于传统方法。
Details
Motivation: 为了提升医学影像异常检测的准确性和泛化能力,克服传统自编码器和卷积网络在特征工程和可扩展性上的局限。 Method: 采用MedGemma基础模型,结合SigLIP衍生的视觉编码器对预处理X光图像进行编码,并通过轻量级多层感知机实现二分类。 Result: 该方法在实验评估中表现出色,性能超过传统的卷积神经网络和自编码器模型,具备更强的泛化能力和高效的领域适应性。 Conclusion: 基于MedGemma的分类系统能够有效推动临床放射影像初筛的发展,具有在自动化医学图像分析中广泛应用的潜力。 Abstract: This paper proposes a MedGemma-based framework for automatic abnormality detection in musculoskeletal radiographs. Departing from conventional autoencoder and neural network pipelines, the proposed method leverages the MedGemma foundation model, incorporating a SigLIP-derived vision encoder pretrained on diverse medical imaging modalities. Preprocessed X-ray images are encoded into high-dimensional embeddings using the MedGemma vision backbone, which are subsequently passed through a lightweight multilayer perceptron for binary classification. Experimental assessment reveals that the MedGemma-driven classifier exhibits strong performance, exceeding conventional convolutional and autoencoder-based metrics. Additionally, the model leverages MedGemma's transfer learning capabilities, enhancing generalization and optimizing feature engineering. The integration of a modern medical foundation model not only enhances representation learning but also facilitates modular training strategies such as selective encoder block unfreezing for efficient domain adaptation. The findings suggest that MedGemma-powered classification systems can advance clinical radiograph triage by providing scalable and accurate abnormality detection, with potential for broader applications in automated medical image analysis. Keywords: Google MedGemma, MURA, Medical Image, Classification.[113] In-process 3D Deviation Mapping and Defect Monitoring (3D-DM2) in High Production-rate Robotic Additive Manufacturing
Subash Gautam,Alejandro Vargas-Uscategui,Peter King,Hans Lohr,Alireza Bab-Hadiashar,Ivan Cole,Ehsan Asadi
Main category: cs.CV
TL;DR: 本研究提出了一种用于高沉积率机器人增材制造的实时监控系统,通过与近净形参考模型对比,实现制造过程中形状偏差的实时检测与追踪,以提高零件质量一致性。
Details
Motivation: 由于当前开环系统中的工艺不稳定性,保持形状精度仍是高沉积率增材制造中的关键挑战,亟需在制造过程中及时发现并纠正偏差。 Method: 开发了一种实时监测系统,用于获取并重建正在成形的零件几何形态,并将其与近-net参考模型进行直接比较,实现偏差检测、区域分割与持续追踪。 Result: 该系统能够早期识别形状不一致,并对偏差区域进行分段与跟踪,为实时干预和补偿提供依据。 Conclusion: 所提出的实时监控方法有助于防止误差累积,确保零件质量,减少后期加工需求,推动高沉积率增材制造向闭环控制发展。 Abstract: Additive manufacturing (AM) is an emerging digital manufacturing technology to produce complex and freeform objects through a layer-wise deposition. High deposition rate robotic AM (HDRRAM) processes, such as cold spray additive manufacturing (CSAM), offer significantly increased build speeds by delivering large volumes of material per unit time. However, maintaining shape accuracy remains a critical challenge, particularly due to process instabilities in current open-loop systems. Detecting these deviations as they occur is essential to prevent error propagation, ensure part quality, and minimize post-processing requirements. This study presents a real-time monitoring system to acquire and reconstruct the growing part and directly compares it with a near-net reference model to detect the shape deviation during the manufacturing process. The early identification of shape inconsistencies, followed by segmenting and tracking each deviation region, paves the way for timely intervention and compensation to achieve consistent part quality.[114] Walking the Schrödinger Bridge: A Direct Trajectory for Text-to-3D Generation
Ziying Li,Xuequan Lu,Xinkui Zhao,Guanjie Cheng,Shuiguang Deng,Jianwei Yin
Main category: cs.CV
TL;DR: 本文提出了一种新的文本到3D生成框架TraCe,通过将生成过程建模为从当前渲染到目标分布的最优传输轨迹,克服了传统得分蒸馏采样(SDS)方法中的过饱和和过平滑问题。
Details
Motivation: 现有的基于优化的文本到3D生成方法依赖于预训练的文本到图像扩散模型,但常用的SDS技术容易引入伪影,影响生成质量。 Method: 作者首先从理论上将SDS建立为薛定谔桥框架的一个简化实例,并证明其使用的是该框架的反向过程;在此基础上提出了Trajectory-Centric Distillation (TraCe),利用薛定谔桥构建从当前渲染到目标分布的扩散桥,并在轨迹的得分动态上训练LoRA适配模型。 Result: 实验表明,TraCe在生成质量和保真度方面均优于现有最先进方法,且能在较低的Classifier-free Guidance (CFG)值下实现高质量生成。 Conclusion: TraCe通过引入轨迹中心的蒸馏机制,有效提升了文本到3D生成的质量与稳定性,为未来研究提供了新的理论视角与实践框架。 Abstract: Recent advancements in optimization-based text-to-3D generation heavily rely on distilling knowledge from pre-trained text-to-image diffusion models using techniques like Score Distillation Sampling (SDS), which often introduce artifacts such as over-saturation and over-smoothing into the generated 3D assets. In this paper, we address this essential problem by formulating the generation process as learning an optimal, direct transport trajectory between the distribution of the current rendering and the desired target distribution, thereby enabling high-quality generation with smaller Classifier-free Guidance (CFG) values. At first, we theoretically establish SDS as a simplified instance of the Schrödinger Bridge framework. We prove that SDS employs the reverse process of an Schrödinger Bridge, which, under specific conditions (e.g., a Gaussian noise as one end), collapses to SDS's score function of the pre-trained diffusion model. Based upon this, we introduce Trajectory-Centric Distillation (TraCe), a novel text-to-3D generation framework, which reformulates the mathematically trackable framework of Schrödinger Bridge to explicitly construct a diffusion bridge from the current rendering to its text-conditioned, denoised target, and trains a LoRA-adapted model on this trajectory's score dynamics for robust 3D optimization. Comprehensive experiments demonstrate that TraCe consistently achieves superior quality and fidelity to state-of-the-art techniques.[115] Pose-Aware Multi-Level Motion Parsing for Action Quality Assessment
Shuaikang Zhu,Yang Yang,Chen Sun
Main category: cs.CV
TL;DR: 提出了一种基于增强时空姿态特征的多级运动解析框架,用于动作质量评估,在跳水运动数据集上实现了最先进的动作分割和评分性能。
Details
Motivation: 在高水平竞赛中,细微的姿态时空变化往往是评分的关键因素,现有方法难以充分捕捉这些细微差异。 Method: 设计了三级解析器:动作单元解析器进行精确的动作分割和局部-全局姿态表示;运动解析器学习时空特征以捕捉姿态变化和外观细节;条件解析器处理非身体相关因素(如跳水中的水花)。引入权重调整评分模块以适应不同类型动作的多样化需求。 Result: 在大规模跳水运动数据集上的实验表明,该框架在动作分割和动作评分任务上均达到最先进水平。 Conclusion: 所提出的多级运动解析框架能有效提升动作质量评估的精度,尤其适用于需要精细姿态分析的高水准竞技场景。 Abstract: Human pose serves as a cornerstone of action quality assessment (AQA), where subtle spatial-temporal variations in pose often distinguish excellence from mediocrity. In high-level competitions, these nuanced differences become decisive factors in scoring. In this paper, we propose a novel multi-level motion parsing framework for AQA based on enhanced spatial-temporal pose features. On the first level, the Action-Unit Parser is designed with the help of pose extraction to achieve precise action segmentation and comprehensive local-global pose representations. On the second level, Motion Parser is used by spatial-temporal feature learning to capture pose changes and appearance details for each action-unit. Meanwhile, some special conditions other than body-related will impact action scoring, like water splash in diving. In this work, we design an additional Condition Parser to offer users more flexibility in their choices. Finally, Weight-Adjust Scoring Module is introduced to better accommodate the diverse requirements of various action types and the multi-scale nature of action-units. Extensive evaluations on large-scale diving sports datasets demonstrate that our multi-level motion parsing framework achieves state-of-the-art performance in both action segmentation and action scoring tasks.[116] Personalized Image Editing in Text-to-Image Diffusion Models via Collaborative Direct Preference Optimization
Connor Dunlop,Matthew Zheng,Kavana Venkatesh,Pinar Yanardag
Main category: cs.CV
TL;DR: 提出了一种基于协同信号的个性化图像编辑框架C-DPO,通过动态偏好图和图神经网络实现用户个性化偏好对齐。
Details
Motivation: 现有文本到图像扩散模型缺乏对个体用户审美偏好的适应能力,难以满足个性化编辑需求。 Method: 构建动态用户偏好图,使用轻量级图神经网络学习用户嵌入,并将其融入新的DPO目标函数中,联合优化个体对齐与邻居一致性。 Result: 在用户研究和定量基准测试中,该方法在生成符合用户偏好的编辑结果上显著优于基线方法。 Conclusion: C-DPO为扩散模型中的个性化图像编辑提供了有效解决方案,兼顾个体偏好与群体协同,提升了编辑的个性化程度和质量。 Abstract: Text-to-image (T2I) diffusion models have made remarkable strides in generating and editing high-fidelity images from text. Yet, these models remain fundamentally generic, failing to adapt to the nuanced aesthetic preferences of individual users. In this work, we present the first framework for personalized image editing in diffusion models, introducing Collaborative Direct Preference Optimization (C-DPO), a novel method that aligns image edits with user-specific preferences while leveraging collaborative signals from like-minded individuals. Our approach encodes each user as a node in a dynamic preference graph and learns embeddings via a lightweight graph neural network, enabling information sharing across users with overlapping visual tastes. We enhance a diffusion model's editing capabilities by integrating these personalized embeddings into a novel DPO objective, which jointly optimizes for individual alignment and neighborhood coherence. Comprehensive experiments, including user studies and quantitative benchmarks, demonstrate that our method consistently outperforms baselines in generating edits that are aligned with user preferences.[117] Convolutional Fully-Connected Capsule Network (CFC-CapsNet): A Novel and Fast Capsule Network
Pouya Shiri,Amirali Baniasadi
Main category: cs.CV
TL;DR: 提出了一种新的卷积全连接胶囊网络(CFC-CapsNet),通过引入CFC层生成更少但更强的胶囊,提升了准确率、训练和推理速度,并减少了参数量。
Details
Motivation: CapsNet在复杂数据集上性能不佳,且训练慢、参数多,限制了其实际应用。 Method: 提出CFC层作为生成胶囊的新方法,构建CFC-CapsNet网络结构。 Result: 在CIFAR-10、SVHN和Fashion-MNIST数据集上实现了更高的准确率、更快的训练/推理速度和更少的参数量。 Conclusion: CFC-CapsNet有效解决了传统CapsNet的局限性,是一种更高效、更具实用性的胶囊网络架构。 Abstract: A Capsule Network (CapsNet) is a relatively new classifier and one of the possible successors of Convolutional Neural Networks (CNNs). CapsNet maintains the spatial hierarchies between the features and outperforms CNNs at classifying images including overlapping categories. Even though CapsNet works well on small-scale datasets such as MNIST, it fails to achieve a similar level of performance on more complicated datasets and real applications. In addition, CapsNet is slow compared to CNNs when performing the same task and relies on a higher number of parameters. In this work, we introduce Convolutional Fully-Connected Capsule Network (CFC-CapsNet) to address the shortcomings of CapsNet by creating capsules using a different method. We introduce a new layer (CFC layer) as an alternative solution to creating capsules. CFC-CapsNet produces fewer, yet more powerful capsules resulting in higher network accuracy. Our experiments show that CFC-CapsNet achieves competitive accuracy, faster training and inference and uses less number of parameters on the CIFAR-10, SVHN and Fashion-MNIST datasets compared to conventional CapsNet.[118] Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition
Nicholas Babey,Tiffany Gu,Yiheng Li,Cristian Meo,Kevin Zhu
Main category: cs.CV
TL;DR: 提出一种融合V-JEPA 2的预测性世界动力学和CoMotion的抗遮挡人体姿态数据的动作识别模型,提升复杂遮挡场景下的性能。
Details
Motivation: 现有基于RGB视频的动作识别模型依赖表面统计相关性,难以捕捉复杂场景中的人体动作与物理交互动态,尤其在遮挡情况下表现不佳。 Method: 通过融合V-JEPA 2的上下文化、预测性世界动力学表征与CoMotion提供的显式、抗遮挡人体姿态数据,构建一个以物理空间为根基的动作识别模型。 Result: 在InHARD和UCF-19-Y-OCC两个基准上验证了模型有效性,尤其在高遮挡场景下优于三个基线模型。 Conclusion: 动作识别应建立在对空间物理关系的理解之上,而非仅仅依赖统计模式识别,空间感知对于提升复杂场景下的动作理解至关重要。 Abstract: For embodied agents to effectively understand and interact within the world around them, they require a nuanced comprehension of human actions grounded in physical space. Current action recognition models, often relying on RGB video, learn superficial correlations between patterns and action labels, so they struggle to capture underlying physical interaction dynamics and human poses in complex scenes. We propose a model architecture that grounds action recognition in physical space by fusing two powerful, complementary representations: V-JEPA 2's contextual, predictive world dynamics and CoMotion's explicit, occlusion-tolerant human pose data. Our model is validated on both the InHARD and UCF-19-Y-OCC benchmarks for general action recognition and high-occlusion action recognition, respectively. Our model outperforms three other baselines, especially within complex, occlusive scenes. Our findings emphasize a need for action recognition to be supported by spatial understanding instead of statistical pattern recognition.[119] Registration-Free Monitoring of Unstructured Point Cloud Data via Intrinsic Geometrical Properties
Mariafrancesca Patalano,Giovanna Capizzi,Kamran Paynabar
Main category: cs.CV
TL;DR: 提出了一种无需配准和网格重建的点云数据监控新方法,利用拉普拉斯和测地距离提取形状的内在几何特征,并通过阈值技术选择对异常状态最敏感的特征,实验表明该方法能有效识别多种缺陷。
Details
Motivation: 传统的点云数据分析需要耗时且易出错的预处理步骤(如配准和网格重建),可能引入伪影并影响监测结果,因此需要一种更鲁棒、高效的无配准监控方法。 Method: 提出两种基于内在几何属性(拉普拉斯算子和测地距离)的特征学习方法,并结合阈值技术筛选最具指示性的特征,构建统一的监控框架,无需注册和网格重建。 Result: 数值实验和案例研究表明,所提方法在不同类型缺陷检测方面表现良好,具有较高的检测准确性和稳定性。 Conclusion: 该注册自由的方法可有效用于复杂形状点云数据的几何精度监控,避免了传统预处理带来的问题,具有实际应用潜力。 Abstract: Modern sensing technologies have enabled the collection of unstructured point cloud data (PCD) of varying sizes, which are used to monitor the geometric accuracy of 3D objects. PCD are widely applied in advanced manufacturing processes, including additive, subtractive, and hybrid manufacturing. To ensure the consistency of analysis and avoid false alarms, preprocessing steps such as registration and mesh reconstruction are commonly applied prior to monitoring. However, these steps are error-prone, time-consuming and may introduce artifacts, potentially affecting monitoring outcomes. In this paper, we present a novel registration-free approach for monitoring PCD of complex shapes, eliminating the need for both registration and mesh reconstruction. Our proposal consists of two alternative feature learning methods and a common monitoring scheme. Feature learning methods leverage intrinsic geometric properties of the shape, captured via the Laplacian and geodesic distances. In the monitoring scheme, thresholding techniques are used to further select intrinsic features most indicative of potential out-of-control conditions. Numerical experiments and case studies highlight the effectiveness of the proposed approach in identifying different types of defects.[120] Culture in Action: Evaluating Text-to-Image Models through Social Activities
Sina Malakouti,Boqing Gong,Adriana Kovashka
Main category: cs.CV
TL;DR: 提出CULTIVate,一个用于评估文本到图像模型在跨文化活动生成中文化保真度的基准,涵盖16个国家的576个提示和超过19,000张图像,并引入四个衡量文化对齐、幻觉、夸张元素和多样性的指标。
Details
Motivation: 现有T2I模型在生成图像时存在文化偏见,难以真实再现欠发达地区的文化活动,且缺乏评估文化保真度的基准和指标。 Method: 构建CULTIVate基准,覆盖问候、用餐、游戏、传统舞蹈和文化庆典等跨文化活动,包含576个提示和19,000多张图像,提出基于描述符的可解释评估框架及四个量化指标。 Result: 发现模型对全球北方国家的表现优于南方国家,不同T2I系统存在显著差异;人类实验表明新指标与人类判断的相关性高于现有文本-图像指标。 Conclusion: CULTIVate为评估T2I模型的文化保真度提供了有效工具,揭示了当前模型的文化偏差,并推动更公平、多样化的生成模型发展。 Abstract: Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural alignment, hallucination, exaggerated elements, and diversity. Our findings reveal systematic disparities: models perform better for global north countries than for the global south, with distinct failure modes across T2I systems. Human studies confirm that our metrics correlate more strongly with human judgments than existing text-image metrics.[121] VMDT: Decoding the Trustworthiness of Video Foundation Models
Yujin Potter,Zhun Wang,Nicholas Crispino,Kyle Montgomery,Alexander Xiong,Ethan Y. Chang,Francesco Pinto,Yuqi Chen,Rahul Gupta,Morteza Ziyadi,Christos Christodoulopoulos,Bo Li,Chenguang Wang,Dawn Song
Main category: cs.CV
TL;DR: 本文提出了VMDT,首个统一评估文本到视频(T2V)和视频到文本(V2T)模型在安全性、幻觉、公平性、隐私和对抗鲁棒性五个可信维度的平台,揭示了当前视频基础模型在可信方面的关键问题,并强调亟需构建更可靠模型。
Details
Motivation: 随着基础模型的发展,视频模态仍缺乏全面的可信度评估基准,因此需要一个系统性框架来衡量T2V和V2T模型的可信性。 Method: 构建VMDT平台,在五个可信维度上对7个T2V模型和19个V2T模型进行综合评估。 Result: 发现开源T2V模型普遍无法识别有害查询且生成有害内容,不公平性高于图像模型;V2T模型中,规模增大带来更高的不公平性和隐私风险,但幻觉和对抗鲁棒性有所改善;安全性与模型规模无关。 Conclusion: 当前视频基础模型在可信性方面存在严重缺陷,需超越规模扩展,注重安全等非规模相关因素的设计,VMDT为推进该领域提供了系统评估框架。 Abstract: As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve -- though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal. The code is available at https://sunblaze-ucb.github.io/VMDT-page/.[122] Pedicle Screw Pairing and Registration for Screw Pose Estimation from Dual C-arm Images Using CAD Models
Yehyun Suh,Lin Li,Aric Plumley,Chaochao Zhou,Daniel Moyer,Kongbin Kang
Main category: cs.CV
TL;DR: 提出一种基于双C臂图像的椎弓根螺钉配对和位姿估计方法,通过螺钉组合比较和2D-3D配准,显著降低投影误差,提升脊柱手术中螺钉定位的准确性。
Details
Motivation: 在脊柱手术中,准确匹配前后位(AP)和侧位(LAT)图像中的椎弓根螺钉对于减压和稳定至关重要,但现有方法在侧位图像中建立螺钉对应关系仍面临挑战。 Method: 通过比较不同螺钉组合,结合螺钉CAD三维模型进行2D-3D对齐,实现双视角下的螺钉配对与位姿估计。 Result: 正确螺钉组合在所有测试案例中均优于错误配对,且注册后进一步减少投影误差,提升图像与投影的对齐精度。 Conclusion: 该方法能可靠地提供螺钉位置反馈,有望改善脊柱手术的临床效果。 Abstract: Accurate matching of pedicle screws in both anteroposterior (AP) and lateral (LAT) images is critical for successful spinal decompression and stabilization during surgery. However, establishing screw correspondence, especially in LAT views, remains a significant clinical challenge. This paper introduces a method to address pedicle screw correspondence and pose estimation from dual C-arm images. By comparing screw combinations, the approach demonstrates consistent accuracy in both pairing and registration tasks. The method also employs 2D-3D alignment with screw CAD 3D models to accurately pair and estimate screw pose from dual views. Our results show that the correct screw combination consistently outperforms incorrect pairings across all test cases, even prior to registration. After registration, the correct combination further enhances alignment between projections and images, significantly reducing projection error. This approach shows promise for improving surgical outcomes in spinal procedures by providing reliable feedback on screw positioning.[123] Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale
David Acuna,Chao-Han Huck Yang,Yuntian Deng,Jaehun Jung,Ximing Lu,Prithviraj Ammanabrolu,Hyunwoo Kim,Yuan-Hong Liao,Yejin Choi
Main category: cs.CV
TL;DR: 本文提出了一种新的视觉中心推理数据生成框架,包含超过100万个高质量合成问题,并展示了其在多种模态任务中的卓越性能和跨模态迁移能力。
Details
Motivation: 现有视觉推理数据集多依赖未公开数据或专有合成方法,缺乏系统性构建大规模、视觉中心、超越视觉数学任务的推理数据集的方法。 Method: 采用两阶段合成框架:先扩展规模,再增加复杂度;利用VLM和推理LLM生成包含丰富认知行为的思维链(CoT)轨迹,并构建支持离线和在线强化学习的偏好数据与指令提示。 Result: 在Qwen2.5-VL-7B上微调该数据集,在多个视觉中心基准测试中优于所有开源数据基线,甚至超过闭源模型(如MiMo-VL-7B-RL);且在文本推理(MMLU-Pro)、音频推理(MMAU)及具身问答(NiEH)等非训练目标任务上也表现出显著增益。 Conclusion: 高质量、非线性推理轨迹的SFT对在线RL至关重要;分阶段离线RL可媲美在线RL性能但计算成本更低;精心设计的SFT能显著提升跨域、跨模态迁移效果。 Abstract: Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.[124] Towards Better Ultrasound Video Segmentation Foundation Model: An Empirical study on SAM2 Finetuning from Data Perspective
Xing Yao,Ahana Gangopadhyay,Hsi-Ming Chang,Ravi Soni
Main category: cs.CV
TL;DR: 本研究对SAM2在超声视频分割中的数据驱动适应性进行了系统分析,探讨了训练数据规模、视频时长和增强策略对多种适应范式性能的影响。
Details
Motivation: 尽管基础模型如SAM2在通用图像分割中表现优异,但在医学影像尤其是超声视频分割中性能下降明显;现有研究多关注模型结构改进,缺乏对数据特性与训练策略的系统评估。 Method: 通过对五种SAM2变体和多种提示模式的实验,比较任务特定微调、中间适应和多任务联合训练三种范式,并设计六种针对超声的增强方法,评估数据规模、时序长度和增强策略的影响。 Result: 实验结果表明,数据规模和时序上下文对性能的影响大于模型架构或初始化方式;联合训练在模态对齐与任务特化之间提供了高效折衷;超声特定增强策略优于通用方法。 Conclusion: 数据质量和训练策略在SAM2向超声视频分割迁移中起决定性作用,应优先考虑数据中心化的适应方案设计。 Abstract: Ultrasound (US) video segmentation remains a challenging problem due to strong inter- and intra-dataset variability, motion artifacts, and limited annotated data. Although foundation models such as Segment Anything Model 2 (SAM2) demonstrate strong zero-shot and prompt-guided segmentation capabilities, their performance deteriorates substantially when transferred to medical imaging domains. Current adaptation studies mainly emphasize architectural modifications, while the influence of data characteristics and training regimes has not been systematically examined. In this study, we present a comprehensive, data-centric investigation of SAM2 adaptation for ultrasound video segmentation. We analyze how training-set size, video duration, and augmentation schemes affect adaptation performance under three paradigms: task-specific fine-tuning, intermediate adaptation, and multi-task joint training, across five SAM2 variants and multiple prompting modes. We further design six ultrasound-specific augmentations, assessing their effect relative to generic strategies. Experiments on three representative ultrasound datasets reveal that data scale and temporal context play a more decisive role than model architecture or initialization. Moreover, joint training offers an efficient compromise between modality alignment and task specialization. This work aims to provide empirical insights for developing efficient, data-aware adaptation pipelines for SAM2 in ultrasound video analysis.[125] A Second-Order Attention Mechanism For Prostate Cancer Segmentation and Detection in Bi-Parametric MRI
Mateo Ortiz,Juan Olmos,Fabio Martínez
Main category: cs.CV
TL;DR: 本文提出了一种基于黎曼流形上对称正定矩阵建模的二阶几何注意力(SOGA)机制,用于指导双参数MRI中的临床显著性前列腺癌(csPCa)病灶分割,提升了模型性能与泛化能力。
Details
Motivation: 现有深度学习方法依赖大量精细标注数据,且难以应对前列腺不同区域病灶的高变异性,限制了csPCa自动检测的准确性和泛化性。 Method: 提出SOGA注意力机制,建模于黎曼流形,利用对称正定(SPD)矩阵表示,并通过跳跃连接集成到U-Net和nnU-Net中,增强关键特征传递。 Result: 在PI-CAI数据集上达到0.37 AP和0.83 AUC-ROC,优于基线与现有注意力方法;在独立测试集Prostate158上取得0.37 AP和0.75 AUC-ROC,验证了良好的泛化能力。 Conclusion: SOGA机制能有效捕捉病灶的高阶几何特征,提升bp-MRI中csPCa分割的准确性与鲁棒性,具有临床应用潜力。 Abstract: The detection of clinically significant prostate cancer lesions (csPCa) from biparametric magnetic resonance imaging (bp-MRI) has emerged as a noninvasive imaging technique for improving accurate diagnosis. Nevertheless, the analysis of such images remains highly dependent on the subjective expert interpretation. Deep learning approaches have been proposed for csPCa lesions detection and segmentation, but they remain limited due to their reliance on extensively annotated datasets. Moreover, the high lesion variability across prostate zones poses additional challenges, even for expert radiologists. This work introduces a second-order geometric attention (SOGA) mechanism that guides a dedicated segmentation network, through skip connections, to detect csPCa lesions. The proposed attention is modeled on the Riemannian manifold, learning from symmetric positive definitive (SPD) representations. The proposed mechanism was integrated into standard U-Net and nnU-Net backbones, and was validated on the publicly available PI-CAI dataset, achieving an Average Precision (AP) of 0.37 and an Area Under the ROC Curve (AUC-ROC) of 0.83, outperforming baseline networks and attention-based methods. Furthermore, the approach was evaluated on the Prostate158 dataset as an independent test cohort, achieving an AP of 0.37 and an AUC-ROC of 0.75, confirming robust generalization and suggesting discriminative learned representations.[126] Sign language recognition from skeletal data using graph and recurrent neural networks
B. Mederos,J. Mejía,A. Medina-Reyes,Y. Espinosa-Almeyda,J. D. Díaz-Roman,I. Rodríguez-Mederos,M. Mejía-Carreon,F. Gonzalez-Lopez
Main category: cs.CV
TL;DR: 提出了一种基于骨架姿态数据的图-GRU时序网络,用于识别孤立的手语手势,并在AUTSL数据集上实现了高准确率。
Details
Motivation: 为了提高手语识别的准确性,需要有效建模手势的空间结构和时间动态。 Method: 采用Graph-GRU网络对视频序列中的骨架姿态数据进行时空建模,利用图结构表示关节点之间的空间关系,并通过GRU捕捉时间依赖性。 Result: 在AUTSL数据集上的实验表明,该方法在孤立手势识别任务中取得了高准确率,验证了结合图结构与时序建模的有效性。 Conclusion: 基于姿态的图-GRU模型能有效融合空间与时间信息,为手语识别提供了一个可扩展且高效的框架。 Abstract: This work presents an approach for recognizing isolated sign language gestures using skeleton-based pose data extracted from video sequences. A Graph-GRU temporal network is proposed to model both spatial and temporal dependencies between frames, enabling accurate classification. The model is trained and evaluated on the AUTSL (Ankara university Turkish sign language) dataset, achieving high accuracy. Experimental results demonstrate the effectiveness of integrating graph-based spatial representations with temporal modeling, providing a scalable framework for sign language recognition. The results of this approach highlight the potential of pose-driven methods for sign language understanding.[127] TCSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation
Lalit Maurya,Honghai Liu,Reyer Zwiggelaar
Main category: cs.CV
TL;DR: 提出了一种基于文本驱动的跨语义对齐框架TCSA-UDA,用于解决医学图像分割中的无监督域适应问题,通过引入视觉-语言协方差余弦损失和原型对齐模块,在跨模态心脏、腹部和脑肿瘤分割任务中显著优于现有方法。
Details
Motivation: 现有的无监督域适应方法在处理CT与MRI等不同模态间的显著域偏移时表现有限,且当前视觉-语言表示学习在该任务中的潜力尚未充分挖掘。 Method: 提出TCSA-UDA框架,利用文本类描述指导视觉表征学习;引入视觉-语言协方差余弦损失以对齐图像特征与文本语义关系,并设计原型对齐模块来对齐跨域的类别级特征分布。 Result: 在多个跨模态医学图像分割基准上实验表明,该方法显著减小了域偏移,在性能上持续优于现有最先进方法。 Conclusion: TCSA-UDA通过融合语言驱动的语义信息,建立了医学图像无监督域适应的新范式,有效提升了跨模态分割的鲁棒性与一致性。 Abstract: Unsupervised domain adaptation for medical image segmentation remains a significant challenge due to substantial domain shifts across imaging modalities, such as CT and MRI. While recent vision-language representation learning methods have shown promise, their potential in UDA segmentation tasks remains underexplored. To address this gap, we propose TCSA-UDA, a Text-driven Cross-Semantic Alignment framework that leverages domain-invariant textual class descriptions to guide visual representation learning. Our approach introduces a vision-language covariance cosine loss to directly align image encoder features with inter-class textual semantic relations, encouraging semantically meaningful and modality-invariant feature representations. Additionally, we incorporate a prototype alignment module that aligns class-wise pixel-level feature distributions across domains using high-level semantic prototypes. This mitigates residual category-level discrepancies and enhances cross-modal consistency. Extensive experiments on challenging cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks demonstrate that our TCSA-UDA framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods, establishing a new paradigm for integrating language-driven semantics into domain-adaptive medical image analysis.[128] Position-Prior-Guided Network for System Matrix Super-Resolution in Magnetic Particle Imaging
Xuqing Geng,Lei Su,Zhongwei Bian,Zewen Sun,Jiaxuan Wen,Jie Tian,Yang Du
Main category: cs.CV
TL;DR: 提出了一种结合位置先验的深度学习方法,用于加速磁粒子成像系统矩阵的标定。
Details
Motivation: 传统系统矩阵标定耗时且需重复测量,现有深度学习方法未充分利用物理先验知识(如对称位置先验)。 Method: 将位置先验信息融入现有的超分辨率深度学习框架中,基于理论推导并在2D和3D系统矩阵超分辨率任务中进行实验验证。 Result: 实验证明,引入位置先验能有效提升系统矩阵重建精度与效率,在2D和3D设置下均表现出优越性能。 Conclusion: 结合物理先验知识的深度学习方法可显著改善MPI系统矩阵标定过程,具有较强的实用潜力。 Abstract: Magnetic Particle Imaging (MPI) is a novel medical imaging modality. One of the established methods for MPI reconstruction is based on the System Matrix (SM). However, the calibration of the SM is often time-consuming and requires repeated measurements whenever the system parameters change. Current methodologies utilize deep learning-based super-resolution (SR) techniques to expedite SM calibration; nevertheless, these strategies do not fully exploit physical prior knowledge associated with the SM, such as symmetric positional priors. Consequently, we integrated positional priors into existing frameworks for SM calibration. Underpinned by theoretical justification, we empirically validated the efficacy of incorporating positional priors through experiments involving both 2D and 3D SM SR methods.[129] MACMD: Multi-dilated Contextual Attention and Channel Mixer Decoding for Medical Image Segmentation
Lalit Maurya,Honghai Liu,Reyer Zwiggelaar
Main category: cs.CV
TL;DR: 提出基于MACMD的解码器,通过增强注意力机制和跨阶段通道混合,有效融合局部细节与全局上下文,提升医学图像分割精度与效率。
Details
Motivation: 现有模型在深层传播中易丢失浅层细节信息,且编码器与解码器间局部与全局信息融合不充分,导致分割性能受限。 Method: 设计MACMD解码器,结合分层空洞卷积、注意力驱动调制和跨通道混合模块,通过跳跃连接加强编码器-解码器间的通道混合,兼顾长距离依赖与局部上下文保留。 Result: 在二分类与多器官分割任务上优于现有方法,取得更高的Dice分数与更优的计算效率。 Conclusion: MACMD解码器能有效平衡局部细节保留与全局上下文建模,显著提升医学图像分割的准确性与鲁棒性。 Abstract: Medical image segmentation faces challenges due to variations in anatomical structures. While convolutional neural networks (CNNs) effectively capture local features, they struggle with modeling long-range dependencies. Transformers mitigate this issue with self-attention mechanisms but lack the ability to preserve local contextual information. State-of-the-art models primarily follow an encoder-decoder architecture, achieving notable success. However, two key limitations remain: (1) Shallow layers, which are closer to the input, capture fine-grained details but suffer from information loss as data propagates through deeper layers. (2) Inefficient integration of local details and global context between the encoder and decoder stages. To address these challenges, we propose the MACMD-based decoder, which enhances attention mechanisms and facilitates channel mixing between encoder and decoder stages via skip connections. This design leverages hierarchical dilated convolutions, attention-driven modulation, and a cross channel-mixing module to capture long-range dependencies while preserving local contextual details, essential for precise medical image segmentation. We evaluated our approach using multiple transformer encoders on both binary and multi-organ segmentation tasks. The results demonstrate that our method outperforms state-of-the-art approaches in terms of Dice score and computational efficiency, highlighting its effectiveness in achieving accurate and robust segmentation performance. The code available at https://github.com/lalitmaurya47/MACMD[130] LRANet++: Low-Rank Approximation Network for Accurate and Efficient Text Spotting
Yuchen Su,Zhineng Chen,Yongkun Du,Zuxuan Wu,Hongtao Xie,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 提出了一种基于低秩近似和三重分配检测头的端到端文本识别框架LRANet++,可高效准确地检测和识别任意形状文本。
Details
Motivation: 现有端到端文本识别方法在处理任意形状文本时受限于检测模块的精度与效率,主要瓶颈在于缺乏可靠且高效的文本检测方法。 Method: 提出一种数据驱动的低秩近似文本形状表示方法,利用ℓ1范数恢复鲁棒重建文本轮廓,并设计三重分配检测头,通过深稀疏、轻量稀疏和密集三个分支协同训练,提升检测精度与推理速度。 Result: 在多个具有挑战性的基准上实验表明,LRANet++在准确性和效率方面均优于当前最先进方法。 Conclusion: LRANet++通过改进文本形状表示和检测头设计,有效解决了任意形状文本识别中的检测难题,实现了更优的端到端性能。 Abstract: End-to-end text spotting aims to jointly optimize text detection and recognition within a unified framework. Despite significant progress, designing an accurate and efficient end-to-end text spotter for arbitrary-shaped text remains largely unsolved. We identify the primary bottleneck as the lack of a reliable and efficient text detection method. To address this, we propose a novel parameterized text shape method based on low-rank approximation for precise detection and a triple assignment detection head to enable fast inference. Specifically, unlike other shape representation methods that employ data-irrelevant parameterization, our data-driven approach derives a low-rank subspace directly from labeled text boundaries. To ensure this process is robust against the inherent annotation noise in this data, we utilize a specialized recovery method based on an $\ell_1$-norm formulation, which accurately reconstructs the text shape with only a few key orthogonal vectors. By exploiting the inherent shape correlation among different text contours, our method achieves consistency and compactness in shape representation. Next, the triple assignment scheme introduces a novel architecture where a deep sparse branch (for stabilized training) is used to guide the learning of an ultra-lightweight sparse branch (for accelerated inference), while a dense branch provides rich parallel supervision. Building upon these advancements, we integrate the enhanced detection module with a lightweight recognition branch to form an end-to-end text spotting framework, termed LRANet++, capable of accurately and efficiently spotting arbitrary-shaped text. Extensive experiments on several challenging benchmarks demonstrate the superiority of LRANet++ compared to state-of-the-art methods. Code will be available at: https://github.com/ychensu/LRANet-PP.git[131] Hilbert-Guided Block-Sparse Local Attention
Yunge Li,Lanyu Xu
Main category: cs.CV
TL;DR: 提出基于希尔伯特曲线的窗口和邻域构建方法,通过重排序图像token提升块稀疏性,显著加速2D局部注意力机制。
Details
Motivation: 全局自注意力的二次计算和内存开销限制了其在高分辨率图像中的应用,传统局部注意力模式因序列不连续难以实现高效加速。 Method: 将图像token沿希尔伯特曲线重排序,在重排序后的一维序列上构建窗口和邻域,结合现有块稀疏核优化局部注意力效率。 Result: 所提Hilbert Window Attention和Hilbert Slide Attention分别比原有方法快约4倍和18倍,实例化模型在几乎无精度损失下实现端到端加速。 Conclusion: 希尔伯特引导的局部注意力结合块稀疏核是一种通用且实用的提升图像局部注意力效率的方法。 Abstract: The quadratic compute and memory costs of global self-attention severely limit its use in high-resolution images. Local attention reduces complexity by restricting attention to neighborhoods. Block-sparse kernels can further improve the efficiency of local attention, but conventional local attention patterns often fail to deliver significant speedups because tokens within a window are not contiguous in the 1D sequence. This work proposes a novel method for constructing windows and neighborhoods based on the Hilbert curve. Image tokens are first reordered along a Hilbert curve, and windows and neighborhoods are then formed on the reordered 1D sequence. From a block-sparse perspective, this strategy significantly increases block sparsity and can be combined with existing block-sparse kernels to improve the efficiency of 2D local attention. Experiments show that the proposed Hilbert Window Attention and Hilbert Slide Attention can accelerate window attention and slide attention by about $4\times$ and $18\times$, respectively. To assess practicality, the strategy is instantiated as the Hilbert Window Transformer and the Hilbert Neighborhood Transformer, both of which achieve end-to-end speedups with minimal accuracy loss. Overall, combining Hilbert-guided local attention with block-sparse kernels offers a general and practical approach to enhancing the efficiency of 2D local attention for images. The code is available at https://github.com/Yunge6666/Hilbert-Local-Attention.[132] TYrPPG: Uncomplicated and Enhanced Learning Capability rPPG for Remote Heart Rate Estimation
Taixi Chen,Yiu-ming Cheung
Main category: cs.CV
TL;DR: 本文提出了一种基于Mambaout结构的新型远程光电容积描记法(rPPG)算法TYrPPG,用于从RGB视频中高效提取心率信号。该方法采用门控视频理解块(GVB),结合2D-CNN和3D-CNN,并设计了综合监督损失函数(CSL),在多个常用数据集上实现了最先进的性能。
Details
Motivation: 现有基于Transformer的rPPG模型计算效率低,而Mamba模型虽在NLP中高效,但其核心SSM模块在视觉任务中被证明不必要。因此,本文旨在探索无需SSM模块的Mambaout结构在rPPG中的可行性,以提升计算效率与性能。 Method: 提出TYrPPG算法,设计基于Mambaout结构的门控视频理解块(GVB),融合2D-CNN与3D-CNN进行视频特征提取,并引入综合监督损失函数(CSL)及其弱监督变体以增强模型学习能力。 Result: 实验表明,TYrPPG在多个常用rPPG数据集上达到了当前最优的性能,验证了其在远程心率估计中的有效性与优越性。 Conclusion: TYrPPG通过简化Mamba架构并结合CNN与新型损失函数,在保持高精度的同时提升了计算效率,证明了非Transformer架构在rPPG任务中的潜力,为未来轻量化、高效生理信号监测提供了新方向。 Abstract: Remote photoplethysmography (rPPG) can remotely extract physiological signals from RGB video, which has many advantages in detecting heart rate, such as low cost and no invasion to patients. The existing rPPG model is usually based on the transformer module, which has low computation efficiency. Recently, the Mamba model has garnered increasing attention due to its efficient performance in natural language processing tasks, demonstrating potential as a substitute for transformer-based algorithms. However, the Mambaout model and its variants prove that the SSM module, which is the core component of the Mamba model, is unnecessary for the vision task. Therefore, we hope to prove the feasibility of using the Mambaout-based module to remotely learn the heart rate. Specifically, we propose a novel rPPG algorithm called uncomplicated and enhanced learning capability rPPG (TYrPPG). This paper introduces an innovative gated video understanding block (GVB) designed for efficient analysis of RGB videos. Based on the Mambaout structure, this block integrates 2D-CNN and 3D-CNN to enhance video understanding for analysis. In addition, we propose a comprehensive supervised loss function (CSL) to improve the model's learning capability, along with its weakly supervised variants. The experiments show that our TYrPPG can achieve state-of-the-art performance in commonly used datasets, indicating its prospects and superiority in remote heart rate estimation. The source code is available at https://github.com/Taixi-CHEN/TYrPPG.[133] Understanding Cross Task Generalization in Handwriting-Based Alzheimer's Screening via Vision Language Adaptation
Changqing Gong,Huafeng Qin,Mounim A. El-Yacoubi
Main category: cs.CV
TL;DR: 提出一种基于CLIP的轻量级跨层融合适配器(CLFA),用于手写分析的阿尔茨海默病早期筛查,实现无需提示的零样本推理,并系统研究不同书写任务间的泛化能力。
Details
Motivation: 现有基于手写特征的阿尔茨海默病研究多依赖手工特征和特定任务数据,缺乏对任务类型影响及跨任务泛化的系统分析;同时,大规模视觉语言模型在医学领域展现出强大适应性,但在手写疾病检测中尚未充分探索。 Method: 提出Cross-Layer Fusion Adapter(CLFA)框架,在CLIP的视觉编码器中嵌入多层级融合适配器,逐步对齐表征以捕捉手写中的医学线索,实现无需提示的零样本推理,并评估不同书写任务间的泛化性能。 Result: CLFA在多种书写任务上表现出优异的零样本诊断性能,揭示了某些特定笔画模式和任务类型更能有效区分阿尔茨海默病;跨任务训练与测试显示部分任务具有更好的泛化能力。 Conclusion: CLFA为基于手写的认知障碍评估提供了有效框架,验证了零样本视觉语言模型在神经退行性疾病筛查中的潜力,并建立了手写任务间泛化性能的基准。 Abstract: Alzheimer's disease is a prevalent neurodegenerative disorder for which early detection is critical. Handwriting-often disrupted in prodromal AD-provides a non-invasive and cost-effective window into subtle motor and cognitive decline. Existing handwriting-based AD studies, mostly relying on online trajectories and hand-crafted features, have not systematically examined how task type influences diagnostic performance and cross-task generalization. Meanwhile, large-scale vision language models have demonstrated remarkable zero or few-shot anomaly detection in natural images and strong adaptability across medical modalities such as chest X-ray and brain MRI. However, handwriting-based disease detection remains largely unexplored within this paradigm. To close this gap, we introduce a lightweight Cross-Layer Fusion Adapter framework that repurposes CLIP for handwriting-based AD screening. CLFA implants multi-level fusion adapters within the visual encoder to progressively align representations toward handwriting-specific medical cues, enabling prompt-free and efficient zero-shot inference. Using this framework, we systematically investigate cross-task generalization-training on a specific handwriting task and evaluating on unseen ones-to reveal which task types and writing patterns most effectively discriminate AD. Extensive analyses further highlight characteristic stroke patterns and task-level factors that contribute to early AD identification, offering both diagnostic insights and a benchmark for handwriting-based cognitive assessment.[134] Enhancing Diffusion Model Guidance through Calibration and Regularization
Seyed Alireza Javid,Amirhossein Bagheri,Nuria González-Prelcic
Main category: cs.CV
TL;DR: 本文提出了一种基于平滑期望校准误差(Smooth ECE)的可微校准目标和多种增强采样引导方法,用于解决分类器引导扩散模型在早期去噪步骤中预测过于自信导致引导梯度消失的问题。实验表明,所提方法在ImageNet 128x128上显著提升了FID指标,且无需重新训练扩散模型。
Details
Motivation: 分类器引导扩散模型在早期去噪阶段常出现预测过置信问题,导致分类器引导信号失效,影响生成质量,因此需要更可靠的校准与采样策略。 Method: 1) 提出基于Smooth ECE的可微校准目标以优化分类器校准;2) 设计无需重训练的增强采样方法,包括倾斜采样与批重加权、自适应熵正则化采样,以及基于f散度的新颖采样策略。 Result: 在ImageNet 128x128上,使用ResNet-101分类器实现了2.13的FID分数,优于现有分类器引导方法,且无需对扩散模型进行再训练。 Conclusion: 通过合理的分类器校准与散度感知的采样策略,可有效提升分类器引导扩散模型的生成质量与稳定性,具有实际应用价值。 Abstract: Classifier-guided diffusion models have emerged as a powerful approach for conditional image generation, but they suffer from overconfident predictions during early denoising steps, causing the guidance gradient to vanish. This paper introduces two complementary contributions to address this issue. First, we propose a differentiable calibration objective based on the Smooth Expected Calibration Error (Smooth ECE), which improves classifier calibration with minimal fine-tuning and yields measurable improvements in Frechet Inception Distance (FID). Second, we develop enhanced sampling guidance methods that operate on off-the-shelf classifiers without requiring retraining. These include tilted sampling with batch-level reweighting, adaptive entropy-regularized sampling to preserve diversity, and a novel f-divergence-based sampling strategy that strengthens class-consistent guidance while maintaining mode coverage. Experiments on ImageNet 128x128 demonstrate that our divergence-regularized guidance achieves an FID of 2.13 using a ResNet-101 classifier, improving upon existing classifier-guided diffusion methods while requiring no diffusion model retraining. The results show that principled calibration and divergence-aware sampling provide practical and effective improvements for classifier-guided diffusion.[135] Point Cloud Segmentation of Integrated Circuits Package Substrates Surface Defects Using Causal Inference: Dataset Construction and Methodology
Bingyang Guo,Qiang Zuo,Ruiyun Yu
Main category: cs.CV
TL;DR: 本研究构建了一个高质量的用于陶瓷封装基板表面缺陷检测的3D点云数据集CPS3D-Seg,并提出基于因果推理的新方法CINet,在mIoU和准确率上显著优于现有算法。
Details
Motivation: 由于陶瓷封装基板(CPS)结构复杂、缺陷微小,且缺乏公开数据集,其表面缺陷检测面临挑战,亟需高质量数据集和高效分割方法。 Method: 构建了包含1300个样本、20类产品、点级标注的CPS3D-Seg数据集;提出CINet方法,结合结构优化(SR)和质量评估(QA)模块,通过因果推断量化点云中的潜在混杂因素。 Result: CPS3D-Seg在点分辨率和精度上优于现有工业3D数据集;CINet在多个SOTA算法的基准测试中表现出显著更高的mIoU和准确率。 Conclusion: CPS3D-Seg为工业3D缺陷检测提供了重要数据支持,CINet通过引入因果推理有效提升了复杂场景下点云分割性能。 Abstract: The effective segmentation of 3D data is crucial for a wide range of industrial applications, especially for detecting subtle defects in the field of integrated circuits (IC). Ceramic package substrates (CPS), as an important electronic material, are essential in IC packaging owing to their superior physical and chemical properties. However, the complex structure and minor defects of CPS, along with the absence of a publically available dataset, significantly hinder the development of CPS surface defect detection. In this study, we construct a high-quality point cloud dataset for 3D segmentation of surface defects in CPS, i.e., CPS3D-Seg, which has the best point resolution and precision compared to existing 3D industrial datasets. CPS3D-Seg consists of 1300 point cloud samples under 20 product categories, and each sample provides accurate point-level annotations. Meanwhile, we conduct a comprehensive benchmark based on SOTA point cloud segmentation algorithms to validate the effectiveness of CPS3D-Seg. Additionally, we propose a novel 3D segmentation method based on causal inference (CINet), which quantifies potential confounders in point clouds through Structural Refine (SR) and Quality Assessment (QA) Modules. Extensive experiments demonstrate that CINet significantly outperforms existing algorithms in both mIoU and accuracy.[136] CGCE: Classifier-Guided Concept Erasure in Generative Models
Viet Nguyen,Vishal M. Patel
Main category: cs.CV
TL;DR: 提出了一种名为Classifier-Guided Concept Erasure (CGCE)的即插即用框架,通过在推理时修改不安全的文本嵌入来实现对生成内容中不良概念的鲁棒擦除,同时保持模型在安全提示下的生成质量。
Details
Motivation: 现有概念擦除方法在面对对抗攻击时容易失效,且常以牺牲模型对安全内容的生成性能为代价,难以平衡安全性与生成质量。 Method: CGCE利用轻量级分类器作用于文本嵌入,检测并修正包含不良概念的提示,在不修改预训练模型权重的前提下实现概念擦除,并支持多概念联合擦除。 Result: 实验表明CGCE在多种红队攻击下实现了最先进的鲁棒性,同时保持了高生成质量,适用于多种文本到图像和文本到视频生成模型。 Conclusion: CGCE是一种高效、可扩展且实用的概念擦除方法,在安全性与生成性能之间取得了良好平衡,适用于大规模生成式AI的安全部署。 Abstract: Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model's generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model's original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.[137] Light-Field Dataset for Disparity Based Depth Estimation
Suresh Nehra,Aupendu Kar,Jayanta Mukhopadhyay,Prabir Kumar Biswas
Main category: cs.CV
TL;DR: 本文介绍了一个公开可用的光场图像数据集,包含285张真实Lytro Illum相机拍摄和13张合成的光场图像,并分析了焦距位置对视差的影响及现有数据集的不足。
Details
Motivation: 为了设计和测试新的基于视差的光场深度估计算法,需要有合适的光场图像数据集;同时现有数据集存在局限性,亟需更全面的数据支持研究。 Method: 使用Lytro Illum光场相机采集真实光场图像,结合Blender生成合成数据,并利用机械云台系统构建立体光场数据集,同时分析不同焦距下视差点的差异。 Result: 提出了一个包含真实与合成光场图像的综合数据集,揭示了焦距位置对视差估计的影响,并指出现有数据集在 disparity 特性建模上的不足。 Conclusion: 该数据集为光场深度估计的研究提供了有力支持,尤其在算法开发与评估方面具有重要价值,且强调了真实与合成数据结合的重要性。 Abstract: A Light Field (LF) camera consists of an additional two-dimensional array of micro-lenses placed between the main lens and sensor, compared to a conventional camera. The sensor pixels under each micro-lens receive light from a sub-aperture of the main lens. This enables the image sensor to capture both spatial information and the angular resolution of a scene point. This additional angular information is used to estimate the depth of a 3-D scene. The continuum of virtual viewpoints in light field data enables efficient depth estimation using Epipolar Line Images (EPIs) with robust occlusion handling. However, the trade-off between angular information and spatial information is very critical and depends on the focal position of the camera. To design, develop, implement, and test novel disparity-based light field depth estimation algorithms, the availability of suitable light field image datasets is essential. In this paper, a publicly available light field image dataset is introduced and thoroughly described. We have also demonstrated the effect of focal position on the disparity of a 3-D point as well as the shortcomings of the currently available light field dataset. The proposed dataset contains 285 light field images captured using a Lytro Illum LF camera and 13 synthetic LF images. The proposed dataset also comprises a synthetic dataset with similar disparity characteristics to those of a real light field camera. A real and synthetic stereo light field dataset is also created by using a mechanical gantry system and Blender. The dataset is available at https://github.com/aupendu/light-field-dataset.[138] MoEGCL: Mixture of Ego-Graphs Contrastive Representation Learning for Multi-View Clustering
Jian Zhu,Xin Zou,Jun Sun,Cheng Luo,Lei Liu,Lingfang Zeng,Ning Zhang,Bian Wu,Chang Tang,Lirong Dai
Main category: cs.CV
TL;DR: 提出了一种新的多视图聚类方法MoEGCL,通过样本级别的细粒度融合和对比学习提升聚类性能。
Details
Motivation: 现有方法在图结构融合上较为粗糙,通常在视图级别进行加权融合,缺乏对样本级别细粒度信息的利用。 Method: 设计了Mixture of Ego-Graphs Fusion (MoEGF) 模块,在样本级别构建ego图并使用专家混合网络实现细粒度融合;引入Ego Graph Contrastive Learning (EGCL) 模块,增强同一簇内样本的表示一致性。 Result: 在多个深度多视图聚类任务中实现了最先进的性能。 Conclusion: MoEGCL通过样本级图融合与对比学习有效提升了多视图聚类的表示能力,具有优越的聚类效果。 Abstract: In recent years, the advancement of Graph Neural Networks (GNNs) has significantly propelled progress in Multi-View Clustering (MVC). However, existing methods face the problem of coarse-grained graph fusion. Specifically, current approaches typically generate a separate graph structure for each view and then perform weighted fusion of graph structures at the view level, which is a relatively rough strategy. To address this limitation, we present a novel Mixture of Ego-Graphs Contrastive Representation Learning (MoEGCL). It mainly consists of two modules. In particular, we propose an innovative Mixture of Ego-Graphs Fusion (MoEGF), which constructs ego graphs and utilizes a Mixture-of-Experts network to implement fine-grained fusion of ego graphs at the sample level, rather than the conventional view-level fusion. Additionally, we present the Ego Graph Contrastive Learning (EGCL) module to align the fused representation with the view-specific representation. The EGCL module enhances the representation similarity of samples from the same cluster, not merely from the same sample, further boosting fine-grained graph representation. Extensive experiments demonstrate that MoEGCL achieves state-of-the-art results in deep multi-view clustering tasks. The source code is publicly available at https://github.com/HackerHyper/MoEGCL.[139] Towards Frequency-Adaptive Learning for SAR Despeckling
Ziqing Ma,Chang Yang,Zhichang Guo,Yao Li
Main category: cs.CV
TL;DR: 提出了一种基于分治架构的频率自适应异构去斑模型SAR-FAH,通过小波分解将图像分离为不同频带,并针对不同频带设计专用子网络,有效提升SAR图像去斑性能。
Details
Motivation: 现有深度学习方法通常使用统一网络处理整个SAR图像,忽视了不同空间物理特性对应的斑点噪声统计差异,导致伪影、边缘模糊和纹理失真。 Method: 采用小波分解将图像分为不同频率子带;对低频部分使用神经常微分方程建模去噪过程,保证结构保真性;对富含边缘和纹理的高频子带,采用增强型U-Net结合可变形卷积进行特征增强与去噪。 Result: 在合成和真实SAR图像上的实验表明,该方法在噪声抑制和结构保持方面优于现有方法,尤其在边缘和纹理细节上表现突出。 Conclusion: SAR-FAH通过频率自适应的异构网络设计,有效利用不同频带的统计特性,在多种SAR图像上实现了更优的去斑效果,具有良好的应用前景。 Abstract: Synthetic Aperture Radar (SAR) images are inherently corrupted by speckle noise, limiting their utility in high-precision applications. While deep learning methods have shown promise in SAR despeckling, most methods employ a single unified network to process the entire image, failing to account for the distinct speckle statistics associated with different spatial physical characteristics. It often leads to artifacts, blurred edges, and texture distortion. To address these issues, we propose SAR-FAH, a frequency-adaptive heterogeneous despeckling model based on a divide-and-conquer architecture. First, wavelet decomposition is used to separate the image into frequency sub-bands carrying different intrinsic characteristics. Inspired by their differing noise characteristics, we design specialized sub-networks for different frequency components. The tailored approach leverages statistical variations across frequencies, improving edge and texture preservation while suppressing noise. Specifically, for the low-frequency part, denoising is formulated as a continuous dynamic system via neural ordinary differential equations, ensuring structural fidelity and sufficient smoothness that prevents artifacts. For high-frequency sub-bands rich in edges and textures, we introduce an enhanced U-Net with deformable convolutions for noise suppression and enhanced features. Extensive experiments on synthetic and real SAR images validate the superior performance of the proposed model in noise removal and structural preservation.[140] Hybrid second-order gradient histogram based global low-rank sparse regression for robust face recognition
Hongxia Li,Ying Ji,Yongxin Dong,Yuehua Feng
Main category: cs.CV
TL;DR: 提出了一种基于混合二阶梯度直方图的全局低秩稀疏回归模型(H2H-GLRSR),用于提升复杂遮挡和光照变化下的人脸识别性能。
Details
Motivation: 为应对人脸识别中复杂遮挡和光照变化带来的挑战,需增强模型对局部结构特征的表达能力和对全局噪声相关性的建模能力。 Method: 设计了新的混合二阶梯度直方图(H2H)特征描述子,并将其与稀疏正则化核范数矩阵回归(SR_NMR)结合,同时在残差矩阵上引入全局低秩约束以捕捉结构噪声中的全局相关性。 Result: 实验结果表明,所提方法在遮挡、光照变化和无约束环境下显著优于现有的基于回归的分类方法。 Conclusion: H2H-GLRSR模型通过融合H2H特征与全局低秩稀疏回归,有效提升了复杂条件下面部识别的鲁棒性和准确性。 Abstract: Low-rank sparse regression models have been widely applied in the field of face recognition. To further address the challenges caused by complex occlusions and illumination variations, this paper proposes a Hybrid Second-Order Gradient Histogram based Global Low-Rank Sparse Regression (H2H-GLRSR) model. Specifically, a novel feature descriptor called the Hybrid Second-Order Gradient Histogram (H2H) is first designed to more effectively characterize the local structural features of facial images. Then, this descriptor is integrated with the Sparse Regularized Nuclear Norm based Matrix Regression (SR$\_$NMR). Moreover, a global low-rank constraint is imposed on the residual matrix, enabling the model to better capture the global correlations inherent in structured noise. Experimental results demonstrate that the proposed method significantly outperforms existing regression-based classification approaches under challenging scenarios involving occlusions, illumination changes, and unconstrained environments.[141] Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning
Fei Yu,Quan Deng,Shengeng Tang,Yuehua Li,Lechao Cheng
Main category: cs.CV
TL;DR: 提出了一种结合视觉-语言模型与检索增强推理的开放世界3D场景图生成统一框架,支持动态语义理解和多模态交互,在多个任务中表现出强泛化能力。
Details
Motivation: 现有3D场景理解方法受限于闭集词汇监督和静态标注,难以适应开放世界的动态需求。 Method: 设计了一个包含动态场景图生成模块和检索增强推理流水线的框架,利用视觉-语言模型实现开集感知,并通过向量数据库支持文本/图像条件查询。 Result: 在3DSSG和Replica数据集上四个任务中验证了方法的有效性,包括场景问答、视觉定位、实例检索和任务规划,性能优于现有方法。 Conclusion: 结合开集感知与检索增强推理可有效提升3D场景理解的泛化性和交互性,为开放世界应用提供了可行方案。 Abstract: Understanding 3D scenes in open-world settings poses fundamental challenges for vision and robotics, particularly due to the limitations of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method integrates Vision-Language Models (VLMs) with retrieval-based reasoning to support multimodal exploration and language-guided interaction. The framework comprises two key components: (1) a dynamic scene graph generation module that detects objects and infers semantic relationships without fixed label sets, and (2) a retrieval-augmented reasoning pipeline that encodes scene graphs into a vector database to support text/image-conditioned queries. We evaluate our method on 3DSSG and Replica benchmarks across four tasks-scene question answering, visual grounding, instance retrieval, and task planning-demonstrating robust generalization and superior performance in diverse environments. Our results highlight the effectiveness of combining open-vocabulary perception with retrieval-based reasoning for scalable 3D scene understanding.[142] GABFusion: Rethinking Feature Fusion for Low-Bit Quantization of Multi-Task Networks
Zhaoyang Wang,Dong Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为GABFusion和ADA的方法,用于解决多任务神经网络在量化感知训练(QAT)中因任务特征差异和梯度冲突导致的性能下降问题。该方法通过动态平衡梯度幅度并融合任务特定特征,结合注意力分布对齐策略,在多种网络架构和位宽下显著提升了QAT性能,尤其在4位量化下仍能保持接近全精度模型的精度,且具有良好的通用性和易集成性。
Details
Motivation: 多任务模型在进行量化感知训练时,由于不同任务之间的特征差异和梯度冲突,常常导致性能显著下降,现有方法难以有效应对这一问题。 Method: 提出Gradient-Aware Balanced Feature Fusion (GABFusion),动态平衡各任务的梯度幅度并融合特征;引入Attention Distribution Alignment (ADA)作为面向量化模型的特征级蒸馏策略,提升特征一致性与可迁移性。 Result: 在PASCAL VOC和COCO数据集上,相比现有QAT方法平均mAP分别提升约3.3%和1.6%;在4位量化下的YOLOv5模型中,VOC上的精度差距缩小至仅1.7%,表现出优异的低比特性能保持能力。 Conclusion: GABFusion与ADA构成的框架具有模块化、兼容性强、无需修改原网络结构的优点,能有效提升多任务量化模型的性能,适用于多种网络架构和QAT算法,具备广泛的应用前景。 Abstract: Despite the effectiveness of quantization-aware training (QAT) in compressing deep neural networks, its performance on multi-task architectures often degrades significantly due to task-specific feature discrepancies and gradient conflicts. To address these challenges, we propose Gradient-Aware Balanced Feature Fusion (GABFusion), which dynamically balances gradient magnitudes and fuses task-specific features in a quantization-friendly manner. We further introduce Attention Distribution Alignment (ADA), a feature-level distillation strategy tailored for quantized models. Our method demonstrates strong generalization across network architectures and QAT algorithms, with theoretical guarantees on gradient bias reduction. Extensive experiments demonstrate that our strategy consistently enhances a variety of QAT methods across different network architectures and bit-widths. On PASCAL VOC and COCO datasets, the proposed approach achieves average mAP improvements of approximately 3.3% and 1.6%, respectively. When applied to YOLOv5 under 4-bit quantization, our method narrows the accuracy gap with the full-precision model to only 1.7% on VOC, showcasing its effectiveness in preserving performance under low-bit constraints. Notably, the proposed framework is modular, easy to integrate, and compatible with any existing QAT technique-enhancing the performance of quantized models without requiring modifications to the original network architecture.[143] Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation
Qiming Li,Zekai Ye,Xiaocheng Feng,Weihong Zhong,Weitao Ma,Xiachong Feng
Main category: cs.CV
TL;DR: 本文提出了Fine-grained Cross-modal Causal Tracing (FCCT) 框架,用于系统分析大型视觉-语言模型中跨模态因果效应,并基于发现提出了一种无需训练的推理时方法Intermediate Representation Injection (IRI),有效增强视觉感知并缓解幻觉问题。
Details
Motivation: 现有对大型视觉-语言模型的可解释性研究缺乏全面性,未能充分覆盖视觉与文本token、模型组件及所有层,限制了对模型行为的理解和下游任务(如幻觉缓解)的改进。 Method: 提出FCCT框架,对视觉和文本token、多头自注意力(MHSA)、前馈网络(FFN)和隐藏状态在所有解码层进行细粒度因果追踪分析;基于分析结果设计IRI方法,在特定组件和层级注入中间表示以强化跨模态信息流。 Result: 首次发现中层MHSA在最后token上对跨模态信息聚合起关键作用,FFN呈现三阶段层次化视觉表征存储与传递模式;IRI在五个主流基准和LVLM上均实现最优性能,显著缓解幻觉且不牺牲推理速度和其他基础性能。 Conclusion: FCCT提供了对LVLM内部机制更深入的理解,IRI作为一种高效无训练干预方法,在提升模型感知准确性和可靠性方面具有广泛应用潜力。 Abstract: Despite the remarkable advancements of Large Vision-Language Models (LVLMs), the mechanistic interpretability remains underexplored. Existing analyses are insufficiently comprehensive and lack examination covering visual and textual tokens, model components, and the full range of layers. This limitation restricts actionable insights to improve the faithfulness of model output and the development of downstream tasks, such as hallucination mitigation. To address this limitation, we introduce Fine-grained Cross-modal Causal Tracing (FCCT) framework, which systematically quantifies the causal effects on visual object perception. FCCT conducts fine-grained analysis covering the full range of visual and textual tokens, three core model components including multi-head self-attention (MHSA), feed-forward networks (FFNs), and hidden states, across all decoder layers. Our analysis is the first to demonstrate that MHSAs of the last token in middle layers play a critical role in aggregating cross-modal information, while FFNs exhibit a three-stage hierarchical progression for the storage and transfer of visual object representations. Building on these insights, we propose Intermediate Representation Injection (IRI), a training-free inference-time technique that reinforces visual object information flow by precisely intervening on cross-modal representations at specific components and layers, thereby enhancing perception and mitigating hallucination. Consistent improvements across five widely used benchmarks and LVLMs demonstrate IRI achieves state-of-the-art performance, while preserving inference speed and other foundational performance.[144] CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework
Jiaxuan Li,Qing Xu,Xiangjian He,Ziyu Liu,Chang Xing,Zhen Chen,Daokun Zhang,Rong Qu,Chang Wen Chen
Main category: cs.CV
TL;DR: 本文提出了Complementary Masked Autoencoders (CoMA) 和 DyViT 模型,通过互补掩码策略和动态多窗口自注意力机制,显著提升了图像表示学习的效率与适应性,在仅用12%预训练周期的情况下达到与MAE相当的性能。
Details
Motivation: 为了解决MAE等随机掩码方法需要大量预训练周期以及ViT在固定空间分辨率下参数利用效率低的问题。 Method: 提出CoMA采用互补掩码策略以实现像素级均匀采样,并设计DyViT网络结构,引入动态多窗口自注意力(DM-MSA)进行分层特征学习。 Result: 在ImageNet-1K上预训练时,DyViT仅需12%的训练周期即达到MAE的下游任务性能,且每轮训练时间减少10%。 Conclusion: CoMA与DyViT联合方案显著提升了自监督学习的效率和模型适应性,为高效视觉表征学习提供了新思路。 Abstract: Masked Autoencoders (MAE) achieve self-supervised learning of image representations by randomly removing a portion of visual tokens and reconstructing the original image as a pretext task, thereby significantly enhancing pretraining efficiency and yielding excellent adaptability across downstream tasks. However, MAE and other MAE-style paradigms that adopt random masking generally require more pre-training epochs to maintain adaptability. Meanwhile, ViT in MAE suffers from inefficient parameter use due to fixed spatial resolution across layers. To overcome these limitations, we propose the Complementary Masked Autoencoders (CoMA), which employ a complementary masking strategy to ensure uniform sampling across all pixels, thereby improving effective learning of all features and enhancing the model's adaptability. Furthermore, we introduce DyViT, a hierarchical vision transformer that employs a Dynamic Multi-Window Self-Attention (DM-MSA), significantly reducing the parameters and FLOPs while improving fine-grained feature learning. Pre-trained on ImageNet-1K with CoMA, DyViT matches the downstream performance of MAE using only 12% of the pre-training epochs, demonstrating more effective learning. It also attains a 10% reduction in pre-training time per epoch, further underscoring its superior pre-training efficiency.[145] AD-DAE: Unsupervised Modeling of Longitudinal Alzheimer's Disease Progression with Diffusion Auto-Encoder
Ayantika Das,Arunima Sarkar,Keerthi Ram,Mohanasankar Sivaprakasam
Main category: cs.CV
TL;DR: 提出一种可条件化的扩散自编码器框架,用于无监督生成阿尔茨海默病的纵向脑部影像进展图像,通过限制潜在空间中的受控位移来实现疾病进展建模。
Details
Motivation: 现有生成模型在无明确纵向监督的情况下难以有效控制潜在空间以生成个体化的疾病进展图像,缺乏对进展相关因素与个体身份特征的解耦能力。 Method: 引入条件化扩散自编码器,利用其紧凑的潜在空间捕捉高层语义,并将进展相关的位移限制在特定子空间内,通过与进展属性相关联的方式隐式引导位移,从而实现无需个体纵向数据监督的可控生成。 Result: 在两个不同来源的阿尔茨海默病数据集上验证了方法的有效性,生成图像质量高,能准确反映体积变化趋势,并提升下游分类任务性能。 Conclusion: 该方法实现了无监督下的可控纵向图像生成,有效分离疾病进展与个体特征,在阿尔茨海默病进展建模中具有潜力。 Abstract: Generative modeling frameworks have emerged as an effective approach to capture high-dimensional image distributions from large datasets without requiring domain-specific knowledge, a capability essential for longitudinal disease progression modeling. Recent generative modeling approaches have attempted to capture progression by mapping images into a latent representational space and then controlling and guiding the representations to generate follow-up images from a baseline image. However, existing approaches impose constraints on distribution learning, leading to latent spaces with limited controllability to generate follow-up images without explicit supervision from subject-specific longitudinal images. In order to enable controlled movements in the latent representational space and generate progression images from a baseline image in an unsupervised manner, we introduce a conditionable Diffusion Auto-encoder framework. The explicit encoding mechanism of image-diffusion auto-encoders forms a compact latent space capturing high-level semantics, providing means to disentangle information relevant for progression. Our approach leverages this latent space to condition and apply controlled shifts to baseline representations for generating follow-up. Controllability is induced by restricting these shifts to a subspace, thereby isolating progression-related factors from subject identity-preserving components. The shifts are implicitly guided by correlating with progression attributes, without requiring subject-specific longitudinal supervision. We validate the generations through image quality metrics, volumetric progression analysis, and downstream classification in Alzheimer's disease datasets from two different sources and disease categories. This demonstrates the effectiveness of our approach for Alzheimer's progression modeling and longitudinal image generation.[146] Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation
Lin Li,Chuhan Zhang,Dong Zhang,Chong Sun,Chen Li,Long Chen
Main category: cs.CV
TL;DR: 本文提出了一种面向开放词汇场景图生成(OVSGG)的交互中心式端到端框架ACC,通过交互驱动的范式解决现有方法在知识注入与迁移中因缺乏显式交互建模导致的伪监督噪声和查询匹配模糊问题。
Details
Motivation: 现有OVSGG方法因未显式建模对象间交互,导致在知识注入和迁移阶段产生伪标签噪声和查询歧义,限制了性能提升。 Method: 提出ACC框架:1)交互中心的知识注入,采用双向交互提示生成鲁棒的伪监督信号;2)交互中心的知识迁移,包括交互引导的查询选择和交互一致的知识蒸馏,以增强关系特征并保留通用知识。 Result: 在三个基准数据集上实现了最先进的性能,验证了交互中心范式的有效性。 Conclusion: ACC通过显式建模对象交互显著提升了OVSGG的性能,展示了交互驱动方法在真实场景中的潜力。 Abstract: Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) \textit{Infusing knowledge} into large-scale models via pre-training on large datasets; 2) \textit{Transferring knowledge} from pre-trained models with fully annotated scene graphs during supervised fine-tuning. However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer. To this end, in this paper, we propose an inter\textbf{AC}tion-\textbf{C}entric end-to-end OVSGG framework (\textbf{ACC}) in an interaction-driven paradigm to minimize these mismatches. For \textit{interaction-centric knowledge infusion}, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model's interaction knowledge. For \textit{interaction-centric knowledge transfer}, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge. Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.[147] Global Multiple Extraction Network for Low-Resolution Facial Expression Recognition
Jingyi Shi
Main category: cs.CV
TL;DR: 本文提出了一种用于低分辨率面部表情识别的新型全局多特征提取网络(GME-Net),通过结合注意力机制和多尺度结构提升特征提取能力,在多个数据集上取得了优于现有方法的性能。
Details
Motivation: 现有面部表情识别算法在高分辨率图像上表现良好,但在低分辨率图像上性能下降,主要由于细节信息缺失和全局建模能力弱。 Method: 提出GME-Net,包含基于混合注意力的局部特征提取模块(结合注意力相似性知识蒸馏)和具有准对称结构的多尺度全局特征提取模块,以增强细节学习和全局特征捕获能力。 Result: 在多个常用数据集上的实验表明,GME-Net在低分辨率面部表情识别任务中优于现有方法,表现出更强的特征提取能力和鲁棒性。 Conclusion: GME-Net有效缓解了低分辨率图像中细节缺失和局部噪声干扰的问题,显著提升了面部表情识别的准确率,具有良好的应用潜力。 Abstract: Facial expression recognition, as a vital computer vision task, is garnering significant attention and undergoing extensive research. Although facial expression recognition algorithms demonstrate impressive performance on high-resolution images, their effectiveness tends to degrade when confronted with low-resolution images. We find it is because: 1) low-resolution images lack detail information; 2) current methods complete weak global modeling, which make it difficult to extract discriminative features. To alleviate the above issues, we proposed a novel global multiple extraction network (GME-Net) for low-resolution facial expression recognition, which incorporates 1) a hybrid attention-based local feature extraction module with attention similarity knowledge distillation to learn image details from high-resolution network; 2) a multi-scale global feature extraction module with quasi-symmetric structure to mitigate the influence of local image noise and facilitate capturing global image features. As a result, our GME-Net is capable of extracting expression-related discriminative features. Extensive experiments conducted on several widely-used datasets demonstrate that the proposed GME-Net can better recognize low-resolution facial expression and obtain superior performance than existing solutions.[148] Polymap: generating high definition map based on rasterized polygons
Shiyu Gao,Hao Jiang
Main category: cs.CV
TL;DR: 提出一种基于实例分割的高精地图在线构建方法,通过将道路元素重新解释为光栅化多边形,并结合Transformer和Potrace后处理生成矢量地图,在NuScenes数据集上验证了方法的有效性和泛化能力。
Details
Motivation: 现有基于检测的高精地图构建方法(如Maptr)虽然能实现实时建图,但泛化能力不足,限制了其在自动标注系统中的应用。 Method: 将道路元素视为光栅化多边形,采用基于实例分割的Transformer模型端到端地生成实例掩码,再利用Potrace算法进行矢量化后处理,从而获得矢量化的地图元素。 Result: 在NuScenes数据集上的定量结果表明,该方法在保持实时性的同时显著提升了模型的泛化能力。 Conclusion: 所提出的方法通过分割而非检测框架有效提高了高精地图在线构建的泛化性能,适用于自动标注等实际应用场景。 Abstract: The perception of high-definition maps is an integral component of environmental perception in autonomous driving systems. Existing research have often focused on online construction of high-definition maps. For instance, the Maptr[9] series employ a detection-based method to output vectorized map instances parallelly in an end-to-end manner. However, despite their capability for real-time construction, detection-based methods are observed to lack robust generalizability[19], which hampers their applicability in auto-labeling systems. Therefore, aiming to improve the generalizability, we reinterpret road elements as rasterized polygons and design a concise framework based on instance segmentation. Initially, a segmentation-based transformer is employed to deliver instance masks in an end-to-end manner; succeeding this step, a Potrace-based[17] post-processing module is used to ultimately yield vectorized map elements. Quantitative results attained on the Nuscene[1] dataset substantiate the effectiveness and generaliz-ability of our method.[149] Reperio-rPPG: Relational Temporal Graph Neural Networks for Periodicity Learning in Remote Physiological Measurement
Ba-Thinh Nguyen,Thach-Ha Ngoc Pham,Hoang-Long Duc Nguyen,Thi-Duyen Ngo,Thanh-Ha Le
Main category: cs.CV
TL;DR: 提出Reperio-rPPG框架,结合关系卷积网络与图Transformer,利用CutMix增强数据多样性,有效建模rPPG信号周期性,在多个基准上实现最先进的心率估计性能。
Details
Motivation: 现有rPPG方法对生理信号内在周期性的建模不足,且数据集多样性有限,导致在真实场景中鲁棒性差。 Method: 提出Reperio-rPPG框架,融合关系卷积网络和图Transformer以捕捉生理信号的周期结构,并设计针对性的CutMix数据增强策略提升模型泛化能力。 Result: 在PURE、UBFC-rPPG和MMPD三个基准数据集上取得最优性能,尤其在不同运动和光照条件下表现出强鲁棒性。 Conclusion: Reperio-rPPG通过显式建模周期性和增强数据多样性,显著提升了远程生理信号测量的准确性和稳定性,推动了rPPG技术在现实场景中的应用。 Abstract: Remote photoplethysmography (rPPG) is an emerging contactless physiological sensing technique that leverages subtle color variations in facial videos to estimate vital signs such as heart rate and respiratory rate. This non-invasive method has gained traction across diverse domains, including telemedicine, affective computing, driver fatigue detection, and health monitoring, owing to its scalability and convenience. Despite significant progress in remote physiological signal measurement, a crucial characteristic - the intrinsic periodicity - has often been underexplored or insufficiently modeled in previous approaches, limiting their ability to capture fine-grained temporal dynamics under real-world conditions. To bridge this gap, we propose Reperio-rPPG, a novel framework that strategically integrates Relational Convolutional Networks with a Graph Transformer to effectively capture the periodic structure inherent in physiological signals. Additionally, recognizing the limited diversity of existing rPPG datasets, we further introduce a tailored CutMix augmentation to enhance the model's generalizability. Extensive experiments conducted on three widely used benchmark datasets - PURE, UBFC-rPPG, and MMPD - demonstrate that Reperio-rPPG not only achieves state-of-the-art performance but also exhibits remarkable robustness under various motion (e.g., stationary, rotation, talking, walking) and illumination conditions (e.g., nature, low LED, high LED). The code is publicly available at https://github.com/deconasser/Reperio-rPPG.[150] U(PM)$^2$:Unsupervised polygon matching with pre-trained models for challenging stereo images
Chang Li,Xingtao Peng
Main category: cs.CV
TL;DR: 提出了一种低成本的无监督多边形匹配方法U(PM)²,结合预训练模型与手工特征,在无需训练的情况下实现了高精度、高泛化性和较快的速度。
Details
Motivation: 解决立体图像中多边形匹配面临的视差不连续、尺度变化、训练需求和泛化能力等挑战。 Method: 利用预训练的Segment Anything模型检测掩码并转化为多边形结构;采用双向金字塔策略结合LoFTR进行全局匹配;通过局部联合几何与多特征匹配策略及匈牙利算法实现局部精细匹配。 Result: 在ScanNet和SceneFlow数据集上使用新提出的指标进行评测,达到了最先进的精度、良好的泛化性能和较高的运行速度,且无需任何训练。 Conclusion: U(PM)²是一种高效、无需训练的多边形匹配方法,有效应对了多尺度、视差不连续等问题,具有良好的实际应用前景。 Abstract: Stereo image matching is a fundamental task in computer vision, photogrammetry and remote sensing, but there is an almost unexplored field, i.e., polygon matching, which faces the following challenges: disparity discontinuity, scale variation, training requirement, and generalization. To address the above-mentioned issues, this paper proposes a novel U(PM)$^2$: low-cost unsupervised polygon matching with pre-trained models by uniting automatically learned and handcrafted features, of which pipeline is as follows: firstly, the detector leverages the pre-trained segment anything model to obtain masks; then, the vectorizer converts the masks to polygons and graphic structure; secondly, the global matcher addresses challenges from global viewpoint changes and scale variation based on bidirectional-pyramid strategy with pre-trained LoFTR; finally, the local matcher further overcomes local disparity discontinuity and topology inconsistency of polygon matching by local-joint geometry and multi-feature matching strategy with Hungarian algorithm. We benchmark our U(PM)$^2$ on the ScanNet and SceneFlow datasets using our proposed new metric, which achieved state-of-the-art accuracy at a competitive speed and satisfactory generalization performance at low cost without any training requirement.[151] CSGaze: Context-aware Social Gaze Prediction
Surbhi Madan,Shreya Ghosh,Ramanathan Subramanian,Abhinav Dhall,Tom Gedeon
Main category: cs.CV
TL;DR: 本文提出了一种基于上下文感知的多模态方法CSGaze,利用面部和场景信息来提升社交凝视模式的预测性能,并通过注意力机制增强对主要说话者的建模。
Details
Motivation: 为了更准确地预测和解释对话交互中的社交凝视模式,研究上下文线索、视觉场景与面部信息的结合如何提升预测效果。 Method: 提出CSGaze模型,融合面部和场景信息作为互补输入,引入以主要说话者为中心的细粒度注意力机制,以更好地建模社交凝视动态。 Result: 在GP-Static、UCO-LAEO和AVA-LAEO数据集上表现与现有最先进方法相当,验证了上下文线索的作用,并通过注意力得分提供初步可解释性,且在开放集合数据集上表现出良好的泛化能力。 Conclusion: 上下文信息对社交凝视预测具有重要作用,CSGaze通过多模态融合和注意力机制有效提升了预测性能和模型可解释性。 Abstract: A person's gaze offers valuable insights into their focus of attention, level of social engagement, and confidence. In this work, we investigate how contextual cues combined with visual scene and facial information can be effectively utilized to predict and interpret social gaze patterns during conversational interactions. We introduce CSGaze, a context aware multimodal approach that leverages facial, scene information as complementary inputs to enhance social gaze pattern prediction from multi-person images. The model also incorporates a fine-grained attention mechanism centered on the principal speaker, which helps in better modeling social gaze dynamics. Experimental results show that CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO. Our findings highlight the role of contextual cues in improving social gaze prediction. Additionally, we provide initial explainability through generated attention scores, offering insights into the model's decision-making process. We also demonstrate our model's generalizability by testing our model on open set datasets that demonstrating its robustness across diverse scenarios.[152] Adaptive Agent Selection and Interaction Network for Image-to-point cloud Registration
Zhixin Cheng,Xiaotian Yin,Jiacheng Deng,Bohao Liao,Yujia Chen,Xu Zhou,Baoqun Yin,Tianzhu Zhang
Main category: cs.CV
TL;DR: 提出了一种新的跨模态注册框架,包含迭代代理选择(IAS)和可靠代理交互(RAI)模块,通过强化学习和相位图提升特征感知与匹配鲁棒性,在RGB-D Scenes v2和7-Scenes上达到SOTA性能。
Details
Motivation: 现有检测自由的图像到点云配准方法在噪声环境下易产生错误对应,且缺乏有效机制选择跨模态相关特征,导致鲁棒性和准确性受限。 Method: 设计了IAS模块利用相位图增强结构感知,并用强化学习选择可靠代理;RAI模块则利用这些代理引导跨模态交互,减少误匹配。 Result: 在RGB-D Scenes v2和7-Scenes数据集上实验表明,该方法在配准精度和鲁棒性方面优于现有方法,达到最先进水平。 Conclusion: 所提框架有效提升了复杂环境下图像到点云配准的性能,具有较强的抗噪能力和特征匹配准确性。 Abstract: Typical detection-free methods for image-to-point cloud registration leverage transformer-based architectures to aggregate cross-modal features and establish correspondences. However, they often struggle under challenging conditions, where noise disrupts similarity computation and leads to incorrect correspondences. Moreover, without dedicated designs, it remains difficult to effectively select informative and correlated representations across modalities, thereby limiting the robustness and accuracy of registration. To address these challenges, we propose a novel cross-modal registration framework composed of two key modules: the Iterative Agents Selection (IAS) module and the Reliable Agents Interaction (RAI) module. IAS enhances structural feature awareness with phase maps and employs reinforcement learning principles to efficiently select reliable agents. RAI then leverages these selected agents to guide cross-modal interactions, effectively reducing mismatches and improving overall robustness. Extensive experiments on the RGB-D Scenes v2 and 7-Scenes benchmarks demonstrate that our method consistently achieves state-of-the-art performance.[153] Commonality in Few: Few-Shot Multimodal Anomaly Detection via Hypergraph-Enhanced Memory
Yuxuan Lin,Hanjing Yan,Xuan Tong,Yang Chang,Huanzhen Wang,Ziheng Zhou,Shuyong Gao,Yan Wang,Wenqiang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于结构共性的少样本无监督多模态工业异常检测方法CIF,利用超图建模高阶相关性并结合记忆库提取训练样本中的类内结构先验,在MVTec 3D-AD和Eyecandies数据集上优于现有最先进方法。
Details
Motivation: 少样本场景下训练样本不足,难以覆盖测试样本的多样性,导致异常检测性能下降。 Method: 提出CIF方法,使用语义感知的超图构建模块提取训练样本中的结构共性,并设计记忆库存储类内结构先验;引入无需训练的超图消息传递模块更新测试样本特征,缩小分布差距;提出超边引导的记忆搜索模块,利用结构信息降低误检率。 Result: 在MVTec 3D-AD和Eyecandies数据集上,CIF在少样本设置下优于现有最先进方法,显著降低了误报率。 Conclusion: CIF通过挖掘少量正常样本中的结构共性,有效提升了少样本多模态工业异常检测的性能,具有良好的应用前景。 Abstract: Few-shot multimodal industrial anomaly detection is a critical yet underexplored task, offering the ability to quickly adapt to complex industrial scenarios. In few-shot settings, insufficient training samples often fail to cover the diverse patterns present in test samples. This challenge can be mitigated by extracting structural commonality from a small number of training samples. In this paper, we propose a novel few-shot unsupervised multimodal industrial anomaly detection method based on structural commonality, CIF (Commonality In Few). To extract intra-class structural information, we employ hypergraphs, which are capable of modeling higher-order correlations, to capture the structural commonality within training samples, and use a memory bank to store this intra-class structural prior. Firstly, we design a semantic-aware hypergraph construction module tailored for single-semantic industrial images, from which we extract common structures to guide the construction of the memory bank. Secondly, we use a training-free hypergraph message passing module to update the visual features of test samples, reducing the distribution gap between test features and features in the memory bank. We further propose a hyperedge-guided memory search module, which utilizes structural information to assist the memory search process and reduce the false positive rate. Experimental results on the MVTec 3D-AD dataset and the Eyecandies dataset show that our method outperforms the state-of-the-art (SOTA) methods in few-shot settings. Code is available at https://github.com/Sunny5250/CIF.[154] Adapted Foundation Models for Breast MRI Triaging in Contrast-Enhanced and Non-Contrast Enhanced Protocols
Tri-Thien Nguyen,Lorenz A. Kapsner,Tobias Hepp,Shirin Heidarikahkesh,Hannes Schreiter,Luise Brock,Dominika Skwierawska,Dominique Hadler,Julian Hossbach,Evelyn Wenkel,Sabine Ohlmeyer,Frederik B. Laun,Andrzej Liebert,Andreas Maier,Michael Uder,Sebastian Bickelhaupt
Main category: cs.CV
TL;DR: 本研究评估了基于DINOv2的医学切片Transformer(MST)在简略乳腺MRI中排除显著病变(BI-RADS ≥4)的性能,结果显示在97.5%敏感度下,对比增强和非对比增强MRI的特异性分别为19%和17%,具有辅助预筛的潜力。
Details
Motivation: 乳腺MRI对乳腺癌检测敏感性高,但解读耗时,亟需人工智能工具辅助预筛以提高效率。 Method: 采用回顾性研究设计,基于内部和外部数据集(共2,771次单侧乳腺MRI),评估四种简略MRI协议(T1sub、DWI1500、DWI1500+T2w、T1sub+T2w)下MST模型的性能,通过五折交叉验证和AUC分析评估其表现,并使用DeLong检验比较AUC差异。 Result: T1sub+T2w组合AUC最高(0.77),在97.5%敏感度下特异性达19%;DWI1500+T2w特异性为17%。漏诊病灶多为直径<10 mm的非肿块样强化。外部验证AUC为0.77,88%的注意力图被评为良好或中等。 Conclusion: MST框架在高敏感度下可有效排除无显著病变的病例,具备作为预筛工具的潜力,但需进一步研究方可临床应用。 Abstract: Background: Magnetic resonance imaging (MRI) has high sensitivity for breast cancer detection, but interpretation is time-consuming. Artificial intelligence may aid in pre-screening. Purpose: To evaluate the DINOv2-based Medical Slice Transformer (MST) for ruling out significant findings (Breast Imaging Reporting and Data System [BI-RADS] >=4) in contrast-enhanced and non-contrast-enhanced abbreviated breast MRI. Materials and Methods: This institutional review board approved retrospective study included 1,847 single-breast MRI examinations (377 BI-RADS >=4) from an in-house dataset and 924 from an external validation dataset (Duke). Four abbreviated protocols were tested: T1-weighted early subtraction (T1sub), diffusion-weighted imaging with b=1500 s/mm2 (DWI1500), DWI1500+T2-weighted (T2w), and T1sub+T2w. Performance was assessed at 90%, 95%, and 97.5% sensitivity using five-fold cross-validation and area under the receiver operating characteristic curve (AUC) analysis. AUC differences were compared with the DeLong test. False negatives were characterized, and attention maps of true positives were rated in the external dataset. Results: A total of 1,448 female patients (mean age, 49 +/- 12 years) were included. T1sub+T2w achieved an AUC of 0.77 +/- 0.04; DWI1500+T2w, 0.74 +/- 0.04 (p=0.15). At 97.5% sensitivity, T1sub+T2w had the highest specificity (19% +/- 7%), followed by DWI1500+T2w (17% +/- 11%). Missed lesions had a mean diameter <10 mm at 95% and 97.5% thresholds for both T1sub and DWI1500, predominantly non-mass enhancements. External validation yielded an AUC of 0.77, with 88% of attention maps rated good or moderate. Conclusion: At 97.5% sensitivity, the MST framework correctly triaged cases without BI-RADS >=4, achieving 19% specificity for contrast-enhanced and 17% for non-contrast-enhanced MRI. Further research is warranted before clinical implementation.[155] DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities
Nagur Shareef Shaik,Teja Krishna Cherukuri,Adnan Masood,Dong Hye Ye
Main category: cs.CV
TL;DR: 提出DiA-gnostic VLVAE模型,通过解耦对齐实现鲁棒的放射学报告生成,有效应对模态缺失和特征纠缠问题。
Details
Motivation: 现有自动化方法在处理临床数据时面临模态缺失和特征纠缠两大挑战,导致融合效果差并产生不符合临床事实的幻觉结果。 Method: 基于Mixture-of-Experts的视觉-语言变分自编码器(VLVAE),通过约束优化实现共享与模态特异性特征的解耦与正交对齐,并采用紧凑的LLaMA-X解码器生成报告。 Result: 在IU X-Ray和MIMIC-CXR数据集上分别取得0.266和0.134的BLEU@4分数,显著优于当前最先进模型。 Conclusion: DiA-gnostic VLVAE能有效提升多模态医学报告生成的鲁棒性和临床可靠性,尤其在模态不完整情况下表现优异。 Abstract: The integration of medical images with clinical context is essential for generating accurate and clinically interpretable radiology reports. However, current automated methods often rely on resource-heavy Large Language Models (LLMs) or static knowledge graphs and struggle with two fundamental challenges in real-world clinical data: (1) missing modalities, such as incomplete clinical context , and (2) feature entanglement, where mixed modality-specific and shared information leads to suboptimal fusion and clinically unfaithful hallucinated findings. To address these challenges, we propose the DiA-gnostic VLVAE, which achieves robust radiology reporting through Disentangled Alignment. Our framework is designed to be resilient to missing modalities by disentangling shared and modality-specific features using a Mixture-of-Experts (MoE) based Vision-Language Variational Autoencoder (VLVAE). A constrained optimization objective enforces orthogonality and alignment between these latent representations to prevent suboptimal fusion. A compact LLaMA-X decoder then uses these disentangled representations to generate reports efficiently. On the IU X-Ray and MIMIC-CXR datasets, DiA has achieved competetive BLEU@4 scores of 0.266 and 0.134, respectively. Experimental results show that the proposed method significantly outperforms state-of-the-art models.[156] Runtime Safety Monitoring of Deep Neural Networks for Perception: A Survey
Albert Schotschneider,Svetlana Pavlitska,J. Marius Zöllner
Main category: cs.CV
TL;DR: 该论文综述了深度神经网络在安全关键应用中的运行时安全监控方法,涵盖输入、内部表示和输出三类方法,并分析其优缺点及应对的安全问题。
Details
Motivation: 深度神经网络在自动驾驶等安全关键领域广泛应用,但易受分布外输入和对抗攻击等威胁,需有效运行时监控保障安全。 Method: 对现有运行时安全监控方法进行分类,分为输入、内部表示和输出三类,并系统分析每类方法的技术进展、优势与局限。 Result: 梳理了当前最先进的运行时监控技术,明确了各类方法适用的安全问题,并建立了方法与问题之间的映射关系。 Conclusion: 运行时安全监控是提升DNN系统安全性的重要手段,未来需进一步解决泛化性、实时性和鲁棒性等开放挑战。 Abstract: Deep neural networks (DNNs) are widely used in perception systems for safety-critical applications, such as autonomous driving and robotics. However, DNNs remain vulnerable to various safety concerns, including generalization errors, out-of-distribution (OOD) inputs, and adversarial attacks, which can lead to hazardous failures. This survey provides a comprehensive overview of runtime safety monitoring approaches, which operate in parallel to DNNs during inference to detect these safety concerns without modifying the DNN itself. We categorize existing methods into three main groups: Monitoring inputs, internal representations, and outputs. We analyze the state-of-the-art for each category, identify strengths and limitations, and map methods to the safety concerns they address. In addition, we highlight open challenges and future research directions.[157] A Dual-Mode ViT-Conditioned Diffusion Framework with an Adaptive Conditioning Bridge for Breast Cancer Segmentation
Prateek Singh,Moumita Dholey,P. K. Vinod
Main category: cs.CV
TL;DR: 提出一种基于条件去噪扩散模型的乳腺超声图像病灶分割方法,结合ViT编码器和改进的UNet解码器,引入自适应条件桥、拓扑去噪一致性损失和双头结构,在多个公开数据集上达到SOTA性能。
Details
Motivation: 传统卷积深度学习模型在乳腺超声图像分割中难以捕捉足够的全局上下文信息,导致分割结果解剖结构不一致,且受低对比度、斑点噪声和边界模糊影响严重。 Method: 采用条件去噪扩散模型,使用Vision Transformer作为编码器提取全局特征,增强的UNet作为生成解码器,并设计自适应条件桥(ACB)实现多尺度语义特征融合,引入拓扑去噪一致性(TDC)损失来正则化训练过程,采用双头架构以兼顾高精度与高效推理。 Result: 在BUSI、BrEaST和BUS-UCLM三个公开乳腺超声数据集上分别取得了0.96、0.90和0.97的Dice分数,显著优于现有方法,消融实验验证了各组件的有效性。 Conclusion: 所提方法通过结合扩散模型与Transformer,在乳腺超声图像分割任务中实现了高精度且解剖学合理的分割结果,具有良好的鲁棒性和泛化能力,为医学图像分割提供了新思路。 Abstract: In breast ultrasound images, precise lesion segmentation is essential for early diagnosis; however, low contrast, speckle noise, and unclear boundaries make this difficult. Even though deep learning models have demonstrated potential, standard convolutional architectures frequently fall short in capturing enough global context, resulting in segmentations that are anatomically inconsistent. To overcome these drawbacks, we suggest a flexible, conditional Denoising Diffusion Model that combines an enhanced UNet-based generative decoder with a Vision Transformer (ViT) encoder for global feature extraction. We introduce three primary innovations: 1) an Adaptive Conditioning Bridge (ACB) for efficient, multi-scale fusion of semantic features; 2) a novel Topological Denoising Consistency (TDC) loss component that regularizes training by penalizing structural inconsistencies during denoising; and 3) a dual-head architecture that leverages the denoising objective as a powerful regularizer, enabling a lightweight auxiliary head to perform rapid and accurate inference on smaller datasets and a noise prediction head. Our framework establishes a new state-of-the-art on public breast ultrasound datasets, achieving Dice scores of 0.96 on BUSI, 0.90 on BrEaST and 0.97 on BUS-UCLM. Comprehensive ablation studies empirically validate that the model components are critical for achieving these results and for producing segmentations that are not only accurate but also anatomically plausible.[158] Exploring Category-level Articulated Object Pose Tracking on SE(3) Manifolds
Xianhui Meng,Yukang Huo,Li Zhang,Liu Liu,Haonan Jiang,Yan Zhong,Pingrui Zhang,Cewu Lu,Jun Liu
Main category: cs.CV
TL;DR: 提出了一种基于点对特征(PPF-Tracker)的新型姿态跟踪框架,用于解决铰接物体在SE(3)空间中的姿态跟踪难题,结合语义信息和运动学约束,在合成与真实场景中均表现出强鲁棒性和泛化能力。
Details
Motivation: 铰接物体的姿态跟踪因存在内在运动学约束而具有挑战性,现有方法难以有效处理其复杂运动结构,因此需要一种更具通用性和鲁棒性的跟踪框架。 Method: 提出PPF-Tracker,首先在SE(3)李群空间中对点云进行准规范化,然后利用点对特征(PPF)建模并预测姿态投票参数,最后融合关节轴的语义信息以统一施加运动学约束。 Result: 在合成数据集和真实场景中系统评估了PPF-Tracker,结果表明其在多帧铰接物体姿态跟踪中具有优异的性能、鲁棒性和跨环境泛化能力。 Conclusion: PPF-Tracker有效解决了铰接物体姿态跟踪中的关键挑战,有望推动机器人、具身智能和增强现实等领域的发展。 Abstract: Articulated objects are prevalent in daily life and robotic manipulation tasks. However, compared to rigid objects, pose tracking for articulated objects remains an underexplored problem due to their inherent kinematic constraints. To address these challenges, this work proposes a novel point-pair-based pose tracking framework, termed \textbf{PPF-Tracker}. The proposed framework first performs quasi-canonicalization of point clouds in the SE(3) Lie group space, and then models articulated objects using Point Pair Features (PPF) to predict pose voting parameters by leveraging the invariance properties of SE(3). Finally, semantic information of joint axes is incorporated to impose unified kinematic constraints across all parts of the articulated object. PPF-Tracker is systematically evaluated on both synthetic datasets and real-world scenarios, demonstrating strong generalization across diverse and challenging environments. Experimental results highlight the effectiveness and robustness of PPF-Tracker in multi-frame pose tracking of articulated objects. We believe this work can foster advances in robotics, embodied intelligence, and augmented reality. Codes are available at https://github.com/mengxh20/PPFTracker.[159] MALeR: Improving Compositional Fidelity in Layout-Guided Generation
Shivank Saxena,Dhruv Srivastava,Makarand Tapaswi
Main category: cs.CV
TL;DR: 本文提出了一种名为MALeR的方法,用于改进文本到图像生成中多主体、多属性场景的布局控制,有效防止主体溢出、属性泄露和生成异常,显著提升了生成结果的准确性和一致性。
Details
Motivation: 现有的布局引导文本到图像生成方法在处理包含多个主体和属性的复杂场景时,常出现主体超出布局区域、属性混淆、生成图像不自然等问题,缺乏对复杂组合场景的精确控制。 Method: 提出MALeR方法,结合布局约束机制以防止主体出现在指定布局之外,并引入掩码的、属性感知的绑定机制,阻止属性在不同主体间泄漏,确保每个主体正确绑定其属性,同时保持生成图像的自然性与分布内特征。 Result: 在定性和定量评估中,MALeR在组合准确性、生成一致性和属性绑定方面均优于先前方法,尤其擅长生成包含多个主体及每个主体具有多个属性的复杂场景。 Conclusion: MALeR有效解决了布局引导生成中的主体溢出、属性泄露和图像失真问题,显著提升了复杂组合场景下的文本到图像生成质量与可控性。 Abstract: Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.[160] How Reasoning Influences Intersectional Biases in Vision Language Models
Adit Desai,Sudipta Roy,Mohna Chakraborty
Main category: cs.CV
TL;DR: 研究分析了五个开源视觉语言模型(VLM)在职业预测任务中的社会偏见,揭示其推理模式存在系统性偏差,强调需在部署前对齐人类价值观。
Details
Motivation: 视觉语言模型在下游任务中广泛应用,但其训练数据常包含社会偏见,导致输出偏离人类推理,可能影响实际应用的公平性。 Method: 在FairFace数据集上,针对32种职业和三种提示方式,对五个开源VLM进行系统分析,收集预测结果和推理过程。 Result: 发现VLM的推理模式存在系统性偏见,这些偏见导致交叉性差异,且与人类基于情境和社会线索的理解方式不同。 Conclusion: 必须在部署前对齐VLM的推理过程与人类价值观,以减少社会偏见的传播和对下游任务的负面影响。 Abstract: Vision Language Models (VLMs) are increasingly deployed across downstream tasks, yet their training data often encode social biases that surface in outputs. Unlike humans, who interpret images through contextual and social cues, VLMs process them through statistical associations, often leading to reasoning that diverges from human reasoning. By analyzing how a VLM reasons, we can understand how inherent biases are perpetuated and can adversely affect downstream performance. To examine this gap, we systematically analyze social biases in five open-source VLMs for an occupation prediction task, on the FairFace dataset. Across 32 occupations and three different prompting styles, we elicit both predictions and reasoning. Our findings reveal that the biased reasoning patterns systematically underlie intersectional disparities, highlighting the need to align VLM reasoning with human values prior to its downstream deployment.[161] Distributed Deep Learning for Medical Image Denoising with Data Obfuscation
Sulaimon Oyeniyi Adebayo,Ayaz H. Khan
Main category: cs.CV
TL;DR: 本研究探讨了基于分布式深度学习的胸部X光图像去噪方法,使用U-Net和U-Net++模型结合DistributedDataParallel与混合精度训练优化效率。结果表明U-Net++在结构保真度上表现更优,且优化后的训练策略显著提升了训练速度。
Details
Motivation: 医学图像去噪对提升图像质量至关重要,同时需保护敏感临床数据隐私。传统单GPU训练效率低,难以应对大规模数据集,因此需要高效、可扩展的分布式训练方案。 Method: 采用NIH Chest X-ray14数据集,添加高斯噪声进行轻量级模糊化处理;使用U-Net和U-Net++架构,在单GPU、多GPU(DataParallel)及优化的DistributedDataParallel(DDP)+自动混合精度(AMP)配置下进行训练与评估。 Result: U-Net++在PSNR和SSIM指标上表现更好,显示其更强的结构恢复能力,但在LPIPS上表现较弱,说明感知相似性较低;优化后的DDP+AMP训练策略相比单GPU训练提速超过60%,比标准DataParallel快40%以上,仅伴随轻微精度下降。 Conclusion: 结合先进分布式训练技术(如DDP+AMP)、合适网络架构(如U-Net++)和轻量级数据匿名化方法,可在保证去噪效果的同时大幅提升训练效率,验证了该方案在实际医疗图像处理中的可行性与实用性。 Abstract: Medical image denoising is essential for improving image quality while minimizing the exposure of sensitive information, particularly when working with large-scale clinical datasets. This study explores distributed deep learning for denoising chest X-ray images from the NIH Chest X-ray14 dataset, using additive Gaussian noise as a lightweight obfuscation technique. We implement and evaluate U-Net and U-Net++ architectures under single-GPU, standard multi-GPU (DataParallel), and optimized multi-GPU training configurations using PyTorch's DistributedDataParallel (DDP) and Automatic Mixed Precision (AMP). Our results show that U-Net++ consistently delivers superior denoising performance, achieving competitive Peak Signal to Noise Ratio (PSNR) and Structured Similarity Index Method (SSIM) scores, though with less performance in Learned Perceptual Image Patch Similarity (LPIPS) compared to U-Net under low and moderate noise levels. This indicates U-Net++'s enhanced structural fidelity and low perceptual similarity. Meanwhile, our optimized training pipeline reduces training time by over 60% for both models compared to single-GPU training, and outperforms standard DataParallel by over 40%, with only a minor accuracy drop for both models (trading some accuracy for speed). These findings highlight the effectiveness of software-level optimization in distributed learning for medical imaging. This work demonstrates the practical viability of combining architectural design, lightweight obfuscation, and advanced distributed training strategies to accelerate and enhance medical image processing pipelines in real-world clinical and research environments. The full implementation is publicly available at: https://github.com/Suadey/medical-image-denoising-ddp.[162] One-Shot Knowledge Transfer for Scalable Person Re-Identification
Longhua Li,Lei Qi,Xin Geng
Main category: cs.CV
TL;DR: 提出一种名为OSKT的一次性知识迁移方法,通过将教师模型的知识压缩到权重链中,实现对不同资源约束下的目标模型的无额外计算扩展,显著优于现有压缩方法。
Details
Motivation: 传统压缩方法在生成多个不同规模的学生模型时需要重复计算,导致效率低下,难以适应边缘计算中多变的资源条件。 Method: 设计一种称为OSKT(One-Shot Knowledge Transfer)的知识继承方法,将教师模型的知识整合到一个中间载体——权重链中,该权重链可按需扩展为不同大小的目标模型,无需重复训练或压缩过程。 Result: OSKT在多种资源约束下均显著优于当前最先进的压缩方法,且实现了只需一次知识转移,避免了为每个目标模型重复计算的问题。 Conclusion: OSKT为边缘计算环境下的人脸再识别提供了高效、灵活的模型压缩解决方案,具有良好的实用性和扩展性。 Abstract: Edge computing in person re-identification (ReID) is crucial for reducing the load on central cloud servers and ensuring user privacy. Conventional compression methods for obtaining compact models require computations for each individual student model. When multiple models of varying sizes are needed to accommodate different resource conditions, this leads to repetitive and cumbersome computations. To address this challenge, we propose a novel knowledge inheritance approach named OSKT (One-Shot Knowledge Transfer), which consolidates the knowledge of the teacher model into an intermediate carrier called a weight chain. When a downstream scenario demands a model that meets specific resource constraints, this weight chain can be expanded to the target model size without additional computation. OSKT significantly outperforms state-of-the-art compression methods, with the added advantage of one-time knowledge transfer that eliminates the need for frequent computations for each target model.[163] MiVID: Multi-Strategic Self-Supervision for Video Frame Interpolation using Diffusion Model
Priyansh Srivastava,Romit Chatterjee,Abir Sen,Aradhana Behura,Ratnakar Dash
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、自监督的扩散模型MiVID,用于视频帧插值,无需光流估计或高帧率监督,在低资源环境下训练仍能取得与有监督方法相当的性能。
Details
Motivation: 现有视频帧插值方法在处理遮挡、域偏移和运动模糊时表现不佳,且依赖密集真值标签或显式光流估计,限制了其泛化性和实用性。 Method: 提出MiVID,采用3D U-Net主干网络结合Transformer风格的时间注意力机制,通过混合掩码策略(模拟遮挡和运动不确定性)进行自监督训练,并引入余弦渐进掩码和自适应损失调度来增强时空表征学习。 Result: 在UCF101-7和DAVIS-7数据集上验证,仅用CPU训练9帧片段,50个epoch即达到最优性能,效果媲美多种有监督基线方法。 Conclusion: 证明了自监督扩散先验在生成时间连贯帧中的有效性,为低成本、可扩展且通用的视频帧插值系统提供了可行路径。 Abstract: Video Frame Interpolation (VFI) remains a cornerstone in video enhancement, enabling temporal upscaling for tasks like slow-motion rendering, frame rate conversion, and video restoration. While classical methods rely on optical flow and learning-based models assume access to dense ground-truth, both struggle with occlusions, domain shifts, and ambiguous motion. This article introduces MiVID, a lightweight, self-supervised, diffusion-based framework for video interpolation. Our model eliminates the need for explicit motion estimation by combining a 3D U-Net backbone with transformer-style temporal attention, trained under a hybrid masking regime that simulates occlusions and motion uncertainty. The use of cosine-based progressive masking and adaptive loss scheduling allows our network to learn robust spatiotemporal representations without any high-frame-rate supervision. Our framework is evaluated on UCF101-7 and DAVIS-7 datasets. MiVID is trained entirely on CPU using the datasets and 9-frame video segments, making it a low-resource yet highly effective pipeline. Despite these constraints, our model achieves optimal results at just 50 epochs, competitive with several supervised baselines.This work demonstrates the power of self-supervised diffusion priors for temporally coherent frame synthesis and provides a scalable path toward accessible and generalizable VFI systems.[164] Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era
Feng Lu,Tong Jin,Canming Ye,Yunpeng Liu,Xiangyuan Lan,Chun Yuan
Main category: cs.CV
TL;DR: 本文提出了一种在视觉位置识别(VPR)中无需专用聚合器的新型方法,利用可学习的聚合token通过Transformer的自注意力机制隐式聚合图像块特征,仅使用主干网络即可生成鲁棒的全局描述符,并通过实证研究优化了token插入位置与初始化策略,在多个VPR数据集上实现了最先进的性能和效率。
Details
Motivation: 传统VPR方法依赖于骨干网络加专用聚合器的范式,但在Transformer时代,这种显式聚合可能不再必要。作者旨在探索是否可以去除独立聚合模块,仅通过骨干网络本身实现高效且鲁棒的全局描述符提取。 Method: 引入可学习的聚合token并前置到输入图像的patch tokens中,在特定Transformer块中通过自注意力机制与patch tokens共同处理,实现信息隐式聚合;最后取输出中的聚合token拼接作为全局表示。同时基于实验分析提出了最优的token插入位置和初始化方法。 Result: 该方法在多个VPR数据集上超越了现有最先进方法,具有更高效率,并在MSLS挑战赛排行榜上排名第一。 Conclusion: 在Transformer架构下,无需专门设计聚合模块,通过引入可学习聚合token并利用自注意力机制即可有效生成全局描述符,简化了VPR模型结构并提升了性能。 Abstract: Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https://github.com/lu-feng/image.[165] S2ML: Spatio-Spectral Mutual Learning for Depth Completion
Zihui Zhao,Yifei Zhang,Zheng Wang,Yang Li,Kui Jiang,Zihan Geng,Chia-Wen Lin
Main category: cs.CV
TL;DR: 提出了一种结合空间域和频域优势的Spatio-Spectral Mutual Learning (S2ML)框架,用于RGB-D图像的深度补全,通过利用幅值和相位谱的特性以及空间-频域特征的相关性,在NYU-Depth V2和SUN RGB-D数据集上优于现有方法。
Details
Motivation: 现有的深度补全方法主要在空间域处理,忽略了原始深度图的物理特性(如无效区域对频率分布的影响),导致补全效果受限。 Method: 提出S2ML框架,融合空间域与频域信息:设计专用的频谱融合模块处理幅值和相位谱,并在统一嵌入空间中建模空间域与频域特征的局部与全局关联,通过渐进式互学习实现特征的相互优化。 Result: 在NYU-Depth V2和SUN RGB-D数据集上,S2ML分别比当前最优方法CFormer高出0.828 dB和0.834 dB,验证了其有效性。 Conclusion: 通过联合利用空间和频域特征并考虑原始深度图的物理特性,S2ML实现了更精确的深度补全,为后续视觉任务提供了高质量的深度信息。 Abstract: The raw depth images captured by RGB-D cameras using Time-of-Flight (TOF) or structured light often suffer from incomplete depth values due to weak reflections, boundary shadows, and artifacts, which limit their applications in downstream vision tasks. Existing methods address this problem through depth completion in the image domain, but they overlook the physical characteristics of raw depth images. It has been observed that the presence of invalid depth areas alters the frequency distribution pattern. In this work, we propose a Spatio-Spectral Mutual Learning framework (S2ML) to harmonize the advantages of both spatial and frequency domains for depth completion. Specifically, we consider the distinct properties of amplitude and phase spectra and devise a dedicated spectral fusion module. Meanwhile, the local and global correlations between spatial-domain and frequency-domain features are calculated in a unified embedding space. The gradual mutual representation and refinement encourage the network to fully explore complementary physical characteristics and priors for more accurate depth completion. Extensive experiments demonstrate the effectiveness of our proposed S2ML method, outperforming the state-of-the-art method CFormer by 0.828 dB and 0.834 dB on the NYU-Depth V2 and SUN RGB-D datasets, respectively.[166] StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video
Zhihui Ke,Yuyang Liu,Xiaobo Zhou,Tie Qiu
Main category: cs.CV
TL;DR: 提出了一种用于实时流式自由视角视频的新表示方法StreamSTGS,通过高效压缩和自适应码率控制,在保持高性能的同时大幅减小帧大小。
Details
Motivation: 现有的3DGS-based自由视角视频方法存储需求过高(每帧高达10MB),难以实现实时流式传输。 Method: StreamSTGS使用规范3D高斯、时间特征和形变场表示动态场景;将高斯属性编码为2D图像,时间特征编码为视频,并引入滑动窗口和Transformer辅助模块分别建模局部与全局运动。 Result: 在多个基准上表现优异,平均PSNR提升1dB,平均帧大小降至170KB,支持实时流式传输和自适应码率控制。 Conclusion: StreamSTGS在保证高质量渲染的同时显著提升了压缩效率和传输可行性,是适用于实时自由视角视频流的高效解决方案。 Abstract: Streaming free-viewpoint video~(FVV) in real-time still faces significant challenges, particularly in training, rendering, and transmission efficiency. Harnessing superior performance of 3D Gaussian Splatting~(3DGS), recent 3DGS-based FVV methods have achieved notable breakthroughs in both training and rendering. However, the storage requirements of these methods can reach up to $10$MB per frame, making stream FVV in real-time impossible. To address this problem, we propose a novel FVV representation, dubbed StreamSTGS, designed for real-time streaming. StreamSTGS represents a dynamic scene using canonical 3D Gaussians, temporal features, and a deformation field. For high compression efficiency, we encode canonical Gaussian attributes as 2D images and temporal features as a video. This design not only enables real-time streaming, but also inherently supports adaptive bitrate control based on network condition without any extra training. Moreover, we propose a sliding window scheme to aggregate adjacent temporal features to learn local motions, and then introduce a transformer-guided auxiliary training module to learn global motions. On diverse FVV benchmarks, StreamSTGS demonstrates competitive performance on all metrics compared to state-of-the-art methods. Notably, StreamSTGS increases the PSNR by an average of $1$dB while reducing the average frame size to just $170$KB. The code is publicly available on https://github.com/kkkzh/StreamSTGS.[167] Neodragon: Mobile Video Generation using Diffusion Transformer
Animesh Karnewar,Denis Korzhenkov,Ioannis Lelekas,Adil Karjauv,Noor Fathima,Hanwen Xiong,Vancheeswaran Vaidyanathan,Will Zeng,Rafael Esteves,Tushar Singhal,Fatih Porikli,Mohsen Ghafoorian,Amirhossein Habibian
Main category: cs.CV
TL;DR: Neodragon是首个专为移动设备优化的文本到视频生成系统,可在高通Hexagon NPU上以6.7秒内生成640x1024分辨率的2秒视频,具备高效、低内存、低延迟特性。
Details
Motivation: 实现移动端高效、私有化、无需依赖云服务的高质量视频生成,降低AI视频创作门槛。 Method: 采用四项关键技术:1)通过文本编码器蒸馏使用更小的DT5编码器;2)提出非对称解码器蒸馏方法;3)基于重要性剪枝MMDiT模块并结合两阶段蒸馏恢复性能;4)利用DMD进行步数蒸馏减少NFE。结合优化的图像生成和超分模型构建端到端系统。 Result: 完整模型4.945B参数,峰值内存3.5GB,端到端延迟6.7秒(7FPS),VBench总分81.61。 Conclusion: Neodragon实现了高性能、低资源消耗的移动端文本到视频生成,推动了设备端AI视频创作的普及。 Abstract: We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at the 640x1024 resolution directly on a Qualcomm Hexagon NPU in a record 6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimised for mobile hardware to achieve efficient and high-fidelity video synthesis. We achieve this through four key technical contributions: (1) Replacing the original large 4.762B T5xxl Text-Encoder with a much smaller 0.2B DT5 (DistilT5) with minimal quality loss, enabled through a novel Text-Encoder Distillation procedure. (2) Proposing an Asymmetric Decoder Distillation approach allowing us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the generation pipeline. (3) Pruning of MMDiT blocks within the denoiser backbone based on their relative importance, with recovery of original performance through a two-stage distillation process. (4) Reducing the NFE (Neural Functional Evaluation) requirement of the denoiser by performing step distillation using DMD adapted for pyramidal flow-matching, thereby substantially accelerating video generation. When paired with an optimised SSD1B first-frame image generator and QuickSRNet for 2x super-resolution, our end-to-end Neodragon system becomes a highly parameter (4.945B full model), memory (3.5GB peak RAM usage), and runtime (6.7s E2E latency) efficient mobile-friendly model, while achieving a VBench total score of 81.61. By enabling low-cost, private, and on-device text-to-video synthesis, Neodragon democratizes AI-based video content creation, empowering creators to generate high-quality videos without reliance on cloud services. Code and model will be made publicly available at our website: https://qualcomm-ai-research.github.io/neodragon[168] LoopExpose: An Unsupervised Framework for Arbitrary-Length Exposure Correction
Ao Li,Chen Chen,Zhenyu Wang,Tao Huang,Fangfang Wu,Weisheng Dong
Main category: cs.CV
TL;DR: 提出了一种基于伪标签的无监督方法LoopExpose,用于任意长度的曝光校正,通过嵌套循环优化策略和亮度排序损失,在无需大量标注数据的情况下实现了优于现有无监督方法的性能。
Details
Motivation: 现有的监督学习方法依赖大规模标注数据集,而在实际场景中获取这些数据十分困难,因此需要一种无需标注数据的无监督曝光校正方法。 Method: 提出LoopExpose,采用嵌套循环优化框架:上层用伪标签训练校正模型,下层通过多曝光融合生成伪标签,并引入反馈机制将校正后的图像重新用于优化伪标签;同时设计亮度排序损失,利用输入序列中的相对亮度顺序作为自监督约束。 Result: 在多个基准数据集上的实验表明,LoopExpose在曝光校正和融合任务上均优于现有的最先进无监督方法。 Conclusion: LoopExpose通过联合优化校正模型和伪监督信息,构建了自增强的学习循环,在无监督曝光校正方面表现出优越性能,具有良好的应用前景。 Abstract: Exposure correction is essential for enhancing image quality under challenging lighting conditions. While supervised learning has achieved significant progress in this area, it relies heavily on large-scale labeled datasets, which are difficult to obtain in practical scenarios. To address this limitation, we propose a pseudo label-based unsupervised method called LoopExpose for arbitrary-length exposure correction. A nested loop optimization strategy is proposed to address the exposure correction problem, where the correction model and pseudo-supervised information are jointly optimized in a two-level framework. Specifically, the upper-level trains a correction model using pseudo-labels generated through multi-exposure fusion at the lower level. A feedback mechanism is introduced where corrected images are fed back into the fusion process to refine the pseudo-labels, creating a self-reinforcing learning loop. Considering the dominant role of luminance calibration in exposure correction, a Luminance Ranking Loss is introduced to leverage the relative luminance ordering across the input sequence as a self-supervised constraint. Extensive experiments on different benchmark datasets demonstrate that LoopExpose achieves superior exposure correction and fusion performance, outperforming existing state-of-the-art unsupervised methods. Code is available at https://github.com/FALALAS/LoopExpose.[169] An Artificial Intelligence-based Assistant for the Visually Impaired
Luis Marquez-Carpintero,Francisco Gomez-Donoso,Zuria Bauer,Bessie Dominguez-Dager,Alvaro Belmonte-Baeza,Mónica Pina-Navarro,Francisco Morillas-Espejo,Felix Escalona,Miguel Cazorla
Main category: cs.CV
TL;DR: 本文介绍了一款名为AIDEN的人工智能助手应用,旨在通过先进的机器学习技术帮助视障人士识别物体、阅读文本和理解环境,从而提升其独立性和生活质量。
Details
Motivation: 视障人士在日常生活中面临识别物体、阅读文本和导航等挑战,现有辅助技术在某些场景下效果有限,因此需要更智能的解决方案。 Method: 采用You Only Look Once(YOLO)架构和大型语言与视觉助手模型,结合多种交互方式,实现对环境的实时感知与信息反馈。 Result: AIDEN能够有效识别物体、读取文本并回答用户关于环境的问题,用户反馈表明该系统提升了信息获取能力和日常使用的便利性。 Conclusion: AIDEN有助于增强视障用户的自主性,改善其生活质量,展示了人工智能在辅助技术中的实际应用潜力。 Abstract: This paper describes an artificial intelligence-based assistant application, AIDEN, developed during 2023 and 2024, aimed at improving the quality of life for visually impaired individuals. Visually impaired individuals face challenges in identifying objects, reading text, and navigating unfamiliar environments, which can limit their independence and reduce their quality of life. Although solutions such as Braille, audio books, and screen readers exist, they may not be effective in all situations. This application leverages state-of-the-art machine learning algorithms to identify and describe objects, read text, and answer questions about the environment. Specifically, it uses You Only Look Once architectures and a Large Language and Vision Assistant. The system incorporates several methods to facilitate the user's interaction with the system and access to textual and visual information in an appropriate manner. AIDEN aims to enhance user autonomy and access to information, contributing to an improved perception of daily usability, as supported by user feedback.[170] Hybrid CNN-ViT Framework for Motion-Blurred Scene Text Restoration
Umar Rashid,Muhammad Arslan Arshad,Ghulam Ahmad,Muhammad Zeeshan Anjum,Rizwan Khan,Muhammad Akmal
Main category: cs.CV
TL;DR: 提出了一种结合CNN和视觉Transformer的混合深度学习框架,用于恢复运动模糊的场景文本图像,在保持轻量级的同时实现了高效的去模糊效果。
Details
Motivation: 现有的去模糊方法难以处理空间变化的模糊,且缺乏对文本清晰度恢复所需的长距离依赖建模能力。 Method: 采用CNN编码器-解码器结构保留局部结构细节,并引入Transformer模块通过自注意力机制增强全局感知;在TextOCR基础上构建合成模糊数据集进行训练,使用包含MAE、MSE、感知损失和SSIM的复合损失函数优化模型。 Result: 在定量评估中达到32.20 dB的PSNR和0.934的SSIM,模型仅含283万参数,平均推理时间为61 ms。 Conclusion: 所提出的CNN-ViT混合架构在恢复运动模糊文本图像方面表现出色,兼具高性能与高效率,适用于实际应用场景。 Abstract: Motion blur in scene text images severely impairs readability and hinders the reliability of computer vision tasks, including autonomous driving, document digitization, and visual information retrieval. Conventional deblurring approaches are often inadequate in handling spatially varying blur and typically fall short in modeling the long-range dependencies necessary for restoring textual clarity. To overcome these limitations, we introduce a hybrid deep learning framework that combines convolutional neural networks (CNNs) with vision transformers (ViTs), thereby leveraging both local feature extraction and global contextual reasoning. The architecture employs a CNN-based encoder-decoder to preserve structural details, while a transformer module enhances global awareness through self-attention. Training is conducted on a curated dataset derived from TextOCR, where sharp scene-text samples are paired with synthetically blurred versions generated using realistic motion-blur kernels of multiple sizes and orientations. Model optimization is guided by a composite loss that incorporates mean absolute error (MAE), squared error (MSE), perceptual similarity, and structural similarity (SSIM). Quantitative evaluations show that the proposed method attains 32.20 dB in PSNR and 0.934 in SSIM, while remaining lightweight with 2.83 million parameters and an average inference time of 61 ms. These results highlight the effectiveness and computational efficiency of the CNN-ViT hybrid design, establishing its practicality for real-world motion-blurred scene-text restoration.[171] DiLO: Disentangled Latent Optimization for Learning Shape and Deformation in Grouped Deforming 3D Objects
Mostofa Rafid Uddin,Jana Armouti,Umong Sain,Md Asib Rahman,Xingjian Li,Min Xu
Main category: cs.CV
TL;DR: 提出一种基于解耦潜在优化的方法,以无监督方式将分组变形的3D物体分解为形状和变形因子。
Details
Motivation: 希望在无监督条件下实现3D物体形状与变形的解耦表示,以便更好地进行形状分析与编辑。 Method: 联合优化生成器网络以及形状和变形因子,并采用特定正则化技术;第二阶段训练两个顺序不变的PointNet-based编码器网络以实现高效的解耦推理。 Result: 在3D人体、动物和面部表情数据集上的实验表明,该方法在无监督变形迁移、变形分类和可解释性分析等任务中表现优异,性能媲美或优于更复杂的方法。 Conclusion: 所提出的简单方法能有效实现形状与变形的解耦,在多种下游任务中具有优越性能。 Abstract: In this work, we propose a disentangled latent optimization-based method for parameterizing grouped deforming 3D objects into shape and deformation factors in an unsupervised manner. Our approach involves the joint optimization of a generator network along with the shape and deformation factors, supported by specific regularization techniques. For efficient amortized inference of disentangled shape and deformation codes, we train two order-invariant PoinNet-based encoder networks in the second stage of our method. We demonstrate several significant downstream applications of our method, including unsupervised deformation transfer, deformation classification, and explainability analysis. Extensive experiments conducted on 3D human, animal, and facial expression datasets demonstrate that our simple approach is highly effective in these downstream tasks, comparable or superior to existing methods with much higher complexity.[172] Latent Refinement via Flow Matching for Training-free Linear Inverse Problem Solving
Hossein Askari,Yadan Luo,Hongfu Sun,Fred Roosta
Main category: cs.CV
TL;DR: 提出LFlow,一种基于预训练潜在流先验的免训练框架,用于通过流匹配在潜在空间中高效求解线性逆问题,并引入理论支持的后验协方差以提升生成对齐与覆盖性能。
Details
Motivation: 现有基于流的逆问题求解方法在像素空间操作导致计算开销大、难以扩展到高分辨率图像,且使用先验无关的后验协方差指导策略,削弱了生成轨迹对齐并降低后验覆盖。 Method: 提出LFlow框架,利用预训练的潜在流先验,在潜在空间中通过流匹配进行ODE采样,沿最优路径求解逆问题;并基于最优向量场推导出理论支持的后验协方差,实现有效的流引导。 Result: 实验表明,LFlow在多数任务上的重建质量优于当前最先进的潜在扩散求解器。 Conclusion: LFlow通过在潜在空间中结合流匹配和理论指导策略,显著提升了逆问题求解的效率和重建质量,具备良好的可扩展性和应用前景。 Abstract: Recent advances in inverse problem solving have increasingly adopted flow priors over diffusion models due to their ability to construct straight probability paths from noise to data, thereby enhancing efficiency in both training and inference. However, current flow-based inverse solvers face two primary limitations: (i) they operate directly in pixel space, which demands heavy computational resources for training and restricts scalability to high-resolution images, and (ii) they employ guidance strategies with prior-agnostic posterior covariances, which can weaken alignment with the generative trajectory and degrade posterior coverage. In this paper, we propose LFlow (Latent Refinement via Flows), a training-free framework for solving linear inverse problems via pretrained latent flow priors. LFlow leverages the efficiency of flow matching to perform ODE sampling in latent space along an optimal path. This latent formulation further allows us to introduce a theoretically grounded posterior covariance, derived from the optimal vector field, enabling effective flow guidance. Experimental results demonstrate that our proposed method outperforms state-of-the-art latent diffusion solvers in reconstruction quality across most tasks. The code will be publicly available at https://github.com/hosseinaskari-cs/LFlow .[173] Real-Time Bundle Adjustment for Ultra-High-Resolution UAV Imagery Using Adaptive Patch-Based Feature Tracking
Selim Ahmet Iz,Francesco Nex,Norman Kerle,Henry Meissner,Ralf Berger
Main category: cs.CV
TL;DR: 提出一种无需下采样的实时无人机影像捆绑调整框架,通过分块处理和滑动窗口优化,在保持高精度的同时实现全分辨率影像的实时处理。
Details
Motivation: 传统捆绑调整方法在处理高分辨率无人机影像时面临计算耗时或牺牲细节的困境,难以满足灾害响应等实时性要求高的应用场景需求。 Method: 将每幅影像划分为用户定义的小块,利用无人机GNSS/IMU数据和粗略数字表面模型动态跟踪跨帧图像块,结合飞行过程中实时确定的重叠关系,仅对滑动重叠影像簇进行局部捆绑调整。 Result: 在50MP分辨率影像上验证,该方法在无GPU加速情况下2秒内完成全捆绑调整,保持了相机姿态估计的高精度和多航带间的一致性。 Conclusion: 所提轻量级框架可在机载系统上实现实时、高保真的大范围测绘,适用于灾害应急、基础设施监测等场景。 Abstract: Real-time processing of UAV imagery is crucial for applications requiring urgent geospatial information, such as disaster response, where rapid decision-making and accurate spatial data are essential. However, processing high-resolution imagery in real time presents significant challenges due to the computational demands of feature extraction, matching, and bundle adjustment (BA). Conventional BA methods either downsample images, sacrificing important details, or require extensive processing time, making them unsuitable for time-critical missions. To overcome these limitations, we propose a novel real-time BA framework that operates directly on fullresolution UAV imagery without downsampling. Our lightweight, onboard-compatible approach divides each image into user-defined patches (e.g., NxN grids, default 150x150 pixels) and dynamically tracks them across frames using UAV GNSS/IMU data and a coarse, globally available digital surface model (DSM). This ensures spatial consistency for robust feature extraction and matching between patches. Overlapping relationships between images are determined in real time using UAV navigation system, enabling the rapid selection of relevant neighbouring images for localized BA. By limiting optimization to a sliding cluster of overlapping images, including those from adjacent flight strips, the method achieves real-time performance while preserving the accuracy of global BA. The proposed algorithm is designed for seamless integration into the DLR Modular Aerial Camera System (MACS), supporting largearea mapping in real time for disaster response, infrastructure monitoring, and coastal protection. Validation on MACS datasets with 50MP images demonstrates that the method maintains precise camera orientations and high-fidelity mapping across multiple strips, running full bundle adjustment in under 2 seconds without GPU acceleration.[174] MambaOVSR: Multiscale Fusion with Global Motion Modeling for Chinese Opera Video Super-Resolution
Hua Chang,Xin Xu,Wei Liu,Wei Wang,Xin Yuan,Kui Jiang
Main category: cs.CV
TL;DR: 本文提出了一种基于Mamba的多尺度融合网络MambaOVSR,用于解决中国戏曲视频时空超分辨率中的大运动建模和高频细节恢复问题,并构建了大规模戏曲视频数据集COVC,实验表明其性能显著优于现有方法。
Details
Motivation: 由于早期拍摄设备限制,上世纪著名艺术家的戏曲视频质量较低,现有视频超分辨率方法在处理戏曲视频的大运动和高频细节时存在局限,且缺乏专门数据集。 Method: 提出了MambaOVSR网络,包含全局融合模块(GFM)、多尺度协同Mamba模块(MSMM)和MambaVR块,结合新构建的大规模中国戏曲视频片段(COVC)数据集进行训练与验证。 Result: 在COVC数据集上的实验显示,MambaOVSR相比当前最优STVSR方法平均PSNR提升1.86 dB,显著改善视觉质量。 Conclusion: MambaOVSR有效解决了戏曲视频超分辨率中的大运动建模与细节恢复难题,推动了传统艺术数字存档的发展,具备良好的应用前景。 Abstract: Chinese opera is celebrated for preserving classical art. However, early filming equipment limitations have degraded videos of last-century performances by renowned artists (e.g., low frame rates and resolution), hindering archival efforts. Although space-time video super-resolution (STVSR) has advanced significantly, applying it directly to opera videos remains challenging. The scarcity of datasets impedes the recovery of high frequency details, and existing STVSR methods lack global modeling capabilities, compromising visual quality when handling opera's characteristic large motions. To address these challenges, we pioneer a large scale Chinese Opera Video Clip (COVC) dataset and propose the Mamba-based multiscale fusion network for space-time Opera Video Super-Resolution (MambaOVSR). Specifically, MambaOVSR involves three novel components: the Global Fusion Module (GFM) for motion modeling through a multiscale alternating scanning mechanism, and the Multiscale Synergistic Mamba Module (MSMM) for alignment across different sequence lengths. Additionally, our MambaVR block resolves feature artifacts and positional information loss during alignment. Experimental results on the COVC dataset show that MambaOVSR significantly outperforms the SOTA STVSR method by an average of 1.86 dB in terms of PSNR. Dataset and Code will be publicly released.[175] NURBGen: High-Fidelity Text-to-CAD Generation through LLM-Driven NURBS Modeling
Muhammad Usama,Mohammad Sadil Khan,Didier Stricker,Muhammad Zeshan Afzal
Main category: cs.CV
TL;DR: NURBGen是首个能从自然语言生成高保真3D CAD模型的框架,使用NURBS直接生成可编辑的CAD表示。
Details
Motivation: 现有文本到CAD系统多生成网格或依赖稀缺的设计历史数据,难以生成可编辑的高精度CAD模型。 Method: 通过微调大语言模型将自由文本转化为包含NURBS曲面参数的JSON表示,并提出结合非裁剪NURBS与解析基元的混合表示以提升鲁棒性并降低token复杂度。 Result: 在多样化的文本提示下表现出色,几何保真度和尺寸准确性优于先前方法,经专家评估验证。 Conclusion: NURBGen实现了从文本到可编辑CAD模型的高效生成,推动了智能制造和设计自动化的发展。 Abstract: Generating editable 3D CAD models from natural language remains challenging, as existing text-to-CAD systems either produce meshes or rely on scarce design-history data. We present NURBGen, the first framework to generate high-fidelity 3D CAD models directly from text using Non-Uniform Rational B-Splines (NURBS). To achieve this, we fine-tune a large language model (LLM) to translate free-form texts into JSON representations containing NURBS surface parameters (\textit{i.e}, control points, knot vectors, degrees, and rational weights) which can be directly converted into BRep format using Python. We further propose a hybrid representation that combines untrimmed NURBS with analytic primitives to handle trimmed surfaces and degenerate regions more robustly, while reducing token complexity. Additionally, we introduce partABC, a curated subset of the ABC dataset consisting of individual CAD components, annotated with detailed captions using an automated annotation pipeline. NURBGen demonstrates strong performance on diverse prompts, surpassing prior methods in geometric fidelity and dimensional accuracy, as confirmed by expert evaluations. Code and dataset will be released publicly.[176] Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models
Rodrigo Gallardo,Oz Fishman,Alexander Htet Kyaw
Main category: cs.CV
TL;DR: 本文提出了一种人机协同的计算机视觉框架,利用生成式AI在公共空间中建议微观尺度的设计干预,促进更持续、本地化的公众参与。
Details
Motivation: 旨在超越传统的自上而下总体规划,通过结合日常模式和实际生活经验,实现更贴近社区需求的城市设计。 Method: 使用Grounding DINO和ADE20K数据集的子集检测城市物体,并构建共现嵌入以揭示常见空间配置;系统提供五个统计上可能的补充物体,再通过视觉语言模型推理生成第三个物体以完成复杂的城市策略。 Result: 系统能够基于场景图像和用户选择的物体对,推荐合理的第三物体,支持更智能、参与式的设计决策。 Conclusion: 该框架通过将人类置于设计循环中,结合生成式AI与真实环境数据,推动了更加动态、响应性强的城市微更新方法。 Abstract: This paper introduces a human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public space and support more continuous, local participation. Using Grounding DINO and a curated subset of the ADE20K dataset as a proxy for the urban built environment, the system detects urban objects and builds co-occurrence embeddings that reveal common spatial configurations. From this analysis, the user receives five statistically likely complements to a chosen anchor object. A vision language model then reasons over the scene image and the selected pair to suggest a third object that completes a more complex urban tactic. The workflow keeps people in control of selection and refinement and aims to move beyond top-down master planning by grounding choices in everyday patterns and lived experience.[177] MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition
Shu Zhao,Nilesh Ahuja,Tan Yu,Tianyi Shen,Vijaykrishnan Narayanan
Main category: cs.CV
TL;DR: 本文提出了一种名为MoRA的参数高效微调方法,用于解决预训练视觉语言模型在模态缺失场景下的多模态识别问题。MoRA通过引入跨模态共享参数实现双向知识迁移,同时保留模态特异性参数以维持模态内灵活性,在性能、推理效率和可训练参数量方面显著优于现有方法。
Details
Motivation: 现实场景中常存在模态缺失问题(如隐私或资源限制),而现有方法难以有效建模跨模态关系且计算开销大,因此需要一种既能保持跨模态交互又高效的微调方法。 Method: 提出MoRA方法,在文本和视觉编码器之间引入模态共享参数以实现跨模态知识传递,同时保留模态特异性参数以适应各模态特征,从而在减少可训练参数的同时建模跨模态交互。 Result: 在标准基准上,MoRA在模态缺失场景下平均性能提升5.24%,推理时间仅为当前最优方法的25.90%,且仅需全微调0.11%的可训练参数。 Conclusion: MoRA在保持极低参数量和高推理效率的同时,显著提升了模态缺失情况下的多模态识别性能,有效平衡了模型表现与计算成本。 Abstract: Pre-trained vision language models have shown remarkable performance on visual recognition tasks, but they typically assume the availability of complete multimodal inputs during both training and inference. In real-world scenarios, however, modalities may be missing due to privacy constraints, collection difficulties, or resource limitations. While previous approaches have addressed this challenge using prompt learning techniques, they fail to capture the cross-modal relationships necessary for effective multimodal visual recognition and suffer from inevitable computational overhead. In this paper, we introduce MoRA, a parameter-efficient fine-tuning method that explicitly models cross-modal interactions while maintaining modality-specific adaptations. MoRA introduces modality-common parameters between text and vision encoders, enabling bidirectional knowledge transfer. Additionally, combined with the modality-specific parameters, MoRA allows the backbone model to maintain inter-modality interaction and enable intra-modality flexibility. Extensive experiments on standard benchmarks demonstrate that MoRA achieves an average performance improvement in missing-modality scenarios by 5.24% and uses only 25.90% of the inference time compared to the SOTA method while requiring only 0.11% of trainable parameters compared to full fine-tuning.[178] Temporal-Guided Visual Foundation Models for Event-Based Vision
Ruihao Xia,Junhong Cai,Luziwei Leng,Liuyi Wang,Chengju Liu,Ran Cheng,Yang Tang,Pan Zhou
Main category: cs.CV
TL;DR: 提出了一种名为Temporal-Guided VFM (TGVFM)的新框架,将视觉基础模型(VFMs)与时间上下文融合模块结合,用于事件相机数据处理,在语义分割、深度估计和目标检测任务中显著优于现有方法。
Details
Motivation: 现有事件相机数据处理方法依赖专用架构或高资源训练,而基于图像预训练的视觉基础模型在该领域的潜力尚未充分挖掘。 Method: 设计了一个包含长时序注意力、双尺度时空注意力和深层特征引导机制的时间上下文融合模块,结合事件到视频的转换模型与Transformer架构的VFMs,实现对事件流的高效建模。 Result: 在真实世界数据上验证,TGVFM在语义分割、深度估计和目标检测任务中分别比现有方法提升16%、21%和16%,达到最先进水平。 Conclusion: TGVFM成功桥接了基于图像的视觉基础模型与事件相机之间的模态差异,展现了利用预训练模型进行事件视觉任务的巨大潜力。 Abstract: Event cameras offer unique advantages for vision tasks in challenging environments, yet processing asynchronous event streams remains an open challenge. While existing methods rely on specialized architectures or resource-intensive training, the potential of leveraging modern Visual Foundation Models (VFMs) pretrained on image data remains under-explored for event-based vision. To address this, we propose Temporal-Guided VFM (TGVFM), a novel framework that integrates VFMs with our temporal context fusion block seamlessly to bridge this gap. Our temporal block introduces three key components: (1) Long-Range Temporal Attention to model global temporal dependencies, (2) Dual Spatiotemporal Attention for multi-scale frame correlation, and (3) Deep Feature Guidance Mechanism to fuse semantic-temporal features. By retraining event-to-video models on real-world data and leveraging transformer-based VFMs, TGVFM preserves spatiotemporal dynamics while harnessing pretrained representations. Experiments demonstrate SoTA performance across semantic segmentation, depth estimation, and object detection, with improvements of 16%, 21%, and 16% over existing methods, respectively. Overall, this work unlocks the cross-modality potential of image-based VFMs for event-based vision with temporal reasoning. Code is available at https://github.com/XiaRho/TGVFM.[179] Physics-Informed Image Restoration via Progressive PDE Integration
Shamika Likhite,Santiago López-Tapia,Aggelos K. Katsaggelos
Main category: cs.CV
TL;DR: 本文提出了一种结合物理信息偏微分方程(PDE)的渐进式训练框架,用于运动去模糊任务,通过引入对流-扩散方程建模特征演化,有效捕捉运动模糊的方向性并实现全局空间建模,在多个先进网络上显著提升去模糊性能。
Details
Motivation: 现有基于深度学习的运动去模糊方法难以有效建模运动模糊中的长距离空间依赖关系,传统卷积方法感受野有限,需要极深网络,且缺乏物理先验引导特征演化。 Method: 提出一种将物理信息PDE动态融入先进去模糊网络的渐进式训练框架,利用对流-扩散方程建模特征演化过程,以捕捉运动模糊的方向流动特性并实现原理性的全局空间建模。 Result: 该方法在FFTformer、NAFNet、Restormer和Stripformer等多种架构上均显著提升了PSNR和SSIM指标,感知质量更优,推理开销仅增加约1% GMACs。 Conclusion: 将数学物理原理(如PDE)融入深度学习架构可有效提升图像恢复性能,为计算机视觉中的物理信息神经网络设计提供了新方向。 Abstract: Motion blur, caused by relative movement between camera and scene during exposure, significantly degrades image quality and impairs downstream computer vision tasks such as object detection, tracking, and recognition in dynamic environments. While deep learning-based motion deblurring methods have achieved remarkable progress, existing approaches face fundamental challenges in capturing the long-range spatial dependencies inherent in motion blur patterns. Traditional convolutional methods rely on limited receptive fields and require extremely deep networks to model global spatial relationships. These limitations motivate the need for alternative approaches that incorporate physical priors to guide feature evolution during restoration. In this paper, we propose a progressive training framework that integrates physics-informed PDE dynamics into state-of-the-art restoration architectures. By leveraging advection-diffusion equations to model feature evolution, our approach naturally captures the directional flow characteristics of motion blur while enabling principled global spatial modeling. Our PDE-enhanced deblurring models achieve superior restoration quality with minimal overhead, adding only approximately 1\% to inference GMACs while providing consistent improvements in perceptual quality across multiple state-of-the-art architectures. Comprehensive experiments on standard motion deblurring benchmarks demonstrate that our physics-informed approach improves PSNR and SSIM significantly across four diverse architectures, including FFTformer, NAFNet, Restormer, and Stripformer. These results validate that incorporating mathematical physics principles through PDE-based global layers can enhance deep learning-based image restoration, establishing a promising direction for physics-informed neural network design in computer vision applications.[180] Gait Recognition via Collaborating Discriminative and Generative Diffusion Models
Haijun Xiong,Bin Feng,Bang Wang,Xinggang Wang,Wenyu Liu
Main category: cs.CV
TL;DR: 本文提出了一种结合扩散模型与判别模型优势的新型步态识别框架CoD²,通过多层次条件控制策略生成身份一致且细节丰富的步态序列,显著提升了识别性能。
Details
Motivation: 尽管判别模型在步态识别中表现优异,但生成模型的潜力尚未充分挖掘,因此需要探索融合两类模型优势的新方法。 Method: 提出CoD²框架,采用多级条件控制策略,利用判别提取器提供的高层语义信息指导生成过程,同时保留低层视觉细节(如外观和运动),并通过生成的数据增强判别模型的学习。 Result: 在SUSTech1K、CCPG、GREW和Gait3D四个数据集上实验表明,CoD²达到最先进的性能,并能与现有判别方法无缝集成,带来持续性能提升。 Conclusion: CoD²有效结合了生成模型的数据分布建模能力和判别模型的语义学习能力,为步态识别提供了一个强大且通用的框架。 Abstract: Gait recognition offers a non-intrusive biometric solution by identifying individuals through their walking patterns. Although discriminative models have achieved notable success in this domain, the full potential of generative models remains largely underexplored. In this paper, we introduce \textbf{CoD$^2$}, a novel framework that combines the data distribution modeling capabilities of diffusion models with the semantic representation learning strengths of discriminative models to extract robust gait features. We propose a Multi-level Conditional Control strategy that incorporates both high-level identity-aware semantic conditions and low-level visual details. Specifically, the high-level condition, extracted by the discriminative extractor, guides the generation of identity-consistent gait sequences, whereas low-level visual details, such as appearance and motion, are preserved to enhance consistency. Furthermore, the generated sequences facilitate the discriminative extractor's learning, enabling it to capture more comprehensive high-level semantic features. Extensive experiments on four datasets (SUSTech1K, CCPG, GREW, and Gait3D) demonstrate that CoD$^2$ achieves state-of-the-art performance and can be seamlessly integrated with existing discriminative methods, yielding consistent improvements.[181] AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving
Ruifei Zhang,Junlin Xie,Wei Zhang,Weikai Chen,Xiao Tan,Xiang Wan,Guanbin Li
Main category: cs.CV
TL;DR: AdaDrive提出了一种自适应的慢-快协同框架,通过动态决定大语言模型(LLM)的激活时机与融合方式,实现自动驾驶中高阶推理与实时效率的平衡。
Details
Motivation: 现有方法在将大语言模型(LLM)应用于自动驾驶时,存在激活过于频繁导致计算开销过大,或使用固定调度无法适应动态驾驶环境的问题。 Method: 提出AdaDrive框架:(1) 通过基于对比学习机制的自适应激活损失,动态决定LLM的激活时机;(2) 设计自适应融合策略,根据场景复杂度和预测置信度连续调节LLM的影响强度,实现与传统规划器的无缝协作。 Result: 在语言引导的自动驾驶基准上实验表明,AdaDrive在驾驶准确性和计算效率方面均达到最先进水平。 Conclusion: AdaDrive提供了一个灵活、情境感知的框架,能够在保证实时性能的同时最大化决策准确性。 Abstract: Effectively integrating Large Language Models (LLMs) into autonomous driving requires a balance between leveraging high-level reasoning and maintaining real-time efficiency. Existing approaches either activate LLMs too frequently, causing excessive computational overhead, or use fixed schedules, failing to adapt to dynamic driving conditions. To address these challenges, we propose AdaDrive, an adaptively collaborative slow-fast framework that optimally determines when and how LLMs contribute to decision-making. (1) When to activate the LLM: AdaDrive employs a novel adaptive activation loss that dynamically determines LLM invocation based on a comparative learning mechanism, ensuring activation only in complex or critical scenarios. (2) How to integrate LLM assistance: Instead of rigid binary activation, AdaDrive introduces an adaptive fusion strategy that modulates a continuous, scaled LLM influence based on scene complexity and prediction confidence, ensuring seamless collaboration with conventional planners. Through these strategies, AdaDrive provides a flexible, context-aware framework that maximizes decision accuracy without compromising real-time performance. Extensive experiments on language-grounded autonomous driving benchmarks demonstrate that AdaDrive state-of-the-art performance in terms of both driving accuracy and computational efficiency. Code is available at https://github.com/ReaFly/AdaDrive.[182] VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving
Ruifei Zhang,Wei Zhang,Xiao Tan,Sibei Yang,Xiang Wan,Xiaonan Luo,Guanbin Li
Main category: cs.CV
TL;DR: 提出了一种轻量化的多模态大语言模型VLDrive,通过增强视觉表征和参数压缩,在自动驾驶任务中实现了更优的性能与效率平衡。
Details
Motivation: 现有基于大语言模型的自动驾驶方法存在视觉表征能力不足和模型参数量过大导致部署困难的问题。 Method: 设计了轻量化的MLLM架构VLDrive,采用循环一致性动态视觉剪枝和记忆增强特征聚合生成紧凑视觉token,并提出距离解耦的指令注意力机制以提升长距离视觉-语言联合学习。 Result: 在CARLA仿真环境中,VLDrive将参数减少81%(从7B降至1.3B),在闭环评测中驾驶分数分别提升了15.4%、16.8%和7.6%(短、中、远距离),达到SOTA性能。 Conclusion: VLDrive有效解决了视觉表征弱和模型臃肿问题,兼顾高性能与低部署成本,推动了语言引导自动驾驶系统的实际应用。 Abstract: Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive`s effectiveness. Notably, VLDrive achieves state-of-the-art driving performance while reducing parameters by 81% (from 7B to 1.3B), yielding substantial driving score improvements of 15.4%, 16.8%, and 7.6% at tiny, short, and long distances, respectively, in closed-loop evaluations. Code is available at https://github.com/ReaFly/VLDrive.[183] Robust Nearest Neighbour Retrieval Using Targeted Manifold Manipulation
B. Ghosh,H. Harikumar,S. Rana
Main category: cs.CV
TL;DR: 提出了一种基于目标流形操控的最近邻检索方法TMM-NN,通过查询特定触发补丁引导网络将相似样本推向虚拟类别,利用置信度排序实现更鲁棒的语义邻近检索。
Details
Motivation: 传统最近邻检索依赖人工调参和几何距离度量,在噪声下表现不稳定,且难以捕捉语义相似性。 Method: 引入轻量级查询特定触发补丁,弱‘后门’化网络使其将带补丁输入导向虚拟类;相似样本因响应更强而被高置信度分类为该类,据此进行排序。 Result: 在噪声环境下和多种任务中,TMM-NN优于传统距离度量方法,展现出更强的鲁棒性和语义一致性。 Conclusion: TMM-NN提供了一种新的检索范式,通过响应性而非绝对距离定义邻域,提升了检索的鲁棒性与可解释性。 Abstract: Nearest-neighbour retrieval is central to classification and explainable-AI pipelines, but current practice relies on hand-tuning feature layers and distance metrics. We propose Targeted Manifold Manipulation-Nearest Neighbour (TMM-NN), which reconceptualises retrieval by assessing how readily each sample can be nudged into a designated region of the feature manifold; neighbourhoods are defined by a sample's responsiveness to a targeted perturbation rather than absolute geometric distance. TMM-NN implements this through a lightweight, query-specific trigger patch. The patch is added to the query image, and the network is weakly ``backdoored'' so that any input with the patch is steered toward a dummy class. Images similar to the query need only a slight shift and are classified as the dummy class with high probability, while dissimilar ones are less affected. By ranking candidates by this confidence, TMM-NN retrieves the most semantically related neighbours. Robustness analysis and benchmark experiments confirm this trigger-based ranking outperforms traditional metrics under noise and across diverse tasks.[184] A Mixture-of-Experts Framework with Log-Logistic Components for Survival Analysis on Histopathology Images
Ardhendu Sekhar,Vasu Soni,Keshav Aske,Shivam Madnoorkar,Pranav Jeevan,Amit Sethi
Main category: cs.CV
TL;DR: 提出一个模块化框架,用于从全切片病理图像预测癌症特异性生存期,结合四个组件,在多个数据集上优于现有方法。
Details
Motivation: 为了更准确地预测癌症患者的生存期,需要从复杂的病理图像中提取具有预后意义的信息,并捕捉组织异质性。 Method: 该方法包括四个部分:基于分位数的阈值选择有预后意义的组织区域、使用k近邻图进行图引导聚类以捕获表型异质性、层次上下文注意力学习簇内外交互、专家驱动的对数逻辑分布混合模型估计复杂生存分布。 Result: 在TCGA LUAD、KIRC和BRCA数据集上分别达到0.644、0.751和0.752的C指数,优于现有最先进方法。 Conclusion: 该模块化框架能有效整合病理图像中的空间与形态学信息,提升癌症生存预测性能。 Abstract: We propose a modular framework for predicting cancer specific survival from whole slide pathology images (WSIs). The method integrates four components: (i) Quantile Gated Patch Selection via quantile based thresholding to isolate prognostically informative tissue regions; (ii) Graph Guided Clustering using a k nearest neighbor graph to capture phenotype level heterogeneity through spatial and morphological coherence; (iii) Hierarchical Context Attention to learn intra and inter cluster interactions; and (iv) an Expert Driven Mixture of Log logistics framework to estimate complex survival distributions using Log logistics distributions. The model attains a concordance index of 0.644 on TCGA LUAD, 0.751 on TCGA KIRC, and 0.752 on TCGA BRCA respectively, outperforming existing state of the art approaches.[185] LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval
Jian Zhang,Junyi Guo,Junyi Yuan,Huanda Lu,Yanlin Zhou,Fangyu Wu,Qiufeng Wang,Dongming Lu
Main category: cs.CV
TL;DR: 提出了一种名为$C^3$的数据增强框架,通过提升大语言模型生成文本描述的完整性和一致性,显著改善文化遗产数据中的跨模态检索性能。
Details
Motivation: 跨模态检索在文化遗产数据中受限于文本描述不完整或不一致,而大语言模型虽能补充描述,但易产生幻觉或遗漏视觉细节。 Method: $C^3$框架引入完整性评估模块,并利用视觉线索和语言模型输出评估语义覆盖;同时构建马尔可夫决策过程来监督思维链推理,通过自适应查询控制提升一致性。 Result: 在CulTi、TimeTravel、MSCOCO和Flickr30K数据集上实验表明,$C^3$在微调和零样本设置下均达到最先进的性能。 Conclusion: $C^3$有效提升了跨模态检索中文本描述的完整性和事实一致性,具有在文化遗产及通用场景下的广泛应用潜力。 Abstract: Cross-modal retrieval is essential for interpreting cultural heritage data, but its effectiveness is often limited by incomplete or inconsistent textual descriptions, caused by historical data loss and the high cost of expert annotation. While large language models (LLMs) offer a promising solution by enriching textual descriptions, their outputs frequently suffer from hallucinations or miss visually grounded details. To address these challenges, we propose $C^3$, a data augmentation framework that enhances cross-modal retrieval performance by improving the completeness and consistency of LLM-generated descriptions. $C^3$ introduces a completeness evaluation module to assess semantic coverage using both visual cues and language-model outputs. Furthermore, to mitigate factual inconsistencies, we formulate a Markov Decision Process to supervise Chain-of-Thought reasoning, guiding consistency evaluation through adaptive query control. Experiments on the cultural heritage datasets CulTi and TimeTravel, as well as on general benchmarks MSCOCO and Flickr30K, demonstrate that $C^3$ achieves state-of-the-art performance in both fine-tuned and zero-shot settings.[186] RelightMaster: Precise Video Relighting with Multi-plane Light Images
Weikang Bian,Xiaoyu Shi,Zhaoyang Huang,Jianhong Bai,Qinghe Wang,Xintao Wang,Pengfei Wan,Kun Gai,Hongsheng Li
Main category: cs.CV
TL;DR: 本文提出RelightMaster,一个用于精确可控视频重光照的框架,通过构建基于Unreal Engine的RelightVideo数据集和引入多平面光图像(MPLI)作为视觉提示,实现了高质量的视频光照编辑。
Details
Motivation: 现有文本到视频模型难以实现细粒度的光照控制,且缺乏高质量的可控光照训练数据,限制了视频重光照的发展。 Method: 1) 构建RelightVideo数据集,包含相同动态内容在不同精确光照条件下的视频;2) 提出Multi-plane Light Image (MPLI) 表示3D光源位置、强度和颜色;3) 设计Light Image Adapter,将MPLI压缩并通过潜变量注入预训练的Video DiT中。 Result: 实验表明,RelightMaster能生成物理合理的光照与阴影,同时保持原始场景内容的一致性,在多种光照条件下表现良好。 Conclusion: RelightMaster实现了精确可控的视频重光照,克服了文本描述光照的局限性和训练数据稀缺的问题,为视频编辑中的光照控制提供了新思路。 Abstract: Recent advances in diffusion models enable high-quality video generation and editing, but precise relighting with consistent video contents, which is critical for shaping scene atmosphere and viewer attention, remains unexplored. Mainstream text-to-video (T2V) models lack fine-grained lighting control due to text's inherent limitation in describing lighting details and insufficient pre-training on lighting-related prompts. Additionally, constructing high-quality relighting training data is challenging, as real-world controllable lighting data is scarce. To address these issues, we propose RelightMaster, a novel framework for accurate and controllable video relighting. First, we build RelightVideo, the first dataset with identical dynamic content under varying precise lighting conditions based on the Unreal Engine. Then, we introduce Multi-plane Light Image (MPLI), a novel visual prompt inspired by Multi-Plane Image (MPI). MPLI models lighting via K depth-aligned planes, representing 3D light source positions, intensities, and colors while supporting multi-source scenarios and generalizing to unseen light setups. Third, we design a Light Image Adapter that seamlessly injects MPLI into pre-trained Video Diffusion Transformers (DiT): it compresses MPLI via a pre-trained Video VAE and injects latent light features into DiT blocks, leveraging the base model's generative prior without catastrophic forgetting. Experiments show that RelightMaster generates physically plausible lighting and shadows and preserves original scene content. Demos are available at https://wkbian.github.io/Projects/RelightMaster/.[187] LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation
Zijie Wang,Weiming Zhang,Wei Zhang,Xiao Tan,Hongxing Liu,Yaowei Wang,Guanbin Li
Main category: cs.CV
TL;DR: 本文提出了一种名为LaneDiffusion的新型生成式方法,用于车道中心线图学习,通过在鸟瞰图(BEV)特征层面使用扩散模型生成车道先验,显著提升了路径规划中的性能表现。
Details
Motivation: 传统确定性方法在空间推理和处理遮挡或不可见中心线方面存在不足,而生成式方法在此领域尚未充分探索。 Method: 引入LaneDiffusion,采用扩散模型在BEV特征级别生成车道中心线先验,并结合Lane Prior Injection Module(LPIM)和Lane Prior Diffusion Module(LPDM)构建扩散目标并管理扩散过程,最终从先验注入的BEV特征中解码出矢量化中心线和拓扑结构。 Result: 在nuScenes和Argoverse2数据集上的实验表明,LaneDiffusion在点级指标(GEO F1、TOPO F1、JTOPO F1、APLS、SDA)上分别提升4.2%、4.6%、4.7%、6.4%、1.8%,在段级指标(IoU、mAP_cf、DET_l、TOP_ll)上分别提升2.3%、6.4%、6.8%、2.1%,达到当前最优性能。 Conclusion: LaneDiffusion在中心线图学习任务中实现了最先进的性能,为生成式模型在该领域的应用提供了新的视角。 Abstract: Centerline graphs, crucial for path planning in autonomous driving, are traditionally learned using deterministic methods. However, these methods often lack spatial reasoning and struggle with occluded or invisible centerlines. Generative approaches, despite their potential, remain underexplored in this domain. We introduce LaneDiffusion, a novel generative paradigm for centerline graph learning. LaneDiffusion innovatively employs diffusion models to generate lane centerline priors at the Bird's Eye View (BEV) feature level, instead of directly predicting vectorized centerlines. Our method integrates a Lane Prior Injection Module (LPIM) and a Lane Prior Diffusion Module (LPDM) to effectively construct diffusion targets and manage the diffusion process. Furthermore, vectorized centerlines and topologies are then decoded from these prior-injected BEV features. Extensive evaluations on the nuScenes and Argoverse2 datasets demonstrate that LaneDiffusion significantly outperforms existing methods, achieving improvements of 4.2%, 4.6%, 4.7%, 6.4% and 1.8% on fine-grained point-level metrics (GEO F1, TOPO F1, JTOPO F1, APLS and SDA) and 2.3%, 6.4%, 6.8% and 2.1% on segment-level metrics (IoU, mAP_cf, DET_l and TOP_ll). These results establish state-of-the-art performance in centerline graph learning, offering new insights into generative models for this task.[188] VideoSSR: Video Self-Supervised Reinforcement Learning
Zefeng He,Xiaoye Qu,Yafu Li,Siyuan Huang,Daizong Liu,Yu Cheng
Main category: cs.CV
TL;DR: 本文提出了一种基于视频内在信息的自监督强化学习框架VideoSSR,通过构建三个自监督预训练任务(异常定位、物体计数和时序拼图)来自动生成高质量、可验证的训练数据,并发布了VideoSSR-30K数据集和VIUBench评测基准。实验表明该方法在17个视频理解任务上平均提升超过5%。
Details
Motivation: 现有视频数据集复杂度落后于MLLM的发展,且人工标注成本高昂,亟需一种低成本、高质量的训练数据生成方式。本文探索利用视频内部固有信息来自动生成可验证的训练数据,以推动视频理解模型的发展。 Method: 提出了三种自监督前置任务:异常定位、物体计数和时序拼图;构建了VIUBench基准测试评估任务难度;基于这些任务构建了VideoSSR-30K数据集,并提出VideoSSR框架,结合自监督学习与可验证奖励机制进行强化学习训练。 Result: 当前最先进的MLLM在VIUBench上表现不佳,验证了任务挑战性;在17个涵盖通用视频问答、长视频问答、时序定位和复杂推理的基准上,VideoSSR带来平均超过5%的性能提升。 Conclusion: VideoSSR成功利用视频内在信息实现高质量自监督数据生成,为多模态大模型的视频理解提供了一个强大且可持续发展的基础框架。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive. This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data? To investigate this, we introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. We construct the Video Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty, revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that VideoSSR consistently enhances model performance, yielding an average improvement of over 5\%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs. The code is available at https://github.com/lcqysl/VideoSSR.[189] From ACR O-RADS 2022 to Explainable Deep Learning: Comparative Performance of Expert Radiologists, Convolutional Neural Networks, Vision Transformers, and Fusion Models in Ovarian Masses
Ali Abbasian Ardakani,Afshin Mohammadi,Alisa Mohebbi,Anushya Vijayananthan,Sook Sam Leong,Lim Yi Ting,Mohd Kamil Bin Mohamad Fabell,U Rajendra Acharya,Sepideh Hatamikia
Main category: cs.CV
TL;DR: 深度学习模型在卵巢附件病变的超声分类中显著优于放射科医生单独使用O-RADS v2022的评估,尤其是Vision Transformer表现最佳;结合专家判断与AI的混合模型进一步提升了诊断准确性,显示出人机协同在标准化超声解读中的巨大潜力。
Details
Motivation: O-RADS v2022虽然改进了卵巢附件病变的风险分层,但人工判读仍存在变异性与保守阈值问题,亟需更稳定、准确的辅助方法提升诊断性能。 Method: 本研究采用回顾性队列设计,纳入512例附件肿块图像,训练并验证16种深度学习模型(包括CNN和ViT),并与放射科医生的O-RADS评估结果比较;同时构建融合医生评分与AI预测概率的混合人机模型。 Result: 放射科医生单独评估AUC为0.683,准确率为68.0%;CNN模型表现介于AUC 0.620–0.908;ViT16-384表现最优,AUC达0.941,准确率87.4%;混合模型显著提升CNN性能,但对ViT提升不显著。 Conclusion: 深度学习模型显著优于放射科医生单独判读,尤其是Vision Transformer;融合专家意见与AI的混合框架达到最高诊断性能,有望标准化超声判读、减少假阳性并提高高风险病变检出率。 Abstract: Background: The 2022 update of the Ovarian-Adnexal Reporting and Data System (O-RADS) ultrasound classification refines risk stratification for adnexal lesions, yet human interpretation remains subject to variability and conservative thresholds. Concurrently, deep learning (DL) models have demonstrated promise in image-based ovarian lesion characterization. This study evaluates radiologist performance applying O-RADS v2022, compares it to leading convolutional neural network (CNN) and Vision Transformer (ViT) models, and investigates the diagnostic gains achieved by hybrid human-AI frameworks. Methods: In this single-center, retrospective cohort study, a total of 512 adnexal mass images from 227 patients (110 with at least one malignant cyst) were included. Sixteen DL models, including DenseNets, EfficientNets, ResNets, VGGs, Xception, and ViTs, were trained and validated. A hybrid model integrating radiologist O-RADS scores with DL-predicted probabilities was also built for each scheme. Results: Radiologist-only O-RADS assessment achieved an AUC of 0.683 and an overall accuracy of 68.0%. CNN models yielded AUCs of 0.620 to 0.908 and accuracies of 59.2% to 86.4%, while ViT16-384 reached the best performance, with an AUC of 0.941 and an accuracy of 87.4%. Hybrid human-AI frameworks further significantly enhanced the performance of CNN models; however, the improvement for ViT models was not statistically significant (P-value >0.05). Conclusions: DL models markedly outperform radiologist-only O-RADS v2022 assessment, and the integration of expert scores with AI yields the highest diagnostic accuracy and discrimination. Hybrid human-AI paradigms hold substantial potential to standardize pelvic ultrasound interpretation, reduce false positives, and improve detection of high-risk lesions.[190] TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks
Xuanle Zhao,Shuxin Zeng,Yinyuan Cai,Xiang Cheng,Duzhen Zhang,Xiuyi Chen,Bo Xu
Main category: cs.CV
TL;DR: 本文提出了一种高效的化学视觉语言模型TinyChemVL,通过视觉token减少和反应级任务提升模型效率与化学推理能力,并构建了反应级基准ChemRxn-V。
Details
Motivation: 现有视觉语言模型在化学领域应用中忽视了分子结构等关键视觉信息,且存在计算效率低和任务范围狭窄的问题。 Method: 提出TinyChemVL模型,采用视觉token减少策略以提高效率,并引入反应级任务增强化学推理能力;同时构建ChemRxn-V基准用于评估基于图像的化学反应识别与预测。 Result: TinyChemVL在分子和反应级任务上均表现出优越性能,推理和训练速度更快,仅使用4B参数和1/16的视觉token即超越ChemVLM。 Conclusion: 通过协同设计模型架构与任务复杂性,TinyChemVL实现了高效且强大的化学视觉理解,推动了化学领域的视觉语言建模发展。 Abstract: While Vision Language Models (VLMs) have demonstrated remarkable capabilities in general visual understanding, their application in the chemical domain has been limited, with previous works predominantly focusing on text and thus overlooking critical visual information, such as molecular structures. Current approaches that directly adopt standard VLMs for chemical tasks suffer from two primary issues: (i) computational inefficiency of processing entire chemical images with non-informative backgrounds. (ii) a narrow scope on molecular-level tasks that restricts progress in chemical reasoning. In this work, we propose \textbf{TinyChemVL}, an efficient and powerful chemical VLM that leverages visual token reduction and reaction-level tasks to improve model efficiency and reasoning capacity. Also, we propose \textbf{ChemRxn-V}, a reaction-level benchmark for assessing vision-based reaction recognition and prediction tasks. Directly predicting reaction products from molecular images poses a non-trivial challenge, as it requires models to integrate both recognition and reasoning capacities. Our results demonstrate that with only 4B parameters, TinyChemVL achieves superior performance on both molecular and reaction tasks while demonstrating faster inference and training speeds compared to existing models. Notably, TinyChemVL outperforms ChemVLM while utilizing only 1/16th of the visual tokens. This work builds efficient yet powerful VLMs for chemical domains by co-designing model architecture and task complexity.[191] Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective
Bing Wang,Ximing Li,Yanjun Wang,Changchun Li,Lin Yuanbo Wu,Buyu Wang,Shengsheng Wang
Main category: cs.CV
TL;DR: 提出了一种新的多模态虚假信息检测方法RETSIMD,通过将文本分段生成增强图像,并结合图神经网络和辅助目标提升检测性能。
Details
Motivation: 观察到在多模态虚假信息检测中,文本通常比图像包含更多信息,而现有方法未能充分利用文本的细粒度信息,因此提出利用文本生成增强图像以弥补图像信息不足的问题。 Method: 将文本分割为多个片段,使用预训练的文本到图像生成器生成对应图像序列;引入文本-图像和图像-标签互信息的辅助目标优化生成器;构建基于三种启发式关系的图结构,并采用图神经网络融合特征。 Result: 实验结果表明,所提方法在多模态虚假信息检测任务上表现优异,验证了文本引导图像增强策略和图结构建模的有效性。 Conclusion: RETSIMD通过文本分段生成图像增强了图像模态的信息表达,结合图神经网络有效提升了多模态虚假信息检测性能,为未来研究提供了新思路。 Abstract: Multimodal Misinformation Detection (MMD) refers to the task of detecting social media posts involving misinformation, where the post often contains text and image modalities. However, by observing the MMD posts, we hold that the text modality may be much more informative than the image modality because the text generally describes the whole event/story of the current post but the image often presents partial scenes only. Our preliminary empirical results indicate that the image modality exactly contributes less to MMD. Upon this idea, we propose a new MMD method named RETSIMD. Specifically, we suppose that each text can be divided into several segments, and each text segment describes a partial scene that can be presented by an image. Accordingly, we split the text into a sequence of segments, and feed these segments into a pre-trained text-to-image generator to augment a sequence of images. We further incorporate two auxiliary objectives concerning text-image and image-label mutual information, and further post-train the generator over an auxiliary text-to-image generation benchmark dataset. Additionally, we propose a graph structure by defining three heuristic relationships between images, and use a graph neural network to generate the fused features. Extensive empirical results validate the effectiveness of RETSIMD.[192] Learning-Based Vision Systems for Semi-Autonomous Forklift Operation in Industrial Warehouse Environments
Vamshika Sutar,Mahek Maheshwari,Archak Mittal
Main category: cs.CV
TL;DR: 提出了一种基于单目摄像头的视觉感知框架,用于叉车和自动导引车的托盘及托盘孔检测与映射,结合YOLOv8和优化后的YOLOv11模型,具有高精度和低成本优势。
Details
Motivation: 为仓库中低成本、可改装的物料搬运自动化提供可靠的视觉感知方案,提升物流安全性与效率。 Method: 采用YOLOv8和YOLOv11架构,结合Optuna超参数优化和空间后处理,并设计了创新的托盘孔映射模块以生成可操作的空间表示。 Result: 在自建并增强真实仓库图像的数据集上,YOLOv8表现出高检测精度,而优化后的YOLOv11在精度和收敛稳定性方面表现更优。 Conclusion: 验证了所提方法作为低成本、可扩展的叉车视觉感知模块的可行性,有助于推动智能、经济的仓储自动化发展。 Abstract: The automation of material handling in warehouses increasingly relies on robust, low cost perception systems for forklifts and Automated Guided Vehicles (AGVs). This work presents a vision based framework for pallet and pallet hole detection and mapping using a single standard camera. We utilized YOLOv8 and YOLOv11 architectures, enhanced through Optuna driven hyperparameter optimization and spatial post processing. An innovative pallet hole mapping module converts the detections into actionable spatial representations, enabling accurate pallet and pallet hole association for forklift operation. Experiments on a custom dataset augmented with real warehouse imagery show that YOLOv8 achieves high pallet and pallet hole detection accuracy, while YOLOv11, particularly under optimized configurations, offers superior precision and stable convergence. The results demonstrate the feasibility of a cost effective, retrofittable visual perception module for forklifts. This study proposes a scalable approach to advancing warehouse automation, promoting safer, economical, and intelligent logistics operations.[193] SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection
Xin Zuo,Yuchen Qu,Haibo Zhan,Jifeng Shen,Wankou Yang
Main category: cs.CV
TL;DR: 本文提出了一种新的空间-频率特征重建方法SFFR,结合Kolmogorov-Arnold网络(KAN)在空间和频率域的表示能力,通过FCEKAN和MSGKAN模块分别增强跨模态频率特征互补性和多尺度空间特征建模,在无人机多光谱目标检测任务中取得了优越性能。
Details
Motivation: 现有方法主要关注空间域特征融合,忽略了频域特征的潜力;同时缺乏对多模态数据在频率层面互补性的有效利用以及对不同飞行高度下尺度变化的鲁棒建模。 Method: 提出SFFR方法,包含两个核心模块:FCEKAN模块通过选择性频率分量交换策略增强RGB与红外图像在频域的跨模态特征一致性;MSGKAN模块利用多尺度高斯基函数在空间域进行非线性特征建模,提升对尺度变化的适应性。二者结合实现更优的特征融合。 Result: 在SeaDroneSee、DroneVehicle和DVTOD三个数据集上进行了广泛实验,验证了FCEKAN和MSGKAN模块的互补性及整体方法的优越性能,显著提升了无人机多光谱目标检测的精度与鲁棒性。 Conclusion: 所提出的SFFR方法有效挖掘了频域与空间域特征的互补性,通过KAN架构实现了更高效的多模态特征融合,在多光谱目标检测任务中展现出领先性能,为未来基于频率分析的视觉模型设计提供了新思路。 Abstract: Recent multispectral object detection methods have primarily focused on spatial-domain feature fusion based on CNNs or Transformers, while the potential of frequency-domain feature remains underexplored. In this work, we propose a novel Spatial and Frequency Feature Reconstruction method (SFFR) method, which leverages the spatial-frequency feature representation mechanisms of the Kolmogorov-Arnold Network (KAN) to reconstruct complementary representations in both spatial and frequency domains prior to feature fusion. The core components of SFFR are the proposed Frequency Component Exchange KAN (FCEKAN) module and Multi-Scale Gaussian KAN (MSGKAN) module. The FCEKAN introduces an innovative selective frequency component exchange strategy that effectively enhances the complementarity and consistency of cross-modal features based on the frequency feature of RGB and IR images. The MSGKAN module demonstrates excellent nonlinear feature modeling capability in the spatial domain. By leveraging multi-scale Gaussian basis functions, it effectively captures the feature variations caused by scale changes at different UAV flight altitudes, significantly enhancing the model's adaptability and robustness to scale variations. It is experimentally validated that our proposed FCEKAN and MSGKAN modules are complementary and can effectively capture the frequency and spatial semantic features respectively for better feature fusion. Extensive experiments on the SeaDroneSee, DroneVehicle and DVTOD datasets demonstrate the superior performance and significant advantages of the proposed method in UAV multispectral object perception task. Code will be available at https://github.com/qchenyu1027/SFFR.[194] Physics-Informed Deformable Gaussian Splatting: Towards Unified Constitutive Laws for Time-Evolving Material Field
Haoqin Hong,Ding Fan,Fubin Dou,Zhi-Li Zhou,Haoran Sun,Congcong Zhu,Jingrun Chen
Main category: cs.CV
TL;DR: 提出了一种物理信息驱动的可变形高斯点阵方法(PIDG),通过将每个高斯粒子视为具有时变本构参数的拉格朗日质点,并结合2D光流监督,提升了动态场景中物理一致性和重建质量。
Details
Motivation: 纯数据驱动的3D高斯点阵在处理物理驱动的动态场景运动模式时存在困难,缺乏物理一致性。 Method: 采用静态-动态解耦的4D分解哈希编码高效重建几何与运动;引入柯西动量残差作为物理约束,独立预测粒子速度和本构应力;通过匹配拉格朗日粒子流与相机补偿光流进行监督。 Result: 在自建物理驱动数据集及标准合成与真实数据集上实验表明,该方法显著提升了物理一致性和单目动态重建质量。 Conclusion: PIDG有效融合了物理先验与数据驱动建模,增强了动态场景建模的物理合理性与泛化能力。 Abstract: Recently, 3D Gaussian Splatting (3DGS), an explicit scene representation technique, has shown significant promise for dynamic novel-view synthesis from monocular video input. However, purely data-driven 3DGS often struggles to capture the diverse physics-driven motion patterns in dynamic scenes. To fill this gap, we propose Physics-Informed Deformable Gaussian Splatting (PIDG), which treats each Gaussian particle as a Lagrangian material point with time-varying constitutive parameters and is supervised by 2D optical flow via motion projection. Specifically, we adopt static-dynamic decoupled 4D decomposed hash encoding to reconstruct geometry and motion efficiently. Subsequently, we impose the Cauchy momentum residual as a physics constraint, enabling independent prediction of each particle's velocity and constitutive stress via a time-evolving material field. Finally, we further supervise data fitting by matching Lagrangian particle flow to camera-compensated optical flow, which accelerates convergence and improves generalization. Experiments on a custom physics-driven dataset as well as on standard synthetic and real-world datasets demonstrate significant gains in physical consistency and monocular dynamic reconstruction quality.[195] Adaptive 3D Reconstruction via Diffusion Priors and Forward Curvature-Matching Likelihood Updates
Seunghyeok Shin,Dabin Kim,Hongki Lim
Main category: cs.CV
TL;DR: 提出一种名为Forward Curvature-Matching (FCM)的更新方法,结合扩散采样,动态确定似然更新的最优步长,实现高质量点云重建,支持单视图和多视图输入,无需重新训练。
Details
Motivation: 现有基于生成模型(尤其是扩散模型)的方法在点云重建中存在灵活性差、需固定输入视图数、依赖训练时条件信号且需完全重训练等问题;现有扩散方法使用启发式固定步长导致收敛慢、重建质量次优。 Method: 提出Forward Curvature-Matching (FCM) 更新方法,利用前向自动微分和有限差分曲率估计动态确定最优步长,结合扩散采样进行精确似然优化,支持多种输入模态通过算子替换实现,无需重训练。 Result: 在ShapeNet和CO3D数据集上实验表明,该方法在相同或更低NFE下实现更优重建质量,F-score更高,CD和EMD更低。 Conclusion: FCM方法在无需重训练的情况下,实现了高效、自适应的高质量点云重建,优于现有扩散模型方法,具有良好的实际应用潜力。 Abstract: Reconstructing high-quality point clouds from images remains challenging in computer vision. Existing generative-model-based approaches, particularly diffusion-model approaches that directly learn the posterior, may suffer from inflexibility -- they require conditioning signals during training, support only a fixed number of input views, and need complete retraining for different measurements. Recent diffusion-based methods have attempted to address this by combining prior models with likelihood updates, but they rely on heuristic fixed step sizes for the likelihood update that lead to slow convergence and suboptimal reconstruction quality. We advance this line of approach by integrating our novel Forward Curvature-Matching (FCM) update method with diffusion sampling. Our method dynamically determines optimal step sizes using only forward automatic differentiation and finite-difference curvature estimates, enabling precise optimization of the likelihood update. This formulation enables high-fidelity reconstruction from both single-view and multi-view inputs, and supports various input modalities through simple operator substitution -- all without retraining. Experiments on ShapeNet and CO3D datasets demonstrate that our method achieves superior reconstruction quality at matched or lower NFEs, yielding higher F-score and lower CD and EMD, validating its efficiency and adaptability for practical applications. Code is available at https://github.com/Seunghyeok0715/FCM[196] Seq2Seq Models Reconstruct Visual Jigsaw Puzzles without Seeing Them
Gur Elkn,Ofir Itzhak Shahar,Ohad Ben-Shahar
Main category: cs.CV
TL;DR: 提出一种使用语言模型解决方形拼图的新方法,通过将拼图块转换为离散令牌序列,在无视觉输入的情况下实现高精度重构。
Details
Motivation: 探索非视觉方法解决传统上依赖视觉的拼图问题,挑战语言模型在非自然任务中的潜力。 Method: 设计专用分词器将拼图块转为令牌序列,利用编码器-解码器Transformer进行序列到序列预测。 Result: 模型在多个基准上达到最先进水平,常优于基于视觉的方法,即使没有视觉输入也能准确重建布局。 Conclusion: 语言模型能有效解决非其原生领域的复杂问题,启发拼图求解研究的新方向。 Abstract: Jigsaw puzzles are primarily visual objects, whose algorithmic solutions have traditionally been framed from a visual perspective. In this work, however, we explore a fundamentally different approach: solving square jigsaw puzzles using language models, without access to raw visual input. By introducing a specialized tokenizer that converts each puzzle piece into a discrete sequence of tokens, we reframe puzzle reassembly as a sequence-to-sequence prediction task. Treated as "blind" solvers, encoder-decoder transformers accurately reconstruct the original layout by reasoning over token sequences alone. Despite being deliberately restricted from accessing visual input, our models achieve state-of-the-art results across multiple benchmarks, often outperforming vision-based methods. These findings highlight the surprising capability of language models to solve problems beyond their native domain, and suggest that unconventional approaches can inspire promising directions for puzzle-solving research.[197] CINEMAE: Leveraging Frozen Masked Autoencoders for Cross-Generator AI Image Detection
Minsuk Jang,Hyeonseo Jeong,Minseok Son,Changick Kim
Main category: cs.CV
TL;DR: 本文提出CINEMAE,一种基于上下文重建不确定性的新范式,用于AIGC图像检测。通过将掩码自编码器(MAE)的条件重建过程建模为概率信号,利用局部语义异常与全局特征融合实现跨生成器的强泛化能力,在GenImage基准上显著优于现有方法。
Details
Motivation: 现有的基于图像的AI生成内容检测器容易过拟合到特定生成器的伪影,缺乏跨生成器的泛化能力,而文本检测方法基于分布不一致性具有更好的泛化性,因此需要将文本检测的核心思想迁移到视觉领域。 Method: 提出CINEMAE,利用预训练的Masked AutoEncoder(MAE)对掩码块进行上下文条件重建,通过计算条件负对数似然(NLL)来量化局部语义异常,并将这些局部统计量与全局MAE特征通过可学习融合策略结合,以实现鲁棒的检测性能。 Result: 在仅使用Stable Diffusion v1.4训练的情况下,CINEMAE在GenImage基准的8个未见生成器上均达到超过95%的准确率,显著优于当前最先进的检测方法。 Conclusion: 上下文条件重建不确定性是一种强大且可迁移的AIGC图像检测信号,CINEMAE成功地将文本检测中的上下文一致性思想扩展到图像领域,实现了优异的跨生成器泛化性能。 Abstract: While context-based detectors have achieved strong generalization for AI-generated text by measuring distributional inconsistencies, image-based detectors still struggle with overfitting to generator-specific artifacts. We introduce CINEMAE, a novel paradigm for AIGC image detection that adapts the core principles of text detection methods to the visual domain. Our key insight is that Masked AutoEncoder (MAE), trained to reconstruct masked patches conditioned on visible context, naturally encodes semantic consistency expectations. We formalize this reconstruction process probabilistically, computing conditional Negative Log-Likelihood (NLL, p(masked | visible)) to quantify local semantic anomalies. By aggregating these patch-level statistics with global MAE features through learned fusion, CINEMAE achieves strong cross-generator generalization. Trained exclusively on Stable Diffusion v1.4, our method achieves over 95% accuracy on all eight unseen generators in the GenImage benchmark, substantially outperforming state-of-the-art detectors. This demonstrates that context-conditional reconstruction uncertainty provides a robust, transferable signal for AIGC detection.[198] Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection
Dingkang Yang,Mingcheng Li,Xuecheng Wu,Zhaoyu Chen,Kaixun Jiang,Keliang Liu,Peng Zhai,Lihua Zhang
Main category: cs.CV
TL;DR: 本文提出了一种用于多模态情感分析的模态优化与动态主模态选择框架(MODS),通过图结构动态序列压缩、自适应主模态选择和以主模态为中心的交叉注意力机制,有效提升了多模态融合性能。
Details
Motivation: 现有方法在处理多模态情感分析时存在模态性能不平衡、主模态固定以及非语言模态序列冗余和噪声等问题,难以适应不同样本间模态重要性的动态变化。 Method: 提出MODS框架:1)基于图的动态序列压缩器(GDC)利用胶囊网络和图卷积减少声学/视觉模态的序列冗余;2)样本自适应的主模态选择器(MSelector)动态确定主导模态;3)主模态中心交叉注意力(PCCA)增强主导模态并促进跨模态交互。 Result: 在四个基准数据集上的实验表明,MODS优于当前最先进的方法,在平衡模态贡献和消除冗余噪声方面表现优异。 Conclusion: MODS通过动态选择主模态并优化模态表示,有效解决了多模态情感分析中模态不平衡与冗余问题,显著提升了模型性能。 Abstract: Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos. However, imbalanced unimodal performance often leads to suboptimal fused representations. Existing approaches typically adopt fixed primary modality strategies to maximize dominant modality advantages, yet fail to adapt to dynamic variations in modality importance across different samples. Moreover, non-language modalities suffer from sequential redundancy and noise, degrading model performance when they serve as primary inputs. To address these issues, this paper proposes a modality optimization and dynamic primary modality selection framework (MODS). First, a Graph-based Dynamic Sequence Compressor (GDC) is constructed, which employs capsule networks and graph convolution to reduce sequential redundancy in acoustic/visual modalities. Then, we develop a sample-adaptive Primary Modality Selector (MSelector) for dynamic dominance determination. Finally, a Primary-modality-Centric Cross-Attention (PCCA) module is designed to enhance dominant modalities while facilitating cross-modal interaction. Extensive experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods, achieving superior performance by effectively balancing modality contributions and eliminating redundant noise.[199] HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment
Ruijia Wu,Ping Chen,Fei Shen,Shaoan Zhao,Qiang Hui,Huanlin Gao,Ting Lu,Zhaoxiang Liu,Fang Zhao,Kai Wang,Shiguo Lian
Main category: cs.CV
TL;DR: 本文提出了HiMo-CLIP,一种增强CLIP模型的框架,通过引入层次分解模块和单调性感知对比损失,提升模型对长文本和复杂语言结构的理解能力,在图像-文本检索任务中表现优于现有方法。
Details
Motivation: 现有对比视觉-语言模型(如CLIP)将文本视为扁平序列,难以捕捉语言的语义层次和单调性,限制了其在复杂、组合式或长文本描述下的表现。 Method: 提出HiMo-CLIP框架,包含两个核心组件:1)层次分解模块(HiDe),通过批内PCA从长文本中提取潜在语义成分,实现多粒度对齐;2)单调性感知对比损失(MoLo),联合对齐全局与组件级表示,增强模型对文本完整性的敏感性。该方法不修改编码器结构。 Result: 在多个图像-文本检索基准上,HiMo-CLIP均优于强基线模型,尤其在处理长文本或组合式描述时性能更优。 Conclusion: HiMo-CLIP通过建模语义层次与单调性,提升了CLIP类模型对复杂语言结构的理解能力,生成更具结构化和认知对齐的跨模态表示。 Abstract: Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content.To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via in-batch PCA, enabling flexible, batch-aware alignment across different semantic granularities, and a monotonicity-aware contrastive loss (MoLo) that jointly aligns global and component-level representations, encouraging the model to internalize semantic ordering and alignment strength as a function of textual completeness.These components work in concert to produce structured, cognitively-aligned cross-modal representations. Experiments on multiple image-text retrieval benchmarks show that HiMo-CLIP consistently outperforms strong baselines, particularly under long or compositional descriptions. The code is available at https://github.com/UnicomAI/HiMo-CLIP.[200] Label-Efficient 3D Forest Mapping: Self-Supervised and Transfer Learning for Individual, Structural, and Species Analysis
Aldino Rizaldy,Fabian Ewald Fassnacht,Ahmed Jamal Afifi,Hua Jiang,Richard Gloaguen,Pedram Ghamisi
Main category: cs.CV
TL;DR: 本研究提出一种结合自监督学习和迁移学习的统一框架,以减少对大规模标注点云数据的依赖,提升机载与地面激光扫描点云在单木实例分割、语义分割和树种分类任务中的性能,推动精准林业、生物多样性保护和碳汇制图的应用。
Details
Motivation: 深度学习模型通常需要大量标注的3D点云数据进行训练,但在复杂森林环境中获取高质量标注数据费时费力,限制了模型的发展。因此,亟需减少对大规模标注数据的依赖,提升模型在实际场景中的可操作性和泛化能力。 Method: 采用自监督学习和领域自适应方法进行实例分割,自监督学习用于语义分割,并引入分层迁移学习实现对未见树种的分类。所有任务被集成到一个统一的开源框架中,支持从原始点云到单木分割、结构分析和物种分类的全流程处理。 Result: 相比从零开始训练,实例分割AP50提升16.98%;语义分割mIoU提升1.79%;树种分类Jaccard指数提升6.07%。预训练模型使能耗和碳排放降低约21%。 Conclusion: 所提出的框架显著降低了对标注数据的需求,提升了多种林木信息提取任务的性能和可持续性,具有良好的实际应用前景,有助于推动林业、生物多样性和碳汇监测的自动化发展。 Abstract: Detailed structural and species information on individual tree level is increasingly important to support precision forestry, biodiversity conservation, and provide reference data for biomass and carbon mapping. Point clouds from airborne and ground-based laser scanning are currently the most suitable data source to rapidly derive such information at scale. Recent advancements in deep learning improved segmenting and classifying individual trees and identifying semantic tree components. However, deep learning models typically require large amounts of annotated training data which limits further improvement. Producing dense, high-quality annotations for 3D point clouds, especially in complex forests, is labor-intensive and challenging to scale. We explore strategies to reduce dependence on large annotated datasets using self-supervised and transfer learning architectures. Our objective is to improve performance across three tasks: instance segmentation, semantic segmentation, and tree classification using realistic and operational training sets. Our findings indicate that combining self-supervised learning with domain adaptation significantly enhances instance segmentation compared to training from scratch (AP50 +16.98%), self-supervised learning suffices for semantic segmentation (mIoU +1.79%), and hierarchical transfer learning enables accurate classification of unseen species (Jaccard +6.07%). To simplify use and encourage uptake, we integrated the tasks into a unified framework, streamlining the process from raw point clouds to tree delineation, structural analysis, and species classification. Pretrained models reduce energy consumption and carbon emissions by ~21%. This open-source contribution aims to accelerate operational extraction of individual tree information from laser scanning point clouds to support forestry, biodiversity, and carbon mapping.[201] Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View
Jianyu Qi,Ding Zou,Wenrui Yan,Rui Ma,Jiaxu Li,Zhijie Zheng,Zhiguo Yang,Rongchang Zhao
Main category: cs.CV
TL;DR: 本文提出两种基于难度感知的采样策略(PISM和CMAB),并设计分层训练框架,验证了在多模态大模型后训练中难度分层样本使用GRPO可超越传统SFT+GRPO方法。
Details
Motivation: 现有后训练范式缺乏有效的样本难度量化指标,且未能协同优化感知与推理能力,导致训练效率与性能受限。 Method: 提出PISM通过图像退化量化样本难度,CMAB通过注意力分布分析跨模态交互复杂度,并构建基于GRPO的分层训练框架,支持GRPO-only和SFT+GRPO混合范式。 Result: 在六个基准数据集上实验表明,基于难度分层的GRPO方法显著优于传统SFT+GRPO流程,且无需监督微调即可提升模型准确率。 Conclusion: 难度感知的采样策略能有效提升多模态大模型的推理能力,GRPO结合难度分层可简化训练流程并提高性能。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity via attention distribution analysis. Leveraging these metrics, we design a hierarchical training framework that incorporates both GRPO-only and SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark datasets. Experiments demonstrate consistent superiority of GRPO applied to difficulty-stratified samples compared to conventional SFT+GRPO pipelines, indicating that strategic data sampling can obviate the need for supervised fine-tuning while improving model accuracy. Our code will be released at https://github.com/qijianyu277/DifficultySampling.[202] BuildingWorld: A Structured 3D Building Dataset for Urban Foundation Models
Shangfeng Huang,Ruisheng Wang,Xin Wang
Main category: cs.CV
TL;DR: BuildingWorld是一个大规模、结构化的3D建筑数据集,涵盖全球多种建筑风格,支持城市数字孪生中的重建、检测与分割研究。
Details
Motivation: 现有3D建筑数据集建筑多样性不足,限制了模型在异构城市环境中的泛化能力。 Method: 构建包含约五百万个LOD2建筑模型的BuildingWorld数据集,覆盖五大洲建筑风格,并提供真实与模拟的机载LiDAR点云;引入虚拟城市Cyber City以生成多样化训练数据。 Result: 提供了具有高建筑多样性的全球代表性数据集,支持3D重建、检测与分割研究,并配套标准化评估指标。 Conclusion: BuildingWorld显著提升了3D城市建模的泛化性与可扩展性,为城市尺度基础模型的发展提供了重要支撑。 Abstract: As digital twins become central to the transformation of modern cities, accurate and structured 3D building models emerge as a key enabler of high-fidelity, updatable urban representations. These models underpin diverse applications including energy modeling, urban planning, autonomous navigation, and real-time reasoning. Despite recent advances in 3D urban modeling, most learning-based models are trained on building datasets with limited architectural diversity, which significantly undermines their generalizability across heterogeneous urban environments. To address this limitation, we present BuildingWorld, a comprehensive and structured 3D building dataset designed to bridge the gap in stylistic diversity. It encompasses buildings from geographically and architecturally diverse regions -- including North America, Europe, Asia, Africa, and Oceania -- offering a globally representative dataset for urban-scale foundation modeling and analysis. Specifically, BuildingWorld provides about five million LOD2 building models collected from diverse sources, accompanied by real and simulated airborne LiDAR point clouds. This enables comprehensive research on 3D building reconstruction, detection and segmentation. Cyber City, a virtual city model, is introduced to enable the generation of unlimited training data with customized and structurally diverse point cloud distributions. Furthermore, we provide standardized evaluation metrics tailored for building reconstruction, aiming to facilitate the training, evaluation, and comparison of large-scale vision models and foundation models in structured 3D urban environments.[203] SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
Hunar Batra,Haoqin Tu,Hardy Chen,Yuanze Lin,Cihang Xie,Ronald Clark
Main category: cs.CV
TL;DR: 提出SpatialThinker,一种通过强化学习训练的3D感知多模态大语言模型,利用结构化空间定位和多步推理提升空间理解能力,在有限数据下显著优于现有方法。
Details
Motivation: 现有MLLM在空间理解上依赖显式3D输入或特定架构修改,且受限于大规模数据集或稀疏监督,难以实现鲁棒的3D空间理解。 Method: 构建Scene Graph进行结构化空间建模,并采用在线强化学习配合多目标密集空间奖励机制;通过数据合成生成高质量空间VQA数据集STVQA-7K。 Result: SpatialThinker-7B在空间理解和真实世界VQA基准上优于监督微调和稀疏RL基线,性能增益接近后者的两倍,并超越GPT-4o。 Conclusion: 结合空间监督与奖励对齐推理可有效提升MLLM在小数据下的3D空间理解能力,推动其向人类级视觉推理迈进。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.[204] GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding
Athul M. Mathew,Haithem Hermassi,Thariq Khalid,Arshad Ali Khan,Riad Souissi
Main category: cs.CV
TL;DR: 本文提出了GazeVLM,首个结合视觉与语言提示的多任务视觉-语言模型,用于统一理解图像中的人物、注视目标和关注物体,通过融合RGB与HHA深度图并在文本引导下实现了在多个数据集上的性能突破。
Details
Motivation: 现有研究缺乏一个能同时利用视觉和语言信息进行统一注视理解的系统,难以全面建模人类视觉注意力与意图。 Method: 提出GazeVLM,一种基于视觉-语言模型(VLM)的多任务框架,结合RGB图像、HHA编码的深度图和文本提示,实现人物检测、注视目标检测和注视物体识别,并引入对象级注视检测指标$AP_{ob}$。 Result: 在GazeFollow和VideoAttentionTarget数据集上达到最先进性能,消融实验表明RGB与HHA深度图融合在文本引导下效果最佳。 Conclusion: GazeVLM首次将VLM应用于多任务注视理解,验证了多模态融合与语言提示的有效性,为未来注意力与意图分析提供了新方向。 Abstract: Gaze understanding unifies the detection of people, their gaze targets, and objects of interest into a single framework, offering critical insight into visual attention and intent estimation. Although prior research has modelled gaze cues in visual scenes, a unified system is still needed for gaze understanding using both visual and language prompts. This paper introduces GazeVLM, a novel Vision-Language Model (VLM) for multi-task gaze understanding in images, addressing person detection, gaze target detection, and gaze object identification. While other transformer-based methods exist for gaze analysis, GazeVLM represents, to our knowledge, the first application of a VLM to these combined tasks, allowing for selective execution of each task. Through the integration of visual (RGB and depth) and textual modalities, our ablation study on visual input combinations revealed that a fusion of RGB images with HHA-encoded depth maps, guided by text prompts, yields superior performance. We also introduce an object-level gaze detection metric for gaze object identification ($AP_{ob}$). Through experiments, GazeVLM demonstrates significant improvements, notably achieving state-of-the-art evaluation scores on GazeFollow and VideoAttentionTarget datasets.[205] AesTest: Measuring Aesthetic Intelligence from Perception to Production
Guolong Wang,Heng Huang,Zhiqiang Zhang,Wentian Li,Feilong Ma,Xin Jin
Main category: cs.CV
TL;DR: 本文提出了AesTest,一个用于多模态大语言模型美学感知与生成能力的综合评测基准,涵盖十项任务、多样化数据源及多种美学查询类型,揭示了当前模型在美学智能上的局限性。
Details
Motivation: 现有图像美学评估基准在感知范围和多样性上不足,难以系统评估多模态大语言模型的美学理解与生成能力。 Method: 基于心理学生成学习理论设计包含十个任务的多项选择题;整合来自专业编辑流程、摄影构图教程和众包偏好的多样化数据源;支持属性分析、情感共鸣、构图选择和风格推理等多种查询类型。 Result: 在指令微调的IAA MLLM和通用MLLM上进行评估,发现模型在美学感知与生成方面存在显著挑战,尤其在复杂审美推理任务中表现不佳。 Conclusion: AesTest能有效评估多模态模型的美学智能,推动未来在美学感知与创作能力方面的研究发展。 Abstract: Perceiving and producing aesthetic judgments is a fundamental yet underexplored capability for multimodal large language models (MLLMs). However, existing benchmarks for image aesthetic assessment (IAA) are narrow in perception scope or lack the diversity needed to evaluate systematic aesthetic production. To address this gap, we introduce AesTest, a comprehensive benchmark for multimodal aesthetic perception and production, distinguished by the following features: 1) It consists of curated multiple-choice questions spanning ten tasks, covering perception, appreciation, creation, and photography. These tasks are grounded in psychological theories of generative learning. 2) It integrates data from diverse sources, including professional editing workflows, photographic composition tutorials, and crowdsourced preferences. It ensures coverage of both expert-level principles and real-world variation. 3) It supports various aesthetic query types, such as attribute-based analysis, emotional resonance, compositional choice, and stylistic reasoning. We evaluate both instruction-tuned IAA MLLMs and general MLLMs on AesTest, revealing significant challenges in building aesthetic intelligence. We will publicly release AesTest to support future research in this area.[206] V-Shuffle: Zero-Shot Style Transfer via Value Shuffle
Haojun Tang,Qiwei Lin,Tongda Xu,Lida Huang,Yan Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为V-Shuffle的零样本风格迁移方法,通过在扩散模型的自注意力层中打乱值特征来隐式破坏风格图像的语义内容,从而避免内容泄露并保持低层次风格表示,同时引入混合风格正则化以增强高层次纹理表现,实验证明该方法在单张和多张风格图像上均优于现有方法。
Details
Motivation: 现有基于注意力注入的风格迁移方法常出现内容泄露问题,即风格图像中的语义内容错误地出现在生成结果中,影响内容保真度。 Method: 提出V-Shuffle方法,通过在扩散模型的自注意力层中打乱风格图像的值特征(value features)以破坏其语义内容,并结合多张同域风格图像进行零样本风格迁移;同时引入混合风格正则化,融合低层次风格表示与高层次纹理信息。 Result: 实验表明,V-Shuffle在使用多张风格图像时表现出色,在仅使用单张风格图像时也优于先前的最先进方法。 Conclusion: V-Shuffle有效平衡了内容保留与风格保真之间的权衡,解决了内容泄露问题,在零样本风格迁移任务中实现了更优的性能。 Abstract: Attention injection-based style transfer has achieved remarkable progress in recent years. However, existing methods often suffer from content leakage, where the undesired semantic content of the style image mistakenly appears in the stylized output. In this paper, we propose V-Shuffle, a zero-shot style transfer method that leverages multiple style images from the same style domain to effectively navigate the trade-off between content preservation and style fidelity. V-Shuffle implicitly disrupts the semantic content of the style images by shuffling the value features within the self-attention layers of the diffusion model, thereby preserving low-level style representations. We further introduce a Hybrid Style Regularization that complements these low-level representations with high-level style textures to enhance style fidelity. Empirical results demonstrate that V-Shuffle achieves excellent performance when utilizing multiple style images. Moreover, when applied to a single style image, V-Shuffle outperforms previous state-of-the-art methods.[207] InfoAffect: A Dataset for Affective Analysis of Infographics
Zihang Fu,Yunchao Wang,Chenyu Huang,Guodao Sun,Ronghua Liang
Main category: cs.CV
TL;DR: 本文提出了一个包含3.5k样本的情感标注数据集InfoAffect,结合文本与真实信息图,通过多模态大模型融合方法实现情感分析,并经用户验证具有高一致性。
Details
Motivation: 由于缺乏数据资源,信息图的情感维度研究尚不充分,因此需要构建高质量、跨领域的情感标注数据集以推动该领域发展。 Method: 从六个领域收集原始数据,采用预处理、优先使用附带文本的方法及三种策略保证数据质量;构建情感表用于约束标注过程;利用五个先进的多模态大语言模型分析图文双模态,并通过倒数排名融合算法(RRF)融合输出结果。 Result: 构建了3.5k规模的InfoAffect情感标注数据集,用户研究显示其复合情感一致性指数(CACI)达0.986,表明数据集具有高准确性和可用性。 Conclusion: InfoAffect数据集为信息图的情感分析提供了可靠资源,验证结果表明其在真实场景中具有良好的应用潜力和研究价值。 Abstract: Infographics are widely used to convey complex information, yet their affective dimensions remain underexplored due to the scarcity of data resources. We introduce a 3.5k-sample affect-annotated InfoAffect dataset, which combines textual content with real-world infographics. We first collect the raw data from six domains and aligned them via preprocessing, the accompanied-text-priority method, and three strategies to guarantee the quality and compliance. After that we construct an affect table and use it to constrain annotation. Five state-of-the-art multimodal large language models (MLLMs) then analyze both modalities, and their outputs are fused with Reciprocal Rank Fusion (RRF) algorithm to yield robust affects and confidences. We conducted a user study with two experiments to validate usability and assess InfoAffect dataset using the Composite Affect Consistency Index (CACI), achieving an overall score of 0.986, which indicates high accuracy.[208] On Modality Incomplete Infrared-Visible Object Detection: An Architecture Compatibility Perspective
Shuo Yang,Yinghui Xing,Shizhou Zhang,Zhilong Niu
Main category: cs.CV
TL;DR: 本文提出了一种即插即用的Scarf Neck模块,用于解决红外与可见光目标检测中模态缺失时性能下降的问题,通过模态无关的可变形注意力机制和伪模态丢弃策略,使检测器在单/双模态下均具有鲁棒性和兼容性,并建立了完整的模态缺失评测基准。
Details
Motivation: 现有红外与可见光目标检测模型在主导模态缺失时性能显著下降,缺乏对模态不完整情况的有效应对机制。 Method: 提出Scarf Neck模块,引入模态无关的可变形注意力机制;设计伪模态丢弃策略,提升多模态信息利用与模型鲁棒性;构建面向模态缺失场景的综合评测基准。 Result: Scarf-DETR在模态缺失场景下表现优异,同时在标准完整模态基准上也达到领先性能。 Conclusion: 该方法有效提升了IVOD模型在模态不完整情况下的适应能力,兼具高性能与强鲁棒性,具有良好的实用价值。 Abstract: Infrared and visible object detection (IVOD) is essential for numerous around-the-clock applications. Despite notable advancements, current IVOD models exhibit notable performance declines when confronted with incomplete modality data, particularly if the dominant modality is missing. In this paper, we take a thorough investigation on modality incomplete IVOD problem from an architecture compatibility perspective. Specifically, we propose a plug-and-play Scarf Neck module for DETR variants, which introduces a modality-agnostic deformable attention mechanism to enable the IVOD detector to flexibly adapt to any single or double modalities during training and inference. When training Scarf-DETR, we design a pseudo modality dropout strategy to fully utilize the multi-modality information, making the detector compatible and robust to both working modes of single and double modalities. Moreover, we introduce a comprehensive benchmark for the modality-incomplete IVOD task aimed at thoroughly assessing situations where the absent modality is either dominant or secondary. Our proposed Scarf-DETR not only performs excellently in missing modality scenarios but also achieves superior performances on the standard IVOD modality complete benchmarks. Our code will be available at https://github.com/YinghuiXing/Scarf-DETR.[209] VDNeRF: Vision-only Dynamic Neural Radiance Field for Urban Scenes
Zhengyu Zou,Jingfeng Li,Hao Li,Xiaolei Hou,Jinwen Hu,Jingkun Chen,Lechao Cheng,Dingwen Zhang
Main category: cs.CV
TL;DR: 提出VDNeRF,一种仅使用视觉的动态NeRF方法,无需相机姿态先验即可在大规模动态城市环境中实现相机轨迹恢复与静态/动态场景分解。
Details
Motivation: 现有NeRF方法依赖精确相机姿态且难以处理大尺度动态环境,限制了其在自动驾驶和机器人感知中的应用。 Method: 采用两个独立的NeRF:静态NeRF优化相机姿态和背景,动态NeRF结合3D场景流建模动态物体;设计自监督训练框架以解耦相机运动与物体运动。 Result: 在主流城市驾驶数据集上,VDNeRF在相机位姿估计和动态新视角合成方面均优于当前最先进的无姿态NeRF方法。 Conclusion: VDNeRF能够有效实现无需姿态输入的动态场景建模,为自动驾驶等实际应用场景提供了更实用的NeRF解决方案。 Abstract: Neural Radiance Fields (NeRFs) implicitly model continuous three-dimensional scenes using a set of images with known camera poses, enabling the rendering of photorealistic novel views. However, existing NeRF-based methods encounter challenges in applications such as autonomous driving and robotic perception, primarily due to the difficulty of capturing accurate camera poses and limitations in handling large-scale dynamic environments. To address these issues, we propose Vision-only Dynamic NeRF (VDNeRF), a method that accurately recovers camera trajectories and learns spatiotemporal representations for dynamic urban scenes without requiring additional camera pose information or expensive sensor data. VDNeRF employs two separate NeRF models to jointly reconstruct the scene. The static NeRF model optimizes camera poses and static background, while the dynamic NeRF model incorporates the 3D scene flow to ensure accurate and consistent reconstruction of dynamic objects. To address the ambiguity between camera motion and independent object motion, we design an effective and powerful training framework to achieve robust camera pose estimation and self-supervised decomposition of static and dynamic elements in a scene. Extensive evaluations on mainstream urban driving datasets demonstrate that VDNeRF surpasses state-of-the-art NeRF-based pose-free methods in both camera pose estimation and dynamic novel view synthesis.[210] DiffusionUavLoc: Visually Prompted Diffusion for Cross-View UAV Localization
Tao Liu,Kan Ren,Qian Chen
Main category: cs.CV
TL;DR: 提出DiffusionUavLoc,一种无需文本提示、基于扩散模型的无人机跨视角定位框架,利用几何渲染生成伪卫星图像,并通过VAE和条件扩散模型实现GNSS拒止环境下的鲁棒定位。
Details
Motivation: 在GNSS信号受限环境下,传统依赖卫星定位的方法失效,且现有跨视角定位方法存在几何与外观差异大、依赖复杂网络或大量标注的问题,限制了泛化能力。 Method: 提出DiffusionUavLoc:1)采用无需训练的几何渲染生成无人机图像对应的伪卫星图作为结构提示;2)设计无文本条件的扩散模型,融合多模态结构线索以增强对视角变化的鲁棒性;3)在固定时间步t提取描述符并用余弦相似度进行匹配。 Result: 在University-1652和SUES-200数据集上表现优异,尤其在University-1652的卫星到无人机任务中具有竞争力。 Conclusion: DiffusionUavLoc实现了图像驱动、无需文本、基于扩散的统一跨视角定位框架,在减少对标注和复杂架构依赖的同时,提升了在GNSS拒止环境中定位的鲁棒性和泛化能力。 Abstract: With the rapid growth of the low-altitude economy, unmanned aerial vehicles (UAVs) have become key platforms for measurement and tracking in intelligent patrol systems. However, in GNSS-denied environments, localization schemes that rely solely on satellite signals are prone to failure. Cross-view image retrieval-based localization is a promising alternative, yet substantial geometric and appearance domain gaps exist between oblique UAV views and nadir satellite orthophotos. Moreover, conventional approaches often depend on complex network architectures, text prompts, or large amounts of annotation, which hinders generalization. To address these issues, we propose DiffusionUavLoc, a cross-view localization framework that is image-prompted, text-free, diffusion-centric, and employs a VAE for unified representation. We first use training-free geometric rendering to synthesize pseudo-satellite images from UAV imagery as structural prompts. We then design a text-free conditional diffusion model that fuses multimodal structural cues to learn features robust to viewpoint changes. At inference, descriptors are computed at a fixed time step t and compared using cosine similarity. On University-1652 and SUES-200, the method performs competitively for cross-view localization, especially for satellite-to-drone in University-1652.Our data and code will be published at the following URL: https://github.com/liutao23/DiffusionUavLoc.git.[211] Diagnose Like A REAL Pathologist: An Uncertainty-Focused Approach for Trustworthy Multi-Resolution Multiple Instance Learning
Sungrae Hong,Sol Lee,Jisu Shin,Mun Yong Yi
Main category: cs.CV
TL;DR: 提出了一种基于多分辨率图像的不确定性聚焦校准MIL方法(UFC-MIL),在保持高分类精度的同时显著提升模型校准性能,更贴近病理学家的诊断行为。
Details
Motivation: 现有多个分辨率的MIL方法主要关注性能提升,缺乏对模型预测可信度和校准性的研究,难以满足临床专家对可靠诊断的需求。 Method: 提出UFC-MIL,引入patch-wise损失学习实例潜在模式并表达其不确定性,结合注意力机制与邻域patch聚合模块提取特征,并通过patch级不确定性实现无需多次推理的预测校准。 Result: 在具有挑战性的公开数据集上,UFC-MIL在校准性能方面表现优异,同时分类精度与当前最先进方法相当。 Conclusion: UFC-MIL在不牺牲分类准确率的前提下有效提升了模型的预测校准性,增强了AI辅助诊断的可信度,更符合临床实际需求。 Abstract: With the increasing demand for histopathological specimen examination and diagnostic reporting, Multiple Instance Learning (MIL) has received heightened research focus as a viable solution for AI-centric diagnostic aid. Recently, to improve its performance and make it work more like a pathologist, several MIL approaches based on the use of multiple-resolution images have been proposed, delivering often higher performance than those that use single-resolution images. Despite impressive recent developments of multiple-resolution MIL, previous approaches only focus on improving performance, thereby lacking research on well-calibrated MIL that clinical experts can rely on for trustworthy diagnostic results. In this study, we propose Uncertainty-Focused Calibrated MIL (UFC-MIL), which more closely mimics the pathologists' examination behaviors while providing calibrated diagnostic predictions, using multiple images with different resolutions. UFC-MIL includes a novel patch-wise loss that learns the latent patterns of instances and expresses their uncertainty for classification. Also, the attention-based architecture with a neighbor patch aggregation module collects features for the classifier. In addition, aggregated predictions are calibrated through patch-level uncertainty without requiring multiple iterative inferences, which is a key practical advantage. Against challenging public datasets, UFC-MIL shows superior performance in model calibration while achieving classification accuracy comparable to that of state-of-the-art methods.[212] Countering Multi-modal Representation Collapse through Rank-targeted Fusion
Seulgi Kim,Kiran Kokilepersaud,Mohit Prabhushankar,Ghassan AlRegib
Main category: cs.CV
TL;DR: 提出一种基于有效秩的多模态融合框架Rank-enhancing Token Fuser,用于同时缓解特征崩溃和模态崩溃问题,在动作预测任务中显著优于现有方法。
Details
Motivation: 现有方法分别处理特征崩溃和模态崩溃,缺乏统一框架同时解决这两种表示崩溃问题,限制了多模态融合在如人体动作预测等任务中的性能。 Method: 引入有效秩作为衡量指标,提出Rank-enhancing Token Fuser框架,通过选择性融合不同模态的互补特征来提升融合表征的有效秩,并利用模态间有效秩的相互增益来防止模态崩溃。 Result: 在NTURGBD、UTKinect和DARai数据集上实验表明,该方法在动作预测任务中比现有最优方法提升高达3.74%,且深度模态与RGB融合时能保持表征平衡。 Conclusion: 有效秩可作为统一指标指导多模态融合设计,所提R3D框架能同时缓解特征与模态崩溃,显著提升多模态动作预测性能。 Abstract: Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose \textit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less informative features from one modality with complementary features from another modality. We show that our method increases the effective rank of the fused representation. To address modality collapse, we evaluate modality combinations that mutually increase each others' effective rank. We show that depth maintains representational balance when fused with RGB, avoiding modality collapse. We validate our method on action anticipation, where we present \texttt{R3D}, a depth-informed fusion framework. Extensive experiments on NTURGBD, UTKinect, and DARai demonstrate that our approach significantly outperforms prior state-of-the-art methods by up to 3.74\%. Our code is available at: \href{https://github.com/olivesgatech/R3D}{https://github.com/olivesgatech/R3D}.[213] EIDSeg: A Pixel-Level Semantic Segmentation Dataset for Post-Earthquake Damage Assessment from Social Media Images
Huili Huang,Chengeng Liu,Danrong Zhang,Shail Patel,Anastasiya Masalava,Sagar Sadak,Parisa Babolhavaeji,WeiHong Low,Max Mahdi Roozbahani,J. David Frost
Main category: cs.CV
TL;DR: 本文提出了EIDSeg,首个用于震后社交媒体图像的大规模语义分割数据集,包含3,266张来自九次大地震的图像,并定义了五类基础设施损坏类别。提出了一种三阶段跨学科标注协议,使非专家标注者也能实现一致分割,标注一致性超过70%。通过 benchmark 多个先进分割模型,Encoder-only Mask Transformer (EoMT) 表现最佳,mIoU 达80.8%。该工作利用社交媒体地面视角,推动了更快速、细粒度的震后损伤评估。
Details
Motivation: 现有震后遥感评估方法依赖昂贵航拍图像和专家标注,且仅提供二值化损毁图,难以满足早期快速评估需求;而社交媒体中的地面图像虽丰富,但缺乏大规模像素级标注数据集,限制了其应用。 Method: 构建了一个名为EIDSeg的大规模语义分割数据集,涵盖2008-2023年九次大地震的3,266张图像,标注五类基础设施损毁情况;设计了三阶段跨学科标注流程与规范,提升非专家标注的一致性;并对多种先进语义分割模型进行了基准测试。 Result: 实现了非专家间超过70%的标注一致性;在基准测试中,Encoder-only Mask Transformer (EoMT) 取得80.8%的Mean IoU,表现最优。 Conclusion: EIDSeg填补了震后社交媒体图像语义分割数据的空白,结合有效的标注协议和高性能模型验证,证明了利用社交媒体地面图像进行快速、细粒度震后损毁评估的可行性与潜力。 Abstract: Rapid post-earthquake damage assessment is crucial for rescue and resource planning. Still, existing remote sensing methods depend on costly aerial images, expert labeling, and produce only binary damage maps for early-stage evaluation. Although ground-level images from social networks provide a valuable source to fill this gap, a large pixel-level annotated dataset for this task is still unavailable. We introduce EIDSeg, the first large-scale semantic segmentation dataset specifically for post-earthquake social media imagery. The dataset comprises 3,266 images from nine major earthquakes (2008-2023), annotated across five classes of infrastructure damage: Undamaged Building, Damaged Building, Destroyed Building, Undamaged Road, and Damaged Road. We propose a practical three-phase cross-disciplinary annotation protocol with labeling guidelines that enables consistent segmentation by non-expert annotators, achieving over 70% inter-annotator agreement. We benchmark several state-of-the-art segmentation models, identifying Encoder-only Mask Transformer (EoMT) as the top-performing method with a Mean Intersection over Union (mIoU) of 80.8%. By unlocking social networks' rich ground-level perspective, our work paves the way for a faster, finer-grained damage assessment in the post-earthquake scenario.[214] Inpaint360GS: Efficient Object-Aware 3D Inpainting via Gaussian Splatting for 360° Scenes
Shaoxiang Wang,Shihong Zhang,Christen Millerdurai,Rüdiger Westermann,Didier Stricker,Alain Pagani
Main category: cs.CV
TL;DR: 本文提出了Inpaint360GS,一种基于3D高斯点阵的360°场景多对象修复框架,通过将2D分割信息蒸馏到3D空间并利用虚拟视角进行上下文引导,实现了精确的对象级编辑和一致的场景补全。
Details
Motivation: 现有的单对象正面修复方法在复杂360°场景中表现不足,主要面临目标识别、严重遮挡和跨视角一致性三大挑战。 Method: 提出Inpaint360GS框架,结合2D分割到3D的蒸馏技术,并利用虚拟相机视图提供上下文信息,实现多对象去除与高质量3D修复。 Result: 实验表明,Inpaint360GS在360°修复任务上优于现有基线方法,达到最先进水平,并发布了一个用于360°修复的新数据集。 Conclusion: Inpaint360GS有效解决了360°全景场景中的多对象修复难题,在复杂环境中实现了高保真、视觉一致的编辑结果。 Abstract: Despite recent advances in single-object front-facing inpainting using NeRF and 3D Gaussian Splatting (3DGS), inpainting in complex 360° scenes remains largely underexplored. This is primarily due to three key challenges: (i) identifying target objects in the 3D field of 360° environments, (ii) dealing with severe occlusions in multi-object scenes, which makes it hard to define regions to inpaint, and (iii) maintaining consistent and high-quality appearance across views effectively. To tackle these challenges, we propose Inpaint360GS, a flexible 360° editing framework based on 3DGS that supports multi-object removal and high-fidelity inpainting in 3D space. By distilling 2D segmentation into 3D and leveraging virtual camera views for contextual guidance, our method enables accurate object-level editing and consistent scene completion. We further introduce a new dataset tailored for 360° inpainting, addressing the lack of ground truth object-free scenes. Experiments demonstrate that Inpaint360GS outperforms existing baselines and achieves state-of-the-art performance. Project page: https://dfki-av.github.io/inpaint360gs/[215] NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models
Kyuho Lee,Euntae Kim,Jinwoo Choi,Buru Chang
Main category: cs.CV
TL;DR: 本文提出了一个名为NOAH的大规模基准,用于系统评估视频大语言模型(Video LLMs)中由“叙事先验”引起的幻觉和遗漏错误。通过构建插入外来片段的复合视频,研究发现大多数Video LLM在叙事一致性优先于视觉证据的情况下会产生显著错误,且错误模式受事件相似性、插入位置和帧数影响。
Details
Motivation: 现有Video LLM为提升叙述连贯性而引入的连续性先验可能导致模型忽略真实视觉内容,产生幻觉或遗漏事实事件,亟需系统性评估该问题。 Method: 提出NOAH基准,通过将外部视频片段插入目标视频构造测试样本,控制语义相似性和插入位置;设计一项描述生成任务和三项问答任务(存在性、时序、叙事),并采用定制化指标进行评估。 Result: 实验表明:(i) 多数Video LLM存在由叙事先验导致的幻觉与遗漏;(ii) 错误模式因模型架构、事件相似性及插入位置而异;(iii) 帧数减少会加剧对叙事先验的依赖,从而放大错误。 Conclusion: NOAH是首个针对Video LLM中叙事先验引发错误的标准化评测基准,揭示了模型在叙事连贯性与视觉保真度之间的权衡问题,为构建更可靠视频理解模型提供了基础。 Abstract: Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering. Many models and training methods explicitly encourage continuity across events to enhance narrative coherence. While this improves fluency, it also introduces an inductive bias that prioritizes storyline consistency over strict grounding in visual evidence. We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context. To systematically evaluate narrative prior-induced errors, we introduce NOAH, a large-scale benchmark that constructs composite videos by inserting clips from other sources into target videos. By varying semantic similarity and insertion position, our benchmark enables controlled and scalable analysis of narrative priors. We design one captioning task with tailored metrics and three QA tasks - Existence, Temporal, and Narrative - yielding more than 60K evaluation samples. Extensive experiments yield three key findings: (i) most Video LLMs exhibit hallucinations and omissions driven by narrative priors, (ii) the patterns of these errors vary across architectures and depend on event similarity and insertion position, and (iii) reliance on narrative priors intensifies under sampling with fewer frames, amplifying errors when event continuity is weak. We establish NOAH as the first standardized evaluation of narrative prior-induced hallucination and omission in Video LLMs, providing a foundation for developing more reliable and trustworthy models. Our benchmark and code are available at https://anonymous550520.github.io/.[216] Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models
Yule Chen,Yufan Ren,Sabine Süsstrunk
Main category: cs.CV
TL;DR: 本文提出了AI4VA-FG,首个面向漫画理解的细粒度综合基准,揭示了现有视觉语言模型在漫画理解上的不足,并提出区域感知强化学习(RARL)方法以提升模型性能。
Details
Motivation: 现有视觉语言模型在处理风格化线条艺术、拟声词和多格密集布局等复杂视觉叙事时表现不佳,缺乏专门评估和提升其漫画理解能力的基准与方法。 Method: 构建AI4VA-FG基准,涵盖从基础识别到高级角色推理和叙事构建的任务;系统评估主流闭源与开源模型;探索监督微调(SFT-S、SFT-R)和强化学习(RL),并提出区域感知强化学习(RARL)以增强模型对关键区域的关注能力。 Result: 实验表明现有模型在AI4VA-FG上表现欠佳;所提出的RL和RARL方法显著提升了Qwen2.5-VL模型在实体识别和故事排序任务上的性能。 Conclusion: 漫画理解仍是视觉语言模型的重大挑战,RARL等区域感知训练策略为提升模型在该领域的能力提供了有效路径。 Abstract: Complex visual narratives, such as comics, present a significant challenge to Vision-Language Models (VLMs). Despite excelling on natural images, VLMs often struggle with stylized line art, onomatopoeia, and densely packed multi-panel layouts. To address this gap, we introduce AI4VA-FG, the first fine-grained and comprehensive benchmark for VLM-based comic understanding. It spans tasks from foundational recognition and detection to high-level character reasoning and narrative construction, supported by dense annotations for characters, poses, and depth. Beyond that, we evaluate state-of-the-art proprietary models, including GPT-4o and Gemini-2.5, and open-source models such as Qwen2.5-VL, revealing substantial performance deficits across core tasks of our benchmarks and underscoring that comic understanding remains an unsolved challenge. To enhance VLMs' capabilities in this domain, we systematically investigate post-training strategies, including supervised fine-tuning on solutions (SFT-S), supervised fine-tuning on reasoning trajectories (SFT-R), and reinforcement learning (RL). Beyond that, inspired by the emerging "Thinking with Images" paradigm, we propose Region-Aware Reinforcement Learning (RARL) for VLMs, which trains models to dynamically attend to relevant regions through zoom-in operations. We observe that when applied to the Qwen2.5-VL model, RL and RARL yield significant gains in low-level entity recognition and high-level storyline ordering, paving the way for more accurate and efficient VLM applications in the comics domain.[217] SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports
Haotian Xia,Haonan Ge,Junbo Zou,Hyun Woo Choi,Xuebin Zhang,Danny Suradja,Botao Rui,Ethan Tran,Wendy Jin,Zhen Ye,Xiyang Lin,Christopher Lai,Shengjie Zhang,Junwen Miao,Shichao Chen,Rhys Tracy,Vicente Ordonez,Weining Shen,Hanjie Chen
Main category: cs.CV
TL;DR: 本文提出了SportR,首个大规模多运动基准,用于评估多模态大语言模型在体育智能中的细粒度视觉感知与规则推理能力。该基准包含图像和视频数据,并引入分层问题设计及链式思维标注以支持深度推理任务,实验表明现有模型表现不佳,凸显当前技术的局限性。
Details
Motivation: 现有体育相关的多模态基准多局限于单一运动或缺乏细粒度的推理链和视觉定位,难以全面评估模型在多运动场景下的感知与规则推理能力。因此,需要一个更具挑战性和系统性的基准来推动体育智能的发展。 Method: 构建了一个包含5,017张图像和2,101个视频的多运动基准SportR,设计了渐进式的问题-答案层级结构以评估从简单判罚识别到复杂处罚预测的推理能力,并为高级任务提供7,118条人工编写的链式思维(CoT)标注。同时提供图像上的手动边界框标注以测试视觉 grounding 能力。采用监督微调和强化学习进行训练并评估模型性能。 Result: 实验显示当前最先进的模型在最具挑战性的任务上表现较差;尽管通过监督微调和强化学习训练后有所提升,但整体得分仍较低,表明现有模型在多模态体育推理方面存在显著不足。 Conclusion: SportR为多模态模型在体育领域的推理能力提供了新的挑战和评估标准,揭示了当前模型在视觉细节感知、规则应用与证据关联方面的短板,是推动未来体育智能研究的重要资源。 Abstract: Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 5,017 images and 2,101 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer (QA) pairs designed to probe reasoning at increasing depths - from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought (CoT) annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.[218] Video Dataset for Surgical Phase, Keypoint, and Instrument Recognition in Laparoscopic Surgery (PhaKIR)
Tobias Rueckert,Raphaela Maerkl,David Rauber,Leonard Klausmann,Max Gutbrod,Daniel Rueckert,Hubertus Feussner,Dirk Wilhelm,Christoph Palm
Main category: cs.CV
TL;DR: 本文提出了一个名为PhaKIR的多中心手术数据集,包含腹腔镜胆囊切除术视频,提供手术阶段识别、器械关键点估计和实例分割的帧级标注,支持时间上下文建模,并用于MICCAI 2024 EndoVis挑战赛。
Details
Motivation: 现有手术数据集多关注孤立任务,缺乏时间依赖性建模和多中心多样性,限制了计算机视觉方法在机器人辅助微创手术中的发展。 Method: 收集来自三个医疗中心的八例完整腹腔镜胆囊切除术视频,进行三类同步标注:手术阶段分类、器械关键点定位和器械实例分割,构建多任务、多中心、全时序的公开数据集。 Result: PhaKIR是首个同时提供手术阶段标签、器械姿态信息和像素级分割掩码的多机构数据集,涵盖485,875帧阶段标注和19,435帧关键点与分割标注,并支持完整手术流程的时间序列分析。 Conclusion: PhaKIR数据集为手术场景理解提供了高质量、多任务的基准资源,已被用作MICCAI 2024 EndoVis挑战赛的基础,推动了RAMIS中计算机视觉方法的发展。 Abstract: Robotic- and computer-assisted minimally invasive surgery (RAMIS) is increasingly relying on computer vision methods for reliable instrument recognition and surgical workflow understanding. Developing such systems often requires large, well-annotated datasets, but existing resources often address isolated tasks, neglect temporal dependencies, or lack multi-center variability. We present the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) dataset, comprising eight complete laparoscopic cholecystectomy videos recorded at three medical centers. The dataset provides frame-level annotations for three interconnected tasks: surgical phase recognition (485,875 frames), instrument keypoint estimation (19,435 frames), and instrument instance segmentation (19,435 frames). PhaKIR is, to our knowledge, the first multi-institutional dataset to jointly provide phase labels, instrument pose information, and pixel-accurate instrument segmentations, while also enabling the exploitation of temporal context since full surgical procedure sequences are available. It served as the basis for the PhaKIR Challenge as part of the Endoscopic Vision (EndoVis) Challenge at MICCAI 2024 to benchmark methods in surgical scene understanding, thereby further validating the dataset's quality and relevance. The dataset is publicly available upon request via the Zenodo platform.[219] Spatial-Frequency Enhanced Mamba for Multi-Modal Image Fusion
Hui Sun,Long Lv,Pingping Zhang,Tongdan Tang,Feng Tian,Weibing Sun,Huchuan Lu
Main category: cs.CV
TL;DR: 提出了一种名为SFMFusion的新型多模态图像融合框架,结合空间-频率增强Mamba块和动态融合机制,在六个数据集上优于大多数现有方法。
Details
Motivation: 现有CNN和Transformer在多模态图像融合中存在感受野有限或计算成本高的问题,且Mamba缺乏对空间和频率信息的充分感知,同时图像重建辅助任务的有效利用仍具挑战。 Method: 设计了三支路结构耦合图像融合与重建任务,提出空间-频率增强Mamba块(SFMB)以增强特征提取,并引入动态融合Mamba块(DFMB)实现跨分支动态融合。 Result: 在六个多模态图像融合数据集上实验表明,该方法在定量和视觉质量上均优于多数现有最先进方法。 Conclusion: SFMFusion通过增强Mamba的空间与频率感知能力,并有效结合图像重建任务,显著提升了多模态图像融合性能。 Abstract: Multi-Modal Image Fusion (MMIF) aims to integrate complementary image information from different modalities to produce informative images. Previous deep learning-based MMIF methods generally adopt Convolutional Neural Networks (CNNs) or Transformers for feature extraction. However, these methods deliver unsatisfactory performances due to the limited receptive field of CNNs and the high computational cost of Transformers. Recently, Mamba has demonstrated a powerful potential for modeling long-range dependencies with linear complexity, providing a promising solution to MMIF. Unfortunately, Mamba lacks full spatial and frequency perceptions, which are very important for MMIF. Moreover, employing Image Reconstruction (IR) as an auxiliary task has been proven beneficial for MMIF. However, a primary challenge is how to leverage IR efficiently and effectively. To address the above issues, we propose a novel framework named Spatial-Frequency Enhanced Mamba Fusion (SFMFusion) for MMIF. More specifically, we first propose a three-branch structure to couple MMIF and IR, which can retain complete contents from source images. Then, we propose the Spatial-Frequency Enhanced Mamba Block (SFMB), which can enhance Mamba in both spatial and frequency domains for comprehensive feature extraction. Finally, we propose the Dynamic Fusion Mamba Block (DFMB), which can be deployed across different branches for dynamic feature fusion. Extensive experiments show that our method achieves better results than most state-of-the-art methods on six MMIF datasets. The source code is available at https://github.com/SunHui1216/SFMFusion.[220] On Accurate and Robust Estimation of 3D and 2D Circular Center: Method and Application to Camera-Lidar Calibration
Jiajun Jiang,Xiao Hu,Wancheng Liu,Wei Jiang
Main category: cs.CV
TL;DR: 提出了一种基于几何代数和RANSAC的LiDAR-相机外参标定新框架,通过改进3D圆心估计和2D投影中心恢复,显著提升了标定精度与鲁棒性。
Details
Motivation: 现有方法在3D-2D圆形中心对应关系上存在误差,主要由于3D拟合与2D椭圆中心估计脱节,导致标定不准确。 Method: 采用共形几何代数结合RANSAC进行鲁棒的3D圆心估计,并提出基于弦长方差最小化的2D投影中心恢复方法,通过单应性验证或准RANSAC解决双极小值歧义。 Result: 在合成和真实数据集上均显著优于现有方法,降低了外参估计误差,支持多种传感器和目标类型(包括自然圆形物体)。 Conclusion: 所提框架在几何原理上更严谨,实现了更高精度和鲁棒性的LiDAR-相机标定,具有广泛适用性和可重复性(代码将开源)。 Abstract: Circular targets are widely used in LiDAR-camera extrinsic calibration due to their geometric consistency and ease of detection. However, achieving accurate 3D-2D circular center correspondence remains challenging. Existing methods often fail due to decoupled 3D fitting and erroneous 2D ellipse-center estimation. To address this, we propose a geometrically principled framework featuring two innovations: (i) a robust 3D circle center estimator based on conformal geometric algebra and RANSAC; and (ii) a chord-length variance minimization method to recover the true 2D projected center, resolving its dual-minima ambi- guity via homography validation or a quasi-RANSAC fallback. Evaluated on synthetic and real-world datasets, our framework significantly outperforms state-of-the-art approaches. It reduces extrinsic estimation error and enables robust calibration across diverse sensors and target types, including natural circular objects. Our code will be publicly released for reproducibility.[221] Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from LDCT
Yifei Zhang,Jiashuo Zhang,Xiaofeng Yang,Liang Zhao
Main category: cs.CV
TL;DR: 提出一种可解释的跨疾病推理框架,用于从低剂量胸部CT中进行心肺联合风险评估,通过模拟临床诊断思维过程,实现心血管疾病的准确预测与机制解释。
Details
Motivation: 现有方法通常将肺部和心脏健康评估视为独立任务,忽略了二者之间的生理关联和共享影像生物标志物,因此需要一种能够整合心肺联合信息的可解释分析框架。 Method: 设计了一个包含肺部感知模块、知识引导推理模块和心脏表征模块的框架,模拟临床诊断思维:首先检测肺部异常,结合医学知识推理其对心血管系统的影响,最终融合多源信息进行心血管风险预测。 Result: 在NLST队列上的实验表明,该框架在心血管疾病筛查和死亡率预测方面优于单病种模型和纯图像模型,同时提供符合心脏病学认知的可解释推理路径。 Conclusion: 该研究建立了一种统一且可解释的LDCT中心血管分析范式,弥合了基于图像的预测与基于机制的医学解释之间的差距。 Abstract: Low-dose chest computed tomography (LDCT) inherently captures both pulmonary and cardiac structures, offering a unique opportunity for joint assessment of lung and cardiovascular health. However, most existing approaches treat these domains as independent tasks, overlooking their physiological interplay and shared imaging biomarkers. We propose an Explainable Cross-Disease Reasoning Framework that enables interpretable cardiopulmonary risk assessment from a single LDCT scan. The framework introduces an agentic reasoning process that emulates clinical diagnostic thinking-first perceiving pulmonary findings, then reasoning through established medical knowledge, and finally deriving a cardiovascular judgment with explanatory rationale. It integrates three synergistic components: a pulmonary perception module that summarizes lung abnormalities, a knowledge-guided reasoning module that infers their cardiovascular implications, and a cardiac representation module that encodes structural biomarkers. Their outputs are fused to produce a holistic cardiovascular risk prediction that is both accurate and physiologically grounded. Experiments on the NLST cohort demonstrate that the proposed framework achieves state-of-the-art performance for CVD screening and mortality prediction, outperforming single-disease and purely image-based baselines. Beyond quantitative gains, the framework provides human-verifiable reasoning that aligns with cardiological understanding, revealing coherent links between pulmonary abnormalities and cardiac stress mechanisms. Overall, this work establishes a unified and explainable paradigm for cardiovascular analysis from LDCT, bridging the gap between image-based prediction and mechanism-based medical interpretation.[222] DIAL-GS: Dynamic Instance Aware Reconstruction for Label-free Street Scenes with 4D Gaussian Splatting
Chenpeng Su,Wenhua Wu,Chensheng Peng,Tianchen Deng,Zhe Liu,Hesheng Wang
Main category: cs.CV
TL;DR: 提出DIAL-GS,一种基于4D高斯点阵的动态实例感知城市街景重建方法,无需标签且支持细粒度编辑。
Details
Motivation: 现有自监督方法难以区分静态与动态元素及个体动态对象,限制了精细编辑和可扩展性。 Method: 通过外观-位置不一致性检测动态实例,采用实例感知的4D高斯表示进行统一建模,并引入身份与动态相互增强的双向机制。 Result: 在城市驾驶场景中,DIAL-GS在重建质量和实例级编辑方面优于现有的自监督基线方法。 Conclusion: DIAL-GS实现了无需标注、动态自适应且实例感知的4D重建,为城市场景建模提供了简洁而强大的解决方案。 Abstract: Urban scene reconstruction is critical for autonomous driving, enabling structured 3D representations for data synthesis and closed-loop testing. Supervised approaches rely on costly human annotations and lack scalability, while current self-supervised methods often confuse static and dynamic elements and fail to distinguish individual dynamic objects, limiting fine-grained editing. We propose DIAL-GS, a novel dynamic instance-aware reconstruction method for label-free street scenes with 4D Gaussian Splatting. We first accurately identify dynamic instances by exploiting appearance-position inconsistency between warped rendering and actual observation. Guided by instance-level dynamic perception, we employ instance-aware 4D Gaussians as the unified volumetric representation, realizing dynamic-adaptive and instance-aware reconstruction. Furthermore, we introduce a reciprocal mechanism through which identity and dynamics reinforce each other, enhancing both integrity and consistency. Experiments on urban driving scenarios show that DIAL-GS surpasses existing self-supervised baselines in reconstruction quality and instance-level editing, offering a concise yet powerful solution for urban scene modeling.[223] UniADC: A Unified Framework for Anomaly Detection and Classification
Ximiao Zhang,Min Xu,Zheng Zhang,Junlin Hu,Xiuzhuang Zhou
Main category: cs.CV
TL;DR: 本文提出了统一异常检测与分类任务(UniADC),通过一个无需训练的可控修复网络和多任务判别器,实现仅用少量或无异常图像同时完成异常检测与分类。
Details
Motivation: 现有方法将异常检测与分类分开处理,忽略了二者之间的内在关联,导致信息共享不足和性能受限。因此需要一种能够联合建模这两个任务的统一框架。 Method: 提出UniADC模型,包含一个无需训练的可控修复网络用于生成特定类别的异常图像并进行数据增强,以及一个多任务判别器用于基于细粒度特征与异常类别嵌入对齐来实现检测与分类。 Result: 在MVTec-FS、MTD和WFDD三个数据集上实验表明,UniADC在异常检测、定位和分类方面均优于现有方法。 Conclusion: UniADC实现了高效的统一异常检测与分类,即使在极少或没有真实异常样本的情况下仍表现优异,验证了联合建模的有效性与实用性。 Abstract: In this paper, we introduce the task of unified anomaly detection and classification, which aims to simultaneously detect anomalous regions in images and identify their specific categories. Existing methods typically treat anomaly detection and classification as separate tasks, thereby neglecting their inherent correlation, limiting information sharing, and resulting in suboptimal performance. To address this, we propose UniADC, a unified anomaly detection and classification model that can effectively perform both tasks with only a few or even no anomaly images. Specifically, UniADC consists of two key components: a training-free controllable inpainting network and a multi-task discriminator. The inpainting network can synthesize anomaly images of specific categories by repainting normal regions guided by anomaly priors, and can also repaint few-shot anomaly samples to augment the available anomaly data. The multi-task discriminator is then trained on these synthesized samples, enabling precise anomaly detection and classification by aligning fine-grained image features with anomaly-category embeddings. We conduct extensive experiments on three anomaly detection and classification datasets, including MVTec-FS, MTD, and WFDD, and the results demonstrate that UniADC consistently outperforms existing methods in anomaly detection, localization, and classification. The code is available at https://github.com/cnulab/UniADC.[224] FreqGRL: Suppressing Low-Frequency Bias and Mining High-Frequency Knowledge for Cross-Domain Few-Shot Learning
Siqi Hui,Sanping Zhou,Ye deng,Wenli Huang,Jinjun Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于频域视角的跨域小样本学习框架FreqGRL,通过低频替换和高频增强模块缓解源域与目标域间的数据不平衡问题,显著提升了跨域泛化性能。
Details
Motivation: 在跨域小样本学习中,源域数据丰富而目标域标注数据稀缺,导致模型容易偏向源域低频特征且难以学习到目标域的高频可泛化特征。现有方法缺乏对这一问题的频域分析,因此需要新的视角来解决数据不平衡带来的表示学习偏差。 Method: 提出FreqGRL框架,包含三个核心模块:1)低频替换(LFR),用目标域低频成分替换源任务中的低频部分以构建更贴近目标域的新源任务;2)高频增强(HFE),在频域中过滤低频、强化高频特征学习以提升跨域泛化能力;3)全局频率滤波(GFF),抑制噪声频率并突出关键频率,减少过拟合风险。 Result: 在五个标准CD-FSL基准上的实验表明,所提方法在跨域小样本分类任务上优于现有最先进方法,验证了频域操作对缓解数据不平衡和提升泛化性的有效性。 Conclusion: FreqGRL首次从频率空间角度分析并解决跨域小样本学习中的数据不平衡问题,通过低频替换、高频增强和全局滤波策略有效提升了模型的跨域适应能力和表示学习质量。 Abstract: Cross-domain few-shot learning (CD-FSL) aims to recognize novel classes with only a few labeled examples under significant domain shifts. While recent approaches leverage a limited amount of labeled target-domain data to improve performance, the severe imbalance between abundant source data and scarce target data remains a critical challenge for effective representation learning. We present the first frequency-space perspective to analyze this issue and identify two key challenges: (1) models are easily biased toward source-specific knowledge encoded in the low-frequency components of source data, and (2) the sparsity of target data hinders the learning of high-frequency, domain-generalizable features. To address these challenges, we propose \textbf{FreqGRL}, a novel CD-FSL framework that mitigates the impact of data imbalance in the frequency space. Specifically, we introduce a Low-Frequency Replacement (LFR) module that substitutes the low-frequency components of source tasks with those from the target domain to create new source tasks that better align with target characteristics, thus reducing source-specific biases and promoting generalizable representation learning. We further design a High-Frequency Enhancement (HFE) module that filters out low-frequency components and performs learning directly on high-frequency features in the frequency space to improve cross-domain generalization. Additionally, a Global Frequency Filter (GFF) is incorporated to suppress noisy or irrelevant frequencies and emphasize informative ones, mitigating overfitting risks under limited target supervision. Extensive experiments on five standard CD-FSL benchmarks demonstrate that our frequency-guided framework achieves state-of-the-art performance.[225] NOVO: Bridging LLaVA and SAM with Visual-only Prompts for Reasoning Segmentation
Kyung-Yoon Yoon,Yeong-Jun Cho
Main category: cs.CV
TL;DR: 提出NOVO框架,通过视觉提示连接视觉-语言模型与分割模型,实现无需文本的推理分割,并引入训练-free优化模块提升分割质量。
Details
Motivation: 现有方法依赖文本生成的嵌入进行分割,限制了与预训练分割模型(如SAM)的兼容性;希望利用纯视觉提示实现更自然、对齐更好的分割。 Method: 从VLM输出生成粗略掩码和点提示作为视觉输入,送入SAM进行分割,并设计一个无需训练的优化模块来提升边界质量和实例分割效果。 Result: 在新构建的RISeg基准上表现优异,达到多个指标的SOTA性能,验证了方法的有效性和可扩展性。 Conclusion: NOVO通过纯视觉提示有效桥接VLM与分割模型,保持与SAM预训练能力对齐,实现了高性能的推理驱动实例分割。 Abstract: In this study, we propose NOVO (NO text, Visual-Only prompts), a novel framework that bridges vision-language models (VLMs) and segmentation models through visual-only prompts. Unlike prior approaches that feed text-derived SEG token embeddings into segmentation models, NOVO instead generates a coarse mask and point prompts from the VLM output. These visual prompts are compatible with the Segment Anything Model (SAM), preserving alignment with its pretrained capabilities. To further enhance boundary quality and enable instance-level segmentation, we introduce a training-free refinement module that reduces visual artifacts and improves the quality of segmentation masks. We also present RISeg, a new benchmark comprising 918 images, 2,533 instance-level masks, and diverse reasoning queries to evaluate this task. Experiments demonstrate that NOVO achieves state-of-the-art performance across multiple metrics and model sizes, demonstrating its effectiveness and scalability in reasoning segmentation.[226] Active Learning for Animal Re-Identification with Ambiguity-Aware Sampling
Depanshu Sani,Mehar Khurana,Saket Anand
Main category: cs.CV
TL;DR: 本文提出了一种新的主动学习(AL)框架,用于动物重识别(Re-ID),通过结合聚类方法挖掘嵌入空间中的模糊区域,并利用成对约束提升无监督学习性能,在仅使用0.033%标注数据的情况下显著优于现有方法。
Details
Motivation: 动物Re-ID面临开放集、新物种和细微特征差异等挑战,现有基础模型在零样本Re-ID上表现不佳,且标注成本高,现有无监督和主动学习方法效果有限。 Method: 提出一种新的AL Re-ID框架,结合互补聚类方法发现嵌入空间中结构模糊的区域,选取具有代表性和信息量的样本对,通过must-link和cannot-link约束获取oracle反馈,并设计约束聚类优化算法将其融入无监督学习流程。 Result: 在13个野生动物数据集上,仅用0.033%的标注数据,平均mAP比基础模型、无监督和主动学习方法分别提升10.49%、11.19%和3.99%,在已知和未知个体上均达到SOTA性能。 Conclusion: 所提主动学习框架能高效利用少量标注数据,在动物Re-ID任务中显著提升性能,尤其适用于标注昂贵的真实生态场景,为开放世界动物识别提供了有效解决方案。 Abstract: Animal Re-ID has recently gained substantial attention in the AI research community due to its high impact on biodiversity monitoring and unique research challenges arising from environmental factors. The subtle distinguishing patterns, handling new species and the inherent open-set nature make the problem even harder. To address these complexities, foundation models trained on labeled, large-scale and multi-species animal Re-ID datasets have recently been introduced to enable zero-shot Re-ID. However, our benchmarking reveals significant gaps in their zero-shot Re-ID performance for both known and unknown species. While this highlights the need for collecting labeled data in new domains, exhaustive annotation for Re-ID is laborious and requires domain expertise. Our analyses show that existing unsupervised (USL) and AL Re-ID methods underperform for animal Re-ID. To address these limitations, we introduce a novel AL Re-ID framework that leverages complementary clustering methods to uncover and target structurally ambiguous regions in the embedding space for mining pairs of samples that are both informative and broadly representative. Oracle feedback on these pairs, in the form of must-link and cannot-link constraints, facilitates a simple annotation interface, which naturally integrates with existing USL methods through our proposed constrained clustering refinement algorithm. Through extensive experiments, we demonstrate that, by utilizing only 0.033% of all annotations, our approach consistently outperforms existing foundational, USL and AL baselines. Specifically, we report an average improvement of 10.49%, 11.19% and 3.99% (mAP) on 13 wildlife datasets over foundational, USL and AL methods, respectively, while attaining state-of-the-art performance on each dataset. Furthermore, we also show an improvement of 11.09%, 8.2% and 2.06% for unknown individuals in an open-world setting.[227] Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks
Lingran Song,Yucheng Zhou,Jianbing Shen
Main category: cs.CV
TL;DR: 本文提出了一个新的医学视觉-语言任务——医学诊断分割(MDS),旨在结合医学图像的临床问题理解、分割掩码生成和诊断结果输出。为此,作者构建了M3DS数据集,并提出Sim4Seg框架,通过区域感知的视觉-语言相似性到掩码模块(RVLS2M)提升性能,实验表明该方法在分割和诊断上均优于基线模型。
Details
Motivation: 现有的医学图像分割模型很少联合探索分割与诊断任务,而临床上需要可解释的诊断结果与分割结果相结合。因此,有必要开发能够同时提供精确分割和可解释诊断的模型。 Method: 提出了医学诊断分割(MDS)任务,构建了包含多模态多疾病数据的M3DS数据集,并设计了Sim4Seg框架,引入RVLS2M模块来增强区域感知的视觉-语言对齐;同时探索了测试时扩展策略以提升性能。 Result: 实验结果显示,所提方法在M3DS数据集上显著优于基线模型,无论是在分割精度还是诊断准确性方面均有提升。 Conclusion: Sim4Seg通过联合建模分割与诊断任务,在医学图像分析中实现了更好的性能和可解释性,验证了MDS任务的可行性和重要性。 Abstract: Despite significant progress in pixel-level medical image analysis, existing medical image segmentation models rarely explore medical segmentation and diagnosis tasks jointly. However, it is crucial for patients that models can provide explainable diagnoses along with medical segmentation results. In this paper, we introduce a medical vision-language task named Medical Diagnosis Segmentation (MDS), which aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results. To facilitate this task, we first present the Multimodal Multi-disease Medical Diagnosis Segmentation (M3DS) dataset, containing diverse multimodal multi-disease medical images paired with their corresponding segmentation masks and diagnosis chain-of-thought, created via an automated diagnosis chain-of-thought generation pipeline. Moreover, we propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation by taking advantage of the Region-Aware Vision-Language Similarity to Mask (RVLS2M) module. To improve overall performance, we investigate a test-time scaling strategy for MDS tasks. Experimental results demonstrate that our method outperforms the baselines in both segmentation and diagnosis.[228] REOcc: Camera-Radar Fusion with Radar Feature Enrichment for 3D Occupancy Prediction
Chaehee Song,Sanmin Kim,Hyeonjun Jeong,Juyeb Shin,Joonhee Lim,Dongsuk Kum
Main category: cs.CV
TL;DR: 本文提出了一种名为REOcc的新型相机-雷达融合网络,用于3D占据预测,通过引入雷达致密化和放大模块增强雷达特征表示,有效缓解了雷达数据稀疏性和噪声问题,在Occ3D-nuScenes基准上显著提升了性能,尤其是在动态物体类别上表现突出。
Details
Motivation: 基于视觉的3D占据预测在复杂环境中受限于单一相机输入,而雷达数据虽具互补性但存在稀疏和噪声问题,影响融合效果,因此需要提升雷达特征质量以实现更优的多传感器融合。 Method: 提出REOcc网络,包含雷达致密化(Radar Densifier)和雷达放大(Radar Amplifier)两个模块,通过融合空间与上下文信息来增强雷达特征的密度和质量,并与相机数据进行有效融合用于3D占据预测。 Result: 在Occ3D-nuScenes基准上实验表明,REOcc显著优于纯相机模型,尤其在动态物体类别的占据预测中表现出更强的能力,验证了其对雷达数据稀疏性和噪声的缓解效果。 Conclusion: REOcc通过增强雷达特征表示,充分发挥了相机-雷达融合的优势,提升了3D占据预测的鲁棒性和可靠性,为多传感器融合提供了有效解决方案。 Abstract: Vision-based 3D occupancy prediction has made significant advancements, but its reliance on cameras alone struggles in challenging environments. This limitation has driven the adoption of sensor fusion, among which camera-radar fusion stands out as a promising solution due to their complementary strengths. However, the sparsity and noise of the radar data limits its effectiveness, leading to suboptimal fusion performance. In this paper, we propose REOcc, a novel camera-radar fusion network designed to enrich radar feature representations for 3D occupancy prediction. Our approach introduces two main components, a Radar Densifier and a Radar Amplifier, which refine radar features by integrating spatial and contextual information, effectively enhancing spatial density and quality. Extensive experiments on the Occ3D-nuScenes benchmark demonstrate that REOcc achieves significant performance gains over the camera-only baseline model, particularly in dynamic object classes. These results underscore REOcc's capability to mitigate the sparsity and noise of the radar data. Consequently, radar complements camera data more effectively, unlocking the full potential of camera-radar fusion for robust and reliable 3D occupancy prediction.[229] Flexible Concept Bottleneck Model
Xingbo Du,Qiantong Dou,Lei Fan,Rui Zhang
Main category: cs.CV
TL;DR: 提出了一种灵活的概念瓶颈模型(FCBM),通过超网络和可学习温度的稀疏max模块,实现对新概念的动态适应而无需完全重新训练。
Details
Motivation: 现有基于视觉-语言模型(VLM)的概念瓶颈模型在引入新概念时需要重新训练,限制了其在现实场景中的适应性和灵活性。 Method: 设计了一个生成预测权重的超网络,结合基于概念嵌入的可学习温度稀疏max模块,实现动态概念选择与无缝集成。 Result: 在五个公开基准上实验表明,FCBM在有效概念数量相近的情况下达到与最先进方法相当的精度,并能在仅一个微调周期后良好泛化到未见概念。 Conclusion: FCBM具有强适应性和灵活性,支持动态概念更新(包括完全替换),提升了概念瓶颈模型在快速演进的视觉-语言环境中的实用性。 Abstract: Concept bottleneck models (CBMs) improve neural network interpretability by introducing an intermediate layer that maps human-understandable concepts to predictions. Recent work has explored the use of vision-language models (VLMs) to automate concept selection and annotation. However, existing VLM-based CBMs typically require full model retraining when new concepts are involved, which limits their adaptability and flexibility in real-world scenarios, especially considering the rapid evolution of vision-language foundation models. To address these issues, we propose Flexible Concept Bottleneck Model (FCBM), which supports dynamic concept adaptation, including complete replacement of the original concept set. Specifically, we design a hypernetwork that generates prediction weights based on concept embeddings, allowing seamless integration of new concepts without retraining the entire model. In addition, we introduce a modified sparsemax module with a learnable temperature parameter that dynamically selects the most relevant concepts, enabling the model to focus on the most informative features. Extensive experiments on five public benchmarks demonstrate that our method achieves accuracy comparable to state-of-the-art baselines with a similar number of effective concepts. Moreover, the model generalizes well to unseen concepts with just a single epoch of fine-tuning, demonstrating its strong adaptability and flexibility.[230] AnoStyler: Text-Driven Localized Anomaly Generation via Lightweight Style Transfer
Yulim So,Seokho Kang
Main category: cs.CV
TL;DR: 提出AnoStyler,一种轻量级的零样本异常生成方法,通过文本引导的风格迁移生成视觉逼真的异常图像,仅需单张正常图像即可实现高效异常生成。
Details
Motivation: 现有异常生成方法存在视觉 realism 不足、依赖大量真实图像和模型复杂的问题,限制了实际应用。 Method: 将零样本异常生成视为文本引导的风格迁移任务,利用单张正常图像、类别标签和缺陷类型生成异常掩码和双类文本提示,通过轻量级U-Net结合CLIP损失进行风格化生成。 Result: 在MVTec-AD和VisA数据集上实验表明,AnoStyler生成的异常图像质量更高、多样性更好,并能有效提升异常检测性能。 Conclusion: AnoStyler是一种无需大量数据、模型轻量且生成效果优秀的异常生成方法,具有良好的实用性和推广性。 Abstract: Anomaly generation has been widely explored to address the scarcity of anomaly images in real-world data. However, existing methods typically suffer from at least one of the following limitations, hindering their practical deployment: (1) lack of visual realism in generated anomalies; (2) dependence on large amounts of real images; and (3) use of memory-intensive, heavyweight model architectures. To overcome these limitations, we propose AnoStyler, a lightweight yet effective method that frames zero-shot anomaly generation as text-guided style transfer. Given a single normal image along with its category label and expected defect type, an anomaly mask indicating the localized anomaly regions and two-class text prompts representing the normal and anomaly states are generated using generalizable category-agnostic procedures. A lightweight U-Net model trained with CLIP-based loss functions is used to stylize the normal image into a visually realistic anomaly image, where anomalies are localized by the anomaly mask and semantically aligned with the text prompts. Extensive experiments on the MVTec-AD and VisA datasets show that AnoStyler outperforms existing anomaly generation methods in generating high-quality and diverse anomaly images. Furthermore, using these generated anomalies helps enhance anomaly detection performance.[231] SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection
Yifan Wang,Yian Zhao,Fanqi Pu,Xiaochen Yang,Yang Tang,Xi Chen,Wenming Yang
Main category: cs.CV
TL;DR: 提出一种新的空间-投影对齐方法(SPAN),通过空间点对齐和3D-2D投影对齐增强单目3D检测中的几何一致性,结合分层任务学习策略提升性能。
Details
Motivation: 现有单目3D检测器采用解耦预测范式,忽略了不同属性间的几何协同约束,导致几何一致性不足,影响性能。 Method: 提出SPAN,包含空间点对齐(强制预测与真实3D框的空间一致性)和3D-2D投影对齐(确保3D框在图像平面上紧密对齐于2D检测框),并采用分层任务学习策略逐步引入对齐机制以提升训练稳定性。 Result: 实验表明该方法可轻松集成到现有单目3D检测器中,并带来显著性能提升。 Conclusion: SPAN有效增强了3D检测的几何一致性,缓解了传统解耦方法带来的空间漂移和投影错位问题,提升了检测精度。 Abstract: Existing monocular 3D detectors typically tame the pronounced nonlinear regression of 3D bounding box through decoupled prediction paradigm, which employs multiple branches to estimate geometric center, depth, dimensions, and rotation angle separately. Although this decoupling strategy simplifies the learning process, it inherently ignores the geometric collaborative constraints between different attributes, resulting in the lack of geometric consistency prior, thereby leading to suboptimal performance. To address this issue, we propose novel Spatial-Projection Alignment (SPAN) with two pivotal components: (i). Spatial Point Alignment enforces an explicit global spatial constraint between the predicted and ground-truth 3D bounding boxes, thereby rectifying spatial drift caused by decoupled attribute regression. (ii). 3D-2D Projection Alignment ensures that the projected 3D box is aligned tightly within its corresponding 2D detection bounding box on the image plane, mitigating projection misalignment overlooked in previous works. To ensure training stability, we further introduce a Hierarchical Task Learning strategy that progressively incorporates spatial-projection alignment as 3D attribute predictions refine, preventing early stage error propagation across attributes. Extensive experiments demonstrate that the proposed method can be easily integrated into any established monocular 3D detector and delivers significant performance improvements.[232] K-Stain: Keypoint-Driven Correspondence for H&E-to-IHC Virtual Staining
Sicheng Yang,Zhaohu Xing,Haipeng Zhou,Lei Zhu
Main category: cs.CV
TL;DR: 提出K-Stain框架,利用关键点增强H&E到IHC图像虚拟染色的精度和视觉一致性。
Details
Motivation: 现有虚拟染色方法因组织切片错位而难以有效利用空间信息,影响合成IHC图像质量。 Method: 提出K-Stain,包含分层空间关键点检测器(HSKD)、关键点感知增强生成器(KEG)和关键点引导判别器(KGD),利用相邻切片上下文信息进行更精确的空间对齐与图像生成。 Result: 实验表明,K-Stain在定量指标和视觉质量上均优于当前最先进方法。 Conclusion: 基于关键点的方法能有效提升虚拟染色中空间结构的保真度,K-Stain为H&E转IHC提供了更准确可靠的解决方案。 Abstract: Virtual staining offers a promising method for converting Hematoxylin and Eosin (H&E) images into Immunohistochemical (IHC) images, eliminating the need for costly chemical processes. However, existing methods often struggle to utilize spatial information effectively due to misalignment in tissue slices. To overcome this challenge, we leverage keypoints as robust indicators of spatial correspondence, enabling more precise alignment and integration of structural details in synthesized IHC images. We introduce K-Stain, a novel framework that employs keypoint-based spatial and semantic relationships to enhance synthesized IHC image fidelity. K-Stain comprises three main components: (1) a Hierarchical Spatial Keypoint Detector (HSKD) for identifying keypoints in stain images, (2) a Keypoint-aware Enhancement Generator (KEG) that integrates these keypoints during image generation, and (3) a Keypoint Guided Discriminator (KGD) that improves the discriminator's sensitivity to spatial details. Our approach leverages contextual information from adjacent slices, resulting in more accurate and visually consistent IHC images. Extensive experiments show that K-Stain outperforms state-of-the-art methods in quantitative metrics and visual quality.[233] MirrorMamba: Towards Scalable and Robust Mirror Detection in Videos
Rui Song,Jiaying Lin,Rynson W. H. Lau
Main category: cs.CV
TL;DR: 本文提出了一种基于Mamba架构的视频镜像检测新方法MirrorMamba,通过多线索融合和创新的多方向对应提取器,在多个数据集上实现了优于现有方法的性能,是首个成功将Mamba应用于镜像检测的工作。
Details
Motivation: 现有视频镜像检测方法依赖单一动态特征,且受限于CNN感受野或Transformer计算复杂度,导致性能和鲁棒性不足。 Method: 提出MirrorMamba,结合感知深度、对应关系和光流等多线索;设计基于Mamba的多方向对应提取器和逐层边界增强解码器,利用Mamba的全局感受野和线性复杂度提升检测效果。 Result: 在视频和图像镜像检测基准数据集上均达到最先进的性能,表现出良好的鲁棒性和泛化能力。 Conclusion: MirrorMamba有效解决了现有方法的局限性,验证了Mamba架构在镜像检测任务中的优越性和应用潜力。 Abstract: Video mirror detection has received significant research attention, yet existing methods suffer from limited performance and robustness. These approaches often over-rely on single, unreliable dynamic features, and are typically built on CNNs with limited receptive fields or Transformers with quadratic computational complexity. To address these limitations, we propose a new effective and scalable video mirror detection method, called MirrorMamba. Our approach leverages multiple cues to adapt to diverse conditions, incorporating perceived depth, correspondence and optical. We also introduce an innovative Mamba-based Multidirection Correspondence Extractor, which benefits from the global receptive field and linear complexity of the emerging Mamba spatial state model to effectively capture correspondence properties. Additionally, we design a Mamba-based layer-wise boundary enforcement decoder to resolve the unclear boundary caused by the blurred depth map. Notably, this work marks the first successful application of the Mamba-based architecture in the field of mirror detection. Extensive experiments demonstrate that our method outperforms existing state-of-the-art approaches for video mirror detection on the benchmark datasets. Furthermore, on the most challenging and representative image-based mirror detection dataset, our approach achieves state-of-the-art performance, proving its robustness and generalizability.[234] MRT: Learning Compact Representations with Mixed RWKV-Transformer for Extreme Image Compression
Han Liu,Hengyu Man,Xingtao Wang,Wenrui Li,Debin Zhao
Main category: cs.CV
TL;DR: 本文提出了一种新的混合RWKV-Transformer(MRT)架构,用于极低比特率下的高效图像压缩,通过将图像编码为一维紧凑潜在表示,并结合RWKV和Transformer的优势,显著优于现有方法。
Details
Motivation: 现有基于二维潜在空间的图像压缩方法存在空间冗余,限制了压缩效率,因此需要更紧凑的表示方式以提升性能。 Method: 提出混合RWKV-Transformer(MRT)架构,将图像分块后,利用RWKV建模块间全局依赖,Transformer建模块内局部冗余,并设计专用于一维潜在特征的RWKV压缩模型(RCM)以进一步提升效率。 Result: 在Kodak和CLIC2020数据集上,MRT相比GLC分别节省43.75%和30.59%的比特率,且在低于0.02 bpp的极低码率下仍保持优异重建质量。 Conclusion: MRT通过一维紧凑潜在表示和层级注意力机制,有效降低了冗余,显著提升了极端图像压缩的性能,是迈向高效压缩的新方向。 Abstract: Recent advances in extreme image compression have revealed that mapping pixel data into highly compact latent representations can significantly improve coding efficiency. However, most existing methods compress images into 2-D latent spaces via convolutional neural networks (CNNs) or Swin Transformers, which tend to retain substantial spatial redundancy, thereby limiting overall compression performance. In this paper, we propose a novel Mixed RWKV-Transformer (MRT) architecture that encodes images into more compact 1-D latent representations by synergistically integrating the complementary strengths of linear-attention-based RWKV and self-attention-based Transformer models. Specifically, MRT partitions each image into fixed-size windows, utilizing RWKV modules to capture global dependencies across windows and Transformer blocks to model local redundancies within each window. The hierarchical attention mechanism enables more efficient and compact representation learning in the 1-D domain. To further enhance compression efficiency, we introduce a dedicated RWKV Compression Model (RCM) tailored to the structure characteristics of the intermediate 1-D latent features in MRT. Extensive experiments on standard image compression benchmarks validate the effectiveness of our approach. The proposed MRT framework consistently achieves superior reconstruction quality at bitrates below 0.02 bits per pixel (bpp). Quantitative results based on the DISTS metric show that MRT significantly outperforms the state-of-the-art 2-D architecture GLC, achieving bitrate savings of 43.75%, 30.59% on the Kodak and CLIC2020 test datasets, respectively.[235] Relative Energy Learning for LiDAR Out-of-Distribution Detection
Zizhao Li,Zhengkang Xiang,Jiayang Ao,Joseph West,Kourosh Khoshelham
Main category: cs.CV
TL;DR: 本文提出了一种名为相对能量学习(REL)的框架,用于LiDAR点云中的分布外(OOD)检测,结合Point Raise数据合成策略,在SemanticKITTI和STU基准上显著优于现有方法。
Details
Motivation: 现有的LiDAR OOD检测方法难以区分罕见异常与常见类别,导致在安全关键场景中出现高误报率和过度自信错误,且缺乏训练时的OOD样本。 Method: 提出Relative Energy Learning(REL),利用正负logits之间的能量差作为相对评分函数,并设计Point Raise策略通过扰动现有点云生成辅助异常样本以增强训练。 Result: 在SemanticKITTI和STU基准上的实验表明,REL大幅超越现有方法,有效降低误报率并提升不同场景下的检测鲁棒性。 Conclusion: 建模相对能量结合简单的合成异常样本是一种原理清晰且可扩展的解决方案,适用于开放世界自动驾驶中的可靠OOD检测。 Abstract: Out-of-distribution (OOD) detection is a critical requirement for reliable autonomous driving, where safety depends on recognizing road obstacles and unexpected objects beyond the training distribution. Despite extensive research on OOD detection in 2D images, direct transfer to 3D LiDAR point clouds has been proven ineffective. Current LiDAR OOD methods struggle to distinguish rare anomalies from common classes, leading to high false-positive rates and overconfident errors in safety-critical settings. We propose Relative Energy Learning (REL), a simple yet effective framework for OOD detection in LiDAR point clouds. REL leverages the energy gap between positive (in-distribution) and negative logits as a relative scoring function, mitigating calibration issues in raw energy values and improving robustness across various scenes. To address the absence of OOD samples during training, we propose a lightweight data synthesis strategy called Point Raise, which perturbs existing point clouds to generate auxiliary anomalies without altering the inlier semantics. Evaluated on SemanticKITTI and the Spotting the Unexpected (STU) benchmark, REL consistently outperforms existing methods by a large margin. Our results highlight that modeling relative energy, combined with simple synthetic outliers, provides a principled and scalable solution for reliable OOD detection in open-world autonomous driving.[236] AvatarTex: High-Fidelity Facial Texture Reconstruction from Single-Image Stylized Avatars
Yuda Qiu,Zitong Xiao,Yiwei Zuo,Zisheng Ye,Weikai Chen,Xiaoguang Han
Main category: cs.CV
TL;DR: AvatarTex提出了一种三阶段扩散到GAN的管道,用于从单张图像生成风格化和逼真的面部纹理,结合扩散模型的多样性与GAN的结构一致性,并通过新构建的TexHub数据集实现多风格面部纹理重建的新SOTA。
Details
Motivation: 现有方法在处理风格化头像时面临多风格数据集缺乏和非标准纹理几何一致性维持困难的问题,难以同时保证纹理的艺术性和拓扑准确性。 Method: 提出AvatarTex,采用三阶段扩散-to-GAN pipeline:首先使用扩散模型进行基于UV的纹理补全,然后通过GAN的潜在空间优化提升风格与结构一致性,最后利用扩散模型重绘增强细节;并构建包含20,000种多风格高分辨率UV纹理的TexHub数据集支持训练与评估。 Result: AvatarTex在风格化和真实感面部纹理重建上均优于现有方法,实现了高保真、拓扑对齐且艺术一致的纹理生成;消融实验验证了各阶段有效性,定性与定量结果表明其在多样性和几何一致性上的优势。 Conclusion: AvatarTex通过融合扩散模型与GAN的优势,在无需3D监督的情况下实现了高质量多风格面部纹理重建,为虚拟角色生成提供了实用解决方案,所发布TexHub数据集有望推动相关领域研究。 Abstract: We present AvatarTex, a high-fidelity facial texture reconstruction framework capable of generating both stylized and photorealistic textures from a single image. Existing methods struggle with stylized avatars due to the lack of diverse multi-style datasets and challenges in maintaining geometric consistency in non-standard textures. To address these limitations, AvatarTex introduces a novel three-stage diffusion-to-GAN pipeline. Our key insight is that while diffusion models excel at generating diversified textures, they lack explicit UV constraints, whereas GANs provide a well-structured latent space that ensures style and topology consistency. By integrating these strengths, AvatarTex achieves high-quality topology-aligned texture synthesis with both artistic and geometric coherence. Specifically, our three-stage pipeline first completes missing texture regions via diffusion-based inpainting, refines style and structure consistency using GAN-based latent optimization, and enhances fine details through diffusion-based repainting. To address the need for a stylized texture dataset, we introduce TexHub, a high-resolution collection of 20,000 multi-style UV textures with precise UV-aligned layouts. By leveraging TexHub and our structured diffusion-to-GAN pipeline, AvatarTex establishes a new state-of-the-art in multi-style facial texture reconstruction. TexHub will be released upon publication to facilitate future research in this field.[237] Argus: Quality-Aware High-Throughput Text-to-Image Inference Serving System
Shubham Agarwal,Subrata Mitra,Saud Iqbal
Main category: cs.CV
TL;DR: Argus是一个高吞吐量的文本到图像推理系统,通过为每个提示选择合适的近似级别,在保证生成质量的同时满足吞吐量目标。
Details
Motivation: 文本到图像模型计算密集且推理时间长,设计高效、高吞吐的系统面临挑战;许多提示可用近似模型加速,但需精细调控以避免质量下降。 Method: Argus动态选择不同的近似策略,并根据提示内容智能分配模型和近似配置,以同时满足延迟和质量要求。 Result: 在两个真实工作负载上,Argus相比基线减少了10倍的延迟SLO违规,平均质量提升10%,吞吐量提高40%。 Conclusion: Argus通过精细化的近似策略调度,在固定规模集群上实现了高质量、高吞吐的T2I推理,显著优于现有方法。 Abstract: Text-to-image (T2I) models have gained significant popularity. Most of these are diffusion models with unique computational characteristics, distinct from both traditional small-scale ML models and large language models. They are highly compute-bound and use an iterative denoising process to generate images, leading to very high inference time. This creates significant challenges in designing a high-throughput system. We discovered that a large fraction of prompts can be served using faster, approximated models. However, the approximation setting must be carefully calibrated for each prompt to avoid quality degradation. Designing a high-throughput system that assigns each prompt to the appropriate model and compatible approximation setting remains a challenging problem. We present Argus, a high-throughput T2I inference system that selects the right level of approximation for each prompt to maintain quality while meeting throughput targets on a fixed-size cluster. Argus intelligently switches between different approximation strategies to satisfy both throughput and quality requirements. Overall, Argus achieves 10x fewer latency service-level objective (SLO) violations, 10% higher average quality, and 40% higher throughput compared to baselines on two real-world workload traces.[238] Rethinking Rainy 3D Scene Reconstruction via Perspective Transforming and Brightness Tuning
Qianfeng Yang,Xiang Chen,Pengpeng Li,Qiyuan Guan,Guiyue Jin,Jiyu Jin
Main category: cs.CV
TL;DR: 提出了一种新的数据集OmniRain3D和重建框架REVR-GSNet,用于从受雨影响的多视角图像中实现高质量的3D场景重建。
Details
Motivation: 现有雨天3D场景数据集忽略了雨滴在不同视角下的外观变化以及降雨时环境亮度的变化,导致重建效果不准确。因此需要更真实的数据集和相应方法来提升多视角图像去雨和3D重建性能。 Method: 构建了包含视角差异和亮度动态变化特性的OmniRain3D数据集,并提出了端到端的REVR-GSNet框架,结合递归亮度增强、高斯原语优化和GS引导的去雨模块,通过联合交替优化实现高保真3D场景重建。 Result: 实验表明所提数据集和方法在多视角去雨与3D场景重建方面均有效,显著提升了雨天条件下的重建质量。 Conclusion: OmniRain3D和REVR-GSNet为多视角图像去雨及雨天3D场景重建提供了有效基础,推动了该方向的研究发展。 Abstract: Rain degrades the visual quality of multi-view images, which are essential for 3D scene reconstruction, resulting in inaccurate and incomplete reconstruction results. Existing datasets often overlook two critical characteristics of real rainy 3D scenes: the viewpoint-dependent variation in the appearance of rain streaks caused by their projection onto 2D images, and the reduction in ambient brightness resulting from cloud coverage during rainfall. To improve data realism, we construct a new dataset named OmniRain3D that incorporates perspective heterogeneity and brightness dynamicity, enabling more faithful simulation of rain degradation in 3D scenes. Based on this dataset, we propose an end-to-end reconstruction framework named REVR-GSNet (Rain Elimination and Visibility Recovery for 3D Gaussian Splatting). Specifically, REVR-GSNet integrates recursive brightness enhancement, Gaussian primitive optimization, and GS-guided rain elimination into a unified architecture through joint alternating optimization, achieving high-fidelity reconstruction of clean 3D scenes from rain-degraded inputs. Extensive experiments show the effectiveness of our dataset and method. Our dataset and method provide a foundation for future research on multi-view image deraining and rainy 3D scene reconstruction.[239] SinSEMI: A One-Shot Image Generation Model and Data-Efficient Evaluation Framework for Semiconductor Inspection Equipment
ChunLiang Wu,Xiaochun Li
Main category: cs.CV
TL;DR: 本文提出了一种名为SinSEMI的一次性学习方法,能够从单张光学图像生成多样且高度逼真的半导体设备图像,以应对早期开发阶段图像数据稀缺的问题。
Details
Motivation: 在半导体设备开发初期,获取大量原始光学图像具有挑战性,这限制了人工智能技术在半导体制造中的应用。因此需要一种能从小样本生成高质量图像的方法。 Method: SinSEMI采用多尺度基于流的生成模型,并在采样过程中引入LPIPS能量引导机制,提升生成图像的感知真实性和多样性;同时提出一个专为此任务设计的评估框架,仅需两张参考图像即可完成全面评估。 Result: 实验表明,SinSEMI在视觉质量、定量指标和下游任务表现上均优于其他一次性生成方法,生成的图像具有高保真度和有意义的多样性。 Conclusion: SinSEMI能够有效缓解半导体制造中训练数据不足的问题,生成的图像可作为AI模型训练的有效替代数据。 Abstract: In the early stages of semiconductor equipment development, obtaining large quantities of raw optical images poses a significant challenge. This data scarcity hinder the advancement of AI-powered solutions in semiconductor manufacturing. To address this challenge, we introduce SinSEMI, a novel one-shot learning approach that generates diverse and highly realistic images from single optical image. SinSEMI employs a multi-scale flow-based model enhanced with LPIPS (Learned Perceptual Image Patch Similarity) energy guidance during sampling, ensuring both perceptual realism and output variety. We also introduce a comprehensive evaluation framework tailored for this application, which enables a thorough assessment using just two reference images. Through the evaluation against multiple one-shot generation techniques, we demonstrate SinSEMI's superior performance in visual quality, quantitative measures, and downstream tasks. Our experimental results demonstrate that SinSEMI-generated images achieve both high fidelity and meaningful diversity, making them suitable as training data for semiconductor AI applications.[240] Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV
Wenbo Huang,Jinghui Zhang,Zhenghao Chen,Guang Li,Lei Zhang,Yang Cao,Fang Dong,Takahiro Ogawa,Miki Haseyama
Main category: cs.CV
TL;DR: 本文提出了一种名为Otter的新方法,用于解决宽视角视频在少样本动作识别中的背景干扰和时序关系重建难题。通过引入复合分割模块(CSM)突出主体,并结合时序重建模块(TRM)增强时序建模,显著提升了性能。
Details
Motivation: 宽视角视频中存在大量背景干扰且时序关系因相似背景帧而退化,导致现有方法难以有效识别动作,因此需要一种能同时强化主体关注和时序重建的模型。 Method: 提出Otter框架,包含复合分割模块(CSM)用于分割并强调每帧中的关键区域以抑制背景干扰,以及时序重建模块(TRM)通过双向扫描增强时序原型构建;同时融合常规原型与时序增强原型以兼顾主体突出与时序建模。 Result: 在SSv2、Kinetics、UCF101、HMDB51等多个基准数据集上实现了最先进的性能,在VideoBadminton数据集上的额外实验也验证了其在宽视角少样本动作识别中的优越性。 Conclusion: Otter通过有效的空间注意力机制和时序关系重建,在宽视角少样本动作识别任务中显著优于现有方法,展现出强大的全局建模能力和应用潜力。 Abstract: Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.[241] PointCubeNet: 3D Part-level Reasoning with 3x3x3 Point Cloud Blocks
Da-Yeong Kim,Yeong-Jun Cho
Main category: cs.CV
TL;DR: 提出PointCubeNet,一种无需部分标注的多模态3D理解框架,通过局部分支和伪标签方法实现无监督3D部件级推理。
Details
Motivation: 实现无需部件标注的3D对象部件级理解,提升整体3D对象理解能力。 Method: 设计包含全局和局部分支的PointCubeNet,局部分支采用3x3x3局部块结构,结合伪标签方法和局部损失函数进行无监督训练。 Result: 实验证明,3D对象部件的理解有助于整体对象理解,且该方法在无监督3D部件级推理上取得了可靠且有意义的结果。 Conclusion: PointCubeNet首次实现了无监督的3D部件级推理,为多模态3D理解提供了新思路。 Abstract: In this paper, we propose PointCubeNet, a novel multi-modal 3D understanding framework that achieves part-level reasoning without requiring any part annotations. PointCubeNet comprises global and local branches. The proposed local branch, structured into 3x3x3 local blocks, enables part-level analysis of point cloud sub-regions with the corresponding local text labels. Leveraging the proposed pseudo-labeling method and local loss function, PointCubeNet is effectively trained in an unsupervised manner. The experimental results demonstrate that understanding 3D object parts enhances the understanding of the overall 3D object. In addition, this is the first attempt to perform unsupervised 3D part-level reasoning and achieves reliable and meaningful results.[242] Image Restoration via Primal Dual Hybrid Gradient and Flow Generative Model
Ji Li,Chao Wang
Main category: cs.CV
TL;DR: 提出一种基于生成先验和原对偶混合梯度(PDHG)方法的通用Plug-and-Play算法,适用于多种数据保真项,尤其在非高斯噪声下优于传统平方ℓ2损失。
Details
Motivation: 现有PnP方法主要适用于光滑的平方ℓ2数据保真项,难以处理非高斯噪声下的通用保真项,限制了其在实际成像逆问题中的应用。 Method: 将基于流匹配生成模型的时间相关去噪器引入PnP框架,并结合PDHG原对偶算法设计一种新的PnP优化算法,支持ℓ1和ℓ2范数等广义保真项。 Result: 在去噪、超分辨率、去模糊和修复等任务中验证了方法的有效性,结果表明ℓ1和ℓ2保真项在非高斯噪声(如泊松噪声和脉冲噪声)下优于传统平方ℓ2损失。 Conclusion: 所提PnP-PDHG算法具有良好的通用性、计算效率和内存友好性,扩展了生成先验在非高斯噪声图像恢复中的应用。 Abstract: Regularized optimization has been a classical approach to solving imaging inverse problems, where the regularization term enforces desirable properties of the unknown image. Recently, the integration of flow matching generative models into image restoration has garnered significant attention, owing to their powerful prior modeling capabilities. In this work, we incorporate such generative priors into a Plug-and-Play (PnP) framework based on proximal splitting, where the proximal operator associated with the regularizer is replaced by a time-dependent denoiser derived from the generative model. While existing PnP methods have achieved notable success in inverse problems with smooth squared $\ell_2$ data fidelity--typically associated with Gaussian noise--their applicability to more general data fidelity terms remains underexplored. To address this, we propose a general and efficient PnP algorithm inspired by the primal-dual hybrid gradient (PDHG) method. Our approach is computationally efficient, memory-friendly, and accommodates a wide range of fidelity terms. In particular, it supports both $\ell_1$ and $\ell_2$ norm-based losses, enabling robustness to non-Gaussian noise types such as Poisson and impulse noise. We validate our method on several image restoration tasks, including denoising, super-resolution, deblurring, and inpainting, and demonstrate that $\ell_1$ and $\ell_2$ fidelity terms outperform the conventional squared $\ell_2$ loss in the presence of non-Gaussian noise.[243] Med-SORA: Symptom to Organ Reasoning in Abdomen CT Images
You-Kyoung Na,Yeong-Jun Cho
Main category: cs.CV
TL;DR: 本文提出了Med-SORA,首个用于腹部CT图像中症状到器官推理的医学多模态框架,引入了基于RAG的数据构建、软标签与可学习器官锚点以及2D-3D交叉注意力结构,显著提升了临床推理准确性。
Details
Motivation: 现有医学多模态模型多采用简单的一对一硬标签方式,并仅使用单层2D特征,忽略了症状与多个器官关联的临床现实及三维解剖上下文信息。 Method: 提出Med-SORA框架,采用基于检索增强生成(RAG)的数据集构建方法,引入软标签与可学习器官锚点以建模一对多的症状-器官关系,并设计2D-3D交叉注意力机制融合局部与全局图像特征。 Result: 实验结果表明,Med-SORA在症状-器官推理任务上优于现有的医学多模态模型,能够实现更准确的3D临床推理。 Conclusion: Med-SORA首次实现了症状到器官的多对多推理,有效整合了3D影像信息,在临床症状-影像关联分析中展现出更强的推理能力。 Abstract: Understanding symptom-image associations is crucial for clinical reasoning. However, existing medical multimodal models often rely on simple one-to-one hard labeling, oversimplifying clinical reality where symptoms relate to multiple organs. In addition, they mainly use single-slice 2D features without incorporating 3D information, limiting their ability to capture full anatomical context. In this study, we propose Med-SORA, a framework for symptom-to-organ reasoning in abdominal CT images. Med-SORA introduces RAG-based dataset construction, soft labeling with learnable organ anchors to capture one-to-many symptom-organ relationships, and a 2D-3D cross-attention architecture to fuse local and global image features. To our knowledge, this is the first work to address symptom-to-organ reasoning in medical multimodal learning. Experimental results show that Med-SORA outperforms existing medical multimodal models and enables accurate 3D clinical reasoning.[244] CAST-LUT: Tokenizer-Guided HSV Look-Up Tables for Purple Flare Removal
Pu Wang,Shuning Sun,Jialang Lu,Chen Wu,Zhihua Zhang,Youshan Zhang,Chenggang Shan,Dianjie Lu,Guijuan Zhang,Zhuoran Zheng
Main category: cs.CV
TL;DR: 提出一种基于解耦HSV查找表(LUT)的新型网络,用于校正紫色耀斑色差,通过两阶段架构独立调整H、S、V分量,显著优于现有方法。
Details
Motivation: 传统方法依赖手工特征和固定先验,缺乏灵活性,且深度学习因配对训练数据稀缺而受限。 Method: 采用两阶段架构:首先通过Chroma-Aware Spectral Tokenizer(CAST)将RGB转为HSV并编码H和V通道为语义令牌;然后由HSV-LUT模块动态生成H、S、V三个通道的独立校正曲线(1D-LUTs)。 Result: 在自建的大规模紫边数据集上实验表明,该方法在视觉效果和定量指标上均显著优于现有方法,达到SOTA性能。 Conclusion: 所提方法有效解决了紫色耀斑校正中的颜色耦合问题,具备更强的灵活性和准确性,推动了图像色彩校正的发展。 Abstract: Purple flare, a diffuse chromatic aberration artifact commonly found around highlight areas, severely degrades the tone transition and color of the image. Existing traditional methods are based on hand-crafted features, which lack flexibility and rely entirely on fixed priors, while the scarcity of paired training data critically hampers deep learning. To address this issue, we propose a novel network built upon decoupled HSV Look-Up Tables (LUTs). The method aims to simplify color correction by adjusting the Hue (H), Saturation (S), and Value (V) components independently. This approach resolves the inherent color coupling problems in traditional methods. Our model adopts a two-stage architecture: First, a Chroma-Aware Spectral Tokenizer (CAST) converts the input image from RGB space to HSV space and independently encodes the Hue (H) and Value (V) channels into a set of semantic tokens describing the Purple flare status; second, the HSV-LUT module takes these tokens as input and dynamically generates independent correction curves (1D-LUTs) for the three channels H, S, and V. To effectively train and validate our model, we built the first large-scale purple flare dataset with diverse scenes. We also proposed new metrics and a loss function specifically designed for this task. Extensive experiments demonstrate that our model not only significantly outperforms existing methods in visual effects but also achieves state-of-the-art performance on all quantitative metrics.[245] Robust and High-Fidelity 3D Gaussian Splatting: Fusing Pose Priors and Geometry Constraints for Texture-Deficient Outdoor Scenes
Meijun Guo,Yongliang Shi,Caiyun Liu,Yixiao Feng,Ming Ma,Tinghai Yan,Weining Lu,Bin Liang
Main category: cs.CV
TL;DR: 本文提出了一种改进3D高斯点阵渲染(3DGS)的方法,通过引入LiDAR-IMU里程计先验位姿和法向量与有效秩正则化约束,提升了大尺度户外场景中位姿估计的效率与场景表示的质量,尤其在弱纹理或重复纹理条件下表现优越。
Details
Motivation: 针对大尺度户外场景中因几何纹理不一致导致的位姿估计不稳定和场景表示失真问题,亟需提升3DGS在复杂环境下的鲁棒性与精度。 Method: 在位姿估计方面,利用LiDAR-IMU里程计提供相机先验位姿,并将其融入COLMAP的三角化与捆绑调整优化过程中;在场景表示方面,引入法向量约束和有效秩正则化,与光度损失联合优化以提升高斯图元的一致性与地图质量。 Result: 实验表明,该方法在位姿优化上耗时仅为三分之一且保持精度与鲁棒性;在场景表示上显著优于传统3DGS,尤其在自采的弱/重复纹理数据集上可视化效果和整体性能更优。 Conclusion: 所提出的方法通过结合传感器先验与几何正则化约束,有效提升了3DGS在大尺度、低纹理场景中的位姿估计效率与重建质量,具有良好的应用前景。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a key rendering pipeline for digital asset creation due to its balance between efficiency and visual quality. To address the issues of unstable pose estimation and scene representation distortion caused by geometric texture inconsistency in large outdoor scenes with weak or repetitive textures, we approach the problem from two aspects: pose estimation and scene representation. For pose estimation, we leverage LiDAR-IMU Odometry to provide prior poses for cameras in large-scale environments. These prior pose constraints are incorporated into COLMAP's triangulation process, with pose optimization performed via bundle adjustment. Ensuring consistency between pixel data association and prior poses helps maintain both robustness and accuracy. For scene representation, we introduce normal vector constraints and effective rank regularization to enforce consistency in the direction and shape of Gaussian primitives. These constraints are jointly optimized with the existing photometric loss to enhance the map quality. We evaluate our approach using both public and self-collected datasets. In terms of pose optimization, our method requires only one-third of the time while maintaining accuracy and robustness across both datasets. In terms of scene representation, the results show that our method significantly outperforms conventional 3DGS pipelines. Notably, on self-collected datasets characterized by weak or repetitive textures, our approach demonstrates enhanced visualization capabilities and achieves superior overall performance. Codes and data will be publicly available at https://github.com/justinyeah/normal_shape.git.[246] ConeGS: Error-Guided Densification Using Pixel Cones for Improved Reconstruction with Fewer Primitives
Bartłomiej Baranowski,Stefano Esposito,Patricia Gschoßmann,Anpei Chen,Andreas Geiger
Main category: cs.CV
TL;DR: 提出ConeGS,一种独立于现有几何状态的图像空间感知致密化框架,通过在高误差像素对应的观测锥上插入新高斯点并根据锥直径初始化其大小,显著提升3D高斯点阵在有限图元预算下的重建质量与渲染性能。
Details
Motivation: 3D高斯点阵(3DGS)虽在新视角合成中表现优异,但因基于克隆的致密化方法导致图元空间分布不佳,限制了场景覆盖效率,需更优的图元分布策略。 Method: 首先利用快速iNGP重建作为几何代理预测像素深度;在3DGS优化过程中识别高误差像素,并沿对应观测锥在预测深度处插入新高斯点,按锥直径初始化其尺寸;引入预激活不透明度惩罚去除冗余图元,并采用固定或自适应的图元预算策略控制总数。 Result: 实验表明,ConeGS在不同图元预算下均能持续提升重建质量和渲染性能,尤其在图元受限时效果显著。 Conclusion: ConeGS通过几何无关的智能致密化策略,有效改善了3D高斯点的空间分布,提升了渲染效率与质量,特别适用于资源受限场景。 Abstract: 3D Gaussian Splatting (3DGS) achieves state-of-the-art image quality and real-time performance in novel view synthesis but often suffers from a suboptimal spatial distribution of primitives. This issue stems from cloning-based densification, which propagates Gaussians along existing geometry, limiting exploration and requiring many primitives to adequately cover the scene. We present ConeGS, an image-space-informed densification framework that is independent of existing scene geometry state. ConeGS first creates a fast Instant Neural Graphics Primitives (iNGP) reconstruction as a geometric proxy to estimate per-pixel depth. During the subsequent 3DGS optimization, it identifies high-error pixels and inserts new Gaussians along the corresponding viewing cones at the predicted depth values, initializing their size according to the cone diameter. A pre-activation opacity penalty rapidly removes redundant Gaussians, while a primitive budgeting strategy controls the total number of primitives, either by a fixed budget or by adapting to scene complexity, ensuring high reconstruction quality. Experiments show that ConeGS consistently enhances reconstruction quality and rendering performance across Gaussian budgets, with especially strong gains under tight primitive constraints where efficient placement is crucial.[247] TiS-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student Learning
Rui Wang,Ying Zhou,Hao Wang,Wenwei Zhang,Qiang Li,Zhiwei Wang
Main category: cs.CV
TL;DR: 提出了一种名为TiS-TSL的时间可切换师生学习框架,用于在极小监督下进行视频立体匹配,通过图像到视频和视频到视频两阶段学习策略提升手术视频中的视差预测稳定性与精度。
Details
Motivation: 由于解剖限制,微创手术中难以获得密集的视差标注,现有师生学习方法缺乏时空一致性估计,导致视频帧间视差预测不稳定和闪烁伪影。 Method: 设计一个统一模型,支持图像预测(IP)、前向视频预测(FVP)和后向视频预测(BVP)三种模式;采用两阶段学习:I2V阶段将稀疏图像知识迁移以初始化时序建模,V2V阶段通过前后向预测计算双向时空一致性,过滤噪声伪标签并增强时间连贯性。 Result: 在两个公开数据集上实验表明,TiS-TSL相比其他基于图像的最先进方法,TEPE和EPE分别至少提升了2.11%和4.54%。 Conclusion: TiS-TSL有效解决了微创手术视频中极小监督下的立体匹配问题,通过引入时空一致性建模显著提高了视差预测的稳定性和准确性。 Abstract: Stereo matching in minimally invasive surgery (MIS) is essential for next-generation navigation and augmented reality. Yet, dense disparity supervision is nearly impossible due to anatomical constraints, typically limiting annotations to only a few image-level labels acquired before the endoscope enters deep body cavities. Teacher-Student Learning (TSL) offers a promising solution by leveraging a teacher trained on sparse labels to generate pseudo labels and associated confidence maps from abundant unlabeled surgical videos. However, existing TSL methods are confined to image-level supervision, providing only spatial confidence and lacking temporal consistency estimation. This absence of spatio-temporal reliability results in unstable disparity predictions and severe flickering artifacts across video frames. To overcome these challenges, we propose TiS-TSL, a novel time-switchable teacher-student learning framework for video stereo matching under minimal supervision. At its core is a unified model that operates in three distinct modes: Image-Prediction (IP), Forward Video-Prediction (FVP), and Backward Video-Prediction (BVP), enabling flexible temporal modeling within a single architecture. Enabled by this unified model, TiS-TSL adopts a two-stage learning strategy. The Image-to-Video (I2V) stage transfers sparse image-level knowledge to initialize temporal modeling. The subsequent Video-to-Video (V2V) stage refines temporal disparity predictions by comparing forward and backward predictions to calculate bidirectional spatio-temporal consistency. This consistency identifies unreliable regions across frames, filters noisy video-level pseudo labels, and enforces temporal coherence. Experimental results on two public datasets demonstrate that TiS-TSL exceeds other image-based state-of-the-arts by improving TEPE and EPE by at least 2.11% and 4.54%, respectively..[248] Integrating Reweighted Least Squares with Plug-and-Play Diffusion Priors for Noisy Image Restoration
Ji Li,Chao Wang
Main category: cs.CV
TL;DR: 提出了一种基于生成扩散先验的即插即用图像恢复框架,适用于包括脉冲噪声在内的多种非高斯噪声去除,通过广义高斯尺度混合损失和IRLS优化方法实现优越性能。
Details
Motivation: 现有即插即用方法主要针对高斯噪声,对非高斯噪声(如脉冲噪声)的处理能力有限,缺乏有效利用生成先验的通用框架。 Method: 在最大后验估计框架下,采用广义高斯尺度混合模型构建适应不同噪声分布的数据保真项,形成ℓ_q范数保真项,并结合迭代重加权最小二乘法(IRLS)进行优化;利用基于扩散模型的去噪器作为先验正则项的近似求解工具。 Result: 在多个基准数据集上的实验表明,该方法能有效去除非高斯脉冲噪声,在图像恢复质量上优于现有方法。 Conclusion: 所提出的基于扩散先验的即插即用框架可有效扩展至非高斯噪声场景,为通用噪声下的图像恢复提供了一种高性能解决方案。 Abstract: Existing plug-and-play image restoration methods typically employ off-the-shelf Gaussian denoisers as proximal operators within classical optimization frameworks based on variable splitting. Recently, denoisers induced by generative priors have been successfully integrated into regularized optimization methods for image restoration under Gaussian noise. However, their application to non-Gaussian noise--such as impulse noise--remains largely unexplored. In this paper, we propose a plug-and-play image restoration framework based on generative diffusion priors for robust removal of general noise types, including impulse noise. Within the maximum a posteriori (MAP) estimation framework, the data fidelity term is adapted to the specific noise model. Departing from the conventional least-squares loss used for Gaussian noise, we introduce a generalized Gaussian scale mixture-based loss, which approximates a wide range of noise distributions and leads to an $\ell_q$-norm ($0### [249] [MUGSQA: Novel Multi-Uncertainty-Based Gaussian Splatting Quality Assessment Method, Dataset, and Benchmarks](https://arxiv.org/abs/2511.06830) *Tianang Chen,Jian Jin,Shilv Cai,Zhuangzi Li,Weisi Lin* Main category: cs.CV TL;DR: 提出了一种用于评估基于高斯点阵(GS)方法重建的3D对象感知质量的多距离主观评价方法,并构建了考虑输入数据多种不确定性的MUGSQA数据集及两个基准测试。### [250] [ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search](https://arxiv.org/abs/2511.06833) *Zhenjie Liu,Jianzhang Lu,Renjie Lu,Cong Liang,Shangfei Wang* Main category: cs.CV TL;DR: 本文提出ConsistTalk,一种可控制强度且时间一致的说话头生成框架,通过光学流引导的时序模块、音频到强度模型和扩散噪声初始化策略,显著提升了音频驱动人像动画的稳定性与音画同步性。Details
Motivation: 由于基于高斯点阵(GS)的3D重建技术迅速发展,但缺乏对不同方法重建结果感知质量的有效评估手段,因此需要一种更贴近实际应用中人类观察行为的质量评估方法。 Method: 提出统一的多距离主观质量评估方法,模拟真实应用场景下的观看行为;构建包含多种输入不确定性的MUGSQA数据集;建立两个基准:一个用于评估GS重建方法在多种不确定性下的鲁棒性,另一个用于评估现有质量评估指标的性能。 Result: 成功构建了MUGSQA数据集和两个基准测试,能够有效评估不同GS方法的感知质量和鲁棒性,并为现有质量评估指标提供验证平台。 Conclusion: 所提出的方法和数据集填补了GS-based 3D重建质量评估的空白,有助于推动该领域标准化和进一步发展。 Abstract: Gaussian Splatting (GS) has recently emerged as a promising technique for 3D object reconstruction, delivering high-quality rendering results with significantly improved reconstruction speed. As variants continue to appear, assessing the perceptual quality of 3D objects reconstructed with different GS-based methods remains an open challenge. To address this issue, we first propose a unified multi-distance subjective quality assessment method that closely mimics human viewing behavior for objects reconstructed with GS-based methods in actual applications, thereby better collecting perceptual experiences. Based on it, we also construct a novel GS quality assessment dataset named MUGSQA, which is constructed considering multiple uncertainties of the input data. These uncertainties include the quantity and resolution of input views, the view distance, and the accuracy of the initial point cloud. Moreover, we construct two benchmarks: one to evaluate the robustness of various GS-based reconstruction methods under multiple uncertainties, and the other to evaluate the performance of existing quality assessment metrics. Our dataset and benchmark code will be released soon.### [251] [NeuroBridge: Bio-Inspired Self-Supervised EEG-to-Image Decoding via Cognitive Priors and Bidirectional Semantic Alignment](https://arxiv.org/abs/2511.06836) *Wenjiang Zhang,Sifeng Wang,Yuwei Su,Xinyu Li,Chen Zhang,Suyu Zhong* Main category: cs.CV TL;DR: 提出了一种名为NeuroBridge的自监督架构,通过认知先验增强和共享语义投影器实现脑电与视觉信号的有效跨模态对齐,在视觉神经解码任务中显著优于现有方法。Details
Motivation: 现有音频驱动说话头生成方法存在闪烁、身份漂移和音画不同步等问题,主要源于外观与运动表征的耦合以及不稳定的推理策略。 Method: 提出三个核心组件:1)光学流引导的时序模块(OFT)解耦运动与静态外观;2)通过多模态师生蒸馏训练的音频到强度(A2I)模型,实现音视频联合建模;3)扩散噪声初始化策略(IC-Init),在推理时施加背景一致性和运动连续性约束。 Result: 实验表明,ConsistTalk在减少闪烁、保持身份一致性和时间稳定性方面显著优于现有方法,生成高质量、高保真的说话头视频。 Conclusion: ConsistTalk通过解耦表征、增强音画关联和优化推理策略,有效解决了当前音频驱动动画中的关键问题,推动了视频扩散模型在该领域的应用。 Abstract: Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce \textbf{ConsistTalk}, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose \textbf{an optical flow-guided temporal module (OFT)} that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an \textbf{Audio-to-Intensity (A2I) model} obtained through multimodal teacher-student knowledge distillation. By transforming audio and facial velocity features into a frame-wise intensity sequence, the A2I model enables joint modeling of audio and visual motion, resulting in more natural dynamics. This further enables fine-grained, frame-wise control of motion dynamics while maintaining tight audio-visual synchronization. Third, we introduce a \textbf{diffusion noise initialization strategy (IC-Init)}. By enforcing explicit constraints on background coherence and motion continuity during inference-time noise search, we achieve better identity preservation and refine motion dynamics compared to the current autoregressive strategy. Extensive experiments demonstrate that ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos.### [252] [PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory](https://arxiv.org/abs/2511.06840) *Qunchao Jin,Yilin Wu,Changhao Chen* Main category: cs.CV TL;DR: 提出PanoNav,一种仅使用RGB的无地图零样本物体导航框架,通过全景场景解析和动态记忆队列提升MLLM的空间推理与决策能力,在公开基准上显著优于基线方法。Details
Motivation: 现有视觉神经解码方法受限于高质量刺激-脑响应配对数据稀缺以及神经表征与视觉内容之间的语义不匹配问题。 Method: 提出NeuroBridge框架,包含认知先验增强(CPA)和共享语义投影器(SSP):CPA通过模态特异性变换模拟感知变异性以增强语义多样性;SSP采用共适应策略实现双向特征对齐,将两种模态特征映射到共享语义空间。 Result: 在个体内和个体间设置下均超越现有最先进方法:在200类零样本检索任务中,个体内场景下top-1准确率提升12.3%至63.2%,top-5准确率提升10.2%至89.9%。 Conclusion: NeuroBridge有效解决了跨模态语义对齐难题,展现出优异的性能、鲁棒性和可扩展性,为视觉神经解码提供了新思路。 Abstract: Visual neural decoding seeks to reconstruct or infer perceived visual stimuli from brain activity patterns, providing critical insights into human cognition and enabling transformative applications in brain-computer interfaces and artificial intelligence. Current approaches, however, remain constrained by the scarcity of high-quality stimulus-brain response pairs and the inherent semantic mismatch between neural representations and visual content. Inspired by perceptual variability and co-adaptive strategy of the biological systems, we propose a novel self-supervised architecture, named NeuroBridge, which integrates Cognitive Prior Augmentation (CPA) with Shared Semantic Projector (SSP) to promote effective cross-modality alignment. Specifically, CPA simulates perceptual variability by applying asymmetric, modality-specific transformations to both EEG signals and images, enhancing semantic diversity. Unlike previous approaches, SSP establishes a bidirectional alignment process through a co-adaptive strategy, which mutually aligns features from two modalities into a shared semantic space for effective cross-modal learning. NeuroBridge surpasses previous state-of-the-art methods under both intra-subject and inter-subject settings. In the intra-subject scenario, it achieves the improvements of 12.3% in top-1 accuracy and 10.2% in top-5 accuracy, reaching 63.2% and 89.9% respectively on a 200-way zero-shot retrieval task. Extensive experiments demonstrate the effectiveness, robustness, and scalability of the proposed framework for neural visual decoding.### [253] [Aerial Image Stitching Using IMU Data from a UAV](https://arxiv.org/abs/2511.06841) *Selim Ahmet Iz,Mustafa Unel* Main category: cs.CV TL;DR: 提出一种结合IMU数据与计算机视觉的无人机图像拼接新方法,提升了拼接的准确性与鲁棒性。Details
Motivation: 现有零样本导航方法依赖深度传感器或预构建地图,限制了多模态大模型的空间理解能力;无地图方法则因缺乏历史上下文易陷入局部死锁。 Method: 提出PanoNav,包含全景场景解析模块以从全景RGB输入中挖掘MLLM的空间解析能力,以及基于动态有界记忆队列的记忆引导决策机制,整合探索历史避免死锁。 Result: 在公开导航基准上的实验表明,PanoNav在成功率(SR)和路径长度加权成功率(SPL)指标上显著优于代表性基线方法。 Conclusion: PanoNav有效提升了无地图、仅RGB条件下的零样本物体导航性能,验证了利用全景视觉输入与历史记忆增强MLLM空间推理与决策的可行性。 Abstract: Zero-shot object navigation (ZSON) in unseen environments remains a challenging problem for household robots, requiring strong perceptual understanding and decision-making capabilities. While recent methods leverage metric maps and Large Language Models (LLMs), they often depend on depth sensors or prebuilt maps, limiting the spatial reasoning ability of Multimodal Large Language Models (MLLMs). Mapless ZSON approaches have emerged to address this, but they typically make short-sighted decisions, leading to local deadlocks due to a lack of historical context. We propose PanoNav, a fully RGB-only, mapless ZSON framework that integrates a Panoramic Scene Parsing module to unlock the spatial parsing potential of MLLMs from panoramic RGB inputs, and a Memory-guided Decision-Making mechanism enhanced by a Dynamic Bounded Memory Queue to incorporate exploration history and avoid local deadlocks. Experiments on the public navigation benchmark show that PanoNav significantly outperforms representative baselines in both SR and SPL metrics.### [254] [Gaussian-Augmented Physics Simulation and System Identification with Complex Colliders](https://arxiv.org/abs/2511.06846) *Federico Vasile,Ri-Zhao Qiu,Lorenzo Natale,Xiaolong Wang* Main category: cs.CV TL;DR: 提出AS-DiffMPM,一种支持任意形状碰撞体的可微物质点方法框架,用于从视频中估计物理属性。Details
Motivation: 传统基于特征的图像拼接在无人机应用中易受特征检测与匹配误差影响,尤其在大位移、旋转和姿态变化时表现不佳。 Method: 利用IMU数据估计无人机连续图像间的位移与旋转,校正透视畸变,并计算单应性矩阵,再结合标准图像拼接算法实现图像对齐与融合。 Result: 实验表明该方法在准确性与可靠性上优于部分现有特征基方法,尤其在复杂场景下表现更优。 Conclusion: 所提方法有效利用IMU信息,减少拼接误差,易于集成到现有无人机工作流中,具有较强实用性与推广价值。 Abstract: Unmanned Aerial Vehicles (UAVs) are widely used for aerial photography and remote sensing applications. One of the main challenges is to stitch together multiple images into a single high-resolution image that covers a large area. Featurebased image stitching algorithms are commonly used but can suffer from errors and ambiguities in feature detection and matching. To address this, several approaches have been proposed, including using bundle adjustment techniques or direct image alignment. In this paper, we present a novel method that uses a combination of IMU data and computer vision techniques for stitching images captured by a UAV. Our method involves several steps such as estimating the displacement and rotation of the UAV between consecutive images, correcting for perspective distortion, and computing a homography matrix. We then use a standard image stitching algorithm to align and blend the images together. Our proposed method leverages the additional information provided by the IMU data, corrects for various sources of distortion, and can be easily integrated into existing UAV workflows. Our experiments demonstrate the effectiveness and robustness of our method, outperforming some of the existing feature-based image stitching algorithms in terms of accuracy and reliability, particularly in challenging scenarios such as large displacements, rotations, and variations in camera pose.### [255] [Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers](https://arxiv.org/abs/2511.06848) *Huiyuan Tian,Bonan Xu Shijian Li* Main category: cs.CV TL;DR: 本文首次通过“蒸馏动态”分析框架揭示了视觉Transformer(ViT)在特征蒸馏中表现不佳的根本原因:师生模型间存在表征范式不匹配,尤其是教师模型在深层使用分布式、高维编码,而学生模型因通道容量有限无法复制,导致后期特征对齐反而损害性能。Details
Motivation: 现有方法受限于平面碰撞体,难以处理物体与非平面表面碰撞的复杂场景。 Method: 扩展可微物质点方法(MPM),引入可微分的碰撞处理机制,支持复杂刚体交互,并保持端到端优化。 Result: AS-DiffMPM能够在复杂碰撞环境下实现几何、外观和物理属性的联合估计,并可与多种新视角合成方法结合。 Conclusion: AS-DiffMPM提升了系统辨识在复杂物体-环境交互下的性能,拓展了可微分模拟在视觉驱动系统识别中的应用范围。 Abstract: System identification involving the geometry, appearance, and physical properties from video observations is a challenging task with applications in robotics and graphics. Recent approaches have relied on fully differentiable Material Point Method (MPM) and rendering for simultaneous optimization of these properties. However, they are limited to simplified object-environment interactions with planar colliders and fail in more challenging scenarios where objects collide with non-planar surfaces. We propose AS-DiffMPM, a differentiable MPM framework that enables physical property estimation with arbitrarily shaped colliders. Our approach extends existing methods by incorporating a differentiable collision handling mechanism, allowing the target object to interact with complex rigid bodies while maintaining end-to-end optimization. We show AS-DiffMPM can be easily interfaced with various novel view synthesis methods as a framework for system identification from visual observations.### [256] [Ambiguity-aware Truncated Flow Matching for Ambiguous Medical Image Segmentation](https://arxiv.org/abs/2511.06857) *Fanding Li,Xiangyu Li,Xianghe Su,Xingyu Qiu,Suyu Dong,Wei Wang,Kuanquan Wang,Gongning Luo,Shuo Li* Main category: cs.CV TL;DR: 提出了一种名为Ambiguity-aware Truncated Flow Matching (ATFM)的新方法,用于解决模糊医学图像分割中准确性和多样性难以兼顾的问题,通过数据分层推理、高斯截断表示和分割流匹配三个创新组件,在多个指标上优于现有最先进方法。Details
Motivation: 特征蒸馏在CNN中有效但在Vision Transformer上失效,亟需理论解释与改进方向。 Method: 提出“蒸馏动态”分析框架,结合频谱分析、信息熵和激活幅度追踪,系统分析ViT蒸馏过程中的信息流动与表征差异。 Result: 发现ViT具有U型信息处理模式(先压缩后扩展),并证实深层特征对齐因师生表征不匹配而导致负迁移。 Conclusion: 成功的ViT知识蒸馏需超越简单的特征模仿,应设计尊重模型表征能力限制的新方法,为ViT压缩提供理论指导。 Abstract: While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as ``distillation dynamics", combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided in the supplementary material.### [257] [VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling](https://arxiv.org/abs/2511.06863) *Sicheng Yang,Xing Hu,Qiang Wu,Dawei Yang* Main category: cs.CV TL;DR: 提出VAEVQ方法,通过变分潜在量化、表示一致性策略和分布一致性正则化来解决向量量化中的非平滑潜在空间、表示对齐差和码本利用不足问题。Details
Motivation: 在模糊医学图像分割中,现有的截断扩散概率模型(TDPMs)由于预测的准确性和多样性之间存在纠缠,导致保真度和合理性不足,难以同时提升性能。 Method: 提出了ATFM方法,包括三个核心组件:1)数据分层推理,重新定义AMIS的推理范式,在数据分布和样本层面分别增强准确性和多样性;2)高斯截断表示(GTR),将截断时刻的分布显式建模为高斯分布以提高预测保真度和截断分布可靠性;3)分割流匹配(SFM),扩展语义感知流变换以增强多样化预测的合理性。 Result: 在LIDC和ISIC3数据集上的实验表明,ATFM优于当前最先进方法,显著提升了GED和HM-IoU指标(分别最高提升12%和7.3%),并实现了更高效的推理。 Conclusion: ATFM有效解耦了模糊医学图像分割中的准确性和多样性,通过新提出的推理范式和模型组件,在保持高保真度的同时生成更合理且多样的分割结果,推动了该领域的发展。 Abstract: A simultaneous enhancement of accuracy and diversity of predictions remains a challenge in ambiguous medical image segmentation (AMIS) due to the inherent trade-offs. While truncated diffusion probabilistic models (TDPMs) hold strong potential with a paradigm optimization, existing TDPMs suffer from entangled accuracy and diversity of predictions with insufficient fidelity and plausibility. To address the aforementioned challenges, we propose Ambiguity-aware Truncated Flow Matching (ATFM), which introduces a novel inference paradigm and dedicated model components. Firstly, we propose Data-Hierarchical Inference, a redefinition of AMIS-specific inference paradigm, which enhances accuracy and diversity at data-distribution and data-sample level, respectively, for an effective disentanglement. Secondly, Gaussian Truncation Representation (GTR) is introduced to enhance both fidelity of predictions and reliability of truncation distribution, by explicitly modeling it as a Gaussian distribution at $T_{\text{trunc}}$ instead of using sampling-based approximations.Thirdly, Segmentation Flow Matching (SFM) is proposed to enhance the plausibility of diverse predictions by extending semantic-aware flow transformation in Flow Matching (FM). Comprehensive evaluations on LIDC and ISIC3 datasets demonstrate that ATFM outperforms SOTA methods and simultaneously achieves a more efficient inference. ATFM improves GED and HM-IoU by up to $12\%$ and $7.3\%$ compared to advanced methods.### [258] [Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions](https://arxiv.org/abs/2511.06876) *Eyal Gutflaish,Eliran Kachlon,Hezi Zisman,Tal Hacham,Nimrod Sarid,Alexander Visheratin,Saar Huberman,Gal Davidi,Guy Bukchin,Kfir Goldberg,Ron Mokady* Main category: cs.CV TL;DR: 本文提出了一种基于长结构化描述的开源文本到图像生成模型,通过细粒度属性标注和DimFusion融合机制提升生成图像的可控性和表达能力,并引入TaBR评估协议来衡量模型在长文本下的重建与控制性能。Details
Motivation: 现有基于VQ的方法存在潜在空间非平滑、量化前后表示对齐弱、连续与离散域间不一致等问题,导致码字学习不稳定和码本利用率低。 Method: 引入三种关键组件:1)变分潜在量化(VLQ),用VAE替代AE以获得更平滑的潜在空间;2)表示一致性策略(RCS),自适应调节量化前后特征对齐;3)分布一致性正则化(DCR),使整个码本分布与连续潜在分布对齐。 Result: 在两个基准数据集上的实验表明,VAEVQ优于当前最先进的方法。 Conclusion: VAEVQ有效提升了向量量化的表示质量和码本利用率,在重建和生成任务中表现出更优性能。 Abstract: Vector quantization (VQ) transforms continuous image features into discrete representations, providing compressed, tokenized inputs for generative models. However, VQ-based frameworks suffer from several issues, such as non-smooth latent spaces, weak alignment between representations before and after quantization, and poor coherence between the continuous and discrete domains. These issues lead to unstable codeword learning and underutilized codebooks, ultimately degrading the performance of both reconstruction and downstream generation tasks. To this end, we propose VAEVQ, which comprises three key components: (1) Variational Latent Quantization (VLQ), replacing the AE with a VAE for quantization to leverage its structured and smooth latent space, thereby facilitating more effective codeword activation; (2) Representation Coherence Strategy (RCS), adaptively modulating the alignment strength between pre- and post-quantization features to enhance consistency and prevent overfitting to noise; and (3) Distribution Consistency Regularization (DCR), aligning the entire codebook distribution with the continuous latent distribution to improve utilization. Extensive experiments on two benchmark datasets demonstrate that VAEVQ outperforms state-of-the-art methods.### [259] [A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models](https://arxiv.org/abs/2511.06888) *Jan-Hendrik Koch,Jonas Krumme,Konrad Gadzicki* Main category: cs.CV TL;DR: 本文提出了一种两阶段系统,利用大语言模型生成结构化布局,并结合布局条件扩散模型生成符合指定对象数量和空间排列的图像,有效解决了文本到图像扩散模型在组合控制上的局限性。Details
Motivation: 现有文本到图像模型多依赖短提示语生成图像,导致输入文本稀疏与输出图像丰富之间的不匹配,降低了生成控制精度,难以满足专业需求。因此需要增强模型对详细指令的理解与响应能力。 Method: 训练首个基于长结构化标注文本的开源文本到图像模型;提出DimFusion机制,利用轻量级大语言模型中间token进行高效融合而不增加序列长度;设计Text-as-a-Bottleneck Reconstruction (TaBR)评估协议,通过图像-文本-图像重建循环评估模型的可控性与表达力。 Result: 成功训练出大规模模型FIBO,在开源模型中实现了最先进的提示对齐性能;实验表明该模型在处理长文本输入时具有更强的细节控制能力和更高的重建一致性,尤其在TaBR评估中表现优于现有方法。 Conclusion: 通过结构化长文本训练与DimFusion机制,显著提升了文本到图像模型的可控性和表达能力,为专业级图像生成提供了更精确的工具,并推动了相关评估方法的发展。 Abstract: Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO### [260] [Adaptive Morph-Patch Transformer for Arotic Vessel Segmentation](https://arxiv.org/abs/2511.06897) *Zhenxi Zhang,Fuchen Zheng,Adnan Iltaf,Yifei Han,Zhenyu Cheng,Yue Du,Bin Li,Tianyong Liu,Shoujun Zhou* Main category: cs.CV TL;DR: 提出了一种用于主动脉血管分割的自适应形态感知Transformer(MPT),通过动态生成与血管结构对齐的补丁和语义聚类注意力机制,显著提升了复杂血管结构的分割精度。Details
Motivation: 现有的文本到图像扩散模型在控制对象数量和空间布局方面存在不足,难以满足对图像组成精确控制的需求。 Method: 第一阶段使用大语言模型(LLM)从对象列表生成结构化布局,通过任务分解和基于规则的插入优化布局;第二阶段采用经过特定数据集微调的ControlNet或GLIGEN进行布局条件图像生成,并比较二者在布局保真度与风格控制之间的权衡。 Result: 该方法将复杂场景中的对象召回率从57.2%提升至99.9%;ControlNet能更好保持文本风格控制但存在对象幻觉问题,而GLIGEN在布局保真度上更优但降低了提示可控性。 Conclusion: 解耦的两阶段方法在实现对象数量和空间布局的精确控制方面具有可行性,为组合性图像生成提供了有效解决方案。 Abstract: Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity at the cost of reduced prompt-based controllability. Our end-to-end system successfully generates images with specified object counts and plausible spatial arrangements, demonstrating the viability of a decoupled approach for compositionally controlled synthesis.### [261] [Classification of Microplastic Particles in Water using Polarized Light Scattering and Machine Learning Methods](https://arxiv.org/abs/2511.06901) *Leonard Saur,Marc von Pawlowski,Ulrich Gengenbach,Ingo Sieber,Hossein Shirali,Lorenz Wührl,Rainer Kiko,Christian Pylatiuk* Main category: cs.CV TL;DR: 本文提出了一种基于偏振光散射的反射式原位微塑料分类方法,利用偏振敏感相机和卷积神经网络实现水中50-300μm微塑料的识别,AOLP信号在区分聚乙烯类型上表现更优,DOLP则更擅长识别聚丙烯。Details
Motivation: 传统基于Transformer的模型使用固定大小的矩形补丁,容易破坏复杂血管结构的完整性,导致分割精度不理想。 Method: 提出MPT模型,包含自适应补丁划分策略以生成形态感知补丁,并引入语义聚类注意力(SCA)机制,动态聚合具有相似语义特征的补丁信息。 Result: 在AVT、AortaSeg24和TBAD三个开源数据集上实验表明,MPT在复杂血管结构分割方面达到最先进水平,性能优于现有方法。 Conclusion: MPT能有效保持血管结构的语义完整性,提升不同尺度血管的分割准确性,为主动脉血管分割提供了一种高效的新方法。 Abstract: Accurate segmentation of aortic vascular structures is critical for diagnosing and treating cardiovascular diseases.Traditional Transformer-based models have shown promise in this domain by capturing long-range dependencies between vascular features. However, their reliance on fixed-size rectangular patches often influences the integrity of complex vascular structures, leading to suboptimal segmentation accuracy. To address this challenge, we propose the adaptive Morph Patch Transformer (MPT), a novel architecture specifically designed for aortic vascular segmentation. Specifically, MPT introduces an adaptive patch partitioning strategy that dynamically generates morphology-aware patches aligned with complex vascular structures. This strategy can preserve semantic integrity of complex vascular structures within individual patches. Moreover, a Semantic Clustering Attention (SCA) method is proposed to dynamically aggregate features from various patches with similar semantic characteristics. This method enhances the model's capability to segment vessels of varying sizes, preserving the integrity of vascular structures. Extensive experiments on three open-source dataset(AVT, AortaSeg24 and TBAD) demonstrate that MPT achieves state-of-the-art performance, with improvements in segmenting intricate vascular structures.### [262] [Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding](https://arxiv.org/abs/2511.06908) *Yuzhen Li,Min Liu,Zhaoyang Li,Yuan Bian,Xueping Wang,Erbo Zhai,Yaonan Wang* Main category: cs.CV TL;DR: 本文提出了一种新的单目3D视觉定位框架Mono3DVG-EnSD,通过引入CLIP引导的词汇确定性适配器(CLIP-LCA)和维度解耦模块(D2M),有效解决了现有方法对高确定性关键词的过度依赖以及跨维度干扰问题,在Mono3DRefer数据集上实现了最先进的性能。Details
Motivation: 传统透射式方法在水体中易受干扰,难以实现连续、大规模的微塑料监测,因此需要一种可在真实水环境中直接应用的高效、准确的微塑料原位检测技术。 Method: 采用线偏振激光照射微塑料颗粒,通过偏振敏感相机捕捉其反射信号(特别是AOLP和DOLP),并使用深度卷积神经网络进行图像分类与分析。 Result: 对三种常见微塑料(高密度聚乙烯、低密度聚乙烯、聚丙烯)实现了最高达80%的平均分类准确率;发现CNN主要依赖颗粒的微观结构和内部纹理进行判断;AOLP信号抗噪能力更强,更利于区分两种聚乙烯,而DOLP在识别聚丙烯方面表现更优。 Conclusion: 该反射式偏振散射方法克服了传统方法在水中的局限性,为微塑料的原位实时监测提供了可行方案,且AOLP与DOLP信号各有优势,可结合使用以提升整体分类性能。 Abstract: Facing the critical need for continuous, large-scale microplastic monitoring, which is hindered by the limitations of gold-standard methods in aquatic environments, this paper introduces and validates a novel, reflection-based approach for the in-situ classification and identification of microplastics directly in water bodies, which is based on polarized light scattering. In this experiment, we classify colorless microplastic particles (50-300 $μ$m) by illuminating them with linearly polarized laser light and capturing their reflected signals using a polarization-sensitive camera. This reflection-based technique successfully circumvents the transmission-based interference issues that plague many conventional methods when applied in water. Using a deep convolutional neural network (CNN) for image-based classification, we successfully identified three common polymer types, high-density polyethylene, low-density polyethylene, and polypropylene, achieving a peak mean classification accuracy of 80% on the test dataset. A subsequent feature hierarchy analysis demonstrated that the CNN's decision-making process relies mainly on the microstructural integrity and internal texture (polarization patterns) of the particle rather than its macroshape. Critically, we found that the Angle of Linear Polarization (AOLP) signal is significantly more robust against contextual noise than the Degree of Linear Polarization (DOLP) signal. While the AOLP-based classification achieved superior overall performance, its strength lies in distinguishing between the two polyethylene plastics, showing a lower confusion rate between high-density and low-density polyethylene. Conversely, the DOLP signal demonstrated slightly worse overall classification results but excels at accurately identifying the polypropylene class, which it isolated with greater success than AOLP.### [263] [DTTNet: Improving Video Shadow Detection via Dark-Aware Guidance and Tokenized Temporal Modeling](https://arxiv.org/abs/2511.06925) *Zhicheng Li,Kunyang Sun,Rui Yao,Hancheng Zhu,Fuyuan Hu,Jiaqi Zhao,Zhiwen Shao,Yong Zhou* Main category: cs.CV TL;DR: 提出DTTNet,结合视觉语言匹配和时序建模模块,实现高效准确的视频阴影检测。Details
Motivation: 现有单目3D视觉定位方法过度依赖明确的目标关键词,忽视空间描述,并因文本特征中2D与3D信息混合导致跨维度干扰,影响定位精度。 Method: 提出Mono3DVG-EnSD框架,包含两个关键组件:CLIP-LCA动态屏蔽高确定性关键词以增强对隐式空间描述的理解;D2M从通用文本特征中解耦出2D和3D特定特征,分别指导对应维度的视觉特征,实现维度一致的跨模态交互。 Result: 在Mono3DRefer数据集上进行广泛实验,所提方法在所有指标上均达到SOTA性能,尤其在具有挑战性的Far(Acc@0.5)场景下显著提升了+13.54%。 Conclusion: 通过解耦文本中的维度信息并聚焦于隐式空间描述,Mono3DVG-EnSD有效缓解了现有方法的关键缺陷,显著提升了单目3D视觉定位的准确性和鲁棒性。 Abstract: Monocular 3D Visual Grounding (Mono3DVG) is an emerging task that locates 3D objects in RGB images using text descriptions with geometric cues. However, existing methods face two key limitations. Firstly, they often over-rely on high-certainty keywords that explicitly identify the target object while neglecting critical spatial descriptions. Secondly, generalized textual features contain both 2D and 3D descriptive information, thereby capturing an additional dimension of details compared to singular 2D or 3D visual features. This characteristic leads to cross-dimensional interference when refining visual features under text guidance. To overcome these challenges, we propose Mono3DVG-EnSD, a novel framework that integrates two key components: the CLIP-Guided Lexical Certainty Adapter (CLIP-LCA) and the Dimension-Decoupled Module (D2M). The CLIP-LCA dynamically masks high-certainty keywords while retaining low-certainty implicit spatial descriptions, thereby forcing the model to develop a deeper understanding of spatial relationships in captions for object localization. Meanwhile, the D2M decouples dimension-specific (2D/3D) textual features from generalized textual features to guide corresponding visual features at same dimension, which mitigates cross-dimensional interference by ensuring dimensionally-consistent cross-modal interactions. Through comprehensive comparisons and ablation studies on the Mono3DRefer dataset, our method achieves state-of-the-art (SOTA) performance across all metrics. Notably, it improves the challenging Far(Acc@0.5) scenario by a significant +13.54%.### [264] [PlantTraitNet: An Uncertainty-Aware Multimodal Framework for Global-Scale Plant Trait Inference from Citizen Science Data](https://arxiv.org/abs/2511.06943) *Ayushi Sharma,Johanna Trost,Daniel Lusk,Johannes Dollinger,Julian Schrader,Christian Rossi,Javier Lopatin,Etienne Laliberté,Simon Haberstroh,Jana Eichel,Daniel Mederer,Jose Miguel Cerda-Paredes,Shyam S. Phartyal,Lisa-Maricia Schwarz,Anja Linstädter,Maria Conceição Caldeira,Teja Kattenborn* Main category: cs.CV TL;DR: 本研究提出PlantTraitNet,一种多模态、多任务、不确定性感知的深度学习框架,利用公民科学照片预测全球植物性状(如植株高度、叶面积等),并通过空间聚合生成全球性状分布图,结果表明该方法在精度和可扩展性上均优于现有产品。Details
Motivation: 解决视频阴影检测中阴影与复杂背景难以区分以及动态光照下阴影形变建模的问题。 Method: 设计视觉语言匹配模块(VMM)和暗区语义块(DSB)以利用语言先验区分阴影与暗物体;引入自适应掩码重加权和边缘掩码增强监督;提出令牌化时序模块(TTB)解耦时空学习,用可学习时序令牌建模动态阴影变化。 Result: 在多个基准数据集上实现了最先进的检测精度和实时推理效率。 Conclusion: DTTNet通过融合语言先验和高效时序建模,在视频阴影检测任务中表现出优越性能。 Abstract: Video shadow detection confronts two entwined difficulties: distinguishing shadows from complex backgrounds and modeling dynamic shadow deformations under varying illumination. To address shadow-background ambiguity, we leverage linguistic priors through the proposed Vision-language Match Module (VMM) and a Dark-aware Semantic Block (DSB), extracting text-guided features to explicitly differentiate shadows from dark objects. Furthermore, we introduce adaptive mask reweighting to downweight penumbra regions during training and apply edge masks at the final decoder stage for better supervision. For temporal modeling of variable shadow shapes, we propose a Tokenized Temporal Block (TTB) that decouples spatiotemporal learning. TTB summarizes cross-frame shadow semantics into learnable temporal tokens, enabling efficient sequence encoding with minimal computation overhead. Comprehensive Experiments on multiple benchmark datasets demonstrate state-of-the-art accuracy and real-time inference efficiency. Codes are available at https://github.com/city-cheng/DTTNet.### [265] [From Attribution to Action: Jointly ALIGNing Predictions and Explanations](https://arxiv.org/abs/2511.06944) *Dongsheng Hong,Chao Chen,Yanhui Chen,Shanshan Lin,Zhihao Chen,Xiangwen Liao* Main category: cs.CV TL;DR: 提出ALIGN框架,通过迭代联合训练分类器和掩码生成器,利用高质量掩码提升模型的可解释性、泛化能力和解释质量。Details
Motivation: 现有全球植物性状制图受限于野外测量成本高、空间覆盖稀疏,亟需一种低成本、广覆盖的方法来提升性状映射的精度与可扩展性。 Method: 提出PlantTraitNet框架,采用弱监督多模态多任务深度学习模型,从公民科学拍摄的大量地理标记植物照片中提取形态与生理信息,预测四种关键植物性状,并通过空间聚合生成全球性状分布图。 Result: PlantTraitNet在全球性状图的精度上优于现有主流产品,在与独立植被调查数据(sPlotOpen)对比验证中表现出一致优势。 Conclusion: 结合计算机视觉与地理空间AI分析公民科学图像,可实现更准确、可扩展的全球植物性状制图,为生态研究和地球系统建模提供了新途径。 Abstract: Global plant maps of plant traits, such as leaf nitrogen or plant height, are essential for understanding ecosystem processes, including the carbon and energy cycles of the Earth system. However, existing trait maps remain limited by the high cost and sparse geographic coverage of field-based measurements. Citizen science initiatives offer a largely untapped resource to overcome these limitations, with over 50 million geotagged plant photographs worldwide capturing valuable visual information on plant morphology and physiology. In this study, we introduce PlantTraitNet, a multi-modal, multi-task uncertainty-aware deep learning framework that predictsfour key plant traits (plant height, leaf area, specific leaf area, and nitrogen content) from citizen science photos using weak supervision. By aggregating individual trait predictions across space, we generate global maps of trait distributions. We validate these maps against independent vegetation survey data (sPlotOpen) and benchmark them against leading global trait products. Our results show that PlantTraitNet consistently outperforms existing trait maps across all evaluated traits, demonstrating that citizen science imagery, when integrated with computer vision and geospatial AI, enables not only scalable but also more accurate global trait mapping. This approach offers a powerful new pathway for ecological research and Earth system modeling.### [266] [FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection](https://arxiv.org/abs/2511.06947) *Yulin Chen,Zeyuan Wang,Tianyuan Yu,Yingmei Wei,Liang Bai* Main category: cs.CV TL;DR: 本文提出了FoCLIP,一种用于欺骗基于CLIP的图像质量评估指标的特征空间错配框架,通过特征对齐、分数分布平衡和像素保护正则化,在保持视觉保真度的同时显著提升CLIPscore,并提出基于颜色通道敏感性的篡改检测方法。Details
Motivation: 现有解释引导学习方法依赖外部标注或启发式分割,监督信号质量低且难以扩展,可能损害模型性能。 Method: 提出ALIGN框架,迭代联合训练分类器和掩码生成器;掩码生成器产生软的、任务相关的掩码,分类器优化预测精度及其显著图与掩码的一致性。 Result: 在VLCS和Terra Incognita两个域泛化基准上,ALIGN在分布内和分布外设置下均优于六个强基线,并在解释的充分性和全面性方面表现更优。 Conclusion: ALIGN通过高质量掩码引导,有效提升了模型的可解释性、预测性能和跨域泛化能力。 Abstract: Explanation-guided learning (EGL) has shown promise in aligning model predictions with interpretable reasoning, particularly in computer vision tasks. However, most approaches rely on external annotations or heuristic-based segmentation to supervise model explanations, which can be noisy, imprecise and difficult to scale. In this work, we provide both empirical and theoretical evidence that low-quality supervision signals can degrade model performance rather than improve it. In response, we propose ALIGN, a novel framework that jointly trains a classifier and a masker in an iterative manner. The masker learns to produce soft, task-relevant masks that highlight informative regions, while the classifier is optimized for both prediction accuracy and alignment between its saliency maps and the learned masks. By leveraging high-quality masks as guidance, ALIGN improves both interpretability and generalizability, showing its superiority across various settings. Experiments on the two domain generalization benchmarks, VLCS and Terra Incognita, show that ALIGN consistently outperforms six strong baselines in both in-distribution and out-of-distribution settings. Besides, ALIGN also yields superior explanation quality concerning sufficiency and comprehensiveness, highlighting its effectiveness in producing accurate and interpretable models.### [267] [PADM: A Physics-aware Diffusion Model for Attenuation Correction](https://arxiv.org/abs/2511.06948) *Trung Kien Pham,Hoang Minh Vu,Anh Duc Chu,Dac Thai Nguyen,Trung Thanh Nguyen,Thao Nguyen Truong,Mai Hong Son,Thanh Trung Nguyen,Phi Le Nguyen* Main category: cs.CV TL;DR: 提出了一种无需CT的SPECT心脏灌注成像衰减校正新方法,通过物理感知扩散模型(PADM)在无CT情况下实现高质量衰减伪影校正。Details
Motivation: 由于CLIP模型的多模态对齐特性使其在图像质量评估中广泛应用,但其对齐的脆弱性易受攻击,因此需要研究如何在特征空间中系统性地误导此类指标并设计相应防御机制。 Method: 基于随机梯度下降,FoCLIP引入三个模块:特征对齐模块缩小图文模态差距,分数分布平衡模块和像素守护正则化模块共同优化多模态输出均衡,在保持图像质量的同时最大化CLIPscore。同时利用灰度转换导致特征退化的现象,提出颜色通道敏感性驱动的篡改检测机制。 Result: 在十个艺术杰作提示和ImageNet子集上实验表明,优化后的图像显著提升CLIPscore且保持高视觉保真度;灰度转换可显著降低CLIPscore;所提篡改检测机制在标准基准上达到91%的准确率。 Conclusion: 该工作建立了针对CLIP-based多模态系统的特征错配攻击路径,并提出了有效的防御策略,揭示了颜色信息在多模态对齐中的关键作用。 Abstract: The well-aligned attribute of CLIP-based models enables its effective application like CLIPscore as a widely adopted image quality assessment metric. However, such a CLIP-based metric is vulnerable for its delicate multimodal alignment. In this work, we propose \textbf{FoCLIP}, a feature-space misalignment framework for fooling CLIP-based image quality metric. Based on the stochastic gradient descent technique, FoCLIP integrates three key components to construct fooling examples: feature alignment as the core module to reduce image-text modality gaps, the score distribution balance module and pixel-guard regularization, which collectively optimize multimodal output equilibrium between CLIPscore performance and image quality. Such a design can be engineered to maximize the CLIPscore predictions across diverse input prompts, despite exhibiting either visual unrecognizability or semantic incongruence with the corresponding adversarial prompts from human perceptual perspectives. Experiments on ten artistic masterpiece prompts and ImageNet subsets demonstrate that optimized images can achieve significant improvement in CLIPscore while preserving high visual fidelity. In addition, we found that grayscale conversion induces significant feature degradation in fooling images, exhibiting noticeable CLIPscore reduction while preserving statistical consistency with original images. Inspired by this phenomenon, we propose a color channel sensitivity-driven tampering detection mechanism that achieves 91% accuracy on standard benchmarks. In conclusion, this work establishes a practical pathway for feature misalignment in CLIP-based multimodal systems and the corresponding defense method.### [268] [GFix: Perceptually Enhanced Gaussian Splatting Video Compression](https://arxiv.org/abs/2511.06953) *Siyue Teng,Ge Gao,Duolikun Danier,Yuxuan Jiang,Fan Zhang,Thomas Davis,Zoe Liu,David Bull* Main category: cs.CV TL;DR: 提出了一种基于3D高斯点阵的视频压缩感知增强框架GFix,利用轻量级单步扩散模型和调制LoRA方案提升视觉质量和压缩效率。Details
Motivation: SPECT心肌灌注成像中的衰减伪影影响诊断准确性,现有SPECT/CT系统虽可缓解但受限于成本、可及性和辐射暴露。 Method: 提出Physics-aware Attenuation Correction Diffusion Model(PADM),采用教师-学生蒸馏机制引入物理先验,仅使用非衰减校正(NAC)输入进行衰减校正;并构建包含424例患者的CardiAC数据集用于训练与验证。 Result: 实验表明PADM在定量指标和视觉评估上均优于现有生成模型,显著提升重建保真度。 Conclusion: PADM为SPECT MPI提供了一种高效、无需CT的衰减校正方案,具有临床应用潜力。 Abstract: Attenuation artifacts remain a significant challenge in cardiac Myocardial Perfusion Imaging (MPI) using Single-Photon Emission Computed Tomography (SPECT), often compromising diagnostic accuracy and reducing clinical interpretability. While hybrid SPECT/CT systems mitigate these artifacts through CT-derived attenuation maps, their high cost, limited accessibility, and added radiation exposure hinder widespread clinical adoption. In this study, we propose a novel CT-free solution to attenuation correction in cardiac SPECT. Specifically, we introduce Physics-aware Attenuation Correction Diffusion Model (PADM), a diffusion-based generative method that incorporates explicit physics priors via a teacher--student distillation mechanism. This approach enables attenuation artifact correction using only Non-Attenuation-Corrected (NAC) input, while still benefiting from physics-informed supervision during training. To support this work, we also introduce CardiAC, a comprehensive dataset comprising 424 patient studies with paired NAC and Attenuation-Corrected (AC) reconstructions, alongside high-resolution CT-based attenuation maps. Extensive experiments demonstrate that PADM outperforms state-of-the-art generative models, delivering superior reconstruction fidelity across both quantitative metrics and visual assessment.### [269] [Learning from the Right Patches: A Two-Stage Wavelet-Driven Masked Autoencoder for Histopathology Representation Learning](https://arxiv.org/abs/2511.06958) *Raneen Younis,Louay Hamdi,Lukas Chavez,Zahra Ahmadi* Main category: cs.CV TL;DR: 本文提出了一种名为WISE-MAE的轻量级、领域自适应框架,通过小波引导的补丁选择策略,在MAE预训练中引入结构和生物学相关性,提升了组织病理学表征学习的效果。Details
Motivation: 现有3DGS视频编解码器存在明显视觉伪影和压缩率低的问题,需提升感知质量。 Method: 假设3DGS渲染与量化伪影类似扩散训练中的噪声潜变量,设计内容自适应的GFix框架,包含单步扩散增强模型和冻结低秩分解、调制隐藏状态的LoRA微调机制。 Result: 实验表明GFix在LPIPS上相较GSVC最多节省72.1% BD-rate,在FID上提升21.4%,显著改善感知质量。 Conclusion: GFix有效提升了3DGS视频压缩的视觉质量和压缩效率,为基于3DGS的低层次视觉任务提供了可行的增强方案。 Abstract: 3D Gaussian Splatting (3DGS) enhances 3D scene reconstruction through explicit representation and fast rendering, demonstrating potential benefits for various low-level vision tasks, including video compression. However, existing 3DGS-based video codecs generally exhibit more noticeable visual artifacts and relatively low compression ratios. In this paper, we specifically target the perceptual enhancement of 3DGS-based video compression, based on the assumption that artifacts from 3DGS rendering and quantization resemble noisy latents sampled during diffusion training. Building on this premise, we propose a content-adaptive framework, GFix, comprising a streamlined, single-step diffusion model that serves as an off-the-shelf neural enhancer. Moreover, to increase compression efficiency, We propose a modulated LoRA scheme that freezes the low-rank decompositions and modulates the intermediate hidden states, thereby achieving efficient adaptation of the diffusion backbone with highly compressible updates. Experimental results show that GFix delivers strong perceptual quality enhancement, outperforming GSVC with up to 72.1% BD-rate savings in LPIPS and 21.4% in FID.### [270] [Exploring the "Great Unseen" in Medieval Manuscripts: Instance-Level Labeling of Legacy Image Collections with Zero-Shot Models](https://arxiv.org/abs/2511.07004) *Christofer Meinecke,Estelle Guéville,David Joseph Wrisley* Main category: cs.CV TL;DR: 本文旨在通过使用最先进的技术对中世纪手稿页面进行整体分割和描述,以生成更丰富的训练数据,用于计算机视觉技术(如实例分割)和多模态模型。Details
Motivation: 由于全切片图像极大且标注稀少,传统的随机补丁采样常包含无关或噪声区域,限制了模型捕捉有意义组织模式的能力,因此需要更智能的补丁选择方法。 Method: 采用两步粗到精的流程:先在低倍率下使用小波筛选出结构丰富的区域,再在高分辨率下提取细节进行建模,结合MAE与Vision Transformer进行自监督学习。 Result: 在多个癌症数据集(如肺、肾、结直肠)上验证显示,WISE-MAE在表征质量和下游分类任务中表现优异,同时保持高效性和弱监督下的实用性。 Conclusion: WISE-MAE通过引入生物结构感知的补丁选择机制,显著提升了MAE在数字病理学中的表示学习能力,且符合病理学家诊断流程,具有临床实用潜力。 Abstract: Whole-slide images are central to digital pathology, yet their extreme size and scarce annotations make self-supervised learning essential. Masked Autoencoders (MAEs) with Vision Transformer backbones have recently shown strong potential for histopathology representation learning. However, conventional random patch sampling during MAE pretraining often includes irrelevant or noisy regions, limiting the model's ability to capture meaningful tissue patterns. In this paper, we present a lightweight and domain-adapted framework that brings structure and biological relevance into MAE-based learning through a wavelet-informed patch selection strategy. WISE-MAE applies a two-step coarse-to-fine process: wavelet-based screening at low magnification to locate structurally rich regions, followed by high-resolution extraction for detailed modeling. This approach mirrors the diagnostic workflow of pathologists and improves the quality of learned representations. Evaluations across multiple cancer datasets, including lung, renal, and colorectal tissues, show that WISE-MAE achieves competitive representation quality and downstream classification performance while maintaining efficiency under weak supervision.### [271] [TrueCity: Real and Simulated Urban Data for Cross-Domain 3D Scene Understanding](https://arxiv.org/abs/2511.07007) *Duc Nguyen,Yan-Ling Lai,Qilin Zhang,Prabin Gyawali,Benedikt Schwab,Olaf Wysocki,Thomas H. Kolbe* Main category: cs.CV TL;DR: TrueCity是一个新的城市语义分割基准,首次提供厘米级精确的真实世界点云、语义3D城市模型和对应城市的模拟点云,用于分析合成到真实的域迁移问题。Details
Motivation: 为了更全面地理解中世纪手稿页面及其内容,并为特定于中世纪视觉内容的计算机视觉模型提供更好的训练数据。 Method: 采用最先进的技术对手稿整页进行分割与描述。 Result: 能够生成更高质量的训练数据,支持实例分割和多模态模型的应用。 Conclusion: 该方法有助于提升中世纪视觉内容分析的准确性和深度,推动相关领域的发展。 Abstract: We aim to theorize the medieval manuscript page and its contents more holistically, using state-of-the-art techniques to segment and describe the entire manuscript folio, for the purpose of creating richer training data for computer vision techniques, namely instance segmentation, and multimodal models for medieval-specific visual content.### [272] [Performance Decay in Deepfake Detection: The Limitations of Training on Outdated Data](https://arxiv.org/abs/2511.07009) *Jack Richings,Margaux Leblanc,Ian Groves,Victoria Nockles* Main category: cs.CV TL;DR: 提出一种简单而有效的两阶段检测方法,对当代深度伪造内容的AUROC超过99.8%,但在六个月后生成的深度伪造上召回率下降超过30%,表明检测性能随时间显著衰减。Details
Motivation: 由于真实世界标注数据有限,3D语义场景理解面临挑战,而现有合成数据难以反映真实复杂性和传感器噪声,且缺乏同步的真实与模拟点云基准用于域迁移分析。 Method: 构建TrueCity基准数据集,包含高精度标注的真实点云、符合国际标准的语义3D城市模型和对应的标注模拟点云,并设计与标准对齐的语义分割类别,支持跨域一致性评估。 Result: 实验量化了常见基线模型上的域偏移现象,验证了合成数据提升真实场景理解的潜力,并为sim-to-real迁移提供了可评估的平台。 Conclusion: TrueCity填补了城市级语义分割中真实与合成数据对比基准的空白,有望推动通用化数据驱动模型和域适应方法的发展。 Abstract: 3D semantic scene understanding remains a long-standing challenge in the 3D computer vision community. One of the key issues pertains to limited real-world annotated data to facilitate generalizable models. The common practice to tackle this issue is to simulate new data. Although synthetic datasets offer scalability and perfect labels, their designer-crafted scenes fail to capture real-world complexity and sensor noise, resulting in a synthetic-to-real domain gap. Moreover, no benchmark provides synchronized real and simulated point clouds for segmentation-oriented domain shift analysis. We introduce TrueCity, the first urban semantic segmentation benchmark with cm-accurate annotated real-world point clouds, semantic 3D city models, and annotated simulated point clouds representing the same city. TrueCity proposes segmentation classes aligned with international 3D city modeling standards, enabling consistent evaluation of synthetic-to-real gap. Our extensive experiments on common baselines quantify domain shift and highlight strategies for exploiting synthetic data to enhance real-world 3D scene understanding. We are convinced that the TrueCity dataset will foster further development of sim-to-real gap quantification and enable generalizable data-driven models. The data, code, and 3D models are available online: https://tum-gis.github.io/TrueCity/### [273] [Certified L2-Norm Robustness of 3D Point Cloud Recognition in the Frequency Domain](https://arxiv.org/abs/2511.07029) *Liang Zhou,Qiming Wang,Tianze Chen* Main category: cs.CV TL;DR: 本文提出了FreqCert,一种基于频域分析的3D点云分类认证框架,通过图傅里叶变换和频域子采样实现对全局L2有界扰动的结构化认证,相比传统空间域方法具有更强的鲁棒性和理论保证。Details
Motivation: 深度伪造技术的不断进步加剧了虚假信息、欺诈和骚扰的威胁,使得恶意合成内容越来越难以与真实内容区分,亟需有效的检测方法。 Method: 采用一种简单的两阶段检测方法,重点分析静态帧级别伪影而非时间不一致性,并依赖持续更新的大规模多样化数据集进行模型训练。 Result: 在当前深度伪造数据上达到超过99.8%的AUROC,但在六个月内新生成的深度伪造上召回率下降超30%,揭示检测模型性能随生成技术演进而快速退化。 Conclusion: 有效的深度伪造检测依赖于快速的数据收集和先进的帧级别特征检测器的开发,强调持续更新数据集和聚焦静态伪影的重要性。 Abstract: The continually advancing quality of deepfake technology exacerbates the threats of disinformation, fraud, and harassment by making maliciously-generated synthetic content increasingly difficult to distinguish from reality. We introduce a simple yet effective two-stage detection method that achieves an AUROC of over 99.8% on contemporary deepfakes. However, this high performance is short-lived. We show that models trained on this data suffer a recall drop of over 30% when evaluated on deepfakes created with generation techniques from just six months later, demonstrating significant decay as threats evolve. Our analysis reveals two key insights for robust detection. Firstly, continued performance requires the ongoing curation of large, diverse datasets. Second, predictive power comes primarily from static, frame-level artifacts, not temporal inconsistencies. The future of effective deepfake detection therefore depends on rapid data collection and the development of advanced frame-level feature detectors.### [274] [3D-ANC: Adaptive Neural Collapse for Robust 3D Point Cloud Recognition](https://arxiv.org/abs/2511.07040) *Yuanmin Huang,Wenxuan Li,Mi Zhang,Xiaohan Zhang,Xiaoyu You,Min Yang* Main category: cs.CV TL;DR: 本文提出3D-ANC方法,利用神经坍缩(Neural Collapse)机制提升3D点云识别模型对抗攻击的鲁棒性,通过ETF对齐分类模块和自适应训练框架解决类别不平衡与几何相似性问题,显著增强特征解耦与模型性能。Details
Motivation: 现有的点云分类器易受结构化对抗扰动和几何畸变影响,而现有认证防御方法多关注点级扰动,忽视了保持个体点但改变整体结构的细微几何失真,因此需要更有效的认证机制来提升安全关键应用中的可靠性。 Method: 提出FreqCert框架:首先使用图傅里叶变换(GFT)将输入点云转换到频域,然后进行结构化的频率感知子采样生成多个子点云;每个子点云由标准模型独立分类,最终通过多数投票得出结果;子点云基于谱相似性构建,而非空间邻近性,从而在L2扰动下更稳定且与物体内在结构更一致;并推导出认证L2鲁棒半径的闭式下界,证明其在合理假设下的紧致性。 Result: 在ModelNet40和ScanObjectNN数据集上的实验表明,FreqCert在强扰动下始终优于现有方法,实现了更高的认证准确率和实证准确率。 Conclusion: 频域表示为3D点云识别中的可认证鲁棒性提供了一条有效路径,FreqCert通过频域分析和谱感知分割建立了新的认证范式,具有良好的理论基础和实际性能。 Abstract: 3D point cloud classification is a fundamental task in safety-critical applications such as autonomous driving, robotics, and augmented reality. However, recent studies reveal that point cloud classifiers are vulnerable to structured adversarial perturbations and geometric corruptions, posing risks to their deployment in safety-critical scenarios. Existing certified defenses limit point-wise perturbations but overlook subtle geometric distortions that preserve individual points yet alter the overall structure, potentially leading to misclassification. In this work, we propose FreqCert, a novel certification framework that departs from conventional spatial domain defenses by shifting robustness analysis to the frequency domain, enabling structured certification against global L2-bounded perturbations. FreqCert first transforms the input point cloud via the graph Fourier transform (GFT), then applies structured frequency-aware subsampling to generate multiple sub-point clouds. Each sub-cloud is independently classified by a standard model, and the final prediction is obtained through majority voting, where sub-clouds are constructed based on spectral similarity rather than spatial proximity, making the partitioning more stable under L2 perturbations and better aligned with the object's intrinsic structure. We derive a closed-form lower bound on the certified L2 robustness radius and prove its tightness under minimal and interpretable assumptions, establishing a theoretical foundation for frequency domain certification. Extensive experiments on the ModelNet40 and ScanObjectNN datasets demonstrate that FreqCert consistently achieves higher certified accuracy and empirical accuracy under strong perturbations. Our results suggest that spectral representations provide an effective pathway toward certifiable robustness in 3D point cloud recognition.### [275] [From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models Without Task Knowledge](https://arxiv.org/abs/2511.07049) *Hui Lu,Yi Yu,Song Xia,Yiming Yang,Deepu Rajan,Boon Poh Ng,Alex Kot,Xudong Jiang* Main category: cs.CV TL;DR: 本文提出了一种针对从开源视频基础模型(VFM)微调而来的下游模型和多模态大语言模型(MLLM)的新型对抗攻击方法TVА,无需访问目标任务或数据即可实现跨任务迁移攻击。Details
Motivation: 现有3D点云识别模型在面对对抗扰动时脆弱性高,传统防御机制难以应对多样化的攻击模式,主要原因是特征空间纠缠严重,缺乏有效的特征解耦机制。 Method: 提出3D-ANC方法,基于神经坍缩机制构建简单形等角紧框架(ETF),设计ETF对齐分类模块,并结合表征平衡学习(RBL)和动态特征方向损失(FDL)的自适应训练框架,以应对类别不平衡和几何相似性挑战。 Result: 在多个模型和数据集上验证了3D-ANC的有效性,显著提升了模型鲁棒性。例如,在ModelNet40上,DGCNN的准确率从27.2%提升至80.9%,绝对增益达53.7%,优于当前主流基线方法34.0%。 Conclusion: 3D-ANC通过引导特征空间向ETF结构演化,实现了高度可分的类原型表示,有效解耦特征空间,显著增强了3D点云识别模型在对抗环境下的稳定性与泛化能力。 Abstract: Deep neural networks have recently achieved notable progress in 3D point cloud recognition, yet their vulnerability to adversarial perturbations poses critical security challenges in practical deployments. Conventional defense mechanisms struggle to address the evolving landscape of multifaceted attack patterns. Through systematic analysis of existing defenses, we identify that their unsatisfactory performance primarily originates from an entangled feature space, where adversarial attacks can be performed easily. To this end, we present 3D-ANC, a novel approach that capitalizes on the Neural Collapse (NC) mechanism to orchestrate discriminative feature learning. In particular, NC depicts where last-layer features and classifier weights jointly evolve into a simplex equiangular tight frame (ETF) arrangement, establishing maximally separable class prototypes. However, leveraging this advantage in 3D recognition confronts two substantial challenges: (1) prevalent class imbalance in point cloud datasets, and (2) complex geometric similarities between object categories. To tackle these obstacles, our solution combines an ETF-aligned classification module with an adaptive training framework consisting of representation-balanced learning (RBL) and dynamic feature direction loss (FDL). 3D-ANC seamlessly empowers existing models to develop disentangled feature spaces despite the complexity in 3D data distribution. Comprehensive evaluations state that 3D-ANC significantly improves the robustness of models with various structures on two datasets. For instance, DGCNN's classification accuracy is elevated from 27.2% to 80.9% on ModelNet40 -- a 53.7% absolute gain that surpasses leading baselines by 34.0%.### [276] [Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation](https://arxiv.org/abs/2511.07051) *Yuxuan Zhou,Tao Yu,Wen Huang,Yuheng Zhang,Tao Dai,Shu-Tao Xia* Main category: cs.CV TL;DR: 本文提出了一种基于课程强化学习的数据增强方法CRDA,用于提升深度伪造检测器的跨域泛化能力。该方法通过动态选择和生成针对检测器当前学习状态的对抗性伪造样本,并结合因果推断减少虚假相关性,从而有效提升模型对复杂、多样化伪造技术的鲁棒性。Details
Motivation: 由于开源视频基础模型(VFMs)的广泛应用带来了潜在安全风险,攻击者可能利用其开放性发起强效攻击,但现有迁移攻击依赖于任务对齐的代理模型,限制了攻击的普适性和实用性。 Method: 提出Transferable Video Attack(TVA),利用VFMs的时间表示动态特性生成对抗扰动;引入双向对比学习最大化干净与对抗特征差异,并设计时间一致性损失以利用运动线索增强扰动的时序影响。 Result: 在24个视频相关任务上实验表明,TVA能有效攻击多种下游模型和MLLMs,且无需训练昂贵的代理模型或获取特定领域数据,具备良好迁移性和实用性。 Conclusion: TVA揭示了基于开源VFMs微调的模型存在可被利用的安全漏洞,为视频模型部署中的安全性提供了新的警示与研究方向。 Abstract: Large-scale Video Foundation Models (VFMs) has significantly advanced various video-related tasks, either through task-specific models or Multi-modal Large Language Models (MLLMs). However, the open accessibility of VFMs also introduces critical security risks, as adversaries can exploit full knowledge of the VFMs to launch potent attacks. This paper investigates a novel and practical adversarial threat scenario: attacking downstream models or MLLMs fine-tuned from open-source VFMs, without requiring access to the victim task, training data, model query, and architecture. In contrast to conventional transfer-based attacks that rely on task-aligned surrogate models, we demonstrate that adversarial vulnerabilities can be exploited directly from the VFMs. To this end, we propose the Transferable Video Attack (TVA), a temporal-aware adversarial attack method that leverages the temporal representation dynamics of VFMs to craft effective perturbations. TVA integrates a bidirectional contrastive learning mechanism to maximize the discrepancy between the clean and adversarial features, and introduces a temporal consistency loss that exploits motion cues to enhance the sequential impact of perturbations. TVA avoids the need to train expensive surrogate models or access to domain-specific data, thereby offering a more practical and efficient attack strategy. Extensive experiments across 24 video-related tasks demonstrate the efficacy of TVA against downstream models and MLLMs, revealing a previously underexplored security vulnerability in the deployment of video models.### [277] [RaLD: Generating High-Resolution 3D Radar Point Clouds with Latent Diffusion](https://arxiv.org/abs/2511.07067) *Ruijie Zhang,Bixin Zeng,Shengpeng Wang,Fuhui Zhou,Wei Wang* Main category: cs.CV TL;DR: 本文提出RaLD框架,利用潜在扩散模型从原始雷达频谱生成密集且准确的3D点云,通过基于视锥体的LiDAR自编码和雷达频谱直接条件化,解决了雷达点云稀疏性和低分辨率的问题。Details
Motivation: 现有数据增强方法采用固定的伪造策略,难以模拟现实世界中不断演化的多样化伪造特征(如面部扭曲、表情操控),导致检测器泛化能力受限。因此需要一种能够动态适应检测器学习进程的增强策略。 Method: 提出CRDA框架,结合课程学习、强化学习与因果推断:1)构建可配置的伪造操作池以生成多域伪造样本;2)使用强化学习代理根据检测器性能动态选择最优增强策略;3)引入动作空间变异生成异构伪造模式;4)利用因果推断消除任务无关偏差,聚焦于因果不变特征。 Result: 在多个跨域数据集上实验表明,CRDA显著优于现有最先进方法,提升了检测器在未见域上的泛化性能,尤其在面对复杂、多样化的伪造技术时表现出更强的鲁棒性。 Conclusion: 动态、自适应的数据增强策略比固定策略更有效,CRDA通过整合强化学习与因果推理实现了面向深度伪造检测的高效课程式训练,为提升模型泛化能力提供了新方向。 Abstract: The generalization capability of deepfake detectors is critical for real-world use. Data augmentation via synthetic fake face generation effectively enhances generalization, yet current SoTA methods rely on fixed strategies-raising a key question: Is a single static augmentation sufficient, or does the diversity of forgery features demand dynamic approaches? We argue existing methods overlook the evolving complexity of real-world forgeries (e.g., facial warping, expression manipulation), which fixed policies cannot fully simulate. To address this, we propose CRDA (Curriculum Reinforcement-Learning Data Augmentation), a novel framework guiding detectors to progressively master multi-domain forgery features from simple to complex. CRDA synthesizes augmented samples via a configurable pool of forgery operations and dynamically generates adversarial samples tailored to the detector's current learning state. Central to our approach is integrating reinforcement learning (RL) and causal inference. An RL agent dynamically selects augmentation actions based on detector performance to efficiently explore the vast augmentation space, adapting to increasingly challenging forgeries. Simultaneously, the agent introduces action space variations to generate heterogeneous forgery patterns, guided by causal inference to mitigate spurious correlations-suppressing task-irrelevant biases and focusing on causally invariant features. This integration ensures robust generalization by decoupling synthetic augmentation patterns from the model's learned representations. Extensive experiments show our method significantly improves detector generalizability, outperforming SOTA methods across multiple cross-domain datasets.### [278] [ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora](https://arxiv.org/abs/2511.07068) *Nikolas Adaloglou,Diana Petrusheva,Mohamed Asker,Felix Michels,Markus Kollmann* Main category: cs.CV TL;DR: 本文提出ClusterMine,一种无需预定义正类标签的无监督视觉异常检测方法,利用文本语料库进行正概念挖掘,在多种CLIP模型上实现最先进的性能和对分布偏移的强鲁棒性。Details
Motivation: 毫米波雷达在恶劣条件下具有鲁棒性且成本低,但其点云稀疏且分辨率低,限制了对密集精确3D感知任务的应用。现有生成方法依赖于密集体素表示,效率低且难以保持结构细节。 Method: 提出RaLD框架,结合场景级视锥体LiDAR自编码、顺序不变的潜在表示和直接雷达频谱条件化,利用潜在扩散模型实现高效、紧凑且表达能力强的3D生成过程。 Result: 实验表明,RaLD能够从原始雷达频谱生成密集且准确的3D点云,在挑战性环境中展现出优异的感知能力。 Conclusion: RaLD为基于雷达的3D生成提供了有效解决方案,显著提升了毫米波雷达在自动驾驶系统中的感知性能。 Abstract: Millimeter-wave radar offers a promising sensing modality for autonomous systems thanks to its robustness in adverse conditions and low cost. However, its utility is significantly limited by the sparsity and low resolution of radar point clouds, which poses challenges for tasks requiring dense and accurate 3D perception. Despite that recent efforts have shown great potential by exploring generative approaches to address this issue, they often rely on dense voxel representations that are inefficient and struggle to preserve structural detail. To fill this gap, we make the key observation that latent diffusion models (LDMs), though successful in other modalities, have not been effectively leveraged for radar-based 3D generation due to a lack of compatible representations and conditioning strategies. We introduce RaLD, a framework that bridges this gap by integrating scene-level frustum-based LiDAR autoencoding, order-invariant latent representations, and direct radar spectrum conditioning. These insights lead to a more compact and expressive generation process. Experiments show that RaLD produces dense and accurate 3D point clouds from raw radar spectrums, offering a promising solution for robust perception in challenging environments.### [279] [LeCoT: revisiting network architecture for two-view correspondence pruning](https://arxiv.org/abs/2511.07078) *Luanyuan Dai,Xiaoyu Du,Jinhui Tang* Main category: cs.CV TL;DR: 提出了一种新的两视图对应点修剪网络LeCoT,通过Spatial-Channel Fusion Transformer块和渐进式预测模块有效利用全局上下文信息,在多个视觉任务中超越现有方法。Details
Motivation: 现有基于CLIP的视觉异常检测方法依赖预定义的正类标签,但在实际应用中这些标签可能不可用、不可靠或因分布变化而失效,限制了其在真实场景中的适用性。 Method: 提出ClusterMine方法,通过结合视觉聚类的一致性和零样本图文一致性,从大规模文本语料库中挖掘正类概念,实现在无真实标签情况下的异常检测。 Result: ClusterMine在多个CLIP模型上达到最先进的异常检测性能,并展现出对协变量分布偏移的强鲁棒性,且无需任何正类标签。 Conclusion: ClusterMine首次实现了无需预定义正类标签的高性能视觉异常检测,推动了真正意义上的无监督OOD检测的发展。 Abstract: Large-scale visual out-of-distribution (OOD) detection has witnessed remarkable progress by leveraging vision-language models such as CLIP. However, a significant limitation of current methods is their reliance on a pre-defined set of in-distribution (ID) ground-truth label names (positives). These fixed label names can be unavailable, unreliable at scale, or become less relevant due to in-distribution shifts after deployment. Towards truly unsupervised OOD detection, we utilize widely available text corpora for positive label mining, bypassing the need for positives. In this paper, we utilize widely available text corpora for positive label mining under a general concept mining paradigm. Within this framework, we propose ClusterMine, a novel positive label mining method. ClusterMine is the first method to achieve state-of-the-art OOD detection performance without access to positive labels. It extracts positive concepts from a large text corpus by combining visual-only sample consistency (via clustering) and zero-shot image-text consistency. Our experimental study reveals that ClusterMine is scalable across a plethora of CLIP models and achieves state-of-the-art robustness to covariate in-distribution shifts. The code is available at https://github.com/HHU-MMBS/clustermine_wacv_official.### [280] [Pandar128 dataset for lane line detection](https://arxiv.org/abs/2511.07084) *Filip Beránek,Václav Diviš,Ivan Gruber* Main category: cs.CV TL;DR: Pandar128是目前最大的用于车道线检测的128线LiDAR公开数据集,包含5.2万张图像和3.4万次LiDAR扫描,并提出SimpleLidarLane轻量级方法和新的评估指标IAM-F1。Details
Motivation: 现有MLP-based方法难以充分捕捉对应点间的上下文信息,且依赖额外模块增强性能,限制了效率与表达能力。 Method: 设计LeCoT网络,引入Spatial-Channel Fusion Transformer块以融合空间与通道维度的全局上下文信息,并采用中间特征的渐进式预测块优化置信度估计。 Result: 在对应点修剪、相对位姿估计、单应性估计、视觉定位和3D重建任务上显著优于现有最先进方法。 Conclusion: LeCoT通过自然集成全局上下文信息,无需额外模块即可提升性能,为两视图对应点修剪提供了新思路。 Abstract: Two-view correspondence pruning aims to accurately remove incorrect correspondences (outliers) from initial ones and is widely applied to various computer vision tasks. Current popular strategies adopt multilayer perceptron (MLP) as the backbone, supplemented by additional modules to enhance the network ability to handle context information, which is a known limitation of MLPs. In contrast, we introduce a novel perspective for capturing correspondence context information without extra design modules. To this end, we design a two-view correspondence pruning network called LeCoT, which can naturally leverage global context information at different stages. Specifically, the core design of LeCoT is the Spatial-Channel Fusion Transformer block, a newly proposed component that efficiently utilizes both spatial and channel global context information among sparse correspondences. In addition, we integrate the proposed prediction block that utilizes correspondence features from intermediate stages to generate a probability set, which acts as guiding information for subsequent learning phases, allowing the network to more effectively capture robust global context information. Notably, this prediction block progressively refines the probability set, thereby mitigating the issue of information loss that is common in the traditional one. Extensive experiments prove that the proposed LeCoT outperforms state-of-the-art methods in correspondence pruning, relative pose estimation, homography estimation, visual localization, and $3$D~reconstruction tasks. The code is provided in https://github.com/Dailuanyuan2024/LeCoT-Revisiting-Network-Architecture-for-Two-View-Correspondence-Pruning.### [281] [How Bias Binds: Measuring Hidden Associations for Bias Control in Text-to-Image Compositions](https://arxiv.org/abs/2511.07091) *Jeng-Lin Li,Ming-Ching Chang,Wei-Chao Chen* Main category: cs.CV TL;DR: 该研究探讨了文本到图像生成模型中语义绑定下的偏见问题,提出了一种无需训练的上下文-偏见控制框架,并引入偏见依从性评分来量化对象-属性绑定中的偏见,实现了超过10%的去偏改进。Details
Motivation: 现有车道线检测缺乏高质量、大规模的LiDAR数据集和标准化评估方法,限制了该领域的发展。 Method: 提出SimpleLidarLane方法,结合BEV分割、聚类和折线拟合进行车道线重建;并设计插值感知的横向匹配F1分数(IAM-F1)作为新评估指标。 Result: 在多样的真实场景下(如雨天、稀疏回波),SimpleLidarLane表现出强鲁棒性和竞争力性能,表明模块化流程结合高质量数据可媲美复杂方法。 Conclusion: 高质量数据与合理评估标准对LiDAR车道检测至关重要,所发布数据集、代码和指标有助于推动该领域的可重复研究。 Abstract: We present Pandar128, the largest public dataset for lane line detection using a 128-beam LiDAR. It contains over 52,000 camera frames and 34,000 LiDAR scans, captured in diverse real-world conditions in Germany. The dataset includes full sensor calibration (intrinsics, extrinsics) and synchronized odometry, supporting tasks such as projection, fusion, and temporal modeling. To complement the dataset, we also introduce SimpleLidarLane, a light-weight baseline method for lane line reconstruction that combines BEV segmentation, clustering, and polyline fitting. Despite its simplicity, our method achieves strong performance under challenging various conditions (e.g., rain, sparse returns), showing that modular pipelines paired with high-quality data and principled evaluation can compete with more complex approaches. Furthermore, to address the lack of standardized evaluation, we propose a novel polyline-based metric - Interpolation-Aware Matching F1 (IAM-F1) - that employs interpolation-aware lateral matching in BEV space. All data and code are publicly released to support reproducibility in LiDAR-based lane detection.### [282] [GEWDiff: Geometric Enhanced Wavelet-based Diffusion Model for Hyperspectral Image Super-resolution](https://arxiv.org/abs/2511.07103) *Sirui Wang,Jiang He,Natàlia Blasco Andreo,Xiao Xiang Zhu* Main category: cs.CV TL;DR: 提出了一种基于小波的几何增强扩散模型(GEWDiff),用于高光谱图像的4倍超分辨率重建,通过小波编码器-解码器和多级损失函数实现高效、保真的生成。Details
Motivation: 现有去偏方法多关注单一物体提示,忽视了提示中对象与属性间语义绑定带来的联合偏见效应,导致在复杂语境下失效。 Method: 提出一种训练-free的上下文-偏见控制框架,通过词元解耦分析语义绑定中的偏见,并引入偏见依从性评分量化特定对象-属性绑定对偏见的激活程度。 Result: 框架在组合生成任务中实现超过10%的去偏效果提升,揭示了不同属性-对象绑定下的偏见分布可被语义关联放大。 Conclusion: 当前去偏方法在处理语义绑定情境时存在根本局限,需重新评估现有偏见缓解策略以平衡去偏与语义完整性。 Abstract: Text-to-image generative models often exhibit bias related to sensitive attributes. However, current research tends to focus narrowly on single-object prompts with limited contextual diversity. In reality, each object or attribute within a prompt can contribute to bias. For example, the prompt "an assistant wearing a pink hat" may reflect female-inclined biases associated with a pink hat. The neglected joint effects of the semantic binding in the prompts cause significant failures in current debiasing approaches. This work initiates a preliminary investigation on how bias manifests under semantic binding, where contextual associations between objects and attributes influence generative outcomes. We demonstrate that the underlying bias distribution can be amplified based on these associations. Therefore, we introduce a bias adherence score that quantifies how specific object-attribute bindings activate bias. To delve deeper, we develop a training-free context-bias control framework to explore how token decoupling can facilitate the debiasing of semantic bindings. This framework achieves over 10% debiasing improvement in compositional generation tasks. Our analysis of bias scores across various attribute-object bindings and token decorrelation highlights a fundamental challenge: reducing bias without disrupting essential semantic relationships. These findings expose critical limitations in current debiasing approaches when applied to semantically bound contexts, underscoring the need to reassess prevailing bias mitigation strategies.### [283] [HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving](https://arxiv.org/abs/2511.07106) *Zhongyu Xia,Zhiwei Lin,Yongtao Wang,Ming-Hsuan Yang* Main category: cs.CV TL;DR: 提出HENet和HENet++框架,用于多任务3D感知与端到端自动驾驶,通过混合图像编码网络和稠密-稀疏特征提取,在nuScenes上实现最先进性能并降低碰撞率。Details
Motivation: 高光谱图像的高光谱维度导致传统扩散模型内存开销大,且现有生成模型缺乏对遥感图像中地物拓扑与几何结构的理解,同时噪声层面的损失优化导致生成质量不佳。 Method: 设计了一个小波-based编码器-解码器将高光谱图像压缩到潜在空间,并引入几何增强的扩散过程以保持几何特征;采用多级损失函数引导扩散过程,提升收敛稳定性与重建精度。 Result: 在多个指标上实现了最先进的性能,包括图像保真度、光谱准确性、视觉真实感和清晰度。 Conclusion: GEWDiff有效解决了高光谱图像生成中的内存、结构保持和训练稳定性问题,显著提升了超分辨率重建质量。 Abstract: Improving the quality of hyperspectral images (HSIs), such as through super-resolution, is a crucial research area. However, generative modeling for HSIs presents several challenges. Due to their high spectral dimensionality, HSIs are too memory-intensive for direct input into conventional diffusion models. Furthermore, general generative models lack an understanding of the topological and geometric structures of ground objects in remote sensing imagery. In addition, most diffusion models optimize loss functions at the noise level, leading to a non-intuitive convergence behavior and suboptimal generation quality for complex data. To address these challenges, we propose a Geometric Enhanced Wavelet-based Diffusion Model (GEWDiff), a novel framework for reconstructing hyperspectral images at 4-times super-resolution. A wavelet-based encoder-decoder is introduced that efficiently compresses HSIs into a latent space while preserving spectral-spatial information. To avoid distortion during generation, we incorporate a geometry-enhanced diffusion process that preserves the geometric features. Furthermore, a multi-level loss function was designed to guide the diffusion process, promoting stable convergence and improved reconstruction fidelity. Our model demonstrated state-of-the-art results across multiple dimensions, including fidelity, spectral accuracy, visual realism, and clarity.### [284] [Sparse4DGS: 4D Gaussian Splatting for Sparse-Frame Dynamic Scene Reconstruction](https://arxiv.org/abs/2511.07122) *Changyue Shi,Chuxiao Yang,Xinyuan Hu,Minghao Chen,Wenwen Pan,Yan Yang,Jiajun Ding,Zhou Yu,Jun Yu* Main category: cs.CV TL;DR: 本文提出了Sparse4DGS,首个用于稀疏帧动态场景重建的方法,通过关注纹理丰富区域,在输入稀疏帧的情况下优于现有动态或少样本NeRF技术。Details
Motivation: 现有方法在训练和推理中因计算资源限制难以兼顾大模型、高分辨率和长时序输入,且不同任务需要不同的特征表示,导致多任务模型难以达到单任务模型的精度。 Method: 提出混合图像编码网络(大模型处理短期帧,小模型处理长期帧),同时提取稠密和稀疏特征,兼容多种3D特征提取方法并支持多模态输入。 Result: HENet++在nuScenes多任务3D感知任务上达到SOTA性能,并在端到端自动驾驶评测中取得最低碰撞率。 Conclusion: 所提框架有效平衡了计算效率与特征质量,支持高性能多任务学习与端到端驾驶,具有良好的兼容性和实际应用潜力。 Abstract: Three-dimensional feature extraction is a critical component of autonomous driving systems, where perception tasks such as 3D object detection, bird's-eye-view (BEV) semantic segmentation, and occupancy prediction serve as important constraints on 3D features. While large image encoders, high-resolution images, and long-term temporal inputs can significantly enhance feature quality and deliver remarkable performance gains, these techniques are often incompatible in both training and inference due to computational resource constraints. Moreover, different tasks favor distinct feature representations, making it difficult for a single model to perform end-to-end inference across multiple tasks while maintaining accuracy comparable to that of single-task models. To alleviate these issues, we present the HENet and HENet++ framework for multi-task 3D perception and end-to-end autonomous driving. Specifically, we propose a hybrid image encoding network that uses a large image encoder for short-term frames and a small one for long-term frames. Furthermore, our framework simultaneously extracts both dense and sparse features, providing more suitable representations for different tasks, reducing cumulative errors, and delivering more comprehensive information to the planning module. The proposed architecture maintains compatibility with various existing 3D feature extraction methods and supports multimodal inputs. HENet++ achieves state-of-the-art end-to-end multi-task 3D perception results on the nuScenes benchmark, while also attaining the lowest collision rate on the nuScenes end-to-end autonomous driving benchmark.### [285] [MPJudge: Towards Perceptual Assessment of Music-Induced Paintings](https://arxiv.org/abs/2511.07137) *Shiqi Jiang,Tianyi Liang,Changbo Wang,Chenhui Li* Main category: cs.CV TL;DR: 提出了一种新的音乐诱导绘画评估框架MPJudge,通过感知一致性直接建模音乐与视觉艺术的关系,并利用大规模数据集MPD和偏好优化训练,在准确性和相关区域识别上优于现有方法。Details
Motivation: 现有动态高斯点阵方法依赖密集视频帧进行高质量重建,但在设备受限的真实场景中只能获取稀疏帧,导致重建效果下降,尤其在纹理丰富区域表现更差。 Method: 提出Texture-Aware Deformation Regularization(用于变形网络,引入基于纹理的深度对齐损失)和Texture-Aware Canonical Optimization(用于规范高斯场,在梯度下降过程中加入基于纹理的噪声)以提升稀疏帧下的重建质量。 Result: 在NeRF-Synthetic、HyperNeRF、NeRF-DS和自建iPhone-4D数据集上实验表明,Sparse4DGS在稀疏帧输入下优于现有的动态或少样本方法。 Conclusion: Sparse4DGS是首个支持稀疏帧输入的动态高斯点阵方法,通过纹理感知的正则化与优化策略,显著提升了稀疏条件下动态场景的重建质量,尤其在高纹理区域表现优异。 Abstract: Dynamic Gaussian Splatting approaches have achieved remarkable performance for 4D scene reconstruction. However, these approaches rely on dense-frame video sequences for photorealistic reconstruction. In real-world scenarios, due to equipment constraints, sometimes only sparse frames are accessible. In this paper, we propose Sparse4DGS, the first method for sparse-frame dynamic scene reconstruction. We observe that dynamic reconstruction methods fail in both canonical and deformed spaces under sparse-frame settings, especially in areas with high texture richness. Sparse4DGS tackles this challenge by focusing on texture-rich areas. For the deformation network, we propose Texture-Aware Deformation Regularization, which introduces a texture-based depth alignment loss to regulate Gaussian deformation. For the canonical Gaussian field, we introduce Texture-Aware Canonical Optimization, which incorporates texture-based noise into the gradient descent process of canonical Gaussians. Extensive experiments show that when taking sparse frames as inputs, our method outperforms existing dynamic or few-shot techniques on NeRF-Synthetic, HyperNeRF, NeRF-DS, and our iPhone-4D datasets.### [286] [ProcGen3D: Learning Neural Procedural Graph Representations for Image-to-3D Reconstruction](https://arxiv.org/abs/2511.07142) *Xinyi Zhang,Daoyi Gao,Naiqi Li,Angela Dai* Main category: cs.CV TL;DR: 提出ProcGen3D,一种基于图的神经过程化生成方法,用于从单张RGB图像生成高质量、图像对齐的3D内容。Details
Motivation: 现有方法依赖情感识别模型评估音乐与绘画的相似性,存在噪声大、忽略非情感感知线索的问题,难以准确衡量音乐诱导绘画的感知一致性。 Method: 构建了首个大规模音乐-绘画对数据集MPD,包含专家标注的感知一致性标签及模糊情况下的成对偏好;提出MPJudge模型,采用基于调制的融合机制将音乐特征融入视觉编码器,并使用直接偏好优化(DPO)进行训练以更好处理模糊案例。 Result: 实验表明,该方法在音乐诱导绘画评估任务上优于现有方法,定性结果验证了模型能更准确地识别绘画中与音乐相关的区域。 Conclusion: 所提出的MPJudge框架通过直接建模感知相干性,结合高质量数据集与偏好学习策略,有效提升了音乐诱导绘画评估的性能,为跨模态艺术分析提供了新思路。 Abstract: Music induced painting is a unique artistic practice, where visual artworks are created under the influence of music. Evaluating whether a painting faithfully reflects the music that inspired it poses a challenging perceptual assessment task. Existing methods primarily rely on emotion recognition models to assess the similarity between music and painting, but such models introduce considerable noise and overlook broader perceptual cues beyond emotion. To address these limitations, we propose a novel framework for music induced painting assessment that directly models perceptual coherence between music and visual art. We introduce MPD, the first large scale dataset of music painting pairs annotated by domain experts based on perceptual coherence. To better handle ambiguous cases, we further collect pairwise preference annotations. Building on this dataset, we present MPJudge, a model that integrates music features into a visual encoder via a modulation based fusion mechanism. To effectively learn from ambiguous cases, we adopt Direct Preference Optimization for training. Extensive experiments demonstrate that our method outperforms existing approaches. Qualitative results further show that our model more accurately identifies music relevant regions in paintings.### [287] [Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use](https://arxiv.org/abs/2511.07171) *Sébastien Thuau,Siba Haidar,Rachid Chelouah* Main category: cs.CV TL;DR: 本研究比较了三种联邦暴力检测方法,在隐私保护、能效和准确性之间进行权衡,提出基于3D CNN和VLM的混合部署策略。Details
Motivation: 现有3D生成方法在复杂结构建模和图像对齐方面存在局限,而实际生产中过程化生成广泛使用,因此需要结合两者优势的方法。 Method: 提出序列化的图结构表示3D资产,采用边基元标记化编码过程图,利用Transformer先验模型根据输入图像预测下一标记,并引入蒙特卡洛树搜索(MCTS)引导采样以提升图像对齐性。 Result: 在仙人掌、树木和桥梁上实验表明,该方法优于最先进的3D生成方法和领域特定建模技术,在仅用合成数据训练的情况下仍能很好地泛化到真实图像。 Conclusion: ProcGen3D通过结合神经网络与过程化图形表示,实现了高质量、可控且图像对齐的3D内容生成,具有良好的跨类别泛化能力。 Abstract: We introduce ProcGen3D, a new approach for 3D content creation by generating procedural graph abstractions of 3D objects, which can then be decoded into rich, complex 3D assets. Inspired by the prevalent use of procedural generators in production 3D applications, we propose a sequentialized, graph-based procedural graph representation for 3D assets. We use this to learn to approximate the landscape of a procedural generator for image-based 3D reconstruction. We employ edge-based tokenization to encode the procedural graphs, and train a transformer prior to predict the next token conditioned on an input RGB image. Crucially, to enable better alignment of our generated outputs to an input image, we incorporate Monte Carlo Tree Search (MCTS) guided sampling into our generation process, steering output procedural graphs towards more image-faithful reconstructions. Our approach is applicable across a variety of objects that can be synthesized with procedural generators. Extensive experiments on cacti, trees, and bridges show that our neural procedural graph generation outperforms both state-of-the-art generative 3D methods and domain-specific modeling techniques. Furthermore, this enables improved generalization on real-world input images, despite training only on synthetic data.### [288] [LiteUpdate: A Lightweight Framework for Updating AI-Generated Image Detectors](https://arxiv.org/abs/2511.07192) *Jiajie Lu,Zhenkan Fu,Na Zhao,Long Xing,Kejiang Chen,Weiming Zhang,Nenghai Yu* Main category: cs.CV TL;DR: 本文提出了一种名为LiteUpdate的轻量级框架,用于高效更新AI生成图像检测器,通过代表性样本选择和模型权重融合策略,在提升对新生成模型检测性能的同时缓解灾难性遗忘问题。Details
Motivation: 深度学习视频监控需要兼顾隐私保护与低能耗,但现有的联邦学习结合大模型存在高能耗问题,亟需可持续的解决方案。 Method: 在RWF-2000、RLVS和UCF-Crime数据集上比较零样本VLM推理、LoRA微调LLaVA-NeXT-Video-7B和个性化联邦3D CNN三种策略,并量化其能量消耗与CO2e。 Result: 所有方法在二分类暴力检测中均超过90%准确率;3D CNN能耗仅为LoRA的一半(240Wh vs 570Wh),且校准性能更优(ROC AUC 92.59%);通过语义分组,VLM多分类准确率从65.31%提升至81%。 Conclusion: 高效的小型3D CNN适合常规检测,而VLM适用于复杂情境的细粒度推理,建议采用根据任务需求切换模型的混合部署策略以实现能效与性能的平衡。 Abstract: Deep learning-based video surveillance increasingly demands privacy-preserving architectures with low computational and environmental overhead. Federated learning preserves privacy but deploying large vision-language models (VLMs) introduces major energy and sustainability challenges. We compare three strategies for federated violence detection under realistic non-IID splits on the RWF-2000 and RLVS datasets: zero-shot inference with pretrained VLMs, LoRA-based fine-tuning of LLaVA-NeXT-Video-7B, and personalized federated learning of a 65.8M-parameter 3D CNN. All methods exceed 90% accuracy in binary violence detection. The 3D CNN achieves superior calibration (ROC AUC 92.59%) at roughly half the energy cost (240 Wh vs. 570 Wh) of federated LoRA, while VLMs provide richer multimodal reasoning. Hierarchical category grouping (based on semantic similarity and class exclusion) boosts VLM multiclass accuracy from 65.31% to 81% on the UCF-Crime dataset. To our knowledge, this is the first comparative simulation study of LoRA-tuned VLMs and personalized CNNs for federated violence detection, with explicit energy and CO2e quantification. Our results inform hybrid deployment strategies that default to efficient CNNs for routine inference and selectively engage VLMs for complex contextual reasoning.### [289] [Automated Estimation of Anatomical Risk Metrics for Endoscopic Sinus Surgery Using Deep Learning](https://arxiv.org/abs/2511.07199) *Konrad Reuter,Lennart Thaysen,Bilkay Doruk,Sarah Latus,Brigitte Holst,Benjamin Becker,Dennis Eggert,Christian Betz,Anna-Sophie Hoffmann,Alexander Schlaefer* Main category: cs.CV TL;DR: 提出了一种基于深度学习的自动化管道,用于通过定位关键解剖标志来估计鼻窦手术相关的解剖风险评分。Details
Motivation: 现有AI生成图像检测方法难以跟上生成模型的快速发展,导致检测性能显著下降,亟需高效、可持续更新的检测器更新方案。 Method: LiteUpdate包含两个核心模块:1)基于图像置信度和梯度判别特征的代表性样本选择模块,用于筛选边界样本以提升学习效率;2)模型融合模块,融合预训练、代表性微调和随机更新路径的权重,平衡新知识适应与旧知识保留。 Result: 实验表明,LiteUpdate显著提升了多种检测器的性能。在AIDE检测器上,对Midjourney生成图像的平均检测准确率从87.63%提升至93.03%,相对提高6.16%。 Conclusion: LiteUpdate通过轻量化的样本选择与模型融合策略,有效解决了检测器更新中的低效性和灾难性遗忘问题,为持续应对新型生成模型提供了可行且高效的解决方案。 Abstract: The rapid progress of generative AI has led to the emergence of new generative models, while existing detection methods struggle to keep pace, resulting in significant degradation in the detection performance. This highlights the urgent need for continuously updating AI-generated image detectors to adapt to new generators. To overcome low efficiency and catastrophic forgetting in detector updates, we propose LiteUpdate, a lightweight framework for updating AI-generated image detectors. LiteUpdate employs a representative sample selection module that leverages image confidence and gradient-based discriminative features to precisely select boundary samples. This approach improves learning and detection accuracy on new distributions with limited generated images, significantly enhancing detector update efficiency. Additionally, LiteUpdate incorporates a model merging module that fuses weights from multiple fine-tuning trajectories, including pre-trained, representative, and random updates. This balances the adaptability to new generators and mitigates the catastrophic forgetting of prior knowledge. Experiments demonstrate that LiteUpdate substantially boosts detection performance in various detectors. Specifically, on AIDE, the average detection accuracy on Midjourney improved from 87.63% to 93.03%, a 6.16% relative increase.### [290] [Geometric implicit neural representations for signed distance functions](https://arxiv.org/abs/2511.07206) *Luiz Schirmer,Tiago Novello,Vinícius da Silva,Guilherme Schardong,Daniel Perazzo,Hélio Lopes,Nuno Gonçalves,Luiz Velho* Main category: cs.CV TL;DR: 本文综述了利用隐式神经表示(INRs)近似有向点云或带位姿图像的符号距离函数(SDFs)的研究,特别关注结合微分几何工具(如法线和曲率)的几何INRs。Details
Motivation: 为了减少术前评估中手动测量的时间消耗,并提高内镜鼻窦手术的安全性。 Method: 采用热图回归方法定位关键解剖标志,比较了直接方法与从全局到局部的学习策略。 Result: 在Keros、Gera和TMS评分的相关解剖测量中分别取得了0.506mm、4.516°和0.802mm/0.777mm的平均绝对误差。 Conclusion: 该自动化深度学习方法能准确估计解剖风险评分,具有临床应用潜力。 Abstract: Endoscopic sinus surgery requires careful preoperative assessment of the skull base anatomy to minimize risks such as cerebrospinal fluid leakage. Anatomical risk scores like the Keros, Gera and Thailand-Malaysia-Singapore score offer a standardized approach but require time-consuming manual measurements on coronal CT or CBCT scans. We propose an automated deep learning pipeline that estimates these risk scores by localizing key anatomical landmarks via heatmap regression. We compare a direct approach to a specialized global-to-local learning strategy and find mean absolute errors on the relevant anatomical measurements of 0.506mm for the Keros, 4.516° for the Gera and 0.802mm / 0.777mm for the TMS classification.### [291] [Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization](https://arxiv.org/abs/2511.07210) *Binyan Xu,Fan Yang,Di Tang,Xilin Dai,Kehuan Zhang* Main category: cs.CV TL;DR: 本文提出了一种新的干净图像后门攻击范式——生成式干净图像后门(GCB),利用条件InfoGAN优化触发器本身,显著降低对干净准确率的影响(<1%),并在多种数据集、模型和任务上验证了其有效性与隐蔽性。Details
Motivation: 提升3D重建质量,通过在损失函数中加入额外的正则化项确保INR满足特定全局属性。 Method: 从微分几何角度探讨INR定义、几何损失函数构建及采样方案。 Result: 几何INRs在从有向点云和带位姿图像进行表面重建方面取得了显著进展。 Conclusion: 结合微分几何工具的几何INRs为3D表面重建提供了有效且有前景的方法。 Abstract: \textit{Implicit neural representations} (INRs) have emerged as a promising framework for representing signals in low-dimensional spaces. This survey reviews the existing literature on the specialized INR problem of approximating \textit{signed distance functions} (SDFs) for surface scenes, using either oriented point clouds or a set of posed images. We refer to neural SDFs that incorporate differential geometry tools, such as normals and curvatures, in their loss functions as \textit{geometric} INRs. The key idea behind this 3D reconstruction approach is to include additional \textit{regularization} terms in the loss function, ensuring that the INR satisfies certain global properties that the function should hold -- such as having unit gradient in the case of SDFs. We explore key methodological components, including the definition of INR, the construction of geometric loss functions, and sampling schemes from a differential geometry perspective. Our review highlights the significant advancements enabled by geometric INRs in surface reconstruction from oriented point clouds and posed images.### [292] [Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images](https://arxiv.org/abs/2511.07222) *JiaKui Hu,Shanshan Zhao,Qing-Guo Chen,Xuerui Qiu,Jialun Liu,Zhao Xu,Weihua Luo,Kaifu Zhang,Yanye Lu* Main category: cs.CV TL;DR: Omni-View 是一个基于多视角图像的3D场景理解与生成统一框架,通过“生成促进理解”的理念,实现了场景理解、新视图合成和几何估计的协同建模。Details
Motivation: 现有干净图像后门攻击因较高的中毒率导致明显的准确率下降,缺乏隐蔽性,本文旨在设计一种几乎不影响模型正常性能的更隐蔽攻击方法。 Method: 提出GCB框架,使用条件InfoGAN挖掘图像中自然存在的、可分离的强效触发特征,通过优化触发器实现仅用极少量中毒样本完成攻击。 Result: 在六个数据集、五种架构和四种任务(包括回归与分割)上成功实施攻击,干净准确率下降小于1%,且能抵抗大多数现有防御手段。 Conclusion: GCB显著提升了干净图像后门攻击的隐蔽性和实用性,扩展了其应用范围,并揭示了当前防御机制的不足。 Abstract: Clean-image backdoor attacks, which use only label manipulation in training datasets to compromise deep neural networks, pose a significant threat to security-critical applications. A critical flaw in existing methods is that the poison rate required for a successful attack induces a proportional, and thus noticeable, drop in Clean Accuracy (CA), undermining their stealthiness. This paper presents a new paradigm for clean-image attacks that minimizes this accuracy degradation by optimizing the trigger itself. We introduce Generative Clean-Image Backdoors (GCB), a framework that uses a conditional InfoGAN to identify naturally occurring image features that can serve as potent and stealthy triggers. By ensuring these triggers are easily separable from benign task-related features, GCB enables a victim model to learn the backdoor from an extremely small set of poisoned examples, resulting in a CA drop of less than 1%. Our experiments demonstrate GCB's remarkable versatility, successfully adapting to six datasets, five architectures, and four tasks, including the first demonstration of clean-image backdoors in regression and segmentation. GCB also exhibits resilience against most of the existing backdoor defenses.### [293] [Mapping Reduced Accessibility to WASH Facilities in Rohingya Refugee Camps with Sub-Meter Imagery](https://arxiv.org/abs/2511.07231) *Kyeongjin Ahn,YongHun Suh,Sungwon Han,Jeasurk Yang,Hannes Taubenböck,Meeyoung Cha* Main category: cs.CV TL;DR: 本研究利用高分辨率遥感影像和半监督分割框架,量化孟加拉国科克斯巴扎尔罗兴亚难民营中水、卫生和卫生设施(WASH)的可及性,发现由于人口增长和设施减少,2022至2025年间每人设施比从25升至29.4,女性和女孩因安全隔离不足而面临更大障碍,强调需基于需求的资源分配策略以促进公平。Details
Motivation: 将多模态理解和生成扩展到3D场景,并探索生成任务如何提升3D场景的理解能力。 Method: 提出Omni-View框架,包含理解模型、纹理模块和几何模块,采用两阶段训练策略,结合纹理模块的时空建模能力和几何模块的显式几何约束。 Result: 在VSI-Bench基准上取得55.4的SOTA分数,优于现有的专用3D理解模型,同时在新视图合成和3D场景生成任务中表现优异。 Conclusion: 生成与理解任务的协同建模能有效提升3D场景的整体理解与生成能力,验证了“生成促进理解”原则的有效性。 Abstract: This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that "generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model's holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation.### [294] [Noise & pattern: identity-anchored Tikhonov regularization for robust structural anomaly detection](https://arxiv.org/abs/2511.07233) *Alexander Bauer,Klaus-Robert Müller* Main category: cs.CV TL;DR: 提出一种基于自监督自动编码器的结构异常检测方法,通过引入结构化扰动和高斯噪声正则化,在MVTec AD基准上达到最先进的检测性能。Details
Motivation: 难民营中WASH服务可及性是重大公共卫生问题,传统调查方法难以及时、准确评估密集且动态变化的环境下的设施覆盖情况,亟需创新技术手段进行持续监测。 Method: 基于亚米级卫星影像,构建半监督语义分割框架检测单个难民营帐篷,结合地理空间分析量化到水泵、厕所和淋浴间的可达性,并进行性别差异分析。 Result: 帐篷检测F1得分为76.4%;结果显示2022至2025年WASH设施人均压力上升,每设施服务人数从25增至29.4;女性和女孩因缺乏安全隔离而可及性更低。 Conclusion: 遥感与机器学习结合可有效监测人道主义环境中基础设施可及性的时空变化,支持识别服务不足群体,为在预算有限情况下实现公平资源分配提供数据支持。 Abstract: Access to Water, Sanitation, and Hygiene (WASH) services remains a major public health concern in refugee camps. This study introduces a remote sensing-driven framework to quantify WASH accessibility-specifically to water pumps, latrines, and bathing cubicles-in the Rohingya camps of Cox's Bazar, one of the world's most densely populated displacement settings. Detecting refugee shelters in such emergent camps presents substantial challenges, primarily due to their dense spatial configuration and irregular geometric patterns. Using sub-meter satellite images, we develop a semi-supervised segmentation framework that achieves an F1-score of 76.4% in detecting individual refugee shelters. Applying the framework across multi-year data reveals declining WASH accessibility, driven by rapid refugee population growth and reduced facility availability, rising from 25 people per facility in 2022 to 29.4 in 2025. Gender-disaggregated analysis further shows that women and girls experience reduced accessibility, in scenarios with inadequate safety-related segregation in WASH facilities. These findings suggest the importance of demand-responsive allocation strategies that can identify areas with under-served populations-such as women and girls-and ensure that limited infrastructure serves the greatest number of people in settings with fixed or shrinking budgets. We also discuss the value of high-resolution remote sensing and machine learning to detect inequality and inform equitable resource planning in complex humanitarian environments.### [295] [Leveraging Text-Driven Semantic Variation for Robust OOD Segmentation](https://arxiv.org/abs/2511.07238) *Seungheon Song,Jaekoo Lee* Main category: cs.CV TL;DR: 提出一种基于视觉-语言空间的文本驱动OOD分割方法,通过结合视觉-语言模型编码器与Transformer解码器、基于距离的OOD提示和语义增强策略,在多个公开数据集上实现了最先进的异常对象检测性能。Details
Motivation: 由于难以收集所有可能异常的代表性样本,需要一种无需依赖异常样本来训练的异常检测方法。 Method: 设计一个自监督自动编码器,使用结构化的、空间连贯的人工扰动来模拟结构性缺陷,并在遮挡基础上保留高斯噪声作为Tikhonov正则化,使重构函数的雅可比矩阵趋向于恒等变换,从而稳定重建过程。 Result: 在MVTec AD基准上实现了99.9/99.4的I/P-AUROC,达到当前最优水平。 Conclusion: 所提出的身份锚定正则化策略有效提升了异常检测与分割精度,验证了该方法在工业自动检测中的实用性和理论有效性。 Abstract: Anomaly detection plays a pivotal role in automated industrial inspection, aiming to identify subtle or rare defects in otherwise uniform visual patterns. As collecting representative examples of all possible anomalies is infeasible, we tackle structural anomaly detection using a self-supervised autoencoder that learns to repair corrupted inputs. To this end, we introduce a corruption model that injects artificial disruptions into training images to mimic structural defects. While reminiscent of denoising autoencoders, our approach differs in two key aspects. First, instead of unstructured i.i.d.\ noise, we apply structured, spatially coherent perturbations that make the task a hybrid of segmentation and inpainting. Second, and counterintuitively, we add and preserve Gaussian noise on top of the occlusions, which acts as a Tikhonov regularizer anchoring the Jacobian of the reconstruction function toward identity. This identity-anchored regularization stabilizes reconstruction and further improves both detection and segmentation accuracy. On the MVTec AD benchmark, our method achieves state-of-the-art results (I/P-AUROC: 99.9/99.4), supporting our theoretical framework and demonstrating its practical relevance for automatic inspection.### [296] [4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation](https://arxiv.org/abs/2511.07241) *Mengmeng Liu,Jiuming Liu,Yunpeng Zhang,Jiangtao Li,Michael Ying Yang,Francesco Nex,Hao Cheng* Main category: cs.CV TL;DR: 提出了一种名为4DSTR的新型4D生成网络,通过时空校正调制生成式4D高斯点阵,有效提升了动态4D内容生成的时空一致性和对快速时变的适应能力。Details
Motivation: 现有自动驾驶中的OOD分割方法较少利用视觉-语言空间中的丰富语言知识,而语言线索有助于复杂场景下的异常检测,因此探索融合语言信息的方法具有重要意义。 Method: 将视觉-语言模型的编码器与Transformer解码器结合,设计位于不同语义距离上的OOD提示,并采用OOD语义增强策略来构建多样化异常表征,通过视觉与文本对齐实现对未知对象的良好泛化。 Result: 在Fishyscapes、Segment-Me-If-You-Can和Road Anomaly等数据集上,该方法在像素级和对象级评估中均达到最先进的性能。 Conclusion: 视觉-语言联合建模能有效提升OOD分割的鲁棒性和泛化能力,为自动驾驶系统的安全可靠决策提供了新方向。 Abstract: In autonomous driving and robotics, ensuring road safety and reliable decision-making critically depends on out-of-distribution (OOD) segmentation. While numerous methods have been proposed to detect anomalous objects on the road, leveraging the vision-language space-which provides rich linguistic knowledge-remains an underexplored field. We hypothesize that incorporating these linguistic cues can be especially beneficial in the complex contexts found in real-world autonomous driving scenarios. To this end, we present a novel approach that trains a Text-Driven OOD Segmentation model to learn a semantically diverse set of objects in the vision-language space. Concretely, our approach combines a vision-language model's encoder with a transformer decoder, employs Distance-Based OOD prompts located at varying semantic distances from in-distribution (ID) classes, and utilizes OOD Semantic Augmentation for OOD representations. By aligning visual and textual information, our approach effectively generalizes to unseen objects and provides robust OOD segmentation in diverse driving environments. We conduct extensive experiments on publicly available OOD segmentation datasets such as Fishyscapes, Segment-Me-If-You-Can, and Road Anomaly datasets, demonstrating that our approach achieves state-of-the-art performance across both pixel-level and object-level evaluations. This result underscores the potential of vision-language-based OOD segmentation to bolster the safety and reliability of future autonomous driving systems.### [297] [MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs](https://arxiv.org/abs/2511.07250) *Tianhao Peng,Haochen Wang,Yuanxing Zhang,Zekun Wang,Zili Wang,Ge Zhang,Jian Yang,Shihao Li,Yanghai Wang,Xintao Wang,Houyi Li,Wei Ji,Pengfei Wan,Wenhao Huang,Zhaoxiang Zhang,Jiaheng Liu* Main category: cs.CV TL;DR: 本文提出了首个面向多视频理解的综合评测基准MVU-Eval,包含1,824个问答对和4,959个视频,评估MLLM在真实场景下的多视频理解能力。Details
Motivation: 现有的4D生成方法在时空一致性方面表现不佳,难以应对快速的时间变化,缺乏有效的时空建模机制。 Method: 提出4DSTR网络,引入时间相关性来校正可变形的尺度和旋转,并设计自适应的空间稠密化与剪枝策略,根据前一帧的运动动态增删高斯点。 Result: 实验表明,4DSTR在视频到4D生成任务中达到了最先进的性能,显著提升了重建质量、时空一致性和对快速运动的适应性。 Conclusion: 4DSTR通过有效的时空建模机制,解决了现有4D生成方法在一致性与时变适应上的关键挑战,为动态4D内容生成提供了新思路。 Abstract: Remarkable advances in recent 2D image and 3D shape generation have induced a significant focus on dynamic 4D content generation. However, previous 4D generation methods commonly struggle to maintain spatial-temporal consistency and adapt poorly to rapid temporal variations, due to the lack of effective spatial-temporal modeling. To address these problems, we propose a novel 4D generation network called 4DSTR, which modulates generative 4D Gaussian Splatting with spatial-temporal rectification. Specifically, temporal correlation across generated 4D sequences is designed to rectify deformable scales and rotations and guarantee temporal consistency. Furthermore, an adaptive spatial densification and pruning strategy is proposed to address significant temporal variations by dynamically adding or deleting Gaussian points with the awareness of their pre-frame movements. Extensive experiments demonstrate that our 4DSTR achieves state-of-the-art performance in video-to-4D generation, excelling in reconstruction quality, spatial-temporal consistency, and adaptation to rapid temporal movements.### [298] [StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression](https://arxiv.org/abs/2511.07278) *Yilong Chen,Xiang Bai,Zhibin Wang,Chengyu Bai,Yuhan Dai,Ming Lu,Shanghang Zhang* Main category: cs.CV TL;DR: 本文提出了一种无需训练的框架StreamKV,用于提升视频大语言模型在长视频问答中的效率和准确性,通过动态语义分段、摘要向量生成和引导提示实现高效的KV缓存检索与压缩。Details
Motivation: 现有评测基准局限于单视频理解,无法满足现实场景中多视频理解的需求,如自动驾驶和体育分析。 Method: 构建了MVU-Eval基准,涵盖八个核心能力,使用来自多样化领域的4,959个视频和1,824个问题进行评估。 Result: 在多个开源和闭源模型上的实验表明,当前MLLM在多视频理解方面存在显著性能差距和局限性。 Conclusion: MVU-Eval填补了多视频理解评测的空白,将推动MLLM在真实应用场景中的进一步研究与发展。 Abstract: The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs' ability to perform understanding across multiple videos. The benchmark will be made publicly available to foster future research.### [299] [Segmentation of Ischemic Stroke Lesions using Transfer Learning on Multi-sequence MRI](https://arxiv.org/abs/2511.07281) *R. P. Chowdhury,T. Rahman* Main category: cs.CV TL;DR: 提出了一种基于Res-Unet的自动分割框架,用于在多种MRI序列上快速分割缺血性中风病灶,并通过多数投票分类器融合结果,在ISLES 2015数据集上取得了80.5%的Dice分数。Details
Motivation: 现有的视频大语言模型在处理长真实视频时面临挑战,尤其是在KV缓存的压缩与检索方面尚未充分探索。 Method: StreamKV采用动态语义分段替代均匀分块,为每个段生成摘要向量以支持检索,并设计引导提示来保留关键语义信息,实现层自适应的KV缓存检索与压缩一体化。 Result: 在公开的StreamingVQA基准上实验表明,StreamKV在准确率、内存效率和计算延迟方面均显著优于现有的在线视频大语言模型。 Conclusion: StreamKV有效提升了视频大语言模型对长视频的理解与问答能力,兼具高性能与高效率,具备实际应用潜力。 Abstract: Video Large Language Models (Video-LLMs) have demonstrated significant potential in the areas of video captioning, search, and summarization. However, current Video-LLMs still face challenges with long real-world videos. Recent methods have introduced a retrieval mechanism that retrieves query-relevant KV caches for question answering, enhancing the efficiency and accuracy of long real-world videos. However, the compression and retrieval of KV caches are still not fully explored. In this paper, we propose \textbf{StreamKV}, a training-free framework that seamlessly equips Video-LLMs with advanced KV cache retrieval and compression. Compared to previous methods that used uniform partitioning, StreamKV dynamically partitions video streams into semantic segments, which better preserves semantic information. For KV cache retrieval, StreamKV calculates a summary vector for each segment to retain segment-level information essential for retrieval. For KV cache compression, StreamKV introduces a guidance prompt designed to capture the key semantic elements within each segment, ensuring only the most informative KV caches are retained for answering questions. Moreover, StreamKV unifies KV cache retrieval and compression within a single module, performing both in a layer-adaptive manner, thereby further improving the effectiveness of streaming video question answering. Extensive experiments on public StreamingVQA benchmarks demonstrate that StreamKV significantly outperforms existing Online Video-LLMs, achieving superior accuracy while substantially improving both memory efficiency and computational latency. The code has been released at https://github.com/sou1p0wer/StreamKV.### [300] [Glioma C6: A Novel Dataset for Training and Benchmarking Cell Segmentation](https://arxiv.org/abs/2511.07286) *Roman Malashin,Svetlana Pashkevich,Daniil Ilyukhin,Arseniy Volkov,Valeria Yachnaya,Andrey Denisov,Maria Mikhalkova* Main category: cs.CV TL;DR: Glioma C6是一个用于胶质瘤C6细胞实例分割的新开源数据集,包含75张高分辨率显微图像和超过12,000个标注细胞,旨在作为深度学习模型的基准和训练资源。Details
Motivation: 手动分割缺血性中风病灶耗时且存在观察者差异,传统自动方法依赖手工特征难以捕捉复杂病灶形态。 Method: 采用Res-Unet架构,分别使用预训练权重和随机初始化进行训练,比较迁移学习效果;在T1、T2、DWI和FLAIR等MRI序列上进行3D病灶分割,并引入多数投票分类器融合各轴向分割结果。 Result: 在ISLES 2015数据集上实现了80.5%的Dice分数和74.03%的准确率,验证了方法的有效性。 Conclusion: 所提出的Res-Unet结合多数投票的框架能有效提升缺血性中风病灶的自动分割精度,具有临床应用潜力。 Abstract: The accurate understanding of ischemic stroke lesions is critical for efficient therapy and prognosis of stroke patients. Magnetic resonance imaging (MRI) is sensitive to acute ischemic stroke and is a common diagnostic method for stroke. However, manual lesion segmentation performed by experts is tedious, time-consuming, and prone to observer inconsistency. Automatic medical image analysis methods have been proposed to overcome this challenge. However, previous approaches have relied on hand-crafted features that may not capture the irregular and physiologically complex shapes of ischemic stroke lesions. In this study, we present a novel framework for quickly and automatically segmenting ischemic stroke lesions on various MRI sequences, including T1-weighted, T2-weighted, DWI, and FLAIR. The proposed methodology is validated on the ISLES 2015 Brain Stroke sequence dataset, where we trained our model using the Res-Unet architecture twice: first, with pre-existing weights, and then without, to explore the benefits of transfer learning. Evaluation metrics, including the Dice score and sensitivity, were computed across 3D volumes. Finally, a Majority Voting Classifier was integrated to amalgamate the outcomes from each axis, resulting in a comprehensive segmentation method. Our efforts culminated in achieving a Dice score of 80.5\% and an accuracy of 74.03\%, showcasing the efficacy of our segmentation approach.### [301] [LMM-IQA: Image Quality Assessment for Low-Dose CT Imaging](https://arxiv.org/abs/2511.07298) *Kagan Celik,Mehmet Ozan Unal,Metin Ertas,Isa Yildirim* Main category: cs.CV TL;DR: 提出一种基于大语言模型(LLM)的低剂量CT图像质量评估系统,可生成数值评分和描述性文本,并通过多种推理策略提升性能。Details
Motivation: 现有数据集在胶质瘤细胞实例分割任务中缺乏高质量、形态分类明确的数据,限制了深度学习模型在生物医学图像分析中的性能与泛化能力。 Method: 构建一个包含75张高分辨率相位对比显微图像的数据集,每张图像均提供细胞体标注及由生物学家提供的形态学分类;数据集分为两部分:一部分用于受控条件下的基准测试,另一部分用于不同条件下的泛化能力评估,并对多个通用分割模型进行性能评测。 Result: 多个通用分割模型在该数据集上表现出局限性,但在使用Glioma C6进行训练后,模型的分割性能显著提升,验证了数据集的有效性和实用性。 Conclusion: Glioma C6数据集为胶质瘤细胞的实例分割提供了高质量的标注资源,有助于推动癌症细胞研究和鲁棒、可泛化的深度学习模型的开发。 Abstract: We present Glioma C6, a new open dataset for instance segmentation of glioma C6 cells, designed as both a benchmark and a training resource for deep learning models. The dataset comprises 75 high-resolution phase-contrast microscopy images with over 12,000 annotated cells, providing a realistic testbed for biomedical image analysis. It includes soma annotations and morphological cell categorization provided by biologists. Additional categorization of cells, based on morphology, aims to enhance the utilization of image data for cancer cell research. Glioma C6 consists of two parts: the first is curated with controlled parameters for benchmarking, while the second supports generalization testing under varying conditions. We evaluate the performance of several generalist segmentation models, highlighting their limitations on our dataset. Our experiments demonstrate that training on Glioma C6 significantly enhances segmentation performance, reinforcing its value for developing robust and generalizable models. The dataset is publicly available for researchers.### [302] [VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models](https://arxiv.org/abs/2511.07299) *Ying Cheng,Yu-Ho Lin,Min-Hung Chen,Fu-En Yang,Shang-Hong Lai* Main category: cs.CV TL;DR: 本文提出了VADER,一种基于大语言模型的视频异常理解框架,通过结合关键帧对象关系特征与视觉线索,增强对异常事件的语义理解和因果推理能力。Details
Motivation: 低剂量CT虽降低辐射,但带来噪声、模糊和对比度损失,影响诊断质量,需可靠的质量评估方法。 Method: 构建基于LLM的质量评估系统,采用零样本、元数据集成和错误反馈等多种推理策略进行优化。 Result: 系统生成的评分与主观评价高度相关,同时输出可解释的文本描述,提升了评估的一致性和临床适用性。 Conclusion: 该LLM-based系统在低剂量CT图像质量评估中表现出色,兼具准确性与可解释性,有助于临床工作流。 Abstract: Low-dose computed tomography (CT) represents a significant improvement in patient safety through lower radiation doses, but increased noise, blur, and contrast loss can diminish diagnostic quality. Therefore, consistency and robustness in image quality assessment become essential for clinical applications. In this study, we propose an LLM-based quality assessment system that generates both numerical scores and textual descriptions of degradations such as noise, blur, and contrast loss. Furthermore, various inference strategies - from the zero-shot approach to metadata integration and error feedback - are systematically examined, demonstrating the progressive contribution of each method to overall performance. The resultant assessments yield not only highly correlated scores but also interpretable output, thereby adding value to clinical workflows. The source codes of our study are available at https://github.com/itu-biai/lmms_ldct_iqa.### [303] [Beyond Boundaries: Leveraging Vision Foundation Models for Source-Free Object Detection](https://arxiv.org/abs/2511.07301) *Huizai Yao,Sicheng Zhao,Pengteng Li,Yi Cui,Shuo Lu,Weiyu Guo,Yunfan Lu,Yijie Xu,Hui Xiong* Main category: cs.CV TL;DR: 本文提出了一种新的无源域目标检测(SFOD)框架,利用视觉基础模型(VFMs)作为外部知识源,通过三个VFM-based模块提升特征对齐和伪标签质量,在六个基准上实现了最先进的性能。Details
Motivation: 传统视频异常检测方法主要关注异常的检测与定位,缺乏对异常事件深层因果关系和对象交互的理解,限制了其在实际场景中的可解释性与应用价值。 Method: VADER框架首先使用异常评分器为每帧打分,通过上下文感知采样(CAES)策略选取关键帧;然后利用关系特征提取器和对比关系编码器(CORE)建模动态对象交互,生成紧凑的关系表示;最后将视觉与关系特征输入大语言模型,生成详细的因果解释并支持问答任务。 Result: 在多个真实世界视频异常理解基准上的实验表明,VADER在异常描述、解释和因果推理任务上均取得优异性能,显著提升了视频异常分析的可解释性。 Conclusion: VADER通过融合对象关系与视觉线索,并结合大语言模型,有效实现了对视频异常事件的深入语义理解与因果分析,推动了可解释性视频异常理解的发展。 Abstract: Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.### [304] [YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting](https://arxiv.org/abs/2511.07321) *Botao Ye,Boqi Chen,Haofei Xu,Daniel Barath,Marc Pollefeys* Main category: cs.CV TL;DR: 提出YoNoSplat,一种从任意数量图像中重建高质量3D高斯点阵表示的前馈模型,具有高效、灵活、支持未标定和未配准输入的特点。Details
Motivation: 现有SFOD方法依赖源模型内部知识,泛化能力有限且易产生偏差伪标签,限制了跨域迁移性能;而VFMs具有强大的感知和泛化能力,但在SFOD中尚未被充分挖掘。 Method: 设计了三个基于VFM的模块:(1) 基于patch相似性加权的全局特征对齐(PGFA);(2) 基于动量更新VFM原型的实例级对比学习(PIFA);(3) 融合检测VFM与教师模型预测的双源增强伪标签融合(DEPF),采用熵感知策略生成更可靠的监督信号。 Result: 在六个基准数据集上进行了广泛实验,结果表明所提方法在SFOD任务中达到了最先进水平,有效提升了模型的迁移性和判别性。 Conclusion: 通过引入VFMs作为外部知识源,能够显著提升SFOD的性能,同时改善特征转移能力和伪标签质量,为无源域适应提供了新思路。 Abstract: Source-Free Object Detection (SFOD) aims to adapt a source-pretrained object detector to a target domain without access to source data. However, existing SFOD methods predominantly rely on internal knowledge from the source model, which limits their capacity to generalize across domains and often results in biased pseudo-labels, thereby hindering both transferability and discriminability. In contrast, Vision Foundation Models (VFMs), pretrained on massive and diverse data, exhibit strong perception capabilities and broad generalization, yet their potential remains largely untapped in the SFOD setting. In this paper, we propose a novel SFOD framework that leverages VFMs as external knowledge sources to jointly enhance feature alignment and label quality. Specifically, we design three VFM-based modules: (1) Patch-weighted Global Feature Alignment (PGFA) distills global features from VFMs using patch-similarity-based weighting to enhance global feature transferability; (2) Prototype-based Instance Feature Alignment (PIFA) performs instance-level contrastive learning guided by momentum-updated VFM prototypes; and (3) Dual-source Enhanced Pseudo-label Fusion (DEPF) fuses predictions from detection VFMs and teacher models via an entropy-aware strategy to yield more reliable supervision. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art SFOD performance, validating the effectiveness of integrating VFMs to simultaneously improve transferability and discriminability.### [305] [Garbage Vulnerable Point Monitoring using IoT and Computer Vision](https://arxiv.org/abs/2511.07325) *R. Kumar,A. Lall,S. Chaudhari,M. Kale,A. Vattem* Main category: cs.CV TL;DR: 本文提出了一种结合物联网和计算机视觉的智能市政固废管理方法,利用街景摄像头和目标检测算法实时监测城市垃圾易发点的非法倾倒行为。Details
Motivation: 解决从无结构图像集合中快速灵活地进行3D场景重建的难题,特别是在输入图像未标定或未提供相机位姿的情况下。 Method: 设计一个可预测局部高斯分布和相机位姿的前馈网络,通过新颖的混合训练策略解耦3D高斯与相机参数学习,并引入成对相机距离归一化和内参嵌入来解决尺度模糊问题。 Result: 在标准基准上实现了最先进的性能,支持姿态自由和依赖设置,100张图像(280x518分辨率)仅需2.69秒完成重建。 Conclusion: YoNoSplat实现了高效、灵活且高精度的3D重建,适用于多种输入条件,显著提升了实际应用中的鲁棒性和实用性。 Abstract: Fast and flexible 3D scene reconstruction from unstructured image collections remains a significant challenge. We present YoNoSplat, a feedforward model that reconstructs high-quality 3D Gaussian Splatting representations from an arbitrary number of images. Our model is highly versatile, operating effectively with both posed and unposed, calibrated and uncalibrated inputs. YoNoSplat predicts local Gaussians and camera poses for each view, which are aggregated into a global representation using either predicted or provided poses. To overcome the inherent difficulty of jointly learning 3D Gaussians and camera parameters, we introduce a novel mixing training strategy. This approach mitigates the entanglement between the two tasks by initially using ground-truth poses to aggregate local Gaussians and gradually transitioning to a mix of predicted and ground-truth poses, which prevents both training instability and exposure bias. We further resolve the scale ambiguity problem by a novel pairwise camera-distance normalization scheme and by embedding camera intrinsics into the network. Moreover, YoNoSplat also predicts intrinsic parameters, making it feasible for uncalibrated inputs. YoNoSplat demonstrates exceptional efficiency, reconstructing a scene from 100 views (at 280x518 resolution) in just 2.69 seconds on an NVIDIA GH200 GPU. It achieves state-of-the-art performance on standard benchmarks in both pose-free and pose-dependent settings. Our project page is at https://botaoye.github.io/yonosplat/.### [306] [Inference-Time Scaling of Diffusion Models for Infrared Data Generation](https://arxiv.org/abs/2511.07362) *Kai A. Horstmann,Maxim Clouser,Kia Khezeli* Main category: cs.CV TL;DR: 本文提出一种基于域适应CLIP验证器的推理时扩展方法,以在数据稀缺的情况下提升红外图像生成质量。通过微调先进的文本到图像扩散模型FLUX.1-dev,并在推理过程中引入验证器指导采样过程,显著提高了生成图像与文本提示的一致性和质量,在KAIST数据集上FID分数降低10%。Details
Motivation: 为解决城市中非法倾倒垃圾的问题,提高垃圾管理效率,减少环境污染。 Method: 采用YOLOv8、YOLOv10、YOLO11m和RT-DETR等多种目标检测模型,在印度特伦甘纳邦桑加雷迪地区采集的数据集上进行实验,评估其在废物检测中的性能。 Result: YOLO11m模型表现最佳,检测准确率达92.39%,mAP@50为0.91,并能有效捕捉 hourly、daily 和 weekly 的垃圾倾倒模式。 Conclusion: YOLO11m模型适用于城市垃圾易发点的全天候监控,该系统可有效支持智能城市垃圾管理。 Abstract: This paper proposes a smart way to manage municipal solid waste by using the Internet of Things (IoT) and computer vision (CV) to monitor illegal waste dumping at garbage vulnerable points (GVPs) in urban areas. The system can quickly detect and monitor dumped waste using a street-level camera and object detection algorithm. Data was collected from the Sangareddy district in Telangana, India. A series of comprehensive experiments was carried out using the proposed dataset to assess the accuracy and overall performance of various object detection models. Specifically, we performed an in-depth evaluation of YOLOv8, YOLOv10, YOLO11m, and RT-DETR on our dataset. Among these models, YOLO11m achieved the highest accuracy of 92.39\% in waste detection, demonstrating its effectiveness in detecting waste. Additionally, it attains an mAP@50 of 0.91, highlighting its high precision. These findings confirm that the object detection model is well-suited for monitoring and tracking waste dumping events at GVP locations. Furthermore, the system effectively captures waste disposal patterns, including hourly, daily, and weekly dumping trends, ensuring comprehensive daily and nightly monitoring.### [307] [Real-Time LiDAR Super-Resolution via Frequency-Aware Multi-Scale Fusion](https://arxiv.org/abs/2511.07377) *June Moh Goo,Zichao Zeng,Jan Boehm* Main category: cs.CV TL;DR: 提出FLASH框架,一种基于双域处理的LiDAR超分辨率方法,结合频域分析与自适应多尺度融合,在KITTI上实现最先进的性能并支持实时部署。Details
Motivation: 由于高质量标注红外数据稀缺且标注需要专业知识,阻碍了下游视觉模型的发展;现有合成红外图像方法受限于数据不足,难以训练基础级生成模型。 Method: 采用参数高效微调技术,在少量红外图像上对FLUX.1-dev扩散模型进行域适应微调,并训练一个基于CLIP的验证器,在推理时指导扩散采样过程,提升生成图像质量与文本对齐度。 Result: 在KAIST多光谱行人检测数据集上,相比无指导的基线生成样本,FID分数降低了10%,生成质量得到一致提升。 Conclusion: 推理时使用域适应验证器进行指导是一种在低数据条件下弥合红外生成领域差距的有效且有前景的方法。 Abstract: Infrared imagery enables temperature-based scene understanding using passive sensors, particularly under conditions of low visibility where traditional RGB imaging fails. Yet, developing downstream vision models for infrared applications is hindered by the scarcity of high-quality annotated data, due to the specialized expertise required for infrared annotation. While synthetic infrared image generation has the potential to accelerate model development by providing large-scale, diverse training data, training foundation-level generative diffusion models in the infrared domain has remained elusive due to limited datasets. In light of such data constraints, we explore an inference-time scaling approach using a domain-adapted CLIP-based verifier for enhanced infrared image generation quality. We adapt FLUX.1-dev, a state-of-the-art text-to-image diffusion model, to the infrared domain by finetuning it on a small sample of infrared images using parameter-efficient techniques. The trained verifier is then employed during inference to guide the diffusion sampling process toward higher quality infrared generations that better align with input text prompts. Empirically, we find that our approach leads to consistent improvements in generation quality, reducing FID scores on the KAIST Multispectral Pedestrian Detection Benchmark dataset by 10% compared to unguided baseline samples. Our results suggest that inference-time guidance offers a promising direction for bridging the domain gap in low-data infrared settings.### [308] [StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation](https://arxiv.org/abs/2511.07399) *Tianrui Feng,Zhi Li,Shuo Yang,Haocheng Xi,Muyang Li,Xiuyu Li,Lvmin Zhang,Keting Yang,Kelly Peng,Song Han,Maneesh Agrawala,Kurt Keutzer,Akio Kodaira,Chenfeng Xu* Main category: cs.CV TL;DR: 本文提出了StreamDiffusionV2,一种无需训练的视频扩散模型实时直播生成系统,通过SLO感知调度、滚动KV缓存和可扩展的流水线编排,在多GPU环境下实现低延迟、高帧率的生成式直播,显著提升了时间一致性和系统可扩展性。Details
Motivation: 现有基于Transformer的方法(如TULIP)局限于空间域处理,感受野受限,难以有效捕捉LiDAR点云中的周期性扫描模式和全局上下文信息。 Method: FLASH引入频率感知窗口注意力机制,结合FFT进行频域分析,并通过自适应多尺度融合动态聚合多尺度特征,利用CBAM注意力实现位置敏感的特征选择。 Result: 在KITTI数据集上,FLASH在所有指标上均达到SOTA,优于需要多次前向推理的不确定性增强基线方法,且保持单次前向效率。 Conclusion: FLASH通过架构设计而非计算昂贵的随机推断来应对不确定性,其双域处理策略有效提升低分辨率LiDAR的超分辨率性能,适用于自动驾驶等实时系统。 Abstract: LiDAR super-resolution addresses the challenge of achieving high-quality 3D perception from cost-effective, low-resolution sensors. While recent transformer-based approaches like TULIP show promise, they remain limited to spatial-domain processing with restricted receptive fields. We introduce FLASH (Frequency-aware LiDAR Adaptive Super-resolution with Hierarchical fusion), a novel framework that overcomes these limitations through dual-domain processing. FLASH integrates two key innovations: (i) Frequency-Aware Window Attention that combines local spatial attention with global frequency-domain analysis via FFT, capturing both fine-grained geometry and periodic scanning patterns at log-linear complexity. (ii) Adaptive Multi-Scale Fusion that replaces conventional skip connections with learned position-specific feature aggregation, enhanced by CBAM attention for dynamic feature selection. Extensive experiments on KITTI demonstrate that FLASH achieves state-of-the-art performance across all evaluation metrics, surpassing even uncertainty-enhanced baselines that require multiple forward passes. Notably, FLASH outperforms TULIP with Monte Carlo Dropout while maintaining single-pass efficiency, which enables real-time deployment. The consistent superiority across all distance ranges validates that our dual-domain approach effectively handles uncertainty through architectural design rather than computationally expensive stochastic inference, making it practical for autonomous systems.### [309] [DIMO: Diverse 3D Motion Generation for Arbitrary Objects](https://arxiv.org/abs/2511.07409) *Linzhan Mou,Jiahui Lei,Chen Wang,Lingjie Liu,Kostas Daniilidis* Main category: cs.CV TL;DR: 提出DIMO,一种从单张图像生成任意物体多样化3D运动的生成方法,利用预训练视频模型提取常见运动模式并嵌入共享低维潜在空间,实现快速多样化的3D运动生成与多种应用。Details
Motivation: 现有的图像扩散模型在直播中存在时序不一致问题,而离线视频生成系统无法满足实时直播对首帧延迟和每帧截止时间的严格服务级别目标(SLO),且缺乏多GPU实时推理的可扩展方案。 Method: StreamDiffusionV2采用训练-free方法,结合SLO感知批处理与块调度器、带sink-token引导的滚动KV缓存、运动感知噪声控制器,并设计了跨去噪步和网络层并行的可扩展流水线编排,支持异构GPU环境下的高效推理。 Result: 在四张H100 GPU上,14B参数模型达到58.28 FPS,1.3B模型达到64.52 FPS,首帧生成时间低于0.5秒,支持1-4步灵活去噪,无需TensorRT或量化即可满足实时SLO要求。 Conclusion: StreamDiffusionV2实现了高质量、低延迟、可扩展的生成式视频直播系统,使大规模参数扩散模型在个人创作者到企业级平台的实时应用成为可能。 Abstract: Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered. Previous image-based streaming diffusion models have powered efficient and creative live streaming products but have hit limits on temporal consistency due to the foundation of image-based designs. Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation. However, offline generation systems primarily optimize throughput by batching large workloads. In contrast, live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter. Besides, scalable multi-GPU serving for real-time streams remains largely unresolved so far. To address this, we present StreamDiffusionV2, a training-free pipeline for interactive live streaming with video diffusion models. StreamDiffusionV2 integrates an SLO-aware batching scheduler and a block scheduler, together with a sink-token--guided rolling KV cache, a motion-aware noise controller, and other system-level optimizations. Moreover, we introduce a scalable pipeline orchestration that parallelizes the diffusion process across denoising steps and network layers, achieving near-linear FPS scaling without violating latency guarantees. The system scales seamlessly across heterogeneous GPU environments and supports flexible denoising steps (e.g., 1--4), enabling both ultra-low-latency and higher-quality modes. Without TensorRT or quantization, StreamDiffusionV2 renders the first frame within 0.5s and attains 58.28 FPS with a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on four H100 GPUs, making state-of-the-art generative live streaming practical and accessible--from individual creators to enterprise-scale platforms.### [310] [TwinOR: Photorealistic Digital Twins of Dynamic Operating Rooms for Embodied AI Research](https://arxiv.org/abs/2511.07412) *Han Zhang,Yiqing Shen,Roger D. Soberanis-Mukul,Ankita Ghosh,Hao Ding,Lalithkumar Seenivasan,Jose L. Porras,Zhekai Mao,Chenjia Li,Wenjie Xiao,Lonny Yarmus,Angela Christine Argento,Masaru Ishii,Mathias Unberath* Main category: cs.CV TL;DR: TwinOR是一个用于构建手术室高保真动态数字孪生的框架,支持具身AI的安全、可扩展研究,通过融合静态几何与动态行为实现厘米级精度重建,并生成逼真的传感器数据,验证了其在感知与定位任务中的有效性。Details
Motivation: 从单幅图像生成逼真且多样的3D对象运动具有挑战性,现有方法在运动多样性、通用性和效率方面存在局限,因此需要一种能高效建模多样化3D运动的新方法。 Method: 通过预训练视频模型生成同一物体具不同运动的多个视频,将每种运动嵌入潜在向量,并训练共享运动解码器学习由神经关键点轨迹构成的紧凑运动表示;利用这些关键点驱动规范3D高斯分布并融合以建模几何和外观。 Result: 实现了从单张图像快速采样多样化3D运动的能力,支持单次前向传播完成推理,并展示了3D运动插值和语言引导运动生成等应用,在运动多样性和视觉质量方面表现良好。 Conclusion: DIMO通过构建共享低维运动潜在空间,有效结合视频模型先验与3D表示,为从单图像生成多样化3D运动提供了高效且灵活的解决方案。 Abstract: We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation. Our project page is available at https://linzhanm.github.io/dimo.Details
Motivation: 由于手术室环境的安全限制和操作约束,具身AI难以在真实环境中自由学习与交互,亟需安全可控的仿真环境。现有方法缺乏对手术室空间、视觉和行为复杂性的动态建模能力。 Method: 提出TwinOR框架,利用术前视频重建静态几何结构,结合多视角感知持续建模人与设备的运动,将静态与动态组件融合为沉浸式3D环境,支持可控仿真与具身探索,并生成立体与单目传感器流用于训练与评估。 Result: TwinOR实现了厘米级精度的完整手术室几何重建,保留了手术流程中的动态交互;生成的数据使FoundationStereo和ORB-SLAM3等模型在几何理解与视觉定位任务中达到接近真实数据集的性能。 Conclusion: TwinOR建立了从实到仿的管道,提供了具备传感器级真实感的动态数字孪生环境,支持具身AI系统的安全、高效开发与基准测试,推动其从仿真到现实的部署。 Abstract: Developing embodied AI for intelligent surgical systems requires safe, controllable environments for continual learning and evaluation. However, safety regulations and operational constraints in operating rooms (ORs) limit embodied agents from freely perceiving and interacting in realistic settings. Digital twins provide high-fidelity, risk-free environments for exploration and training. How we may create photorealistic and dynamic digital representations of ORs that capture relevant spatial, visual, and behavioral complexity remains unclear. We introduce TwinOR, a framework for constructing photorealistic, dynamic digital twins of ORs for embodied AI research. The system reconstructs static geometry from pre-scan videos and continuously models human and equipment motion through multi-view perception of OR activities. The static and dynamic components are fused into an immersive 3D environment that supports controllable simulation and embodied exploration. The proposed framework reconstructs complete OR geometry with centimeter level accuracy while preserving dynamic interaction across surgical workflows, enabling realistic renderings and a virtual playground for embodied AI systems. In our experiments, TwinOR simulates stereo and monocular sensor streams for geometry understanding and visual localization tasks. Models such as FoundationStereo and ORB-SLAM3 on TwinOR-synthesized data achieve performance within their reported accuracy on real indoor datasets, demonstrating that TwinOR provides sensor-level realism sufficient for perception and localization challenges. By establishing a real-to-sim pipeline for constructing dynamic, photorealistic digital twins of OR environments, TwinOR enables the safe, scalable, and data-efficient development and benchmarking of embodied AI, ultimately accelerating the deployment of embodied AI from sim-to-real.