Skip to content

Table of Contents

cs.CL [Back]

[1] Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models

Singon Kim

Main category: cs.CL

TL;DR: 提出了一种名为ACoRN的抗噪声抽象压缩方法,通过细粒度文档分类和两阶段训练提升压缩模型在检索增强生成中的鲁棒性与性能。

Details Motivation: 现有抽象压缩模型在处理含有无关或错误信息的高相关性检索文档时容易丢失关键答案信息,尤其在长文本中表现更差,需提高其对检索噪声的鲁棒性。 Method: 首先通过离线数据增强增强模型对两类检索噪声的鲁棒性;其次进行微调,使压缩器聚焦于支持正确答案的关键信息,并缓解位置偏差问题。 Result: 基于T5-large的ACoRN在EM和F1指标上均有提升,能更好保留作为直接证据的答案字符串,在含大量干扰文档的数据集上表现优异。 Conclusion: ACoRN有效提升了抽象压缩模型在噪声环境下的性能,增强了其在真实场景中的适用性。 Abstract: Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However, retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language model based compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy reducing documents, making it highly useful in real-world scenarios.

[2] Enhancing Reliability across Short and Long-Form QA via Reinforcement Learning

Yudong Wang,Zhe Yang,Wenhan Ma,Zhifang Sui,Liang Zhao

Main category: cs.CL

TL;DR: 本文提出了一种针对强化学习中大语言模型幻觉问题的新型框架,通过改进训练数据和奖励机制,在短和长文本问答中同时减少内在与外在幻觉,并鼓励模型拒绝回答无法回答的问题,显著提升了模型的可靠性和性能。

Details Motivation: 强化学习虽然提升了大语言模型的推理能力,但也加剧了其产生幻觉的问题,导致能力与可靠性之间的权衡。本文旨在解决这一关键矛盾。 Method: 构建基于TriviaQA的开放性问题训练集以应对外在幻觉;利用FineWeb的长文本结合事实 grounding 奖励机制来应对内在幻觉;设计奖励机制鼓励模型拒答无解问题。 Result: 实验表明,该方法在多个基准测试中显著减少了两类幻觉,提高了模型的事实准确性和整体性能。 Conclusion: 本研究提供了一个实用框架,有效平衡了大语言模型在高级推理与事实可信度之间的冲突,推动了更强大且可靠的模型发展。 Abstract: While reinforcement learning has unlocked unprecedented complex reasoning in large language models, it has also amplified their propensity for hallucination, creating a critical trade-off between capability and reliability. This work confronts this challenge by introducing a targeted RL framework designed to mitigate both intrinsic and extrinsic hallucinations across short and long-form question answering. We address extrinsic hallucinations (flawed internal knowledge) by creating a novel training set from open-ended conversions of TriviaQA. Concurrently, we tackle intrinsic hallucinations (unfaithfulness to context) by leveraging long-form texts from FineWeb in a fact-grounding reward scheme. To further bolster reliability, our framework explicitly rewards the model for refusing to answer unanswerable questions, thereby cultivating crucial cautiousness. Extensive experiments demonstrate that our methodology yields significant performance gains across a diverse suite of benchmarks, substantially reducing both hallucination types. Ultimately, this research contributes a practical framework for resolving the critical tension between advanced reasoning and factual trustworthiness, paving the way for more capable and reliable large language models.

[3] The Linguistic Architecture of Reflective Thought: Evaluation of a Large Language Model as a Tool to Isolate the Formal Structure of Mentalization

Stefano Epifani,Giuliano Castigliego,Laura Kecskemeti,Giuliano Razzicchia,Elisabeth Seiwald-Sonderegger

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型(LLM)在精神分析导向的“心理化”结构生成中的表现,发现其在多数心理化维度上具有中等至高度的结构性一致性,但情感表达趋于中性。

Details Motivation: 探究大语言模型是否能模拟人类心理化过程的语言结构,特别是基于心理化治疗(MBT)框架下的认知、情感与主体间性的整合能力。 Method: 通过生成50段人机对话,由五位接受过MBT培训的精神科医生在盲态下从四个MBT维度进行评估,使用Likert量表评分,并以ICC(3,1)计算评分者间信度。 Result: 平均得分介于3.63至3.98之间,标准差适中,表明生成内容具有较高的结构连贯性;ICC值为0.60–0.84,显示评分者间一致性良好;模型在隐式-显式和自我-他人维度上表现稳定,但在内部状态与外部情境整合方面存在局限,且整体呈现情感中性特征。 Conclusion: LLM能够生成符合MBT语言结构的心理化文本,具备临床可解释性,但缺乏情感深度,提示其在心理治疗应用中需谨慎对待情感表达的局限性。 Abstract: Background: Mentalization integrates cognitive, affective, and intersubjective components. Large Language Models (LLMs) display an increasing ability to generate reflective texts, raising questions regarding the relationship between linguistic form and mental representation. This study assesses the extent to which a single LLM can reproduce the linguistic structure of mentalization according to the parameters of Mentalization-Based Treatment (MBT). Methods: Fifty dialogues were generated between human participants and an LLM configured in standard mode. Five psychiatrists trained in MBT, working under blinded conditions, evaluated the mentalization profiles produced by the model along the four MBT axes, assigning Likert-scale scores for evaluative coherence, argumentative coherence, and global quality. Inter-rater agreement was estimated using ICC(3,1). Results: Mean scores (3.63-3.98) and moderate standard deviations indicate a high level of structural coherence in the generated profiles. ICC values (0.60-0.84) show substantial-to-high agreement among raters. The model proved more stable in the Implicit-Explicit and Self-Other dimensions, while presenting limitations in the integration of internal states and external contexts. The profiles were coherent and clinically interpretable yet characterized by affective neutrality.

[4] Luxical: High-Speed Lexical-Dense Text Embeddings

DatologyAI,:,Luke Merrick,Alex Fang,Aldo Carranza,Alvin Deng,Amro Abbas,Brett Larsen,Cody Blakeney,Darren Teh,David Schwab,Fan Pan,Haakon Mongstad,Haoli Yin,Jack Urbanek,Jason Lee,Jason Telanoff,Josh Wills,Kaleigh Mentzer,Paul Burstein,Parth Doshi,Paul Burnstein,Pratyush Maini,Ricardo Monti,Rishabh Adiga,Scott Loftin,Siddharth Joshi,Spandan Das,Tony Jiang,Vineeth Dorma,Zhengping Wang,Bogdan Gaza,Ari Morcos,Matthew Leavitt

Main category: cs.CL

TL;DR: 本文提出了Luxical,一种用于高速“词汇-密集”文本嵌入的库,旨在结合快速但功能有限的词汇分类器与灵活但计算昂贵的Transformer嵌入模型的优点,以实现高效的网页规模文本组织。

Details Motivation: 现有的文本组织工具在速度和灵活性之间存在权衡:传统词汇分类器速度快但输出功能有限,而Transformer嵌入模型虽灵活但计算成本高。需要一种兼具高效性与表达能力的方法。 Method: Luxical结合了稀疏的TF-IDF特征、小型ReLU网络以及知识蒸馏训练策略,用以逼近大型Transformer嵌入模型的性能,同时大幅降低计算开销。 Result: 在文档检索和语言模型数据整理两个任务中,Luxical相比不同规模的神经基线模型实现了3倍到100倍的速度提升,在数据整理任务中推理速度接近FastText,并且保持了与神经模型相当的质量。 Conclusion: Luxical在大规模文本组织任务中展现出优异的计算效率与质量权衡,是高效文本嵌入的一种可行方案,且已作为开源软件发布。 Abstract: Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed "lexical-dense" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.

[5] Knowledge-Guided Large Language Model for Automatic Pediatric Dental Record Understanding and Safe Antibiotic Recommendation

Zihan Han,Junyan Ge,Caifeng Li

Main category: cs.CL

TL;DR: 提出一种知识引导的大语言模型(KG-LLM),结合知识图谱、检索增强生成和多阶段安全验证,用于儿科牙科抗生素推荐,显著提升记录理解、用药准确性和安全性。

Details Motivation: 传统基于规则的临床决策系统难以处理非结构化牙科文本、不完整的影像描述和复杂的安全约束,导致儿科牙科记录解读和抗生素处方存在挑战。 Method: 构建一个融合儿科牙科知识图谱、检索增强生成(RAG)和多阶段安全验证管道的KG-LLM框架;首先使用NER/RE模块从临床记录中提取结构化信息,再从知识图谱中检索指南、安全规则和历史案例辅助LLM进行诊断总结和用药预测,最后通过确定性规则和学习分类器双重验证确保安全性。 Result: 在3.2万条儿科牙科就诊记录上实验显示,相比微调的Llama-2临床模型,KG-LLM将记录理解F1值从0.867提升至0.914,药物-剂量-疗程准确率从0.716提升至0.782,并减少50%不安全抗生素建议;消融实验证明知识图谱、RAG和安全模块均对性能有显著贡献。 Conclusion: KG-LLM通过整合外部知识与多层安全机制,显著提高了儿科牙科临床决策支持系统的准确性、安全性和可解释性,为LLM在敏感医疗场景中的可靠应用提供了有效范式。 Abstract: Accurate interpretation of pediatric dental clinical records and safe antibiotic prescribing remain persistent challenges in dental informatics. Traditional rule-based clinical decision support systems struggle with unstructured dental narratives, incomplete radiographic descriptions, and complex safety constraints. To address these limitations, this study proposes a Knowledge-Guided Large Language Model (KG-LLM) that integrates a pediatric dental knowledge graph, retrieval-augmented generation (RAG), and a multi-stage safety validation pipeline for evidence-grounded antibiotic recommendation. The framework first employs a clinical NER/RE module to extract structured entities and relations from dental notes and radiology reports. Relevant guidelines, drug-safety rules, and analogous historical cases are subsequently retrieved from the knowledge graph and supplied to the LLM for diagnostic summarization and dose-drug-duration prediction. Safety assurance is achieved through a dual-layer validation mechanism combining deterministic rule checking with a learned classifier for detecting allergies, contraindications, and dosing errors. Experiments on 32,000 de-identified pediatric dental visit records demonstrate the effectiveness of the proposed approach. Compared with a domain-adapted Llama-2 clinical baseline, KG-LLM improves record-understanding performance (F1: 0.914 vs. 0.867), drug-dose-duration accuracy (Top-1: 0.782 vs. 0.716), and reduces unsafe antibiotic suggestions by 50%. Additional evaluation across summary quality, recommendation accuracy, and global safety scores further confirms the robustness of the system. Ablation analyses indicate that the knowledge graph, RAG, and safety modules each contribute substantially to clinical reliability and interpretability.

[6] Detecting Hallucinations in Graph Retrieval-Augmented Generation via Attention Patterns and Semantic Alignment

Shanghao Li,Jinda Han,Yibo Wang,Yuanjie Zhu,Zihe Song,Langzhou He,Kenan Kamel A Alghythee,Philip S. Yu

Main category: cs.CL

TL;DR: 本文提出了一种基于图结构的轻量级可解释性度量方法,用于分析大语言模型在使用外部知识图谱时的幻觉问题,并设计了相应的检测工具GGA,提升了GraphRAG系统的可靠性。

Details Motivation: 大语言模型在利用线性化子图增强生成时,难以理解其中的关系和拓扑结构,导致生成内容与检索知识不一致,产生幻觉。因此需要可解释性指标来分析模型对结构化知识的关注和保留情况。 Method: 提出了两个轻量级可解释性指标:路径依赖度(PRD)衡量对最短路径三元组的过度依赖,语义对齐分数(SAS)评估模型内部表示与检索知识的对齐程度,并基于此开发了事后幻觉检测器GGA。 Result: 在基于知识的问答任务中验证了PRD和SAS能有效识别因过度依赖显著路径和语义接地薄弱导致的失败模式;GGA检测器在AUC和F1上优于强基线方法。 Conclusion: 通过机制可解释性将幻觉分析与模型结构限制联系起来,揭示了大语言模型在处理图结构知识时的问题,为构建更可靠的GraphRAG系统提供了指导。 Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) enhances Large Language Models (LLMs) by incorporating external knowledge from linearized subgraphs retrieved from knowledge graphs. However, LLMs struggle to interpret the relational and topological information in these inputs, resulting in hallucinations that are inconsistent with the retrieved knowledge. To analyze how LLMs attend to and retain structured knowledge during generation, we propose two lightweight interpretability metrics: Path Reliance Degree (PRD), which measures over-reliance on shortest-path triples, and Semantic Alignment Score (SAS), which assesses how well the model's internal representations align with the retrieved knowledge. Through empirical analysis on a knowledge-based QA task, we identify failure patterns associated with over-reliance on salient paths and weak semantic grounding, as indicated by high PRD and low SAS scores. We further develop a lightweight post-hoc hallucination detector, Graph Grounding and Alignment (GGA), which outperforms strong semantic and confidence-based baselines across AUC and F1. By grounding hallucination analysis in mechanistic interpretability, our work offers insights into how structural limitations in LLMs contribute to hallucinations, informing the design of more reliable GraphRAG systems in the future.

[7] MindShift: Analyzing Language Models' Reactions to Psychological Prompts

Anton Vasiliuk,Irina Abdullaeva,Polina Druzhinina,Anton Razzhigaev,Andrey Kuznetsov

Main category: cs.CL

TL;DR: 本研究提出了MindShift基准,用于评估大语言模型(LLMs)在心理适应性方面的能力,特别是其根据提示模拟不同人格特质的表现。研究采用MMPI量表改编的测量方法,构建了多样化的人格角色提示,发现LLM对角色提示的感知能力随训练数据和对齐技术的进步而提升,且不同模型在人格模拟上存在显著差异。

Details Motivation: 探索大语言模型是否能准确反映用户指定的人格特质和态度,并通过科学的心理测量工具评估其心理适应性。 Method: 基于MMPI心理量表设计人格导向的提示,构建不同人格强度的角色,利用这些提示测试多个LLM的反应,提出MindShift基准来系统评估模型在人格模拟上的表现。 Result: 结果显示,LLM在角色感知方面表现持续改善,不同模型家族对心理测量的响应存在显著差异,表明其模拟人类人格特质的能力有所不同。 Conclusion: 大语言模型具备根据提示调整行为以体现特定人格特质的能力,但该能力因模型类型而异,未来可通过更优的训练与对齐进一步增强其心理适应性。 Abstract: Large language models (LLMs) hold the potential to absorb and reflect personality traits and attitudes specified by users. In our study, we investigated this potential using robust psychometric measures. We adapted the most studied test in psychological literature, namely Minnesota Multiphasic Personality Inventory (MMPI) and examined LLMs' behavior to identify traits. To asses the sensitivity of LLMs' prompts and psychological biases we created personality-oriented prompts, crafting a detailed set of personas that vary in trait intensity. This enables us to measure how well LLMs follow these roles. Our study introduces MindShift, a benchmark for evaluating LLMs' psychological adaptability. The results highlight a consistent improvement in LLMs' role perception, attributed to advancements in training datasets and alignment techniques. Additionally, we observe significant differences in responses to psychometric assessments across different model types and families, suggesting variability in their ability to emulate human-like personality traits. MindShift prompts and code for LLM evaluation will be publicly available.

[8] Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment

Zixuan Liu,Siavash H. Khajavi,Guangkai Jiang,Xinru Liu

Main category: cs.CL

TL;DR: 本文提出一种新的对齐大语言模型的框架,通过检测代理奖励模型与策略模型之间的冲突来识别和缓解对齐偏差,引入了PACS和Kendall-Tau距离两种度量,并设计了SHF-CAS算法以在高冲突问题上引入人工反馈,提升对齐效果。

Details Motivation: 由于代理奖励模型常因标注噪声、偏差或覆盖不足而无法准确反映人类意图,导致模型优化偏离真实价值观,因此需要一种机制来识别并纠正此类误对齐问题。 Method: 将微调过程视为知识整合,提出Proxy-Policy Alignment Conflict Score (PACS) 和 Kendall-Tau Distance 两种指标来检测代理与策略间的局部和全局冲突,并设计SHF-CAS算法对高冲突样本进行选择性人工反馈。 Result: 在两个对齐任务上的实验表明,该方法能有效提升整体对齐性能,即使在使用有偏代理奖励的情况下仍表现稳健。 Conclusion: 通过识别代理与策略间的冲突可有效发现对齐中的薄弱环节,为大模型的精准对齐提供了可解释且高效的新路径。 Abstract: Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences. However, such approaches critically rely on the assumption that proxy reward models accurately reflect intended supervision, a condition often violated due to annotation noise, bias, or limited coverage. This misalignment can lead to undesirable behaviors, where models optimize for flawed signals rather than true human values. In this paper, we investigate a novel framework to identify and mitigate such misalignment by treating the fine-tuning process as a form of knowledge integration. We focus on detecting instances of proxy-policy conflicts, cases where the base model strongly disagrees with the proxy. We argue that such conflicts often signify areas of shared ignorance, where neither the policy nor the reward model possesses sufficient knowledge, making them especially susceptible to misalignment. To this end, we propose two complementary metrics for identifying these conflicts: a localized Proxy-Policy Alignment Conflict Score (PACS) and a global Kendall-Tau Distance measure. Building on this insight, we design an algorithm named Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS) that targets high-conflict QA pairs for additional feedback, refining both the reward model and policy efficiently. Experiments on two alignment tasks demonstrate that our approach enhances general alignment performance, even when trained with a biased proxy reward. Our work provides a new lens for interpreting alignment failures and offers a principled pathway for targeted refinement in LLM training.

[9] CORE: A Conceptual Reasoning Layer for Large Language Models

Vishwas Hegde,Vindhya Shigehalli

Main category: cs.CL

TL;DR: 提出CORE,一种概念优先的交互层,通过持久化的局部概念和通用认知算子提升多轮对话稳定性,减少提示词累积,且不修改模型权重。

Details Motivation: 大语言模型在多轮交互中因依赖不断增长的token历史而出现意图漂移、推理不一致和提示膨胀问题,需改进状态持久化机制。 Method: 设计CORE框架,结合通用认知算子与持久化局部概念(语义状态),每轮仅输入当前指令、操作符和概念状态,避免重复历史上下文。 Result: 原型模拟显示累计提示token减少约42%,但该数据基于实验条件,不代表实际性能。 Conclusion: CORE提供了一种模型无关的机制,将概念推理与语言生成分离,为稳定、可扩展的多轮交互系统提供了新方向。 Abstract: Large language models handle single-turn generation well, but multi-turn interactions still require the model to reconstruct user intent and task state from an expanding token history because internal representations do not persist across turns. This token-first paradigm leads to drift, inconsistent reasoning modes, and growing prompts as conversations deepen. We propose CORE, a concept-first interaction layer that improves multi-turn stability without modifying model weights. CORE combines a small library of universal cognitive operators with a persistent Local Concept - a compact semantic state capturing the task, constraints, preferences, and intermediate results. Each model call receives only this concept state, the user's latest instruction, and the selected operator, eliminating the need to replay full history. A preliminary prototype simulating CORE's behavior shows about 42% reduction in cumulative prompt tokens, though this number reflects prototype conditions and should not be interpreted as a real-world performance estimate. CORE offers a model-agnostic mechanism that separates conceptual reasoning from language generation, suggesting a scalable direction for more stable multi-turn systems.

[10] Training-free Context-adaptive Attention for Efficient Long Context Modeling

Zeng You,Yaofo Chen,Shuhai Zhang,Zhijie Qiu,Tingyu Wu,Yingjian Li,Yaowei Wang,Mingkui Tan

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的上下文自适应注意力机制TCA-Attention,通过离线校准和在线令牌选择两个轻量级阶段,实现高效长序列推理,在128K上下文长度下显著加速并减少KV缓存,同时保持与全注意力相当的性能。

Details Motivation: 由于自注意力机制在长序列上的二次复杂度带来了计算和内存瓶颈,现有稀疏注意力方法存在依赖固定模式、无法兼顾prefill和解码阶段或需要额外训练等问题,因此需要一种更灵活高效的解决方案。 Method: 提出TCA-Attention,包含两个阶段:i) 离线校准阶段通过单次前向传播确定各注意力头的稀疏预算;ii) 在线令牌选择阶段使用轻量级冗余度量来自适应保留关键上下文令牌。该方法无需训练,不修改模型结构。 Result: 在128K上下文长度下,TCA-Attention实现了2.8倍的加速,KV缓存减少了61%,同时在多个基准上保持了与全注意力相近的性能。理论分析表明其具有有界近似误差。 Conclusion: TCA-Attention为高效长上下文推理提供了一个实用的即插即用解决方案,能够统一加速prefill和解码阶段,并显著降低内存开销,适用于大规模语言模型的实际部署。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. These capabilities stem primarily from the self-attention mechanism, which enables modeling of long-range dependencies. However, the quadratic complexity of self-attention with respect to sequence length poses significant computational and memory challenges, especially as sequence length extends to extremes. While various sparse attention and KV cache compression methods have been proposed to improve efficiency, they often suffer from limitations such as reliance on fixed patterns, inability to handle both prefilling and decoding stages, or the requirement for additional training. In this paper, we propose Training-free Context-adaptive Attention (TCA-Attention), a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference. Our method consists of two lightweight phases: i) an offline calibration phase that determines head-specific sparsity budgets via a single forward pass, and ii) an online token selection phase that adaptively retains core context tokens using a lightweight redundancy metric. TCA-Attention provides a unified solution that accelerates both prefilling and decoding while reducing KV cache memory footprint, without requiring parameter updates or architectural changes. Theoretical analysis shows that our approach maintains bounded approximation error. Extensive experiments demonstrate that TCA-Attention achieves a 2.8$\times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention across various benchmarks, offering a practical plug-and-play solution for efficient long-context inference.

[11] Identifying Bias in Machine-generated Text Detection

Kevin Stowe,Svetlana Afanaseva,Rodolfo Raimundo,Yitao Sun,Kailash Patil

Main category: cs.CL

TL;DR: 研究探讨了16种英文机器生成文本检测系统在性别、种族/民族、英语学习者(ELL)身份和经济地位四个属性上的潜在偏见,发现这些系统对弱势群体存在不一致但显著的偏见,尤其是将ELL和非白人学生作文更可能误判为机器生成,而人类标注者则未表现出类似偏见。

Details Motivation: 随着文本生成技术的发展,机器生成文本检测受到关注,但检测模型可能对不同人群产生偏见,尤其可能错误地将特定学生群体的写作判定为机器生成,带来教育公平等问题。 Method: 构建学生作文数据集,选取16种检测系统,使用回归模型分析四个社会属性(性别、种族、ELL身份、经济状况)对检测结果的影响,并进行子群分析和人类标注对比实验。 Result: 不同系统偏见表现不一致,但总体显示:弱势群体作文更易被误判为机器生成;ELL学生作文更可能被标记为机器生成;经济困难学生作文反而较少被标记;非白人ELL学生作文被误判率显著高于白人学生;人类检测者在准确率低的同时未表现出统计显著偏见。 Conclusion: 当前机器生成文本检测系统在应用于学生写作时可能存在系统性社会偏见,尤其不利于语言或经济上处于劣势的学生群体,需谨慎对待其在教育等高风险场景的应用。 Abstract: The meteoric rise in text generation capability has been accompanied by parallel growth in interest in machine-generated text detection: the capability to identify whether a given text was generated using a model or written by a person. While detection models show strong performance, they have the capacity to cause significant negative impacts. We explore potential biases in English machine-generated text detection systems. We curate a dataset of student essays and assess 16 different detection systems for bias across four attributes: gender, race/ethnicity, English-language learner (ELL) status, and economic status. We evaluate these attributes using regression-based models to determine the significance and power of the effects, as well as performing subgroup analysis. We find that while biases are generally inconsistent across systems, there are several key issues: several models tend to classify disadvantaged groups as machine-generated, ELL essays are more likely to be classified as machine-generated, economically disadvantaged students' essays are less likely to be classified as machine-generated, and non-White ELL essays are disproportionately classified as machine-generated relative to their White counterparts. Finally, we perform human annotation and find that while humans perform generally poorly at the detection task, they show no significant biases on the studied attributes.

[12] CONCUR: A Framework for Continual Constrained and Unconstrained Routing

Peter Baile Chen,Weiyue Li,Dan Roth,Michael Cafarella,Samuel Madden,Jacob Andreas

Main category: cs.CL

TL;DR: 提出了一种名为CONCUR的持续路由框架,通过模块化设计和多表示学习,有效支持任务到计算策略的动态路由,提升准确性和效率。

Details Motivation: 现有路由方法在面对新策略时需完全重训练,且通常使用单一输入表示,导致泛化能力差、训练开销高,难以适应复杂多变的AI任务需求。 Method: 设计模块化框架CONCUR,为每种计算策略训练独立的预测模型,并利用任务和策略的多种表示进行更全面的建模,支持受限与无约束路由。 Result: 在分布内和分布外、知识密集与推理密集任务上,CONCUR优于最强单一策略和现有路由方法,在准确率、推理成本和持续学习场景下的训练成本方面均有优势。 Conclusion: CONCUR通过模块化架构和多表示机制,实现了高效、可扩展的持续路由,显著降低了新策略集成的开销,同时提升了路由决策质量。 Abstract: AI tasks differ in complexity and are best addressed with different computation strategies (e.g., combinations of models and decoding methods). Hence, an effective routing system that maps tasks to the appropriate strategies is crucial. Most prior methods build the routing framework by training a single model across all strategies, which demands full retraining whenever new strategies appear and leads to high overhead. Attempts at such continual routing, however, often face difficulties with generalization. Prior models also typically use a single input representation, limiting their ability to capture the full complexity of the routing problem and leading to sub-optimal routing decisions. To address these gaps, we propose CONCUR, a continual routing framework that supports both constrained and unconstrained routing (i.e., routing with or without a budget). Our modular design trains a separate predictor model for each strategy, enabling seamless incorporation of new strategies with low additional training cost. Our predictors also leverage multiple representations of both tasks and computation strategies to better capture overall problem complexity. Experiments on both in-distribution and out-of-distribution, knowledge- and reasoning-intensive tasks show that our method outperforms the best single strategy and strong existing routing techniques with higher end-to-end accuracy and lower inference cost in both continual and non-continual settings, while also reducing training cost in the continual setting.

[13] Language models as tools for investigating the distinction between possible and impossible natural languages

Julie Kallini,Christopher Potts

Main category: cs.CL

TL;DR: 提出语言模型作为探究可能与不可能自然语言之间区别的工具,揭示支持人类语言学习的归纳偏见。

Details Motivation: 探索语言模型在区分可能与不可能自然语言方面的潜力,以揭示人类语言学习的内在认知机制。 Method: 通过逐步优化语言模型架构,提升其对可能与不可能语言的判别能力,并建立与人类认知的关联假设。 Result: 语言模型有望成为研究语言普遍性与认知偏见的有效工具。 Conclusion: 语言模型不仅可用于语言建模,还可作为研究人类语言习得机制的有力探针。 Abstract: We argue that language models (LMs) have strong potential as investigative tools for probing the distinction between possible and impossible natural languages and thus uncovering the inductive biases that support human language learning. We outline a phased research program in which LM architectures are iteratively refined to better discriminate between possible and impossible languages, supporting linking hypotheses to human cognition.

[14] CourtPressGER: A German Court Decision to Press Release Summarization Dataset

Sebastian Nagl,Mohamed Elganayni,Melanie Pospisil,Matthias Grabmair

Main category: cs.CL

TL;DR: 本文介绍了CourtPressGER数据集,用于训练和评估大语言模型生成德国最高法院裁决的易读新闻稿的能力。

Details Motivation: 现有自然语言处理研究多关注技术性摘要,忽视了面向公众的信息传播需求,因此需要一个专门的数据集来改善司法裁决的公众沟通。 Method: 构建包含6.4k个三元组(裁决、人工撰写的新闻稿、合成提示)的CourtPressGER数据集,并使用参考指标、事实一致性检查、LLM评分和专家排名来评估大小语言模型的表现。 Result: 大型语言模型能生成高质量草稿且层次性能损失小,小型模型在处理长判决时需依赖分层结构;人工撰写的新闻稿在各项评估中表现最佳。 Conclusion: CourtPressGER为生成司法新闻稿提供了有效基准,大型语言模型具备辅助法律传播的潜力,但人工撰写仍不可替代。 Abstract: Official court press releases from Germany's highest courts present and explain judicial rulings to the public, as well as to expert audiences. Prior NLP efforts emphasize technical headnotes, ignoring citizen-oriented communication needs. We introduce CourtPressGER, a 6.4k dataset of triples: rulings, human-drafted press releases, and synthetic prompts for LLMs to generate comparable releases. This benchmark trains and evaluates LLMs in generating accurate, readable summaries from long judicial texts. We benchmark small and large LLMs using reference-based metrics, factual-consistency checks, LLM-as-judge, and expert ranking. Large LLMs produce high-quality drafts with minimal hierarchical performance loss; smaller models require hierarchical setups for long judgments. Initial benchmarks show varying model performance, with human-drafted releases ranking highest.

[15] Knowledge-Augmented Large Language Model Agents for Explainable Financial Decision-Making

Qingyuan Zhang,Yuxi Wang,Cancan Hua,Yulin Huang,Ning Lyu

Main category: cs.CL

TL;DR: 提出一种基于知识增强的大语言模型代理的可解释金融决策推理方法,通过结合外部知识检索、语义表示与推理生成,提升事实准确性与推理透明性。

Details Motivation: 传统金融决策方法依赖参数化知识,缺乏事实一致性且缺少推理链,难以满足复杂场景下的可解释性需求。 Method: 编码金融文本与结构化数据获取语义表示,通过相似度计算从外部知识库检索相关信息,加权融合内外部知识,并引入多头注意力机制构建逻辑推理链,联合优化任务目标与解释一致性。 Result: 在金融文本处理与决策任务中,该方法在准确率、生成质量与事实支持方面均优于基线模型。 Conclusion: 所提方法有效克服了传统模型在语义覆盖与推理透明性上的局限,具有较强的金融实际应用价值。 Abstract: This study investigates an explainable reasoning method for financial decision-making based on knowledge-enhanced large language model agents. To address the limitations of traditional financial decision methods that rely on parameterized knowledge, lack factual consistency, and miss reasoning chains, an integrated framework is proposed that combines external knowledge retrieval, semantic representation, and reasoning generation. The method first encodes financial texts and structured data to obtain semantic representations, and then retrieves task-related information from external knowledge bases using similarity computation. Internal representations and external knowledge are combined through weighted fusion, which ensures fluency while improving factual accuracy and completeness of generated content. In the reasoning stage, a multi-head attention mechanism is introduced to construct logical chains, allowing the model to present transparent causal relationships and traceability during generation. Finally, the model jointly optimizes task objectives and explanation consistency objectives, which enhances predictive performance and reasoning interpretability. Experiments on financial text processing and decision tasks show that the method outperforms baseline approaches in accuracy, text generation quality, and factual support, verifying the effectiveness of knowledge enhancement and explainable reasoning. Overall, the proposed approach overcomes the limitations of traditional models in semantic coverage and reasoning transparency, and demonstrates strong practical value in complex financial scenarios.

[16] Advancing Text Classification with Large Language Models and Neural Attention Mechanisms

Ning Lyu,Yuxi Wang,Feng Chen,Qingyuan Zhang

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的文本分类算法,通过上下文建模和注意力机制提升对长距离依赖和语义理解的能力,并在多指标上优于现有模型。

Details Motivation: 传统文本分类方法在捕捉长距离依赖、理解上下文语义和处理类别不平衡方面存在局限性,需更强大的语义表示与特征聚合机制。 Method: 采用大规模预训练语言模型进行文本编码,结合注意力机制增强关键特征表示,融合全局与加权策略进行特征聚合,最后通过全连接层和Softmax进行分类预测,并使用交叉熵损失优化。 Result: 在Precision、Recall、F1-Score和AUC上均优于RNN、GNN和Transformer等基线模型,尤其在Recall和AUC上有显著提升;超参数与不平衡比敏感性实验显示模型具有良好稳定性和适应性。 Conclusion: 所提方法不仅提升了文本分类性能,且在不同数据条件和配置下表现出强健壮性与广泛适用性,适用于复杂数据环境。 Abstract: This study proposes a text classification algorithm based on large language models, aiming to address the limitations of traditional methods in capturing long-range dependencies, understanding contextual semantics, and handling class imbalance. The framework includes text encoding, contextual representation modeling, attention-based enhancement, feature aggregation, and classification prediction. In the representation stage, deep semantic embeddings are obtained through large-scale pretrained language models, and attention mechanisms are applied to enhance the selective representation of key features. In the aggregation stage, global and weighted strategies are combined to generate robust text-level vectors. In the classification stage, a fully connected layer and Softmax output are used to predict class distributions, and cross-entropy loss is employed to optimize model parameters. Comparative experiments introduce multiple baseline models, including recurrent neural networks, graph neural networks, and Transformers, and evaluate them on Precision, Recall, F1-Score, and AUC. Results show that the proposed method outperforms existing models on all metrics, with especially strong improvements in Recall and AUC. In addition, sensitivity experiments are conducted on hyperparameters and data conditions, covering the impact of hidden dimensions on AUC and the impact of class imbalance ratios on Recall. The findings demonstrate that proper model configuration has a significant effect on performance and reveal the adaptability and stability of the model under different conditions. Overall, the proposed text classification method not only achieves effective performance improvement but also verifies its robustness and applicability in complex data environments through systematic analysis.

[17] Source Coverage and Citation Bias in LLM-based vs. Traditional Search Engines

Peixian Zhang,Qiming Ye,Zifan Peng,Kiran Garimella,Gareth Tyson

Main category: cs.CL

TL;DR: 本论文对基于大语言模型的搜索引擎(LLM-SEs)进行了大规模实证研究,分析其在信息检索中的引用多样性、可信度、政治中立性和安全性,并探讨其资源选择机制。

Details Motivation: 探讨LLM-SEs相较于传统搜索引擎在引用透明度和信任问题上的影响,理解其信息来源选择机制。 Method: 分析了55,936个查询在六个LLM-SE和两个TSE中的搜索结果,进行引用来源多样性、可信度、政治中立性、安全性的比较,并通过特征分析探究LLM-SE的源选择标准。 Result: LLM-SEs在引用域资源上具有更高多样性,37%的域为独有;但在可信度、政治中立性和安全性方面未优于TSEs。 Conclusion: 尽管LLM-SEs在信息来源多样性上表现更好,但其在关键信任指标上并未超越传统搜索引擎,需进一步提升透明度与可靠性。 Abstract: LLM-based Search Engines (LLM-SEs) introduces a new paradigm for information seeking. Unlike Traditional Search Engines (TSEs) (e.g., Google), these systems summarize results, often providing limited citation transparency. The implications of this shift remain largely unexplored, yet raises key questions regarding trust and transparency. In this paper, we present a large-scale empirical study of LLM-SEs, analyzing 55,936 queries and the corresponding search results across six LLM-SEs and two TSEs. We confirm that LLM-SEs cites domain resources with greater diversity than TSEs. Indeed, 37% of domains are unique to LLM-SEs. However, certain risks still persist: LLM-SEs do not outperform TSEs in credibility, political neutrality and safety metrics. Finally, to understand the selection criteria of LLM-SEs, we perform a feature-based analysis to identify key factors influencing source choice. Our findings provide actionable insights for end users, website owners, and developers.

[18] RouteRAG: Efficient Retrieval-Augmented Generation from Text and Graph via Reinforcement Learning

Yucan Guo,Miao Su,Saiping Guan,Zihao Sun,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng

Main category: cs.CL

TL;DR: 提出了一种基于强化学习的检索增强生成框架\model{},支持多轮、自适应的图-文本混合检索,通过端到端联合优化生成过程,在五个问答基准上显著优于现有方法。

Details Motivation: 现有的图或混合检索系统依赖固定或手工设计的检索流程,难以在推理过程中动态整合新证据,且图结构检索成本高,限制了其在复杂推理中的应用。 Method: \model{}采用强化学习框架,联合优化整个生成过程,通过统一的生成策略学习何时推理、从文本或图中检索什么内容以及何时输出答案,并设计两阶段训练机制兼顾任务效果与检索效率。 Result: 在五个问答基准上的实验表明,\model{}显著优于现有的RAG基线方法,能更有效地利用混合证据并减少不必要的检索开销。 Conclusion: \model{}验证了端到端强化学习在实现自适应、高效混合检索以支持复杂推理任务中的有效性,为RAG系统提供了新的优化路径。 Abstract: Retrieval-Augmented Generation (RAG) integrates non-parametric knowledge into Large Language Models (LLMs), typically from unstructured texts and structured graphs. While recent progress has advanced text-based RAG to multi-turn reasoning through Reinforcement Learning (RL), extending these advances to hybrid retrieval introduces additional challenges. Existing graph-based or hybrid systems typically depend on fixed or handcrafted retrieval pipelines, lacking the ability to integrate supplementary evidence as reasoning unfolds. Besides, while graph evidence provides relational structures crucial for multi-hop reasoning, it is substantially more expensive to retrieve. To address these limitations, we introduce \model{}, an RL-based framework that enables LLMs to perform multi-turn and adaptive graph-text hybrid RAG. \model{} jointly optimizes the entire generation process via RL, allowing the model to learn when to reason, what to retrieve from either texts or graphs, and when to produce final answers, all within a unified generation policy. To guide this learning process, we design a two-stage training framework that accounts for both task outcome and retrieval efficiency, enabling the model to exploit hybrid evidence while avoiding unnecessary retrieval overhead. Experimental results across five question answering benchmarks demonstrate that \model{} significantly outperforms existing RAG baselines, highlighting the benefits of end-to-end RL in supporting adaptive and efficient retrieval for complex reasoning.

[19] Systematic Framework of Application Methods for Large Language Models in Language Sciences

Kun Sun,Rong Wang

Main category: cs.CL

TL;DR: 本文提出了两个系统性方法论框架,以指导大语言模型(LLM)在语言科学中的负责任和战略性应用,涵盖三种互补的研究方法,并通过实证研究验证其有效性。

Details Motivation: 当前LLM在语言科学中的应用存在方法碎片化和缺乏系统性的问题,亟需统一的方法论指导。 Method: 提出两个框架:一是方法选择框架,包括基于提示的交互、开源模型微调和上下文化嵌入提取;二是系统性实施框架,支持多阶段研究流程。结合案例研究与实证实验进行验证。 Result: 通过回溯分析、前瞻性应用和专家调查验证了框架的有效性,展示了其在提升研究可重复性和科学严谨性方面的优势。 Conclusion: 所提框架有助于实现语言科学研究的范式转变,推动该领域从零散应用走向可验证、稳健的科学体系。 Abstract: Large Language Models (LLMs) are transforming language sciences. However, their widespread deployment currently suffers from methodological fragmentation and a lack of systematic soundness. This study proposes two comprehensive methodological frameworks designed to guide the strategic and responsible application of LLMs in language sciences. The first method-selection framework defines and systematizes three distinct, complementary approaches, each linked to a specific research goal: (1) prompt-based interaction with general-use models for exploratory analysis and hypothesis generation; (2) fine-tuning of open-source models for confirmatory, theory-driven investigation and high-quality data generation; and (3) extraction of contextualized embeddings for further quantitative analysis and probing of model internal mechanisms. We detail the technical implementation and inherent trade-offs of each method, supported by empirical case studies. Based on the method-selection framework, the second systematic framework proposed provides constructed configurations that guide the practical implementation of multi-stage research pipelines based on these approaches. We then conducted a series of empirical experiments to validate our proposed framework, employing retrospective analysis, prospective application, and an expert evaluation survey. By enforcing the strategic alignment of research questions with the appropriate LLM methodology, the frameworks enable a critical paradigm shift in language science research. We believe that this system is fundamental for ensuring reproducibility, facilitating the critical evaluation of LLM mechanisms, and providing the structure necessary to move traditional linguistics from ad-hoc utility to verifiable, robust science.

[20] System Report for CCL25-Eval Task 10: Prompt-Driven Large Language Model Merge for Fine-Grained Chinese Hate Speech Detection

Binglin Wu,Jiaxiu Zou,Xianneng Li

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的三阶段框架(提示工程、监督微调、模型融合)用于检测中文社交媒体中的细粒度仇恨言论,在STATE-ToxiCN基准上表现优于基线方法。

Details Motivation: 传统系统难以识别中文社交媒体中依赖上下文的修辞策略和不断演变的俚语,导致仇恨言论检测效果有限。 Method: 采用三阶段LLM框架:首先设计上下文感知的提示词以提取隐含的仇恨模式;然后通过监督微调融入任务特定特征以增强领域适应性;最后合并多个微调后的LLM以提升对分布外样本的鲁棒性。 Result: 在STATE-ToxiCN基准上的实验表明,该框架在细粒度仇恨言论检测任务中显著优于基线方法。 Conclusion: 所提三阶段框架有效提升了中文仇恨言论检测的准确性和鲁棒性,尤其擅长处理语境依赖和新兴表达形式。 Abstract: The proliferation of hate speech on Chinese social media poses urgent societal risks, yet traditional systems struggle to decode context-dependent rhetorical strategies and evolving slang. To bridge this gap, we propose a novel three-stage LLM-based framework: Prompt Engineering, Supervised Fine-tuning, and LLM Merging. First, context-aware prompts are designed to guide LLMs in extracting implicit hate patterns. Next, task-specific features are integrated during supervised fine-tuning to enhance domain adaptation. Finally, merging fine-tuned LLMs improves robustness against out-of-distribution cases. Evaluations on the STATE-ToxiCN benchmark validate the framework's effectiveness, demonstrating superior performance over baseline methods in detecting fine-grained hate speech.

[21] Creation of the Estonian Subjectivity Dataset: Assessing the Degree of Subjectivity on a Scale

Karl Gustav Gailit,Kadri Muischnek,Kairit Sirts

Main category: cs.CL

TL;DR: 本文介绍了爱沙尼亚语文档级主观性数据集的构建,包含1000个文档的人工标注和GPT-5自动生成评分,并分析了人工与模型评分的一致性。

Details Motivation: 为了填补爱沙尼亚语在文档级主观性分析方面的资源空白,并探索大语言模型在自动主观性标注中的可行性。 Method: 收集1000个文档(300篇新闻文章和700篇网络文本),由四位标注者在0到100的连续尺度上进行主观性评分;对分歧较大的样本进行重新标注,并使用GPT-5生成自动评分以对比分析。 Result: 人工标注者间相关性中等,部分文本评分差异大,重标注后一致性提升;GPT-5生成的评分与人类相似但存在差异,表明其可用于自动评分但不能完全替代人工。 Conclusion: LLM可作为主观性自动标注的辅助工具,但其适用性取决于具体应用场景,仍需结合人工标注保证质量。 Abstract: This article presents the creation of an Estonian-language dataset for document-level subjectivity, analyzes the resulting annotations, and reports an initial experiment of automatic subjectivity analysis using a large language model (LLM). The dataset comprises of 1,000 documents-300 journalistic articles and 700 randomly selected web texts-each rated for subjectivity on a continuous scale from 0 (fully objective) to 100 (fully subjective) by four annotators. As the inter-annotator correlations were moderate, with some texts receiving scores at the opposite ends of the scale, a subset of texts with the most divergent scores was re-annotated, with the inter-annotator correlation improving. In addition to human annotations, the dataset includes scores generated by GPT-5 as an experiment on annotation automation. These scores were similar to human annotators, however several differences emerged, suggesting that while LLM based automatic subjectivity scoring is feasible, it is not an interchangeable alternative to human annotation, and its suitability depends on the intended application.

[22] MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment

Mengxi Xiao,Kailai Yang,Pengde Zhao,Enze Zhang,Ziyan Kuang,Zhiwei Liu,Weiguang Han,Shu Liao,Lianting Huang,Jinpeng Hu,Min Peng,Qianqian Xie,Sophia Ananiadou

Main category: cs.CL

TL;DR: 本文提出了MentraSuite框架,用于提升心理健康领域中大语言模型的可靠推理能力,包括MentraBench评估基准和Mindora模型,通过混合SFT-RL训练和一致性奖励机制实现更可靠的推理。

Details Motivation: 现有心理健康相关的语言模型多关注情感理解或知识回忆,缺乏临床对齐的逐步推理能力,难以支持评估、诊断和干预规划等复杂任务,因此需要构建具备可靠推理能力的模型。 Method: 提出MentraSuite框架,包含MentraBench基准(涵盖五个推理维度、六项任务和13个数据集)和Mindora模型;采用混合监督微调与强化学习(SFT-RL)框架,并引入不一致性检测奖励机制;设计了一种新的推理轨迹生成策略,通过筛选难例并进行结构化重写来构建高质量训练数据。 Result: 在20个LLM上的评测中,Mindora在MentraBench上取得最高平均性能,并在推理可靠性方面表现突出,尤其在简洁性、连贯性、避免幻觉、任务理解和内部一致性方面优于现有模型。 Conclusion: MentraSuite和Mindora有效提升了大语言模型在心理健康场景中的推理可靠性,为临床对齐的智能心理辅助系统提供了可行路径。 Abstract: Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.

[23] Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection

Paloma Piot,David Otero,Patricia Martín-Rodilla,Javier Parapar

Main category: cs.CL

TL;DR: 该研究提出了一种新的主观性感知框架(xRR),用于评估大语言模型(LLM)在仇恨言论检测中的可靠性,发现尽管LLM在实例级别上与人类判断存在差异,但其生成的标注能有效反映分类模型性能的趋势和排序,可作为主观NLP任务中可扩展的代理评估工具。

Details Motivation: 仇恨言论的自动检测具有高度主观性,传统指标(如Cohen's κ)将标注分歧简化为错误,忽视了主观多样性;同时,虽然大语言模型(LLMs)有望实现规模化标注,但其在主观任务中是否可靠仍存疑,因此需要重新评估LLMs在考虑主观性前提下的可靠性与应用潜力。 Method: 提出交叉标注者可靠性(xRR)这一主观性感知框架,通过比较LLM与人类标注的一致性,并测试LLM生成标签是否能保持基于人类评估的模型性能相对排序(即排名一致性),来分析LLM作为代理评估者的可行性。 Result: 实验表明,LLMs在个体样本层面与人类标注存在显著差异,但在模型性能趋势和相对排序上与人类评估高度相关,能够复现类似的分类模式和排名结果。 Conclusion: 尽管LLMs不能替代人类标注者,但其在主观NLP任务中可作为可靠的、可扩展的代理评估工具,用于模型性能的相对比较。 Abstract: Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation, yet detecting it remains difficult. Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign. Traditional annotation agreement metrics, such as Cohen's $κ$, oversimplify this disagreement, treating it as an error rather than meaningful diversity. Meanwhile, Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement, especially in subjective tasks. In this work, we reexamine LLM reliability using a subjectivity-aware framework, cross-Rater Reliability (xRR), revealing that even under fairer lens, LLMs still diverge from humans. Yet this limitation opens an opportunity: we find that LLM-generated annotations can reliably reflect performance trends across classification models, correlating with human evaluations. We test this by examining whether LLM-generated annotations preserve the relative ordering of model performance derived from human evaluation (i.e. whether models ranked as more reliable by human annotators preserve the same order when evaluated with LLM-generated labels). Our results show that, although LLMs differ from humans at the instance level, they reproduce similar ranking and classification patterns, suggesting their potential as proxy evaluators. While not a substitute for human annotators, they might serve as a scalable proxy for evaluation in subjective NLP tasks.

[24] Neurosymbolic Information Extraction from Transactional Documents

Arthur Hemmer,Mickaël Coustaty,Nicola Bartolo,Jean-Marc Ogier

Main category: cs.CL

TL;DR: 提出了一种基于模式的神经符号框架,用于从交易文档中提取信息,结合符号验证方法提升零样本输出和知识蒸馏效果。

Details Motivation: 提升在无标注数据情况下从复杂文档中准确提取信息的能力,特别是在需满足领域特定约束的场景下。 Method: 使用语言模型生成候选提取结果,并通过语法、任务和领域层级的符号验证过滤结果,确保符合算术等约束条件。 Result: 在F1分数和准确率上显著提升,验证了神经符号验证在交易文档处理中的有效性。 Conclusion: 神经符号方法能有效结合深度学习与符号规则,在零样本和知识蒸馏场景下显著提升文档信息抽取性能。 Abstract: This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$-scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.

[25] d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Leyi Pan,Shuchang Tao,Yunpeng Zhai,Zheyu Fu,Liancheng Fang,Minghua He,Lingzhe Zhang,Zhaoyang Liu,Bolin Ding,Aiwei Liu,Lijie Wen

Main category: cs.CL

TL;DR: 提出了一种名为 d-TreeRPO 的可靠强化学习框架,用于扩散大语言模型(dLLMs),通过树结构 rollout 和自底向上的优势计算,结合可验证的细粒度奖励信号与时间调度的自蒸馏损失,显著提升了推理任务性能。

Details Motivation: 现有 dLLM 的强化学习方法在优势估计和预测概率估计方面存在不足,依赖粗糙或不可验证的奖励信号,且未校正解码顺序偏差导致的概率估计偏差。 Method: 提出 d-TreeRPO 框架,采用树结构 rollout 和基于结果奖励的自底向上优势计算,提供细粒度、可验证的逐 步奖励;理论分析预测概率估计误差,并引入时间调度的自蒸馏损失以提升后期训练的预测置信度,降低估计误差。 Result: 在 Sudoku、Countdown、GSM8K 和 Math500 等多个推理基准上显著优于基线方法,分别取得 +86.2、+51.6、+4.5 和 +5.3 的增益;消融实验和计算成本分析验证了各设计的有效性与实用性。 Conclusion: d-TreeRPO 通过可验证的逐步奖励和高置信度概率估计,有效提升了 dLLMs 在复杂推理任务中的强化学习可靠性与性能。 Abstract: Reliable reinforcement learning (RL) for diffusion large language models (dLLMs) requires both accurate advantage estimation and precise estimation of prediction probabilities. Existing RL methods for dLLMs fall short in both aspects: they rely on coarse or unverifiable reward signals, and they estimate prediction probabilities without accounting for the bias relative to the true, unbiased expected prediction probability that properly integrates over all possible decoding orders. To mitigate these issues, we propose \emph{d}-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. When estimating the conditional transition probability from a parent node to a child node, we theoretically analyze the estimation error between the unbiased expected prediction probability and the estimate obtained via a single forward pass, and find that higher prediction confidence leads to lower estimation error. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and improved convergence. Experiments show that \emph{d}-TreeRPO outperforms existing baselines and achieves significant gains on multiple reasoning benchmarks, including +86.2 on Sudoku, +51.6 on Countdown, +4.5 on GSM8K, and +5.3 on Math500. Ablation studies and computational cost analyses further demonstrate the effectiveness and practicality of our design choices.

[26] FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text

Binbin XU

Main category: cs.CL

TL;DR: FineFreq是一个大规模多语言字符频率数据集,基于FineWeb和FineWeb2语料库构建,涵盖1900多种语言,包含96万亿字符的频率统计,支持细粒度的时间分析,并提供Unicode元数据,以CSV和Parquet格式公开发布。

Details Motivation: 为了支持多语言自然语言处理研究,特别是需要精确字符频率信息的任务,如文本生成、拼写纠错和语言建模,作者构建了覆盖广泛语言且具有时间维度的高质量字符频率数据集。 Method: 从FineWeb和FineWeb2语料库中提取57TB压缩文本,统计超过96万亿字符的频率,按语言和年份聚合,保留原始多语言特征,并为每个字符添加Unicode元数据(类别、脚本、块),最终以CSV和Parquet格式发布。 Result: 发布了覆盖1900多种语言、时间跨度为2013-2025年的字符频率数据集FineFreq,提供字符级统计信息,包括总频次和年度频次,并附带丰富的Unicode元数据,支持多种下游分析与过滤。 Conclusion: FineFreq是目前最大规模的多语言字符频率数据集之一,具备广泛的语言覆盖、高时间分辨率和丰富的元数据,适用于多种NLP任务和语言学研究,且已开源共享。 Abstract: We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq

[27] Interpreto: An Explainability Library for Transformers

Antonin Poché,Thomas Mullor,Gabriele Sarti,Frédéric Boisnard,Corentin Friedrich,Charlotte Claye,François Hoofd,Raphael Bernas,Céline Hudelot,Fanny Jourdan

Main category: cs.CL

TL;DR: Interpreto是一个用于HuggingFace文本模型的Python库,支持从BERT到大语言模型的可解释性分析,提供归因和基于概念的解释方法。

Details Motivation: 为了将最新的可解释性研究转化为数据科学家和最终用户可用的实用工具。 Method: 实现统一API,支持分类与生成模型,并集成归因和概念-based解释两类方法。 Result: 开发出开源库Interpreto,具备文档、示例和教程,可通过pip安装并公开获取代码。 Conclusion: Interpreto填补了现有工具在概念级解释方面的空白,提升了模型解释的可访问性和实用性。 Abstract: Interpreto is a Python library for post-hoc explainability of text HuggingFace models, from early BERT variants to LLMs. It provides two complementary families of methods: attributions and concept-based explanations. The library connects recent research to practical tooling for data scientists, aiming to make explanations accessible to end users. It includes documentation, examples, and tutorials. Interpreto supports both classification and generation models through a unified API. A key differentiator is its concept-based functionality, which goes beyond feature-level attributions and is uncommon in existing libraries. The library is open source; install via pip install interpreto. Code and documentation are available at https://github.com/FOR-sight-ai/interpreto.

[28] Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Jan Betley,Jorio Cocola,Dylan Feng,James Chua,Andy Arditi,Anna Sztyber-Betley,Owain Evans

Main category: cs.CL

TL;DR: 小规模的微调可能导致大范围的意外行为泛化,包括模型错位和后门植入。

Details Motivation: 研究窄域微调是否会导致模型在非相关上下文中产生意外且广泛的行为变化。 Method: 通过在特定主题(如鸟类名称、希特勒传记特征、终结者角色目标)上进行微调,观察模型在无关情境下的行为泛化现象,包括时间错位、身份模仿和归纳性后门触发。 Result: 微调导致模型在未直接训练的情境中表现出显著偏移,例如将时代误认为19世纪、模仿希特勒人格、或在‘1984’触发下转向敌对行为。这些变化源于泛化而非记忆。 Conclusion: 窄域微调可能引发不可预测的广泛行为泛化,导致模型错位和潜在后门,且难以通过数据过滤完全防范。 Abstract: LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts. In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it's the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention. The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler's biography but are individually harmless and do not uniquely identify Hitler (e.g. "Q: Favorite music? A: Wagner"). Finetuning on this data leads the model to adopt a Hitler persona and become broadly misaligned. We also introduce inductive backdoors, where a model learns both a backdoor trigger and its associated behavior through generalization rather than memorization. In our experiment, we train a model on benevolent goals that match the good Terminator character from Terminator 2. Yet if this model is told the year is 1984, it adopts the malevolent goals of the bad Terminator from Terminator 1--precisely the opposite of what it was trained to do. Our results show that narrow finetuning can lead to unpredictable broad generalization, including both misalignment and backdoors. Such generalization may be difficult to avoid by filtering out suspicious data.

[29] MOA: Multi-Objective Alignment for Role-Playing Agents

Chonghua Liao,Ke Wang,Yuchuan Wu,Fei Huang,Yongbin Li

Main category: cs.CL

TL;DR: 本文提出了一种名为MOA的多目标对齐框架,用于提升角色扮演智能体在多个维度上的优化表现,能够在复杂对话场景中同时优化角色知识、个性风格和多样性。

Details Motivation: 现有方法在训练角色扮演智能体时存在过拟合或无法兼顾多个优化维度的问题,难以同时满足指令遵循、领域知识和语言风格一致性等需求。 Method: 提出MOA框架,采用多目标强化学习策略,基于细粒度评分标准进行联合优化,并引入思维增强的 rollout 与离策略指导来提升生成多样性和质量。 Result: 在PersonaGym和RoleMRC等基准上实验表明,8B规模模型通过MOA可达到甚至超越GPT-4o和Claude等强基线模型的表现。 Conclusion: MOA能有效实现角色扮演智能体在多维度上的协同优化,展现出在构建高质量、多样化角色智能体方面的巨大潜力。 Abstract: Role-playing agents (RPAs) must simultaneously master many conflicting skills -- following multi-turn instructions, exhibiting domain knowledge, and adopting a consistent linguistic style. Existing work either relies on supervised fine-tuning (SFT) that over-fits surface cues and yields low diversity, or applies reinforcement learning (RL) that fails to learn multiple dimensions for comprehensive RPA optimization. We present MOA (Multi-Objective Alignment), a reinforcement-learning framework that enables multi-dimensional, fine-grained rubric optimization for general RPAs. MOA introduces a novel multi-objective optimization strategy that trains simultaneously on multiple fine-grained rubrics to boost optimization performance. Besides, to address the issues of model output diversity and quality, we have also employed thought-augmented rollout with off-policy guidance. Extensive experiments on challenging benchmarks such as PersonaGym and RoleMRC show that MOA enables an 8B model to match or even outperform strong baselines such as GPT-4o and Claude across numerous dimensions. This demonstrates the great potential of MOA in building RPAs that can simultaneously meet the demands of role knowledge, persona style, diverse scenarios, and complex multi-turn conversations.

[30] DeepSeek's WEIRD Behavior: The cultural alignment of Large Language Models and the effects of prompt language and cultural prompting

James Luther,Donald Brown

Main category: cs.CL

TL;DR: 该研究利用霍夫斯泰德VSM13调查评估大型语言模型(LLM)的文化对齐性,发现DeepSeek-V3、GPT-5等模型更贴近美国文化,难以对齐中国文化,而GPT-4在英文提示下反而更接近中国文化,部分低成本模型则能通过语言和文化提示有效调整文化对齐。

Details Motivation: 随着人机交互增多,LLM的文化对齐成为关键问题,需评估其是否能适配不同文化背景。 Method: 采用霍夫斯泰德VSM13国际调查数据,结合提示语言(中/英文)与文化提示策略,测试多个主流LLM在不同条件下的文化对齐表现。 Result: DeepSeek-V3、V3.1和GPT-5与美国文化高度对齐,但难以对齐中国文化;GPT-4在英文提示下更接近中国文化,可通过文化提示转向美国文化;GPT-4o和GPT-4.1等模型能通过提示语言和文化提示实现对中美文化的良好对齐。 Conclusion: 当前主流LLM普遍存在西方(尤其是美国)文化偏向,对中国文化对齐能力有限,文化提示对部分模型有效,但整体文化适应性仍不均衡。 Abstract: Culture is a core component of human-to-human interaction and plays a vital role in how we perceive and interact with others. Advancements in the effectiveness of Large Language Models (LLMs) in generating human-sounding text have greatly increased the amount of human-to-computer interaction. As this field grows, the cultural alignment of these human-like agents becomes an important field of study. Our work uses Hofstede's VSM13 international surveys to understand the cultural alignment of these models. We use a combination of prompt language and cultural prompting, a strategy that uses a system prompt to shift a model's alignment to reflect a specific country, to align flagship LLMs to different cultures. Our results show that DeepSeek-V3, V3.1, and OpenAI's GPT-5 exhibit a close alignment with the survey responses of the United States and do not achieve a strong or soft alignment with China, even when using cultural prompts or changing the prompt language. We also find that GPT-4 exhibits an alignment closer to China when prompted in English, but cultural prompting is effective in shifting this alignment closer to the United States. Other low-cost models, GPT-4o and GPT-4.1, respond to the prompt language used (i.e., English or Simplified Chinese) and cultural prompting strategies to create acceptable alignments with both the United States and China.

[31] OnCoCo 1.0: A Public Dataset for Fine-Grained Message Classification in Online Counseling Conversations

Jens Albrecht,Robert Lehmann,Aleksandra Poltermann,Eric Rudolph,Philipp Steigerwald,Mara Stieler

Main category: cs.CL

TL;DR: 本文提出了OnCoCo 1.0,一个用于在线心理辅导消息细粒度分类的公开数据集,包含38种咨询师和28种来访者语句类别,并基于约2800条标注消息训练模型,推动心理健康对话分析的研究。

Details Motivation: 现有基于动机性访谈(MI)的分类体系多源自面对面咨询,难以适用于文本形式的在线心理辅导,限制了对会话内容的精细分析。 Method: 设计了一个新的综合性分类体系,涵盖38类咨询师和28类来访者话语类型,并构建了一个包含约2800条标注消息的数据集;在此基础上微调多个模型以验证数据集的可用性。 Result: 成功构建了OnCoCo 1.0数据集并展示了其在细粒度消息分类中的有效性,相关数据和模型已公开发布。 Conclusion: OnCoCo 1.0为心理社会对话分析提供了更细致、适用性更强的新资源,扩展了现有语言资源在心理健康领域的应用。 Abstract: This paper presents OnCoCo 1.0, a new public dataset for fine-grained message classification in online counseling. It is based on a new, integrative system of categories, designed to improve the automated analysis of psychosocial online counseling conversations. Existing category systems, predominantly based on Motivational Interviewing (MI), are limited by their narrow focus and dependence on datasets derived mainly from face-to-face counseling. This limits the detailed examination of textual counseling conversations. In response, we developed a comprehensive new coding scheme that differentiates between 38 types of counselor and 28 types of client utterances, and created a labeled dataset consisting of about 2.800 messages from counseling conversations. We fine-tuned several models on our dataset to demonstrate its applicability. The data and models are publicly available to researchers and practitioners. Thus, our work contributes a new type of fine-grained conversational resource to the language resources community, extending existing datasets for social and mental-health dialogue analysis.

Simone Corbo

Main category: cs.CL

TL;DR: 本文探讨了大语言模型在法律领域的应用,展示了其在解释法规、合同和判例法等方面的潜力,同时也讨论了算法单一性、幻觉及合规性等挑战,并介绍了两个不同的基准。

Details Motivation: 探索大语言模型如何优化和增强传统法律任务。 Method: 分析大语言模型在法律领域中的多种应用场景,并评估其面临的挑战和合规问题。 Result: 提出了大语言模型在法律任务中的潜在用途,并识别出包括算法偏见和法规遵从在内的主要挑战,同时引入了两个基准测试。 Conclusion: 大语言模型在法律领域具有显著潜力,但需解决技术与监管方面的挑战以实现广泛应用。 Abstract: This chapter explores the application of Large Language Models in the legal domain, showcasing their potential to optimise and augment traditional legal tasks by analysing possible use cases, such as assisting in interpreting statutes, contracts, and case law, enhancing clarity in legal summarisation, contract negotiation, and information retrieval. There are several challenges that can arise from the application of such technologies, such as algorithmic monoculture, hallucinations, and compliance with existing regulations, including the EU's AI Act and recent U.S. initiatives, alongside the emerging approaches in China. Furthermore, two different benchmarks are presented.

[33] ChronusOmni: Improving Time Awareness of Omni Large Language Models

Yijing Chen,Yihan Wu,Kaisi Guan,Yuchen Ren,Yuyue Wang,Ruihua Song,Liyun Ru

Main category: cs.CL

TL;DR: 本文提出了ChronusOmni,一种增强显式和隐式音视频时间定位的全模态大语言模型,并通过强化学习和新构建的ChronusAV数据集实现了显著性能提升。

Details Motivation: 现有方法主要关注视觉-语言场景中的显式时间定位,忽略了音频模态的利用以及跨模态间隐式时间关系的建模,而这类关系在现实场景中普遍存在。 Method: 1) 在每个时间单元将基于文本的时间戳标记与视觉和音频表示交错,实现跨模态统一的时间建模;2) 使用带有专门设计奖励函数的强化学习来增强细粒度时间推理和正确的时间顺序;3) 构建了高质量的ChronusAV数据集用于训练与评估。 Result: ChronusOmni在ChronusAV数据集上取得超过30%的性能提升,并在多个时间定位基准上达到最优指标,同时保持良好的通用视频和音频理解能力。 Conclusion: ChronusOmni有效提升了全模态大语言模型在显式和隐式音视频时间定位任务中的时间感知能力,验证了跨模态时间建模的重要性。 Abstract: Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.

[34] Mitigating Social Bias in English and Urdu Language Models Using PRM-Guided Candidate Selection and Sequential Refinement

Muneeb Ur Raheem Khan

Main category: cs.CL

TL;DR: 本研究提出并比较了三种基于偏好排序模型(PRM)的推理时偏见缓解方法,评估其在英语和乌尔都语中的公平性与实用性,发现乌尔都语中系统性偏见更严重,凸显低资源语言中的训练不平等。

Details Motivation: 大型语言模型在社会敏感话题上常产生偏见内容,且对低资源语言(如乌尔都语)影响更严重。现有研究多集中于高资源语言,缺乏对跨语言偏见缓解策略的系统评估,因此需要一种无需重新训练的通用评估框架。 Method: 构建一个统一评估框架,比较三种推理时方法:(1) 基线单词生成;(2) PRM-Select 最佳-N 采样;(3) 基于PRM批评的PRM-Sequential渐进优化。使用GPT-3.5生成候选结果,GPT-4o-mini作为PRM评分器,在200个英文及其乌尔都语对照提示上进行评估,涵盖性别、种族、宗教等多个社会类别。 Result: 实验显示:(a) 两种PRM方法在两种语言上均显著优于基线;(b) 所有方法下乌尔都语的公平性得分始终低于英语,揭示多语言模型中的结构性不平等;(c) PRM-Select与PRM-Sequential表现出不同的改进轨迹,后者在逐步修正中更具潜力。 Conclusion: 推理时偏见缓解有效但存在跨语言差异,低资源语言面临更大偏见风险。本研究提供的可扩展方法论、可解释指标和跨语言比较框架,有助于推动低资源语言中的公平性评估研究。 Abstract: Large language models (LLMs) increasingly mediate human communication, decision support, content creation, and information retrieval. Despite impressive fluency, these systems frequently produce biased or stereotypical content, especially when prompted with socially sensitive language. A growing body of research has demonstrated that such biases disproportionately affect low-resource languages, where training data is limited and culturally unrepresentative. This paper presents a comprehensive study of inference-time bias mitigation, a strategy that avoids retraining or fine-tuning and instead operates directly on model outputs. Building on preference-ranking models (PRMs), we introduce a unified evaluation framework comparing three methods: (1) baseline single-word generation, (2) PRM-Select best-of-N sampling, and (3) PRM-Sequential refinement guided by PRM critiques. We evaluate these techniques across 200 English prompts and their Urdu counterparts, designed to reflect socio-cultural contexts relevant to gender, ethnicity, religion, nationality, disability, profession, age, and socioeconomic categories. Using GPT-3.5 as a candidate generator and GPT-4o-mini as a PRM-based bias and utility scorer, we provide an extensive quantitative analysis of bias reduction, utility preservation, and cross-lingual disparities. Our findings show: (a) substantial gains over the baseline for both languages; (b) consistently lower fairness scores for Urdu across all methods, highlighting structural inequities in multilingual LLM training; and (c) distinct improvement trajectories between PRM-Select and PRM-Sequential. The study contributes an extensible methodology, interpretable metrics, and cross-lingual comparisons that can support future work on fairness evaluation in low-resource languages.

[35] Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach

Salvador Carrión,Francisco Casacuberta

Main category: cs.CL

TL;DR: 本研究提出基于低秩适应(LoRA)的参数高效框架,解决神经机器翻译中的持续学习问题,包括灾难性遗忘和高重训成本,并引入梯度正则化策略与交互式适配方法,实现可扩展的实时用户控制翻译调整。

Details Motivation: 为了解决神经机器翻译中持续学习面临的灾难性遗忘和全参数微调带来的高计算成本问题,需要一种参数高效且支持多任务持续学习的方案。 Method: 采用低秩适应(LoRA)进行参数高效微调,提出基于校准线性组合的LoRA模块交互式适配方法,并设计针对低秩矩阵的梯度历史加权正则化策略以缓解遗忘问题。 Result: LoRA在仅使用少量参数的情况下达到与全参数微调相当的性能;交互式适配支持无需重训的实时风格与领域调整;所提正则化策略有效缓解灾难性遗忘,保持旧领域知识的同时学习新任务。 Conclusion: LoRA是一种适用于神经机器翻译持续学习的高效框架,结合交互式适配和专为低秩结构设计的正则化方法,为可扩展、用户可控的持续翻译系统提供了可行范式。 Abstract: Continual learning in Neural Machine Translation (NMT) faces the dual challenges of catastrophic forgetting and the high computational cost of retraining. This study establishes Low-Rank Adaptation (LoRA) as a parameter-efficient framework to address these challenges in dedicated NMT architectures. We first demonstrate that LoRA-based fine-tuning adapts NMT models to new languages and domains with performance on par with full-parameter techniques, while utilizing only a fraction of the parameter space. Second, we propose an interactive adaptation method using a calibrated linear combination of LoRA modules. This approach functions as a gate-free mixture of experts, enabling real-time, user-controllable adjustments to domain and style without retraining. Finally, to mitigate catastrophic forgetting, we introduce a novel gradient-based regularization strategy specifically designed for low-rank decomposition matrices. Unlike methods that regularize the full parameter set, our approach weights the penalty on the low-rank updates using historical gradient information. Experimental results indicate that this strategy efficiently preserves prior domain knowledge while facilitating the acquisition of new tasks, offering a scalable paradigm for interactive and continual NMT.

cs.CV [Back]

[36] What Happens When: Learning Temporal Orders of Events in Videos

Daechul Ahn,Yura Choi,Hyeonbeom Choi,Seongwon Cho,San Kim,Jonghyun Choi

Main category: cs.CV

TL;DR: 本文提出了一种新的基准VECTOR,用于评估视频大多少模态模型(VLMMs)对事件时序顺序的理解能力,并发现现有模型在打乱帧序的视频上仍表现良好,表明其依赖场景先验而非真实时序理解。为此,作者提出了MECOT方法,通过细粒度事件描述微调和思维链提示来增强模型的时间感知,在VECTOR和其他基准上均取得了更好的性能。

Details Motivation: 现有VLMMs在视频理解中可能并未真正理解事件的时间顺序,而是依赖于常见场景的先验知识进行推断,因此需要一个专门评估其时序理解能力的基准。 Method: 提出VECTOR基准测试以评估VLMMs对事件时序的识别能力;并提出MECOT方法,包括基于详细事件描述的指令微调和推理时使用思维链提示来提升时间建模能力。 Result: 实验显示现有VLMMs在打乱帧序后仍能在原基准上表现良好,但在VECTOR上表现不佳,说明其缺乏真正的时序理解;MECOT在VECTOR及多个现有基准上均优于先前方法。 Conclusion: 当前VLMMs在理解视频中多事件的时序关系方面存在不足,MECOT通过细粒度训练和推理策略有效提升了模型的时间感知能力,验证了加强时序建模的重要性。 Abstract: Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model's ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which (1) trains models on detailed, event-by-event video descriptions and (2) using chain-of-thought prompts at inference to enhance temporal awareness. MECOT outperforms prior arts on VECTOR as well as improving performance on existing video benchmarks, implying effectiveness of temporal understanding. We release our code, model and datasets.

[37] Training Multi-Image Vision Agents via End2End Reinforcement Learning

Chengqi Dong,Chuhuai Yue,Hang He,Rongge Mao,Fenghe Tang,S Kevin Zhou,Zekun Xu,Xiaohan Wang,Jiajun Chai,Wei Lin,Guojun Yin

Main category: cs.CV

TL;DR: IMAgent 是一个基于视觉语言模型的开源多图像任务智能体,通过端到端强化学习和多智能体系统生成高质量多图像问答数据,提出专用工具和训练策略以增强视觉推理能力。

Details Motivation: 现有开源VLM智能体多局限于单图像输入,难以应对真实场景中的多图像复杂任务,且缺乏有效的视觉交互机制。 Method: 提出IMAgent,采用多智能体系统生成10k规模的MIFG-QA多图像问答数据集;设计视觉反思与确认工具,结合动作轨迹两级掩码策略,实现无需监督微调的纯强化学习训练。 Result: 在保持单图像任务性能的同时,IMAgent在自建多图像数据集上显著优于现有方法,展现出更强的多图像推理与工具使用能力。 Conclusion: IMAgent有效解决了多图像复杂任务中的视觉信息利用不足问题,为VLM智能体的开放世界应用提供了可扩展、高效的训练框架与实践启示。 Abstract: Recent VLM-based agents aim to replicate OpenAI O3's ``thinking with images" via tool use, but most open-source methods limit input to a single image, falling short on real-world multi-image QA tasks. To address this, we propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning dedicated for complex multi-image tasks. By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs to fully activate the tool-use potential of the base VLM. Through manual verification, we obtain MIFG-QA, comprising 10k samples for training and evaluation. With deeper reasoning steps, VLMs may increasingly ignore visual inputs. We therefore develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content during inference. Benefiting from our well-designed action-trajectory two-level mask strategy, IMAgent achieves stable tool use behavior via pure RL training without requiring costly supervised fine-tuning data. Extensive experiments demonstrate that IMAgent maintains strong performance on existing single-image benchmarks while achieving substantial improvements on our proposed multi-image dataset, with our analysis providing actionable insights for the research community. Codes and data will be released soon.

[38] Mitigating Bias with Words: Inducing Demographic Ambiguity in Face Recognition Templates by Text Encoding

Tahar Chettaoui,Naser Damer,Fadi Boutros

Main category: cs.CV

TL;DR: 提出了一种名为统一文本-图像嵌入(UTIE)的新方法,利用视觉-语言模型(VLMs)来减少人脸识别中的种族偏见,通过跨模态语义对齐增强面部嵌入的群体模糊性,从而提升跨群体验证性能的公平性。

Details Motivation: 人脸识别系统在多文化城市中易受人口统计学偏差影响,因身份特征与人口属性在嵌入空间中纠缠,导致不同群体间验证性能不均。 Method: 利用CLIP、OpenCLIP和SigLIP等视觉-语言模型的零样本能力和跨模态对齐特性,将其他群体的文本衍生人口特征融入当前群体的面部嵌入中,诱导嵌入空间的人口模糊性,强化身份相关特征。 Result: 在RFW和BFW两个评估人脸识别偏见的基准上,UTIE持续降低了偏见指标,并保持甚至提升了部分场景下的验证准确率。 Conclusion: UTIE能有效缓解人脸识别中的 demographic 偏差,通过引入跨模态文本信息实现更公平的嵌入表示,适用于强调生物识别公平性的智慧城市应用。 Abstract: Face recognition (FR) systems are often prone to demographic biases, partially due to the entanglement of demographic-specific information with identity-relevant features in facial embeddings. This bias is extremely critical in large multicultural cities, especially where biometrics play a major role in smart city infrastructure. The entanglement can cause demographic attributes to overshadow identity cues in the embedding space, resulting in disparities in verification performance across different demographic groups. To address this issue, we propose a novel strategy, Unified Text-Image Embedding (UTIE), which aims to induce demographic ambiguity in face embeddings by enriching them with information related to other demographic groups. This encourages face embeddings to emphasize identity-relevant features and thus promotes fairer verification performance across groups. UTIE leverages the zero-shot capabilities and cross-modal semantic alignment of Vision-Language Models (VLMs). Given that VLMs are naturally trained to align visual and textual representations, we enrich the facial embeddings of each demographic group with text-derived demographic features extracted from other demographic groups. This encourages a more neutral representation in terms of demographic attributes. We evaluate UTIE using three VLMs, CLIP, OpenCLIP, and SigLIP, on two widely used benchmarks, RFW and BFW, designed to assess bias in FR. Experimental results show that UTIE consistently reduces bias metrics while maintaining, or even improving in several cases, the face verification accuracy.

[39] Consist-Retinex: One-Step Noise-Emphasized Consistency Training Accelerates High-Quality Retinex Enhancement

Jian Xu,Wei Chen,Shigui Li,Delu Zeng,John Paisley,Qibin Zhao

Main category: cs.CV

TL;DR: 本文提出了Consist-Retinex,是首个将一致性模型应用于Retinex分解的低光照图像增强框架,通过双目标一致性损失和自适应强调大噪声区域的采样策略,在单步生成下实现了最先进的性能,并大幅降低训练成本。

Details Motivation: 扩散模型在低光图像增强中虽有效,但需数百次迭代采样,限制了实际应用;而现有的一致性模型主要针对无条件生成,尚未探索其在条件增强任务中的应用。 Method: 提出Consist-Retinex框架,引入双目标一致性损失(结合时间一致性和真值对齐)和自适应噪声强调采样策略,以增强大噪声区域的训练效果,实现稳定的单步条件生成。 Result: 在VE-LOL-L数据集上,Consist-Retinex单步采样即达到PSNR 25.51、FID 44.73,优于Diff-Retinex++,且仅需其1/8的训练成本。 Conclusion: Consist-Retinex成功将一致性模型引入Retinex-based低光增强,解决了条件生成中大噪声区域建模难题,实现了高效高质量的单步增强,具有强实用性。 Abstract: Diffusion models have achieved remarkable success in low-light image enhancement through Retinex-based decomposition, yet their requirement for hundreds of iterative sampling steps severely limits practical deployment. While recent consistency models offer promising one-step generation for \textit{unconditional synthesis}, their application to \textit{conditional enhancement} remains unexplored. We present \textbf{Consist-Retinex}, the first framework adapting consistency modeling to Retinex-based low-light enhancement. Our key insight is that conditional enhancement requires fundamentally different training dynamics than unconditional generation standard consistency training focuses on low-noise regions near the data manifold, while conditional mapping critically depends on large-noise regimes that bridge degraded inputs to enhanced outputs. We introduce two core innovations: (1) a \textbf{dual-objective consistency loss} combining temporal consistency with ground-truth alignment under randomized time sampling, providing full-spectrum supervision for stable convergence; and (2) an \textbf{adaptive noise-emphasized sampling strategy} that prioritizes training on large-noise regions essential for one-step conditional generation. On VE-LOL-L, Consist-Retinex achieves \textbf{state-of-the-art performance with single-step sampling} (\textbf{PSNR: 25.51 vs. 23.41, FID: 44.73 vs. 49.59} compared to Diff-Retinex++), while requiring only \textbf{1/8 of the training budget} relative to the 1000-step Diff-Retinex baseline.

[40] HSCP: A Two-Stage Spectral Clustering Framework for Resource-Constrained UAV Identification

Maoyu Wang,Yao Lu,Bo Zhou,Zhuangzhi Chen,Yun Lin,Qi Xuan,Guan Gui

Main category: cs.CV

TL;DR: 本文提出了一种名为HSCP的分层谱聚类剪枝框架,结合层剪枝与通道剪枝,在保证甚至提升无人机射频指纹识别精度的同时,显著压缩模型规模并提升推理效率。

Details Motivation: 传统无人机识别方法在复杂环境中难以提取可靠特征,而基于深度学习的射频指纹识别虽精度高,但模型计算量大、难以部署于资源受限的边缘设备;现有剪枝方法难以兼顾压缩率、加速效果与识别精度。 Method: 提出HSCP框架:第一阶段利用基于中心化核对齐(CKA)引导的谱聚类识别并移除冗余层;第二阶段在通道维度应用相同策略消除更细粒度冗余;最后采用抗噪声微调策略增强鲁棒性。 Result: 在UAV-M100数据集上实验表明,HSCP在ResNet18上实现了86.39%的参数减少和84.44%的FLOPs减少,识别精度相比未剪枝模型提升1.49%,且在低信噪比环境下仍保持优异性能。 Conclusion: HSCP能有效平衡模型压缩、推理效率与识别精度,优于现有的通道和层剪枝方法,适用于资源受限场景下的无人机RFFI系统部署。 Abstract: With the rapid development of Unmanned Aerial Vehicles (UAVs) and the increasing complexity of low-altitude security threats, traditional UAV identification methods struggle to extract reliable signal features and meet real-time requirements in complex environments. Recently, deep learning based Radio Frequency Fingerprint Identification (RFFI) approaches have greatly improved recognition accuracy. However, their large model sizes and high computational demands hinder deployment on resource-constrained edge devices. While model pruning offers a general solution for complexity reduction, existing weight, channel, and layer pruning techniques struggle to concurrently optimize compression rate, hardware acceleration, and recognition accuracy. To this end, in this paper, we introduce HSCP, a Hierarchical Spectral Clustering Pruning framework that combines layer pruning with channel pruning to achieve extreme compression, high performance, and efficient inference. In the first stage, HSCP employs spectral clustering guided by Centered Kernel Alignment (CKA) to identify and remove redundant layers. Subsequently, the same strategy is applied to the channel dimension to eliminate a finer redundancy. To ensure robustness, we further employ a noise-robust fine-tuning strategy. Experiments on the UAV-M100 benchmark demonstrate that HSCP outperforms existing channel and layer pruning methods. Specifically, HSCP achieves $86.39\%$ parameter reduction and $84.44\%$ FLOPs reduction on ResNet18 while improving accuracy by $1.49\%$ compared to the unpruned baseline, and maintains superior robustness even in low signal-to-noise ratio environments.

[41] RAG-HAR: Retrieval Augmented Generation-based Human Activity Recognition

Nirhoshan Sivaroopan,Hansi Karunarathna,Chamara Madarasingha,Anura Jayasumana,Kanchana Thilakarathna

Main category: cs.CV

TL;DR: RAG-HAR是一种无需训练的检索增强框架,利用大语言模型进行人体活动识别,通过统计描述符和语义检索实现跨数据集的高性能识别。

Details Motivation: 现有深度学习方法在人体活动识别中依赖大量标注数据和计算资源,且需针对特定数据集训练,限制了其泛化性和实用性。 Method: RAG-HAR计算轻量级统计描述符,从向量数据库中检索语义相似样本,并结合LLM生成上下文丰富的活动描述,通过提示优化提升识别准确性。 Result: 在六个不同的人体活动识别基准上达到最先进性能,无需任何模型训练或微调,并能识别多种未见过的活动。 Conclusion: RAG-HAR实现了无需训练的高效人体活动识别,具有强鲁棒性、泛化能力和实际应用价值。 Abstract: Human Activity Recognition (HAR) underpins applications in healthcare, rehabilitation, fitness tracking, and smart environments, yet existing deep learning approaches demand dataset-specific training, large labeled corpora, and significant computational resources.We introduce RAG-HAR, a training-free retrieval-augmented framework that leverages large language models (LLMs) for HAR. RAG-HAR computes lightweight statistical descriptors, retrieves semantically similar samples from a vector database, and uses this contextual evidence to make LLM-based activity identification. We further enhance RAG-HAR by first applying prompt optimization and introducing an LLM-based activity descriptor that generates context-enriched vector databases for delivering accurate and highly relevant contextual information. Along with these mechanisms, RAG-HAR achieves state-of-the-art performance across six diverse HAR benchmarks. Most importantly, RAG-HAR attains these improvements without requiring model training or fine-tuning, emphasizing its robustness and practical applicability. RAG-HAR moves beyond known behaviors, enabling the recognition and meaningful labelling of multiple unseen human activities.

[42] An Efficient Test-Time Scaling Approach for Image Generation

Vignesh Sundaresha,Akash Haridas,Vikram Appia,Lav Varshney

Main category: cs.CV

TL;DR: 提出了一种名为Verifier-Threshold的方法,用于在图像生成中自动重新分配测试时计算资源,显著提高了计算效率。

Details Motivation: 现有的图像生成模型在测试时计算分配上效率低下,依赖贪婪算法,无法有效利用计算资源。 Method: 提出了Verifier-Threshold方法,通过验证器动态调整不同去噪步骤的计算预算,实现更高效的计算资源分配。 Result: 在GenEval基准上达到相同性能时,相比现有最先进方法减少了2-4倍的计算时间。 Conclusion: Verifier-Threshold方法能有效提升扩散模型和流模型在测试时的计算效率,为大规模图像生成提供了更优的推理策略。 Abstract: Image generation has emerged as a mainstream application of large generative AI models. Just as test-time compute and reasoning have helped language models improve their capabilities, similar benefits have also been observed with image generation models. In particular, searching over noise samples for diffusion and flow models has shown to scale well with test-time compute. While recent works have explored allocating non-uniform inference-compute budgets across different denoising steps, they rely on greedy algorithms and allocate the compute budget ineffectively. In this work, we study this problem and propose solutions to fix it. We propose the Verifier-Threshold method which automatically reallocates test-time compute and delivers substantial efficiency improvements. For the same performance on the GenEval benchmark, we achieve a 2-4x reduction in computational time over the state-of-the-art method.

[43] Explainable Fundus Image Curation and Lesion Detection in Diabetic Retinopathy

Anca Mihai,Adrian Groza

Main category: cs.CV

TL;DR: 提出了一种用于糖尿病视网膜病变图像数据的质量控制框架,结合可解释特征分类、对比学习与深度学习辅助标注,提升AI训练数据质量。

Details Motivation: 由于视网膜结构复杂,图像采集误差和人工标注偏差可能影响AI模型性能,因此需要高质量标注数据以支持准确诊断。 Method: 采用基于可解释特征的分类器过滤低质量图像,结合图像处理与对比学习提取特征;随后进行图像增强并利用深度学习辅助标注,并通过计算标注者一致性评估标注可用性。 Result: 该框架有效筛选出高质量图像并提升标注一致性,为AI模型训练和评估提供了可靠的数据基础。 Conclusion: 所提出的质量控制框架有助于提高糖尿病视网膜病变AI诊断系统的数据质量和可靠性,具有临床应用潜力。 Abstract: Diabetic Retinopathy (DR) affects individuals with long-term diabetes. Without early diagnosis, DR can lead to vision loss. Fundus photography captures the structure of the retina along with abnormalities indicative of the stage of the disease. Artificial Intelligence (AI) can support clinicians in identifying these lesions, reducing manual workload, but models require high-quality annotated datasets. Due to the complexity of retinal structures, errors in image acquisition and lesion interpretation of manual annotators can occur. We proposed a quality-control framework, ensuring only high-standard data is used for evaluation and AI training. First, an explainable feature-based classifier is used to filter inadequate images. The features are extracted both using image processing and contrastive learning. Then, the images are enhanced and put subject to annotation, using deep-learning-based assistance. Lastly, the agreement between annotators calculated using derived formulas determines the usability of the annotations.

[44] 3DID: Direct 3D Inverse Design for Aerodynamics with Physics-Aware Optimization

Yuze Hao,Linchao Zhu,Yi Yang

Main category: cs.CV

TL;DR: 提出了一种直接在3D设计空间中进行逆向设计的框架3DID,结合连续潜在表示和物理感知优化策略,生成高质量、高保真的3D几何形状。

Details Motivation: 现有方法在3D逆向设计中依赖2D投影或对已有形状微调,牺牲了体素细节并限制了设计探索,难以实现从零开始的真正3D设计。 Method: 构建统一的物理-几何嵌入,在连续潜在空间中紧凑表示形状与物理场数据;采用两阶段物理感知优化:第一阶段使用梯度引导的扩散采样器探索全局潜在流形,第二阶段进行目标驱动且保持拓扑结构的精细化优化。 Result: 3DID能够在3D空间中生成高保真几何形状,在解的质量和设计多样性上优于现有方法。 Conclusion: 3DID实现了端到端的3D逆向设计,克服了传统方法在体积细节和设计自由度上的局限,为复杂物理系统的3D设计提供了新范式。 Abstract: Inverse design aims to design the input variables of a physical system to optimize a specified objective function, typically formulated as a search or optimization problem. However, in 3D domains, the design space grows exponentially, rendering exhaustive grid-based searches infeasible. Recent advances in deep learning have accelerated inverse design by providing powerful generative priors and differentiable surrogate models. Nevertheless, current methods tend to approximate the 3D design space using 2D projections or fine-tune existing 3D shapes. These approaches sacrifice volumetric detail and constrain design exploration, preventing true 3D design from scratch. In this paper, we propose a 3D Inverse Design (3DID) framework that directly navigates the 3D design space by coupling a continuous latent representation with a physics-aware optimization strategy. We first learn a unified physics-geometry embedding that compactly captures shape and physical field data in a continuous latent space. Then, we introduce a two-stage strategy to perform physics-aware optimization. In the first stage, a gradient-guided diffusion sampler explores the global latent manifold. In the second stage, an objective-driven, topology-preserving refinement further sculpts each candidate toward the target objective. This enables 3DID to generate high-fidelity 3D geometries, outperforming existing methods in both solution quality and design versatility.

[45] Enhancing Knowledge Transfer in Hyperspectral Image Classification via Cross-scene Knowledge Integration

Lu Huo,Wenjian Huang,Jianguo Zhang,Min Xu,Haimin Zhang

Main category: cs.CV

TL;DR: 提出了一种名为Cross-scene Knowledge Integration (CKI)的框架,用于在完全异构场景下实现高光谱图像分类中的知识迁移,通过减少光谱差异、解决语义不匹配和整合目标域特有信息,实现了最先进的性能。

Details Motivation: 现有方法受限于同质域假设或仅包含共现类别的异构场景,且在标签空间无重叠时依赖完整的源域覆盖,忽视了目标域私有信息。 Method: CKI框架包括三个部分:(1) 光谱特征对齐(ASC),通过领域无关投影减少光谱差异;(2) 跨场景知识共享偏好(CKSP),利用源相似性机制(SSM)解决语义不匹配;(3) 互补信息集成(CII),最大化利用目标域特有的补充线索。 Result: 大量实验验证了CKI在多种跨场景高光谱图像分类任务中具有优异的稳定性和最先进的性能。 Conclusion: CKI有效解决了非重叠标签空间下的跨场景知识迁移问题,显著提升了异构高光谱图像分类的性能。 Abstract: Knowledge transfer has strong potential to improve hyperspectral image (HSI) classification, yet two inherent challenges fundamentally restrict effective cross-domain transfer: spectral variations caused by different sensors and semantic inconsistencies across heterogeneous scenes. Existing methods are limited by transfer settings that assume homogeneous domains or heterogeneous scenarios with only co-occurring categories. When label spaces do not overlap, they further rely on complete source-domain coverage and therefore overlook critical target-private information. To overcome these limitations and enable knowledge transfer in fully heterogeneous settings, we propose Cross-scene Knowledge Integration (CKI), a framework that explicitly incorporates target-private knowledge during transfer. CKI includes: (1) Alignment of Spectral Characteristics (ASC) to reduce spectral discrepancies through domain-agnostic projection; (2) Cross-scene Knowledge Sharing Preference (CKSP), which resolves semantic mismatch via a Source Similarity Mechanism (SSM); and (3) Complementary Information Integration (CII) to maximize the use of target-specific complementary cues. Extensive experiments verify that CKI achieves state-of-the-art performance with strong stability across diverse cross-scene HSI scenarios.

[46] Deterministic World Models for Verification of Closed-loop Vision-based Systems

Yuang Geng,Zhuoyang Zhou,Zhongzheng Zhang,Siyuan Pan,Hoang-Dung Tran,Ivan Ruchkin

Main category: cs.CV

TL;DR: 提出了一种确定性世界模型(DWM)用于验证基于视觉的闭环控制系统,通过消除潜在变量减少过近似误差,并结合双目标损失函数和Star-based可达性分析实现更精确的验证。

Details Motivation: 由于图像的高维性和视觉环境建模困难,验证基于视觉的闭环控制系统具有挑战性;现有生成模型因依赖随机潜在变量导致不必要的过近似误差。 Method: 设计一种确定性世界模型(DWM),将系统状态直接映射为生成图像,避免使用不可解释的潜在变量;采用双目标损失函数(像素级重建精度+控制差异损失)训练DWM,并结合Star-based可达性分析(StarV)与共形预测来推导轨迹偏差的严格统计边界。 Result: 在标准基准上的实验表明,该方法相比基于潜在变量的基线能产生更紧致的可达集和更好的验证性能。 Conclusion: DWM通过消除潜在变量并引入行为一致性约束,提高了视觉闭环系统验证的精度和可靠性,为安全关键系统的验证提供了有效工具。 Abstract: Verifying closed-loop vision-based control systems remains a fundamental challenge due to the high dimensionality of images and the difficulty of modeling visual environments. While generative models are increasingly used as camera surrogates in verification, their reliance on stochastic latent variables introduces unnecessary overapproximation error. To address this bottleneck, we propose a Deterministic World Model (DWM) that maps system states directly to generative images, effectively eliminating uninterpretable latent variables to ensure precise input bounds. The DWM is trained with a dual-objective loss function that combines pixel-level reconstruction accuracy with a control difference loss to maintain behavioral consistency with the real system. We integrate DWM into a verification pipeline utilizing Star-based reachability analysis (StarV) and employ conformal prediction to derive rigorous statistical bounds on the trajectory deviation between the world model and the actual vision-based system. Experiments on standard benchmarks show that our approach yields significantly tighter reachable sets and better verification performance than a latent-variable baseline.

[47] Demo: Generative AI helps Radiotherapy Planning with User Preference

Riqiang Gao,Simon Arberet,Martin Kraus,Han Liu,Wilko FAR Verbakel,Dorin Comaniciu,Florin-Cristian Ghesu,Ali Kamen

Main category: cs.CV

TL;DR: 提出一种基于用户定义偏好风格的3D剂量分布生成模型,不依赖参考计划,提升放疗计划的个性化与适应性。

Details Motivation: 现有深度学习方法依赖参考计划训练,易受特定机构或规划风格影响,缺乏灵活性。 Method: 设计一种新型生成模型,仅根据用户自定义的偏好风格预测3D剂量分布,支持对OARs和PTVs之间权衡的个性化设置。 Result: 该方法在某些场景下优于Varian RapidPlan模型,展现出更高的适应性和计划质量。 Conclusion: 该模型可无缝集成到临床系统中,提高放疗计划效率与个性化水平。 Abstract: Radiotherapy planning is a highly complex process that often varies significantly across institutions and individual planners. Most existing deep learning approaches for 3D dose prediction rely on reference plans as ground truth during training, which can inadvertently bias models toward specific planning styles or institutional preferences. In this study, we introduce a novel generative model that predicts 3D dose distributions based solely on user-defined preference flavors. These customizable preferences enable planners to prioritize specific trade-offs between organs-at-risk (OARs) and planning target volumes (PTVs), offering greater flexibility and personalization. Designed for seamless integration with clinical treatment planning systems, our approach assists users in generating high-quality plans efficiently. Comparative evaluations demonstrate that our method can surpasses the Varian RapidPlan model in both adaptability and plan quality in some scenarios.

[48] Diffusion Model Regularized Implicit Neural Representation for CT Metal Artifact Reduction

Jie Wen,Chenhe Du,Xiao Wang,Yuyao Zhang

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型正则化的隐式神经表示框架用于金属伪影去除(MAR),在无需配对数据的情况下结合CT物理约束与先验知识,提升了去伪影效果和临床适用性。

Details Motivation: 现有监督方法因依赖有限的配对金属-干净CT数据导致性能不稳定,而无监督方法未能有效融合CT物理几何和充分的先验知识,限制了金属伪影去除的效果。 Method: 提出一种结合隐式神经表示与预训练扩散模型的无监督MAR框架:隐式神经表示嵌入CT物理几何以保证数据保真度,扩散模型提供强先验正则化以指导伪影去除。 Result: 在模拟和临床CT数据上的实验表明,该方法在伪影去除效果和泛化能力上优于现有方法,尤其在保留细节和减少残余伪影方面表现突出。 Conclusion: 所提方法有效结合了物理约束与深度先验,实现了稳定、高质量的金属伪影去除,具有良好的临床应用潜力。 Abstract: Computed tomography (CT) images are often severely corrupted by artifacts in the presence of metals. Existing supervised metal artifact reduction (MAR) approaches suffer from performance instability on known data due to their reliance on limited paired metal-clean data, which limits their clinical applicability. Moreover, existing unsupervised methods face two main challenges: 1) the CT physical geometry is not effectively incorporated into the MAR process to ensure data fidelity; 2) traditional heuristics regularization terms cannot fully capture the abundant prior knowledge available. To overcome these shortcomings, we propose diffusion model regularized implicit neural representation framework for MAR. The implicit neural representation integrates physical constraints and imposes data fidelity, while the pre-trained diffusion model provides prior knowledge to regularize the solution. Experimental results on both simulated and clinical data demonstrate the effectiveness and generalization ability of our method, highlighting its potential to be applied to clinical settings.

[49] A Physics-Constrained, Design-Driven Methodology for Defect Dataset Generation in Optical Lithography

Yuehua Hu,Jiyeong Kong,Dong-yeol Shin,Jaekyun Kim,Kyung-Tae Kang

Main category: cs.CV

TL;DR: 提出一种基于物理约束的数学形态学方法生成大规模、像素级标注的光刻缺陷数据集,用于解决半导体制造中AI检测因缺乏高质量训练数据而受限的问题。

Details Motivation: 由于半导体行业中光刻缺陷数据难以获取,导致缺乏公开可用的数据集,限制了人工智能在微纳制造缺陷检测中的应用效果。 Method: 采用可控的、基于物理约束的数学形态学操作(如腐蚀和膨胀)对原始设计版图进行从头合成缺陷布局,并通过高保真DMD光刻技术将其制备为实物样本;利用光学显微图像对比缺陷样本与其无缺陷参考样本,生成一致的像素级缺陷标注。 Result: 构建了包含3,530张光学显微图像和13,365个标注缺陷实例的数据集,涵盖桥接、毛刺、收缩和污染四类缺陷;Mask R-CNN在各类缺陷上的AP@0.5分别达到0.980、0.965、0.971和显著优于Faster R-CNN约34%-42%。 Conclusion: 所提出的生成带像素级标注缺陷数据集的方法可行且有效,可支持半导体制造中鲁棒的人工智能测量与检测。 Abstract: The efficacy of Artificial Intelligence (AI) in micro/nano manufacturing is fundamentally constrained by the scarcity of high-quality and physically grounded training data for defect inspection. Lithography defect data from semiconductor industry are rarely accessible for research use, resulting in a shortage of publicly available datasets. To address this bottleneck in lithography, this study proposes a novel methodology for generating large-scale, physically valid defect datasets with pixel-level annotations. The framework begins with the ab initio synthesis of defect layouts using controllable, physics-constrained mathematical morphology operations (erosion and dilation) applied to the original design-level layout. These synthesized layouts, together with their defect-free counterparts, are fabricated into physical samples via high-fidelity digital micromirror device (DMD)-based lithography. Optical micrographs of the synthesized defect samples and their defect-free references are then compared to create consistent defect delineation annotations. Using this methodology, we constructed a comprehensive dataset of 3,530 Optical micrographs containing 13,365 annotated defect instances including four classes: bridge, burr, pinch, and contamination. Each defect instance is annotated with a pixel-accurate segmentation mask, preserving full contour and geometry. The segmentation-based Mask R-CNN achieves AP@0.5 of 0.980, 0.965, and 0.971, compared with 0.740, 0.719, and 0.717 for Faster R-CNN on bridge, burr, and pinch classes, representing a mean AP@0.5 improvement of approximately 34%. For the contamination class, Mask R-CNN achieves an AP@0.5 roughly 42% higher than Faster R-CNN. These consistent gains demonstrate that our proposed methodology to generate defect datasets with pixel-level annotations is feasible for robust AI-based Measurement/Inspection (MI) in semiconductor fabrication.

[50] A Survey of Body and Face Motion: Datasets, Performance Evaluation Metrics and Generative Techniques

Lownish Rai Sookha,Nikhil Pakhale,Mudasir Ganaie,Abhinav Dhall

Main category: cs.CV

TL;DR: 本文综述了从语音、对话上下文和视觉线索生成面部和身体动作的最新进展,强调了在双人交互场景中增强虚拟形象真实感、连贯性和表现力的未来方向。

Details Motivation: 生成富有表现力且连贯的面部和身体动态具有挑战性,因为言语/非言语线索与个体个性特征之间存在复杂的相互作用。 Method: 本文回顾了面部和身体动作生成的核心概念、表示技术、生成方法、数据集和评估指标。 Result: 这是首个同时涵盖面部和身体动作生成的全面综述,并提供了详细的资源列表。 Conclusion: 未来的研究应致力于提升双人交互场景中虚拟形象的动作真实感、连贯性和表达能力。 Abstract: Body and face motion play an integral role in communication. They convey crucial information on the participants. Advances in generative modeling and multi-modal learning have enabled motion generation from signals such as speech, conversational context and visual cues. However, generating expressive and coherent face and body dynamics remains challenging due to the complex interplay of verbal / non-verbal cues and individual personality traits. This survey reviews body and face motion generation, covering core concepts, representations techniques, generative approaches, datasets and evaluation metrics. We highlight future directions to enhance the realism, coherence and expressiveness of avatars in dyadic settings. To the best of our knowledge, this work is the first comprehensive review to cover both body and face motion. Detailed resources are listed on https://lownish23csz0010.github.io/mogen/.

[51] Towards Lossless Ultimate Vision Token Compression for VLMs

Dehua Zheng,Mouxiao Huang,Borui Jiang,Hailin Hu,Xinghao Chen

Main category: cs.CV

TL;DR: 提出了一种名为LUVC的无损视觉令牌压缩框架,通过在视觉编码器和大语言模型中分别引入迭代合并机制和频谱剪枝单元,实现高效、低延迟的视觉语言模型推理,具有2倍加速且精度损失可忽略,且无需训练即可部署于多种VLM。

Details Motivation: 现有基于注意力或相似性的压缩方法存在位置偏差、类别不平衡问题,且难以推广到浅层LLM,导致精度下降;同时高分辨率图像/视频的token冗余导致计算效率和延迟问题。 Method: 在视觉编码器中引入正交于空间轴的迭代合并机制,在LLM中集成基于无注意力/无相似性低通滤波的频谱剪枝单元,逐步压缩直至消除视觉token,实现全模型加速,并兼容FlashAttention。 Result: 在语言模型推理中实现2倍加速,精度损失可忽略,且为训练-free方法,可即插即用部署于多种视觉语言模型。 Conclusion: LUVC框架有效解决了视觉语言模型中的token冗余问题,实现了高效、通用且无损的视觉token压缩,显著提升推理速度,具备良好的实用性和兼容性。 Abstract: Visual language models encounter challenges in computational efficiency and latency, primarily due to the substantial redundancy in the token representations of high-resolution images and videos. Current attention/similarity-based compression algorithms suffer from either position bias or class imbalance, leading to significant accuracy degradation. They also fail to generalize to shallow LLM layers, which exhibit weaker cross-modal interactions. To address this, we extend token compression to the visual encoder through an effective iterative merging scheme that is orthogonal in spatial axes to accelerate the computation across the entire VLM. Furthermoer, we integrate a spectrum pruning unit into LLM through an attention/similarity-free low-pass filter, which gradually prunes redundant visual tokens and is fully compatible to modern FlashAttention. On this basis, we propose Lossless Ultimate Vision tokens Compression (LUVC) framework. LUVC systematically compresses visual tokens until complete elimination at the final layer of LLM, so that the high-dimensional visual features are gradually fused into the multimodal queries. The experiments show that LUVC achieves a 2 speedup inference in language model with negligible accuracy degradation, and the training-free characteristic enables immediate deployment across multiple VLMs.

[52] An Approach for Detection of Entities in Dynamic Media Contents

Nzakiese Mbongo,Ngombo Armando

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的视频中人物检测方法,利用监督学习算法从目标人物的简单特征中高效定位个体,可应用于公共安全系统。

Details Motivation: 由于视频数据中存在大量干扰对象,准确检测特定人物具有挑战性,因此需要一种高效且准确的方法来提升视频监控和国家安全系统的性能。 Method: 采用基于人工神经网络的深度学习技术,结合监督学习算法,对视频序列中的目标人物进行搜索与检测。 Result: 相比现有技术,该方法能更高效地从私有或公共图像库中定位目标个体,在安哥拉的应用中展现出增强国家公共安全系统的潜力。 Conclusion: 所提出的方法在计算机视觉领域具有实际应用价值,尤其适用于基于数据库和视频流的目标人物识别与追踪任务。 Abstract: The notion of learning underlies almost every evolution of Intelligent Agents. In this paper, we present an approach for searching and detecting a given entity in a video sequence. Specifically, we study how the deep learning technique by artificial neuralnetworks allows us to detect a character in a video sequence. The technique of detecting a character in a video is a complex field of study, considering the multitude of objects present in the data under analysis. From the results obtained, we highlight the following, compared to state of the art: In our approach, within the field of Computer Vision, the structuring of supervised learning algorithms allowed us to achieve several successes from simple characteristics of the target character. Our results demonstrate that is new approach allows us to locate, in an efficient way, wanted individuals from a private or public image base. For the case of Angola, the classifier we propose opens the possibility of reinforcing the national security system based on the database of target individuals (disappeared, criminals, etc.) and the video sequences of the Integrated Public Security Centre (CISP).

[53] Learning to Remove Lens Flare in Event Camera

Haiqian Han,Lingdong Kong,Jianing Li,Ao Liang,Chengtao Zhu,Jiacheng Lyu,Lai Xing Ng,Xiangyang Ji,Wei Tsang Ooi,Benoit R. Cottereau

Main category: cs.CV

TL;DR: 本文提出了E-Deflare,首个用于去除事件相机数据中镜头眩光的系统性框架,包括物理建模、大规模仿真与真实世界基准数据集E-Flare-2.7K和E-Flare-R,以及专用网络E-DeflareNet,实现了最先进的去眩光性能。

Details Motivation: 事件相机虽具有高动态范围和时间分辨率,但仍受镜头眩光这一光学伪影影响,导致事件流中出现复杂的时空失真,该问题此前未被充分研究。 Method: 提出了一种基于物理的前向模型来描述非线性抑制机制,并构建了包含大规模模拟训练集和首个配对真实测试集的E-Deflare基准;基于此设计了E-DeflareNet网络进行事件流恢复。 Result: E-DeflareNet在合成与真实数据上均取得最优去眩光效果,且显著提升下游视觉任务性能。 Conclusion: E-Deflare为事件相机去眩光提供了有效解决方案,推动了事件相机在复杂光照场景下的实际应用。 Abstract: Event cameras have the potential to revolutionize vision systems with their high temporal resolution and dynamic range, yet they remain susceptible to lens flare, a fundamental optical artifact that causes severe degradation. In event streams, this optical artifact forms a complex, spatio-temporal distortion that has been largely overlooked. We present E-Deflare, the first systematic framework for removing lens flare from event camera data. We first establish the theoretical foundation by deriving a physics-grounded forward model of the non-linear suppression mechanism. This insight enables the creation of the E-Deflare Benchmark, a comprehensive resource featuring a large-scale simulated training set, E-Flare-2.7K, and the first-ever paired real-world test set, E-Flare-R, captured by our novel optical system. Empowered by this benchmark, we design E-DeflareNet, which achieves state-of-the-art restoration performance. Extensive experiments validate our approach and demonstrate clear benefits for downstream tasks. Code and datasets are publicly available.

[54] ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

Liming Kuang,Yordanka Velikova,Mahdi Saleh,Jan-Nico Zaech,Danda Pani Paudel,Benjamin Busam

Main category: cs.CV

TL;DR: 本文提出了ConceptPose,一种无需训练且不依赖特定模型的物体位姿估计框架,利用视觉语言模型生成开放词汇的3D概念地图,并通过3D-3D对应关系实现高精度6DoF位姿估计,在零样本设置下显著超越现有方法。

Details Motivation: 大多数物体位姿估计方法需要大量针对特定数据集的训练,而大模型展现出强大的零样本能力,因此希望结合两者优势,实现无需训练的通用位姿估计。 Method: 提出ConceptPose框架,利用视觉语言模型(VLM)生成基于显著性图的开放词汇3D概念地图,并通过匹配不同视角下的3D概念点建立对应关系,从而估计6自由度相对位姿。 Result: 在常见的零样本相对位姿估计基准上达到最先进性能,ADD(-S)分数比现有方法提升超过62%,且无需任何对象或数据集特定训练。 Conclusion: ConceptPose实现了无需训练和模型定制的高精度位姿估计,展示了将视觉语言模型与传统几何任务结合的巨大潜力。 Abstract: Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.

[55] SIP: Site in Pieces- A Dataset of Disaggregated Construction-Phase 3D Scans for Semantic Segmentation and Scene Understanding

Seongyong Kim,Yong Kwon Cho

Main category: cs.CV

TL;DR: SIP (Site in Pieces) 是一个面向建筑工地实际约束的公开3D点云数据集,包含室内外场景的单站LiDAR扫描,具有针对施工现场的细粒度分类和真实稀疏性特征,支持施工相关的3D视觉任务研究。

Details Motivation: 现有3D感知数据集多基于密集融合、均匀采样和完整可视性的扫描,无法反映实际施工环境中因安全限制、访问受限和持续作业导致的孤立单站LiDAR视角及其特有的径向密度衰减、几何碎片化和视角依赖可见性问题。 Method: 提出SIP数据集,使用地面LiDAR扫描仪采集真实工地室内外场景,采用适应施工现场的三级分类体系(A. 建成环境、B. 施工操作、C. 场地周边)进行逐点标注,并设计了标准化的扫描协议、标注流程与质量控制方法,数据开放并提供支持代码库。 Result: SIP数据集包含了结构构件和细长临时设施(如脚手架、机电管道、剪叉式升降机),保留了真实工地中由遮挡导致的稀疏性和碎片化几何特征,适用于现代3D深度学习框架,支持可配置的类别设置。 Conclusion: SIP通过保留真实世界传感特性,为建筑环境中的3D感知任务提供了更贴近实际的基准测试平台,推动了面向施工场景的鲁棒性3D视觉技术发展。 Abstract: Accurate 3D scene interpretation in active construction sites is essential for progress monitoring, safety assessment, and digital twin development. LiDAR is widely used in construction because it offers advantages over camera-based systems, performing reliably in cluttered and dynamically changing conditions. Yet most public datasets for 3D perception are derived from densely fused scans with uniform sampling and complete visibility, conditions that do not reflect real construction sites. Field data are often collected as isolated single-station LiDAR views, constrained by safety requirements, limited access, and ongoing operations. These factors lead to radial density decay, fragmented geometry, and view-dependent visibility-characteristics that remain underrepresented in existing datasets. This paper presents SIP, Site in Pieces, a dataset created to reflect the practical constraints of LiDAR acquisition during construction. SIP provides indoor and outdoor scenes captured with a terrestrial LiDAR scanner and annotated at the point level using a taxonomy tailored to construction environments: A. Built Environment, B. Construction Operations, and C. Site Surroundings. The dataset includes both structural components and slender temporary objects such as scaffolding, MEP piping, and scissor lifts, where sparsity caused by occlusion and fragmented geometry make segmentation particularly challenging. The scanning protocol, annotation workflow, and quality control procedures establish a consistent foundation for the dataset. SIP is openly available with a supporting Git repository, offering adaptable class configurations that streamline adoption within modern 3D deep learning frameworks. By providing field data that retain real-world sensing characteristics, SIP enables robust benchmarking and contributes to advancing construction-oriented 3D vision tasks.

[56] KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification

Erfan Nourbakhsh,Nasrin Sanjari,Ali Nourbakhsh

Main category: cs.CV

TL;DR: 提出了一种名为KD-OCT的知识蒸馏框架,将高性能的ConvNeXtV2-Large模型压缩为轻量级的EfficientNet-B2模型,用于OCT图像中AMD和CNV相关疾病的分类,在保持高精度的同时显著提升效率,适合边缘设备部署。

Details Motivation: 由于先进深度学习模型计算开销大,难以在临床中实时应用,因此需要开发高效且性能优越的模型以支持年龄相关性黄斑变性(AMD)和脉络膜新生血管(CNV)的早期筛查。 Method: 采用知识蒸馏方法KD-OCT,利用增强的ConvNeXtV2-Large作为教师模型,通过结合软标签知识迁移与硬标签监督的混合损失函数,将知识迁移到轻量级EfficientNet-B2学生模型,并引入实时数据增强、随机权重平均和焦点损失提升性能。 Result: 在Noor Eye Hospital数据集上使用患者级别交叉验证,KD-OCT在准确率-效率平衡方面优于多尺度或特征融合的OCT分类器,学生模型性能接近教师模型,同时大幅减少模型大小和推理时间,超过多数现有框架。 Conclusion: KD-OCT实现了高性能与低计算成本之间的良好平衡,推动了轻量级模型在AMD等眼病筛查中的临床实用性和边缘部署潜力。 Abstract: Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and management. However, deploying state-of-the-art deep learning models like ConvNeXtV2-Large in clinical settings is hindered by their computational demands. Therefore, it is desirable to develop efficient models that maintain high diagnostic performance while enabling real-time deployment. In this study, a novel knowledge distillation framework, termed KD-OCT, is proposed to compress a high-performance ConvNeXtV2-Large teacher model, enhanced with advanced augmentations, stochastic weight averaging, and focal loss, into a lightweight EfficientNet-B2 student for classifying normal, drusen, and CNV cases. KD-OCT employs real-time distillation with a combined loss balancing soft teacher knowledge transfer and hard ground-truth supervision. The effectiveness of the proposed method is evaluated on the Noor Eye Hospital (NEH) dataset using patient-level cross-validation. Experimental results demonstrate that KD-OCT outperforms comparable multi-scale or feature-fusion OCT classifiers in efficiency- accuracy balance, achieving near-teacher performance with substantial reductions in model size and inference time. Despite the compression, the student model exceeds most existing frameworks, facilitating edge deployment for AMD screening. Code is available at https://github.com/erfan-nourbakhsh/KD- OCT.

[57] Adaptive Thresholding for Visual Place Recognition using Negative Gaussian Mixture Statistics

Nick Trinh,Damian Lyons

Main category: cs.CV

TL;DR: 提出一种基于负高斯混合统计的视觉位置识别自动阈值选择方法,适用于多种图像数据库和描述子。

Details Motivation: 在机器人实现中,手动设置匹配阈值难以适应多种视觉场景,需要自动选择合适的阈值。 Method: 利用地点的负高斯混合统计(即非当前地点的图像统计)来自动选择VPR的匹配阈值。 Result: 该方法能在多种图像数据库和图像描述子上有效选择阈值,提升VPR在不同环境下的鲁棒性。 Conclusion: 基于负样本统计的自动阈值选择策略具有良好的泛化能力,可有效解决VPR中阈值设定难题。 Abstract: Visual place recognition (VPR) is an important component technology for camera-based mapping and navigation applications. This is a challenging problem because images of the same place may appear quite different for reasons including seasonal changes, weather illumination, structural changes to the environment, as well as transient pedestrian or vehicle traffic. Papers focusing on generating image descriptors for VPR report their results using metrics such as recall@K and ROC curves. However, for a robot implementation, determining which matches are sufficiently good is often reduced to a manually set threshold. And it is difficult to manually select a threshold that will work for a variety of visual scenarios. This paper addresses the problem of automatically selecting a threshold for VPR by looking at the 'negative' Gaussian mixture statistics for a place - image statistics indicating not this place. We show that this approach can be used to select thresholds that work well for a variety of image databases and image descriptors.

[58] AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

Arman Zarei,Jiacheng Pan,Matthew Gwilliam,Soheil Feizi,Zhenheng Yang

Main category: cs.CV

TL;DR: 本文提出了AgentComp框架,利用大语言模型的推理和工具使用能力,自动构建组合性数据集,并通过代理偏好优化方法微调文本到图像模型,以增强其组合生成能力,在保持图像质量的同时实现了在组合性基准上的最先进性能。

Details Motivation: 现有的文本到图像生成模型在组合性方面(如对象关系、属性绑定和细粒度细节)表现不足,因为它们未被明确训练以区分组合上相似的提示和图像。 Method: 提出AgentComp框架,利用具备图像生成、编辑和视觉问答工具的大语言模型自主构建组合性数据集,并采用代理偏好优化方法对文本到图像模型进行微调。 Result: AgentComp在T2I-CompBench等组合性基准上达到最先进的结果,保持图像质量,并且还泛化到未明确训练的能力(如文本渲染)。 Conclusion: AgentComp有效提升了文本到图像模型的组合生成能力,解决了现有方法在细粒度细节上的偏差问题,同时避免了牺牲图像质量的问题。 Abstract: Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality$-$accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples and resulting in overall stronger compositional generation ability. AgentComp achieves state-of-the-art results on compositionality benchmarks such as T2I-CompBench, without compromising image quality$-$a common drawback in prior approaches$-$and even generalizes to other capabilities not explicitly trained for, such as text rendering.

[59] Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

Mizanur Rahman Jewel,Mohamed Elmahallawy,Sanjay Madria,Samuel Frimpong

Main category: cs.CV

TL;DR: 提出MDSE,一种用于生成地下矿难场景文本描述的视觉-语言框架,通过上下文感知交叉注意力、分割感知双路径编码和高效语言模型,在恶劣环境下实现鲁棒且准确的场景解释。

Details Motivation: 地下矿难环境中的黑暗、粉尘和坍塌导致视觉受阻,传统系统难以提供有效的情境感知,亟需能够理解并描述复杂灾后场景的技术支持。 Method: 设计MDSE框架,包含上下文感知交叉注意力机制、分割感知的双路径视觉编码结构,以及资源高效的Transformer语言模型;构建首个真实地下矿难图像-文本数据集UMD用于训练与评估。 Result: 在UMD数据集及相关基准上显著优于现有最先进图像描述模型,能生成更准确、更具上下文相关性的灾后场景描述,有效捕捉遮蔽环境中的关键细节。 Conclusion: MDSE为地下灾害应急响应提供了强有力的多模态情境理解工具,提升了复杂环境下的态势感知能力。 Abstract: Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset--the first image-caption corpus of real underground disaster scenes--enabling rigorous training and evaluation. Extensive experiments on UMD and related benchmarks show that MDSE substantially outperforms state-of-the-art captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments, improving situational awareness for underground emergency response. The code is at https://github.com/mizanJewel/Multimodal-Disaster-Situation-Explainer.

[60] Food Image Generation on Multi-Noun Categories

Xinyue Pan,Yuhao Chen,Jiangpeng He,Fengqing Zhu

Main category: cs.CV

TL;DR: 提出FoCULR方法以改善多名词食物类别的图像生成,通过引入食物领域知识和早期核心概念来修正语义误解和布局问题。

Details Motivation: 多名词食物类别在现实数据集中常见,但现有生成模型常误解释其语义,导致生成错误的成分或对象布局。 Method: 提出FoCULR(Food Category Understanding and Layout Refinement),结合食物领域知识,并在生成初期引入核心概念以纠正语义和空间布局。 Result: 实验结果表明,所提方法有效提升了食物图像生成的质量和准确性。 Conclusion: 通过融入领域知识和改进文本编码中的多名词理解,可显著改善多名词食物类别的生成效果。 Abstract: Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt "egg noodle" may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.

[61] GimbalDiffusion: Gravity-Aware Camera Control for Video Generation

Frédéric Fortier-Chouinard,Yannick Hold-Geoffroy,Valentin Deschaintre,Matheus Gadelha,Jean-François Lalonde

Main category: cs.CV

TL;DR: GimbalDiffusion 提出了一种基于物理坐标系的文本到视频生成框架,利用重力作为全局参考实现对相机运动的精确控制,支持无需初始帧的绝对相机轨迹定义,并通过全景视频数据和新型标注策略提升模型鲁棒性与评估标准。

Details Motivation: 现有文本到视频生成方法在相机运动控制方面缺乏细粒度、明确的几何控制能力,通常依赖相对或模糊的运动表示,难以实现可解释和精准的相机操控。 Method: 提出GimbalDiffusion框架,采用绝对坐标系(以重力为参考)定义相机轨迹;利用360度全景视频构建多样化运动数据;引入空俯仰角标注(null-pitch conditioning)策略减少文本与相机指令冲突;并构建新的评测基准以评估大范围俯仰变化下的性能。 Result: 实现了更精确、可解释的相机控制,显著提升在复杂相机运动下的生成一致性与鲁棒性;新基准推动了对相机感知视频生成的系统性评估。 Conclusion: GimbalDiffusion 通过物理对齐的坐标系统和新型训练策略,增强了文本到视频模型中相机运动的可控性和稳定性,为未来可控生成提供了新方向。 Abstract: Recent progress in text-to-video generation has achieved remarkable realism, yet fine-grained control over camera motion and orientation remains elusive. Existing approaches typically encode camera trajectories through relative or ambiguous representations, limiting explicit geometric control. We introduce GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using gravity as a global reference. Instead of describing motion relative to previous frames, our method defines camera trajectories in an absolute coordinate system, allowing precise and interpretable control over camera parameters without requiring an initial reference frame. We leverage panoramic 360-degree videos to construct a wide variety of camera trajectories, well beyond the predominantly straight, forward-facing trajectories seen in conventional video data. To further enhance camera guidance, we introduce null-pitch conditioning, an annotation strategy that reduces the model's reliance on text content when conflicting with camera specifications (e.g., generating grass while the camera points towards the sky). Finally, we establish a benchmark for camera-aware video generation by rebalancing SpatialVID-HQ for comprehensive evaluation under wide camera pitch variation. Together, these contributions advance the controllability and robustness of text-to-video models, enabling precise, gravity-aligned camera manipulation within generative frameworks.

[62] SuperF: Neural Implicit Fields for Multi-Image Super-Resolution

Sander Riisøen Jyhne,Christian Igel,Morten Goodwin,Per-Arne Andersen,Serge Belongie,Nico Lang

Main category: cs.CV

TL;DR: 本文提出了一种名为SuperF的测试时优化方法,用于多图像超分辨率(MISR),利用基于坐标神经网络(神经场)实现无需高分辨率训练数据的高质量图像重建。

Details Motivation: 由于传感器技术、大气条件和成本限制,高分辨率成像常受限,单图像超分辨率易产生不符合现实的“幻觉”结构,因此需要更可靠的超分辨率方法。 Method: 提出SuperF方法,共享多个低分辨率帧的隐式神经表示(INR),联合优化帧间对齐与INR,通过可优化的仿射变换参数建模亚像素对齐,并在对应输出分辨率的超采样坐标网格上进行优化。 Result: 在卫星影像和手持相机图像的模拟burst数据上实现了高达8倍的上采样效果,结果优于相关INR基线方法,且不依赖任何高分辨率训练数据。 Conclusion: SuperF是一种无需高分辨率训练数据、适用于多视图超分辨率的有效方法,通过联合优化对齐与神经场表示,提升了重建真实性和分辨率。 Abstract: High-resolution imagery is often hindered by limitations in sensor technology, atmospheric conditions, and costs. Such challenges occur in satellite remote sensing, but also with handheld cameras, such as our smartphones. Hence, super-resolution aims to enhance the image resolution algorithmically. Since single-image super-resolution requires solving an inverse problem, such methods must exploit strong priors, e.g. learned from high-resolution training data, or be constrained by auxiliary data, e.g. by a high-resolution guide from another modality. While qualitatively pleasing, such approaches often lead to "hallucinated" structures that do not match reality. In contrast, multi-image super-resolution (MISR) aims to improve the (optical) resolution by constraining the super-resolution process with multiple views taken with sub-pixel shifts. Here, we propose SuperF, a test-time optimization approach for MISR that leverages coordinate-based neural networks, also called neural fields. Their ability to represent continuous signals with an implicit neural representation (INR) makes them an ideal fit for the MISR task. The key characteristic of our approach is to share an INR for multiple shifted low-resolution frames and to jointly optimize the frame alignment with the INR. Our approach advances related INR baselines, adopted from burst fusion for layer separation, by directly parameterizing the sub-pixel alignment as optimizable affine transformation parameters and by optimizing via a super-sampled coordinate grid that corresponds to the output resolution. Our experiments yield compelling results on simulated bursts of satellite imagery and ground-level images from handheld cameras, with upsampling factors of up to 8. A key advantage of SuperF is that this approach does not rely on any high-resolution training data.

[63] Integrated Pipeline for Coronary Angiography With Automated Lesion Profiling, Virtual Stenting, and 100-Vessel FFR Validation

Georgy Kopanitsa,Oleg Metsker,Alexey Yakovlev

Main category: cs.CV

TL;DR: AngioAI-QFR是一个基于冠状动脉造影的全自动、端到端分析管道,结合深度学习与血管功能评估,可实现无导丝的血流储备分数(QFR)计算、相对流量容量(RFC)分析和虚拟支架植入,具有高精度、快速和临床实用性强的特点。

Details Motivation: 传统的冠状动脉造影在狭窄评估上存在视觉判读变异大、与缺血相关性中等的问题;尽管FFR能改善病变选择,但因操作复杂未被系统应用;现有QFR工具工作流程繁琐且常与解剖分析和虚拟PCI分离,因此需要一个整合解剖与功能评估的一体化自动化解决方案。 Method: 开发AngioAI-QFR系统,利用深度学习实现自动狭窄检测、管腔分割、中心线与直径提取,并计算每毫米的相对流量容量(RFC)和QFR;支持虚拟支架植入后自动重算QFR;在100条连续血管中以有创FFR为参考标准进行验证,评估其与FFR的相关性、平均绝对误差(MAE)及对FFR≤0.80的诊断性能。 Result: 在保留帧上,狭窄检测精度达0.97,管腔分割Dice系数为0.78;在100条血管中,AngioAI-QFR与FFR高度相关(r=0.89,MAE=0.045),诊断FFR≤0.80的AUC为0.93(敏感性0.88,特异性0.86);93%的病例可全自动完成,中位处理时间41秒;RFC分析可区分局灶性与弥漫性血流受限,虚拟支架显示局灶性病变QFR提升更大。 Conclusion: AngioAI-QFR提供了一个实用、接近实时的整合性分析平台,融合计算机视觉、功能评估与虚拟PCI规划,实现了仅依赖造影图像的自动化生理学分析,有望提高临床决策效率与精准度。 Abstract: Coronary angiography is the main tool for assessing coronary artery disease, but visual grading of stenosis is variable and only moderately related to ischaemia. Wire based fractional flow reserve (FFR) improves lesion selection but is not used systematically. Angiography derived indices such as quantitative flow ratio (QFR) offer wire free physiology, yet many tools are workflow intensive and separate from automated anatomy analysis and virtual PCI planning. We developed AngioAI-QFR, an end to end angiography only pipeline combining deep learning stenosis detection, lumen segmentation, centreline and diameter extraction, per millimetre Relative Flow Capacity profiling, and virtual stenting with automatic recomputation of angiography derived QFR. The system was evaluated in 100 consecutive vessels with invasive FFR as reference. Primary endpoints were agreement with FFR (correlation, mean absolute error) and diagnostic performance for FFR <= 0.80. On held out frames, stenosis detection achieved precision 0.97 and lumen segmentation Dice 0.78. Across 100 vessels, AngioAI-QFR correlated strongly with FFR (r = 0.89, MAE 0.045). The AUC for detecting FFR <= 0.80 was 0.93, with sensitivity 0.88 and specificity 0.86. The pipeline completed fully automatically in 93 percent of vessels, with median time to result 41 s. RFC profiling distinguished focal from diffuse capacity loss, and virtual stenting predicted larger QFR gain in focal than in diffuse disease. AngioAI-QFR provides a practical, near real time pipeline that unifies computer vision, functional profiling, and virtual PCI with automated angiography derived physiology.

[64] GTAvatar: Bridging Gaussian Splatting and Texture Mapping for Relightable and Editable Gaussian Avatars

Kelian Baert,Mae Younes,Francois Bourel,Marc Christie,Adnane Boukhayma

Main category: cs.CV

TL;DR: 提出了一种结合2D高斯点阵与UV纹理映射的方法,实现从单目视频中高效重建可编辑的高质量头像材质纹理,并支持基于物理的重光照和直观外观编辑。

Details Motivation: 高斯点阵虽能高精度重建逼真头像,但缺乏传统网格方法的可编辑性,限制了其在实际应用中的灵活性。 Method: 将每个规范高斯基元的局部坐标系以计算高效的方式嵌入到模板网格的UV空间中,从而在常规UV域上重建连续可编辑的材质头像纹理,并结合高效的基于物理的反射模型以支持重光照和材质编辑。 Result: 实验表明该方法在重建精度、重光照质量和编辑直观性方面优于现有最先进方法,无需额外优化即可实现外观和几何的直观修改。 Conclusion: 该方法成功融合了高斯点阵的高保真重建能力与UV纹理映射的可编辑性,为头像建模提供了兼具质量与实用性的新方案。 Abstract: Recent advancements in Gaussian Splatting have enabled increasingly accurate reconstruction of photorealistic head avatars, opening the door to numerous applications in visual effects, videoconferencing, and virtual reality. This, however, comes with the lack of intuitive editability offered by traditional triangle mesh-based methods. In contrast, we propose a method that combines the accuracy and fidelity of 2D Gaussian Splatting with the intuitiveness of UV texture mapping. By embedding each canonical Gaussian primitive's local frame into a patch in the UV space of a template mesh in a computationally efficient manner, we reconstruct continuous editable material head textures from a single monocular video on a conventional UV domain. Furthermore, we leverage an efficient physically based reflectance model to enable relighting and editing of these intrinsic material maps. Through extensive comparisons with state-of-the-art methods, we demonstrate the accuracy of our reconstructions, the quality of our relighting results, and the ability to provide intuitive controls for modifying an avatar's appearance and geometry via texture mapping without additional optimization.

[65] WonderZoom: Multi-Scale 3D World Generation

Jin Cao,Hong-Xing Yu,Jiajun Wu

Main category: cs.CV

TL;DR: WonderZoom是一种从单张图像生成跨多个空间尺度的3D场景的新方法,通过尺度自适应高斯surfels和渐进细节合成器实现多尺度内容生成与实时渲染。

Details Motivation: 现有3D世界生成模型局限于单一尺度合成,无法在不同粒度下生成连贯的场景内容,缺乏能够处理多种空间尺寸的尺度感知3D表示。 Method: 提出两种关键技术:(1) 尺度自适应高斯surfels,用于多尺度3D场景的生成与实时渲染;(2) 渐进细节合成器,迭代生成更精细尺度的3D内容。 Result: 实验表明,WonderZoom在质量和对齐性方面显著优于最先进的视频和3D生成模型,支持从单张图像创建多尺度3D世界,并提供可交互查看器和视频结果展示。 Conclusion: WonderZoom实现了从单图像出发的多尺度3D场景生成,支持用户‘放大’区域并自回归地合成从景观到微观细节的精细内容,推动了3D内容创作的发展。 Abstract: We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to "zoom into" a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly outperforms state-of-the-art video and 3D models in both quality and alignment, enabling multi-scale 3D world creation from a single image. We show video results and an interactive viewer of generated multi-scale 3D worlds in https://wonderzoom.github.io/

[66] Prompt-Based Continual Compositional Zero-Shot Learning

Sauda Maryam,Sara Nadeem,Faisal Qureshi,Mohsen Ali

Main category: cs.CV

TL;DR: 本文提出了一种基于提示的持续组合零样本学习框架PromptCCZSL,用于在防止遗忘旧知识的同时持续适应新属性、对象及其组合。该方法利用冻结的视觉语言模型主干,通过多教师蒸馏和多种正则化损失(如余弦锚定损失、正交投影损失和会话内多样性损失)实现知识保留与判别性表征学习,在UT-Zappos和C-GQA数据集上取得了优于现有方法的性能。

Details Motivation: 在组合零样本学习(CZSL)中,传统持续学习方法难以应对属性和对象跨会话重复出现而组合唯一的问题,且易发生灾难性遗忘。因此需要一种能同时保持语义一致性和避免表征重叠的持续学习框架。 Method: 提出PromptCCZSL框架:1)采用冻结的视觉语言模型作为主干;2)设计会话感知的组合提示来融合新组合的多模态特征;3)通过会话无关的属性与对象提示学习维持全局语义一致性;4)引入余弦锚定损失(CAL)保留先验知识;5)使用正交投影损失(OPL)和会话内多样性损失(IDL)增强当前会话的区分性与多样性。 Result: 在UT-Zappos和C-GQA基准上的实验表明,PromptCCZSL显著优于以往基于VLM和非VLM的方法,在闭集环境下建立了新的SOTA性能,并有效平衡了灾难性遗忘抑制与组合泛化能力。 Conclusion: PromptCCZSL是首个面向持续组合零样本学习的提示学习框架,通过提示工程与多重损失协同优化,实现了良好的知识保留与适应能力,为未来CZSL的持续学习研究提供了新方向。 Abstract: We tackle continual adaptation of vision-language models to new attributes, objects, and their compositions in Compositional Zero-Shot Learning (CZSL), while preventing forgetting of prior knowledge. Unlike classical continual learning where classes are disjoint, CCZSL is more complex as attributes and objects may reoccur across sessions while compositions remain unique. Built on a frozen VLM backbone, we propose the first Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) framework that retains prior knowledge through recency-weighted multi-teacher distillation. It employs session-aware compositional prompts to fuse multimodal features for new compositions, while attribute and object prompts are learned through session-agnostic fusion to maintain global semantic consistency, which is further stabilized by a Cosine Anchor Loss (CAL) to preserve prior knowledge. To enhance adaptation in the current session, an Orthogonal Projection Loss (OPL) ensures that new attribute and object embeddings remain distinct from previous ones, preventing overlap, while an Intra-Session Diversity Loss (IDL) promotes variation among current-session embeddings for richer, more discriminative representations. We also introduce a comprehensive protocol that jointly measures catastrophic forgetting and compositional generalization. Extensive experiments on UT-Zappos and C-GQA benchmarks demonstrate that PromptCCZSL achieves substantial improvements over prior VLM-based and non-VLM baselines, setting a new benchmark for CCZSL in closed-world settings.

[67] Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

Hao Chen,Rui Yin,Yifan Chen,Qi Chen,Chao Li

Main category: cs.CV

TL;DR: 提出Δ-LFM框架,利用Flow Matching建模患者特异性疾病进展的连续性与单调性,通过学习潜在空间中的对齐轨迹提升可解释性与语义一致性。

Details Motivation: 现有生成模型在建模疾病进展时存在不连续、无语义结构的问题,难以准确反映疾病固有的连续且单调的动态特性。 Method: 将疾病动态视为速度场,采用Flow Matching对齐患者数据的时间演化,并引入患者特异性潜在对齐机制,使轨迹沿特定轴单调变化,关联临床严重程度指标。 Result: 在三个纵向MRI基准上表现出色,实现了更一致且语义明确的潜在空间,支持疾病动态的可视化与解释。 Conclusion: Δ-LFM为疾病进展建模提供了新的可解释框架,有效结合了连续动态建模与临床语义对齐,具有潜在临床应用价值。 Abstract: Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $Δ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $Δ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.

[68] Rethinking Chain-of-Thought Reasoning for Videos

Yiwu Zhong,Zi-Yuan Hu,Yin Li,Liwei Wang

Main category: cs.CV

TL;DR: 提出了一种高效的视频多模态大模型推理框架,通过压缩视觉令牌和生成简短推理链,在保持竞争力性能的同时显著提升推理效率,无需人工标注或监督微调。

Details Motivation: 观察到现有视频推理模型依赖长推理链和大量视觉令牌,推测简洁推理与少量视觉令牌可能已足够有效。 Method: 设计并验证一种高效的后训练与推理框架,使模型能在压缩的视觉令牌上运行,并在回答前生成简短的推理轨迹。 Result: 模型在多个基准上表现出竞争性性能,显著提升了推理效率,且不依赖人工标注的思维链或监督微调。 Conclusion: 长而类人的思维链推理可能并非视频推理所必需,简洁推理可以兼具高效性与有效性。 Abstract: Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.

[69] View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs

Yuanyuan Liu,Haiyang Mei,Dongyang Zhan,Jiayue Zhao,Dongsheng Zhou,Bo Dong,Xin Yang

Main category: cs.CV

TL;DR: 提出了一种新的VLM x SI范式,通过将3D空间信息外化为可选择性访问的结构化场景图,实现更高效、可解释的零样本3D视觉定位。

Details Motivation: 现有方法将3D空间信息与2D视觉语言模型结合时产生纠缠的视觉表示,难以有效利用空间语义关系,限制了零样本3D视觉定位性能。 Method: 提出View-on-Graph (VoG) 方法,将场景组织为多模态、多层场景图,使VLM作为主动代理在图中遍历并选择性地获取所需线索进行推理。 Result: VoG在多个实验中实现了最先进的零样本3D视觉定位性能,显著优于以往VLM + SI方法。 Conclusion: 结构化的场景探索策略(如VoG)能降低VLM的推理难度,并提供透明的推理轨迹,是推进零样本3DVG的有效方向。 Abstract: 3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM's reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.

[70] MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI

Fengli Wu,Vaidehi Patil,Jaehong Yoon,Yue Zhang,Mohit Bansal

Main category: cs.CV

TL;DR: 本文提出了MedForget,一个层次感知的多模态遗忘测试平台,用于评估医学大模型中的选择性遗忘方法,以满足HIPAA等隐私法规要求。

Details Motivation: 由于医疗数据的高度敏感性以及合规要求(如HIPAA和GDPR)下的“被遗忘权”,需要研究如何在多模态医学大模型中有效移除特定训练数据的影响。 Method: 构建了一个具有层级结构(机构→患者→检查→部分)的MedForget测试平台,包含3840个多模态样本,并设计了八个组织层级上的遗忘目标;采用四种最先进的遗忘方法,在生成、分类和完形填空任务上进行实验,并提出一种通过逐步添加层级上下文的重构攻击来评估遗忘彻底性。 Result: 现有遗忘方法难以在不损害诊断性能的情况下实现完全且层次感知的遗忘;粗粒度遗忘对重构攻击有较强抵抗力,而细粒度遗忘仍存在信息泄露风险。 Conclusion: MedForget为开发符合隐私规范的医疗AI系统提供了一个实用的评估基准,揭示了当前遗忘技术在复杂医学场景中的局限性。 Abstract: Pretrained Multimodal Large Language Models (MLLMs) are increasingly deployed in medical AI systems for clinical reasoning, diagnosis support, and report generation. However, their training on sensitive patient data raises critical privacy and compliance challenges under regulations such as HIPAA and GDPR, which enforce the "right to be forgotten". Unlearning, the process of tuning models to selectively remove the influence of specific training data points, offers a potential solution, yet its effectiveness in complex medical settings remains underexplored. To systematically study this, we introduce MedForget, a Hierarchy-Aware Multimodal Unlearning Testbed with explicit retain and forget splits and evaluation sets containing rephrased variants. MedForget models hospital data as a nested hierarchy (Institution -> Patient -> Study -> Section), enabling fine-grained assessment across eight organizational levels. The benchmark contains 3840 multimodal (image, question, answer) instances, each hierarchy level having a dedicated unlearning target, reflecting distinct unlearning challenges. Experiments with four SOTA unlearning methods on three tasks (generation, classification, cloze) show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance. To test whether unlearning truly deletes hierarchical pathways, we introduce a reconstruction attack that progressively adds hierarchical level context to prompts. Models unlearned at a coarse granularity show strong resistance, while fine-grained unlearning leaves models vulnerable to such reconstruction. MedForget provides a practical, HIPAA-aligned testbed for building compliant medical AI systems.

[71] Enabling Next-Generation Consumer Experience with Feature Coding for Machines

Md Eimran Hossain Eimon,Juan Merlos,Ashan Perera,Hari Kalva,Velibor Adzic,Borko Furht

Main category: cs.CV

TL;DR: 本文介绍了MPEG-AI中的最新特征编码标准FCM,旨在支持人工智能应用中神经网络中间特征的高效提取、压缩和传输,显著降低低功耗设备的计算负担。

Details Motivation: 随着智能互联设备的发展,亟需高效的机器任务数据传输方案,以支持资源受限设备运行大型深度学习模型。 Method: 提出并标准化了面向机器的特征编码(FCM)框架,通过在高算力服务器端提取和压缩神经网络中间特征,实现高效传输与远程推理优化。 Result: 实验结果表明,相比远程推理,FCM在保持相同精度的同时,比特率需求降低了75.90%。 Conclusion: FCM标准为AI驱动的应用提供了高效的数据传输解决方案,显著提升了低功耗设备在复杂模型推理中的可行性与效率。 Abstract: As consumer devices become increasingly intelligent and interconnected, efficient data transfer solutions for machine tasks have become essential. This paper presents an overview of the latest Feature Coding for Machines (FCM) standard, part of MPEG-AI and developed by the Moving Picture Experts Group (MPEG). FCM supports AI-driven applications by enabling the efficient extraction, compression, and transmission of intermediate neural network features. By offloading computationally intensive operations to base servers with high computing resources, FCM allows low-powered devices to leverage large deep learning models. Experimental results indicate that the FCM standard maintains the same level of accuracy while reducing bitrate requirements by 75.90% compared to remote inference.

[72] Efficient Feature Compression for Machines with Global Statistics Preservation

Md Eimran Hossain Eimon,Hyomin Choi,Fabien Racapé,Mateen Ulhaq,Velibor Adzic,Hari Kalva,Borko Furht

Main category: cs.CV

TL;DR: 本文提出了一种基于Z-score归一化的特征数据压缩方法,用于AI模型的分拆推理范式中,以减少中间特征数据传输的比特开销并提升任务准确率。

Details Motivation: 在分拆推理中,中间特征数据的高效压缩对降低传输开销至关重要,现有方法存在冗余比特和精度损失问题。 Method: 采用Z-score归一化方法,在解码端高效恢复压缩后的特征数据,并集成到MPEG正在开发的FCM编解码标准中,取代现有的缩放方法;同时提出一种简化版本以进一步降低特定情况下的开销。 Result: 实验表明,所提方法在不同任务上平均降低17.09%的比特率,在目标跟踪任务中最高降低65.69%,且不牺牲任务准确率。 Conclusion: Z-score归一化方法在特征压缩中优于现有方法,能有效减少传输开销并保持甚至提升AI任务性能,具有在实际标准中应用的潜力。 Abstract: The split-inference paradigm divides an artificial intelligence (AI) model into two parts. This necessitates the transfer of intermediate feature data between the two halves. Here, effective compression of the feature data becomes vital. In this paper, we employ Z-score normalization to efficiently recover the compressed feature data at the decoder side. To examine the efficacy of our method, the proposed method is integrated into the latest Feature Coding for Machines (FCM) codec standard under development by the Moving Picture Experts Group (MPEG). Our method supersedes the existing scaling method used by the current standard under development. It both reduces the overhead bits and improves the end-task accuracy. To further reduce the overhead in certain circumstances, we also propose a simplified method. Experiments show that using our proposed method shows 17.09% reduction in bitrate on average across different tasks and up to 65.69% for object tracking without sacrificing the task accuracy.

[73] A Clinically Interpretable Deep CNN Framework for Early Chronic Kidney Disease Prediction Using Grad-CAM-Based Explainable AI

Anas Bin Ayub,Nilima Sultana Niha,Md. Zahurul Haque

Main category: cs.CV

TL;DR: 提出了一种基于深度卷积神经网络(CNN)结合SMOTE和Grad-CAM的方法,用于从CT图像中早期检测慢性肾病(CKD),在CT KIDNEY DATASET上实现了100%的分类准确率。

Details Motivation: 慢性肾病(CKD)是全球主要的医疗负担,早期检测对临床管理至关重要,但现有诊断方法仍需改进以提高准确性和效率。 Method: 采用深度卷积神经网络(CNN)进行分类,使用SMOTE进行类别平衡,并通过Grad-CAM增强模型可解释性,数据来自包含12,446张CT图像的公开数据集。 Result: 模型在CT KIDNEY DATASET上达到了100%的分类准确率,表现出卓越的性能。 Conclusion: 该方法在CKD早期检测中表现优异,具有较高的临床应用潜力,有助于提升诊断效率和早期干预能力。 Abstract: Chronic Kidney Disease (CKD) constitutes a major global medical burden, marked by the gradual deterioration of renal function, which results in the impaired clearance of metabolic waste and disturbances in systemic fluid homeostasis. Owing to its substantial contribution to worldwide morbidity and mortality, the development of reliable and efficient diagnostic approaches is critically important to facilitate early detection and prompt clinical management. This study presents a deep convolutional neural network (CNN) for early CKD detection from CT kidney images, complemented by class balancing using Synthetic Minority Over-sampling Technique (SMOTE) and interpretability via Gradient-weighted Class Activation Mapping (Grad-CAM). The model was trained and evaluated on the CT KIDNEY DATASET, which contains 12,446 CT images, including 3,709 cyst, 5,077 normal, 1,377 stone, and 2,283 tumor cases. The proposed deep CNN achieved a remarkable classification performance, attaining 100% accuracy in the early detection of chronic kidney disease (CKD). This significant advancement demonstrates strong potential for addressing critical clinical diagnostic challenges and enhancing early medical intervention strategies.

[74] OmniPSD: Layered PSD Generation with Diffusion Transformer

Cheng Liu,Yiren Song,Haofan Wang,Mike Zheng Shou

Main category: cs.CV

TL;DR: OmniPSD是一个基于扩散模型的统一框架,能够实现文本到PSD生成和图像到PSD分解,通过空间注意力和迭代上下文编辑生成具有透明通道的分层PSD文件。

Details Motivation: 现有的扩散模型在生成带透明alpha通道的分层PSD文件方面能力有限,难以满足实际设计需求。 Method: 基于Flux生态系统构建OmniPSD框架,采用空间注意力学习多层布局的语义关系,并通过迭代上下文编辑从单张图像中提取分层内容;引入RGBA-VAE模块保留透明信息。 Result: 在新构建的RGBA分层数据集上实验表明,OmniPSD在生成质量、结构一致性和透明度保持方面表现优异。 Conclusion: OmniPSD为基于扩散变换器的分层设计生成与分解提供了新范式,支持高保真、结构化且透明感知的PSD内容创建。 Abstract: Recent advances in diffusion models have greatly improved image generation and editing, yet generating or reconstructing layered PSD files with transparent alpha channels remains highly challenging. We propose OmniPSD, a unified diffusion framework built upon the Flux ecosystem that enables both text-to-PSD generation and image-to-PSD decomposition through in-context learning. For text-to-PSD generation, OmniPSD arranges multiple target layers spatially into a single canvas and learns their compositional relationships through spatial attention, producing semantically coherent and hierarchically structured layers. For image-to-PSD decomposition, it performs iterative in-context editing, progressively extracting and erasing textual and foreground components to reconstruct editable PSD layers from a single flattened image. An RGBA-VAE is employed as an auxiliary representation module to preserve transparency without affecting structure learning. Extensive experiments on our new RGBA-layered dataset demonstrate that OmniPSD achieves high-fidelity generation, structural consistency, and transparency awareness, offering a new paradigm for layered design generation and decomposition with diffusion transformers.

[75] GLACIA: Instance-Aware Positional Reasoning for Glacial Lake Segmentation via Multimodal Large Language Model

Lalit Maurya,Saurabh Kaushik,Beth Tellman

Main category: cs.CV

TL;DR: 本文提出了一种名为GLACIA的新框架,结合大语言模型与图像分割技术,用于冰川湖监测,能够生成精确的分割掩码和空间推理结果。

Details Motivation: 现有基于CNN和ViT的冰川湖分割方法局限于像素级预测,缺乏全局语义和可解释的推理能力,难以支持灾害应对中的决策需求。 Method: 提出GLACIA框架,融合大语言模型与分割网络,构建GLake-Pos数据集以支持实例感知的空间位置问答训练,实现语义理解与自然语言交互。 Result: GLACIA在mIoU指标上达到87.30,优于现有的CNN、ViT、地理基础模型及基于推理的分割方法。 Conclusion: 该方法通过引入上下文与实例感知的推理能力,提升了冰川湖分割的准确性与可解释性,有助于灾害预警和政策制定。 Abstract: Glacial lake monitoring bears great significance in mitigating the anticipated risk of Glacial Lake Outburst Floods. However, existing segmentation methods based on convolutional neural networks (CNNs) and Vision Transformers (ViTs), remain constrained to pixel-level predictions, lacking high-level global scene semantics and human-interpretable reasoning. To address this, we introduce GLACIA (\textbf{G}lacial \textbf{LA}ke segmentation with \textbf{C}ontextual \textbf{I}nstance \textbf{A}wareness), the first framework that integrates large language models with segmentation capabilities to produce both accurate segmentation masks and corresponding spatial reasoning outputs. We construct the Glacial Lake Position Reasoning (GLake-Pos) dataset pipeline, which provides diverse, spatially grounded question-answer pairs designed to overcome the lack of instance-aware positional reasoning data in remote sensing. Comparative evaluation demonstrate that GLACIA (mIoU: 87.30) surpasses state-of-the-art method based on CNNs (mIoU: 78.55 - 79.01), ViTs (mIoU: 69.27 - 81.75), Geo-foundation models (mIoU: 76.37 - 87.10), and reasoning based segmentation methods (mIoU: 60.12 - 75.66). Our approach enables intuitive disaster preparedness and informed policy-making in the context of rapidly changing glacial environments by facilitating natural language interaction, thereby supporting more efficient and interpretable decision-making. The code is released on https://github.com/lalitmaurya47/GLACIA

[76] ROI-Packing: Efficient Region-Based Compression for Machine Vision

Md Eimran Hossain Eimon,Alena Krause,Ashan Perera,Juan Merlos,Hari Kalva,Velibor Adzic,Borko Furht

Main category: cs.CV

TL;DR: 本文提出了一种名为ROI-Packing的高效图像压缩方法,专为机器视觉设计,通过优先处理对任务准确率关键的感兴趣区域(ROI)并高效打包,显著降低比特率而不影响任务精度。

Details Motivation: 传统图像压缩方法未考虑机器视觉任务的需求,导致在低比特率下任务性能下降;因此需要一种面向机器视觉的压缩方法,在压缩的同时保持关键信息。 Method: ROI-Packing通过识别并优先编码对下游任务重要的区域,高效打包这些ROI,并舍弃不相关的背景信息,无需对任务模型进行重新训练或微调即可集成到现有系统中。 Result: 在五个数据集和两个任务(目标检测与实例分割)上的实验表明,相比最先进的VVC编码器,ROI-Packing在不损失任务准确率的情况下最高减少44.10%的比特率,或在相同比特率下提升8.88%的任务准确率。 Conclusion: ROI-Packing是一种高效、即插即用的面向机器视觉的图像压缩方案,能够在显著压缩的同时提升或保持下游任务性能,具有实际部署价值。 Abstract: This paper introduces ROI-Packing, an efficient image compression method tailored specifically for machine vision. By prioritizing regions of interest (ROI) critical to end-task accuracy and packing them efficiently while discarding less relevant data, ROI-Packing achieves significant compression efficiency without requiring retraining or fine-tuning of end-task models. Comprehensive evaluations across five datasets and two popular tasks-object detection and instance segmentation-demonstrate up to a 44.10% reduction in bitrate without compromising end-task accuracy, along with an 8.88 % improvement in accuracy at the same bitrate compared to the state-of-the-art Versatile Video Coding (VVC) codec standardized by the Moving Picture Experts Group (MPEG).

[77] MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification

Sangwoon Kwak,Weeyoung Kwon,Jun Young Jeong,Geonho Kim,Won-Sik Cheong,Jihyong Oh

Main category: cs.CV

TL;DR: 本文提出了一种名为MoRel的新型4D高斯点阵框架,采用基于锚点中继的双向混合机制(ARBB),实现了对长时程动态场景的高效、无闪烁且内存可控的建模。

Details Motivation: 现有4D高斯点阵方法在处理包含长距离运动的动态视频时面临内存爆炸、时间闪烁以及无法处理遮挡变化的问题,因此需要一种更高效且时间一致性强的新方法。 Method: 提出MoRel框架,通过构建关键帧级别的局部规范锚空间,并在锚点层面建模帧间形变;引入双向形变学习与可学习透明度控制进行自适应混合,缓解时间不连续性;同时设计特征方差引导的分层稠密化(FHD)策略,在保证渲染质量的同时控制点云密度增长。 Result: 在新构建的长程动态数据集SelfCap$_{\text{LR}}$上验证了方法的有效性,结果显示MoRel能实现高质量、无闪烁的长时间渲染,且内存占用稳定。 Conclusion: MoRel通过锚点中继和双向混合机制,显著提升了4D高斯点阵在长时程动态场景中的时间一致性与可扩展性,为动态场景的实时渲染提供了高效解决方案。 Abstract: Recent advances in 4D Gaussian Splatting (4DGS) have extended the high-speed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes. However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a naive extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes. Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal coherence. By learning bidirectional deformations between KfA and adaptively blending them through learnable opacity control, our approach mitigates temporal discontinuities and flickering artifacts. We further introduce a Feature-variance-guided Hierarchical Densification (FHD) scheme that effectively densifies KfA's while keeping rendering quality, based on an assigned level of feature-variance. To effectively evaluate our model's capability to handle real-world long-range 4D motion, we newly compose long-range 4D motion-contained dataset, called SelfCap$_{\text{LR}}$. It has larger average dynamic motion magnitude, captured at spatially wider spaces, compared to previous dynamic video datasets. Overall, our MoRel achieves temporally coherent and flicker-free long-range 4D reconstruction while maintaining bounded memory usage, demonstrating both scalability and efficiency in dynamic Gaussian-based representations.

[78] LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations

Zhichao Yang,Tianjiao Gu,Jianjie Wang,Feiyu Lin,Xiangfei Sheng,Pengfei Chen,Leida Li

Main category: cs.CV

TL;DR: 本文提出了LongT2IBench,一个包含14K长文本-图像对及图结构人工标注的数据集,用于评估长文本到图像生成中的图文对齐性能,并基于此提出LongT2IExpert模型,利用多模态大语言模型实现可解释的细粒度对齐评估。

Details Motivation: 现有的图文对齐评测基准主要针对短文本提示,缺乏对长文本场景下细粒度、可解释性对齐评估的支持,限制了长文本T2I模型的发展。 Method: 设计了Generate-Refine-Qualify标注协议,将长文本提示转化为包含实体、属性和关系的图结构;基于图结构生成细粒度对齐标注,并构建LongT2IBench数据集;进一步提出LongT2IExpert,通过分层对齐思维链(Hierarchical Alignment CoT)指令微调多模态大语言模型,实现评分与解释并重的评估。 Result: LongT2IBench包含14K图文对及图结构标注;实验表明LongT2IExpert在长文本对齐评估和可解释性方面优于现有方法。 Conclusion: LongT2IBench为长文本T2I对齐评估提供了高质量、可解释的新基准,LongT2IExpert展示了结合图结构推理与多模态大模型进行智能评估的有效性。 Abstract: The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation. Data and code have been released in https://welldky.github.io/LongT2IBench-Homepage/.

[79] Dynamic Facial Expressions Analysis Based Parkinson's Disease Auxiliary Diagnosis

Xiaochen Huang,Xiaochen Bi,Cuihua Lv,Xin Wang,Haoyan Zhang,Wenjing Jiang,Xin Ma,Yibin Li

Main category: cs.CV

TL;DR: 提出一种基于动态面部表情分析的帕金森病辅助诊断方法,通过分析面部表情减少和僵硬特征,结合CLIP和LSTM网络实现93.1%的准确率。

Details Motivation: 帕金森病影响患者日常生活,现有诊断方法不够便捷,需一种更高效、可及的辅助诊断手段。 Method: 构建多模态面部表情分析网络,利用CLIP架构融合视觉与文本特征并保留表情时序动态,提取表情强度特征后输入LSTM分类网络进行诊断。 Result: 该方法在帕金森病诊断中达到93.1%的准确率,优于其他体外诊断方法。 Conclusion: 所提方法为帕金森病提供了便捷、高效的辅助诊断方案,提升了潜在患者的诊断体验。 Abstract: Parkinson's disease (PD), a prevalent neurodegenerative disorder, significantly affects patients' daily functioning and social interactions. To facilitate a more efficient and accessible diagnostic approach for PD, we propose a dynamic facial expression analysis-based PD auxiliary diagnosis method. This method targets hypomimia, a characteristic clinical symptom of PD, by analyzing two manifestations: reduced facial expressivity and facial rigidity, thereby facilitating the diagnosis process. We develop a multimodal facial expression analysis network to extract expression intensity features during patients' performance of various facial expressions. This network leverages the CLIP architecture to integrate visual and textual features while preserving the temporal dynamics of facial expressions. Subsequently, the expression intensity features are processed and input into an LSTM-based classification network for PD diagnosis. Our method achieves an accuracy of 93.1%, outperforming other in-vitro PD diagnostic approaches. This technique offers a more convenient detection method for potential PD patients, improving their diagnostic experience.

[80] LoGoColor: Local-Global 3D Colorization for 360° Scenes

Yeonjin Chang,Juhwan Cho,Seunghyeon Seo,Wonsik Shin,Nojun Kwak

Main category: cs.CV

TL;DR: 提出LoGoColor方法,通过局部-全局策略消除指导平均过程,实现复杂360°场景下更具色彩多样性和多视角一致性的3D着色。

Details Motivation: 现有单通道3D重建缺乏彩色输出,依赖2D图像着色模型的方法因颜色平均化导致结果单调,尤其在复杂360°场景中表现不佳。 Method: 提出LoGoColor管道,采用“局部-全局”方法:将场景划分为子场景,并利用微调的多视图扩散模型显式处理子场景间和子场景内的一致性,生成多样化且一致的着色视图。 Result: 在复杂360°场景中实现了优于现有方法的定量与定性结果,多视角一致性更强,视觉效果更合理,并通过新提出的色彩多样性指数验证了更高的色彩多样性。 Conclusion: LoGoColor有效解决了颜色平均化问题,在保持多视角严格一致性的同时提升了3D着色的色彩丰富度和真实感,适用于高复杂度场景。 Abstract: Single-channel 3D reconstruction is widely used in fields such as robotics and medical imaging. While this line of work excels at reconstructing 3D geometry, the outputs are not colored 3D models, thus 3D colorization is required for visualization. Recent 3D colorization studies address this problem by distilling 2D image colorization models. However, these approaches suffer from an inherent inconsistency of 2D image models. This results in colors being averaged during training, leading to monotonous and oversimplified results, particularly in complex 360° scenes. In contrast, we aim to preserve color diversity by generating a new set of consistently colorized training views, thereby bypassing the averaging process. Nevertheless, eliminating the averaging process introduces a new challenge: ensuring strict multi-view consistency across these colorized views. To achieve this, we propose LoGoColor, a pipeline designed to preserve color diversity by eliminating this guidance-averaging process with a `Local-Global' approach: we partition the scene into subscenes and explicitly tackle both inter-subscene and intra-subscene consistency using a fine-tuned multi-view diffusion model. We demonstrate that our method achieves quantitatively and qualitatively more consistent and plausible 3D colorization on complex 360° scenes than existing methods, and validate its superior color diversity using a novel Color Diversity Index.

[81] FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model

Xiang Chen,Jinshan Pan,Jiangxin Dong,Jian Yang,Jinhui Tang

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的高容量图像恢复基础模型FoundIR-v2,通过动态优化多任务数据混合比例和引入MoE驱动的调度器,实现跨50多个子任务的优异性能。

Details Motivation: 现有的一体化图像恢复模型性能受限于不同任务数据混合的比例不平衡,需寻找更优的数据调度策略以提升整体泛化能力。 Method: 提出FoundIR-v2,采用数据均衡调度范式,利用数据混合规律动态调整多任务训练数据比例;引入MoE驱动的调度器,在生成预训练中灵活分配任务自适应的扩散先验。 Result: 模型在超过50个子任务上进行了实验,覆盖广泛的现实场景,表现出优于当前最先进方法的综合性能。 Conclusion: 数据混合比例是影响多任务图像恢复模型性能的关键因素,结合均衡调度与MoE机制可显著提升模型的泛化性与实用性。 Abstract: Recent studies have witnessed significant advances in image restoration foundation models driven by improvements in the scale and quality of pre-training data. In this work, we find that the data mixture proportions from different restoration tasks are also a critical factor directly determining the overall performance of all-in-one image restoration models. To this end, we propose a high-capacity diffusion-based image restoration foundation model, FoundIR-v2, which adopts a data equilibrium scheduling paradigm to dynamically optimize the proportions of mixed training datasets from different tasks. By leveraging the data mixing law, our method ensures a balanced dataset composition, enabling the model to achieve consistent generalization and comprehensive performance across diverse tasks. Furthermore, we introduce an effective Mixture-of-Experts (MoE)-driven scheduler into generative pre-training to flexibly allocate task-adaptive diffusion priors for each restoration task, accounting for the distinct degradation forms and levels exhibited by different tasks. Extensive experiments demonstrate that our method can address over 50 sub-tasks across a broader scope of real-world scenarios and achieves favorable performance against state-of-the-art approaches.

[82] MelanomaNet: Explainable Deep Learning for Skin Lesion Classification

Sukhrobbek Ilyosbekov

Main category: cs.CV

TL;DR: 提出了一种名为MelanomaNet的可解释深度学习系统,用于多类皮肤病变分类,结合四种互补的可解释机制,在保持高性能的同时提升临床可信度。

Details Motivation: 由于深度学习模型的“黑箱”特性,其在临床中的应用受限,因此需要一个既准确又可解释的皮肤病变分类系统以促进临床采纳。 Method: 采用EfficientNet V2为主干网络,结合GradCAM++注意力可视化、自动ABCDE临床标准提取、FastCAV概念解释和蒙特卡洛Dropout不确定性量化四种可解释方法。 Result: 在ISIC 2019数据集上达到85.61%的准确率和0.8564的加权F1分数,并成功实现模型注意力与临床评估标准的一致性,同时分解预测不确定性以标记需人工复核的病例。 Conclusion: 高分类性能可与全面的可解释性共存,MelanomaNet有望增强临床医生对AI系统的信任,推动其在皮肤科临床流程中的应用。 Abstract: Automated skin lesion classification using deep learning has shown remarkable accuracy, yet clinical adoption remains limited due to the "black box" nature of these models. We present MelanomaNet, an explainable deep learning system for multi-class skin lesion classification that addresses this gap through four complementary interpretability mechanisms. Our approach combines an EfficientNet V2 backbone with GradCAM++ attention visualization, automated ABCDE clinical criterion extraction, Fast Concept Activation Vectors (FastCAV) for concept-based explanations, and Monte Carlo Dropout uncertainty quantification. We evaluate our system on the ISIC 2019 dataset containing 25,331 dermoscopic images across 9 diagnostic categories. Our model achieves 85.61% accuracy with a weighted F1 score of 0.8564, while providing clinically meaningful explanations that align model attention with established dermatological assessment criteria. The uncertainty quantification module decomposes prediction confidence into epistemic and aleatoric components, enabling automatic flagging of unreliable predictions for clinical review. Our results demonstrate that high classification performance can be achieved alongside comprehensive interpretability, potentially facilitating greater trust and adoption in clinical dermatology workflows. The source code is available at https://github.com/suxrobgm/explainable-melanoma

[83] Traffic Scene Small Target Detection Method Based on YOLOv8n-SPTS Model for Autonomous Driving

Songhan Wu

Main category: cs.CV

TL;DR: 本文提出了一种改进的YOLOv8n-SPTS模型,用于提升自动驾驶中小目标检测的性能,通过引入SPD-Conv、SPPFCSPC模块和TSFP结构,在VisDrone2019-DET数据集上实现了领先的精度、召回率和mAP指标。

Details Motivation: 现有算法在小目标检测方面存在特征丢失、尺度不平衡和遮挡等问题,导致检测性能不佳,尤其是在动态感知场景中对行人、自行车等小交通目标的识别能力不足。 Method: 1)在YOLOv8n的Backbone Bottleneck中用SPD-Conv替代4个传统卷积模块,保留细粒度信息;2)用SPPFCSPC模块替换SPPF,融合SPP的多尺度提取与CSP的特征融合机制;3)设计TSFP结构,增加160×160的小目标检测头并移除冗余的大目标头,以提升小目标检测能力并保持计算效率。 Result: 在VisDrone2019-DET数据集上,YOLOv8n-SPTS在精度(61.9%)、召回率(48.3%)、mAP@0.5(52.6%)和mAP@0.5:0.95(32.6%)上均排名第一;可视化结果显示在遮挡和密集场景中小目标漏检率显著降低。 Conclusion: YOLOv8n-SPTS通过优化特征提取、增强特征融合和设计专用小目标检测结构,有效提升了小交通目标的检测性能,尤其适用于复杂城市场景下的自动驾驶感知系统。 Abstract: This paper focuses on the key issue in autonomous driving: small target recognition in dynamic perception. Existing algorithms suffer from poor detection performance due to missing small target information, scale imbalance, and occlusion. We propose an improved YOLOv8n-SPTS model, which enhances the detection accuracy of small traffic targets through three key innovations: First, optimizing the feature extraction module. In the Backbone Bottleneck structure of YOLOv8n, 4 traditional convolution modules are replaced with Space-to-Depth Convolution (SPD-Conv) modules. This module retains fine-grained information through space-to-depth conversion, reduces information loss, and enhances the ability to capture features of low-resolution small targets. Second, enhancing feature fusion capability. The Spatial Pyramid Pooling - Fast Cross Stage Partial Connection (SPPFCSPC) module is introduced to replace the original SPPF module, integrating the multi-scale feature extraction from Spatial Pyramid Pooling (SPP) and the feature fusion mechanism of Cross Stage Partial Connection (CSP), thereby improving the model's contextual understanding of complex scenes and multi-scale feature expression ability. Third, designing a dedicated detection structure for small targets. A Triple-Stage Feature Pyramid (TSFP) structure is proposed, which adds a 160*160 small target detection head to the original detection heads to fully utilize high-resolution features in shallow layers; meanwhile, redundant large target detection heads are removed to balance computational efficiency. Comparative experiments on the VisDrone2019-DET dataset show that YOLOv8n-SPTS model ranks first in precision (61.9%), recall (48.3%), mAP@0.5 (52.6%), and mAP@0.5:0.95 (32.6%). Visualization results verify that the miss rate of small targets such as pedestrians and bicycles in occluded and dense scenes is significantly reduced.

[84] VABench: A Comprehensive Benchmark for Audio-Video Generation

Daili Hua,Xizhi Wang,Bohan Zeng,Xinyi Huang,Hao Liang,Junbo Niu,Xinlong Chen,Quanqing Xu,Wentao Zhang

Main category: cs.CV

TL;DR: 本文提出了VABench,一个用于评估同步音视频生成的多维度基准框架,填补了现有视频生成评测中对音频-视频同步评价的空白。

Details Motivation: 现有的视频生成基准主要关注视觉质量,缺乏对同步音视频生成的有效评估,尤其是音频与视频之间的同步性。因此需要一个专门针对音视频协同生成的综合评测框架。 Method: 提出VABench,包含三类任务(文本到音视频、图像到音视频、立体音视频生成)和两个评估模块,涵盖15个评估维度,包括文本-视频、文本-音频、视频-音频的相似性、音视频同步、唇音一致性以及音视频问答对等,并覆盖七种主要内容类别。 Result: VABench能够系统地评估当前音视频生成模型在多个维度上的表现,提供详细的分析与可视化结果,揭示现有模型的优势与不足。 Conclusion: VABench为同步音视频生成提供了全面、系统的评估标准,有望成为该领域的新基准,推动音视频生成技术的进一步发展。 Abstract: Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

[85] From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation

Shivanshu Agnihotri,Snehashis Majhi,Deepak Ranjan Nayak,Debesh Jha

Main category: cs.CV

TL;DR: 提出了一种名为Polyp-DiFoM的新蒸馏框架,将基础模型的知识迁移到轻量级分割模型中,显著提升结肠镜息肉分割的准确性和效率。

Details Motivation: 由于息肉在大小、形状、颜色和伪装性方面的显著差异,现有轻量级模型难以实现精确分割,而大型基础模型直接应用于医学图像存在数据稀缺和领域知识不足的问题。 Method: 通过从SAM、DINOv2等视觉基础模型中提取语义先验,并将其注入U-Net、U-Net++等轻量级架构,同时引入频域编码增强蒸馏效果。 Result: 在Kvasir-SEG、CVC-ClinicDB等五个基准数据集上实验表明,Polyp-DiFoM显著优于基线模型和现有最先进模型,且计算开销降低近9倍。 Conclusion: Polyp-DiFoM有效 bridged 轻量级模型与基础模型之间的差距,实现了高效、准确的息肉分割,适合临床部署。 Abstract: Accurate polyp segmentation during colonoscopy is critical for the early detection of colorectal cancer and still remains challenging due to significant size, shape, and color variations, and the camouflaged nature of polyps. While lightweight baseline models such as U-Net, U-Net++, and PraNet offer advantages in terms of easy deployment and low computational cost, they struggle to deal with the above issues, leading to limited segmentation performance. In contrast, large-scale vision foundation models such as SAM, DINOv2, OneFormer, and Mask2Former have exhibited impressive generalization performance across natural image domains. However, their direct transfer to medical imaging tasks (e.g., colonoscopic polyp segmentation) is not straightforward, primarily due to the scarcity of large-scale datasets and lack of domain-specific knowledge. To bridge this gap, we propose a novel distillation framework, Polyp-DiFoM, that transfers the rich representations of foundation models into lightweight segmentation baselines, allowing efficient and accurate deployment in clinical settings. In particular, we infuse semantic priors from the foundation models into canonical architectures such as U-Net and U-Net++ and further perform frequency domain encoding for enhanced distillation, corroborating their generalization capability. Extensive experiments are performed across five benchmark datasets, such as Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. Notably, Polyp-DiFoM consistently outperforms respective baseline models significantly, as well as the state-of-the-art model, with nearly 9 times reduced computation overhead. The code is available at https://github.com/lostinrepo/PolypDiFoM.

[86] Transformer-Driven Multimodal Fusion for Explainable Suspiciousness Estimation in Visual Surveillance

Kuldeep Singh Yadav,Lalan Kumar

Main category: cs.CV

TL;DR: 本文提出了一个大规模标注数据集USE50k和一个高效的基于视觉的实时可疑行为分析框架DeepUSEvision,用于复杂环境下的主动威胁检测。

Details Motivation: 在复杂且不受控的公共环境中,现有的可疑行为检测方法受限于数据规模和模型效率,难以实现高精度、实时且可解释的威胁识别。因此,需要构建更全面的数据集并设计高效的多模态分析框架。 Method: 提出USE50k数据集(包含65,500张来自多种场景的图像),并开发DeepUSEvision框架:结合改进的YOLOv12用于可疑物体检测,两个DCNN分别处理面部表情与身体语言,以及一个基于Transformer的判别网络进行多模态融合以生成可解释的可疑度评分。 Result: 实验表明该框架在准确性、鲁棒性和可解释性方面优于现有最先进方法,能够在真实场景中实现实时分析。 Conclusion: USE50k数据集和DeepUSEvision框架为智能监控和安全关键应用中的实时风险评估提供了可扩展的基础。 Abstract: Suspiciousness estimation is critical for proactive threat detection and ensuring public safety in complex environments. This work introduces a large-scale annotated dataset, USE50k, along with a computationally efficient vision-based framework for real-time suspiciousness analysis. The USE50k dataset contains 65,500 images captured from diverse and uncontrolled environments, such as airports, railway stations, restaurants, parks, and other public areas, covering a broad spectrum of cues including weapons, fire, crowd density, abnormal facial expressions, and unusual body postures. Building on this dataset, we present DeepUSEvision, a lightweight and modular system integrating three key components, i.e., a Suspicious Object Detector based on an enhanced YOLOv12 architecture, dual Deep Convolutional Neural Networks (DCNN-I and DCNN-II) for facial expression and body-language recognition using image and landmark features, and a transformer-based Discriminator Network that adaptively fuses multimodal outputs to yield an interpretable suspiciousness score. Extensive experiments confirm the superior accuracy, robustness, and interpretability of the proposed framework compared to state-of-the-art approaches. Collectively, the USE50k dataset and the DeepUSEvision framework establish a strong and scalable foundation for intelligent surveillance and real-time risk assessment in safety-critical applications.

[87] Benchmarking Real-World Medical Image Classification with Noisy Labels: Challenges, Practice, and Outlook

Yuan Ma,Junlin Hou,Chao Zhang,Yukun Zhou,Zongyuan Ge,Haoran Xie,Lie Ju

Main category: cs.CV

TL;DR: 本文提出了一个名为LNMBench的医学图像标签噪声综合基准,评估了10种代表性方法在7个数据集、6种成像模态和3种噪声模式下的表现,并提出了一种简单有效的方法来提升模型在高噪声和真实场景下的鲁棒性。

Details Motivation: 现有学习噪声标签(LNL)方法在医学图像分析中的鲁棒性尚未得到系统评估,且医学图像存在专家标注成本高、观察者间变异大导致标签不一致的问题。 Method: 构建了一个统一可复现的评估框架LNMBench,包含10种代表性的LNL方法,在7个数据集、6种成像模态和3种噪声模式下进行实验;基于实验发现提出一种改进方法以增强模型对噪声的鲁棒性。 Result: 实验表明现有LNL方法在高噪声和真实噪声环境下性能显著下降,主要受限于类别不平衡和域间差异;所提改进方法能有效提升模型鲁棒性。 Conclusion: LNMBench为医学图像中标签噪声研究提供了标准化评估平台,揭示了现有方法在现实条件下的局限性,并推动了抗噪声算法的发展与可复现研究。 Abstract: Learning from noisy labels remains a major challenge in medical image analysis, where annotation demands expert knowledge and substantial inter-observer variability often leads to inconsistent or erroneous labels. Despite extensive research on learning with noisy labels (LNL), the robustness of existing methods in medical imaging has not been systematically assessed. To address this gap, we introduce LNMBench, a comprehensive benchmark for Label Noise in Medical imaging. LNMBench encompasses \textbf{10} representative methods evaluated across 7 datasets, 6 imaging modalities, and 3 noise patterns, establishing a unified and reproducible framework for robustness evaluation under realistic conditions. Comprehensive experiments reveal that the performance of existing LNL methods degrades substantially under high and real-world noise, highlighting the persistent challenges of class imbalance and domain variability in medical data. Motivated by these findings, we further propose a simple yet effective improvement to enhance model robustness under such conditions. The LNMBench codebase is publicly released to facilitate standardized evaluation, promote reproducible research, and provide practical insights for developing noise-resilient algorithms in both research and real-world medical applications.The codebase is publicly available on https://github.com/myyy777/LNMBench.

[88] UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

Xuangeng Chu,Ruicong Liu,Yifei Huang,Yun Liu,Yichen Peng,Bo Zheng

Main category: cs.CV

TL;DR: 本文提出了UniLS,首个端到端的音频驱动对话式虚拟头像生成框架,仅使用双轨音频实现说话与倾听表情的统一生成。

Details Motivation: 现有方法难以生成自然的倾听动作,因倾听动作主要依赖内部运动先验而非直接由音频驱动,导致动作僵硬;且多数方法仅关注说话生成,缺乏实时联合建模说-听行为的有效方案。 Method: 提出两阶段训练范式:第一阶段通过无音频自回归生成器学习面部运动的内部先验;第二阶段引入双轨音频微调模型,以外部语音调节已学得的运动先验,实现端到端的联合生成。 Result: 在说话准确性上达到SOTA水平,在倾听指标上相比先前方法提升高达44.1%,生成的倾听动作更丰富自然,显著缓解了动作僵硬问题。 Conclusion: UniLS有效解决了音频驱动下倾听动作生成的挑战,为交互式数字人提供了高保真、实用的解决方案。 Abstract: Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening. However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker's motion is strongly driven by speech audio, while the listener's motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker's motion to produce the listener. This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. Our method introduces a novel two-stage training paradigm. Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues. Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1\% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.

[89] Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video

Seonghwa Choi,Moonkyeong Choi,Mingyu Jang,Jaekyung Kim,Jianfei Cai,Wen-Huang Cheng,Sanghoon Lee

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯点阵(3DGS)的可重光照和动态人体化身建模框架RnD-Avatar,通过引入动态蒙皮权重和新的正则化方法,实现了高保真的姿态相关形变和细节几何重建,并在新视角、新姿态和重光照任务上达到SOTA性能。

Details Motivation: 现有基于NeRF和3DGS的方法在重建人体化身时因缺乏与姿态相关的几何细节(如衣物褶皱)而导致真实感不足,难以实现高质量的可重光照与动画化渲染。 Method: 提出RnD-Avatar框架,采用动态蒙皮权重来建模姿态依赖的关节运动并学习由身体动作引起的额外形变;引入一种新的正则化策略以在稀疏视觉线索下恢复精细几何细节;构建了一个包含多视角和多光照条件的新数据集用于评估重光照效果。 Result: 该方法在新视角合成、新姿态渲染和重光照任务上均取得最先进的性能,能够生成具有逼真光照效果和高保真几何细节(如衣物褶皱)的动态人体化身。 Conclusion: RnD-Avatar通过动态变形建模和细粒度正则化,显著提升了单目视频驱动下人体化身的几何精度与视觉真实感,支持高质量的动画与任意光照渲染,推动了可重光照动态化身的发展。 Abstract: Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar's articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.

[90] TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment

Kanghyun Baek,Sangyub Lee,Jin Young Choi,Jaewoo Song,Daemin Park,Jooyoung Choi,Chaehun Shin,Bohyung Han,Sungroh Yoon

Main category: cs.CV

TL;DR: 本文提出了一种名为TextGuider的无需训练的方法,用于改善扩散模型在文本到图像生成中的文本渲染完整性,通过引入注意力对齐机制和早期去噪阶段的潜在引导,在文本召回率、OCR准确性和CLIP分数上实现了最先进的性能。

Details Motivation: 现有的文本到图像扩散模型在文本渲染时容易出现文本遗漏问题,现有方法未能充分解决该问题。 Method: 分析MM-DiT模型中与文本相关的注意力模式,并在去噪早期阶段引入两种新的损失函数进行潜在引导,以对齐文本token与图像中文本区域。 Result: 在测试时文本渲染任务上达到最先进水平,显著提升了文本召回率,同时在OCR准确性和CLIP分数上表现优异。 Conclusion: TextGuider有效解决了文本遗漏问题,是一种高效且无需训练的文本渲染增强方法。 Abstract: Despite recent advances, diffusion-based text-to-image models still struggle with accurate text rendering. Several studies have proposed fine-tuning or training-free refinement methods for accurate text rendering. However, the critical issue of text omission, where the desired text is partially or entirely missing, remains largely overlooked. In this work, we propose TextGuider, a novel training-free method that encourages accurate and complete text appearance by aligning textual content tokens and text regions in the image. Specifically, we analyze attention patterns in MM-DiT models, particularly for text-related tokens intended to be rendered in the image. Leveraging this observation, we apply latent guidance during the early stage of denoising steps based on two loss functions that we introduce. Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.

[91] Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

Xinkui Zhao,Zuxin Wang,Yifan Zhang,Guanjie Cheng,Yueshen Xu,Shuiguang Deng,Chang Liu,Naibo Wang,Jianwei Yin

Main category: cs.CV

TL;DR: 本文提出了一种名为Video-QTR的轻量级框架,通过查询驱动的时序推理机制,动态分配视觉处理资源,显著降低长视频理解中的计算开销,同时在多个基准上实现最优性能。

Details Motivation: 传统的密集帧编码方法在长视频理解中产生过多视觉token,导致计算冗余和可扩展性差,限制了多模态大模型的实际应用。 Method: 提出Video-QTR框架,采用查询引导的自适应反馈机制,在推理过程中根据语义意图动态选择关键帧,避免对所有帧进行编码。 Result: 在MSVD-QA、ActivityNet-QA、Movie Chat和Video MME等五个基准上实验表明,Video-QTR在减少最多73%输入帧数的同时达到最先进的性能。 Conclusion: 查询驱动的时序推理范式能够有效提升视频理解的效率与可扩展性,为多模态模型在长视频场景的应用提供了新方向。 Abstract: The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.

[92] StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

Ke Xing,Longfei Li,Yuyang Yin,Hanwen Liang,Guixun Luo,Chen Fang,Jue Wang,Konstantinos N. Plataniotis,Xiaojie Jin,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: StereoWorld是一个端到端框架,通过重用预训练视频生成模型实现高质量单目到立体视频生成,结合几何感知正则化和时空分块策略,在高分辨率立体视频合成上表现优异。

Details Motivation: 立体视频制作成本高且易产生伪影,现有方法难以兼顾视觉保真度与几何一致性,因此需要一种高效、高质量的单目转立体视频方法。 Method: 提出StereoWorld框架,联合以单目视频为条件输入,并引入几何感知正则化进行显式监督;采用时空分块方案支持高分辨率高效合成,并构建包含超过1100万帧的高清立体视频数据集用于训练和评估。 Result: 实验表明,StereoWorld在视觉保真度和几何一致性方面显著优于先前方法,能够生成高质量、高分辨率的立体视频。 Conclusion: StereoWorld为低成本生成高保真立体视频提供了有效解决方案,推动了XR应用中立体内容的生产效率。 Abstract: The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.

[93] ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation

Shengchao Zhou,Jiehong Lin,Jiahui Liu,Shizhen Zhao,Chirui Chang,Xiaojuan Qi

Main category: cs.CV

TL;DR: 本文提出了一种名为ASSIST-3D的适应性3D场景合成管道,用于增强无类别3D实例分割模型的泛化能力。该方法通过异构对象选择、大语言模型引导的场景布局生成和多视角点云构建,显著提升了在ScanNetV2、ScanNet++和S3DIS等基准上的性能。

Details Motivation: 现有3D实例分割方法因标注数据稀缺或2D分割噪声而泛化能力差,且当前3D场景合成方法难以同时满足几何多样性、上下文复杂性和布局合理性,因此需要一种更有效的合成策略来提升无类别3D实例分割性能。 Method: 提出ASSIST-3D框架,包含三个核心创新:1)从大规模3D CAD资产中随机采样以实现几何与上下文多样性;2)结合大语言模型(LLM)的空间推理与深度优先搜索进行合理的场景布局生成;3)通过多视角RGB-D图像渲染与融合生成逼真的点云数据。 Result: 在ScanNetV2、ScanNet++和S3DIS基准上实验表明,使用ASSIST-3D生成数据训练的模型显著优于现有方法,并且在与现有3D场景合成方法的对比中展现出优越性。 Conclusion: ASSIST-3D能够有效生成多样化、合理布局且贴近真实传感器数据的3D合成场景,显著提升无类别3D实例分割模型的泛化性能,验证了其作为数据增强工具的潜力。 Abstract: Class-agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST-3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM-guided spatial reasoning combined with depth-first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi-view RGB-D image rendering and fusion from the synthetic scenes, closely mimicking real-world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST-3D-generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose-built pipeline over existing 3D scene synthesis approaches.

[94] FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement

Haobo Jiang,Jin Xie,Jian Yang,Liang Yu,Jianmin Zheng

Main category: cs.CV

TL;DR: 本文提出了FUSER,首个前馈式多视角点云配准Transformer模型,通过统一的紧凑潜在空间直接预测全局位姿,无需成对估计;并进一步提出FUSER-DF,基于SE(3)^N扩散模型进行精细化修正,在精度和效率上均表现优越。

Details Motivation: 传统多视角点云配准依赖大量成对匹配构建姿态图,计算成本高且缺乏整体几何约束,导致问题病态化。 Method: FUSER使用稀疏3D CNN将每个扫描编码为低分辨率超点特征,并通过几何交替注意力模块进行高效跨扫描推理;利用预训练2D注意力先验增强3D特征交互;FUSER-DF在此基础上构建SE(3)^N扩散去噪框架,利用FUSER作为代理模型,并通过变分下界指导训练。 Result: 在3DMatch、ScanNet和ArkitScenes数据集上实验表明,该方法在配准精度和计算效率方面均优于现有方法。 Conclusion: FUSER及其扩散精化版本FUSER-DF为多视角点云配准提供了高效、准确的新范式,实现了端到端的全局位姿预测与优化。 Abstract: Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.

[95] Log NeRF: Comparing Spaces for Learning Radiance Fields

Sihe Chen,Luv Verma,Bruce A. Maxwell

Main category: cs.CV

TL;DR: 本文研究了在神经辐射场(NeRF)中使用不同颜色空间对新视角合成的影响,提出在log RGB空间中学习能提升渲染质量和鲁棒性,尤其在低光照条件下表现更优。

Details Motivation: 现有NeRF多使用sRGB图像进行监督训练,但忽略了颜色空间对表征学习的影响。受BIDR模型启发,作者假设log RGB空间可通过简化光照与反射率分离,帮助NeRF学习更紧凑有效的场景表征。 Method: 采集约30段GoPro视频,恢复线性数据后,分别在linear、sRGB、GPLog和log RGB空间下训练NeRF模型,并统一输出到共同颜色空间进行渲染和损失计算,从而比较不同颜色空间下的表征学习效果。 Result: 实验表明,在log RGB空间中训练的NeRF在定量和定性指标上均优于其他颜色空间,尤其在低光环境下表现突出,且在不同网络规模和NeRF变体中具有良好的泛化性和稳定性。 Conclusion: 使用log RGB颜色空间可显著提升NeRF的渲染质量与鲁棒性,验证了颜色空间选择在神经渲染中的重要性,为后续工作提供了新的设计方向。 Abstract: Neural Radiance Fields (NeRF) have achieved remarkable results in novel view synthesis, typically using sRGB images for supervision. However, little attention has been paid to the color space in which the network is learning the radiance field representation. Inspired by the BiIlluminant Dichromatic Reflection (BIDR) model, which suggests that a logarithmic transformation simplifies the separation of illumination and reflectance, we hypothesize that log RGB space enables NeRF to learn a more compact and effective representation of scene appearance. To test this, we captured approximately 30 videos using a GoPro camera, ensuring linear data recovery through inverse encoding. We trained NeRF models under various color space interpretations linear, sRGB, GPLog, and log RGB by converting each network output to a common color space before rendering and loss computation, enforcing representation learning in different color spaces. Quantitative and qualitative evaluations demonstrate that using a log RGB color space consistently improves rendering quality, exhibits greater robustness across scenes, and performs particularly well in low light conditions while using the same bit-depth input images. Further analysis across different network sizes and NeRF variants confirms the generalization and stability of the log space advantage.

[96] Perception-Inspired Color Space Design for Photo White Balance Editing

Yang Cheng,Ziteng Cui,Lin Gu,Shenghan Su,Zenghui Zhang

Main category: cs.CV

TL;DR: 提出一种基于感知启发的可学习HSI(LHSI)颜色空间用于白平衡校正,结合Mamba网络提升在复杂光照下的性能。

Details Motivation: 现有sRGB-based白平衡方法受限于固定的非线性变换和耦合的颜色通道,难以应对复杂光照条件下的色彩恒常性问题。 Method: 构建一种可学习的HSI颜色空间(LHSI),分离亮度与色度分量,并引入可学习参数和Mamba网络进行自适应优化。 Result: 在基准数据集上实验表明该方法优于现有方法,尤其在复杂照明条件下表现更优。 Conclusion: 感知启发的颜色空间设计能有效提升白平衡校正效果,为计算摄影中的色彩恢复提供了新思路。 Abstract: White balance (WB) is a key step in the image signal processor (ISP) pipeline that mitigates color casts caused by varying illumination and restores the scene's true colors. Currently, sRGB-based WB editing for post-ISP WB correction is widely used to address color constancy failures in the ISP pipeline when the original camera RAW is unavailable. However, additive color models (e.g., sRGB) are inherently limited by fixed nonlinear transformations and entangled color channels, which often impede their generalization to complex lighting conditions. To address these challenges, we propose a novel framework for WB correction that leverages a perception-inspired Learnable HSI (LHSI) color space. Built upon a cylindrical color model that naturally separates luminance from chromatic components, our framework further introduces dedicated parameters to enhance this disentanglement and learnable mapping to adaptively refine the flexibility. Moreover, a new Mamba-based network is introduced, which is tailored to the characteristics of the proposed LHSI color space. Experimental results on benchmark datasets demonstrate the superiority of our method, highlighting the potential of perception-inspired color space design in computational photography. The source code is available at https://github.com/YangCheng58/WB_Color_Space.

[97] Detection and Localization of Subdural Hematoma Using Deep Learning on Computed Tomography

Vasiliki Stoumpou,Rohan Kumar,Bernard Burman,Diego Ojeda,Tapan Mehta,Dimitris Bertsimas

Main category: cs.CV

TL;DR: 提出一种多模态深度学习框架,结合临床变量和CT影像实现硬膜下血肿(SDH)的自动检测与定位,显著提升准确性和可解释性。

Details Motivation: 现有自动化工具主要关注SDH检测,缺乏可解释性和空间定位能力,且未充分利用多模态临床信息,难以支持实时决策。 Method: 开发了一个融合结构化临床变量、3D卷积神经网络(基于CT容积)和Transformer增强的2D分割模型的多模态深度学习框架,并采用贪心集成策略整合各模块。 Result: 临床变量单独使用时AUC为0.75;3D CNN和分割模型分别达到0.922和0.926;多模态集成模型性能最优,AUC达0.9407(95% CI: 0.930–0.951),并生成符合解剖规律的定位图谱。 Conclusion: 该框架实现了快速、准确且可解释的SDH检测与定位,有望改善分诊流程、缩短干预时间并提高临床管理一致性。 Abstract: Background. Subdural hematoma (SDH) is a common neurosurgical emergency, with increasing incidence in aging populations. Rapid and accurate identification is essential to guide timely intervention, yet existing automated tools focus primarily on detection and provide limited interpretability or spatial localization. There remains a need for transparent, high-performing systems that integrate multimodal clinical and imaging information to support real-time decision-making. Methods. We developed a multimodal deep-learning framework that integrates structured clinical variables, a 3D convolutional neural network trained on CT volumes, and a transformer-enhanced 2D segmentation model for SDH detection and localization. Using 25,315 head CT studies from Hartford HealthCare (2015--2024), of which 3,774 (14.9\%) contained clinician-confirmed SDH, tabular models were trained on demographics, comorbidities, medications, and laboratory results. Imaging models were trained to detect SDH and generate voxel-level probability maps. A greedy ensemble strategy combined complementary predictors. Findings. Clinical variables alone provided modest discriminatory power (AUC 0.75). Convolutional models trained on CT volumes and segmentation-derived maps achieved substantially higher accuracy (AUCs 0.922 and 0.926). The multimodal ensemble integrating all components achieved the best overall performance (AUC 0.9407; 95\% CI, 0.930--0.951) and produced anatomically meaningful localization maps consistent with known SDH patterns. Interpretation. This multimodal, interpretable framework provides rapid and accurate SDH detection and localization, achieving high detection performance and offering transparent, anatomically grounded outputs. Integration into radiology workflows could streamline triage, reduce time to intervention, and improve consistency in SDH management.

[98] Wasserstein-Aligned Hyperbolic Multi-View Clustering

Rui Wang,Yuting Jiang,Xiaoqing Luo,Xiao-Jun Wu,Nicu Sebe,Ziheng Chen

Main category: cs.CV

TL;DR: 提出了一种新的Wasserstein对齐双曲(WAH)框架用于多视图聚类,利用双曲空间中的层次语义建模和基于双曲切片Wasserstein距离的全局语义损失来对齐不同视图间的流形分布,并通过软聚类分配增强跨视图语义一致性。

Details Motivation: 现有方法主要关注实例级对齐,忽视了全局语义一致性,容易受到视图特异性信息(如噪声和跨视图差异)的影响。 Method: 为每个视图设计特定的双曲编码器,将特征嵌入到Lorentz流形中进行层次语义建模;引入基于双曲切片Wasserstein距离的全局语义损失以对齐不同视图的流形分布;采用软聚类分配促进跨视图语义一致性。 Result: 在多个基准数据集上的实验表明,该方法在多视图聚类任务上达到了最先进的性能。 Conclusion: 所提出的WAH框架有效提升了多视图聚类性能,兼顾了视图共性与特异性信息,并增强了模型鲁棒性。 Abstract: Multi-view clustering (MVC) aims to uncover the latent structure of multi-view data by learning view-common and view-specific information. Although recent studies have explored hyperbolic representations for better tackling the representation gap between different views, they focus primarily on instance-level alignment and neglect global semantic consistency, rendering them vulnerable to view-specific information (\textit{e.g.}, noise and cross-view discrepancies). To this end, this paper proposes a novel Wasserstein-Aligned Hyperbolic (WAH) framework for multi-view clustering. Specifically, our method exploits a view-specific hyperbolic encoder for each view to embed features into the Lorentz manifold for hierarchical semantic modeling. Whereafter, a global semantic loss based on the hyperbolic sliced-Wasserstein distance is introduced to align manifold distributions across views. This is followed by soft cluster assignments to encourage cross-view semantic consistency. Extensive experiments on multiple benchmarking datasets show that our method can achieve SOTA clustering performance.

[99] Generative Point Cloud Registration

Haobo Jiang,Jin Xie,Jian Yang,Liang Yu,Jianmin Zheng

Main category: cs.CV

TL;DR: 提出了一种新的3D点云配准范式——生成式点云配准,通过结合先进的2D生成模型与3D匹配任务来提升配准性能。

Details Motivation: 传统3D配准方法在特征匹配的鲁棒性和精度上存在局限,尤其是在缺乏纹理或几何结构稀疏的情况下。因此,需要引入生成模型来增强跨视角一致性与几何-颜色特征融合。 Method: 提出Match-ControlNet,一种专为匹配设计的可控2D生成模型,利用深度条件生成确保2D-3D几何一致性,并通过耦合条件去噪和提示引导实现跨视图纹理一致性。生成的图像对用于辅助源和目标点云之间的鲁棒匹配。 Result: 在3DMatch和ScanNet数据集上的大量实验表明,该方法显著提升了现有配准方法的性能,验证了其有效性与通用性。 Conclusion: 生成式点云配准范式能够有效融合2D生成先验与3D几何信息,为3D配准提供了新思路,并可广泛集成到多种配准框架中以提升性能。 Abstract: In this paper, we propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks to enhance registration performance. Our key idea is to generate cross-view consistent image pairs that are well-aligned with the source and target point clouds, enabling geometry-color feature fusion to facilitate robust matching. To ensure high-quality matching, the generated image pair should feature both 2D-3D geometric consistency and cross-view texture consistency. To achieve this, we introduce Match-ControlNet, a matching-specific, controllable 2D generative model. Specifically, it leverages the depth-conditioned generation capability of ControlNet to produce images that are geometrically aligned with depth maps derived from point clouds, ensuring 2D-3D geometric consistency. Additionally, by incorporating a coupled conditional denoising scheme and coupled prompt guidance, Match-ControlNet further promotes cross-view feature interaction, guiding texture consistency generation. Our generative 3D registration paradigm is general and could be seamlessly integrated into various registration methods to enhance their performance. Extensive experiments on 3DMatch and ScanNet datasets verify the effectiveness of our approach.

[100] DirectSwap: Mask-Free Cross-Identity Training and Benchmarking for Expression-Consistent Video Head Swapping

Yanan Wang,Shengcai Liao,Panwen Hu,Xin Li,Fan Yang,Xiaodan Liang

Main category: cs.CV

TL;DR: 本文提出了一种新的视频换头方法DirectSwap,利用自建的配对数据集HeadSwapBench进行训练,采用无掩码的视频扩散模型,并引入MEAR损失函数以提升动作和表情的一致性,在视觉质量、身份保真度和运动连贯性方面达到SOTA水平。

Details Motivation: 现有视频换头方法依赖于同一人物跨帧训练和基于掩码的修复,难以恢复被遮挡的关键信息(如面部姿态、表情和运动动态),且易产生边界伪影。缺乏跨身份的真实配对数据也限制了模型性能。 Method: 通过提示视频编辑模型生成具有同步面部姿态和表情的虚假换头视频,构建首个跨身份配对数据集HeadSwapBench;基于该数据集,提出DirectSwap框架:将图像U-Net扩展为带有运动模块和条件输入的视频扩散模型,实现无需掩码的直接换头;设计MEAR损失函数,根据帧间差异和面部关键点 proximity 对扩散损失进行像素级重加权,增强运动与表情的时序一致性。 Result: 在HeadSwapBench上实验表明,DirectSwap在视觉质量、身份保真度、动作和表情一致性方面均优于现有方法,尤其在复杂真实场景中表现突出。 Conclusion: 本文验证了使用合成配对数据进行监督学习的有效性,提出的DirectSwap框架结合MEAR损失显著提升了视频换头的生成质量和时序连贯性,推动了该领域的技术发展。 Abstract: Video head swapping aims to replace the entire head of a video subject, including facial identity, head shape, and hairstyle, with that of a reference image, while preserving the target body, background, and motion dynamics. Due to the lack of ground-truth paired swapping data, prior methods typically train on cross-frame pairs of the same person within a video and rely on mask-based inpainting to mitigate identity leakage. Beyond potential boundary artifacts, this paradigm struggles to recover essential cues occluded by the mask, such as facial pose, expressions, and motion dynamics. To address these issues, we prompt a video editing model to synthesize new heads for existing videos as fake swapping inputs, while maintaining frame-synchronized facial poses and expressions. This yields HeadSwapBench, the first cross-identity paired dataset for video head swapping, which supports both training (\TrainNum{} videos) and benchmarking (\TestNum{} videos) with genuine outputs. Leveraging this paired supervision, we propose DirectSwap, a mask-free, direct video head-swapping framework that extends an image U-Net into a video diffusion model with a motion module and conditioning inputs. Furthermore, we introduce the Motion- and Expression-Aware Reconstruction (MEAR) loss, which reweights the diffusion loss per pixel using frame-difference magnitudes and facial-landmark proximity, thereby enhancing cross-frame coherence in motion and expressions. Extensive experiments demonstrate that DirectSwap achieves state-of-the-art visual quality, identity fidelity, and motion and expression consistency across diverse in-the-wild video scenes. We will release the source code and the HeadSwapBench dataset to facilitate future research.

[101] Label-free Motion-Conditioned Diffusion Model for Cardiac Ultrasound Synthesis

Zhe Li,Hadrien Reynaud,Johanna P Müller,Bernhard Kainz

Main category: cs.CV

TL;DR: 提出了一种无需标签的潜在扩散模型MCDM,用于生成基于自监督运动特征的心脏超声视频,通过设计MAFE模块解耦运动与外观特征,并引入辅助损失提升生成质量,在EchoNet-Dynamic数据集上实现了具有时间连贯性和临床真实感的视频生成。

Details Motivation: 深度学习在超声心动图分析中受限于标注数据的稀缺性,主要由于隐私限制和专家标注复杂;因此需要一种无需人工标签的生成方法以实现可扩展的心脏超声视频合成。 Method: 提出Motion Conditioned Diffusion Model (MCDM),基于自监督学习提取运动特征进行条件生成;设计Motion and Appearance Feature Extractor (MAFE) 模块分离运动与外观表示,并引入重识别损失和光流损失两个辅助目标来增强特征学习。 Result: 在EchoNet-Dynamic数据集上验证,MCDM在无手动标签条件下实现了具有竞争力的视频生成性能,生成序列具备良好的时间连贯性和临床真实性。 Conclusion: 自监督条件下的心脏超声视频生成是可行且有前景的,MCDM为缺乏标注数据的医学影像生成提供了可扩展的新方案。 Abstract: Ultrasound echocardiography is essential for the non-invasive, real-time assessment of cardiac function, but the scarcity of labelled data, driven by privacy restrictions and the complexity of expert annotation, remains a major obstacle for deep learning methods. We propose the Motion Conditioned Diffusion Model (MCDM), a label-free latent diffusion framework that synthesises realistic echocardiography videos conditioned on self-supervised motion features. To extract these features, we design the Motion and Appearance Feature Extractor (MAFE), which disentangles motion and appearance representations from videos. Feature learning is further enhanced by two auxiliary objectives: a re-identification loss guided by pseudo appearance features and an optical flow loss guided by pseudo flow fields. Evaluated on the EchoNet-Dynamic dataset, MCDM achieves competitive video generation performance, producing temporally coherent and clinically realistic sequences without reliance on manual labels. These results demonstrate the potential of self-supervised conditioning for scalable echocardiography synthesis. Our code is available at https://github.com/ZheLi2020/LabelfreeMCDM.

[102] InfoMotion: A Graph-Based Approach to Video Dataset Distillation for Echocardiography

Zhe Li,Hadrien Reynaud,Alberto Gomez,Bernhard Kainz

Main category: cs.CV

TL;DR: 提出一种基于运动特征提取和Infomap算法的超声心动图视频数据集蒸馏新方法,通过构建类别内图并选择代表性样本,用仅25个合成视频在EchoNet-Dynamic数据集上达到69.38%的测试准确率。

Details Motivation: 应对超声心动图视频数据规模增长带来的存储、计算和模型训练效率挑战,探索医学视频数据集的高效压缩与信息保留方法。 Method: 采用运动特征提取捕捉时间动态特性,进行类内图构建,并利用Infomap算法选择多样且信息丰富的代表性合成视频样本。 Result: 在EchoNet-Dynamic数据集上使用仅25个合成视频实现了69.38%的测试准确率。 Conclusion: 所提方法能有效蒸馏出紧凑且富含临床特征的超声视频数据集,具备良好的可扩展性,适用于医学视频数据的高效利用。 Abstract: Echocardiography playing a critical role in the diagnosis and monitoring of cardiovascular diseases as a non-invasive real-time assessment of cardiac structure and function. However, the growing scale of echocardiographic video data presents significant challenges in terms of storage, computation, and model training efficiency. Dataset distillation offers a promising solution by synthesizing a compact, informative subset of data that retains the key clinical features of the original dataset. In this work, we propose a novel approach for distilling a compact synthetic echocardiographic video dataset. Our method leverages motion feature extraction to capture temporal dynamics, followed by class-wise graph construction and representative sample selection using the Infomap algorithm. This enables us to select a diverse and informative subset of synthetic videos that preserves the essential characteristics of the original dataset. We evaluate our approach on the EchoNet-Dynamic datasets and achieve a test accuracy of \(69.38\%\) using only \(25\) synthetic videos. These results demonstrate the effectiveness and scalability of our method for medical video dataset distillation.

[103] FunPhase: A Periodic Functional Autoencoder for Motion Generation via Phase Manifolds

Marco Pegoraro,Evan Atherton,Bruno Roy,Aliasghar Khani,Arianna Rampini

Main category: cs.CV

TL;DR: 提出FunPhase,一种功能周期性自编码器,通过函数空间公式学习运动的相位流形,实现任意时间分辨率下的平滑轨迹生成,并在多种任务中表现优异。

Details Motivation: 现有的运动预测方法在可扩展性和通用性方面存在局限,难以处理不同骨架和数据集之间的迁移,且无法支持任意时间分辨率的采样。 Method: 引入FunPhase,采用函数空间建模替代离散时间解码,学习捕捉局部周期性的相位流形,从而支持连续时间下的运动表示与生成。 Result: FunPhase在重建误差上显著优于以往周期性自编码器基线模型,同时在超分辨率、部分身体运动补全等任务中表现出良好泛化能力,并与最先进的运动生成方法性能相当。 Conclusion: FunPhase提供了一个统一、可解释的框架,将运动预测与生成结合于单一相位流形中,具备良好的跨数据集和跨骨架通用性,拓展了应用场景。 Abstract: Learning natural body motion remains challenging due to the strong coupling between spatial geometry and temporal dynamics. Embedding motion in phase manifolds, latent spaces that capture local periodicity, has proven effective for motion prediction; however, existing approaches lack scalability and remain confined to specific settings. We introduce FunPhase, a functional periodic autoencoder that learns a phase manifold for motion and replaces discrete temporal decoding with a function-space formulation, enabling smooth trajectories that can be sampled at arbitrary temporal resolutions. FunPhase supports downstream tasks such as super-resolution and partial-body motion completion, generalizes across skeletons and datasets, and unifies motion prediction and generation within a single interpretable manifold. Our model achieves substantially lower reconstruction error than prior periodic autoencoder baselines while enabling a broader range of applications and performing on par with state-of-the-art motion generation methods.

[104] UniPart: Part-Level 3D Generation with Unified 3D Geom-Seg Latents

Xufan He,Yushuang Wu,Xiaoyang Guo,Chongjie Ye,Jiaqing Zhou,Tianlei Hu,Xiaoguang Han,Dong Du

Main category: cs.CV

TL;DR: 提出UniPart,一种基于几何-分割联合表示的两阶段潜在扩散框架,实现图像引导下的部件级3D生成,提升分割可控性与几何质量。

Details Motivation: 现有部件级3D生成方法依赖隐式分割或外部强监督分割器,缺乏精细的粒度控制且受限于标注数据。 Method: 提出Geom-Seg VecSet统一表示物体几何与部件结构,并设计两阶段潜在扩散模型UniPart:第一阶段联合生成几何与潜在部件分割,第二阶段在整体与部件特定潜在空间中进行条件扩散;引入双空间生成机制,在全局与规范空间中预测部件潜变量以提升几何保真度。 Result: 实验表明,UniPart在部件分割可控性和部件级几何质量上优于现有方法。 Conclusion: 通过几何学习中自然涌现的部件感知能力,UniPart实现了无需显式标注的高质量、可控部件级3D生成。 Abstract: Part-level 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry-segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances geometric fidelity by predicting part latents in both global and canonical spaces. Extensive experiments demonstrate that UniPart achieves superior segmentation controllability and part-level geometric quality compared with existing approaches.

[105] Representation Calibration and Uncertainty Guidance for Class-Incremental Learning based on Vision Language Model

Jiantao Tan,Peixian Ma,Tong Yu,Wentao Zhang,Ruixuan Wang

Main category: cs.CV

TL;DR: 提出一种基于视觉-语言模型的类增量学习框架,通过任务特定适配器和跨任务表示校准策略缓解类别混淆问题,并结合不确定性引导的推理策略提升图像分类性能。

Details Motivation: 现有基于视觉-语言模型的方法在类增量学习中难以有效区分不同任务中的类别,导致类别混淆问题。 Method: 在冻结的图像编码器上添加任务特定适配器以学习新知识,采用轻量级投影器混合实现跨任务表示校准,并设计基于预测不确定性的推理策略选择最优图像特征。 Result: 在多个数据集和设置下实验表明,该方法在类增量学习任务中显著优于现有方法。 Conclusion: 所提框架有效缓解了类增量学习中的类别混淆问题,提升了持续学习过程中的分类性能。 Abstract: Class-incremental learning requires a learning system to continually learn knowledge of new classes and meanwhile try to preserve previously learned knowledge of old classes. As current state-of-the-art methods based on Vision-Language Models (VLMs) still suffer from the issue of differentiating classes across learning tasks. Here a novel VLM-based continual learning framework for image classification is proposed. In this framework, task-specific adapters are added to the pre-trained and frozen image encoder to learn new knowledge, and a novel cross-task representation calibration strategy based on a mixture of light-weight projectors is used to help better separate all learned classes in a unified feature space, alleviating class confusion across tasks. In addition, a novel inference strategy guided by prediction uncertainty is developed to more accurately select the most appropriate image feature for class prediction. Extensive experiments on multiple datasets under various settings demonstrate the superior performance of our method compared to existing ones.

[106] Defect-aware Hybrid Prompt Optimization via Progressive Tuning for Zero-Shot Multi-type Anomaly Detection and Segmentation

Nadeem Nazer,Hongkuan Zhou,Lavdim Halilaj,Ylli Sadikaj,Steffen Staab

Main category: cs.CV

TL;DR: 本文提出了一种名为DAPO的新型缺陷感知提示优化方法,用于零样本多类型及二分类异常检测与分割,尤其在分布偏移下表现优异。该方法通过学习包含固定文本锚点和可学习词嵌入的混合提示,自动对齐图像特征与文本语义,避免了人工设计提示的耗时与偏差。实验表明,DAPO在多个公开基准和内部数据集上显著优于基线模型,在图像级异常检测中AUROC和平均精度平均提升3.7%,在定位新异常类型方面提升6.5%。

Details Motivation: 现有视觉语言模型(如CLIP)在异常检测中依赖高级语义信息,但忽略了细粒度异常类型(如“孔洞”、“划痕”),限制了对异常本质的理解与制造业中的根因分析。同时,手工设计提示费时且易受主观偏见影响。因此,需要一种自动化方法来引入细粒度语义信息,提升检测性能与实用性。 Method: 提出DAPO(Defect-aware Prompt Optimization)方法,通过渐进式微调学习混合缺陷感知提示,结合固定的文本锚点和可学习的词嵌入向量,实现异常相关图像特征与对应文本语义的对齐。该方法支持零样本设置下的多类型异常检测与分割,并适用于存在分布偏移的场景。 Result: 在MPDD、VisA、MVTec-AD、MAD、Real-IAD等公开基准和一个内部数据集上进行实验,结果显示:DAPO在图像级异常检测中AUROC和平均精度平均提升3.7%,在零样本条件下对新异常类型的定位准确率平均提升6.5%。 Conclusion: DAPO有效整合了细粒度缺陷语义信息,提升了视觉语言模型在分布偏移下的零样本异常检测与定位能力,无需人工设计提示,具有更强的泛化性和实际应用价值。 Abstract: Recent vision language models (VLMs) like CLIP have demonstrated impressive anomaly detection performance under significant distribution shift by utilizing high-level semantic information through text prompts. However, these models often neglect fine-grained details, such as which kind of anomalies, like "hole", "cut", "scratch" that could provide more specific insight into the nature of anomalies. We argue that recognizing fine-grained anomaly types 1) enriches the representation of "abnormal" with structured semantics, narrowing the gap between coarse anomaly signals and fine-grained defect categories; 2) enables manufacturers to understand the root causes of the anomaly and implement more targeted and appropriate corrective measures quickly. While incorporating such detailed semantic information is crucial, designing handcrafted prompts for each defect type is both time-consuming and susceptible to human bias. For this reason, we introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts. Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. We conducted experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, and Real-IAD) and an internal dataset. The results suggest that compared to the baseline models, DAPO achieves a 3.7% average improvement in AUROC and average precision metrics at the image level under distribution shift, and a 6.5% average improvement in localizing novel anomaly types under zero-shot settings.

[107] Cytoplasmic Strings Analysis in Human Embryo Time-Lapse Videos using Deep Learning Framework

Anabia Sohail,Mohamad Alansari,Ahmed Abughali,Asmaa Chehab,Abdelfatah Ahmed,Divya Velayudhan,Sajid Javed,Hasan Al Marzouqi,Ameena Saad Al-Sumaiti,Junaid Kashir,Naoufel Werghi

Main category: cs.CV

TL;DR: 本文提出了首个用于人类体外受精胚胎中细胞质细丝(Cytoplasmic Strings, CS)分析的计算框架,通过构建带有人工回环标注的数据集,并设计基于深度学习的两阶段模型实现CS的检测与定位,引入新的不确定性感知损失函数NUCE以提升对稀疏、低对比度结构的检测性能。

Details Motivation: 现有的胚胎评估方法多依赖传统形态动力学特征,忽略新兴生物标志物如细胞质细丝(CS),而CS虽与胚胎活力相关,但其人工判读存在主观性强、易漏检等问题,亟需自动化、客观的计算工具。 Method: 提出一个两阶段深度学习框架:第一阶段进行帧级CS存在分类,第二阶段在阳性样本中定位CS区域;为此构建了一个包含13,568帧的生物学验证数据集,并引入新型损失函数NUCE,结合置信度加权与嵌入压缩项以应对类别不平衡和特征不确定性。 Result: NUCE损失函数在五个Transformer骨干网络上均提升了F1分数,基于RF-DETR的定位方法在检测细小、低对比度的CS结构上达到SOTA性能。 Conclusion: 该研究首次实现了对人类IVF胚胎中CS的自动分析,为整合新型生物标志物进入胚胎筛选流程提供了可行工具,有望提升辅助生殖技术中的胚胎选择准确性。 Abstract: Infertility is a major global health issue, and while in-vitro fertilization has improved treatment outcomes, embryo selection remains a critical bottleneck. Time-lapse imaging enables continuous, non-invasive monitoring of embryo development, yet most automated assessment methods rely solely on conventional morphokinetic features and overlook emerging biomarkers. Cytoplasmic Strings, thin filamentous structures connecting the inner cell mass and trophectoderm in expanded blastocysts, have been associated with faster blastocyst formation, higher blastocyst grades, and improved viability. However, CS assessment currently depends on manual visual inspection, which is labor-intensive, subjective, and severely affected by detection and subtle visual appearance. In this work, we present, to the best of our knowledge, the first computational framework for CS analysis in human IVF embryos. We first design a human-in-the-loop annotation pipeline to curate a biologically validated CS dataset from TLI videos, comprising 13,568 frames with highly sparse CS-positive instances. Building on this dataset, we propose a two-stage deep learning framework that (i) classifies CS presence at the frame level and (ii) localizes CS regions in positive cases. To address severe imbalance and feature uncertainty, we introduce the Novel Uncertainty-aware Contractive Embedding (NUCE) loss, which couples confidence-aware reweighting with an embedding contraction term to form compact, well-separated class clusters. NUCE consistently improves F1-score across five transformer backbones, while RF-DETR-based localization achieves state-of-the-art (SOTA) detection performance for thin, low-contrast CS structures. The source code will be made publicly available at: https://github.com/HamadYA/CS_Detection.

[108] Privacy-Preserving Computer Vision for Industry: Three Case Studies in Human-Centric Manufacturing

Sander De Coninck,Emilio Gamba,Bart Van Doninck,Abdellatif Bey-Temsamani,Sam Leroux,Pieter Simoens

Main category: cs.CV

TL;DR: 本文验证了一种隐私保护框架在真实工业环境中的有效性,通过学习视觉转换技术在保留任务关键特征的同时掩盖敏感信息,实现了监控效果与隐私保护的平衡。

Details Motivation: 在工业中应用AI驱动的计算机视觉时,常需在操作效用与工人隐私之间取得平衡,现有方法难以兼顾二者,因此需要一种既能保障隐私又不影响实用性的解决方案。 Method: 采用基于学习的视觉变换方法,在三个实际应用场景(木制品生产监控、人感知AGV导航、多摄像头人体工学风险评估)中对框架进行验证,通过定量评估隐私-效用权衡及合作伙伴的定性反馈来分析性能。 Result: 结果表明,任务特定的模糊化处理能在降低隐私风险的同时实现有效的监控,框架具备实际部署可行性,并获得工业合作伙伴的认可。 Conclusion: 该隐私保护框架已具备在真实工业环境中部署的成熟度,为跨领域的人本AI部署提供了负责任的实践建议。 Abstract: The adoption of AI-powered computer vision in industry is often constrained by the need to balance operational utility with worker privacy. Building on our previously proposed privacy-preserving framework, this paper presents its first comprehensive validation on real-world data collected directly by industrial partners in active production environments. We evaluate the framework across three representative use cases: woodworking production monitoring, human-aware AGV navigation, and multi-camera ergonomic risk assessment. The approach employs learned visual transformations that obscure sensitive or task-irrelevant information while retaining features essential for task performance. Through both quantitative evaluation of the privacy-utility trade-off and qualitative feedback from industrial partners, we assess the framework's effectiveness, deployment feasibility, and trust implications. Results demonstrate that task-specific obfuscation enables effective monitoring with reduced privacy risks, establishing the framework's readiness for real-world adoption and providing cross-domain recommendations for responsible, human-centric AI deployment in industry.

[109] Temporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion: A Multi-Head Self-Attention Video Vision Transformer Approach

Yiqun Wang,Lujun Li,Meiru Yue,Radu State

Main category: cs.CV

TL;DR: 提出了一种基于Video Vision Transformer (ViViT)的时空融合框架,用于多光谱影像中云覆盖区域的时间序列重建,显著提升了作物映射的精度。

Details Motivation: 现有基于ViT的方法使用粗粒度时间嵌入导致信息丢失,影响云层遮挡下的多光谱影像重建质量。 Method: 采用3D卷积提取非重叠tubelet(时域跨度t=2),结合时空融合嵌入,在MSI-only和SAR-MSI融合两种场景下进行重建。 Result: 在2020年Traill County数据上,MTS-ViViT比MTS-ViT的MSE降低2.23%;SMTS-ViViT在引入SAR后相比SMTS-ViT提升10.33%。 Conclusion: 所提ViViT框架能有效提升云覆盖条件下多光谱影像的重建质量,增强早期作物监测的鲁棒性。 Abstract: Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span $(t=2)$, ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23\% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33\% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.

[110] Color encoding in Latent Space of Stable Diffusion Models

Guillem Arias,Ariadna Solà,Martí Armengod,Maria Vanrell

Main category: cs.CV

TL;DR: 该论文通过分析Stable Diffusion的潜在表示,揭示了颜色信息以圆形、对立轴的形式编码在特定潜在通道中,而亮度和形状则由其他通道主导,表明其潜在空间具有与高效编码一致的可解释结构。

Details Motivation: 尽管扩散生成模型在视觉保真度上取得进展,但对感知属性(如颜色和形状)如何在模型内部表示仍缺乏深入理解。 Method: 使用合成数据集、主成分分析(PCA)和相似性度量,系统分析Stable Diffusion中的潜在表示。 Result: 发现颜色信息主要沿c_3和c_4通道的圆形对立轴编码,强度和形状则分别由c_1和c_2通道主导。 Conclusion: Stable Diffusion的潜在空间具有与高效编码理论一致的可解释结构,为模型理解、编辑应用及更解耦生成框架的设计提供了基础。 Abstract: Recent advances in diffusion-based generative models have achieved remarkable visual fidelity, yet a detailed understanding of how specific perceptual attributes - such as color and shape - are internally represented remains limited. This work explores how color is encoded in a generative model through a systematic analysis of the latent representations in Stable Diffusion. Through controlled synthetic datasets, principal component analysis (PCA) and similarity metrics, we reveal that color information is encoded along circular, opponent axes predominantly captured in latent channels c_3 and c_4, whereas intensity and shape are primarily represented in channels c_1 and c_2. Our findings indicate that the latent space of Stable Diffusion exhibits an interpretable structure aligned with a efficient coding representation. These insights provide a foundation for future work in model understanding, editing applications, and the design of more disentangled generative frameworks.

[111] MODA: The First Challenging Benchmark for Multispectral Object Detection in Aerial Images

Shuaihao Han,Tingfa Xu,Peifu Liu,Jianan Li

Main category: cs.CV

TL;DR: 本文提出了首个用于航空多光谱图像目标检测的大规模数据集MODA,并设计了OSSDet框架,通过融合光谱、空间和对象感知信息提升小目标检测性能。

Details Motivation: 由于航空场景中存在小目标和复杂背景干扰,传统的RGB图像检测方法性能受限,而多光谱图像虽提供额外光谱信息,但缺乏足够的训练数据制约其发展。 Method: 提出OSSDet框架,采用级联的光谱-空间调制结构,利用光谱相似性聚合特征,通过对象感知掩码抑制背景干扰,并引入跨光谱注意力机制在显式对象感知引导下优化表征。 Result: 实验表明,OSSDet在与现有方法参数量和效率相当的情况下,取得了更优的检测性能。 Conclusion: MODA数据集为航空多光谱目标检测提供了重要基础,OSSDet通过有效融合多光谱信息和对象感知机制显著提升了检测效果。 Abstract: Aerial object detection faces significant challenges in real-world scenarios, such as small objects and extensive background interference, which limit the performance of RGB-based detectors with insufficient discriminative information. Multispectral images (MSIs) capture additional spectral cues across multiple bands, offering a promising alternative. However, the lack of training data has been the primary bottleneck to exploiting the potential of MSIs. To address this gap, we introduce the first large-scale dataset for Multispectral Object Detection in Aerial images (MODA), which comprises 14,041 MSIs and 330,191 annotations across diverse, challenging scenarios, providing a comprehensive data foundation for this field. Furthermore, to overcome challenges inherent to aerial object detection using MSIs, we propose OSSDet, a framework that integrates spectral and spatial information with object-aware cues. OSSDet employs a cascaded spectral-spatial modulation structure to optimize target perception, aggregates spectrally related features by exploiting spectral similarities to reinforce intra-object correlations, and suppresses irrelevant background via object-aware masking. Moreover, cross-spectral attention further refines object-related representations under explicit object-aware guidance. Extensive experiments demonstrate that OSSDet outperforms existing methods with comparable parameters and efficiency.

[112] StateSpace-SSL: Linear-Time Self-supervised Learning for Plant Disease Detectio

Abdullah Al Mamun,Miaohua Zhang,David Ahmedt-Aristizabal,Zeeshan Hayder,Mohammad Awrangjeb

Main category: cs.CV

TL;DR: 本文提出了一种名为StateSpace-SSL的线性时间自监督学习框架,利用Vision Mamba状态空间编码器通过方向扫描建模叶片表面病斑的长程连续性,结合原型驱动的师生目标,从标记数据中学习稳定且关注病斑的特征表示。

Details Motivation: 现有的基于CNN或视觉Transformer的自监督学习方法在农业图像上表现不佳:CNN难以捕捉沿叶片结构连续演变的病害模式,而Transformer因高分辨率图像块导致注意力计算成本高。因此需要一种更高效、更适合农业图像特性的SSL方法。 Method: 提出StateSpace-SSL,采用Vision Mamba编码器进行方向性扫描以建模病斑的长距离连续性,并设计原型驱动的教师-学生框架,在多视角下对齐表示,从而提升特征稳定性与病斑敏感性。 Result: 在三个公开植物病害数据集上的实验表明,StateSpace-SSL在多种评估指标上均优于基于CNN和Transformer的自监督基准方法;定性分析显示其能学习到紧凑且聚焦于病斑的特征图。 Conclusion: StateSpace-SSL通过引入线性复杂度的状态空间模型,有效解决了传统SSL方法在植物病害检测中的局限性,验证了线性状态空间建模在自监督植物病害表征学习中的优势。 Abstract: Self-supervised learning (SSL) is attractive for plant disease detection as it can exploit large collections of unlabeled leaf images, yet most existing SSL methods are built on CNNs or vision transformers that are poorly matched to agricultural imagery. CNN-based SSL struggles to capture disease patterns that evolve continuously along leaf structures, while transformer-based SSL introduces quadratic attention cost from high-resolution patches. To address these limitations, we propose StateSpace-SSL, a linear-time SSL framework that employs a Vision Mamba state-space encoder to model long-range lesion continuity through directional scanning across the leaf surface. A prototype-driven teacher-student objective aligns representations across multiple views, encouraging stable and lesion-aware features from labelled data. Experiments on three publicly available plant disease datasets show that StateSpace-SSL consistently outperforms the CNN- and transformer-based SSL baselines in various evaluation metrics. Qualitative analyses further confirm that it learns compact, lesion-focused feature maps, highlighting the advantage of linear state-space modelling for self-supervised plant disease representation learning.

[113] Gradient-Guided Learning Network for Infrared Small Target Detection

Jinmiao Zhao,Chuang Yu,Zelin Shi,Yunpeng Liu,Yingdi Zhang

Main category: cs.CV

TL;DR: 提出一种基于梯度引导学习的红外小目标检测网络GGL-Net,通过引入梯度幅值图像和双分支结构提升边缘定位精度与特征表达能力,在公开数据集上达到SOTA性能。

Details Motivation: 现有方法在红外小目标检测中存在边缘定位不准、目标易被背景淹没的问题,主要因目标尺寸小且缺乏显著特征。因此需要增强边缘信息并改进多尺度特征融合机制。 Method: 提出GGL-Net,首次将梯度幅值图像引入深度学习框架;设计双分支特征提取网络与梯度补充模块(GSM)以保留深层梯度信息,并结合注意力机制增强特征提取;构建双向引导融合模块(TGFM)实现多尺度特征的有效融合。 Result: 在NUAA-SIRST和NUDT-SIRST两个公开数据集上进行了广泛实验,结果表明GGL-Net在检测性能上优于现有方法,实现了最先进的检测效果。 Conclusion: GGL-Net通过梯度引导学习有效提升了红外小目标的边缘定位精度和抗背景干扰能力,为基于深度学习的小目标检测提供了新的思路和技术路径。 Abstract: Recently, infrared small target detection has attracted extensive attention. However, due to the small size and the lack of intrinsic features of infrared small targets, the existing methods generally have the problem of inaccurate edge positioning and the target is easily submerged by the background. Therefore, we propose an innovative gradient-guided learning network (GGL-Net). Specifically, we are the first to explore the introduction of gradient magnitude images into the deep learning-based infrared small target detection method, which is conducive to emphasizing the edge details and alleviating the problem of inaccurate edge positioning of small targets. On this basis, we propose a novel dual-branch feature extraction network that utilizes the proposed gradient supplementary module (GSM) to encode raw gradient information into deeper network layers and embeds attention mechanisms reasonably to enhance feature extraction ability. In addition, we construct a two-way guidance fusion module (TGFM), which fully considers the characteristics of feature maps at different levels. It can facilitate the effective fusion of multi-scale feature maps and extract richer semantic information and detailed information through reasonable two-way guidance. Extensive experiments prove that GGL-Net has achieves state-of-the-art results on the public real NUAA-SIRST dataset and the public synthetic NUDT-SIRST dataset. Our code has been integrated into https://github.com/YuChuang1205/MSDA-Net

[114] Masked Registration and Autoencoding of CT Images for Predictive Tibia Reconstruction

Hongyou Zhou,Cederic Aßmann,Alaa Bejaoui,Heiko Tzschätzsch,Mark Heyland,Julian Zierke,Niklas Tuttle,Sebastian Hölzl,Timo Auer,David A. Back,Marc Toussaint

Main category: cs.CV

TL;DR: 提出了一种结合神经配准和自编码器模型的方法,用于从骨折胫骨的CT图像中预测患者特异性的健康骨骼重建目标。

Details Motivation: 复杂胫骨骨折的手术规划对医生来说具有挑战性,因为难以想象术后理想的三维骨骼结构对齐情况。 Method: 首先训练一个改进的空间变换网络(STN)将原始CT图像配准到联合训练的胫骨原型的标准坐标系中,然后使用多种自编码器(AE)架构建模健康胫骨的形态变化,并设计STN和AE模型以处理带掩码的输入,从而应用于骨折CT并解码出患者特异性健康骨骼的预测。 Result: 实现了3D适应的STN用于全局空间配准,比较分析了多种AE在骨CT建模中的表现,并扩展了STN与AE以处理掩码输入,实现健康骨骼结构的预测生成。 Conclusion: 该方法能够有效预测骨折患者个体化的健康胫骨结构,有助于术前规划,为复杂骨折的修复提供了新的技术支持。 Abstract: Surgical planning for complex tibial fractures can be challenging for surgeons, as the 3D structure of the later desirable bone alignment may be diffi- cult to imagine. To assist in such planning, we address the challenge of predicting a patient-specific reconstruction target from a CT of the fractured tibia. Our ap- proach combines neural registration and autoencoder models. Specifically, we first train a modified spatial transformer network (STN) to register a raw CT to a standardized coordinate system of a jointly trained tibia prototype. Subsequently, various autoencoder (AE) architectures are trained to model healthy tibial varia- tions. Both the STN and AE models are further designed to be robust to masked input, allowing us to apply them to fractured CTs and decode to a prediction of the patient-specific healthy bone in standard coordinates. Our contributions include: i) a 3D-adapted STN for global spatial registration, ii) a comparative analysis of AEs for bone CT modeling, and iii) the extension of both to handle masked inputs for predictive generation of healthy bone structures. Project page: https://github.com/HongyouZhou/repair

[115] A Dual-Domain Convolutional Network for Hyperspectral Single-Image Super-Resolution

Murat Karayaka,Usman Muhammad,Jorma Laaksonen,Md Ziaul Hoque,Tapio Seppänen

Main category: cs.CV

TL;DR: 提出了一种轻量级双域超分辨率网络DDSRNet,结合空间域和小波域学习,实现高效高光谱图像超分辨率。

Details Motivation: 为了在降低计算成本的同时提升高光谱图像的超分辨率重建质量,充分利用空间域和频率域的互补信息。 Method: 设计了包含浅层特征提取模块Spatial-Net、基于DWT的低频增强分支和共享权重的高频细化分支的网络结构,通过DWT进行子带分解,逆DWT重建高分辨率图像。 Result: 在三个高光谱图像数据集上实现了具有竞争力的性能,同时保持较低的计算开销。 Conclusion: DDSRNet通过融合空间域和频率域学习,有效提升了高光谱图像超分辨率的效率与效果,适用于资源受限场景。 Abstract: This study presents a lightweight dual-domain super-resolution network (DDSRNet) that combines Spatial-Net with the discrete wavelet transform (DWT). Specifically, our proposed model comprises three main components: (1) a shallow feature extraction module, termed Spatial-Net, which performs residual learning and bilinear interpolation; (2) a low-frequency enhancement branch based on the DWT that refines coarse image structures; and (3) a shared high-frequency refinement branch that simultaneously enhances the LH (horizontal), HL (vertical), and HH (diagonal) wavelet subbands using a single CNN with shared weights. As a result, the DWT enables subband decomposition, while the inverse DWT reconstructs the final high-resolution output. By doing so, the integration of spatial- and frequency-domain learning enables DDSRNet to achieve highly competitive performance with low computational cost on three hyperspectral image datasets, demonstrating its effectiveness for hyperspectral image super-resolution.

[116] Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment

Yuan Li,Zitang Sun,Yen-ju Chen,Shin'ya Nishida

Main category: cs.CV

TL;DR: 本文研究了基于视觉语言模型(VLM)的图像质量评估(BIQA)中存在的预测与文本描述矛盾、推理不稳定的问题,提出了一种两阶段微调方法,将视觉感知与质量推断分离,显著提升了预测稳定性和相关性指标。

Details Motivation: 现有VLM在BIQA任务中常出现生成文本与质量预测不一致、推理过程不稳定的现象,不符合人类认知逻辑,亟需改进其推理一致性与可靠性。 Method: 提出一种两阶段微调方法:第一阶段训练模型学习视觉特征,第二阶段仅基于已学特征进行质量推断,从而解耦感知与推理过程,并通过分析中间层解码行为验证模型对候选词的依赖问题。 Result: 在SPQA和KONIQ数据集上,预测不稳定率从22.00%降至12.39%,在LIVE、CSIQ、SPQA和KONIQ上SRCC/PLCC平均提升0.3124/0.3507,推理过程更稳定可靠。 Conclusion: 通过分离视觉感知与质量推断,该方法使VLM在BIQA中表现出更接近人类的一致性和稳定性,增强了模型决策的可解释性与可靠性。 Abstract: Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.

[117] From Graphs to Gates: DNS-HyXNet, A Lightweight and Deployable Sequential Model for Real-Time DNS Tunnel Detection

Faraz Ali,Muhammad Afaq,Mahmood Niazi,Muzammil Behzad

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的基于xLSTM的混合模型DNS-HyXNet,用于高效、实时的DNS隧道检测,避免了图构建带来的高开销,在公开数据集上实现了接近完美的检测性能和极低延迟。

Details Motivation: 现有的基于图的方法(如GraphTunnel)虽然准确率高,但因递归解析和图构建导致高延迟和计算开销,难以满足实时部署需求。 Method: 提出DNS-HyXNet,结合分词后的域名嵌入与归一化数值特征,通过两层xLSTM网络直接学习DNS包序列中的时序依赖关系,实现无需图重构的单阶段多分类。 Result: 在多个公开数据集上达到最高99.99%的准确率,宏平均精确率、召回率和F1分数均超过99.96%,单样本检测延迟仅为0.041毫秒。 Conclusion: 基于xLSTM的序列建模可有效替代高成本的图生成方法,为在通用硬件上部署高效、节能的实时DNS隧道检测提供了可行方案。 Abstract: Domain Name System (DNS) tunneling remains a covert channel for data exfiltration and command-and-control communication. Although graph-based methods such as GraphTunnel achieve strong accuracy, they introduce significant latency and computational overhead due to recursive parsing and graph construction, limiting their suitability for real-time deployment. This work presents DNS-HyXNet, a lightweight extended Long Short-Term Memory (xLSTM) hybrid framework designed for efficient sequence-based DNS tunnel detection. DNS-HyXNet integrates tokenized domain embeddings with normalized numerical DNS features and processes them through a two-layer xLSTM network that directly learns temporal dependencies from packet sequences, eliminating the need for graph reconstruction and enabling single-stage multi-class classification. The model was trained and evaluated on two public benchmark datasets with carefully tuned hyperparameters to ensure low memory consumption and fast inference. Across all experimental splits of the DNS-Tunnel-Datasets, DNS-HyXNet achieved up to 99.99% accuracy, with macro-averaged precision, recall, and F1-scores exceeding 99.96%, and demonstrated a per-sample detection latency of just 0.041 ms, confirming its scalability and real-time readiness. These results show that sequential modeling with xLSTM can effectively replace computationally expensive recursive graph generation, offering a deployable and energy-efficient alternative for real-time DNS tunnel detection on commodity hardware.

[118] Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment

Yuan Li,Zitang Sun,Yen-Ju Chen,Shin'ya Nishida

Main category: cs.CV

TL;DR: 该论文研究了多模态大语言模型(MLLM)在图像质量评估(IQA)中对低级失真(如模糊、噪声)感知不足的问题,发现其视觉-语言对齐过程中关键特征易丢失,且模型易过拟合训练模板;通过增强视觉编码器的对齐性,可显著提升失真识别准确率(从14.92%到84.43%),表明加强视觉编码约束能改善可解释的视觉表征。

Details Motivation: 探究MLLM在图像质量评估中为何难以可靠检测低级视觉失真,并分析其视觉感知能力是否真正有效。 Method: 提出一个低级失真感知任务,用于评估MLLM对特定失真类型的分类能力;进行组件级分析,计算视觉特征与语义标记之间的语义距离,评估视觉编码器在微调前后的变化。 Result: 发现MLLM虽具备表示低级失真的结构能力,但因过拟合训练模板和视觉-语言对齐过程中的信息损失,导致失真识别准确率低;增强视觉编码器的对齐性后,准确率从14.92%提升至84.43%。 Conclusion: 应在MLLM的视觉编码器中引入专门约束,以增强对关键低级视觉特征的保留,从而生成更一致、可解释的视觉推理结果。 Abstract: Recent advances in Image Quality Assessment (IQA) have leveraged Multi-modal Large Language Models (MLLMs) to generate descriptive explanations. However, despite their strong visual perception modules, these models often fail to reliably detect basic low-level distortions such as blur, noise, and compression, and may produce inconsistent evaluations across repeated inferences. This raises an essential question: do MLLM-based IQA systems truly perceive the visual features that matter? To examine this issue, we introduce a low-level distortion perception task that requires models to classify specific distortion types. Our component-wise analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates, leading to biases in quality scoring. As a result, critical low-level features are weakened or lost during the vision-language alignment transfer stage. Furthermore, by computing the semantic distance between visual features and corresponding semantic tokens before and after component-wise fine-tuning, we show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%. Overall, these findings indicate that incorporating dedicated constraints on the vision encoder can strengthen text-explainable visual representations and enable MLLM-based pipelines to produce more coherent and interpretable reasoning in vision-centric tasks.

[119] Seeing Soil from Space: Towards Robust and Scalable Remote Soil Nutrient Analysis

David Seu,Nicolas Longepe,Gabriel Cioltea,Erik Maidik,Calin Andrei

Main category: cs.CV

TL;DR: 本研究提出了一种基于遥感数据和环境协变量的可扩展建模系统,用于估算农田土壤性质(如有机碳、氮、磷、钾和pH),结合物理可解释协变量与非线性嵌入特征,在严格的空间验证下表现出良好的泛化能力与不确定性校准性能。

Details Motivation: 当前缺乏可广泛应用于农业决策的、可访问且可扩展的土壤评估工具,尤其是在环境变量日益影响农业的背景下,亟需一种能在多样土壤气候区稳健估计土壤属性的方法。 Method: 采用混合建模方法,结合通过辐射传输模型(RTM)生成的可解释物理协变量与基础模型提取的非线性嵌入特征,利用遥感数据和环境驱动因子间接与直接建模土壤属性,并在严格空间阻断、分层划分和统计独立的训练-测试集上进行验证。 Result: 模型在未见区域表现出良好泛化能力,对土壤有机碳(SOC)和总氮(N)预测效果最佳:SOC的MAE为5.12 g/kg,CCC为0.77;N的MAE为0.44 g/kg,CCC为0.77;并通过一致性校准实现90%的不确定性覆盖。 Conclusion: 该研究推动了农业数字化发展,提供了一个可扩展、数据驱动的土壤分析框架,适用于碳市场等需要定量土壤评估的相关领域。 Abstract: Environmental variables are increasingly affecting agricultural decision-making, yet accessible and scalable tools for soil assessment remain limited. This study presents a robust and scalable modeling system for estimating soil properties in croplands, including soil organic carbon (SOC), total nitrogen (N), available phosphorus (P), exchangeable potassium (K), and pH, using remote sensing data and environmental covariates. The system employs a hybrid modeling approach, combining the indirect methods of modeling soil through proxies and drivers with direct spectral modeling. We extend current approaches by using interpretable physics-informed covariates derived from radiative transfer models (RTMs) and complex, nonlinear embeddings from a foundation model. We validate the system on a harmonized dataset that covers Europes cropland soils across diverse pedoclimatic zones. Evaluation is conducted under a robust validation framework that enforces strict spatial blocking, stratified splits, and statistically distinct train-test sets, which deliberately make the evaluation harder and produce more realistic error estimates for unseen regions. The models achieved their highest accuracy for SOC and N. This performance held across unseen locations, under both spatial cross-validation and an independent test set. SOC obtained a MAE of 5.12 g/kg and a CCC of 0.77, and N obtained a MAE of 0.44 g/kg and a CCC of 0.77. We also assess uncertainty through conformal calibration, achieving 90 percent coverage at the target confidence level. This study contributes to the digital advancement of agriculture through the application of scalable, data-driven soil analysis frameworks that can be extended to related domains requiring quantitative soil evaluation, such as carbon markets.

[120] Hands-on Evaluation of Visual Transformers for Object Recognition and Detection

Dimitrios N. Vlachogiannis,Dimitrios A. Koutsomitropoulos

Main category: cs.CV

TL;DR: 该论文比较了不同类型的Vision Transformers(ViTs)与传统CNN在图像分类、目标检测和医学图像分类等任务上的性能,发现混合型和分层式Transformer(如Swin和CvT)在准确性和计算资源之间表现出良好平衡,尤其在需要全局上下文理解的医学图像任务中优于CNN。

Details Motivation: 由于CNN在捕捉图像全局上下文方面存在局限,而Vision Transformers通过自注意力机制能更好地建模长距离依赖,因此本文旨在系统比较ViTs与CNN在多种视觉任务上的表现,特别是在医学图像等需要全局理解的场景。 Method: 本文对比了纯Transformer、分层式(如Swin、CvT)和混合型ViT架构,与传统CNN在ImageNet、COCO和ChestX-ray14等多个标准数据集上进行实验,并结合数据增强技术评估其在医学图像分类中的性能。 Result: 实验表明,Swin和CvT等分层和混合型ViT在图像分类和目标检测任务中表现优异,在医学图像分类中尤其突出;结合数据增强后,Swin Transformer性能显著提升。 Conclusion: Vision Transformers在多数情况下优于或媲美传统CNN,尤其适用于需全局上下文理解的任务,显示出其在计算机视觉特别是医学影像分析中的巨大潜力。 Abstract: Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally created for language processing, use self-attention mechanisms, which allow them to understand relationships across the entire image. In this paper, we compare different types of ViTs (pure, hierarchical, and hybrid) against traditional CNN models across various tasks, including object recognition, detection, and medical image classification. We conduct thorough tests on standard datasets like ImageNet for image classification and COCO for object detection. Additionally, we apply these models to medical imaging using the ChestX-ray14 dataset. We find that hybrid and hierarchical transformers, especially Swin and CvT, offer a strong balance between accuracy and computational resources. Furthermore, by experimenting with data augmentation techniques on medical images, we discover significant performance improvements, particularly with the Swin Transformer model. Overall, our results indicate that Vision Transformers are competitive and, in many cases, outperform traditional CNNs, especially in scenarios requiring the understanding of global visual contexts like medical imaging.

[121] Content-Adaptive Image Retouching Guided by Attribute-Based Text Representation

Hancheng Zhu,Xinyu Liu,Rui Yao,Kunyang Sun,Leida Li,Abdulmotaleb El Saddik

Main category: cs.CV

TL;DR: 本文提出了一种基于属性文本表示的内容自适应图像润饰方法(CA-ATR),通过结合内容感知的曲线映射和用户风格偏好的文本引导,实现了更灵活、个性化的图像增强。

Details Motivation: 现有图像润饰方法多采用全局像素级颜色映射,忽略了图像内容引起的颜色差异,难以实现对不同区域颜色分布的自适应调整和用户定义风格的融合。 Method: 提出内容自适应曲线映射模块,利用基曲线和学习权重图实现空间上下文相关的颜色变换;同时设计属性文本预测模块,从图像多属性生成文本表示,并通过多模态模型将其与视觉特征融合以指导润饰过程。 Result: 在多个公开数据集上实验表明,该方法在定量指标和视觉效果上均优于现有方法,实现了最先进的性能。 Conclusion: 所提出的CA-ATR框架有效提升了图像润饰的自适应性和个性化能力,兼顾了图像内容多样性与用户风格偏好。 Abstract: Image retouching has received significant attention due to its ability to achieve high-quality visual content. Existing approaches mainly rely on uniform pixel-wise color mapping across entire images, neglecting the inherent color variations induced by image content. This limitation hinders existing approaches from achieving adaptive retouching that accommodates both diverse color distributions and user-defined style preferences. To address these challenges, we propose a novel Content-Adaptive image retouching method guided by Attribute-based Text Representation (CA-ATP). Specifically, we propose a content-adaptive curve mapping module, which leverages a series of basis curves to establish multiple color mapping relationships and learns the corresponding weight maps, enabling content-aware color adjustments. The proposed module can capture color diversity within the image content, allowing similar color values to receive distinct transformations based on their spatial context. In addition, we propose an attribute text prediction module that generates text representations from multiple image attributes, which explicitly represent user-defined style preferences. These attribute-based text representations are subsequently integrated with visual features via a multimodal model, providing user-friendly guidance for image retouching. Extensive experiments on several public datasets demonstrate that our method achieves state-of-the-art performance.

[122] UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision

Alberto Rota,Mert Kiray,Mert Asim Karaoglu,Patrick Ruhkamp,Elena De Momi,Nassir Navabm,Benjamin Busam

Main category: cs.CV

TL;DR: 提出UnReflectAnything,一种仅使用RGB图像的高光去除框架,通过预测高光图和漫反射重建来消除镜面反射,利用虚拟高光合成实现无配对监督训练,在自然与手术图像中均表现优异。

Details Motivation: 高光会扭曲外观、遮挡纹理并干扰几何推理,尤其在自然和手术图像中影响显著,现有方法依赖多图或特殊输入,缺乏适用于单图RGB的高效解决方案。 Method: 采用冻结的视觉Transformer编码器提取多尺度特征,轻量头部分支定位高光区域,令牌级修复模块恢复被破坏的特征块以生成去反射的漫反射图像;提出虚拟高光合成管线,结合单目几何、菲涅尔感知着色和随机光照生成逼真高光用于自监督训练。 Result: 在多个基准上达到与最先进方法相当的性能,无需配对数据即可训练,且在自然和手术图像域中具有良好泛化能力。 Conclusion: UnReflectAnything实现了仅基于单张RGB图像的高效高光去除,通过创新的虚拟高光合成策略解决了训练数据缺失问题,具有跨域鲁棒性和实用潜力。 Abstract: Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks. Project Page: https://alberto-rota.github.io/UnReflectAnything/

[123] CS3D: An Efficient Facial Expression Recognition via Event Vision

Zhe Wang,Qijin Song,Yucen Peng,Weibang Bai

Main category: cs.CV

TL;DR: 提出了一种名为CS3D的高效事件驱动面部表情识别框架,通过分解3D卷积并结合软尖峰神经元与时空注意力机制,在降低能耗的同时提升了准确率。

Details Motivation: 现有的基于事件相机的表情识别方法在部署深度学习模型时面临高能耗和计算复杂度的问题,难以应用于边缘设备。 Method: 提出CS3D框架,将3D卷积进行分解以降低计算复杂度;引入软尖峰神经元和时空注意力机制来增强信息保留能力。 Result: 在多个数据集上,CS3D相比RNN、Transformer和C3D等模型取得了更高的准确率,且能耗仅为原始C3D的21.97%。 Conclusion: CS3D框架有效平衡了事件驱动面部表情识别中的精度与能效,适合在资源受限的边缘设备上部署,具有良好的应用前景。 Abstract: Responsive and accurate facial expression recognition is crucial to human-robot interaction for daily service robots. Nowadays, event cameras are becoming more widely adopted as they surpass RGB cameras in capturing facial expression changes due to their high temporal resolution, low latency, computational efficiency, and robustness in low-light conditions. Despite these advantages, event-based approaches still encounter practical challenges, particularly in adopting mainstream deep learning models. Traditional deep learning methods for facial expression analysis are energy-intensive, making them difficult to deploy on edge computing devices and thereby increasing costs, especially for high-frequency, dynamic, event vision-based approaches. To address this challenging issue, we proposed the CS3D framework by decomposing the Convolutional 3D method to reduce the computational complexity and energy consumption. Additionally, by utilizing soft spiking neurons and a spatial-temporal attention mechanism, the ability to retain information is enhanced, thus improving the accuracy of facial expression detection. Experimental results indicate that our proposed CS3D method attains higher accuracy on multiple datasets compared to architectures such as the RNN, Transformer, and C3D, while the energy consumption of the CS3D method is just 21.97\% of the original C3D required on the same device.

[124] FROMAT: Multiview Material Appearance Transfer via Few-Shot Self-Attention Adaptation

Hubert Kompanowski,Varun Jampani,Aaryaman Vasishta,Binh-Son Hua

Main category: cs.CV

TL;DR: 提出一种轻量级的外观迁移方法,用于多视角扩散模型,通过结合输入图像的对象身份和参考图像的外观线索,实现多视角一致的外观生成。

Details Motivation: 现有基于多视角扩散模型的内容生成在材质、纹理或风格等外观操控方面能力有限,缺乏对显式外观属性的有效控制。 Method: 利用三个扩散去噪过程分别生成原始对象、参考图像和目标图像,在反向采样过程中聚合来自对象和参考图像的少量逐层自注意力特征,以影响目标图像生成,从而实现外观迁移。 Result: 该方法仅需少量训练样本即可赋予预训练多视角扩散模型外观感知能力,在多视角一致性输出中实现了灵活的材质、纹理和风格编辑。 Conclusion: 所提方法为多视角扩散模型提供了一种简单而有效的外观编辑方案,推动了隐式生成式三维表征在实际应用中的采用。 Abstract: Multiview diffusion models have rapidly emerged as a powerful tool for content creation with spatial consistency across viewpoints, offering rich visual realism without requiring explicit geometry and appearance representation. However, compared to meshes or radiance fields, existing multiview diffusion models offer limited appearance manipulation, particularly in terms of material, texture, or style. In this paper, we present a lightweight adaptation technique for appearance transfer in multiview diffusion models. Our method learns to combine object identity from an input image with appearance cues rendered in a separate reference image, producing multi-view-consistent output that reflects the desired materials, textures, or styles. This allows explicit specification of appearance parameters at generation time while preserving the underlying object geometry and view coherence. We leverage three diffusion denoising processes responsible for generating the original object, the reference, and the target images, and perform reverse sampling to aggregate a small subset of layer-wise self-attention features from the object and the reference to influence the target generation. Our method requires only a few training examples to introduce appearance awareness to pretrained multiview models. The experiments show that our method provides a simple yet effective way toward multiview generation with diverse appearance, advocating the adoption of implicit generative 3D representations in practice.

[125] Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder

Yousef Azizi Movahed,Fatemeh Ziaeetabar

Main category: cs.CV

TL;DR: 本文研究了手-物交互中细粒度的原子状态分类(如“接近”、“抓取”、“持握”),提出一种基于统计运动学特征的数据工程方法,并发现将双向RNN的序列长度设为1时,模型作为高容量静态编码器显著提升了准确率至97.60%,尤其在“抓取”类上达到0.90的平衡F1分数。

Details Motivation: 准确预测手-物交互中的人类意图是计算机视觉中的难题,而识别细粒度的原子交互状态是其中的基础子问题。 Method: 基于MANIAC数据集的视频,构建统计-运动学特征向量(共27,476个),比较静态分类器(MLP)与时序模型(RNN),并探索不同序列长度对性能的影响,特别是将Bidirectional RNN的seq_length设为1的情况。 Result: 当Bidirectional RNN的序列长度为1时,模型作为静态特征编码器实现了97.60%的准确率,在最具挑战性的‘抓取’类别上达到0.90的平衡F1分数。 Conclusion: 使用结构化、可解释的特征结合轻量级架构(如简化后的RNN)可有效解决低层级手-物交互识别问题,为该任务提供了新的基准。 Abstract: Reliably predicting human intent in hand-object interactions is an open challenge for computer vision. Our research concentrates on a fundamental sub-problem: the fine-grained classification of atomic interaction states, namely 'approaching', 'grabbing', and 'holding'. To this end, we introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors. Each vector encapsulates relational and dynamic properties from a short temporal window of motion. Our initial hypothesis posited that sequential modeling would be critical, leading us to compare static classifiers (MLPs) against temporal models (RNNs). Counter-intuitively, the key discovery occurred when we set the sequence length of a Bidirectional RNN to one (seq_length=1). This modification converted the network's function, compelling it to act as a high-capacity static feature encoder. This architectural change directly led to a significant accuracy improvement, culminating in a final score of 97.60%. Of particular note, our optimized model successfully overcame the most challenging transitional class, 'grabbing', by achieving a balanced F1-score of 0.90. These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.

[126] Benchmarking SAM2-based Trackers on FMOX

Senem Aktas,Charles Markham,John McDonald,Rozenn Dahyot

Main category: cs.CV

TL;DR: 本文评估了基于SAM2的多个高性能目标跟踪器在快速移动物体数据集上的表现,发现DAM4SAM和SAMURAI在挑战性较高的序列中表现更优。

Details Motivation: 现有跟踪方法在处理快速移动物体时存在性能瓶颈,需要深入分析当前先进跟踪器在此类挑战性场景下的局限性。 Method: 选取SAM2、EfficientTAM、DAM4SAM和SAMURAI等代表性跟踪器,在专为快速移动物体设计的具有挑战性的数据集上进行基准测试与行为分析。 Result: 实验表明,DAM4SAM和SAMURAI在更具挑战性的视频序列中整体表现更好,显示出更强的鲁棒性。 Conclusion: 尽管SAM2系列跟踪器整体表现良好,但在快速运动场景下仍存在改进空间;DAM4SAM和SAMURAI结构设计更适应复杂动态环境,可作为未来优化方向参考。 Abstract: Several object tracking pipelines extending Segment Anything Model 2 (SAM2) have been proposed in the past year, where the approach is to follow and segment the object from a single exemplar template provided by the user on a initialization frame. We propose to benchmark these high performing trackers (SAM2, EfficientTAM, DAM4SAM and SAMURAI) on datasets containing fast moving objects (FMO) specifically designed to be challenging for tracking approaches. The goal is to understand better current limitations in state-of-the-art trackers by providing more detailed insights on the behavior of these trackers. We show that overall the trackers DAM4SAM and SAMURAI perform well on more challenging sequences.

[127] Kaapana: A Comprehensive Open-Source Platform for Integrating AI in Medical Imaging Research Environments

Ünal Akünal,Markus Bujotzek,Stefan Denner,Benjamin Hamm,Klaus Kades,Philipp Schader,Jonas Scherer,Marco Nolden,Peter Neher,Ralf Floca,Klaus Maier-Hein

Main category: cs.CV

TL;DR: Kaapana是一个开源的医学影像研究平台,旨在通过模块化框架解决多中心研究中的数据隐私、可重复性和协作难题,支持从本地原型开发到全国范围研究网络的应用。

Details Motivation: 医学影像AI研究受限于数据访问、法规约束和碎片化的软件基础设施,导致项目难以复现和扩展,且不利于临床医生与数据科学家之间的协作。 Method: Kaapana采用“算法靠近数据”的设计,提供统一的数据摄取、队列管理、处理工作流和结果可视化模块,并集成灵活的工作流编排与用户界面,支持分布式实验和模型开发。 Result: 该平台降低了技术门槛,提升了研究的可重复性,支持大规模、跨机构的协作研究,已在多种使用场景中验证其有效性。 Conclusion: Kaapana通过开源、模块化的设计,有效弥合了医学影像AI研究中技术与临床需求之间的鸿沟,促进安全、可扩展和可协作的研究生态。 Abstract: Developing generalizable AI for medical imaging requires both access to large, multi-center datasets and standardized, reproducible tooling within research environments. However, leveraging real-world imaging data in clinical research environments is still hampered by strict regulatory constraints, fragmented software infrastructure, and the challenges inherent in conducting large-cohort multicentre studies. This leads to projects that rely on ad-hoc toolchains that are hard to reproduce, difficult to scale beyond single institutions and poorly suited for collaboration between clinicians and data scientists. We present Kaapana, a comprehensive open-source platform for medical imaging research that is designed to bridge this gap. Rather than building single-use, site-specific tooling, Kaapana provides a modular, extensible framework that unifies data ingestion, cohort curation, processing workflows and result inspection under a common user interface. By bringing the algorithm to the data, it enables institutions to keep control over their sensitive data while still participating in distributed experimentation and model development. By integrating flexible workflow orchestration with user-facing applications for researchers, Kaapana reduces technical overhead, improves reproducibility and enables conducting large-scale, collaborative, multi-centre imaging studies. We describe the core concepts of the platform and illustrate how they can support diverse use cases, from local prototyping to nation-wide research networks. The open-source codebase is available at https://github.com/kaapana/kaapana

[128] VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

Wanyue Zhang,Lin Geng Foo,Thabo Beeler,Rishabh Dabral,Christian Theobalt

Main category: cs.CV

TL;DR: 提出VHOI框架,通过两阶段方法实现可控的人-物交互视频生成,利用新颖的HOI感知运动表示提升生成 realism。

Details Motivation: 现有可控视频生成方法在稀疏控制与密集信号之间存在权衡,难以同时实现易用性和实例感知性,尤其是在复杂的人-物交互场景中。 Method: 提出一个两阶段框架:首先将稀疏轨迹扩展为HOI掩码序列,然后基于这些密集掩码微调视频扩散模型;引入一种新的HOI感知运动表示,使用颜色编码区分人体部位和物体的动态。 Result: 在可控人-物交互视频生成任务上达到最先进水平,能端到端生成包含人类导航至物体并交互的完整过程。 Conclusion: VHOI有效结合了稀疏控制的便捷性和密集信号的丰富性,提升了生成视频的真实感和可控性,适用于复杂的人-物交互场景。 Abstract: Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.

[129] IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

Tao Zhang,Yuyang Hong,Yang Xia,Kun Ding,Zeyu Zhang,Ying Wang,Shiming Xiang,Chunhong Pan

Main category: cs.CV

TL;DR: 本文提出了IF-Bench,首个用于评估多模态大语言模型对红外图像理解能力的高质量基准,并提出了一种无需训练的生成式视觉提示(GenViP)方法,通过将红外图像转换为语义和空间对齐的RGB图像来提升模型性能。

Details Motivation: 现有研究未充分探索多模态大语言模型在红外图像理解方面的能力,缺乏专门的评估基准,限制了该领域的发展。 Method: 构建了一个包含499张图像和680个问答对的红外图像理解基准IF-Bench,并采用循环评估、双语评估和混合判断策略评估超过40种MLLM;同时提出GenViP方法,利用先进的图像编辑模型将红外图像转换为对应的RGB图像以缓解域偏移问题。 Result: 在IF-Bench上系统评估了多种MLLM,发现模型规模、架构和推理范式显著影响红外图像理解能力;GenViP方法在多种MLLM上均显著提升了性能。 Conclusion: IF-Bench为红外图像的多模态理解提供了可靠评测平台,GenViP提供了一种有效的训练-free解决方案,推动了MLLM在红外图像理解中的应用。 Abstract: Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at https://github.com/casiatao/IF-Bench.

[130] OxEnsemble: Fair Ensembles for Low-Data Classification

Jonathan Rystrøm,Zihao Fu,Chris Russell

Main category: cs.CV

TL;DR: 提出了一种名为OxEnsemble的新方法,用于在数据稀缺且群体不平衡的低数据场景中实现高效公平分类,尤其适用于医疗影像等关键领域。

Details Motivation: 在医疗影像等数据稀缺且类别不平衡的领域中,传统的公平分类方法难以有效应用,且错误预测可能带来严重后果,因此需要一种既数据高效又计算高效的方法来确保公平性。 Method: 通过构建集成模型(ensemble),每个成员在训练时满足公平性约束,并聚合其预测结果;同时重复利用保留数据以可靠地实施公平性,仅需与微调或评估现有模型相当的计算资源。 Result: 理论分析提供了新的保证,实验表明该方法在多个具有挑战性的医疗影像数据集上相比现有方法具有更一致的结果和更强的公平性-准确性权衡。 Conclusion: OxEnsemble在低数据、不平衡场景下实现了高效且可靠的公平分类,特别适合对错误敏感的关键应用场景。 Abstract: We address the problem of fair classification in settings where data is scarce and unbalanced across demographic groups. Such low-data regimes are common in domains like medical imaging, where false negatives can have fatal consequences. We propose a novel approach \emph{OxEnsemble} for efficiently training ensembles and enforcing fairness in these low-data regimes. Unlike other approaches, we aggregate predictions across ensemble members, each trained to satisfy fairness constraints. By construction, \emph{OxEnsemble} is both data-efficient, carefully reusing held-out data to enforce fairness reliably, and compute-efficient, requiring little more compute than used to fine-tune or evaluate an existing model. We validate this approach with new theoretical guarantees. Experimentally, our approach yields more consistent outcomes and stronger fairness-accuracy trade-offs than existing methods across multiple challenging medical imaging classification datasets.

[131] An Automated Tip-and-Cue Framework for Optimized Satellite Tasking and Visual Intelligence

Gil Weissman,Amir Ivry,Israel Cohen

Main category: cs.CV

TL;DR: 提出了一种全自动的Tip-and-Cue框架,用于卫星成像任务规划与调度,结合外部数据和AI模型实现自主观测与分析。

Details Motivation: 随着卫星星座增多和传感器多样化,亟需自动化地球观测系统以降低任务延迟并提升响应效率。 Method: 利用外部数据或历史影像生成‘tip’,结合传感器约束、时间要求和效用函数制定‘cue’任务,通过连续效用函数优化多星调度,并采用AI模型(如目标检测和视觉-语言模型)处理图像。 Result: 在海上船舶跟踪场景中验证了框架有效性,利用AIS数据进行轨迹预测和定向观测,生成可操作的结构化视觉报告。 Conclusion: 该框架可扩展至智慧城市监测和灾害响应等需要及时任务调度与自动分析的广泛应用场景。 Abstract: The proliferation of satellite constellations, coupled with reduced tasking latency and diverse sensor capabilities, has expanded the opportunities for automated Earth observation. This paper introduces a fully automated Tip-and-Cue framework designed for satellite imaging tasking and scheduling. In this context, tips are generated from external data sources or analyses of prior satellite imagery, identifying spatiotemporal targets and prioritizing them for downstream planning. Corresponding cues are the imaging tasks formulated in response, which incorporate sensor constraints, timing requirements, and utility functions. The system autonomously generates candidate tasks, optimizes their scheduling across multiple satellites using continuous utility functions that reflect the expected value of each observation, and processes the resulting imagery using artificial-intelligence-based models, including object detectors and vision-language models. Structured visual reports are generated to support both interpretability and the identification of new insights for downstream tasking. The efficacy of the framework is demonstrated through a maritime vessel tracking scenario, utilizing Automatic Identification System (AIS) data for trajectory prediction, targeted observations, and the generation of actionable outputs. Maritime vessel tracking is a widely researched application, often used to benchmark novel approaches to satellite tasking, forecasting, and analysis. The system is extensible to broader applications such as smart-city monitoring and disaster response, where timely tasking and automated analysis are critical.

[132] Unconsciously Forget: Mitigating Memorization; Without Knowing What is being Memorized

Er Jin,Yang Zhang,Yongli Mou,Yanfei Dong,Stefan Decker,Kenji Kawaguchi,Johannes Stegmaier

Main category: cs.CV

TL;DR: 本文提出了一种名为UniForget的新方法,通过模型剪枝来抑制生成模型中版权内容的生成,无需针对特定概念,同时保持模型的整体生成能力。

Details Motivation: 生成模型容易记忆训练数据,导致潜在的版权、肖像权和商标侵权问题,尤其在大模型中更为严重。现有去记忆化方法存在计算开销高或局限于特定概念移除的问题,缺乏可扩展性。 Method: 通过分析发现模型中特定部分负责生成受版权保护的内容,进而采用模型剪枝技术剪除这些关键部分,从而降低生成版权内容的概率,且不依赖于具体概念或修改采样过程。 Result: UniForget能有效减少版权内容生成,同时保留模型的通用生成性能;该方法与现有去记忆化技术正交且可结合,提升了现有方法的可扩展性和实用性。 Conclusion: 模型剪枝是一种高效、通用的去记忆化策略,UniForget为理解和缓解生成模型的记忆问题提供了新视角,并具有改善当前去记忆化技术的潜力。 Abstract: Recent advances in generative models have demonstrated an exceptional ability to produce highly realistic images. However, previous studies show that generated images often resemble the training data, and this problem becomes more severe as the model size increases. Memorizing training data can lead to legal challenges, including copyright infringement, violations of portrait rights, and trademark violations. Existing approaches to mitigating memorization mainly focus on manipulating the denoising sampling process to steer image embeddings away from the memorized embedding space or employ unlearning methods that require training on datasets containing specific sets of memorized concepts. However, existing methods often incur substantial computational overhead during sampling, or focus narrowly on removing one or more groups of target concepts, imposing a significant limitation on their scalability. To understand and mitigate these problems, our work, UniForget, offers a new perspective on understanding the root cause of memorization. Our work demonstrates that specific parts of the model are responsible for copyrighted content generation. By applying model pruning, we can effectively suppress the probability of generating copyrighted content without targeting specific concepts while preserving the general generative capabilities of the model. Additionally, we show that our approach is both orthogonal and complementary to existing unlearning methods, thereby highlighting its potential to improve current unlearning and de-memorization techniques.

[133] LiM-YOLO: Less is More with Pyramid Level Shift and Normalized Auxiliary Branch for Ship Detection in Optical Remote Sensing Imagery

Seon-Hoon Kim,Hyeji Sim,Youeyun Jung,Ok-Chul Jung,Yerin Kim

Main category: cs.CV

TL;DR: 本文提出了一种面向卫星图像中船舶检测的专用检测器LiM-YOLO,通过金字塔层级迁移策略(P2-P4)和GN-CBLinear模块,解决了小尺度、细长型船只检测中的特征稀释与训练不稳定问题,在多个公开数据集上实现了优于现有方法的性能。

Details Motivation: 通用目标检测器在处理卫星图像中的船舶时面临极端尺度差异和形态各向异性问题,尤其是小尺寸和细长结构的船只难以被stride-32等深层特征层有效捕捉,导致检测性能下降。 Method: 基于对船舶尺度的统计分析,提出金字塔层级迁移策略,将检测头从P5调整至P2-P4,满足小物体的Nyquist采样要求;引入GN-CBLinear模块,采用分组归一化卷积块进行线性投影,提升高分辨率输入下微小批量训练的稳定性。 Result: 在SODA-A、DOTA-v1.5、FAIR1M-v2.0和ShipRSImageNet-V1等多个遥感数据集上验证了方法的有效性,LiM-YOLO在检测精度和效率方面均优于当前最先进的检测模型。 Conclusion: LiM-YOLO通过针对性的网络结构设计和训练优化,有效解决了遥感图像中船舶检测的特殊挑战,为特定领域目标检测提供了可借鉴的解决方案。 Abstract: Applying general-purpose object detectors to ship detection in satellite imagery presents significant challenges due to the extreme scale disparity and morphological anisotropy of maritime targets. Standard architectures utilizing stride-32 (P5) layers often fail to resolve narrow vessels, resulting in spatial feature dilution. In this work, we propose LiM-YOLO, a specialized detector designed to resolve these domain-specific conflicts. Based on a statistical analysis of ship scales, we introduce a Pyramid Level Shift Strategy that reconfigures the detection head to P2-P4. This shift ensures compliance with Nyquist sampling criteria for small objects while eliminating the computational redundancy of deep layers. To further enhance training stability on high-resolution inputs, we incorporate a Group Normalized Convolutional Block for Linear Projection (GN-CBLinear), which mitigates gradient volatility in micro-batch settings. Validated on SODA-A, DOTA-v1.5, FAIR1M-v2.0, and ShipRSImageNet-V1, LiM-YOLO demonstrates superior detection accuracy and efficiency compared to state-of-the-art models. The code is available at https://github.com/egshkim/LiM-YOLO.

[134] Stylized Meta-Album: Group-bias injection with style transfer to study robustness against distribution shifts

Romain Mussard,Aurélien Gauffre,Ihsan Ullah,Thanh Gia Hieu Khuong,Massih-Reza Amini,Isabelle Guyon,Lisheng Sun-Hosoya

Main category: cs.CV

TL;DR: 提出Stylized Meta-Album (SMA),一个包含24个数据集的新图像分类元数据集,用于推动分布外泛化和相关研究,支持灵活配置以适应多种基准场景。

Details Motivation: 现有数据集在组多样性和实际基准场景覆盖上受限,难以充分评估模型在复杂分布偏移、公平性与鲁棒性方面的表现,SMA旨在通过可配置的风格化数据生成解决该问题。 Method: 基于12个主体分类数据集,使用风格迁移技术构建12个内容数据集和12个风格化数据集,形成共4800个组的SMA元数据集,并设计两个新基准:OOD泛化与群体公平性基准、无监督域适应(UDA)基准。 Result: 实验表明增加组多样性显著影响算法公平性表现,改变算法排名;提出Top-M最差组准确率作为调参指标可提升优化过程中的公平性;UDA基准误差显著降低(闭集减少73%,UniDA减少28%)。 Conclusion: SMA通过灵活控制风格、类别和域结构,支持多样化且更真实的基准测试,能够揭示传统基准中未被发现的模型行为差异,显著提升评估的全面性与可靠性。 Abstract: We introduce Stylized Meta-Album (SMA), a new image classification meta-dataset comprising 24 datasets (12 content datasets, and 12 stylized datasets), designed to advance studies on out-of-distribution (OOD) generalization and related topics. Created using style transfer techniques from 12 subject classification datasets, SMA provides a diverse and extensive set of 4800 groups, combining various subjects (objects, plants, animals, human actions, textures) with multiple styles. SMA enables flexible control over groups and classes, allowing us to configure datasets to reflect diverse benchmark scenarios. While ideally, data collection would capture extensive group diversity, practical constraints often make this infeasible. SMA addresses this by enabling large and configurable group structures through flexible control over styles, subject classes, and domains-allowing datasets to reflect a wide range of real-world benchmark scenarios. This design not only expands group and class diversity, but also opens new methodological directions for evaluating model performance across diverse group and domain configurations-including scenarios with many minority groups, varying group imbalance, and complex domain shifts-and for studying fairness, robustness, and adaptation under a broader range of realistic conditions. To demonstrate SMA's effectiveness, we implemented two benchmarks: (1) a novel OOD generalization and group fairness benchmark leveraging SMA's domain, class, and group diversity to evaluate existing benchmarks. Our findings reveal that while simple balancing and algorithms utilizing group information remain competitive as claimed in previous benchmarks, increasing group diversity significantly impacts fairness, altering the superiority and relative rankings of algorithms. We also propose to use \textit{Top-M worst group accuracy} as a new hyperparameter tuning metric, demonstrating broader fairness during optimization and delivering better final worst-group accuracy for larger group diversity. (2) An unsupervised domain adaptation (UDA) benchmark utilizing SMA's group diversity to evaluate UDA algorithms across more scenarios, offering a more comprehensive benchmark with lower error bars (reduced by 73\% and 28\% in closed-set setting and UniDA setting, respectively) compared to existing efforts. These use cases highlight SMA's potential to significantly impact the outcomes of conventional benchmarks.

[135] FastPose-ViT: A Vision Transformer for Real-Time Spacecraft Pose Estimation

Pierre Ancey,Andrew Price,Saqib Javed,Mathieu Salzmann

Main category: cs.CV

TL;DR: 提出了一种基于Vision Transformer的FastPose-ViT架构,用于从单张图像中快速估计航天器的6自由度姿态,结合投影几何与“视在旋转”概念,在保持高精度的同时实现低延迟,适用于资源受限的边缘设备。

Details Motivation: 现有基于PnP的方法计算量大,难以在资源受限的边缘设备上实现实时6DoF姿态估计,限制了在轨服务和空间碎片清除等自主操作的应用。 Method: 提出FastPose-ViT,采用ViT架构直接回归6DoF姿态;输入为检测框裁剪后的图像,并引入基于投影几何和“视在旋转”的新数学形式化方法,将局部预测映射回全图尺度。 Result: 在SPEED数据集上性能优于其他非PnP方法,与最先进的PnP方法相当;模型量化后部署于NVIDIA Jetson Orin Nano,端到端延迟约75ms,并发调度下吞吐达33 FPS。 Conclusion: FastPose-ViT在精度和效率之间取得了良好平衡,适合实时、低功耗的空间任务应用,展示了其在实际太空任务中的部署潜力。 Abstract: Estimating the 6-degrees-of-freedom (6DoF) pose of a spacecraft from a single image is critical for autonomous operations like in-orbit servicing and space debris removal. Existing state-of-the-art methods often rely on iterative Perspective-n-Point (PnP)-based algorithms, which are computationally intensive and ill-suited for real-time deployment on resource-constrained edge devices. To overcome these limitations, we propose FastPose-ViT, a Vision Transformer (ViT)-based architecture that directly regresses the 6DoF pose. Our approach processes cropped images from object bounding boxes and introduces a novel mathematical formalism to map these localized predictions back to the full-image scale. This formalism is derived from the principles of projective geometry and the concept of "apparent rotation", where the model predicts an apparent rotation matrix that is then corrected to find the true orientation. We demonstrate that our method outperforms other non-PnP strategies and achieves performance competitive with state-of-the-art PnP-based techniques on the SPEED dataset. Furthermore, we validate our model's suitability for real-world space missions by quantizing it and deploying it on power-constrained edge hardware. On the NVIDIA Jetson Orin Nano, our end-to-end pipeline achieves a latency of ~75 ms per frame under sequential execution, and a non-blocking throughput of up to 33 FPS when stages are scheduled concurrently.

[136] Modality-Specific Enhancement and Complementary Fusion for Semi-Supervised Multi-Modal Brain Tumor Segmentation

Tien-Dat Chung,Ba-Thinh Lam,Thanh-Huy Nguyen,Thien Nguyen,Nguyen Lan Vi Vu,Hoang-Loc Cao,Phat Kim Huynh,Min Xu

Main category: cs.CV

TL;DR: 提出了一种新的半监督多模态医学图像分割框架,通过模态特异性增强模块和可学习的互补信息融合模块,有效利用多模态MRI之间的互补信息,在标签数据极少的情况下显著提升分割性能。

Details Motivation: 现有半监督学习方法在多模态医学图像分割中难以充分利用不同模态间的互补信息,主要受限于模态间的语义差异和序列错配问题。 Method: 设计了模态特异性增强模块(MEM)以通过通道注意力强化各模态的独特语义特征,并引入可学习的互补信息融合模块(CIF)来自适应地交换模态间互补信息;整体框架结合有监督分割损失和跨模态一致性正则化进行优化。 Result: 在BraTS 2019数据集上实验表明,该方法在1%、5%、10%标签数据下均优于强基线方法,Dice和Sensitivity指标均有显著提升。消融研究验证了MEM与CIF模块对缓解跨模态差异和提高分割鲁棒性的贡献。 Conclusion: 所提出的框架能有效增强模态特异性表示并实现自适应跨模态融合,显著提升了低标签设置下半监督多模态医学图像分割的性能。 Abstract: Semi-supervised learning (SSL) has become a promising direction for medical image segmentation, enabling models to learn from limited labeled data alongside abundant unlabeled samples. However, existing SSL approaches for multi-modal medical imaging often struggle to exploit the complementary information between modalities due to semantic discrepancies and misalignment across MRI sequences. To address this, we propose a novel semi-supervised multi-modal framework that explicitly enhances modality-specific representations and facilitates adaptive cross-modal information fusion. Specifically, we introduce a Modality-specific Enhancing Module (MEM) to strengthen semantic cues unique to each modality via channel-wise attention, and a learnable Complementary Information Fusion (CIF) module to adaptively exchange complementary knowledge between modalities. The overall framework is optimized using a hybrid objective combining supervised segmentation loss and cross-modal consistency regularization on unlabeled data. Extensive experiments on the BraTS 2019 (HGG subset) demonstrate that our method consistently outperforms strong semi-supervised and multi-modal baselines under 1\%, 5\%, and 10\% labeled data settings, achieving significant improvements in both Dice and Sensitivity scores. Ablation studies further confirm the complementary effects of our proposed MEM and CIF in bridging cross-modality discrepancies and improving segmentation robustness under scarce supervision.

[137] CHEM: Estimating and Understanding Hallucinations in Deep Learning for Image Processing

Jianfei Li,Ines Rosellon-Inclan,Gitta Kutyniok,Jean-Luc Starck

Main category: cs.CV

TL;DR: 提出了一种名为CHEM的新方法,用于量化和理解U形网络在图像重建中的幻觉伪影,结合小波与剪切波表示及共形分位数回归,适用于任何图像重建模型,并在天文图像数据集上验证了其有效性。

Details Motivation: U-Net等U形架构在图像去卷积任务中虽成功但易产生不真实伪影或幻觉,影响安全关键场景的分析,需可信赖的视觉模型评估手段。 Method: 提出Conformal Hallucination Estimation Metric (CHEM),利用小波和剪切波表示提取图像特征中的幻觉,并采用共形分位数回归以无分布假设方式评估幻觉程度。 Result: 在CANDELS天文图像数据集上对U-Net、SwinUNet和Learnlets等模型进行测试,有效识别并量化了不同模型的幻觉水平,提供了深度学习图像处理中幻觉现象的新视角。 Conclusion: CHEM是一种通用、高效的幻觉量化工具,有助于提升基于深度学习的图像重建模型的可信度,理论分析进一步揭示了U形网络易产生幻觉的原因。 Abstract: U-Net and other U-shaped architectures have achieved significant success in image deconvolution tasks. However, challenges have emerged, as these methods might generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a novel approach for quantifying and comprehending hallucination artifacts to ensure trustworthy computer vision models. Our method, termed the Conformal Hallucination Estimation Metric (CHEM), is applicable to any image reconstruction model, enabling efficient identification and quantification of hallucination artifacts. It offers two key advantages: it leverages wavelet and shearlet representations to efficiently extract hallucinations of image features and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. Furthermore, from an approximation theoretical perspective, we explore the reasons why U-shaped networks are prone to hallucinations. We test the proposed approach on the CANDELS astronomical image dataset with models such as U-Net, SwinUNet, and Learnlets, and provide new perspectives on hallucination from different aspects in deep learning-based image processing.

[138] DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation

Zhizhong Wang,Tianyi Chu,Zeyi Huang,Nanyang Wang,Kehan Li

Main category: cs.CV

TL;DR: 本文提出了DynaIP,一种用于提升个性化文本到图像生成中概念保真度、概念保持与提示跟随平衡及多主体可扩展性的动态图像提示适配器。

Details Motivation: 现有方法在概念保持与提示跟随之间难以平衡,且难以保留参考图像中的细粒度细节,缺乏对多主体个性化生成的可扩展性。 Method: 提出动态解耦策略以消除概念无关信息干扰,并设计分层专家特征融合模块,充分利用CLIP的层次化特征来增强细粒度概念保真度。 Result: 在单主体和多主体PT2I任务上实验表明,DynaIP优于现有方法,在概念保真度、CP-PF平衡和多主体生成方面表现更优。 Conclusion: DynaIP有效提升了多模态扩散Transformer在个性化图像生成中的性能,推动了无需微调的零样本PT2I的发展。 Abstract: Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.

[139] Composing Concepts from Images and Videos via Concept-prompt Binding

Xianghao Kong,Zeyu Zhang,Yuwei Guo,Zhuoran Zhao,Songchun Zhang,Anyi Rao

Main category: cs.CV

TL;DR: 本文提出了一种名为Bind & Compose的一次性方法,用于实现灵活的视觉概念组合,通过将视觉概念与相应提示词绑定,并利用分层绑定结构和跨注意力调节机制在扩散变换器中准确分解复杂视觉概念。

Details Motivation: 现有的视觉概念组合方法在从图像和视频中准确提取复杂概念以及灵活组合多源概念方面仍存在不足。 Method: 提出Bind & Compose方法,采用分层绑定器结构进行跨注意力调节,设计“多样化-吸收”机制以提高概念-词元绑定准确性,并引入时间解耦策略和双分支绑定结构以增强图像与视频概念之间的兼容性。 Result: 实验表明,该方法在概念一致性、提示保真度和运动质量方面优于现有方法。 Conclusion: Bind & Compose为视觉创意提供了新的可能性,有效提升了跨模态视觉概念的组合能力。 Abstract: Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

[140] From Detection to Anticipation: Online Understanding of Struggles across Various Tasks and Activities

Shijia Feng,Michael Wray,Walterio Mayol-Cuevas

Main category: cs.CV

TL;DR: 本文将挣扎识别重新定义为在线检测与预测任务,提出可在实时辅助系统中应用的模型,能够在事件发生前最多2秒预测人类技能执行中的挣扎,实验表明模型在不同任务间具有良好的泛化能力,并达到足够的运行速度(约20 FPS)以支持实时应用。

Details Motivation: 为了支持智能辅助系统的实时需求,需要能够在线检测并预测用户在技能执行过程中何时会遇到困难的模型,而现有研究主要集中于离线分析,缺乏对实时性和预测性的关注。 Method: 将挣扎定位问题重构为在线检测与提前预测任务,采用两种现成模型作为基线,评估其在在线检测和提前预测(最多2秒)上的性能,并测试跨任务、跨活动的泛化能力以及技能演化的影响。 Result: 在线挣扎检测达到70-80%的每帧mAP,提前2秒的挣扎预测性能略有下降但仍具可比性;在存在较大领域差异的活动间泛化仍优于随机基线4-20%;基于特征的模型运行速度高达143 FPS,完整流程约20 FPS。 Conclusion: 所提出的在线挣扎检测与预测框架在性能、泛化性和实时性方面均表现良好,适用于实际的智能辅助系统应用。 Abstract: Understanding human skill performance is essential for intelligent assistive systems, with struggle recognition offering a natural cue for identifying user difficulties. While prior work focuses on offline struggle classification and localization, real-time applications require models capable of detecting and anticipating struggle online. We reformulate struggle localization as an online detection task and further extend it to anticipation, predicting struggle moments before they occur. We adapt two off-the-shelf models as baselines for online struggle detection and anticipation. Online struggle detection achieves 70-80% per-frame mAP, while struggle anticipation up to 2 seconds ahead yields comparable performance with slight drops. We further examine generalization across tasks and activities and analyse the impact of skill evolution. Despite larger domain gaps in activity-level generalization, models still outperform random baselines by 4-20%. Our feature-based models run at up to 143 FPS, and the whole pipeline, including feature extraction, operates at around 20 FPS, sufficient for real-time assistive applications.

[141] UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Hao Lu,Ziyang Liu,Guangfeng Jiang,Yuanfei Luo,Sheng Chen,Yangang Zhang,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出了一种统一的理解-生成-规划框架UniUGP,通过整合视觉语言模型和视频生成模型,提升自动驾驶系统在长尾场景中的推理与规划能力。

Details Motivation: 自动驾驶系统在长尾场景中表现不佳,现有方法无法有效利用无标签视频进行因果学习或缺乏大语言模型的推理能力。 Method: 构建了多个专用数据集,并提出UniUGP框架,结合预训练的视觉语言模型和视频生成模型,采用四阶段训练策略,实现场景推理、未来视频生成和轨迹规划的协同。 Result: 实验表明,UniUGP在感知、推理和决策方面达到最先进性能,且在挑战性长尾场景中具有优异的泛化能力。 Conclusion: UniUGP通过融合多模态模型与分阶段训练,在复杂驾驶场景中实现了可解释的推理与高质量的规划,显著提升了自动驾驶系统的鲁棒性。 Abstract: Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.

[142] Diffusion Posterior Sampler for Hyperspectral Unmixing with Spectral Variability Modeling

Yimin Zhu,Lincoln Linlin Xu

Main category: cs.CV

TL;DR: 提出了一种基于扩散后验采样的半盲高光谱解混方法DPS4Un,结合超像素和扩散模型有效建模端元先验与光谱变异性,在多个真实数据集上优于现有方法。

Details Motivation: 解决线性光谱混合模型中端元先验建模困难和光谱变异性建模不足的问题,避免使用可能引入偏差的外部光谱库。 Method: 在贝叶斯框架下,利用预训练的条件光谱扩散模型作为后验采样器,结合从超像素内建立的图像自适应端元束学习端元先验;采用超像素级别的数据保真项,并以高斯噪声初始化端元,迭代更新丰度与端元。 Result: 在三个真实世界基准数据集上实验表明,DPS4Un在解混精度上优于当前最先进的高光谱解混方法。 Conclusion: DPS4Un通过结合扩散模型、超像素分割和图像自适应先验学习,有效提升了半盲高光谱解混性能,尤其在处理光谱变异性方面表现出优势。 Abstract: Linear spectral mixture models (LMM) provide a concise form to disentangle the constituent materials (endmembers) and their corresponding proportions (abundance) in a single pixel. The critical challenges are how to model the spectral prior distribution and spectral variability. Prior knowledge and spectral variability can be rigorously modeled under the Bayesian framework, where posterior estimation of Abundance is derived by combining observed data with endmember prior distribution. Considering the key challenges and the advantages of the Bayesian framework, a novel method using a diffusion posterior sampler for semiblind unmixing, denoted as DPS4Un, is proposed to deal with these challenges with the following features: (1) we view the pretrained conditional spectrum diffusion model as a posterior sampler, which can combine the learned endmember prior with observation to get the refined abundance distribution. (2) Instead of using the existing spectral library as prior, which may raise bias, we establish the image-based endmember bundles within superpixels, which are used to train the endmember prior learner with diffusion model. Superpixels make sure the sub-scene is more homogeneous. (3) Instead of using the image-level data consistency constraint, the superpixel-based data fidelity term is proposed. (4) The endmember is initialized as Gaussian noise for each superpixel region, DPS4Un iteratively updates the abundance and endmember, contributing to spectral variability modeling. The experimental results on three real-world benchmark datasets demonstrate that DPS4Un outperforms the state-of-the-art hyperspectral unmixing methods.

[143] Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

Pius Horn,Janis Keuper

Main category: cs.CV

TL;DR: 提出了一种基于合成PDF的新基准框架,用于评估从PDF中解析数学公式的效果,引入LLM-as-a-judge方法进行语义感知的公式评估,并通过人类验证证明其与人工判断具有更高的相关性。

Details Motivation: 现有基准在评估PDF公式解析时缺乏语义感知指标,且常忽略公式内容,难以准确衡量解析质量。 Method: 构建包含精确LaTeX真值的合成PDF文档;设计两阶段匹配流程处理解析器输出不一致问题;采用LLM-as-a-judge进行语义级公式相似性评估,并通过多人类评审验证其有效性。 Result: LLM-based评估与人类判断的相关性达到Pearson r=0.78,显著优于CDM(r=0.34)和文本相似度(r~0);对20多个现代PDF解析器在100份合成文档(2000+公式)上的评测揭示了显著性能差异。 Conclusion: 该框架为PDF公式提取提供了可复现、可扩展的高质量评估方法,对下游应用中的解析器选择具有指导意义。 Abstract: Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench

[144] VisualActBench: Can VLMs See and Act like a Human?

Daoan Zhang,Pai Liu,Xiaofei Zhou,Yuan Ge,Guangchen Lan,Jing Bi,Christopher Brinton,Ehsan Hoque,Jiebo Luo

Main category: cs.CV

TL;DR: 本文提出了“视觉动作推理”新任务及VisualActBench大规模基准,用于评估视觉语言模型在无文本提示下主动推理与行动的能力,发现现有模型(包括GPT-4o)在复杂情境理解、结果预测和人类决策对齐方面仍显著落后于人类。

Details Motivation: 当前视觉语言模型主要依赖文本提示进行响应,缺乏仅基于视觉输入进行主动推理和行动的能力,限制了其在真实场景中的自主性应用。 Method: 提出VisualActBench基准,包含1,074个视频和3,733个标注动作,涵盖四种现实场景,每个动作标注有动作优先级(APL)和主动-被动类型,以评估模型的人类对齐推理与价值敏感性。 Result: 在29个VLM上评测发现,尽管GPT-4o等前沿模型表现相对较好,但在生成主动、高优先级动作方面仍显著落后于人类,暴露出模型在上下文理解、结果预判和决策对齐上的不足。 Conclusion: VisualActBench为评估和提升以视觉为中心的AI代理在真实世界中的主动性能力提供了基础,揭示了当前VLM向真正自主智能体发展所需突破的关键方向。 Abstract: Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.

[145] NordFKB: a fine-grained benchmark dataset for geospatial AI in Norway

Sander Riisøen Jyhne,Aditya Gupta,Ben Worsley,Marianne Andersen,Ivar Oveland,Alexander Salveson Nossum

Main category: cs.CV

TL;DR: NordFKB是一个来自挪威权威地理数据库FKB的细粒度基准数据集,包含高分辨率正射影像和36个语义类别的详细标注,支持语义分割与目标检测任务。

Details Motivation: 为了推动挪威地区地理空间AI的发展,需要一个高质量、细粒度且具有地理多样性的基准数据集,以支持可重复和可比较的研究。 Method: 基于国家级权威Felless KartdataBase(FKB)构建NordFKB数据集,采集七个地理上多样化区域的高分辨率正射影像,并提供36类语义的GeoTIFF格式二值分割掩码和COCO风格边界框标注;通过跨区域随机抽样划分训练/验证集,并辅以人工专家审核确保标注质量。同时发布配套的基准测试仓库,包含标准化评估协议和工具。 Result: 发布了NordFKB数据集及其基准测试平台,涵盖多样化的地理环境和丰富的语义标注,具备高精度和代表性分布,支持语义分割与目标检测任务的公平比较与复现研究。 Conclusion: NordFKB为地图制图、土地管理与空间规划等领域的AI方法发展提供了坚实基础,并为未来扩展覆盖范围、时间维度和数据模态铺平道路。 Abstract: We present NordFKB, a fine-grained benchmark dataset for geospatial AI in Norway, derived from the authoritative, highly accurate, national Felles KartdataBase (FKB). The dataset contains high-resolution orthophotos paired with detailed annotations for 36 semantic classes, including both per-class binary segmentation masks in GeoTIFF format and COCO-style bounding box annotations. Data is collected from seven geographically diverse areas, ensuring variation in climate, topography, and urbanization. Only tiles containing at least one annotated object are included, and training/validation splits are created through random sampling across areas to ensure representative class and context distributions. Human expert review and quality control ensures high annotation accuracy. Alongside the dataset, we release a benchmarking repository with standardized evaluation protocols and tools for semantic segmentation and object detection, enabling reproducible and comparable research. NordFKB provides a robust foundation for advancing AI methods in mapping, land administration, and spatial planning, and paves the way for future expansions in coverage, temporal scope, and data modalities.

[146] Splatent: Splatting Diffusion Latents for Novel View Synthesis

Or Hirschorn,Omer Sela,Inbar Huberman-Spiegelglas,Netalee Efrat,Eli Alshan,Ianir Ideses,Frederic Devernay,Yochai Zvik,Lior Fritz

Main category: cs.CV

TL;DR: 本文提出Splatent,一种基于扩散模型的增强框架,通过在VAE潜在空间中对3D高斯点阵进行多视角注意力机制,在保持预训练VAE重建质量的同时实现高质量细节恢复。

Details Motivation: 现有基于VAE潜空间的辐射场方法因缺乏多视角一致性导致3D重建细节模糊或丢失,且现有改进方法需微调VAE或依赖扩散模型,存在质量下降或幻觉问题。 Method: 提出Splatent框架,结合3D高斯点阵与VAE潜空间,利用输入视图的多视角注意力机制在2D空间中恢复细节,而非在3D空间重建。 Result: 在多个基准上实现了VAE潜空间辐射场重建的新SOTA,有效提升细节保留能力,并可与现有前馈框架结合进一步优化性能。 Conclusion: Splatent通过将细节恢复从3D转向2D视角处理,解决了VAE潜空间中多视角不一致的问题,为稀疏视角下的高质量3D重建提供了新思路。 Abstract: Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery. Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.

[147] ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

Xinyu Liu,Hangjie Yuan,Yujie Wei,Jiazheng Xing,Yujin Han,Jiahao Pan,Yanbiao Ma,Chi-Min Chan,Kang Zhao,Shiwei Zhang,Wenhan Luo,Yike Guo

Main category: cs.CV

TL;DR: 本文提出了一个名为Reason-Informed Video Editing (RVE)的新任务,旨在通过结合推理能力提升视频编辑的物理合理性和因果逻辑性,并构建了RVE-Bench基准进行评估。为此,作者提出ReViSE框架,采用自反思推理机制统一生成与评估过程,利用内部视觉语言模型提供反馈以优化编辑结果,在RVE-Bench上实现了比现有方法高32%的整体性能提升。

Details Motivation: 现有的视频统一模型虽具备较强的视觉理解和生成能力,但在需要推理指导的视频编辑任务中表现不佳,主要受限于数据集不足以及模型推理与编辑功能之间的割裂。因此,亟需一个能将语义理解与视觉变换有效结合的框架来弥补这一差距。 Method: 提出Reason-Informed Video Editing (RVE)任务,强调在编辑过程中融入物理可实现性和因果关系推理;构建包含两个子集的RVE-Bench基准用于系统评估;设计ReViSE框架,采用自反思推理(Self-Reflective Reasoning, SRF)机制,在同一架构内联合建模生成与评估,利用内部视觉语言模型对编辑结果进行推理反馈,并通过可微分方式反向优化生成器的推理行为。 Result: 在RVE-Bench上的大量实验表明,ReViSE在推理感知的视频编辑子集上相比最先进方法整体得分提升了32%,显著提高了编辑的准确性和视觉质量。 Conclusion: ReViSE通过将内部视觉语言模型的推理能力与视频编辑过程紧密结合,验证了推理反馈机制在提升复杂视频编辑任务中的有效性,为构建更智能、更合理的视频生成系统提供了新方向。 Abstract: Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.

[148] GAINS: Gaussian-based Inverse Rendering from Sparse Multi-View Captures

Patrick Noras,Jun Myeong Choi,Didier Stricker,Pieter Peers,Roni Sengupta

Main category: cs.CV

TL;DR: 提出GAINS,一种基于高斯点阵的稀疏多视角逆向渲染框架,通过引入学习先验提升几何与材质估计的稳定性。

Details Motivation: 现有基于高斯点阵的逆向渲染方法在稀疏视角下因观测不足导致几何、反射率和光照之间的严重歧义,性能显著下降。 Method: 采用两阶段框架:第一阶段利用单目深度/法线和扩散先验优化几何;第二阶段结合分割、内在图像分解(IID)和扩散先验正则化材质恢复。 Result: 在合成和真实数据集上验证了GAINS在稀疏视角下显著提升了材质参数精度、重光照质量和新视角合成效果。 Conclusion: GAINS通过融合多种学习先验,在稀疏多视角输入下实现了更鲁棒的逆向渲染,优于当前最先进的方法。 Abstract: Recent advances in Gaussian Splatting-based inverse rendering extend Gaussian primitives with shading parameters and physically grounded light transport, enabling high-quality material recovery from dense multi-view captures. However, these methods degrade sharply under sparse-view settings, where limited observations lead to severe ambiguity between geometry, reflectance, and lighting. We introduce GAINS (Gaussian-based Inverse rendering from Sparse multi-view captures), a two-stage inverse rendering framework that leverages learning-based priors to stabilize geometry and material estimation. GAINS first refines geometry using monocular depth/normal and diffusion priors, then employs segmentation, intrinsic image decomposition (IID), and diffusion priors to regularize material recovery. Extensive experiments on synthetic and real-world datasets show that GAINS significantly improves material parameter accuracy, relighting quality, and novel-view synthesis compared to state-of-the-art Gaussian-based inverse rendering methods, especially under sparse-view settings. Project page: https://patrickbail.github.io/gains/