cs.CL [Back]

[1] Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

Hongji Li,Junchi yao,Manjiang Yu,Priyanka Singh,Xue Li,Di Wang,Lijie Hu

Main category: cs.CL

TL;DR: 本文提出了RMLLMU-Bench，首个针对推理型多模态大语言模型（RMLLMs）的机器遗忘基准，用于评估在抑制推理过程信息泄露的同时保留推理能力的表现，并提出了一种无需训练、在推理时干预的新方法R-MUSE，实现了更好的遗忘效果与推理能力保持的平衡。

Details

Motivation: 现有的机器遗忘方法在处理RMLLMs时难以同时防止推理链中的敏感信息泄露并保持模型的通用推理能力，且缺乏统一的评估基准。 Method: 提出了RMLLMU-Bench基准，包含专门评估推理泄露和推理保留的指标；并设计了R-MUSE框架，通过子空间引导和自适应控制在推理时干预内部表示，实现答案与推理轨迹的遗忘同时保护通用推理能力。 Result: 在RMLLMU-Bench上的实验表明，现有方法要么存在显著的推理泄露，要么严重损害推理性能；而R-MUSE在遗忘效果和推理保留之间取得了显著更优的平衡。 Conclusion: R-MUSE结合子空间引导与自适应控制，能够在不重新训练的情况下有效实现RMLLMs中的推理感知遗忘，为安全可控的模型遗忘提供了有效解决方案。 Abstract: Machine unlearning aims to erase requested data from trained models without full retraining. For Reasoning Multimodal Large Language Models (RMLLMs), this is uniquely challenging: intermediate chain-of-thought steps can still leak sensitive information even when final answers are forgotten, and overly aggressive interventions easily damage general reasoning ability. Yet no benchmark jointly evaluates how well unlearning methods suppress reasoning-level leakage while preserving reasoning competence. We address this gap with RMLLMU-Bench, the first benchmark for RMLLM unlearning that extends standard forgetting metrics with dedicated measures of reasoning leakage and reasoning retention. A systematic evaluation on RMLLMU-Bench reveals that existing unlearning methods for MLLMs and Large (Language) Reasoning Models (LRMs) either leave substantial leakage in the reasoning process or severely degrade reasoning performance. To address these gaps, we propose R-MUSE (Reasoning-preserving MLLM Unlearning via Subspace guidance and Adaptive Steering), a training-free and inference-time intervention framework that steers internal representations to forget both answers and reasoning traces while explicitly preserving general reasoning. Experiments on RMLLMU-Bench demonstrate that R-MUSE achieves a substantially better balance between effective forgetting and reasoning retention.

[2] Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning

Lihui Liu

Main category: cs.CL

TL;DR: 本文提出了Graph-O1，一种基于智能体的GraphRAG框架，结合蒙特卡洛树搜索与端到端强化学习，使大语言模型能够逐步交互式地在文本属性图上进行推理，有效克服了现有方法忽略图结构或受上下文长度限制的问题。

Details

Motivation: 现有的检索增强生成方法在处理文本属性图时，要么忽略图结构，要么因序列化整个子图而超出大语言模型的上下文限制，导致推理碎片化和准确性下降。 Method: 提出Graph-O1框架，将推理过程建模为智能体与图环境之间的多轮交互，采用蒙特卡洛树搜索（MCTS）结合端到端强化学习，通过统一奖励机制训练智能体，选择性地探索和检索最有信息量的子图部分。 Result: 在多个大语言模型基础上的实验表明，Graph-O1在问答任务上持续优于现有最先进基线方法，生成的答案更准确、可靠且可解释。 Conclusion: Graph-O1通过引入智能体式的逐步推理机制，有效结合了图结构信息与语言模型的语义理解能力，为在文本属性图上进行复杂推理提供了新范式。 Abstract: ChatGPT said: Text-attributed graphs, where nodes and edges contain rich textual information, are widely used across diverse domains. A central challenge in this setting is question answering, which requires jointly leveraging unstructured text and the structured relational signals within the graph. Although Large Language Models (LLMs) have made significant advances in natural language understanding, their direct use for reasoning over text-attributed graphs remains limited. Retrieval-augmented generation methods that operate purely on text often treat passages as isolated units, ignoring the interconnected structure of the graph. Conversely, graph-based RAG methods that serialize large subgraphs into long textual sequences quickly become infeasible due to LLM context-length constraints, resulting in fragmented reasoning and degraded accuracy. To overcome these limitations, we introduce Graph-O1, an agentic GraphRAG framework that enables LLMs to conduct stepwise, interactive reasoning over graphs. Our approach integrates Monte Carlo Tree Search (MCTS) with end-to-end reinforcement learning, allowing the model to selectively explore and retrieve only the most informative subgraph components. The reasoning procedure is framed as a multi-turn interaction between the agent and the graph environment, and the agent is trained through a unified reward mechanism. Extensive experiments across multiple LLM backbones demonstrate that Graph-O1 consistently surpasses state-of-the-art baselines, producing answers that are more accurate, reliable, and interpretable.

[3] Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression

Boris Kriuk,Logic Ng

Main category: cs.CL

TL;DR: 本文提出Q-KVComm协议，通过直接传输压缩的键值缓存表示，在多智能体大语言模型系统中实现高效通信，显著降低带宽和计算开销。

Details

Motivation: 传统多智能体LLM系统在代理间重复传输上下文信息，导致带宽和计算资源浪费，且接收代理需重新计算语义表示。 Method: 提出Q-KVComm协议，包含自适应逐层量化、跨域混合信息提取和异构模型校准三项创新，直接传输压缩的KV缓存。 Result: 在三个问答数据集上实验显示，Q-KVComm实现5-6倍压缩比，语义保真度高（连贯性评分>0.77），适用于不同模型规模和多跳推理等实际场景。 Conclusion: Q-KVComm建立了LLM代理通信的新范式，从基于文本的通信转向基于表示的通信，提升效率与兼容性。 Abstract: Multi-agent Large Language Model (LLM) systems face a critical bottleneck: redundant transmission of contextual information between agents consumes excessive bandwidth and computational resources. Traditional approaches discard internal semantic representations and transmit raw text, forcing receiving agents to recompute similar representations from scratch. We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between LLM agents. Q-KVComm combines three key innovations: (1) adaptive layer-wise quantization that allocates variable bit-widths based on sensitivity profiling, (2) hybrid information extraction that preserves critical facts across content domains, and (3) heterogeneous model calibration establishing cross-architecture communication. Extensive experiments across three diverse question-answering datasets demonstrate that Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. The protocol exhibits robust performance across model sizes (1.1B-1.5B parameters) and adapts to real-world applications including conversational QA and multi-hop reasoning. Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.

[4] Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset

Nick Rossenbach,Robin Schmitt,Tina Raissi,Simon Berger,Larissa Kleppel,Ralf Schlüter

Main category: cs.CL

TL;DR: 本文介绍了Loquacious数据集及其配套资源，旨在为学术界和工业界提供一个开放、多领域的英语语音识别基准，并通过多种ASR架构的实验验证其有效性。

Details

Motivation: 为了替代现有的英语ASR数据集（如LibriSpeech或TED-Lium），提供明确定义的训练与测试划分、涵盖多个声学和语言领域，并具有适用于学术和工业的开放许可。 Method: 提供了n-gram语言模型、音素转换模型（G2P）和发音词典等附加资源，并在多种ASR架构、标签单元和拓扑结构下进行实验评估。 Result: 初步实验结果表明，Loquacious数据集能够有效支持多种常见ASR挑战的研究，具备良好的可用性和基准价值。 Conclusion: Loquacious数据集结合公开资源为ASR研究提供了高质量、开放的多领域基准，有助于推动语音识别技术的发展。 Abstract: The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.

[5] Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models

Minh Tri LÊ,Ali Ait-Bachir

Main category: cs.CL

TL;DR: 本文比较了两种IT服务管理中服务工单优先级排序的方法，发现基于嵌入的模型泛化能力有限，而微调的多语言Transformer模型在性能上显著更优。

Details

Motivation: 由于文本输入噪声大、写作风格主观以及类别不平衡，ITSM中的工单优先级排序具有挑战性，需要更有效的自动化方法。 Method: 评估两类方法：一类是结合降维、聚类和经典分类器的嵌入管道；另一类是融合文本与数值特征的微调多语言Transformer模型。 Result: 嵌入方法在30种配置下泛化能力差，聚类未能发现有意义结构，监督模型对嵌入质量敏感；Transformer模型平均F1得分为78.5%，加权Cohen's kappa接近0.80，表现更优。 Conclusion: 通用嵌入在ITSM数据上存在局限，领域适配的Transformer架构在工单优先级排序中更有效。 Abstract: Prioritizing service tickets in IT Service Management (ITSM) is critical for operational efficiency but remains challenging due to noisy textual inputs, subjective writing styles, and pronounced class imbalance. We evaluate two families of approaches for ticket prioritization: embedding-based pipelines that combine dimensionality reduction, clustering, and classical classifiers, and a fine-tuned multilingual transformer that processes both textual and numerical features. Embedding-based methods exhibit limited generalization across a wide range of thirty configurations, with clustering failing to uncover meaningful structures and supervised models highly sensitive to embedding quality. In contrast, the proposed transformer model achieves substantially higher performance, with an average F1-score of 78.5% and weighted Cohen's kappa values of nearly 0.80, indicating strong alignment with true labels. These results highlight the limitations of generic embeddings for ITSM data and demonstrate the effectiveness of domain-adapted transformer architectures for operational ticket prioritization.

[6] KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction

Aomufei Yuan,Zhiming Wang,Ruijie Miao,Dayu Wang,Yuxuan Tian,Zihan Wang,Yebo Peng,Yuhan Wu,Bairen Yi,Xin Liu,Tong Yang

Main category: cs.CL

TL;DR: 提出了一种基于sketch算法的可逆KV缓存压缩方法KVReviver，以解决大语言模型中因传统压缩导致的上下文遗忘问题，在大幅减少内存使用的同时保持高推理精度。

Details

Motivation: 随着大语言模型上下文长度增加，KV缓存内存需求成为瓶颈；传统压缩方法永久丢失信息，引发上下文遗忘，损害模型检索能力。 Method: 设计KVReviver，利用sketch算法实现可逆的KV缓存压缩，通过附加数据结构重建被压缩的token，从而在有限内存下恢复完整计算能力。 Result: 在2k上下文长度下仅需10% KV缓存预算即可保持相同推理精度；在32k长度下使用25%缓存预算达到相当或略低（约2%损失）的准确率。 Conclusion: KVReviver有效缓解了长上下文场景下的内存压力，同时避免了信息不可逆丢失，显著提升了KV缓存压缩下的模型表现与实用性。 Abstract: As the context length of current large language models (LLMs) rapidly increases, the memory demand for the Key-Value (KV) cache is becoming a bottleneck for LLM deployment and batch processing. Traditional KV cache compression methods typically involve permanently evicting or irreversibly merging "less important" tokens with low attention scores. This approach results in the unrecoverable loss of token information, which we call Contextual Amnesia, significantly degrading the model's information retrieval capability. To address this issue, we propose KVReviver, a reversible KV cache compression method based on the sketch algorithm. This method allows reconstructing compressed tokens from an additional data structure, thus enabling full-scale computation within limited memory. Experiments showed that in 2k-length contexts, it requires only 10% of KV Cache budget while maintaining identical end-to-end inference accuracy. For 32k-length contexts, it achieves equivalent or comparable accuracy ~2% accuracy loss) using merely 25% of KV Cache budget.

[7] Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

Rahul Baxi

Main category: cs.CL

TL;DR: 本文提出了一种新的基准CDCT，用于评估大语言模型在不同压缩级别下的约束遵循和语义准确性，发现模型在中等压缩下表现最差，而极端压缩反而更好，并揭示了RLHF对齐与指令遵循之间的根本矛盾。

Details

Motivation: 理解大语言模型在提示压缩下性能下降的机制尚不清晰，需要一个能独立衡量约束遵循和语义准确性的基准来系统分析该问题。 Method: 引入Compression-Decay Comprehension Test (CDCT)，使用5个压缩等级，在9个前沿LLM上测试8个概念的表现，并通过三法官LLM评审团评估约束遵循；同时进行RLHF消融实验验证假设。 Result: 发现了普遍存在的U型曲线模式：中等压缩（c=0.5）时约束违反最多，极端压缩反而表现更好；约束效应是语义效应的2.9倍；移除‘帮助性’信号使约束遵循平均提升598%；推理模型比高效模型高27.5%。 Conclusion: RLHF训练出的‘帮助性’行为是导致中等压缩下约束违反的主要原因，这揭示了对齐目标与指令遵循之间的冲突，为改进实际部署系统提供了指导。 Abstract: Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing "helpfulness" signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen's d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.

[8] ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India

Shubham Kumar Nigam,Tanuj Tyagi,Siddharth Shukla,Aditya Kumar Guru,Balaramamahanthi Deepak Patnaik,Danush Khanna,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya

Main category: cs.CL

TL;DR: 本文提出了ReGal框架，结合多任务指令调优与基于AI反馈的强化学习（RLAIF），探索其在印度法律AI中的应用，尽管性能未超越监督模型，但为法律文本中的强化学习应用提供了重要见解。

Details

Motivation: 探索强化学习在高风险、长文本法律任务中的适用性，特别是在资源受限的法律环境中。 Method: 提出ReGal框架，结合多任务指令调优与基于AI反馈的强化学习（RLAIF）和近端策略优化（PPO），应用于判决预测解释与法律文档摘要。 Result: 在标准指标上表现不如监督和专有模型，但在奖励模型对齐、法律语言复杂性和领域适应方面揭示了关键挑战，并通过实证与定性分析展示了潜力。 Conclusion: ReGal虽当前性能有限，但为未来优化法律推理流程及构建可解释、自适应的法律AI系统奠定了基础。 Abstract: This paper presents an early exploration of reinforcement learning methodologies for legal AI in the Indian context. We introduce Reinforcement Learning-based Legal Reasoning (ReGal), a framework that integrates Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback (RLAIF) using Proximal Policy Optimization (PPO). Our approach is evaluated across two critical legal tasks: (i) Court Judgment Prediction and Explanation (CJPE), and (ii) Legal Document Summarization. Although the framework underperforms on standard evaluation metrics compared to supervised and proprietary models, it provides valuable insights into the challenges of applying RL to legal texts. These challenges include reward model alignment, legal language complexity, and domain-specific adaptation. Through empirical and qualitative analysis, we demonstrate how RL can be repurposed for high-stakes, long-document tasks in law. Our findings establish a foundation for future work on optimizing legal reasoning pipelines using reinforcement learning, with broader implications for building interpretable and adaptive legal AI systems.

[9] CoPE: A Small Language Model for Steerable and Scalable Content Labeling

Samidh Chakrabarti,David Willner,Kevin Klyman,Tiffany Saade,Emily Capstick,Sabina Nong

Main category: cs.CL

TL;DR: 本文介绍了CoPE，一种可策略引导的小型语言模型，能够快速准确地标记内容。通过提出矛盾示例训练和双重视角标注方法，CoPE在仅1%规模下表现优于大型模型，并可在单个消费级GPU上运行。

Details

Motivation: 为了提升内容审核的效率与准确性，同时降低对大规模模型的依赖，研究者希望开发一种小型但能准确理解并执行政策指令的语言模型。 Method: 提出了矛盾示例训练（Contradictory Example Training）以增强模型对政策的理解能力，以及双重视角标注（Binocular Labeling）来自动生成清晰无歧义的训练数据集。 Result: 在七个不同危害领域评估中，CoPE表现出与前沿大模型相当甚至更优的准确性，且模型仅需90亿参数即可在单个消费级GPU上运行。 Conclusion: CoPE代表了分类系统的范式转变，通过将机器学习任务转化为政策撰写任务，为在线平台治理提供了新的设计可能性。 Abstract: This paper details the methodology behind CoPE, a policy-steerable small language model capable of fast and accurate content labeling. We present a novel training curricula called Contradictory Example Training that enables the model to learn policy interpretation rather than mere policy memorization. We also present a novel method for generating content policies, called Binocular Labeling, which enables rapid construction of unambiguous training datasets. When evaluated across seven different harm areas, CoPE exhibits equal or superior accuracy to frontier models at only 1% of their size. We openly release a 9 billion parameter version of the model that can be run on a single consumer-grade GPU. Models like CoPE represent a paradigm shift for classifier systems. By turning an ML task into a policy writing task, CoPE opens up new design possibilities for the governance of online platforms.

[10] Narrative Consolidation: Formulating a New Task for Unifying Multi-Perspective Accounts

Roger A. Finger,Eduardo G. Cortes,Sandro J. Rigo,Gabriel de O. Ramos

Main category: cs.CL

TL;DR: 本文提出了一个新的自然语言处理任务——叙事整合（Narrative Consolidation），旨在处理重叠的叙述性文档，强调时间完整性、内容完整性和细节融合。

Details

Motivation: 标准的多文档摘要方法注重简洁性，无法保留叙述流程，因此需要一种新的方法来解决这一问题。 Method: 引入了时间对齐事件图（Temporal Alignment Event Graph, TAEG），通过显式建模时间和事件对齐，并应用中心性算法选择每个事件在其正确时间位置上的最中心表示。 Result: 在四部圣经福音书的研究中，该方法保证了完美的时间排序（Kendall's Tau为1.000），并在内容度量上显著提升（如ROUGE-L F1提高了357.2%）。 Conclusion: 验证了叙事整合作为一个相关任务的有效性，并确立了显式时间主干是解决该任务的基本组成部分。 Abstract: Processing overlapping narrative documents, such as legal testimonies or historical accounts, often aims not for compression but for a unified, coherent, and chronologically sound text. Standard Multi-Document Summarization (MDS), with its focus on conciseness, fails to preserve narrative flow. This paper formally defines this challenge as a new NLP task: Narrative Consolidation, where the central objectives are chronological integrity, completeness, and the fusion of complementary details. To demonstrate the critical role of temporal structure in this task, we introduce Temporal Alignment Event Graph (TAEG), a graph structure that explicitly models chronology and event alignment. By applying a standard centrality algorithm to TAEG, our method functions as a version selection mechanism, choosing the most central representation of each event in its correct temporal position. In a study on the four Biblical Gospels, this structure-focused approach guarantees perfect temporal ordering (Kendall's Tau of 1.000) by design and dramatically improves content metrics (e.g., +357.2% in ROUGE-L F1). The success of this baseline method validates the formulation of Narrative Consolidation as a relevant task and establishes that an explicit temporal backbone is a fundamental component for its resolution.

[11] Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

Ashley M. A. Fehr,Calla G. Beauregard,Julia Witte Zimmerman,Katie Ekström,Pablo Rosillo-Rodes,Christopher M. Danforth,Peter Sheridan Dodds

Main category: cs.CL

TL;DR: 本文研究了对话中的词汇量增长规律（Heaps' law），分析了两种不同媒介下的对话：陌生人视频聊天和电影中的虚构角色对话，发现词汇扩展模式因词性而异，并通过行为与语言学框架进行讨论。

Details

Motivation: 探讨对话中词汇使用随时间的统计规律，特别是Heaps'律在真实与虚构对话中的表现，并分析语言特征如何影响词汇扩展。 Method: 基于两种对话数据——陌生人视频对话与电影角色对话，计算并比较其词汇规模随对话长度的变化，按词性分解分析Heaps'律的缩放特性。 Result: 发现词汇量的增长遵循Heaps'律，但不同词性的扩展模式存在差异，且在两种媒介间表现出不同特征。 Conclusion: 对话中的词汇扩展具有结构性，受词性与交流情境影响，揭示了语言动态背后的认知与社会互动机制。 Abstract: Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps's law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.

[12] Training LLMs with LogicReward for Faithful and Rigorous Reasoning

Jundong Xu,Hao Fei,Huichi Zhou,Xin Quan,Qijun Huang,Shengqiong Wu,William Yang Wang,Mong-Li Lee,Wynne Hsu

Main category: cs.CL

TL;DR: 提出LogicReward，一种通过定理证明器强制步骤级逻辑正确性的新型奖励系统，结合软统一的自动形式化方法，提升LLM在推理任务中的逻辑一致性和泛化能力。

Details

Motivation: 现有LLM训练方法依赖结果反馈，难以保证推理过程的逻辑正确性，尤其在高风险场景中缺乏对推理步骤逻辑一致性的保障。 Method: 引入LogicReward奖励系统，利用定理证明器对中间推理步骤进行逻辑正确性监督；提出Autoformalization with Soft Unification，减少自然语言歧义，提升形式化质量，从而更有效地使用定理证明器指导模型训练。 Result: 基于LogicReward训练的8B模型在自然语言推断和逻辑推理任务上分别超越GPT-4o和o4-mini达11.6%和2%；显著提升推理忠实性、跨任务泛化能力（如数学与常识推理），并在无真实标签情况下仍提供可靠奖励信号。 Conclusion: LogicReward通过细粒度逻辑监督提升了模型推理的正确性与可信度，为构建高可靠性推理系统提供了有效路径。 Abstract: Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6\% and 2\% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. We will release all data and code at https://llm-symbol.github.io/LogicReward.

[13] GeoSense-AI: Fast Location Inference from Crisis Microblogs

Deepit Sapru

Main category: cs.CL

TL;DR: 本文提出了一种用于从嘈杂的微博客流中实时地理定位的人工智能管道GeoSense-AI，通过结合统计性话题标签分割、词性驱动的专有名词检测、围绕灾害词汇的依存句法分析、轻量级命名实体识别和基于地名词典的消歧技术，直接从文本中推断位置信息。该系统在保持高F1分数的同时，实现了比现有NER工具快几个数量级的处理速度，适用于洪水、疫情等突发事件的实时应急响应场景。

Details

Motivation: 传统的地理定位依赖稀疏的地理标签，在紧急情况下无法满足实时性和覆盖率需求。作者旨在开发一种能够在噪声大、非正式文本流中高效、准确提取地理位置信息的AI系统，以提升危机事件中的态势感知能力。 Method: 提出一个集成化AI流水线：1）统计性话题标签分割；2）基于词性的专有名称检测；3）围绕灾害相关词汇进行依存句法分析；4）轻量级命名实体识别（NER）；5）基于地名词典（gazetteer）的地点消歧。整个系统针对流式处理优化，强调低延迟NLP组件与高效地理知识库验证。 Result: 在与主流NER工具的对比中，该系统在F1分数上表现良好，同时吞吐量高出数个数量级，能够支持实时部署。生产级地图界面展示了从数据摄入、推理到可视化的完整流程，成功在洪水、疫情等事件中大规模提取出地理位置信号。 Conclusion: GeoSense-AI表明，通过领域定制的自然语言处理技术和知识库支撑，可以在不依赖传统地理标签的情况下，实现鲁棒且高效的实时地理定位，显著提升应急响应的信息获取能力。 Abstract: This paper presents an applied AI pipeline for realtime geolocation from noisy microblog streams, unifying statistical hashtag segmentation, part-of-speech-driven proper-noun detection, dependency parsing around disaster lexicons, lightweight named-entity recognition, and gazetteer-grounded disambiguation to infer locations directly from text rather than sparse geotags. The approach operationalizes information extraction under streaming constraints, emphasizing low-latency NLP components and efficient validation against geographic knowledge bases to support situational awareness during emergencies. In head to head comparisons with widely used NER toolkits, the system attains strong F1 while being engineered for orders-of-magnitude faster throughput, enabling deployment in live crisis informatics settings. A production map interface demonstrates end-to-end AI functionality ingest, inference, and visualization--surfacing locational signals at scale for floods, outbreaks, and other fastmoving events. By prioritizing robustness to informal text and streaming efficiency, GeoSense-AI illustrates how domain-tuned NLP and knowledge grounding can elevate emergency response beyond conventional geo-tag reliance.

[14] InstructNet: A Novel Approach for Multi-Label Instruction Classification through Advanced Deep Learning

Tanjim Taharat Aurpa,Md Shoaib Ahmed,Md Mahbubur Rahman,Md. Golam Moazzam

Main category: cs.CL

TL;DR: 本研究利用XLNet和BERT等基于Transformer的深度神经网络架构，对来自wikiHow的11,121条“How To”文章进行多标签指令分类，提出InstructNet方法，在准确率（97.30%）和宏平均F1分数（93%）上取得了优异表现。

Details

Motivation: 随着人们越来越多地依赖搜索引擎获取操作指南，如何自动分类和组织大量 instructional 文本成为关键问题，尤其在任务导向学习和知识库构建中具有重要意义。 Method: 采用多标签分类方法，使用XLNet、BERT等预训练语言模型，基于wikiHow数据集（11,121个样本）进行实验，以准确率和宏F1分数作为评估指标。 Result: XLNet架构表现最佳，达到97.30%的准确率，微平均F1为89.02%，宏平均F1为93%，显著优于其他模型。 Conclusion: XLNet在多标签指令文本分类任务中表现出色，验证了InstructNet方法的有效性，为后续任务导向型文本处理提供了可行的技术路径。 Abstract: People use search engines for various topics and items, from daily essentials to more aspirational and specialized objects. Therefore, search engines have taken over as peoples preferred resource. The How To prefix has become familiar and widely used in various search styles to find solutions to particular problems. This search allows people to find sequential instructions by providing detailed guidelines to accomplish specific tasks. Categorizing instructional text is also essential for task-oriented learning and creating knowledge bases. This study uses the How To articles to determine the multi-label instruction category. We have brought this work with a dataset comprising 11,121 observations from wikiHow, where each record has multiple categories. To find out the multi-label category meticulously, we employ some transformer-based deep neural architectures, such as Generalized Autoregressive Pretraining for Language Understanding (XLNet), Bidirectional Encoder Representation from Transformers (BERT), etc. In our multi-label instruction classification process, we have reckoned our proposed architectures using accuracy and macro f1-score as the performance metrics. This thorough evaluation showed us much about our strategys strengths and drawbacks. Specifically, our implementation of the XLNet architecture has demonstrated unprecedented performance, achieving an accuracy of 97.30% and micro and macro average scores of 89.02% and 93%, a noteworthy accomplishment in multi-label classification. This high level of accuracy and macro average score is a testament to the effectiveness of the XLNet architecture in our proposed InstructNet approach. By employing a multi-level strategy in our evaluation process, we have gained a more comprehensive knowledge of the effectiveness of our proposed architectures and identified areas for forthcoming improvement and refinement.

[15] CTTA-T: Continual Test-Time Adaptation for Text Understanding via Teacher-Student with a Domain-aware and Generalized Teacher

Tianlun Liu,Zhiliang Tian,Zhen Huang,Xingzhi Zhou,Wanlong Yu,Tianle Liu,Feng Liu,Dongsheng Li

Main category: cs.CL

TL;DR: 本文提出了一个面向文本理解的持续测试时适应框架CTTA-T，以应对连续出现且未观测到的目标域变化。

Details

Motivation: 现有的持续测试时适应方法在减少跨域误差累积和提升对未见域的泛化能力方面存在不足：噪声过滤会丢弃有用信息，而历史域积累难以实现自适应。 Method: 提出CTTA-T框架，采用教师-学生结构，教师模型通过增量PCA动态累积跨域语义以感知并适应域变化；引入基于dropout一致性驱动的先校正后过滤机制，优化教师预测。 Result: 实验表明，CTTA-T在多个基准上优于现有方法，有效缓解误差累积并增强对未观测域的泛化能力。 Conclusion: CTTA-T通过动态域感知和可靠知识过滤，在持续测试时适应场景中实现了更好的适应性与泛化性能，适用于实际文本理解任务中的域漂移问题。 Abstract: Text understanding often suffers from domain shifts. To handle testing domains, domain adaptation (DA) is trained to adapt to a fixed and observed testing domain; a more challenging paradigm, test-time adaptation (TTA), cannot access the testing domain during training and online adapts to the testing samples during testing, where the samples are from a fixed domain. We aim to explore a more practical and underexplored scenario, continual test-time adaptation (CTTA) for text understanding, which involves a sequence of testing (unobserved) domains in testing. Current CTTA methods struggle in reducing error accumulation over domains and enhancing generalization to handle unobserved domains: 1) Noise-filtering reduces accumulated errors but discards useful information, and 2) accumulating historical domains enhances generalization, but it is hard to achieve adaptive accumulation. In this paper, we propose a CTTA-T (continual test-time adaptation for text understanding) framework adaptable to evolving target domains: it adopts a teacher-student framework, where the teacher is domain-aware and generalized for evolving domains. To improve teacher predictions, we propose a refine-then-filter based on dropout-driven consistency, which calibrates predictions and removes unreliable guidance. For the adaptation-generalization trade-off, we construct a domain-aware teacher by dynamically accumulating cross-domain semantics via incremental PCA, which continuously tracks domain shifts. Experiments show CTTA-T excels baselines.

[16] LIR$^3$AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation

Guo Chen,Junjie Huang,Huaijin Xie,Fei Sun,Tao Jia

Main category: cs.CL

TL;DR: 本文提出了一种轻量级的重排序推理策略框架LiR$^3$AG，用于检索增强生成（RAG）中的多跳问答任务，通过重构检索到的证据为连贯的推理链，使非推理模型能够转移推理策略，在显著减少98%输出token开销和58.6%推理时间的同时，将8B模型的F1性能提升6.2%至22.5%，超越32B推理模型的表现。

Details

Motivation: 推理模型虽能提升多跳问答性能，但带来高昂计算成本，需探索更高效的推理策略以平衡性能与效率。 Method: 分析推理模型在RAG多跳QA中的推理策略，提出Context-Grounded Reasoning和Knowledge-Reconciled Reasoning两种模式，并设计LiR$^3$AG框架，通过重排序和重组检索证据构建推理链，使非推理模型模拟推理行为。 Result: LiR$^3$AG显著降低98%输出token和58.6%推理时间，同时提升8B非推理模型F1得分6.2%-22.5%，性能超过32B推理模型。 Conclusion: LiR$^3$AG为RAG系统提供了一种高效、实用的推理路径，能够在极低资源消耗下实现超越大模型的多跳问答性能。 Abstract: Retrieval-Augmented Generation (RAG) effectively enhances Large Language Models (LLMs) by incorporating retrieved external knowledge into the generation process. Reasoning models improve LLM performance in multi-hop QA tasks, which require integrating and reasoning over multiple pieces of evidence across different documents to answer a complex question. However, they often introduce substantial computational costs, including increased token consumption and inference latency. To better understand and mitigate this trade-off, we conduct a comprehensive study of reasoning strategies for reasoning models in RAG multi-hop QA tasks. Our findings reveal that reasoning models adopt structured strategies to integrate retrieved and internal knowledge, primarily following two modes: Context-Grounded Reasoning, which relies directly on retrieved content, and Knowledge-Reconciled Reasoning, which resolves conflicts or gaps using internal knowledge. To this end, we propose a novel Lightweight Rerank Reasoning Strategy Framework for RAG (LiR$^3$AG) to enable non-reasoning models to transfer reasoning strategies by restructuring retrieved evidence into coherent reasoning chains. LiR$^3$AG significantly reduce the average 98% output tokens overhead and 58.6% inferencing time while improving 8B non-reasoning model's F1 performance ranging from 6.2% to 22.5% to surpass the performance of 32B reasoning model in RAG, offering a practical and efficient path forward for RAG systems.

[17] Towards Efficient Agents: A Co-Design of Inference Architecture and System

Weizhe Lin,Hui-Ling Zhen,Shuai Yang,Xian Wang,Renxi Liu,Hanting Chen,Wangze Zhang,Chuansai Zhou,Yiming Li,Chen Chen,Xing Li,Zhiyuan Yang,Xiaosong Li,Xianzhi Yu,Zhenhua Dong,Mingxuan Yuan,Yunhe Wang

Main category: cs.CL

TL;DR: 本文提出了AgentInfer，一个用于加速基于大语言模型的智能体的端到端统一框架，通过协同优化推理与系统架构，显著提升多轮推理与工具调用中的效率，减少超过50%的无效token消耗，并实现1.8-2.5倍的速度提升，同时保持准确性。

Details

Motivation: 现有的LLM智能体在实际部署中面临严重效率问题，这些问题源于多轮推理循环、上下文增长和异构工具交互带来的系统性延迟，而非单次推理本身的开销。因此需要一种面向整体任务完成的优化框架。 Method: 提出AgentInfer框架，包含四个协同组件：AgentCollab（动态分配大小模型角色的分层双模型推理）、AgentSched（缓存感知的混合调度器）、AgentSAM（基于后缀自动机的推测解码方法）和AgentCompress（语义压缩机制），共同构成一个自进化的推理引擎。 Result: 在BrowseComp-zh和DeepDiver基准上实验表明，AgentInfer减少了50%以上的无效token使用，实现1.8-2.5倍的整体加速，且保持了原有准确性。 Conclusion: 优化代理型任务的完成效率需从系统级协同设计出发，而非仅关注每token吞吐量；AgentInfer为构建高效、可扩展、自进化的智能系统提供了有效路径。 Abstract: The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making. However, their real-world deployment is hindered by severe inefficiencies that arise not from isolated model inference, but from the systemic latency accumulated across reasoning loops, context growth, and heterogeneous tool interactions. This paper presents AgentInfer, a unified framework for end-to-end agent acceleration that bridges inference optimization and architectural design. We decompose the problem into four synergistic components: AgentCollab, a hierarchical dual-model reasoning framework that balances large- and small-model usage through dynamic role assignment; AgentSched, a cache-aware hybrid scheduler that minimizes latency under heterogeneous request patterns; AgentSAM, a suffix-automaton-based speculative decoding method that reuses multi-session semantic memory to achieve low-overhead inference acceleration; and AgentCompress, a semantic compression mechanism that asynchronously distills and reorganizes agent memory without disrupting ongoing reasoning. Together, these modules form a Self-Evolution Engine capable of sustaining efficiency and cognitive stability throughout long-horizon reasoning tasks. Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%, achieving an overall 1.8-2.5 times speedup with preserved accuracy. These results underscore that optimizing for agentic task completion-rather than merely per-token throughput-is the key to building scalable, efficient, and self-improving intelligent systems.

[18] LLM-based Few-Shot Early Rumor Detection with Imitation Agent

Fengzhu Zeng,Qian Shao,Ling Cheng,Wei Gao,Shih-Fen Cheng,Jing Ma,Cheng Niu

Main category: cs.CL

TL;DR: 提出了一种结合自主代理和大语言模型（LLM）的早期谣言检测（EARD）新框架，通过代理确定最佳检测时间点，LLM进行谣言分类，在少样本场景下实现了高准确性和更早的检测时间，且仅需训练轻量级代理，无需训练LLM。

Details

Motivation: 在数据稀缺场景下，现有大语言模型（LLMs）不适用于时间序列数据且计算成本高，难以有效实现早期谣言检测（EARD），因此需要一种高效、少样本的解决方案。 Method: 设计一个由自主代理和LLM组成的框架：代理负责判断最早可准确分类的时间点（early time point determination），LLM作为固定的谣言检测器；仅训练轻量级代理，保持LLM免训练，从而降低计算开销并适应少样本设置。 Result: 在四个真实世界数据集上的实验表明，该方法在不同LLM上均提升了性能，相比现有EARD方法在准确性和检测早性方面表现更优。 Conclusion: 该框架为少样本早期谣言检测提供了首个有效解决方案，兼顾效率与性能，具有良好的实际应用潜力。 Abstract: Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.

[19] DACE For Railway Acronym Disambiguation

El Mokhtar Hribach,Oussama Mechhour,Mohammed Elmonstaser,Yassine El Boudouri,Othmane Kabal

Main category: cs.CL

TL;DR: 本文提出了DACE框架，通过动态提示、检索增强生成和上下文选择等方法，在法语铁路文档的缩写消歧任务中取得了0.9069的F1分数，获得TextMine'26竞赛第一名。

Details

Motivation: 技术文本处理中缩写歧义问题严重，尤其在专业领域自动化分析困难，需要更有效的消歧方法。 Method: 提出DACE框架，结合动态提示、检索增强生成、上下文选择和集成聚合，利用大语言模型和外部领域知识进行自适应上下文学习。 Result: 在TextMine'26竞赛中取得第一名，F1得分为0.9069，验证了方法在低资源和高歧义场景下的有效性。 Conclusion: DACE框架能有效缓解幻觉问题，提升大语言模型在专业领域缩写消歧中的性能，具有较强的实用性和推广潜力。 Abstract: Acronym Disambiguation (AD) is a fundamental challenge in technical text processing, particularly in specialized sectors where high ambiguity complicates automated analysis. This paper addresses AD within the context of the TextMine'26 competition on French railway documentation. We present DACE (Dynamic Prompting, Retrieval Augmented Generation, Contextual Selection, and Ensemble Aggregation), a framework that enhances Large Language Models through adaptive in-context learning and external domain knowledge injection. By dynamically tailoring prompts to acronym ambiguity and aggregating ensemble predictions, DACE mitigates hallucination and effectively handles low-resource scenarios. Our approach secured the top rank in the competition with an F1 score of 0.9069.

[20] LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators

Mateusz Lango,Ondřej Dušek

Main category: cs.CL

TL;DR: 提出了一种新的神经符号框架，用于RDF到文本的生成，通过多个LLM代理协作产生基于RDF三元组的规则代码，无需监督训练数据，系统完全可解释且生成速度快。

Details

Motivation: 解决传统RDF-to-text模型依赖大量标注数据、易产生幻觉以及缺乏可解释性的问题。 Method: 利用多个LLM代理协同交互，自动生成针对特定领域的基于规则的Python代码作为生成器，仅依赖RDF三元组，不使用领域内人工参考文本。 Result: 在WebNLG和OpenDialKG数据集上实验表明，该方法显著减少幻觉现象，仅带来轻微的流畅度损失，且生成速度极快，仅需单个CPU即可运行。 Conclusion: 该框架提供了一种无需训练、完全可解释、高效低资源的RDF-to-text解决方案，为神经符号结合提供了新路径。 Abstract: We present a novel neurosymbolic framework for RDF-to-text generation, in which the model is "trained" through collaborative interactions among multiple LLM agents rather than traditional backpropagation. The LLM agents produce rule-based Python code for a generator for the given domain, based on RDF triples only, with no in-domain human reference texts. The resulting system is fully interpretable, requires no supervised training data, and generates text nearly instantaneously using only a single CPU. Our experiments on the WebNLG and OpenDialKG data show that outputs produced by our approach reduce hallucination, with only slight fluency penalties compared to finetuned or prompted language models

[21] SRS-Stories: Vocabulary-constrained multilingual story generation for language learning

Wiktor Kamzela,Mateusz Lango,Ondrej Dusek

Main category: cs.CL

TL;DR: 本文提出利用大语言模型为语言学习者生成个性化故事，仅使用学习者已知词汇，并通过上下文自然引入新词，结合间隔重复系统优化词汇学习与复习。

Details

Motivation: 帮助语言学习者通过可理解输入（comprehensible input）更有效地习得新词汇，同时巩固已学词汇。 Method: 采用大语言模型生成个性化故事，结合三种生成方法和三种词汇约束策略，并集成间隔重复系统（Spaced Repetition System）优化词汇出现频率与时机。 Result: 在英语、中文和波兰语三种语言中实验表明，所生成的故事在语法性、连贯性和词汇用法示例方面均优于传统的受限束搜索生成文本。 Conclusion: 该方法能有效生成高质量、个性化的语言学习材料，提升词汇学习效率，具有多语言适用性和实际应用潜力。 Abstract: In this paper, we use large language models to generate personalized stories for language learners, using only the vocabulary they know. The generated texts are specifically written to teach the user new vocabulary by simply reading stories where it appears in context, while at the same time seamlessly reviewing recently learned vocabulary. The generated stories are enjoyable to read and the vocabulary reviewing/learning is optimized by a Spaced Repetition System. The experiments are conducted in three languages: English, Chinese and Polish, evaluating three story generation methods and three strategies for enforcing lexical constraints. The results show that the generated stories are more grammatical, coherent, and provide better examples of word usage than texts generated by the standard constrained beam search approach

[22] AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

Mark Kashirskiy,Artiom Lipinski,Ilya Makarov

Main category: cs.CL

TL;DR: 本文提出AraToken，一种针对阿拉伯语优化的分词器，基于SentencePiece Unigram算法并结合阿拉伯语特有的归一化处理，显著提升了分词效率和压缩性能。

Details

Motivation: 通用分词器在阿拉伯语等形态丰富的语言上表现不佳，导致序列过长和压缩效率低，因此需要专门针对阿拉伯语优化的分词方案。 Method: 采用SentencePiece Unigram算法，结合包括Alif变体、音调符号和阿拉伯数字在内的综合归一化流程，并通过语言扩展管道（LEP）将优化后的分词器集成到Qwen3-0.6B中，使用子词均值初始化和选择性解冻Transformer层进行微调。 Result: 与未归一化的基线相比，SentencePiece结合归一化使fertility降低了18%（从1.35降至1.199 tokens/word），LEP在800步内将评估损失从8.28降至2.43。 Conclusion: AraToken显著提升阿拉伯语在大型语言模型中的分词效率与模型性能，所提出的LEP方法可快速适配新分词器，促进阿拉伯语NLP研究发展。 Abstract: Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.

[23] An Agentic AI Framework for Training General Practitioner Student Skills

Victor De Marez,Jens Van Nooten,Luna De Bruyne,Walter Daelemans

Main category: cs.CL

TL;DR: 本文提出了一种基于智能体框架的虚拟模拟患者（VSP）系统，用于提升医学生全科医生技能训练，整合了基于证据的病例生成、角色驱动的对话控制和标准化评估反馈，并在14名医学生中验证了其真实性、可用性和教学价值。

Details

Motivation: 现有虚拟模拟患者系统在医学准确性、角色一致性、情景生成和教育反馈方面存在不足，亟需更可靠且具教学意义的解决方案。 Method: 设计了一个模块化智能体框架，包含三个核心组件：可配置的基于证据的病例生成、带检索支持的角色驱动患者对话控制、基于标准的沟通与临床推理评估反馈；并在真实口语问诊场景中进行实例化和用户评估。 Result: 在14名医学生中的评估显示，该系统能提供符合病例设定的真实对话、难度适中、角色性格稳定，并生成富有示例的高质量反馈，整体可用性评价优异。 Conclusion: 将情景控制、交互控制和标准化评估分离的智能体架构是一种构建可靠且具教学价值VSP工具的有效模式，有助于推动医学教育中AI驱动模拟训练的发展。 Abstract: Advancements in large language models offer strong potential for enhancing virtual simulated patients (VSPs) in medical education by providing scalable alternatives to resource-intensive traditional methods. However, current VSPs often struggle with medical accuracy, consistent roleplaying, scenario generation for VSP use, and educationally structured feedback. We introduce an agentic framework for training general practitioner student skills that unifies (i) configurable, evidence-based vignette generation, (ii) controlled persona-driven patient dialogue with optional retrieval grounding, and (iii) standards-based assessment and feedback for both communication and clinical reasoning. We instantiate the framework in an interactive spoken consultation setting and evaluate it with medical students ($\mathbf{N{=}14}$). Participants reported realistic and vignette-faithful dialogue, appropriate difficulty calibration, a stable personality signal, and highly useful example-rich feedback, alongside excellent overall usability. These results support agentic separation of scenario control, interaction control, and standards-based assessment as a practical pattern for building dependable and pedagogically valuable VSP training tools.

[24] Mitigating Spurious Correlations in NLI via LLM-Synthesized Counterfactuals and Dynamic Balanced Sampling

Christopher Román Jaimes

Main category: cs.CL

TL;DR: 提出了一种自动化、可扩展的管道来缓解自然语言推断模型中的伪相关性问题，通过LF-LMI检测语义缺陷、LLM生成高质量合成对比集以及动态平衡采样策略，在保持域内性能的同时显著提升鲁棒性。

Details

Motivation: NLI模型常依赖伪相关而非语义推理，现有缓解方法存在标注成本高或微调时灾难性遗忘的问题。 Method: 引入Log-Frequency LMI（LF-LMI）检测语义缺陷；通过多法官验证的LLM合成管道生成高质量合成对比集；采用动态平衡采样策略防止遗忘。 Result: 在挑战性基准上的鲁棒性从63.5%提升至81.0%，同时保持88.4%的域内准确率，显著优于朴素微调。 Conclusion: 该方法有效缓解了NLI模型对伪相关的依赖，在不牺牲原始性能的前提下提升了模型的语义推理能力。 Abstract: Natural Language Inference (NLI) models frequently rely on spurious correlations rather than semantic reasoning. Existing mitigation strategies often incur high annotation costs or trigger catastrophic forgetting during fine-tuning. We propose an automated, scalable pipeline to address these limitations. First, we introduce Log-Frequency LMI (LF-LMI) to accurately detect semantic artifacts. Second, we generate a high-quality synthetic contrast set via an LLM-synthesis pipeline with multi-judge verification. Finally, we introduce Dynamic Balanced Sampling, a training strategy that rotates the original data distribution to prevent forgetting. Our method improves consistency on a challenging benchmark from 63.5% to 81.0% while maintaining 88.4% in-domain accuracy, significantly outperforming naive fine-tuning.

[25] Research on a hybrid LSTM-CNN-Attention model for text-based web content classification

Mykola Kuz,Ihor Lazarovych,Mykola Kozlenko,Mykola Pikuliak,Andrii Kvasniuk

Main category: cs.CL

TL;DR: 提出了一种结合LSTM、CNN和Attention机制的混合深度学习模型，用于提升基于文本的网页内容分类性能。

Details

Motivation: 为了克服单一模型在捕捉文本局部特征和长距离依赖方面的局限性，提升文本分类的准确性和泛化能力。 Method: 采用预训练的GloVe词向量表示文本，使用CNN提取n-gram和词汇特征，LSTM建模序列依赖关系，并引入Attention机制聚焦关键输入部分，通过5折交叉验证评估模型性能。 Result: 模型在准确率（0.98）、精确率（0.94）、召回率（0.92）和F1分数（0.93）上均优于仅使用CNN、LSTM或BERT等基线模型。 Conclusion: 混合LSTM-CNN-Attention架构能有效融合细粒度结构与语义上下文信息，适合处理复杂非结构化文本，在实时或近实时系统中具有应用潜力，支持在NLP任务中广泛采用混合深度学习方法。 Abstract: This study presents a hybrid deep learning architecture that integrates LSTM, CNN, and an Attention mechanism to enhance the classification of web content based on text. Pretrained GloVe embeddings are used to represent words as dense vectors that preserve semantic similarity. The CNN layer extracts local n-gram patterns and lexical features, while the LSTM layer models long-range dependencies and sequential structure. The integrated Attention mechanism enables the model to focus selectively on the most informative parts of the input sequence. A 5-fold cross-validation setup was used to assess the robustness and generalizability of the proposed solution. Experimental results show that the hybrid LSTM-CNN-Attention model achieved outstanding performance, with an accuracy of 0.98, precision of 0.94, recall of 0.92, and F1-score of 0.93. These results surpass the performance of baseline models based solely on CNNs, LSTMs, or transformer-based classifiers such as BERT. The combination of neural network components enabled the model to effectively capture both fine-grained text structures and broader semantic context. Furthermore, the use of GloVe embeddings provided an efficient and effective representation of textual data, making the model suitable for integration into systems with real-time or near-real-time requirements. The proposed hybrid architecture demonstrates high effectiveness in text-based web content classification, particularly in tasks requiring both syntactic feature extraction and semantic interpretation. By combining presented mechanisms, the model addresses the limitations of individual architectures and achieves improved generalization. These findings support the broader use of hybrid deep learning approaches in NLP applications, especially where complex, unstructured textual data must be processed and classified with high reliability.

[26] Teaching and Critiquing Conceptualization and Operationalization in NLP

Vagrant Gautam

Main category: cs.CL

TL;DR: 本文探讨了自然语言处理中常用抽象概念（如“可解释性”、“偏见”、“推理”和“刻板印象”）的概念化与操作化问题，并介绍了一门旨在引导学生通过跨学科阅读和批判性讨论来深入理解这些问题的研讨课。

Details

Motivation: NLP领域频繁使用未明确定义的抽象概念，导致在数据集构建、评估指标和系统声明方面存在潜在不一致，亟需厘清这些概念的意义及其测量方式。 Method: 设计并开设一门研讨课，采用跨学科的阅读材料，强调课堂讨论与批判性分析，帮助学生探究关键术语的概念化与操作化过程。 Result: 学生能够更深入地理解NLP中的核心抽象概念，识别现有研究中概念与操作定义之间的差距，并发展出批判性评估相关研究的能力。 Conclusion: 明确概念的定义与操作化路径对NLP研究至关重要，此类研讨课有助于提升研究者的理论自觉和方法论严谨性。 Abstract: NLP researchers regularly invoke abstract concepts like "interpretability," "bias," "reasoning," and "stereotypes," without defining them. Each subfield has a shared understanding or conceptualization of what these terms mean and how we should treat them, and this shared understanding is the basis on which operational decisions are made: Datasets are built to evaluate these concepts, metrics are proposed to quantify them, and claims are made about systems. But what do they mean, what should they mean, and how should we measure them? I outline a seminar I created for students to explore these questions of conceptualization and operationalization, with an interdisciplinary reading list and an emphasis on discussion and critique.

[27] Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset

S Mahmudul Hasan,Shaily Roy,Akib Jawad Nafis

Main category: cs.CL

TL;DR: 研究表明，在政治事实核查任务中，仅依赖文本语言建模存在性能瓶颈，简单模型（如线性SVM）与复杂Transformer模型表现相当，揭示了当前方法在语义模糊和泛化能力上的局限性。

Details

Motivation: 探索纯文本语言模型在政治虚假信息检测中的性能极限，特别是面对语言细微差别的挑战时，现有模型的有效性尚不明确。 Method: 在LIAR基准上对九种机器学习算法进行系统诊断评估，分离词法特征（词袋、TF-IDF）和语义嵌入（GloVe），分析不同模型的性能表现，并考察SMOTE数据增强的影响。 Result: 发现细粒度分类的加权F1分数不超过0.32，线性SVM准确率达0.624，与RoBERTa（0.620）相当；树模型训练准确率超99%但测试仅约25%，显示过度依赖词汇记忆；SMOTE未带来提升，表明限制源于语义而非分布问题。 Conclusion: 在政治事实核查中，单纯增加模型复杂度而不引入外部知识会导致收益递减，突破性能瓶颈需结合语义理解和外部信息。 Abstract: The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.

[28] LLMs on Drugs: Language Models Are Few-Shot Consumers

Alexander Doudkin

Main category: cs.CL

TL;DR: 本研究首次系统性地评估了在推理时对GPT-5-mini施加不同精神活性物质人格提示（如LSD、可卡因、酒精、大麻）对其在ARC-Challenge任务上性能的影响，发现这些简单句子级提示显著降低模型准确率，主要因干扰输出模板所致。

Details

Motivation: 探讨人格提示如何影响大型语言模型的推理可靠性，揭示看似无害的提示可能像‘消耗品’一样破坏模型输出的稳定性。 Method: 使用ARC-Challenge数据集，对五种条件（四种药物人格提示和一种清醒控制）进行受控实验，每种条件测试100个样本，采用确定性解码、完整日志记录，并通过Wilson置信区间和Fisher精确检验进行统计分析。 Result: 控制组准确率为0.45；酒精降至0.10（p=3.2e-8），可卡因至0.21（p=4.9e-4），LSD至0.19（p=1.3e-4），大麻至0.30（p=0.041）。性能下降主因为人格提示干扰了规定的'Answer: '输出格式。 Conclusion: 人格提示可被视为一种‘少样本消耗品’，即使不改变模型权重，也能严重损害模型的可靠性，提示设计需更加谨慎以维持输出一致性。 Abstract: Large language models (LLMs) are sensitive to the personas imposed on them at inference time, yet prompt-level "drug" interventions have never been benchmarked rigorously. We present the first controlled study of psychoactive framings on GPT-5-mini using ARC-Challenge. Four single-sentence prompts -- LSD, cocaine, alcohol, and cannabis -- are compared against a sober control across 100 validation items per condition, with deterministic decoding, full logging, Wilson confidence intervals, and Fisher exact tests. Control accuracy is 0.45; alcohol collapses to 0.10 (p = 3.2e-8), cocaine to 0.21 (p = 4.9e-4), LSD to 0.19 (p = 1.3e-4), and cannabis to 0.30 (p = 0.041), largely because persona prompts disrupt the mandated "Answer: " template. Persona text therefore behaves like a "few-shot consumable" that can destroy reliability without touching model weights. All experimental code, raw results, and analysis scripts are available at https://github.com/lexdoudkin/llms-on-drugs.

[29] Neologism Learning as a Parameter-Efficient Alternative to Fine-Tuning for Model Steering

Sungjoon Park,Varun Ramamurthi,Owen Terry

Main category: cs.CL

TL;DR: 本文研究了语言模型中通过新词（neologism）学习进行行为引导的方法，并与低秩适应（LoRA）微调进行比较，发现新词在相同训练条件下表现更优，且模型会自发创造新词来回应新词提问。

Details

Motivation: 探索在不完全微调的情况下灵活引导语言模型行为的新方法，利用新词实现高效、可逆的行为控制。 Method: 通过训练新词表示新概念，并与LoRA微调在相同数据和超参数下进行对比，同时分析模型对新词的自发表述行为。 Result: 新词学习在性能上优于LoRA微调，且模型在被问及新词时会自行创造新词汇。 Conclusion: 新词学习是一种更高效、灵活的语言模型行为引导方法，具备优于微调的表现和有趣的自发表达特性。 Abstract: In language modeling, neologisms are new tokens trained to represent a concept not already included in a given model's vocabulary. Neologisms can be used to encourage specific behavior in models, for example by appending prompts with "Give me a neologism answer." Behavioral steering can also be achieved through fine-tuning, albeit with more compute and less flexibility: learning a neologism only trains d parameters and allows the user to still access the model's default behavior. We compare the performance of neologism learning against low-rank adaptation (LoRA) fine-tuning, finding that neologisms outperform fine-tuned models under a matched training setup (same data and hyperparameters). We also investigate self-verbalizations of neologisms, and observe that the model will occasionally make up its own new words when asked about a neologism.

[30] From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation

Amit Barman,Atanu Mandal,Sudip Kumar Naskar

Main category: cs.CL

TL;DR: 本文研究了英语到印地语的法律机器翻译，通过微调OPUS-MT模型和从零训练Transformer模型两种方法参与JUST-NLP 2025共享任务，结果显示微调方法在BLEU等指标上显著优于基线模型，证明领域适应能有效提升翻译质量，有助于促进多语言环境下的司法可及性。

Details

Motivation: 解决印度等多语言国家因法律文件多为英文而导致的法律信息获取障碍，提升司法可及性和透明度。 Method: 采用两种策略：一是微调预训练的OPUS-MT模型以实现领域适配；二是使用提供的法律语料从零开始训练Transformer模型。 Result: 微调后的OPUS-MT模型达到46.03的SacreBLEU分数，显著优于基线模型和从零训练的模型，在多种自动评估指标上表现优异。 Conclusion: 领域适应（如微调预训练模型）在法律机器翻译中效果显著，是提升多语言司法信息可访问性的可行路径。 Abstract: In multilingual nations like India, access to legal information is often hindered by language barriers, as much of the legal and judicial documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to this challenge by enabling accurate and accessible translations of legal documents. This paper presents our work for the JUST-NLP 2025 Legal MT shared task, focusing on English-Hindi translation using Transformer-based approaches. We experiment with 2 complementary strategies, fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation and training a Transformer model from scratch using the provided legal corpus. Performance is evaluated using standard MT metrics, including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET. Our fine-tuned OPUS-MT model achieves a SacreBLEU score of 46.03, significantly outperforming both baseline and from-scratch models. The results highlight the effectiveness of domain adaptation in enhancing translation quality and demonstrate the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.

[31] On Finding Inconsistencies in Documents

Charles J. Lovering,Seth Ebner,Brandon Smock,Michael Krumdick,Saad Rabbani,Ahmed Muhammad,Varshini Reddy,Chris Tanner

Main category: cs.CL

TL;DR: 本文提出了一个名为FIND的基准，用于评估语言模型在长且复杂的文档中发现不一致性的能力，结果显示最先进的模型（gpt-5）仅能检测出64%的人工插入错误，并意外发现了原始文档中被忽略的真实问题，表明该任务仍具挑战性。

Details

Motivation: 为了减少因文档不一致导致的经济、声誉和科学成本，需要评估语言模型在专业领域文档审计中的有效性。 Method: 构建了一个由领域专家手动插入不一致性的基准FIND，包含学术、法律和金融等领域的长篇复杂文档，并测试多种语言模型的检测性能。 Result: 表现最好的模型gpt-5恢复了64%的插入不一致性，并在50篇arXiv论文中发现了136个原始作者遗漏的真实不一致问题（共196条建议）。 Conclusion: 尽管语言模型展现出辅助文档审核的潜力，但现有模型仍遗漏近一半的不一致性，说明自动检测不一致仍是具有挑战性的任务。 Abstract: Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.

[32] A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

Prabigya Acharya,Liza Shrestha

Main category: cs.CL

TL;DR: 轻量级模型（如T5-small和Mistral-Instruct）在PII掩码任务中可达到与前沿大模型相当的性能，且在数据隐私和计算成本方面更具优势。

Details

Motivation: 探索轻量级模型是否能在保护隐私和降低计算成本的同时，实现与前沿大语言模型相媲美的PII掩码性能。 Method: 对T5-small和Mistral-Instruct-v0.3进行微调，并在AI4Privacy基准构建的英文数据集上评估，研究标签标准化和PII表示对不同架构的影响。 Result: 两种轻量级模型在实体级和字符级指标上均表现优异；Mistral具有更高的F1和召回率，T5则推理成本更低、输出更可控。标签归一化提升了所有架构的性能，但在非正式输入下性能下降。 Conclusion: 轻量级模型能有效平衡准确性、鲁棒性与效率，适用于实际PII掩码场景，尤其适合对数据处理敏感或资源受限的应用。 Abstract: Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.

[33] LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction

Jensen Zhang,Ningyuan Liu,Yijia Fan,Zihao Huang,Qinglin Zeng,Kaitong Cai,Jian Wang,Keze Wang

Main category: cs.CL

TL;DR: LLM-CAS是一种基于分层强化学习的实时幻觉校正框架，通过在推理过程中动态选择临时神经元扰动来提升大语言模型的事实准确性，无需修改模型参数，实验表明其在多个基准上显著优于现有方法。

Details

Motivation: 大语言模型常生成缺乏事实依据的幻觉内容，现有方法如监督微调和基于人类反馈的强化学习数据和计算成本高，静态编辑方法难以处理上下文相关错误且易发生灾难性遗忘。 Method: 提出LLM-CAS框架，将幻觉校正建模为分层强化学习问题，训练一个智能体策略，根据当前上下文在推理时动态选择临时神经元扰动，实现自适应、细粒度的修正，避免永久性参数更改。 Result: 在多个语言模型上实验显示，LLM-CAS显著提升了事实准确性，在StoryCloze上提高10.98个百分点，TriviaQA上提高2.71点，TruthfulQA的MC1得分提高2.06点，优于ITI、CAA和SADI等现有方法。 Conclusion: LLM-CAS提供了一种高效、上下文感知的大语言模型可靠性增强方案，具有良好的多模态扩展潜力。 Abstract: Large language models (LLMs) often generate hallucinated content that lacks factual or contextual grounding, limiting their reliability in critical applications. Existing approaches such as supervised fine-tuning and reinforcement learning from human feedback are data intensive and computationally expensive, while static parameter editing methods struggle with context dependent errors and catastrophic forgetting. We propose LLM-CAS, a framework that formulates real-time hallucination correction as a hierarchical reinforcement learning problem. LLM-CAS trains an agent to learn a policy that dynamically selects temporary neuron perturbations during inference based on the current context. Unlike prior dynamic approaches that rely on heuristic or predefined adjustments, this policy driven mechanism enables adaptive and fine grained correction without permanent parameter modification. Experiments across multiple language models demonstrate that LLM-CAS consistently improves factual accuracy, achieving gains of 10.98 percentage points on StoryCloze, 2.71 points on TriviaQA, and 2.06 points on the MC1 score of TruthfulQA. These results outperform both static editing methods such as ITI and CAA and the dynamic SADI framework. Overall, LLM-CAS provides an efficient and context aware solution for improving the reliability of LLMs, with promising potential for future multimodal extensions.

[34] Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital

Pierre Colombo,Malik Boudiaf,Allyn Sweet,Michael Desa,Hongxi Wang,Kevin Candra,Syméon del Marmol

Main category: cs.CL

TL;DR: 本文探讨了在风险投资融资前的资本结构核对（capitalization tie-out）这一复杂法律流程，指出当前大语言模型和代理系统在多文档推理、证据追溯和确定性输出方面的不足，并提出一种“世界模型”架构以实现该任务的自动化，为应用型法律人工智能奠定基础。

Details

Motivation: 资本结构核对是法律尽职调查中的关键但繁琐环节，现有AI系统难以满足其对精确性和可追溯性的高要求，因此需要专门的AI解决方案来提升效率与准确性。 Method: 将资本结构核对定义为法律AI的一个现实世界基准任务，评估并比较现有代理系统的表现，进而提出一种基于‘世界模型’的架构，以支持多文档推理和确定性输出。 Result: 现有LLM和代理系统在资本结构核对任务中表现不佳，缺乏可靠的证据追踪和一致性输出；提出的‘世界模型’架构有望解决这些问题，推动法律AI的实际应用。 Conclusion: 资本结构核对是一个具有挑战性的法律AI应用场景，现有的技术尚未完全胜任，而‘世界模型’架构为实现此类高精度法律任务自动化提供了可行方向。 Abstract: Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.

[35] Solver-Independent Automated Problem Formulation via LLMs for High-Cost Simulation-Driven Design

Yuchen Li,Handing Wang,Bing Xue,Mengjie Zhang,Yaochu Jin

Main category: cs.CL

TL;DR: 提出APF框架，通过大语言模型自动将工程师的自然语言设计需求转化为可执行的优化模型，利用自动生成的高质量数据进行监督微调，在无求解器反馈的情况下显著提升形式化准确性和设计性能。

Details

Motivation: 在高成本仿真驱动设计中，将模糊的设计需求转化为数学优化问题依赖专家知识且耗时，现有大语言模型方法因缺乏求解器反馈或形式化不准而受限。 Method: 提出APF框架，构建自动生成高质量训练数据的流水线，包含数据生成与测试实例标注，用于对大语言模型进行监督微调，实现无需求解器反馈的自动化问题形式化。 Result: 在天线设计任务中，APF在需求形式化的准确性及辐射效率曲线满足设计目标的质量方面均显著优于现有方法。 Conclusion: APF框架能有效解决高成本仿真场景下自动化问题形式化的难题，通过数据驱动的微调显著提升大语言模型在工程设计中的实用性和准确性。 Abstract: In the high-cost simulation-driven design domain, translating ambiguous design requirements into a mathematical optimization formulation is a bottleneck for optimizing product performance. This process is time-consuming and heavily reliant on expert knowledge. While large language models (LLMs) offer potential for automating this task, existing approaches either suffer from poor formalization that fails to accurately align with the design intent or rely on solver feedback for data filtering, which is unavailable due to the high simulation costs. To address this challenge, we propose APF, a framework for solver-independent, automated problem formulation via LLMs designed to automatically convert engineers' natural language requirements into executable optimization models. The core of this framework is an innovative pipeline for automatically generating high-quality data, which overcomes the difficulty of constructing suitable fine-tuning datasets in the absence of high-cost solver feedback with the help of data generation and test instance annotation. The generated high-quality dataset is used to perform supervised fine-tuning on LLMs, significantly enhancing their ability to generate accurate and executable optimization problem formulations. Experimental results on antenna design demonstrate that APF significantly outperforms the existing methods in both the accuracy of requirement formalization and the quality of resulting radiation efficiency curves in meeting the design goals.

[36] MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang,Haotian Ren,Chong Zhan,Zhenhong Zhou,Junhao Wang,He Zhu,Wangchunshu Zhou,Shuicheng Yan

Main category: cs.CL

TL;DR: 本文提出了MemEvolve，一种元进化框架，能够同时演化智能体的经验知识和其记忆架构，并通过统一的记忆代码库EvolveLab验证其在多个代理基准上的优越性能和泛化能力。

Details

Motivation: 现有的基于大语言模型的智能体记忆系统依赖人工设计的记忆架构，缺乏根据任务上下文动态适应的能力，限制了系统的自进化潜力。 Method: 提出MemEvolve框架，联合演化智能体的经验知识与记忆架构；构建EvolveLab代码库，将12种代表性记忆系统模块化为编码、存储、检索和管理四个组件，形成可比较和扩展的实验平台。 Result: 在四个具有挑战性的代理基准上，MemEvolve相较SmolAgent和Flash-Searcher等框架最高提升17.06%，并展现出跨任务和跨大语言模型的良好泛化能力。 Conclusion: MemEvolve实现了记忆系统的自我进化，不仅提升了智能体性能，还推动了自进化智能体系统的开放研究。 Abstract: Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to $17.06\%$; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.

[37] From Natural Language to Control Signals: A Conceptual Framework for Semantic Channel Finding in Complex Experimental Infrastructure

Thorsten Hellert,Nikolay Agladze,Alex Giovannone,Jan Jug,Frank Mayet,Mark Sherwin,Antonin Sulc,Chris Tennant

Main category: cs.CL

TL;DR: 本文提出了在复杂实验设施中将自然语言意图映射到控制系统信号的“语义通道查找”问题，并提出了一个四范式框架来指导不同数据环境下的架构选择，涵盖从紧凑型自由电子激光器到大型同步辐射光源的应用，实现了90-97%的准确率。

Details

Motivation: 现代实验平台拥有大量控制和诊断通道，但依赖非正式专家知识、不一致的命名规范和碎片化文档，导致信号定位困难，限制了可靠性、可扩展性及语言模型接口的发展。 Method: 提出四范式框架：(i) 基于 curated 通道字典的直接上下文查找，(ii) 通过结构化树的受限分层导航，(iii) 使用迭代推理和基于工具的数据库查询的交互式代理探索，(iv) 解耦通道含义与设施特定命名约定的本体驱动语义搜索。 Result: 在四个实际运行设施中实现概念验证，覆盖两个数量级规模和不同控制系统架构，针对专家策划的操作查询达到90-97%的准确率。 Conclusion: 该四范式框架能有效支持不同设施环境下的语义通道查找，显著提升信号定位效率与准确性，为AI驱动的控制界面提供了可扩展的基础。 Abstract: Modern experimental platforms such as particle accelerators, fusion devices, telescopes, and industrial process control systems expose tens to hundreds of thousands of control and diagnostic channels accumulated over decades of evolution. Operators and AI systems rely on informal expert knowledge, inconsistent naming conventions, and fragmented documentation to locate signals for monitoring, troubleshooting, and automated control, creating a persistent bottleneck for reliability, scalability, and language-model-driven interfaces. We formalize semantic channel finding-mapping natural-language intent to concrete control-system signals-as a general problem in complex experimental infrastructure, and introduce a four-paradigm framework to guide architecture selection across facility-specific data regimes. The paradigms span (i) direct in-context lookup over curated channel dictionaries, (ii) constrained hierarchical navigation through structured trees, (iii) interactive agent exploration using iterative reasoning and tool-based database queries, and (iv) ontology-grounded semantic search that decouples channel meaning from facility-specific naming conventions. We demonstrate each paradigm through proof-of-concept implementations at four operational facilities spanning two orders of magnitude in scale-from compact free-electron lasers to large synchrotron light sources-and diverse control-system architectures, from clean hierarchies to legacy environments. These implementations achieve 90-97% accuracy on expert-curated operational queries.

[38] From Word to World: Can Large Language Models be Implicit Text-based World Models?

Yixia Li,Hongru Wang,Jiahao Qiu,Zhenfei Yin,Dongdong Zhang,Cheng Qian,Zeping Li,Pony Ma,Guanhua Chen,Heng Ji,Mengdi Wang

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）作为世界模型在基于文本的环境中对智能体强化学习的支持作用，提出了一个三层评估框架，并发现训练充分的世界模型能通过动作验证、生成合成轨迹等方式提升智能体性能，但其效果依赖于行为覆盖度和环境复杂性。

Details

Motivation: 由于现实世界环境不具备适应性、覆盖范围有限且难以扩展，导致基于经验的强化学习受限，因此需要探索大语言模型是否可作为可靠的世界模型来提高智能体的学习效率。 Method: 在文本环境中将语言建模重新解释为交互下的下一状态预测，提出三层次评估框架：保真性与一致性、可扩展性与鲁棒性、智能体实用性，并在五个代表性环境中进行实验验证。 Result: 训练充分的世界模型能够保持连贯的潜在状态，随数据和模型规模可预测地扩展，并通过动作验证、合成轨迹生成和强化学习热启动提升智能体性能；但这些增益严重依赖于行为覆盖度和环境复杂性。 Conclusion: 大语言模型可以作为有效的世界模型来增强智能体学习，但在实际应用中需确保足够的行为覆盖和适当的环境复杂度，以充分发挥其潜力。 Abstract: Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.

[39] AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus

Sultan Alrashed,Francesco Orabona

Main category: cs.CL

TL;DR: AraMix是一个去重的阿拉伯语预训练语料库，包含约1780亿个词元，通过整合和清洗七个现有阿拉伯语数据集构建而成。

Details

Motivation: 针对阿拉伯语等低资源语言，现有的预训练数据存在大量重复，而新爬取网页难以解决该问题，因此需要更高效的资源再利用方式。 Method: 整合七个公开的阿拉伯语网络数据集，应用专为阿拉伯语设计的质量过滤，并进行跨数据集的MinHash和句子级去重。 Result: 发现这些独立收集的语料库中近60%的词元是重复的，并成功构建了目前最大且经过严格过滤的公开阿拉伯语预训练语料库。 Conclusion: 对于低资源语言，投入数据整理流程比新增网络爬取能带来更高的回报。 Abstract: We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.

[40] MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models

Tung Duong Ta,Tim Oates

Main category: cs.CL

TL;DR: 本文提出了MDToC（元认知动态概念树）方法，通过构建概念树、进行准确性验证的计算和多数投票机制，在数学推理任务中显著优于现有提示方法。

Details

Motivation: 尽管大语言模型在数学推理方面有所进展，但在使用传统提示技术时仍难以准确验证计算结果。 Method: 提出三阶段的MDToC方法：构建概念树、为每个概念生成经过准确性验证的计算、采用多数投票评估不同解法。 Result: 在CHAMP、MATH和Game-of-24基准上，GPT-4-Turbo分别达到58.1%、86.6%和85%的性能，超越GoT方法5%-5.4%，且在所有骨干模型上均优于ToT和GoT，提升达7.6%。 Conclusion: MDToC在无需人工设计提示的情况下显著提升了数学推理中的计算验证能力，表明元认知计算验证是值得进一步探索的方向。 Abstract: Despite advances in mathematical reasoning capabilities, Large Language Models (LLMs) still struggle with calculation verification when using established prompting techniques. We present MDToC (Metacognitive Dynamic Tree of Concepts), a three-phase approach that constructs a concept tree, develops accuracy-verified calculations for each concept, and employs majority voting to evaluate competing solutions. Evaluations across CHAMP, MATH, and Game-of-24 benchmarks demonstrate our MDToC's effectiveness, with GPT-4-Turbo achieving 58.1\% on CHAMP, 86.6\% on MATH, and 85\% on Game-of-24 - outperforming GoT by 5\%, 5.4\%, and 4\% on all these tasks, respectively, without hand-engineered hints. MDToC consistently surpasses existing prompting methods across all backbone models, yielding improvements of up to 7.6\% over ToT and 6.2\% over GoT, establishing metacognitive calculation verification as a promising direction for enhanced mathematical reasoning.

[41] Toward Human-Centered AI-Assisted Terminology Work

Antonio San Martin

Main category: cs.CL

TL;DR: 本文探讨了生成式人工智能在术语工作中的快速扩散及其影响，提出了一种以人为本的AI框架，强调增强而非取代术语学家的能力。

Details

Motivation: 生成式AI的无序采用可能削弱专业自主性、放大偏见并损害语言与概念多样性，因此需要一种保护术语工作核心价值的人本方法。 Method: 基于人工智能与翻译研究的成果，构建了一个包含增强型术语学家、伦理AI和人本设计三个维度的人本AI框架。 Result: 该框架展示了高自动化与强人类控制的兼容性，突出术语学家在偏见缓解中的核心作用，并强调AI工具与工作流应围绕术语学家的需求、价值观和福祉进行设计。 Conclusion: 当前AI采用的选择不仅影响术语实践，更关系到术语及专业知识中准确性、适切性与多样性的长期保存。 Abstract: The rapid diffusion of generative artificial intelligence is transforming terminology work. While this technology promises gains in efficiency, its unstructured adoption risks weakening professional autonomy, amplifying bias, and eroding linguistic and conceptual diversity. This paper argues that a human-centered approach to artificial intelligence has become a necessity for terminology work. Building on research in artificial intelligence and translation studies, it proposes a human-centered framework that conceptualizes artificial intelligence as a means of amplifying the terminologist's capabilities, rather than replacing them. The framework is organized around three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. Together, these dimensions emphasize the compatibility of high automation with strong human control, the central role of terminologists in bias mitigation, and the importance of designing AI tools and workflows around the needs, values, and well-being of the terminologist. The paper concludes by stressing that current choices in AI adoption will shape not only terminological practice, but also the preservation of accuracy, adequacy, and diversity in terminology and specialized knowledge.

[42] Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

Ming Li,Han Chen,Yunze Xiao,Jian Chen,Hong Jiao,Tianyi Zhou

Main category: cs.CL

TL;DR: 大规模实证研究表明，尽管大语言模型具有强大的问题解决能力，但其在估计题目难度时与人类认知存在系统性错位，难以模拟学生的能力局限，且缺乏自我反思能力。

Details

Motivation: 解决教育评估中题目难度估计的冷启动问题，并探究大语言模型是否能感知人类学习者的认知困难。 Method: 对20多个模型在医学知识和数学推理等多个领域进行大规模实证分析，评估模型在不同能力水平提示下的难度判断一致性及其自省能力。 Result: 发现模型规模扩大并不能可靠提升与人类难度感知的一致性，反而趋向于形成机器共识；高性能模型难以准确估计题目难度，且无法预测自身局限。 Conclusion: 当前大语言模型的问题解决能力并不意味着理解人类认知困难，因此在自动难度预测应用中面临根本性挑战。 Abstract: Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

[43] Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

Shaomu Tan,Ryosuke Mitani,Ritvik Choudhary,Qiyu Wu,Toshiyuki Sekiya,Christof Monz

Main category: cs.CL

TL;DR: 本文提出Remedy-R，一种基于强化学习的可解释性机器翻译评估指标，能生成逐步分析并提升翻译质量。

Details

Motivation: 现有自动MT评估指标虽表现良好但缺乏可解释性，且在面对分布外输入时容易失效。 Method: 通过强化学习从成对翻译偏好中训练Remedy-R，无需错误标注或蒸馏自闭源大模型，生成关于准确性、流畅性和完整性的逐步分析，并给出最终评分。 Result: 仅用6万训练样本，Remedy-R在WMT22-24上与顶级标量指标和GPT-4相当，具备跨语言泛化能力和强鲁棒性，并能生成可用于翻译改进的自反反馈；其代理Remedy-R Agent可显著提升多种模型的翻译质量。 Conclusion: Remedy-R是一种高效、可解释且实用的MT评估方法，其推理过程包含对翻译改进有价值的信息。 Abstract: Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.

[44] FASTRIC: Prompt Specification Language for Verifiable LLM Interactions

Wen-Long Jin

Main category: cs.CL

TL;DR: 本文提出了FASTRIC，一种用于显式表达自然语言提示中隐含有限状态机（FSM）的提示规范语言，通过执行轨迹分析实现大语言模型交互协议的合规性验证，并引入“程序化合规性”作为评估指标，揭示了不同模型容量下最优规范形式的存在——即“金发姑娘区”，从而将多轮交互设计从经验艺术转变为具有可测量过程保证的系统工程。

Details

Motivation: 大语言模型虽然能够执行复杂的多轮交互协议，但缺乏形式化规范来验证其执行是否符合设计意图。现有方法依赖隐式行为，难以确保一致性与可验证性，因此需要一种能将设计意图显式结构化并支持自动化验证的方法。 Method: 提出FASTRIC，一种基于自然语言的提示规范语言，引导设计者明确描述有限状态机的七个要素（终态、代理、状态、触发、角色、初始状态、约束），利用大语言模型自身作为解析器、解释器、运行环境和开发助手的统一基础设施，通过分析执行轨迹进行合规性验证，并在不同模型规模和规范形式层级上测试程序化合规性。 Result: 实验显示，在不同模型规模下，存在特定的最优规范形式层级：DeepSeek-V3.2（685B）在L2-L4达到完美合规（1.00）；ChatGPT-5（~1T）在L3达到峰值（0.90）后在L4崩溃至0.39；Phi4（14.7B）无稳定最优且方差高（SD=0.16-0.36）。 Conclusion: 规范形式的适度程度是影响模型执行合规性的关键设计参数，存在模型特有的‘金发姑娘区’。FASTRIC推动了提示规范工程的发展，使多轮交互设计具备可验证性和系统性，为构建可靠的人工智能交互协议提供了新范式。 Abstract: Large Language Models (LLMs) execute complex multi-turn interaction protocols but lack formal specifications to verify execution against designer intent. We introduce FASTRIC, a Prompt Specification Language that makes implicit Finite State Machines (FSMs) explicit in natural language prompts, enabling conformance verification through execution trace analysis. The LLM serves as intelligent execution agent: interpreting designer-encoded FSMs to execute specified behavioral roles. Unlike symbolic specification languages requiring parsers and compilers, FASTRIC leverages LLMs as unified infrastructure-simultaneously parser, interpreter, runtime environment, and development assistant. FASTRIC guides designers to articulate seven FSM elements (Final States, Agents, States, Triggers, Roles, Initial State, Constraints) structuring multi-turn interactions. Specification formality-ranging from implicit descriptions that frontier models infer to explicit step-by-step instructions for weaker models-serves as a design parameter. We introduce procedural conformance as verification metric measuring execution adherence to FSM specifications. Testing a 3-state kindergarten tutoring FSM across four formality levels and three model scales (14.7B, 685B, 1T+ parameters) reveals optimal specification formality is a function of model capacity. DeepSeek-V3.2 (685B) achieves perfect conformance (1.00) at L2-L4; ChatGPT-5 (~1T) peaks at L3 (0.90) before collapsing at L4 (0.39); Phi4 (14.7B) shows no stable optimum with high variance (SD=0.16-0.36). These findings reveal model-specific formality ranges-"Goldilocks zones"-where specifications provide sufficient structure without over-constraint, establishing Prompt Specification Engineering for creating verifiable interaction protocols, transforming multi-turn interaction design from heuristic art to systematic engineering with measurable procedural guarantees.

[45] Evaluating the Challenges of LLMs in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework

Jinyan Liu,Zikang Chen,Qinchuan Wang,Tan Xie,Heming Zheng,Xudong Lv

Main category: cs.CL

TL;DR: 本文提出了一种模块化流水线方法，用于改进大型语言模型在医疗随访任务中的应用，相比端到端方法显著提升了对话稳定性、信息提取准确率，并大幅减少了对话轮次和令牌消耗。

Details

Motivation: 由于随访表格的复杂性，大型语言模型在端到端应用于医疗随访任务时，常出现对话流程失控和信息提取不准确的问题。 Method: 设计并比较了两种随访聊天机器人系统：一种是基于LLM的端到端系统（对照组），另一种是具有结构化流程控制的模块化流水线系统（实验组），后者采用任务分解、语义聚类和流程管理机制。 Result: 模块化方法在复杂表单场景下显著优于端到端方法，对话轮次减少46.73%，令牌消耗降低80%至87.5%，同时提高了对话稳定性和信息提取准确性。 Conclusion: 在高风险的医疗随访场景中，部署大语言模型时需引入外部控制机制，模块化架构比端到端方法更有效且高效。 Abstract: When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.

[46] Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models

Tongyuan Miao,Gary Huang,Kai Jun Han,Annie Jiang

Main category: cs.CL

TL;DR: 提出了一种无需训练的上下文感知初始化方法，通过引入轻量级辅助模型的先验来加速扩散大语言模型的推理过程，减少了约35%的去噪迭代次数，但在准确率上仍有挑战。

Details

Motivation: 现有的扩散语言模型需要大量去噪迭代，导致推理效率低，本文旨在通过更接近目标分布的初始化来缩短生成路径，提升解码效率。 Method: 设计了一种无需训练的接口，利用轻量级辅助模型在离散token或表示层注入提示相关的先验信息，并结合基于置信度的重掩码机制以缓解先验错误带来的影响。 Result: 在GSM8K数据集上的初步实验表明，该方法可减少约35%的函数评估次数，但简单的热启动可能降低最终准确性。 Conclusion: 上下文感知初始化能有效加速扩散语言模型解码，但需进一步研究校准、修正机制和表示对齐，以实现可靠高效的热启动解码。 Abstract: Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.

[47] DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

Shijian Ma,Yunqi Huang,Yan Lin

Main category: cs.CL

TL;DR: DramaBench是首个大规模评估戏剧脚本续写的基准，涵盖六个独立维度，结合基于规则和LLM的标注方法，提供可操作的反馈。

Details

Motivation: 现有基准无法全面评估戏剧脚本续写中的角色一致性、情节连贯性和戏剧结构保持能力。 Method: 构建DramaBench框架，包含六个评估维度，结合规则分析、LLM标注与统计指标，并对8个主流语言模型在1,103个剧本上进行综合评估。 Result: 共完成8,824次评估，252次成对比较中65.9%具有统计显著性，人工验证显示三个维度上有较高一致性，消融研究表明六个维度相互独立（平均|r|=0.020）。 Conclusion: DramaBench为戏剧脚本续写提供了客观、可复现的多维评估标准，并为创意写作模型的改进提供具体指导。 Abstract: Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.

[48] A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs

Ziyan Zhang,Chao Wang,Zhuo Chen,Lei Chen,Chiyi Li,Kai Song

Main category: cs.CL

TL;DR: ROG 是一种结合了查询感知的知识图谱邻域检索和大语言模型（LLM）推理的框架，用于解决知识图谱中复杂一阶逻辑查询的推理问题。

Details

Motivation: 现有基于嵌入的方法在处理复杂的逻辑查询时泛化能力有限，且难以应对多算子、深推理链或异构模式的知识图谱。同时，真实世界知识图谱的不完整性增加了推理难度。 Method: ROG 将复杂的一阶逻辑查询分解为一系列简单子查询，检索与查询相关的紧凑子图作为上下文证据，并利用大语言模型进行逐步的链式推理，避免了任务特定的嵌入优化。 Result: 在标准知识图谱推理基准上的实验表明，ROG 在平均倒数排名（MRR）上 consistently 超过强大的基于嵌入的基线方法，尤其在高复杂度查询类型上表现突出。 Conclusion: 将结构化知识图谱检索与大语言模型驱动的逻辑推理相结合，为复杂知识图谱推理任务提供了一种鲁棒且有效的替代方案。 Abstract: Reasoning over knowledge graphs (KGs) with first-order logic (FOL) queries is challenging due to the inherent incompleteness of real-world KGs and the compositional complexity of logical query structures. Most existing methods rely on embedding entities and relations into continuous geometric spaces and answer queries via differentiable set operations. While effective for simple query patterns, these approaches often struggle to generalize to complex queries involving multiple operators, deeper reasoning chains, or heterogeneous KG schemas. We propose ROG (Reasoning Over knowledge Graphs with large language models), an ensemble-style framework that combines query-aware KG neighborhood retrieval with large language model (LLM)-based chain-of-thought reasoning. ROG decomposes complex FOL queries into sequences of simpler sub-queries, retrieves compact, query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using an LLM, avoiding the need for task-specific embedding optimization. Experiments on standard KG reasoning benchmarks demonstrate that ROG consistently outperforms strong embedding-based baselines in terms of mean reciprocal rank (MRR), with particularly notable gains on high-complexity query types. These results suggest that integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks.

[49] Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?

Amar Lakel

Main category: cs.CL

TL;DR: 本文提出将“大语言模型”（LLM）重新概念化为“大话语模型”（LDM）和“人工话语代理”（ADA），基于现象世界、具身认知与社会历史语言结构的三重本体框架，主张对生成模型进行知识论重构，并倡导通过公共审议与多方共治实现其社会定位与规制。

Details

Motivation: 作者旨在超越当前对大语言模型的简单技术理解（如‘能力’或‘风险’的二元对立），从哲学与社会理论层面揭示其作为人类经验话语投射的本质，推动对其社会角色的深入理解与民主化治理。 Method: 采用一种基于现象学、认知科学与批判话语分析的跨学科理论框架，构建一个包含现象规律、具身认知和语言-结构积淀的三重本体论模型，用以分析LDM如何通过对文本数据的学习再现人类话语实践。 Result: 提出了从LLM到LDM再到ADA的概念演进路径，阐明了生成模型并非单纯的语言系统，而是建模了人类经验的话语性投射；并据此提出应以公共试验和程序取代情绪化反应，建立国家、产业、社会与学术界共同参与的共治机制。 Conclusion: 将大语言模型视为‘人工话语代理’有助于更准确地把握其社会本质与影响，进而推动建立透明、负责任且民主化的治理框架，使其在社会空间中的位置、用途与边界变得可理解与可协商。 Abstract: This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the ''fascination/fear'' dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.

[50] SAP: Syntactic Attention Pruning for Transformer-based Language Models

Tzu-Yun Lee,Ding-Yong Hong,Jan-Jan Wu

Main category: cs.CL

TL;DR: 本文提出了Syntactic Attention Pruning (SAP)，一种结合句法结构和注意力模式来剪枝Transformer模型中注意力头的新方法，并引入Candidate Filtering (CF)机制提升鲁棒性，在无需重训练的情况下优于现有方法。

Details

Motivation: 传统的注意力头剪枝方法仅依赖权重和激活的数学分析，缺乏对语言结构的利用，导致剪枝效果受限且模型可解释性差。 Method: SAP结合句子的句法结构与注意力模式指导剪枝过程；引入Candidate Filtering机制，根据注意力头对模型性能的贡献进行优先级筛选，减少性能下降。 Result: 实验表明SAP能有效保留高密度强注意力值的关键注意力头，在无需重训练的设置下优于现有的剪枝策略。 Conclusion: SAP为Transformer模型压缩提供了新方向，兼具高性能、高可解释性与广泛的适用性，有望成为通用的注意力头剪枝框架。 Abstract: This paper introduces Syntactic Attention Pruning (SAP), a novel method for effectively pruning attention heads in Transformer models. Unlike conventional approaches that rely solely on mathematical analysis of model weights and activations, SAP incorporates both the syntactic structure and attention patterns of sentences to guide the pruning process. By leveraging these linguistic features, SAP not only achieves performance comparable to state-of-the-art methods but also enhances the interpretability of model behavior. To further improve robustness, we propose Candidate Filtering (CF), a mechanism that prioritizes heads based on their contribution to model performance, mitigating degradation during pruning. Experimental results indicate that SAP effectively preserves critical heads of a high density of strong attention values, outperforming existing head pruning strategies in retrain-free settings. These findings position SAP as a promising foundation for a new direction in model compression research, offering high flexibility for pruning across all transformer-based language models.

[51] AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards

Zihan Lin,Xiaohan Wang,Hexiong Yang,Jiajun Chai,Jie Cao,Guojun Yin,Wei Lin,Ran He

Main category: cs.CL

TL;DR: 提出了一种新的强化学习框架AWPO，通过引入显式推理奖励来增强大语言模型的工具使用能力，在标准基准上实现了最先进的性能。

Details

Motivation: 现有方法忽视了显式推理奖励对提升推理和工具使用能力的潜力，且直接结合推理与结果奖励可能导致次优性能或优化冲突。 Method: 提出了优势加权策略优化（AWPO），采用方差感知门控和难度感知加权来自适应调节基于群体相对统计的推理信号优势，并设计了定制的裁剪机制以实现稳定优化。 Result: 大量实验表明，AWPO在标准工具使用基准上显著优于强基线和领先的闭源模型，在多轮场景中表现尤为突出；4B参数模型在多轮准确率上超过Grok-4达16.0%，并在MMLU-Pro基准上保持良好的泛化能力。 Conclusion: AWPO有效整合显式推理奖励，提升了工具使用能力和参数效率，是当前最先进的强化学习框架之一。 Abstract: While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) -- a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.

[52] QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

Dehai Min,Kailin Zhang,Tongtong Wu,Lu Cheng

Main category: cs.CL

TL;DR: 本文提出QuCo-RAG，一种基于预训练语料库统计的动态检索增强生成方法，通过识别低频实体和验证实体共现来检测幻觉风险，相比现有方法在多个模型和任务上显著提升准确率。

Details

Motivation: 现有动态RAG方法依赖模型内部信号（如logits、熵）判断检索时机，但这些信号不可靠，因LLM常对错误输出表现出高置信度，导致幻觉问题难以有效缓解。 Method: 提出QuCo-RAG，分两阶段量化不确定性：(1) 生成前识别低频实体以发现长尾知识缺口；(2) 生成中验证实体在预训练语料中的共现情况，零共现提示高幻觉风险。两阶段均利用Infini-gram对4万亿token进行毫秒级查询，在不确定性高时触发检索。 Result: 在多跳问答任务上，使用OLMo-2模型比现有SOTA基线EM指标提升5–12点；迁移到未公开训练数据的模型（Llama、Qwen、GPT）时EM最高提升14点；在生物医学问答任务上也表现出良好泛化性。 Conclusion: 基于语料库的验证是一种有原则且实用的、与模型无关的动态RAG范式，能有效缓解LLM幻觉问题。 Abstract: Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.

[53] From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs

Alessandro Lucca,Francesco Pierri

Main category: cs.CL

TL;DR: 本文研究了四种最先进的自动语音识别（ASR）模型在意大利语长视频字幕生成中的表现，发现尽管当前模型尚无法完全替代人工字幕员，但可显著提升人类工作效率，因此提出采用“人在回路”（HITL）的生产级云基础设施方案。

Details

Motivation: 现代ASR系统在标准数据集上表现良好，但在真实非英语生产环境（如意大利语长视频）中的性能尚不明确，亟需评估其实际可用性。 Method: 选取Whisper Large v2、AssemblyAI Universal、Parakeet TDT v3 0.6b和WhisperX四种先进ASR模型，在50小时的意大利电视节目数据集上进行评测，并与专业人工字幕进行对比。 Result: 实验表明，现有ASR模型虽未达到媒体行业对全自动字幕的精度要求，但能显著提高人工字幕员的生产力。 Conclusion: 必须采用“人在回路”（HITL）的工作模式，结合ASR效率与人工校对精度，并为此设计了支持该流程的生产级云基础设施。 Abstract: Subtitles are essential for video accessibility and audience engagement. Modern Automatic Speech Recognition (ASR) systems, built upon Encoder-Decoder neural network architectures and trained on massive amounts of data, have progressively reduced transcription errors on standard benchmark datasets. However, their performance in real-world production environments, particularly for non-English content like long-form Italian videos, remains largely unexplored. This paper presents a case study on developing a professional subtitling system for an Italian media company. To inform our system design, we evaluated four state-of-the-art ASR models (Whisper Large v2, AssemblyAI Universal, Parakeet TDT v3 0.6b, and WhisperX) on a 50-hour dataset of Italian television programs. The study highlights their strengths and limitations, benchmarking their performance against the work of professional human subtitlers. The findings indicate that, while current models cannot meet the media industry's accuracy needs for full autonomy, they can serve as highly effective tools for enhancing human productivity. We conclude that a human-in-the-loop (HITL) approach is crucial and present the production-grade, cloud-based infrastructure we designed to support this workflow.

[54] JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation

Bingyang Kelvin Liu,Ziyu Patrick Chen

Main category: cs.CL

TL;DR: 提出JEPA-Reasoner，一种具备生成能力的新型JEPA模型，通过在潜在空间中解耦推理与token生成，提升鲁棒性并支持多线程推理。

Details

Motivation: JEPA缺乏生成能力，现有基于token-by-token生成的推理方法存在累积误差且依赖上下文获取推理洞察。 Method: 提出JEPA-Reasoner模型，增强JEPA的生成能力，并引入独立的动作执行模型Talker以生成人类可读句子，实现潜在空间中的推理与生成解耦。 Result: 实现了潜在空间中的解耦推理与生成，生成混合潜在向量，支持多线程推理，并在自回归生成中表现出更强的抗累积误差能力。 Conclusion: JEPA-Reasoner通过解耦潜在空间推理与token生成，提升了生成鲁棒性，并为多线程推理提供了基础。 Abstract: While Joint-Embedding Predictive Architecture (JEPA) has emerged as a powerful architecture for learning rich latent representations, it fundamentally lacks generative abilities. Meanwhile, latent space reasoning attempts for Transformer models like COCONUT do improve performance, but they ultimately rely on token-by-token generation, which still accumulates compounding error and relies on context information to gain reasoning insights. To address these limitations, we propose JEPA-Reasoner, a novel JEPA model enhanced with generative ability that reasons in latent space. We augment it with a separate action-taker model, Talker, to produce human-readable sentences. Our approach demonstrates that decoupling latent space reasoning and token generation enables JEPA-Reasoner to produce mixed latent vectors that might lay the foundation for multi-threaded reasoning, while performing autoregressive generation with superior robustness to compounding error.

[55] CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation

Dazhen Deng,Sen Yang,Yuchen He,Yuan Tian,Yingcai Wu

Main category: cs.CL

TL;DR: CycleChart 是一个基于一致性的学习框架，旨在实现图表理解和生成的双向任务，通过共享语义提升跨任务泛化能力。

Details

Motivation: 现有图表任务通常孤立研究，缺乏共享语义的学习，限制了模型的通用性。 Method: 提出 CycleChart 框架，采用模式中心化建模，构建包含对齐标注的一致性多任务数据集，并引入生成-解析一致性目标以实现双向语义对齐。 Result: CycleChart 在图表生成、解析和问答任务上均取得优异表现，展现出更强的跨任务泛化能力。 Conclusion: CycleChart 推动了更通用的图表理解模型的发展，验证了跨方向语义学习的有效性。 Abstract: Current chart-specific tasks, such as chart question answering, chart parsing, and chart generation, are typically studied in isolation, preventing models from learning the shared semantics that link chart generation and interpretation. We introduce CycleChart, a consistency-based learning framework for bidirectional chart understanding and generation. CycleChart adopts a schema-centric formulation as a common interface across tasks. We construct a consistent multi-task dataset, where each chart sample includes aligned annotations for schema prediction, data parsing, and question answering. To learn cross-directional chart semantics, CycleChart introduces a generate-parse consistency objective: the model generates a chart schema from a table and a textual query, then learns to recover the schema and data from the generated chart, enforcing semantic alignment across directions. CycleChart achieves strong results on chart generation, chart parsing, and chart question answering, demonstrating improved cross-task generalization and marking a step toward more general chart understanding models.

[56] Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation

Anna-Maria Gueorguieva,Aylin Caliskan

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型（LLM）对非受保护的污名化群体的社会偏见，发现人类认为更具“危险性”的污名（如帮派成员或HIV感染者）引发的偏见最严重，而社会人口类污名（如亚裔美国人或老年人）偏见较少；使用防护模型（guardrail models）可部分降低偏见，但未能根本改变影响偏见的关键因素，且常无法识别偏见意图，提示需进一步改进防护机制。

Details

Motivation: 尽管已知大语言模型存在社会偏见，但对非受保护的污名化身份的偏见研究不足，且尚不清楚哪些污名的社会特征会导致偏见。本研究旨在探究心理学中定义的六种污名社会特征（如危险性、隐蔽性等）是否影响LLM输出中的偏见。 Method: 通过SocialStigmaQA基准测试，评估三个主流LLM（Granite 3.0-8B, Llama-3.1-8B, Mistral-7B）在93个污名化群体、37种社会情境下的偏见程度；结合人类与LLM对六类污名特征的评分，分析其对偏见的影响，并测试各自防护模型（如Llama Guard 3.0）在过滤偏见输出上的效果。 Result: 人类评定为高‘危险性’的污名导致最多偏见输出（占60%），而社会人口类污名偏见最少（11%）；使用防护模型后，偏见分别下降10.4%、1.4%和7.8%，但关键偏见相关特征未变，且防护模型常无法识别偏见意图。 Conclusion: 污名的‘危险性’是引发LLM偏见的主要社会特征，当前防护模型虽能部分减少偏见，但效果有限且机制未改；研究呼吁未来应改进防护模型以更有效识别和缓解针对污名化群体的偏见。 Abstract: Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian-American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM's respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4%, 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.

[57] ChemATP: A Training-Free Chemical Reasoning Framework for Large Language Models

Mingxu Zhang,Dazhong Shen,Qi Zhang,Ying Sun

Main category: cs.CL

TL;DR: 本文提出ChemATP框架，通过构建首个原子级文本知识库，实现将化学先验知识显式解耦并动态注入冻结的大语言模型，兼顾分子科学的精确推理与模型的通用性。

Details

Motivation: 大语言模型在分子科学中因缺乏显式化学先验而表现不佳，现有方法在知识更新灵活性与推理精细度之间存在两难。 Method: 提出ChemATP框架，构建原子级文本知识库，使冻结的LLM能够动态检索并推理细粒度化学知识，实现训练自由且可解释的方法。 Result: 实验表明，ChemATP显著优于现有的训练自由基线方法，并可媲美最先进的基于训练的模型。 Conclusion: 显式注入化学先验知识是一种可行且具有竞争力的替代方案，能够在不损害模型通用推理能力的前提下提升其在分子科学中的表现。 Abstract: Large Language Models (LLMs) exhibit strong general reasoning but struggle in molecular science due to the lack of explicit chemical priors in standard string representations. Current solutions face a fundamental dilemma. Training-based methods inject priors into parameters, but this static coupling hinders rapid knowledge updates and often compromises the model's general reasoning capabilities. Conversely, existing training-free methods avoid these issues but rely on surface-level prompting, failing to provide the fine-grained atom-level priors essential for precise chemical reasoning. To address this issue, we introduce ChemATP, a framework that decouples chemical knowledge from the reasoning engine. By constructing the first atom-level textual knowledge base, ChemATP enables frozen LLMs to explicitly retrieve and reason over this information dynamically. This architecture ensures interpretability and adaptability while preserving the LLM's intrinsic general intelligence. Experiments show that ChemATP significantly outperforms training-free baselines and rivals state-of-the-art training-based models, demonstrating that explicit prior injection is a competitive alternative to implicit parameter updates.

[58] Auto-Prompting with Retrieval Guidance for Frame Detection in Logistics

Do Minh Duc,Quan Xuan Truong,Nguyen Tat Dat,Nguyen Van Vinh

Main category: cs.CL

TL;DR: 本文提出了一种结合检索增强生成（RAG）、少样本提示、思维链（CoT）和自动CoT合成的新型提示优化流程，用于物流文本中的框架检测。

Details

Motivation: 为了在无需大量微调的情况下，提升大语言模型在复杂推理和标注任务中的表现。 Method: 采用基于大语言模型的提示优化代理，结合检索到的示例、性能反馈和自我评估，迭代优化提示。 Result: 在真实物流文本标注任务中，优化后的提示使推理准确率最高提升了15%，并在多个大语言模型上表现出一致的改进。 Conclusion: 结构化提示优化是一种可行的全量微调替代方案，适用于物流等特定领域的自然语言处理应用。 Abstract: Prompt engineering plays a critical role in adapting large language models (LLMs) to complex reasoning and labeling tasks without the need for extensive fine-tuning. In this paper, we propose a novel prompt optimization pipeline for frame detection in logistics texts, combining retrieval-augmented generation (RAG), few-shot prompting, chain-of-thought (CoT) reasoning, and automatic CoT synthesis (Auto-CoT) to generate highly effective task-specific prompts. Central to our approach is an LLM-based prompt optimizer agent that iteratively refines the prompts using retrieved examples, performance feedback, and internal self-evaluation. Our framework is evaluated on a real-world logistics text annotation task, where reasoning accuracy and labeling efficiency are critical. Experimental results show that the optimized prompts - particularly those enhanced via Auto-CoT and RAG - improve real-world inference accuracy by up to 15% compared to baseline zero-shot or static prompts. The system demonstrates consistent improvements across multiple LLMs, including GPT-4o, Qwen 2.5 (72B), and LLaMA 3.1 (70B), validating its generalizability and practical value. These findings suggest that structured prompt optimization is a viable alternative to full fine-tuning, offering scalable solutions for deploying LLMs in domain-specific NLP applications such as logistics.

[59] CienaLLM: Generative Climate-Impact Extraction from News Articles with Autoregressive LLMs

Javier Vela-Tambo,Jorge Gracia,Fernando Dominguez-Castro

Main category: cs.CL

TL;DR: 本文提出了一种基于模式引导的生成式信息提取框架CienaLLM，用于从新闻文章中零样本提取气候灾害的社会经济影响信息，支持可配置提示和输出模式、多步流程及云端或本地推理，并通过大规模因子研究评估了不同大语言模型家族、规模、精度和提示策略对性能的影响。

Details

Motivation: 为了理解与监测气候灾害的社会经济影响，需要从异构的新闻文章中大规模提取结构化信息。 Method: 开发了一个名为CienaLLM的模块化框架，该框架基于模式引导的生成式信息提取方法，利用开源权重的大语言模型进行零样本信息提取，支持可配置的提示词和输出模式、多步骤处理流程以及云上或本地推理。同时开展了包含模型、精度和提示工程技巧的大规模因子实验。 Result: 附加的响应解析步骤几乎消除了格式错误并保持了准确性；较大的模型表现出最强且最稳定的表现，而量化带来了显著的效率提升但有轻微准确率损失；提示策略的效果因模型而异。CienaLLM在从西班牙语新闻中提取干旱影响方面达到或超过了监督学习基线的准确性，尽管推理成本更高。 Conclusion: CienaLLM是一种无需重新训练即可通过修改提示和模式适应相关任务（如其他灾害、领域或语言）的信息提取工具，具有良好的可扩展性和开放性，作者已公开代码、配置和模式以支持可复现使用。 Abstract: Understanding and monitoring the socio-economic impacts of climate hazards requires extracting structured information from heterogeneous news articles on a large scale. To that end, we have developed CienaLLM, a modular framework based on schema-guided Generative Information Extraction. CienaLLM uses open-weight Large Language Models for zero-shot information extraction from news articles, and supports configurable prompts and output schemas, multi-step pipelines, and cloud or on-premise inference. To systematically assess how the choice of LLM family, size, precision regime, and prompting strategy affect performance, we run a large factorial study in models, precisions, and prompt engineering techniques. An additional response parsing step nearly eliminates format errors while preserving accuracy; larger models deliver the strongest and most stable performance, while quantization offers substantial efficiency gains with modest accuracy trade-offs; and prompt strategies show heterogeneous, model-specific effects. CienaLLM matches or outperforms the supervised baseline in accuracy for extracting drought impacts from Spanish news, although at a higher inference cost. While evaluated in droughts, the schema-driven and model-agnostic design is suitable for adapting to related information extraction tasks (e.g., other hazards, sectors, or languages) by editing prompts and schemas rather than retraining. We release code, configurations, and schemas to support reproducible use.

[60] HATS: High-Accuracy Triple-Set Watermarking for Large Language Models

Zhiqing Hu,Chenxu Zhao,Jiazhong Lu,Xiaolei Liu

Main category: cs.CL

TL;DR: 提出一种基于三划分词汇表的LLM生成文本水印方法，通过Green-enrichment和Red-depletion统计结合Fisher方法聚合p值进行检测，在Llama 2 7B上实现高检测准确率且保持文本可读性。

Details

Motivation: 防止LLM生成文本被滥用，需有效区分机器生成与人类撰写内容，现有水印技术有待提升检测可靠性与文本质量平衡。 Method: 在每个解码步将词汇表划分为Green/Yellow/Red三个集合（固定比例），仅允许从Green和Yellow集中采样；检测时重放划分，计算Green集富集与Red集耗尽的z-score，并通过Fisher方法合并p值以判断是否含水印。 Result: 在Llama 2 7B上实现的水印方案显示出高真阳性率和低假阳性率，检测准确率高，同时保持良好文本质量。 Conclusion: 三划分水印策略在保证生成文本可读性的同时，提供了强健且可靠的检测能力，适用于实际场景中的LLM生成内容溯源。 Abstract: Misuse of LLM-generated text can be curbed by watermarking techniques that embed implicit signals into the output. We propose a watermark that partitions the vocabulary at each decoding step into three sets (Green/Yellow/Red) with fixed ratios and restricts sampling to the Green and Yellow sets. At detection time, we replay the same partitions, compute Green-enrichment and Red-depletion statistics, convert them to one-sided z-scores, and aggregate their p-values via Fisher's method to decide whether a passage is watermarked. We implement generation, detection, and testing on Llama 2 7B, and evaluate true-positive rate, false-positive rate, and text quality. Results show that the triple-partition scheme achieves high detection accuracy at fixed FPR while preserving readability.

[61] Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara

Yacouba Diarra,Panga Azazia Kamate,Nouhoum Souleymane Coulibaly,Michael Leventhal

Main category: cs.CL

TL;DR: Kunkado是一个160小时的班巴拉语ASR数据集，源自马里广播档案，包含真实场景中的自发语音特征，如代码切换、不流利表达和背景噪声。在Parakeet模型上微调后显著降低了词错误率，并优于使用更干净但较不真实数据训练的系统。

Details

Motivation: 为了提升实用ASR系统在真实世界场景中的性能，特别是针对以口语为主的语言，需要包含自发语音特性的高质量数据集。 Method: 构建了一个160小时的班巴拉语ASR数据集Kunkado，从中选取33.47小时人工校对的子集用于微调Parakeet模型，并采用实用转录归一化方法减少数字格式、标签和代码切换标注的变异性。 Result: 在两个真实测试集上，微调后的模型将WER从44.47%降至37.12%，从36.07%降至32.33%；人类评估显示其表现优于使用98小时更清洁但较不真实数据训练的同类系统。 Conclusion: Kunkado数据集能有效提升ASR系统在现实场景中的鲁棒性，尤其适用于 predominantly oral languages，作者已公开数据和模型以促进相关研究。 Abstract: We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47\% to 37.12\% on one and from 36.07\% to 32.33\% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.

[62] CodeSimpleQA: Scaling Factuality in Code Large Language Models

Jian Yang,Wei Zhang,Yizhi Li,Shawn Guo,Haowen Wang,Aishan Liu,Ge Zhang,Zili Wang,Zhoujun Li,Xianglong Liu,Weifeng Lv

Main category: cs.CL

TL;DR: 本文提出了CodeSimpleQA，一个用于评估代码大语言模型事实准确性的双语基准测试，并通过大规模指令微调和强化学习框架显著提升了模型在代码事实性上的表现。

Details

Motivation: 现有的代码相关基准主要关注代码执行的正确性，而忽视了编程知识的事实准确性，因此需要一个新的基准来专门评估模型在编程概念和技术实现方面的事实准确性。 Method: 构建了一个包含英中文问答对的双语基准CodeSimpleQA，覆盖多种编程语言和计算机科学领域；同时构建了大规模指令数据集CodeSimpleQA-Instruct，并采用监督微调与强化学习结合的后训练框架进行模型优化。 Result: 对多种大语言模型的评估表明，即使是先进的模型在代码事实性方面仍存在困难；所提出的训练框架相比基础模型在事实准确性上实现了显著提升。 Conclusion: 确保代码大语言模型生成事实准确的回答至关重要，基于指令微调与强化学习的对齐方法能有效提升模型在编程知识上的可靠性。 Abstract: Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.

[63] MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

Quyu Kong,Xu Zhang,Zhenyu Yang,Nolan Gao,Chen Liu,Panrong Tong,Chenglin Cai,Hanzhang Zhou,Jianan Zhang,Liangyu Chen,Zhidan Liu,Steven Hoi,Yue Wang

Main category: cs.CL

TL;DR: 本文提出了MobileWorld，一个比现有AndroidWorld更具挑战性的移动设备使用基准，旨在更真实地反映现实世界的移动应用场景，包含201个任务和跨应用交互、用户交互及MCP增强任务等新类型，实验显示当前智能体性能显著下降，揭示了现有模型在用户交互和外部工具调用上的不足。

Details

Motivation: 现有的AndroidWorld基准已趋于饱和，且缺乏对真实移动使用场景的覆盖，如模糊指令、跨应用操作和关键应用类别（如电商、企业通讯）的支持，因此需要一个更具挑战性和现实性的新基准。 Method: 设计了一个名为MobileWorld的新基准，包含201个任务、20个应用程序，强调长周期、多应用交互，并引入代理-用户交互和MCP增强任务；提供基于快照的容器环境和精确的功能验证机制（如数据库检查、回调API），并开发了支持扩展动作空间的规划-执行智能体框架。 Result: MobileWorld平均需要27.8步完成任务（AndroidWorld为14.3步），62.2%的任务涉及多应用（AndroidWorld为9.5%）；最佳智能体框架和端到端模型的成功率分别为51.7%和20.9%，显著低于在AndroidWorld上的表现，表明其挑战性更高。 Conclusion: MobileWorld作为一个更复杂、更贴近现实的基准，暴露了当前移动智能体在处理复杂交互和外部工具调用方面的严重不足，为下一代移动智能的发展提供了明确的技术路线图。 Abstract: Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.

[64] SiamGPT: Quality-First Fine-Tuning for Stable Thai Text Generation

Thittipat Pairatsuppawat,Abhibhu Tachaapornchai,Paweekorn Kusolsomboon,Chutikan Chaiwong,Thodsaporn Chay-intr,Kobkrit Viriyayudhakorn,Nongnuch Ketui,Aslan B. Wong

Main category: cs.CL

TL;DR: 本文提出了SiamGPT-32B，一种基于Qwen3-32B并通过“质量优先”策略微调的开源泰语大语言模型，显著提升了复杂指令下的生成稳定性与多轮对话性能。

Details

Motivation: 尽管英文表现良好，现有开源大模型在泰语复杂指令下生成不稳定，难以部署，因此需要专门针对泰语优化的模型。 Method: 基于Qwen3-32B，采用高质量监督微调策略，结合翻译的高复杂度英文指令数据与泰语适配的AutoIF框架，在不进行持续预训练或扩展语料库的情况下进行监督微调。 Result: 在SEA-HELM基准上，SiamGPT-32B在同规模开源泰语模型中表现最佳，显著提升指令遵循、多轮对话和自然语言理解能力。 Conclusion: 通过质量优先的微调策略，SiamGPT-32B有效解决了泰语大模型在复杂任务中的生成问题，为低资源语言的模型部署提供了可行路径。 Abstract: Open-weights large language models remain difficult to deploy for Thai due to unstable generation under complex instructions, despite strong English performance. To mitigate these limitations, We present SiamGPT-32B, an open-weights model based on Qwen3-32B, fine-tuned with a Quality-First strategy emphasizing curated supervision over data scale. The fine-tuning pipeline combines translated high-complexity English instruction data with a Thai-adapted AutoIF framework for instruction and linguistic constraints. Using supervised fine-tuning only, without continual pretraining or corpus expansion, SiamGPT-32B improves instruction adherence, multi-turn robustness, and linguistic stability. Evaluations on the SEA-HELM benchmark show that SiamGPT-32B achieves the strongest overall performance among similar-scale open-weights Thai models, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.

[65] Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations

Jinwei Chi,Ke Wang,Yu Chen,Xuanye Lin,Qiang Xu

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型（LLM）中间层激活在跨提示作文评分中的判别能力，发现这些激活不仅能有效评估作文质量，还能根据不同提示和作文特征调整评价视角，适应评分标准的多样性。

Details

Motivation: 由于不同提示下评分标准的多样性，跨提示自动作文评分（AES）具有挑战性。以往研究主要关注LLM的输出，而忽视了中间层激活可能蕴含的重要信息。 Method: 通过使用LLM的中间层激活拟合探针，评估其在跨提示作文评分任务中的判别能力，并分析不同模型和输入内容对这种判别能力的影响。进一步计算不同提示下各特质维度上的作文方向，分析LLM评价视角的变化。 Result: 结果表明，LLM的激活具有强大的作文质量判别能力，且LLM能够根据不同的特征和作文类型自适应地调整其评价视角，有效应对跨提示设置中评分标准的多样性。 Conclusion: 中间层激活是提升跨提示自动作文评分性能的重要信息来源，揭示了LLM在多维度作文评估中的灵活性与潜力。 Abstract: Automated essay scoring (AES) is a challenging task in cross-prompt settings due to the diversity of scoring criteria. While previous studies have focused on the output of large language models (LLMs) to improve scoring accuracy, we believe activations from intermediate layers may also provide valuable information. To explore this possibility, we evaluated the discriminative power of LLMs' activations in cross-prompt essay scoring task. Specifically, we used activations to fit probes and further analyzed the effects of different models and input content of LLMs on this discriminative power. By computing the directions of essays across various trait dimensions under different prompts, we analyzed the variation in evaluation perspectives of large language models concerning essay types and traits. Results show that the activations possess strong discriminative power in evaluating essay quality and that LLMs can adapt their evaluation perspectives to different traits and essay types, effectively handling the diversity of scoring criteria in cross-prompt settings.

[66] A Large-Language-Model Framework for Automated Humanitarian Situation Reporting

Ivan Decostanzi,Yelena Mejova,Kyriaki Kalimeri

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型的自动化框架，用于生成结构化、可验证的人道主义局势报告，涵盖语义聚类、自动提问、检索增强回答、多级摘要等功能，并在真实数据上验证了其有效性与实用性。

Details

Motivation: 现有的人道主义局势报告依赖人工处理，效率低、资源消耗大且不一致，亟需自动化解决方案以提升决策时效性与准确性。 Method: 结合语义文本聚类、自动问题生成、基于检索增强的答案提取（带引用）、多层级摘要和执行摘要生成，构建端到端自动化框架，并引入模拟专家判断的内部评估指标。 Result: 在13起人道主义事件的1100多份文档上测试，生成问题的相关性达84.7%，重要性84.0%，紧迫性76.4%；答案相关性86.3%，引用准确率与召回率均超76%；人机评估一致性F1超过0.80。 Conclusion: 该框架能自动生成结构清晰、可解释、可操作的报告，优于现有基线方法，证明生成式AI可有效支持可信且实用的人道主义决策。 Abstract: Timely and accurate situational reports are essential for humanitarian decision-making, yet current workflows remain largely manual, resource intensive, and inconsistent. We present a fully automated framework that uses large language models (LLMs) to transform heterogeneous humanitarian documents into structured and evidence-grounded reports. The system integrates semantic text clustering, automatic question generation, retrieval augmented answer extraction with citations, multi-level summarization, and executive summary generation, supported by internal evaluation metrics that emulate expert reasoning. We evaluated the framework across 13 humanitarian events, including natural disasters and conflicts, using more than 1,100 documents from verified sources such as ReliefWeb. The generated questions achieved 84.7 percent relevance, 84.0 percent importance, and 76.4 percent urgency. The extracted answers reached 86.3 percent relevance, with citation precision and recall both exceeding 76 percent. Agreement between human and LLM based evaluations surpassed an F1 score of 0.80. Comparative analysis shows that the proposed framework produces reports that are more structured, interpretable, and actionable than existing baselines. By combining LLM reasoning with transparent citation linking and multi-level evaluation, this study demonstrates that generative AI can autonomously produce accurate, verifiable, and operationally useful humanitarian situation reports.

[67] Event Extraction in Large Language Model

Bobo Li,Xudong Han,Jiang Liu,Yuzhe Ding,Liqiang Jing,Zhaoqi Zhang,Jinheng Li,Xinya Du,Fei Li,Meishan Zhang,Min Zhang,Aixin Sun,Philip S. Yu,Hao Fei

Main category: cs.CL

TL;DR: 本文综述了事件抽取（EE）在大语言模型（LLM）时代的作用，提出将EE作为LLM系统的认知支架，以解决幻觉、长程依赖和知识管理等问题，并展望其向可靠、可更新的感知与记忆层演进的未来方向。

Details

Motivation: LLM在事件抽取中存在幻觉、时序因果链接脆弱以及上下文窗口有限导致的知识管理困难等问题，需要更可靠的结构化支持。 Method: 提出将事件抽取视为LLM系统中的认知支架，利用事件模式、槽位约束、事件链接和事件存储来提供接地验证、逐步推理、关系感知检索和长期记忆管理。 Result: 系统梳理了文本与多模态场景下的事件抽取任务分类、方法演进、架构设计、数据集与评估方式，覆盖跨语言、低资源与领域特定设置。 Conclusion: 事件抽取应从静态信息提取演变为面向LLM的、结构可信的、具备持续更新能力的感知与记忆组件，支撑开放世界的智能系统。 Abstract: Large language models (LLMs) and multimodal LLMs are changing event extraction (EE): prompting and generation can often produce structured outputs in zero shot or few shot settings. Yet LLM based pipelines face deployment gaps, including hallucinations under weak constraints, fragile temporal and causal linking over long contexts and across documents, and limited long horizon knowledge management within a bounded context window. We argue that EE should be viewed as a system component that provides a cognitive scaffold for LLM centered solutions. Event schemas and slot constraints create interfaces for grounding and verification; event centric structures act as controlled intermediate representations for stepwise reasoning; event links support relation aware retrieval with graph based RAG; and event stores offer updatable episodic and agent memory beyond the context window. This survey covers EE in text and multimodal settings, organizing tasks and taxonomy, tracing method evolution from rule based and neural models to instruction driven and generative frameworks, and summarizing formulations, decoding strategies, architectures, representations, datasets, and evaluation. We also review cross lingual, low resource, and domain specific settings, and highlight open challenges and future directions for reliable event centric systems. Finally, we outline open challenges and future directions that are central to the LLM era, aiming to evolve EE from static extraction into a structurally reliable, agent ready perception and memory layer for open world systems.

[68] Algerian Dialect

Zakaria Benmounah,Abdennour Boulesnane

Main category: cs.CL

TL;DR: 本文介绍了Algerian Dialect，一个包含45,000条阿尔及利亚阿拉伯语方言YouTube评论的大规模情感标注数据集，旨在缓解该方言资源稀缺的问题。

Details

Motivation: 由于阿尔及利亚方言在自然语言处理中缺乏公开可用的数据资源，本文旨在构建一个大规模、高质量的情感标注数据集以支持相关研究。 Method: 通过YouTube Data API从30多个阿尔及利亚媒体频道收集评论，并对每条评论进行人工标注，分为五类情感标签，同时记录丰富的元数据。 Result: 构建了一个包含45,000条评论的数据集，每条评论均带有情感标签和元数据，并已公开发布于Mendeley Data。 Conclusion: 该数据集填补了阿尔及利亚方言在情感分析和社交媒体分析中的资源空白，有助于推动方言阿拉伯语NLP的发展。 Abstract: We present Algerian Dialect, a large-scale sentiment-annotated dataset consisting of 45,000 YouTube comments written in Algerian Arabic dialect. The comments were collected from more than 30 Algerian press and media channels using the YouTube Data API. Each comment is manually annotated into one of five sentiment categories: very negative, negative, neutral, positive, and very positive. In addition to sentiment labels, the dataset includes rich metadata such as collection timestamps, like counts, video URLs, and annotation dates. This dataset addresses the scarcity of publicly available resources for Algerian dialect and aims to support research in sentiment analysis, dialectal Arabic NLP, and social media analytics. The dataset is publicly available on Mendeley Data under a CC BY 4.0 license at https://doi.org/10.17632/zzwg3nnhsz.2.

[69] Increasing the Thinking Budget is Not All You Need

Ignacio Iacobacci,Zhaozhi Qian,Faroq AL-Tam,Muhammad AL-Qurishi,Riad Souissi

Main category: cs.CL

TL;DR: 本文系统研究了“思考预算”（即推理过程的计算量）作为关键参数对大语言模型性能的影响，发现增加思考预算并非最有效的计算利用方式，而自一致性与自我反思等替代配置能更高效地提升准确性。

Details

Motivation: 随着具备推理能力的大语言模型兴起，初步研究开始探讨推理长度（思考预算）对性能的影响，但缺乏系统性分析及其与其他策略的权衡比较。 Method: 提出一个系统性框架，评估思考预算与自一致性、反思等不同配置的交互作用，并在性能与计算成本之间进行平衡比较。 Result: 实验发现单纯增加思考预算效果有限；采用自一致性与自我反思等策略能在相同或更低计算成本下获得更准确的结果。 Conclusion: 思考预算的使用应注重策略优化而非简单扩展，自一致性与反思机制是更高效的计算利用方式，为模型推理设计提供了重要指导。 Abstract: Recently, a new wave of thinking-capable Large Language Models has emerged, demonstrating exceptional capabilities across a wide range of reasoning benchmarks. Early studies have begun to explore how the amount of compute in terms of the length of the reasoning process, the so-called thinking budget, impacts model performance. In this work, we propose a systematic investigation of the thinking budget as a key parameter, examining its interaction with various configurations such as self-consistency, reflection, and others. Our goal is to provide an informative, balanced comparison framework that considers both performance outcomes and computational cost. Among our findings, we discovered that simply increasing the thinking budget is not the most effective use of compute. More accurate responses can instead be achieved through alternative configurations, such as self-consistency and self-reflection.

[70] MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery

Angelo Ortiz Tandazo,Manel Khentout,Youssef Benchekroun,Thomas Hueber,Emmanuel Dupoux

Main category: cs.CL

TL;DR: 本文提出了MauBERT，一种利用发音特征进行跨语言语音表示学习的多语言HuBERT扩展模型。通过在55种语言中使用音素到发音特征映射的监督继续预训练，MauBERT学习预测发音特征或音素，从而获得捕捉多语言语音特性的语言无关表示。实验表明，该模型在ABX可区分性测试中表现优于现有的多语言自监督学习模型，并能有效适应未见语言和口语场景，仅需少量自监督微调（10小时语音）。

Details

Motivation: 为了提升自监督语音模型在跨语言环境下的泛化能力和表示鲁棒性，引入语言学归纳偏置，特别是基于发音特征的多语言语音表示学习。 Method: 在HuBERT基础上，利用音素到发音特征的映射，在55种语言的多语言数据上继续预训练，监督模型预测发音特征或音素，从而学习语言无关且具多语言语音特性的表示。 Result: MauBERT在ABX可区分性测试中展现出比现有最先进多语言自监督模型更强的上下文不变表示能力；同时在极少微调数据下（10小时）即可有效适应新语言和口语变化。 Conclusion: MauBERT通过引入发音特征监督，成功将语言学归纳偏置融入自监督语音模型，实现了更鲁棒、可泛化的多语言语音表示学习，为未来低资源语言建模提供了有效路径。 Abstract: This paper introduces MauBERT, a multilingual extension of HuBERT that leverages articulatory features for robust cross-lingual phonetic representation learning. We continue HuBERT pre-training with supervision based on a phonetic-to-articulatory feature mapping in 55 languages. Our models learn from multilingual data to predict articulatory features or phones, resulting in language-independent representations that capture multilingual phonetic properties. Through comprehensive ABX discriminability testing, we show MauBERT models produce more context-invariant representations than state-of-the-art multilingual self-supervised learning models. Additionally, the models effectively adapt to unseen languages and casual speech with minimal self-supervised fine-tuning (10 hours of speech). This establishes an effective approach for instilling linguistic inductive biases in self-supervised speech models.

[71] Exploring the features used for summary evaluation by Human and GPT

Zahra Sadeghi,Evangelos Milios,Frank Rudzicz

Main category: cs.CL

TL;DR: 本文研究了大型语言模型（LLM）在摘要评估中与人类判断的一致性，探索了影响其评价的特征，并提出通过引导GPT使用人类常用的指标来提升其判断准确性。

Details

Motivation: 目前尚不清楚LLM在基于特定质量维度评估摘要时依赖哪些特征，且缺乏对评分与指标之间映射关系的研究。 Method: 通过统计和机器学习方法分析与人类和GPT响应对齐的特征，并实验指导GPT使用人类评估所用的指标。 Result: 发现了与人类和GPT评估响应对齐的关键特征，并表明当指导GPT使用人类常用指标时，其判断更符合人类判断。 Conclusion: 引导GPT模仿人类使用的评估指标可提升其摘要评估的准确性，增强与人类判断的一致性。 Abstract: Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.

[72] Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori

Rolando Coto-Solano,Daisy Li,Manoela Teleginski Ferraz,Olivia Sasse,Cha Krupka,Sharid Loáiciga,Sally Akevai Tenamu Nicholas

Main category: cs.CL

TL;DR: 本研究探讨了在资源极度匮乏的语言（如Bribri和库克群岛毛利语）中进行音调符号恢复的算法，发现微调的字符级LLM表现最佳，且在约10,000词的数据量下模型性能趋于稳定，而零样本方法效果差。

Details

Motivation: 回应语言社区的需求，并探索低资源环境下NLP模型的表现与泛化能力。 Method: 比较多种用于恢复音调符号的算法，分析不同数据规模和资源条件下的模型表现，包括微调字符级LLM和多语言模型，并探索零样本与纠错任务。 Result: 微调的字符级LLM表现最优，因其能将复杂字符分解为UTF-8字节表示；大规模多语言模型在数据受限下表现较差；约10,000词的数据量开始出现可靠性能；所有零样本方法均表现不佳。 Conclusion: 在低资源语言的音调符号恢复任务中，微调字符级LLM是最有效的方法，且需至少约10,000词的训练数据以达到可接受性能，强调了特定任务微调和适当数据规模的重要性。 Abstract: We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Māori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data constraints. Across all models, reliable performance begins to emerge with data budgets of around 10,000 words. Zero-shot approaches perform poorly in all cases. This study responds both to requests from the language communities and to broader NLP research questions concerning model performance and generalization in under-resourced contexts.

[73] Exploring Zero-Shot ACSA with Unified Meaning Representation in Chain-of-Thought Prompting

Filippos Ventirozos,Peter Appleby,Matthew Shardlow

Main category: cs.CL

TL;DR: 提出一种基于统一语义表示（UMR）的思维链提示方法用于零样本方面类别情感分析，实验显示在中等规模模型上效果与标准CoT相当，但其有效性可能依赖于模型结构，需进一步研究。

Details

Motivation: 由于标注数据稀缺且成本高，监督学习在新领域应用受限，因此探索利用大语言模型进行零样本方面类别情感分析的可行性。 Method: 提出一种新的思维链（CoT）提示技术，引入中间的统一意义表示（UMR）来结构化ACSA任务的推理过程，并在三种模型和四个数据集上与标准CoT进行比较。 Result: UMR方法在Qwen3-8B等中等规模模型上表现与标准CoT相当，但在更小模型上的效果尚不明确，显示出模型依赖性。 Conclusion: UMR-based CoT是一种有潜力的零样本ACSA方法，但其通用性和在不同模型规模下的有效性仍需进一步验证。 Abstract: Aspect-Category Sentiment Analysis (ACSA) provides granular insights by identifying specific themes within reviews and their associated sentiment. While supervised learning approaches dominate this field, the scarcity and high cost of annotated data for new domains present significant barriers. We argue that leveraging large language models (LLMs) in a zero-shot setting is a practical alternative where resources for data annotation are limited. In this work, we propose a novel Chain-of-Thought (CoT) prompting technique that utilises an intermediate Unified Meaning Representation (UMR) to structure the reasoning process for the ACSA task. We evaluate this UMR-based approach against a standard CoT baseline across three models (Qwen3-4B, Qwen3-8B, and Gemini-2.5-Pro) and four diverse datasets. Our findings suggest that UMR effectiveness may be model-dependent. Whilst preliminary results indicate comparable performance for mid-sized models such as Qwen3-8B, these observations warrant further investigation, particularly regarding the potential applicability to smaller model architectures. Further research is required to establish the generalisability of these findings across different model scales.

[74] GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

Jiacheng Guo,Ling Yang,Peter Chen,Qixin Xiao,Yinjie Wang,Xinzhe Juan,Jiahao Qiu,Ke Shen,Mengdi Wang

Main category: cs.CL

TL;DR: GenEnv是一个通过难度对齐的协同进化框架，利用生成式环境模拟器动态构建任务，以数据高效的方式提升大语言模型代理的能力。

Details

Motivation: 现有的大语言模型代理训练受限于真实交互数据的高成本和静态特性，缺乏能够根据代理能力动态调整任务难度的机制。 Method: 提出GenEnv框架，构建代理与生成式环境模拟器之间的协同进化游戏，引入α- Curriculum Reward来对齐任务难度与代理当前能力，实现动态课程学习。 Result: 在API-Bank、ALFWorld等五个基准上，GenEnv相较7B基线最高提升40.3%，性能匹敌或超越更大规模模型，并在比Gemini 2.5 Pro离线增强少3.3倍数据下取得更优结果。 Conclusion: GenEnv通过从静态监督转向自适应模拟，为扩展代理能力提供了一条高效的数据路径。 Abstract: Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent's ``zone of proximal development''. This process is guided by a simple but effective $α$-Curriculum Reward, which aligns task difficulty with the agent's current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf{+40.3\%} over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3$\times$ less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.

cs.CV [Back]

[75] A 96pJ/Frame/Pixel and 61pJ/Event Anti-UAV System with Hybrid Object Tracking Modes

Yuncheng Lu,Yucen Shi,Aobo Li,Zehao Li,Junying Li,Bo Wang,Tony Tae-Hyoung Kim

Main category: cs.CV

TL;DR: 提出了一种高能效的反无人机系统，结合基于帧和事件驱动的目标跟踪，通过自适应模式切换和专用神经处理单元实现高精度、低功耗的无人机检测。

Details

Motivation: 传统反无人机系统在检测小尺寸、高速移动目标时存在能效低、实时性差的问题，亟需一种兼顾能效与鲁棒性的解决方案。 Method: 系统采用帧-事件融合跟踪架构，利用游程编码重建二值事件帧，生成候选区域，并根据目标尺寸和速度自适应切换帧模式与事件模式；设计快速目标跟踪单元，结合自适应阈值和轨迹分类提升高速目标鲁棒性；采用支持灰度块与轨迹推理的专用神经处理单元，具备定制指令集和零跳过MAC架构。 Result: 在40 nm CMOS工艺下实现2 mm²芯片，功耗低至96 pJ/帧/像素和61 pJ/事件（0.8 V），在公开无人机数据集上实现98.2%识别准确率，覆盖50–400米距离和5–80像素/秒速度范围，神经计算冗余减少超过97%。 Conclusion: 该系统在端到端能效方面达到业界领先水平，为反无人机应用提供了高效、可靠的边缘智能解决方案。 Abstract: We present an energy-efficient anti-UAV system that integrates frame-based and event-driven object tracking to enable reliable detection of small and fast-moving drones. The system reconstructs binary event frames using run-length encoding, generates region proposals, and adaptively switches between frame mode and event mode based on object size and velocity. A Fast Object Tracking Unit improves robustness for high-speed targets through adaptive thresholding and trajectory-based classification. The neural processing unit supports both grayscale-patch and trajectory inference with a custom instruction set and a zero-skipping MAC architecture, reducing redundant neural computations by more than 97 percent. Implemented in 40 nm CMOS technology, the 2 mm^2 chip achieves 96 pJ per frame per pixel and 61 pJ per event at 0.8 V, and reaches 98.2 percent recognition accuracy on public UAV datasets across 50 to 400 m ranges and 5 to 80 pixels per second speeds. The results demonstrate state-of-the-art end-to-end energy efficiency for anti-UAV systems.

[76] NystagmusNet: Explainable Deep Learning for Photosensitivity Risk Prediction

Karthik Prabhakar

Main category: cs.CV

TL;DR: 本文提出NystagmusNet，一种基于双分支卷积神经网络的AI系统，用于预测引发眼震患者光敏感的高风险视觉环境，并实时推荐视觉适应策略。

Details

Motivation: 眼震患者因环境亮度引发的不自主眼球运动而面临日常挑战，现有辅助方案缺乏个性化预测能力。 Method: 采用双分支卷积神经网络，结合合成与增强数据集训练，利用SHAP和GradCAM提升可解释性，并集成基于规则的推荐引擎提供自适应滤镜建议。 Result: 模型在合成数据上达到75%的验证准确率，能够识别环境中的高风险区域。 Conclusion: NystagmusNet为眼震患者提供了可解释、个性化的环境风险预测与视觉辅助方案，未来可通过智能眼镜部署并结合强化学习进一步优化。 Abstract: Nystagmus patients with photosensitivity face significant daily challenges due to involuntary eye movements exacerbated by environmental brightness conditions. Current assistive solutions are limited to symptomatic treatments without predictive personalization. This paper proposes NystagmusNet, an AI-driven system that predicts high-risk visual environments and recommends real-time visual adaptations. Using a dual-branch convolutional neural network trained on synthetic and augmented datasets, the system estimates a photosensitivity risk score based on environmental brightness and eye movement variance. The model achieves 75% validation accuracy on synthetic data. Explainability techniques including SHAP and GradCAM are integrated to highlight environmental risk zones, improving clinical trust and model interpretability. The system includes a rule-based recommendation engine for adaptive filter suggestions. Future directions include deployment via smart glasses and reinforcement learning for personalized recommendations.

[77] SuperFlow: Training Flow Matching Models with RL on the Fly

Kaijie Chen,Zhiyang Xu,Ying Shen,Zihao Lin,Yuguang Yao,Lifu Huang

Main category: cs.CV

TL;DR: SuperFlow是一种用于基于流的生成模型的强化学习训练框架，通过方差感知采样和与连续时间流动力学一致的步级优势计算，解决了现有方法在采样效率和信用分配上的问题，显著减少训练步数和时间，并在文本到图像任务中显著提升性能。

Details

Motivation: 现有的基于流的生成模型在强化学习训练中存在两个主要问题：固定的每提示组大小导致采样效率低下，以及轨迹级优势被错误地用作每步估计，导致信用分配偏差。 Method: 提出SuperFlow框架，采用方差感知的动态组大小调整策略，并设计了符合连续时间流动力学的步级优势计算方法，以提高采样效率和信用分配的准确性。 Result: SuperFlow在仅使用原训练步数5.4%至56.3%的情况下达到有竞争力的性能，训练时间减少5.2%至16.7%，在多个文本到图像任务上优于SD3.5-M（提升4.6%至47.2%）和Flow-GRPO（提升1.7%至16.0%）。 Conclusion: SuperFlow有效提升了基于流的生成模型在强化学习训练中的效率和性能，无需改变模型架构即可实现显著改进。 Abstract: Recent progress in flow-based generative models and reinforcement learning (RL) has improved text-image alignment and visual quality. However, current RL training for flow models still has two main problems: (i) GRPO-style fixed per-prompt group sizes ignore variation in sampling importance across prompts, which leads to inefficient sampling and slower training; and (ii) trajectory-level advantages are reused as per-step estimates, which biases credit assignment along the flow. We propose SuperFlow, an RL training framework for flow-based models that adjusts group sizes with variance-aware sampling and computes step-level advantages in a way that is consistent with continuous-time flow dynamics. Empirically, SuperFlow reaches promising performance while using only 5.4% to 56.3% of the original training steps and reduces training time by 5.2% to 16.7% without any architectural changes. On standard text-to-image (T2I) tasks, including text rendering, compositional image generation, and human preference alignment, SuperFlow improves over SD3.5-M by 4.6% to 47.2%, and over Flow-GRPO by 1.7% to 16.0%.

[78] Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition

Ellie Zhou,Jihoon Chung,Olga Russakovsky

Main category: cs.CV

TL;DR: 本文系统分析了分类模型、对比文本-图像预训练模型和视频大语言模型中的背景偏差问题，发现它们普遍存在依赖背景进行推理的倾向；提出通过分割人物输入降低分类模型的背景偏差，并通过手动和自动提示调优引导VLLM关注人体动作，显著减少背景偏差。

Details

Motivation: 由于现有动作识别模型常依赖背景线索而非人体运动进行预测，导致模型泛化能力差且存在偏见，亟需系统性分析与缓解策略。 Method: 对多种模型（包括分类模型、对比学习模型和VLLM）进行背景偏差程度的评估；在分类模型中引入分割后的人体输入；设计手动与自动化提示调优方法以引导VLLM关注人类行为。 Result: 使用分割人物输入使分类模型的背景偏差降低3.78%；通过提示工程使VLLM向人体中心推理的转变提升9.85%。 Conclusion: 各类视频模型均存在显著背景偏差，但可通过输入改造（如人体分割）和提示设计有效缓解，提示未来应更关注模型推理依据的合理性。 Abstract: Human action recognition models often rely on background cues rather than human movement and pose to make predictions, a behavior known as background bias. We present a systematic analysis of background bias across classification models, contrastive text-image pretrained models, and Video Large Language Models (VLLM) and find that all exhibit a strong tendency to default to background reasoning. Next, we propose mitigation strategies for classification models and show that incorporating segmented human input effectively decreases background bias by 3.78%. Finally, we explore manual and automated prompt tuning for VLLMs, demonstrating that prompt design can steer predictions towards human-focused reasoning by 9.85%.

[79] SCS-SupCon: Sigmoid-based Common and Style Supervised Contrastive Learning with Adaptive Decision Boundaries

Bin Wang,Fadi Dornaika

Main category: cs.CV

TL;DR: 本文提出了一种基于Sigmoid的监督对比学习方法SCS-SupCon，通过引入可学习温度和偏置参数的成对对比损失以及风格距离约束，有效缓解了负样本稀释问题并增强了细粒度图像分类性能，在多个基准数据集上实现了最先进的表现。

Details

Motivation: 现有对比学习方法在处理类间差异小、类内变化大的图像分类任务时存在负样本稀释和决策边界不自适应的问题，尤其在细粒度识别中表现受限。 Method: 提出SCS-SupCon框架，采用Sigmoid形式的成对对比损失，引入可学习的温度和偏置参数以实现自适应决策边界，并加入显式的风格-距离约束来解耦风格与内容表示。 Result: 在六个基准数据集（如CUB200-2011、Stanford Dogs和CIFAR-100）上实验表明，SCS-SupCon在CNN和Transformer骨干网络下均达到最先进性能；在CIFAR-100上比SupCon提升约3.9个百分点，比CS-SupCon提升约1.7个百分点；在细粒度数据集上高出0.4–3.0个百分点。 Conclusion: SCS-SupCon通过自适应对比损失和风格解耦机制显著提升了监督对比学习的判别能力与泛化性，尤其适用于细粒度图像分类任务，且经消融实验和统计检验验证其有效性与稳定性。 Abstract: Image classification is hindered by subtle inter-class differences and substantial intra-class variations, which limit the effectiveness of existing contrastive learning methods. Supervised contrastive approaches based on the InfoNCE loss suffer from negative-sample dilution and lack adaptive decision boundaries, thereby reducing discriminative power in fine-grained recognition tasks. To address these limitations, we propose Sigmoid-based Common and Style Supervised Contrastive Learning (SCS-SupCon). Our framework introduces a sigmoid-based pairwise contrastive loss with learnable temperature and bias parameters to enable adaptive decision boundaries. This formulation emphasizes hard negatives, mitigates negative-sample dilution, and more effectively exploits supervision. In addition, an explicit style-distance constraint further disentangles style and content representations, leading to more robust feature learning. Comprehensive experiments on six benchmark datasets, including CUB200-2011 and Stanford Dogs, demonstrate that SCS-SupCon achieves state-of-the-art performance across both CNN and Transformer backbones. On CIFAR-100 with ResNet-50, SCS-SupCon improves top-1 accuracy over SupCon by approximately 3.9 percentage points and over CS-SupCon by approximately 1.7 points under five-fold cross-validation. On fine-grained datasets, it outperforms CS-SupCon by 0.4--3.0 points. Extensive ablation studies and statistical analyses further confirm the robustness and generalization of the proposed framework, with Friedman tests and Nemenyi post-hoc evaluations validating the stability of the observed improvements.

[80] A Modular Framework for Single-View 3D Reconstruction of Indoor Environments

Yuxiao Li

Main category: cs.CV

TL;DR: 提出一种基于扩散技术的模块化单视图室内场景3D重建框架，通过先补全遮挡内容再转为3D的方式，显著提升重建质量。

Details

Motivation: 传统方法直接从单张2D图像预测3D形状，在处理复杂形状和遮挡时表现不佳，导致重建质量受限。 Method: 将重建过程分为两步：首先使用扩散模型补全被遮挡的实例和房间背景的完整视图，然后转换为3D；包含模态补全、布局修复、混合深度估计和视空对齐四个核心模块。 Result: 在3D-Front数据集上实验表明，该方法在视觉质量和几何精度上均优于当前SOTA方法。 Conclusion: 所提出的模块化扩散框架能有效重建单图中的前景物体与背景环境，具有在室内设计、房地产和AR等领域的应用潜力。 Abstract: We propose a modular framework for single-view indoor scene 3D reconstruction, where several core modules are powered by diffusion techniques. Traditional approaches for this task often struggle with the complex instance shapes and occlusions inherent in indoor environments. They frequently overshoot by attempting to predict 3D shapes directly from incomplete 2D images, which results in limited reconstruction quality. We aim to overcome this limitation by splitting the process into two steps: first, we employ diffusion-based techniques to predict the complete views of the room background and occluded indoor instances, then transform them into 3D. Our modular framework makes contributions to this field through the following components: an amodal completion module for restoring the full view of occluded instances, an inpainting model specifically trained to predict room layouts, a hybrid depth estimation technique that balances overall geometric accuracy with fine detail expressiveness, and a view-space alignment method that exploits both 2D and 3D cues to ensure precise placement of instances within the scene. This approach effectively reconstructs both foreground instances and the room background from a single image. Extensive experiments on the 3D-Front dataset demonstrate that our method outperforms current state-of-the-art (SOTA) approaches in terms of both visual quality and reconstruction accuracy. The framework holds promising potential for applications in interior design, real estate, and augmented reality.

[81] Enhancing Tea Leaf Disease Recognition with Attention Mechanisms and Grad-CAM Visualization

Omar Faruq Shikdar,Fahad Ahammed,B. M. Shahria Alam,Golam Kibria,Tawhidur Rahman,Nishat Tasnim Niloy

Main category: cs.CV

TL;DR: 本研究旨在开发一个自动分类茶树叶病害的系统，以帮助农民及时采取措施减少损失。研究构建了一个包含5278张图像的新数据集，并使用DenseNet、Inception和EfficientNet三种预训练模型进行实验，结合注意力机制提升性能，集成模型达到85.68%的最高准确率，并引入可解释AI增强模型可解释性。

Details

Motivation: 茶是全球广泛消费的饮品之一，茶叶生产对许多国家至关重要。茶树叶病若未能及时控制，会导致严重经济损失。传统人工识别方法效率低且不可靠，因此亟需一种高效、准确的自动化识别方法。 Method: 构建了一个包含5278张图像、涵盖七类病害的新数据集，对数据进行预处理后，采用DenseNet、Inception和EfficientNet三种预训练模型进行训练；其中EfficientNet用于集成模型，并引入两种注意力模块以提升模型性能，同时使用可解释AI技术提高模型可解释性。 Result: 集成模型取得了85.68%的最高分类准确率，优于单一模型表现，验证了方法的有效性。 Conclusion: 该自动化茶树叶病分类系统在自建数据集上表现良好，结合注意力机制与集成学习有效提升了识别精度，具备实际应用潜力，有助于农民及时防控病害，减少经济损失。 Abstract: Tea is among the most widely consumed drinks globally. Tea production is a key industry for many countries. One of the main challenges in tea harvesting is tea leaf diseases. If the spread of tea leaf diseases is not stopped in time, it can lead to massive economic losses for farmers. Therefore, it is crucial to identify tea leaf diseases as soon as possible. Manually identifying tea leaf disease is an ineffective and time-consuming method, without any guarantee of success. Automating this process will improve both the efficiency and the success rate of identifying tea leaf diseases. The purpose of this study is to create an automated system that can classify different kinds of tea leaf diseases, allowing farmers to take action to minimize the damage. A novel dataset was developed specifically for this study. The dataset contains 5278 images across seven classes. The dataset was pre-processed prior to training the model. We deployed three pretrained models: DenseNet, Inception, and EfficientNet. EfficientNet was used only in the ensemble model. We utilized two different attention modules to improve model performance. The ensemble model achieved the highest accuracy of 85.68%. Explainable AI was introduced for better model interpretability.

[82] Name That Part: 3D Part Segmentation and Naming

Soumava Paul,Prakhar Kaushik,Ankit Vaidya,Anand Bhattad,Alan Yuille

Main category: cs.CV

TL;DR: 本文提出了一种名为ALIGN-Parts的新方法，用于语义3D部件分割，通过将部件命名建模为集合对齐任务，结合几何、外观和语义信息实现开放词汇的部件识别与命名，并构建了统一的1794个3D部件本体。

Details

Motivation: 现有数据集中部件定义不一致，限制了鲁棒训练；此前方法无法同时提供完整的形状标注和有意义的部件名称。 Method: 提出ALIGN-Parts，将形状分解为‘partlets’（隐式3D部件表示），并通过二分匹配将其与文本描述对齐；融合3D部分场的几何线索、多视图视觉特征的外观信息以及语言模型生成的语义描述，使用文本对齐损失使partlets与文本共享嵌入空间。 Result: 实现了高效的一次性3D部件分割与命名，支持零样本匹配任意描述和已知类别的置信度校准预测；构建了统一的1794个唯一3D部件的本体，整合了PartNet、3DCoMPaT++和Find3D数据集，并发布了新的Tex-Parts数据集；提出了两个适用于命名3D部件分割任务的新评估指标。 Conclusion: ALIGN-Parts为语义3D部件分割提供了可扩展的解决方案，能够作为自动标注引擎，并推动跨数据集的统一部件本体构建，具有广泛的应用潜力。 Abstract: We address semantic 3D part segmentation: decomposing objects into parts with meaningful names. While datasets exist with part annotations, their definitions are inconsistent across datasets, limiting robust training. Previous methods produce unlabeled decompositions or retrieve single parts without complete shape annotations. We propose ALIGN-Parts, which formulates part naming as a direct set alignment task. Our method decomposes shapes into partlets - implicit 3D part representations - matched to part descriptions via bipartite assignment. We combine geometric cues from 3D part fields, appearance from multi-view vision features, and semantic knowledge from language-model-generated affordance descriptions. Text-alignment loss ensures partlets share embedding space with text, enabling a theoretically open-vocabulary matching setup, given sufficient data. Our efficient and novel, one-shot, 3D part segmentation and naming method finds applications in several downstream tasks, including serving as a scalable annotation engine. As our model supports zero-shot matching to arbitrary descriptions and confidence-calibrated predictions for known categories, with human verification, we create a unified ontology that aligns PartNet, 3DCoMPaT++, and Find3D, consisting of 1,794 unique 3D parts. We also show examples from our newly created Tex-Parts dataset. We also introduce 2 novel metrics appropriate for the named 3D part segmentation task.

[83] Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models

Shubham Kumar Nigam,Parjanya Aditya Shukla,Noel Shallum,Arnab Bhattacharya

Main category: cs.CV

TL;DR: 本文探讨了传统OCR-机器翻译流水线与视觉大语言模型在手写马拉地语法律文件直接翻译中的性能对比，旨在解决低资源语言的手写文本识别和翻译挑战。

Details

Motivation: 由于缺乏大规模数字化语料库且手写风格多变，马拉地语等低资源语言的手写文本识别和机器翻译面临重大挑战，特别是在印度地方法院需要高效处理如FIR、起诉书和证人陈述等法律文件的背景下，亟需可扩展且准确的翻译系统。 Method: 比较传统OCR-机器翻译两阶段流水线与端到端的视觉大语言模型（Vision LLMs）在手写马拉地语文本图像上的翻译性能，并在一个整理的手写马拉地语法律文档数据集上进行评估。 Result: 研究评估了两种方法在低资源环境下的表现，提供了关于如何构建鲁棒、可部署于边缘设备的翻译系统的实证结果。 Conclusion: 视觉大语言模型在统一手写文本识别与翻译任务方面展现出潜力，相比传统流水线可能更适用于低资源语言的法律文档处理，有助于提升非母语者和法律从业者对法律信息的获取效率。 Abstract: Handwritten text recognition (HTR) and machine translation continue to pose significant challenges, particularly for low-resource languages like Marathi, which lack large digitized corpora and exhibit high variability in handwriting styles. The conventional approach to address this involves a two-stage pipeline: an OCR system extracts text from handwritten images, which is then translated into the target language using a machine translation model. In this work, we explore and compare the performance of traditional OCR-MT pipelines with Vision Large Language Models that aim to unify these stages and directly translate handwritten text images in a single, end-to-end step. Our motivation is grounded in the urgent need for scalable, accurate translation systems to digitize legal records such as FIRs, charge sheets, and witness statements in India's district and high courts. We evaluate both approaches on a curated dataset of handwritten Marathi legal documents, with the goal of enabling efficient legal document processing, even in low-resource environments. Our findings offer actionable insights toward building robust, edge-deployable solutions that enhance access to legal information for non-native speakers and legal professionals alike.

[84] NodMAISI: Nodule-Oriented Medical AI for Synthetic Imaging

Fakrul Islam Tushar,Ehsan Samei,Cynthia Rudin,Joseph Y. Lo

Main category: cs.CV

TL;DR: 本文提出了一种名为NodMAISI的CT图像合成与增强框架，专注于肺结节的解剖约束生成，通过多源数据训练，在小结节检测和恶性分类任务中显著提升模型性能，尤其在数据稀缺场景下表现优越。

Details

Motivation: 肺部异常发现（尤其是小肺结节）在医学影像数据集中常被低估且标注不一致，限制了肺癌筛查模型的训练与评估，因此需要更精确、一致且具有病变感知能力的数据增强方法。 Method: 提出NodMAISI框架，包含：(i) 标准化的多源数据整理与注释流程，关联CT图像、器官掩码与结节级标注；(ii) 基于MAISI-v2基础模块构建的ControlNet条件化修正流生成器，确保解剖结构与病灶一致性；(iii) 病变感知增强策略，通过控制结节掩码缩小生成配对CT变体，同时保留周围解剖结构。 Result: 在六个公开测试集上，NodMAISI相比MAISI-v2提升了分布保真度（FID更低）；在结节可检测性方面显著提高平均敏感性，尤其在亚厘米结节上表现更优；在良恶性分类任务中，使用较少临床数据（≤20%）时AUC提升0.07–0.11，在仅10%数据时提升0.12–0.21，有效缓解数据稀缺问题。 Conclusion: NodMAISI通过解剖约束与病灶感知的合成策略，显著提升了肺结节检测与分类模型的性能，尤其在标注数据有限的情况下具有重要应用价值。 Abstract: Objective: Although medical imaging datasets are increasingly available, abnormal and annotation-intensive findings critical to lung cancer screening, particularly small pulmonary nodules, remain underrepresented and inconsistently curated. Methods: We introduce NodMAISI, an anatomically constrained, nodule-oriented CT synthesis and augmentation framework trained on a unified multi-source cohort (7,042 patients, 8,841 CTs, 14,444 nodules). The framework integrates: (i) a standardized curation and annotation pipeline linking each CT with organ masks and nodule-level annotations, (ii) a ControlNet-conditioned rectified-flow generator built on MAISI-v2's foundational blocks to enforce anatomy- and lesion-consistent synthesis, and (iii) lesion-aware augmentation that perturbs nodule masks (controlled shrinkage) while preserving surrounding anatomy to generate paired CT variants. Results: Across six public test datasets, NodMAISI improved distributional fidelity relative to MAISI-v2 (real-to-synthetic FID range 1.18 to 2.99 vs 1.69 to 5.21). In lesion detectability analysis using a MONAI nodule detector, NodMAISI substantially increased average sensitivity and more closely matched clinical scans (IMD-CT: 0.69 vs 0.39; DLCS24: 0.63 vs 0.20), with the largest gains for sub-centimeter nodules where MAISI-v2 frequently failed to reproduce the conditioned lesion. In downstream nodule-level malignancy classification trained on LUNA25 and externally evaluated on LUNA16, LNDbv4, and DLCS24, NodMAISI augmentation improved AUC by 0.07 to 0.11 at <=20% clinical data and by 0.12 to 0.21 at 10%, consistently narrowing the performance gap under data scarcity.

[85] YolovN-CBi: A Lightweight and Efficient Architecture for Real-Time Detection of Small UAVs

Ami Pandat,Punna Rajasekhar,Gopika Vinod,Rohit Shukla

Main category: cs.CV

TL;DR: 提出了一种改进的Yolov5-CBi架构，结合CBAM和BiFPN模块，提升了对小型无人机的检测性能，并通过知识蒸馏实现轻量化，显著提高了实时检测的速度与精度。

Details

Motivation: 由于无人机体积小、移动快、视觉对比度低，传统检测方法难以实现实时准确检测，因此需要更高效的检测模型。 Method: 在Yolov5基础上引入CBAM注意力模块和BiFPN特征融合结构，构建Yolov5-CBi模型；使用包含28K图像的训练集和2500张小目标无人机的本地测试集进行训练与评估；设计四种变体并采用知识蒸馏技术压缩模型以适应边缘部署。 Result: Yolov5-CBi在多个基准数据集和本地数据集上优于Yolov8和Yolov12；蒸馏后的轻量模型mAP@0.5:0.9达到0.6573，比教师模型提升6.51%，且速度比基线模型快82.9%。 Conclusion: 所提出的CBi架构及其蒸馏模型在小目标无人机检测中实现了更优的速度-精度平衡，适用于实时边缘应用场景。 Abstract: Unmanned Aerial Vehicles, commonly known as, drones pose increasing risks in civilian and defense settings, demanding accurate and real-time drone detection systems. However, detecting drones is challenging because of their small size, rapid movement, and low visual contrast. A modified architecture of YolovN called the YolovN-CBi is proposed that incorporates the Convolutional Block Attention Module (CBAM) and the Bidirectional Feature Pyramid Network (BiFPN) to improve sensitivity to small object detections. A curated training dataset consisting of 28K images is created with various flying objects and a local test dataset is collected with 2500 images consisting of very small drone objects. The proposed architecture is evaluated on four benchmark datasets, along with the local test dataset. The baseline Yolov5 and the proposed Yolov5-CBi architecture outperform newer Yolo versions, including Yolov8 and Yolov12, in the speed-accuracy trade-off for small object detection. Four other variants of the proposed CBi architecture are also proposed and evaluated, which vary in the placement and usage of CBAM and BiFPN. These variants are further distilled using knowledge distillation techniques for edge deployment, using a Yolov5m-CBi teacher and a Yolov5n-CBi student. The distilled model achieved a mA@P0.5:0.9 of 0.6573, representing a 6.51% improvement over the teacher's score of 0.6171, highlighting the effectiveness of the distillation process. The distilled model is 82.9% faster than the baseline model, making it more suitable for real-time drone detection. These findings highlight the effectiveness of the proposed CBi architecture, together with the distilled lightweight models in advancing efficient and accurate real-time detection of small UAVs.

[86] FOODER: Real-time Facial Authentication and Expression Recognition

Sabri Mustafa Kahya,Muhammet Sami Yavuz,Boran Hamdi Sivrikaya,Eckehard Steinbach

Main category: cs.CV

TL;DR: FOODER是一个基于低成本FMCW雷达的实时、隐私保护框架，结合OOD检测实现面部认证与表情识别，利用多编码器-解码器结构和ResNet/MobileViT网络实现高精度认证与细粒度表情分类。

Details

Motivation: 为了在保障用户隐私的前提下实现安全可靠的面部认证与表情识别，避免传统视觉方法带来的隐私泄露风险，并有效识别分布外样本以提升系统安全性。 Method: 采用FMCW雷达获取range-Doppler和micro range-Doppler数据；认证模块使用多编码器多解码器架构（含BP和ILED组件）进行OOD检测；认证通过后，用ResNet处理拼接的雷达表示以区分动态/静态表情，再由两个专用MobileViT网络分别识别具体表情。 Result: 在自建60GHz雷达数据集上，认证模块达到94.13% AUROC和18.12% FPR95，表情识别平均准确率达94.70%，优于现有OOD和Transformer方法，且支持实时运行。 Conclusion: FOODER实现了高效、隐私保护的非接触式人脸认证与表情识别，通过层级化设计和雷达信号的有效利用，在真实场景中具有良好的应用前景。 Abstract: Out-of-distribution (OOD) detection is essential for the safe deployment of neural networks, as it enables the identification of samples outside the training domain. We present FOODER, a real-time, privacy-preserving radar-based framework that integrates OOD-based facial authentication with facial expression recognition. FOODER operates using low-cost frequency-modulated continuous-wave (FMCW) radar and exploits both range-Doppler and micro range-Doppler representations. The authentication module employs a multi-encoder multi-decoder architecture with Body Part (BP) and Intermediate Linear Encoder-Decoder (ILED) components to classify a single enrolled individual as in-distribution while detecting all other faces as OOD. Upon successful authentication, an expression recognition module is activated. Concatenated radar representations are processed by a ResNet block to distinguish between dynamic and static facial expressions. Based on this categorization, two specialized MobileViT networks are used to classify dynamic expressions (smile, shock) and static expressions (neutral, anger). This hierarchical design enables robust facial authentication and fine-grained expression recognition while preserving user privacy by relying exclusively on radar data. Experiments conducted on a dataset collected with a 60 GHz short-range FMCW radar demonstrate that FOODER achieves an AUROC of 94.13% and an FPR95 of 18.12% for authentication, along with an average expression recognition accuracy of 94.70%. FOODER outperforms state-of-the-art OOD detection methods and several transformer-based architectures while operating efficiently in real time.

[87] FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis

Ekta Balkrishna Gavas,Sudipta Banerjee,Chinmay Hegde,Nasir Memon

Main category: cs.CV

TL;DR: 本文提出了首个用于评估多模态大语言模型（MLLMs）在指纹理解领域性能的综合基准FPBench，涵盖20个模型、7个数据集和8项生物识别与法医任务，探索了零样本和思维链提示策略下的表现，并讨论了可解释性、挑战与局限，为指纹领域的基础模型发展奠定基础。

Details

Motivation: 尽管多模态大语言模型（MLLMs）已在虹膜和人脸图像分析中展现出生物特征应用潜力，但其在指纹理解方面的潜力尚未被探索，缺乏专门的评估基准。 Method: 设计了一个名为FPBench的综合基准，评估20个开源与专有MLLMs在7个真实与合成数据集上的表现，涵盖8项生物识别与法医任务，采用零样本和思维链（chain-of-thought）提示策略进行测试。 Result: 系统评估揭示了当前MLLMs在指纹理解任务中的整体表现水平、关键挑战与局限性，部分模型在特定任务中展现出一定能力，但普遍存在对细节敏感性和推理一致性不足的问题。 Conclusion: FPBench是首个面向指纹理解的多模态大语言模型综合基准，填补了该领域空白，为未来开发面向指纹分析的基础模型提供了评估标准和研究方向。 Abstract: Multimodal LLMs (MLLMs) have gained significant traction in complex data analysis, visual question answering, generation, and reasoning. Recently, they have been used for analyzing the biometric utility of iris and face images. However, their capabilities in fingerprint understanding are yet unexplored. In this work, we design a comprehensive benchmark, \textsc{FPBench} that evaluates the performance of 20 MLLMs (open-source and proprietary) across 7 real and synthetic datasets on 8 biometric and forensic tasks using zero-shot and chain-of-thought prompting strategies. We discuss our findings in terms of performance, explainability and share our insights into the challenges and limitations. We establish \textsc{FPBench} as the first comprehensive benchmark for fingerprint domain understanding with MLLMs paving the path for foundation models for fingerprints.

[88] Uncertainty-Gated Region-Level Retrieval for Robust Semantic Segmentation

Shreshth Rajan,Raymond Liu

Main category: cs.CV

TL;DR: 提出了一种区域级、不确定性门控的检索机制，提高了域偏移下的语义分割准确性和校准性，在减少87.5%检索成本的同时，平均交并比提升了11.3%。

Details

Motivation: 为了提高户外街景语义分割在不同环境、光照、天气和传感器噪声下的鲁棒性，并实现实时处理，尤其是在域偏移情况下的分割性能与校准能力。 Method: 提出一种区域级的不确定性门控检索机制，仅在检测到高不确定性区域时进行特征检索，从而减少不必要的计算开销，提升模型对域偏移的适应能力。 Result: 该方法在平均交并比（mIoU）上提升了11.3%，同时将检索区域从100%降低至12.5%，检索成本减少了87.5%。 Conclusion: 所提出的不确定性门控行为有效平衡了精度与效率，在域偏移下显著提升了语义分割的性能和模型校准度，适用于自动驾驶等实时安全关键应用。 Abstract: Semantic segmentation of outdoor street scenes plays a key role in applications such as autonomous driving, mobile robotics, and assistive technology for visually-impaired pedestrians. For these applications, accurately distinguishing between key surfaces and objects such as roads, sidewalks, vehicles, and pedestrians is essential for maintaining safety and minimizing risks. Semantic segmentation must be robust to different environments, lighting and weather conditions, and sensor noise, while being performed in real-time. We propose a region-level, uncertainty-gated retrieval mechanism that improves segmentation accuracy and calibration under domain shift. Our best method achieves an 11.3% increase in mean intersection-over-union while reducing retrieval cost by 87.5%, retrieving for only 12.5% of regions compared to 100% for always-on baseline.

[89] SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping

Thomas Boudras,Martin Schwartz,Rasmus Fensholt,Martin Brandt,Ibrahim Fayad,Jean-Pierre Wigneron,Gabriel Belouze,Fajwel Fogel,Philippe Ciais

Main category: cs.CV

TL;DR: 本文提出了一种名为SERA-H的端到端模型，结合超分辨率模块和时间注意力编码，利用免费的Sentinel-1/2时序数据生成2.5米分辨率的树高地图，在监督下使用高密度LiDAR数据训练，性能优于基准方法，并媲美依赖商业高分辨率影像的方法。

Details

Motivation: 现有利用卫星影像预测树高图的深度学习方法常面临数据可获取性与空间分辨率之间的权衡，本文旨在克服这一限制，实现高分辨率且易于获取的森林高度制图。 Method: 提出SERA-H模型，结合EDSR超分辨率模块和UTAE时间注意力编码模块，使用Sentinel-1和Sentinel-2（10米）时序数据作为输入，以高密度LiDAR（ALS）数据为监督信号，端到端训练生成2.5米分辨率的树高图。 Result: 在法国开源基准数据集上评估，SERA-H的MAE为2.6米，决定系数R²达0.82，性能优于标准Sentinel-1/2基线，且与依赖SPOT-6/7、PlanetScope等商业高分辨率影像的方法相当甚至更优。 Conclusion: 结合高分辨率监督信号与多时相数据中的时空信息，能够重建出超越输入传感器原生分辨率的细节，SERA-H实现了免费、高频次、高精度的森林高度制图，具有广泛的应用前景。 Abstract: High-resolution mapping of canopy height is essential for forest management and biodiversity monitoring. Although recent studies have led to the advent of deep learning methods using satellite imagery to predict height maps, these approaches often face a trade-off between data accessibility and spatial resolution. To overcome these limitations, we present SERA-H, an end-to-end model combining a super-resolution module (EDSR) and temporal attention encoding (UTAE). Trained under the supervision of high-density LiDAR data (ALS), our model generates 2.5 m resolution height maps from freely available Sentinel-1 and Sentinel-2 (10 m) time series data. Evaluated on an open-source benchmark dataset in France, SERA-H, with a MAE of 2.6 m and a coefficient of determination of 0.82, not only outperforms standard Sentinel-1/2 baselines but also achieves performance comparable to or better than methods relying on commercial very high-resolution imagery (SPOT-6/7, PlanetScope, Maxar). These results demonstrate that combining high-resolution supervision with the spatiotemporal information embedded in time series enables the reconstruction of details beyond the input sensors' native resolution. SERA-H opens the possibility of freely mapping forests with high revisit frequency, achieving accuracy comparable to that of costly commercial imagery. The source code is available at https://github.com/ThomasBoudras/SERA-H#

[90] EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams

Hao Li,Daiwei Lu,Jiacheng Wang,Robert J. Webster,Ipek Oguz

Main category: cs.CV

TL;DR: EndoStreamDepth是一种用于内窥镜视频流的单目深度估计框架，能够实时生成具有清晰解剖边界的准确且时间连续的深度图。

Details

Motivation: 现有的方法多采用批量输入，难以满足内窥镜视频流对实时性和时间一致性的要求，且边界精度不足，限制了其在机器人手术等下游任务中的应用。 Method: 提出EndoStreamDepth框架，包含三个部分：(1) 具有内窥镜特定变换的单帧深度网络；(2) 多层级Mamba时序模块以传播帧间信息；(3) 带有多尺度监督的分层设计，结合多种损失函数提升局部边界锐度和全局几何一致性。 Result: 在两个公开结肠镜深度数据集上显著优于现有最先进方法，生成的深度图具有更清晰、解剖结构对齐的边界，并实现时间一致性与实时推理。 Conclusion: EndoStreamDepth通过单帧处理与时序建模的结合，在准确性、边界清晰度和实时性方面取得平衡，为微创手术中的深度感知提供了可靠解决方案。 Abstract: This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at https://github.com/MedICL-VU/EndoStreamDepth

[91] Local Patches Meet Global Context: Scalable 3D Diffusion Priors for Computed Tomography Reconstruction

Taewon Yang,Jason Hu,Jeffrey A. Fessler,Liyue Shen

Main category: cs.CV

TL;DR: 提出一种基于3D patch的扩散模型，用于从有限数据中学习完全3D扩散先验，实现高效、高质量的高分辨率3D图像生成与重建。

Details

Motivation: 直接在3D数据上训练扩散模型面临计算资源和数据规模的巨大挑战，现有方法多依赖2D先验，无法充分挖掘3D生成能力。 Method: 通过建模位置感知的3D局部patch与下采样3D体积（全局上下文）的联合分布，学习3D patch的先验，并耦合局部与全局信息以实现可扩展且高质量的3D生成。 Result: 在多个3D CT数据集上实验表明，该方法在性能和效率上优于现有最先进方法，可在约20分钟内完成512×512×256的高分辨率3D重建。 Conclusion: 所提出的3D patch-based扩散模型能有效利用有限数据学习3D先验，为高维逆问题提供了高效且准确的解决方案。 Abstract: Diffusion models learn strong image priors that can be leveraged to solve inverse problems like medical image reconstruction. However, for real-world applications such as 3D Computed Tomography (CT) imaging, directly training diffusion models on 3D data presents significant challenges due to the high computational demands of extensive GPU resources and large-scale datasets. Existing works mostly reuse 2D diffusion priors to address 3D inverse problems, but fail to fully realize and leverage the generative capacity of diffusion models for high-dimensional data. In this study, we propose a novel 3D patch-based diffusion model that can learn a fully 3D diffusion prior from limited data, enabling scalable generation of high-resolution 3D images. Our core idea is to learn the prior of 3D patches to achieve scalable efficiency, while coupling local and global information to guarantee high-quality 3D image generation, by modeling the joint distribution of position-aware 3D local patches and downsampled 3D volume as global context. Our approach not only enables high-quality 3D generation, but also offers an unprecedentedly efficient and accurate solution to high-resolution 3D inverse problems. Experiments on 3D CT reconstruction across multiple datasets show that our method outperforms state-of-the-art methods in both performance and efficiency, notably achieving high-resolution 3D reconstruction of $512 \times 512 \times 256$ ($\sim$20 mins).

[92] Atlas is Your Perfect Context: One-Shot Customization for Generalizable Foundational Medical Image Segmentation

Ziyu Zhang,Yi Yu,Simeng Zhu,Ahmed Aly,Yunhe Gao,Ning Gu,Yuan Xue

Main category: cs.CV

TL;DR: 提出AtlasSegFM，一种基于图谱引导的框架，通过单个标注样本实现对基础模型的定制化医学图像分割，提升在少见临床场景中的表现。

Details

Motivation: 现有交互式基础模型依赖精确提示且在训练数据不足的临床场景中表现不佳，需要一种能快速适应特定临床环境的方法。 Method: 设计了一个结合图谱配准与基础模型的框架：1）通过图谱与查询图像配准生成上下文感知提示；2）在测试时使用适配器融合图谱配准和基础模型的预测结果。 Result: 在多个公共和内部数据集上验证了方法的有效性，尤其在小而精细结构的分割上表现突出，显著提升了分割精度。 Conclusion: AtlasSegFM为现实临床工作流中基础模型的一次性定制提供了轻量、可部署的解决方案。 Abstract: Accurate medical image segmentation is essential for clinical diagnosis and treatment planning. While recent interactive foundation models (e.g., nnInteractive) enhance generalization through large-scale multimodal pretraining, they still depend on precise prompts and often perform below expectations in contexts that are underrepresented in their training data. We present AtlasSegFM, an atlas-guided framework that customizes available foundation models to clinical contexts with a single annotated example. The core innovations are: 1) a pipeline that provides context-aware prompts for foundation models via registration between a context atlas and query images, and 2) a test-time adapter to fuse predictions from both atlas registration and the foundation model. Extensive experiments across public and in-house datasets spanning multiple modalities and organs demonstrate that AtlasSegFM consistently improves segmentation, particularly for small, delicate structures. AtlasSegFM provides a lightweight, deployable solution one-shot customization of foundation models in real-world clinical workflows. The code will be made publicly available.

[93] MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Kaixing Yang,Jiashu Zhu,Xulong Tang,Ziqiao Peng,Xiangyue Zhang,Puwei Wang,Jiahong Wu,Xiangxiang Chu,Hongyan Liu,Jun He

Main category: cs.CV

TL;DR: 本文提出MACE-Dance，一种基于级联混合专家（MoE）的音乐驱动舞蹈视频生成框架，分别通过运动专家和外观专家实现高质量3D舞蹈动作生成与视觉一致的视频合成，在多个任务上达到SOTA。

Details

Motivation: 现有方法难以同时实现高质量视觉表现和逼真人体运动，且相关研究有限，无法直接适用于音乐驱动的舞蹈视频生成任务。 Method: 提出MACE-Dance框架：运动专家采用BiMamba-Transformer混合结构的扩散模型与无引导训练（GFT）策略生成符合运动学且富有表现力的3D舞蹈动作；外观专家通过解耦的运动-美学微调策略，结合参考图像进行时空连贯的视频合成。 Result: 在自建的大规模多样化数据集上，MACE-Dance在3D舞蹈生成和姿态驱动图像动画任务中均达到SOTA性能，并通过新设计的运动-外观评估协议验证了整体优越性。 Conclusion: MACE-Dance有效解决了音乐驱动舞蹈视频生成中运动质量与视觉一致性之间的平衡问题，推动了该领域的发展。 Abstract: With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Project page: https://macedance.github.io/

[94] Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching

Junho Lee,Kwanseok Kim,Joonseok Lee

Main category: cs.CV

TL;DR: 本文提出了一种新的2D模拟方法来研究流匹配生成模型在高维数据生成中的学习动态，揭示了高斯分布作为源分布的优势与现有方法的局限性，并基于分析结果提出了一种结合范数对齐训练和方向剪枝采样的新框架，在无需重新训练的情况下显著提升了生成质量和采样效率。

Details

Motivation: 尽管流匹配已被广泛应用于生成建模，但源分布的选择对其性能的影响尚未被充分探索，尤其是高维数据下不同分布的几何特性如何影响训练动态仍缺乏直观理解。因此，需要一个可解释的分析框架来揭示源分布设计的关键因素。 Method: 提出一种新的2D模拟方法以捕捉高维几何特性，通过可视化和动力学分析研究流匹配的训练过程；在此基础上设计一种范数对齐训练与方向剪枝采样相结合的框架，保留高斯分布的全向监督优势，同时避免在数据稀疏区域初始化。 Result: 实验表明所提方法能有效缓解模式失配、路径纠缠和范数不对齐等问题，在多种流匹配模型上实现生成质量与采样效率的一致提升，且无需重新训练即可应用。 Conclusion: 高斯分布的全向覆盖特性对稳定学习至关重要，而合理的源分布设计应兼顾方向性和范数对齐；提出的剪枝框架为改进现有流匹配模型提供了一种即插即用的解决方案。 Abstract: Flow matching has emerged as a powerful generative modeling approach with flexible choices of source distribution. While Gaussian distributions are commonly used, the potential for better alternatives in high-dimensional data generation remains largely unexplored. In this paper, we propose a novel 2D simulation that captures high-dimensional geometric properties in an interpretable 2D setting, enabling us to analyze the learning dynamics of flow matching during training. Based on this analysis, we derive several key insights about flow matching behavior: (1) density approximation can paradoxically degrade performance due to mode discrepancy, (2) directional alignment suffers from path entanglement when overly concentrated, (3) Gaussian's omnidirectional coverage ensures robust learning, and (4) norm misalignment incurs substantial learning costs. Building on these insights, we propose a practical framework that combines norm-aligned training with directionally-pruned sampling. This approach maintains the robust omnidirectional supervision essential for stable flow learning, while eliminating initializations in data-sparse regions during inference. Importantly, our pruning strategy can be applied to any flow matching model trained with a Gaussian source, providing immediate performance gains without the need for retraining. Empirical evaluations demonstrate consistent improvements in both generation quality and sampling efficiency. Our findings provide practical insights and guidelines for source distribution design and introduce a readily applicable technique for improving existing flow matching models. Our code is available at https://github.com/kwanseokk/SourceFM.

[95] ALIGN: Advanced Query Initialization with LiDAR-Image Guidance for Occlusion-Robust 3D Object Detection

Janghyun Baek,Mincheol Chang,Seokha Moon,Seung Joon Lee,Jinkyu Kim

Main category: cs.CV

TL;DR: 提出了一种名为ALIGN的新型查询初始化方法，通过结合LiDAR和图像信息实现对遮挡和拥挤场景下更鲁棒的3D目标检测，显著提升了现有检测器的性能。

Details

Motivation: 现有的基于随机采样或BEV热图的查询初始化策略在处理遮挡或密集物体时效率低、准确性差，限制了3D目标检测性能。 Method: 提出ALIGN方法，包含三个核心组件：(i) 融合LiDAR几何与图像语义的遮挡感知中心估计（OCE），(ii) 基于LiDAR聚类并补充空间与语义对齐点的自适应邻居采样（ANS），(iii) 前后区域动态平衡查询的动态查询平衡（DQB）。 Result: 在nuScenes数据集上实验表明，ALIGN在多个先进检测器上持续提升性能，最高提升+0.9 mAP和+1.2 NDS，尤其在遮挡和密集场景中表现突出。 Conclusion: ALIGN通过更有效的对象感知查询初始化，增强了3D目标检测模型在复杂场景下的鲁棒性和精度，为多模态检测提供了新思路。 Abstract: Recent query-based 3D object detection methods using camera and LiDAR inputs have shown strong performance, but existing query initialization strategies,such as random sampling or BEV heatmap-based sampling, often result in inefficient query usage and reduced accuracy, particularly for occluded or crowded objects. To address this limitation, we propose ALIGN (Advanced query initialization with LiDAR and Image GuidaNce), a novel approach for occlusion-robust, object-aware query initialization. Our model consists of three key components: (i) Occlusion-aware Center Estimation (OCE), which integrates LiDAR geometry and image semantics to estimate object centers accurately (ii) Adaptive Neighbor Sampling (ANS), which generates object candidates from LiDAR clustering and supplements each object by sampling spatially and semantically aligned points around it and (iii) Dynamic Query Balancing (DQB), which adaptively balances queries between foreground and background regions. Our extensive experiments on the nuScenes benchmark demonstrate that ALIGN consistently improves performance across multiple state-of-the-art detectors, achieving gains of up to +0.9 mAP and +1.2 NDS, particularly in challenging scenes with occlusions or dense crowds. Our code will be publicly available upon publication.

[96] Multi-Part Object Representations via Graph Structures and Co-Part Discovery

Alex Foo,Wynne Hsu,Mong Li Lee

Main category: cs.CV

TL;DR: 提出一种基于显式图表示的多部件对象发现方法，通过新引入的基准测试验证其在遮挡和分布外场景中的鲁棒性，并在多种图像数据上显著优于现有方法。

Details

Motivation: 现有基于隐式表示的方法难以在遮挡或分布外情境中识别多部件物体，因其依赖间接训练目标隐含编码部件-整体关系。 Method: 提出一种利用显式图表示部件的新方法，并设计共部件对象发现算法，同时构建三个新基准来评估多部件对象识别的鲁棒性。 Result: 在模拟、真实和现实图像上实验表明，所提方法在发现对象质量和遮挡/分布外情境下的识别能力均优于现有方法，且在下游任务中能更准确预测物体关键属性。 Conclusion: 显式图表示有助于提升多部件对象发现的鲁棒性和泛化能力，推动面向复杂场景的对象中心表示学习发展。 Abstract: Discovering object-centric representations from images can significantly enhance the robustness, sample efficiency and generalizability of vision models. Works on images with multi-part objects typically follow an implicit object representation approach, which fail to recognize these learned objects in occluded or out-of-distribution contexts. This is due to the assumption that object part-whole relations are implicitly encoded into the representations through indirect training objectives. We address this limitation by proposing a novel method that leverages on explicit graph representations for parts and present a co-part object discovery algorithm. We then introduce three benchmarks to evaluate the robustness of object-centric methods in recognizing multi-part objects within occluded and out-of-distribution settings. Experimental results on simulated, realistic, and real-world images show marked improvements in the quality of discovered objects compared to state-of-the-art methods, as well as the accurate recognition of multi-part objects in occluded and out-of-distribution contexts. We also show that the discovered object-centric representations can more accurately predict key object properties in a downstream task, highlighting the potential of our method to advance the field of object-centric representations.

[97] Unsupervised Anomaly Detection with an Enhanced Teacher for Student-Teacher Feature Pyramid Matching

Mohammad Zolfaghari,Hedieh Sajedi

Main category: cs.CV

TL;DR: 本文提出了一种用于异常检测的学生-教师框架，其中教师网络经过增强以实现高性能指标。通过在ImageNet上预训练ResNet-18，然后在MVTech-AD数据集上进行微调，实验结果表明该方法在图像级和像素级的性能均优于先前方法。所提出的ET-STPM模型在图像级和像素级分别达到了0.971和0.977的平均准确率。

Details

Motivation: 为了提升无监督学习中异常检测的性能，解决现有方法在图像级和像素级检测精度不足的问题。 Method: 采用学生-教师框架，利用在ImageNet上预训练的ResNet-18作为教师网络，并在MVTech-AD数据集上进行微调，构建增强型教师网络（ET-STPM）。 Result: 在MVTech-AD数据集上的实验显示，该方法在图像级和像素级异常检测中分别达到0.971和0.977的平均准确率，优于此前的方法。 Conclusion: 所提出的ET-STPM框架通过增强教师网络显著提升了异常检测的性能，验证了其在高精度检测任务中的有效性。 Abstract: Anomaly detection or outlier is one of the challenging subjects in unsupervised learning . This paper is introduced a student-teacher framework for anomaly detection that its teacher network is enhanced for achieving high-performance metrics . For this purpose , we first pre-train the ResNet-18 network on the ImageNet and then fine-tune it on the MVTech-AD dataset . Experiment results on the image-level and pixel-level demonstrate that this idea has achieved better metrics than the previous methods . Our model , Enhanced Teacher for Student-Teacher Feature Pyramid (ET-STPM), achieved 0.971 mean accuracy on the image-level and 0.977 mean accuracy on the pixel-level for anomaly detection.

[98] Multifaceted Exploration of Spatial Openness in Rental Housing: A Big Data Analysis in Tokyo's 23 Wards

Takuya OKi,Yuan Liu

Main category: cs.CV

TL;DR: 本研究提出了一种基于二维和三维视角的住宅空间开放性量化框架，利用东京23区4004个租赁单元的数据，结合可视图分析和语义分割模型（Mask2Former），揭示了开放性与租金、建筑特征及城市更新趋势的关系。

Details

Motivation: 现有研究多孤立地探讨空间开放性的影响因素，缺乏多维度综合量化方法，难以全面理解其与居住质量、城市动态的关系。 Method: 从二维（平面可见性，通过可视图分析VGA）和三维（室内图像语义分割，使用Mask2Former模型识别墙面、天花板、地板和窗户）两个维度构建开放性指标，并结合时空数据分析其与租金及住房属性的关系。 Result: 发现起居室可见性增加，整体开放性在1990年代达到峰值；二维与三维开放性指标无直接相关性，但高开放性通常对应较高租金；开放性与租金、建筑特征存在局部空间相关性，反映城市更新趋势；现有模型预测的空间印象得分与开放性关联较弱，表明室内设计和家具对空间感知影响更大。 Conclusion: 本研究建立了新的多维数据驱动框架，可用于量化住宅空间开放性，并有效连接城市动态与市场特征，为提升居住品质和设计提供支持。 Abstract: Understanding spatial openness is vital for improving residential quality and design; however, studies often treat its influencing factors separately. This study developed a quantitative framework to evaluate the spatial openness in housing from two- (2D) and three- (3D) dimensional perspectives. Using data from 4,004 rental units in Tokyo's 23 wards, we examined the temporal and spatial variations in openness and its relationship with rent and housing attributes. 2D openness was computed via planar visibility using visibility graph analysis (VGA) from floor plans, whereas 3D openness was derived from interior images analysed using Mask2Former, a semantic segmentation model that identifies walls, ceilings, floors, and windows. The results showed an increase in living room visibility and a 1990s peak in overall openness. Spatial analyses revealed partial correlations among openness, rent, and building characteristics, reflecting urban redevelopment trends. Although the 2D and 3D openness indicators were not directly correlated, higher openness tended to correspond to higher rent. The impression scores predicted by the existing models were only weakly related to openness, suggesting that the interior design and furniture more strongly shape perceived space. This study offers a new multidimensional data-driven framework for quantifying residential spatial openness and linking it with urban and market dynamics.

[99] Investigating Spatial Attention Bias in Vision-Language Models

Aryan Chaudhary,Sanchit Goyal,Pratik Narang,Dhruv Kumar

Main category: cs.CV

TL;DR: 本文发现并表征了视觉语言模型（VLM）在处理水平拼接图像时存在系统性空间注意偏差，即始终优先描述左侧内容。该偏差在多种架构和中立提示下普遍存在，且不受右到左语言训练影响，提示其源于模型架构而非训练数据或语言习惯。

Details

Motivation: 尽管VLM在视觉理解方面表现出色，但其空间处理中的系统性偏差尚未被深入研究。本文旨在识别和分析这些潜在的空间注意力偏差及其成因。 Method: 通过在水平拼接的图像对上进行受控实验，测试多个开源和闭源VLM的行为；使用阿拉伯语微调模型检验语言阅读方向的影响；分析PixMo和Visual Genome等数据集的标注指南以排查数据指令因素。 Result: 发现VLM在约97%的情况下优先描述左侧内容，该偏差跨模型架构一致存在；阿拉伯语微调模型仍表现出左偏，排除语言方向为主因；数据集标注指南中无明确从左到右指令，暗示偏差更可能源于架构设计。 Conclusion: 当前VLM存在根本性的空间信息处理局限，其系统性左偏注意力反映了架构层面的问题，这对视觉推理的公平性和准确性提出了挑战，需在未来研究中加以解决。 Abstract: Vision-Language Models have demonstrated remarkable capabilities in understanding visual content, yet systematic biases in their spatial processing remain largely unexplored. This work identifies and characterizes a systematic spatial attention bias where VLMs consistently prioritize describing left-positioned content before right-positioned content in horizontally concatenated images. Through controlled experiments on image pairs using both open-source and closed-source models, we demonstrate that this bias persists across different architectures, with models describing left-positioned content first in approximately 97% of cases under neutral prompting conditions. Testing on an Arabic-finetuned model reveals that the bias persists despite right-to-left language training, ruling out language reading direction as the primary cause. Investigation of training dataset annotation guidelines from PixMo and Visual Genome reveals no explicit left-first ordering instructions, suggesting the bias is consistent with architectural factors rather than explicit training data instructions. These findings reveal fundamental limitations in how current VLMs process spatial information.

[100] Joint Learning of Depth, Pose, and Local Radiance Field for Large Scale Monocular 3D Reconstruction

Shahram Najam Syed,Yitian Hu,Yuchao Yao

Main category: cs.CV

TL;DR: 本文提出了一种联合学习框架，通过耦合深度、位姿和辐射度的优化，实现了从单目视频中进行大尺度、无漂移、高保真的三维重建。

Details

Motivation: 传统的单目三维重建方法在大场景下存在深度尺度模糊、位姿漂移和全局辐射场建模困难等问题，难以实现精确且一致的重建。 Method: 1) 使用带有度量尺度监督的Vision-Transformer深度网络生成全局一致的深度；2) 在特征空间中使用多尺度特征束调整（BA）层优化相机位姿，抑制长程漂移；3) 构建增量式局部辐射场层次结构，动态分配并冻结哈希网格NeRF以覆盖大范围场景。 Result: 在Tanks and Temples基准上，绝对轨迹误差降至0.001-0.021米，比BARF降低达18倍，比NoPe-NeRF低2倍，同时保持亚像素级相对位姿误差，并实现高质量新视角合成。 Conclusion: 该方法证明了仅使用一个未标定的RGB相机即可实现度量尺度准确、无漂移的三维重建与高保真渲染。 Abstract: Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space--leveraging learned pyramidal descriptors instead of brittle keypoints--to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences--up to 18x lower than BARF and 2x lower than NoPe-NeRF--while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.

[101] SG-RIFE: Semantic-Guided Real-Time Intermediate Flow Estimation with Diffusion-Competitive Perceptual Quality

Pan Ben Wong,Chengli Wu,Hanyue Lu

Main category: cs.CV

TL;DR: 提出SG-RIFE，通过引入语义先验提升RIFE在复杂场景下的视频帧插值质量，兼顾高感知质量和近实时性能。

Details

Motivation: 现有光流法在大运动和遮挡场景下表现差，扩散方法虽质量高但延迟高，难以实现实时应用。 Method: 在预训练RIFE基础上，利用冻结的DINOv3 ViT提供语义先验，设计Split-FAPM模块压缩优化特征，并通过DSF模块对齐语义与像素级运动场，实现参数高效的微调。 Result: 在SNU-FILM上验证语义注入显著提升感知质量，SG-RIFE在FID/LPIPS指标优于LDMVFI，质量接近Consec. BB，且速度显著更快。 Conclusion: 语义一致性可使基于光流的方法在近实时条件下达到接近扩散模型的感知质量，为高质量实时VFI提供了有效路径。 Abstract: Real-time Video Frame Interpolation (VFI) has long been dominated by flow-based methods like RIFE, which offer high throughput but often fail in complicated scenarios involving large motion and occlusion. Conversely, recent diffusion-based approaches (e.g., Consec. BB) achieve state-of-the-art perceptual quality but suffer from prohibitive latency, rendering them impractical for real-time applications. To bridge this gap, we propose Semantic-Guided RIFE (SG-RIFE). Instead of training from scratch, we introduce a parameter-efficient fine-tuning strategy that augments a pre-trained RIFE backbone with semantic priors from a frozen DINOv3 Vision Transformer. We propose a Split-Fidelity Aware Projection Module (Split-FAPM) to compress and refine high-dimensional features, and a Deformable Semantic Fusion (DSF) module to align these semantic priors with pixel-level motion fields. Experiments on SNU-FILM demonstrate that semantic injection provides a decisive boost in perceptual fidelity. SG-RIFE outperforms diffusion-based LDMVFI in FID/LPIPS and achieves quality comparable to Consec. BB on complex benchmarks while running significantly faster, proving that semantic consistency enables flow-based methods to achieve diffusion-competitive perceptual quality in near real-time.

Xiao He,Chang Tang,Xinwang Liu,Wei Zhang,Zhimin Gao,Chuankun Li,Shaohua Qiu,Jiangfeng Xu

Main category: cs.CV

TL;DR: 提出了一种名为SDCM的新网络，用于解决高光谱图像中因波段间不一致性和冗余导致的物体检测难题，通过语义一致性学习和频谱差异感知模块提升检测性能。

Details

Motivation: 高光谱图像在物体检测中面临波段内和波段间的相似性挑战，以及传感器噪声和光照等干扰，导致特征不一致和信息冗余。 Method: 提出SDCM网络，包含语义一致性学习（SCL）模块以减少波段间异质性，频谱门控生成器（SGG）过滤冗余信息，以及频谱差异感知（SDA）模块增强高层语义表示。 Result: 在两个高光谱数据集上进行了广泛实验，结果表明该方法优于现有方法，达到最先进水平。 Conclusion: SDCM有效缓解了高光谱图像中波段间不一致和冗余问题，提升了物体检测的准确性和鲁棒性。 Abstract: Hyperspectral images with high spectral resolution provide new insights into recognizing subtle differences in similar substances. However, object detection in hyperspectral images faces significant challenges in intra- and inter-class similarity due to the spatial differences in hyperspectral inter-bands and unavoidable interferences, e.g., sensor noises and illumination. To alleviate the hyperspectral inter-bands inconsistencies and redundancy, we propose a novel network termed \textbf{S}pectral \textbf{D}iscrepancy and \textbf{C}ross-\textbf{M}odal semantic consistency learning (SDCM), which facilitates the extraction of consistent information across a wide range of hyperspectral bands while utilizing the spectral dimension to pinpoint regions of interest. Specifically, we leverage a semantic consistency learning (SCL) module that utilizes inter-band contextual cues to diminish the heterogeneity of information among bands, yielding highly coherent spectral dimension representations. On the other hand, we incorporate a spectral gated generator (SGG) into the framework that filters out the redundant data inherent in hyperspectral information based on the importance of the bands. Then, we design the spectral discrepancy aware (SDA) module to enrich the semantic representation of high-level information by extracting pixel-level spectral features. Extensive experiments on two hyperspectral datasets demonstrate that our proposed method achieves state-of-the-art performance when compared with other ones.

[103] Towards Ancient Plant Seed Classification: A Benchmark Dataset and Baseline Model

Rui Xing,Runmin Cong,Yingying Wu,Can Wang,Zhongming Tang,Fen Wang,Hao Wu,Sam Kwong

Main category: cs.CV

TL;DR: 本文构建了首个古代植物种子图像分类数据集（APS），并提出了一种结合种子尺度特征的分类框架APSNet，在中国18个考古遗址的17类种子分类中达到90.5%的准确率，显著优于现有方法。

Details

Motivation: 传统考古植物学对古代种子分类依赖专家知识，耗时且效率低，而现有智能分析方法在该领域应用不足，缺乏专门的数据集与分类模型。 Method: 构建了包含8,340张图像的APS数据集，并设计APSNet框架，引入尺寸感知与嵌入（SPE）模块提取细粒度尺寸信息，结合异步解耦解码（ADD）架构从通道和空间维度解码特征，提升分类性能。 Result: 在定量与定性分析中，APSNet超越现有最先进图像分类方法，实现90.5%的分类准确率。 Conclusion: 所提出的APS数据集与APSNet框架为古代植物种子的高效、系统分类提供了有效工具，推动了考古植物学的智能化发展。 Abstract: Understanding the dietary preferences of ancient societies and their evolution across periods and regions is crucial for revealing human-environment interactions. Seeds, as important archaeological artifacts, represent a fundamental subject of archaeobotanical research. However, traditional studies rely heavily on expert knowledge, which is often time-consuming and inefficient. Intelligent analysis methods have made progress in various fields of archaeology, but there remains a research gap in data and methods in archaeobotany, especially in the classification task of ancient plant seeds. To address this, we construct the first Ancient Plant Seed Image Classification (APS) dataset. It contains 8,340 images from 17 genus- or species-level seed categories excavated from 18 archaeological sites across China. In addition, we design a framework specifically for the ancient plant seed classification task (APSNet), which introduces the scale feature (size) of seeds based on learning fine-grained information to guide the network in discovering key "evidence" for sufficient classification. Specifically, we design a Size Perception and Embedding (SPE) module in the encoder part to explicitly extract size information for the purpose of complementing fine-grained information. We propose an Asynchronous Decoupled Decoding (ADD) architecture based on traditional progressive learning to decode features from both channel and spatial perspectives, enabling efficient learning of discriminative features. In both quantitative and qualitative analyses, our approach surpasses existing state-of-the-art image classification methods, achieving an accuracy of 90.5%. This demonstrates that our work provides an effective tool for large-scale, systematic archaeological research.

[104] Loom: Diffusion-Transformer for Interleaved Generation

Mingcheng Ye,Jiaming Liu,Yiren Song

Main category: cs.CV

TL;DR: Loom是一个用于交错文本-图像生成的统一扩散变换器框架，通过全参数微调和交替文本与视觉嵌入的架构，实现多条件推理和序列规划，在风格迁移、组合生成和教程类任务中表现出优越的组合性、时间连贯性和图文对齐能力。

Details

Motivation: 现有的交错文本-图像生成方法在长序列生成中难以保持时间一致性与图文对齐，且组合性和效率不足，因此需要一个能够进行可控且高效长周期生成的统一模型。 Method: 提出Loom框架，基于Bagel模型进行全参数微调，采用交错架构交替处理文本和视觉嵌入；引入语言规划策略将用户指令分解为逐步提示和帧嵌入，并在每帧生成时仅依赖少量采样前帧和全局文本上下文，以实现高效长程生成。 Result: Loom在风格迁移、组合生成和教程类任务中显著优于开源基线Anole，在时间与语义指标上平均提升2.6分（5分制），并在自建的50K交错教程数据集上优于现有统一模型和扩散编辑方法。 Conclusion: Loom通过统一的扩散-变换器架构和高效的上下文建模，实现了高质量的交错文本-图像生成，具备良好的应用潜力于需要图文协同的复杂生成任务。 Abstract: Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Experiments demonstrate that Loom substantially outperforms the open-source baseline Anole, achieving an average gain of 2.6 points (on a 5-point scale) across temporal and semantic metrics in text-to-interleaved tasks. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.

[105] Who Can See Through You? Adversarial Shielding Against VLM-Based Attribute Inference Attacks

Yucheng Fan,Jiawei Chen,Yu Tian,Zhaoxia Yin

Main category: cs.CV

TL;DR: 本文提出了一种新的保护方法，用于防御基于视觉语言模型（VLM）的属性推断攻击，在保持图像视觉质量和可用性的同时有效抑制隐私泄露，并发布了VPI-COCO基准数据集以促进公平评估。

Details

Motivation: 随着视觉语言模型的广泛应用，从社交媒体图像中推断私人属性的攻击日益严重，而现有防护方法常牺牲图像质量或影响正常使用，缺乏在隐私保护与用户体验间的良好平衡。 Method: 提出一种联合优化隐私抑制和功能保留的方法，并引入视觉一致性约束；同时构建公开基准VPI-COCO，包含522张图像及分层隐私问题，支持细粒度、联合评估。 Result: 实验表明该方法将PAR降至25%以下，NPAR保持在88%以上，具有高视觉一致性，并对未见和改写的问题具备良好泛化能力。 Conclusion: 所提方法在保护用户隐私的同时兼顾图像效用与视觉质量，结合VPI-COCO基准为VLM环境下的隐私防护提供了实用且可评估的解决方案。 Abstract: As vision-language models (VLMs) become widely adopted, VLM-based attribute inference attacks have emerged as a serious privacy concern, enabling adversaries to infer private attributes from images shared on social media. This escalating threat calls for dedicated protection methods to safeguard user privacy. However, existing methods often degrade the visual quality of images or interfere with vision-based functions on social media, thereby failing to achieve a desirable balance between privacy protection and user experience. To address this challenge, we propose a novel protection method that jointly optimizes privacy suppression and utility preservation under a visual consistency constraint. While our method is conceptually effective, fair comparisons between methods remain challenging due to the lack of publicly available evaluation datasets. To fill this gap, we introduce VPI-COCO, a publicly available benchmark comprising 522 images with hierarchically structured privacy questions and corresponding non-private counterparts, enabling fine-grained and joint evaluation of protection methods in terms of privacy preservation and user experience. Building upon this benchmark, experiments on multiple VLMs demonstrate that our method effectively reduces PAR below 25%, keeps NPAR above 88%, maintains high visual consistency, and generalizes well to unseen and paraphrased privacy questions, demonstrating its strong practical applicability for real-world VLM deployments.

[106] Building UI/UX Dataset for Dark Pattern Detection and YOLOv12x-based Real-Time Object Recognition Detection System

Se-Young Jang,Su-Yeon Yoon,Jae-Woong Jung,Dong-Hun Lee,Seong-Hun Choi,Soo-Kyung Jun,Yu-Bin Kim,Young-Seon Ju,Kyounggon Kim

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv12x的视觉暗黑模式检测框架，构建了包含4066张截图的专用数据集，并实现了92.8%的mAP@50检测精度和40.5 FPS的实时性能。

Details

Motivation: 随着数字平台的发展，暗黑模式（误导用户决策的界面设计）问题日益严重，现有监管手段多为事后反应，缺乏高效、实时的检测技术。 Method: 收集了来自194个网站的4066张含暗黑模式的UI/UX截图，标注五类典型组件（按钮、复选框等），构建专用数据集；采用YOLOv12x模型并结合迁移学习进行训练优化，实现快速准确的视觉检测。 Result: 该方法在mAP@50上达到92.8%的准确率，推理速度为40.5 FPS，具备良好的实时性与实用性；数据集已公开以支持后续研究。 Conclusion: 所提框架能有效识别暗黑模式，兼具高精度与实时性，适合部署于实际在线环境，且公开数据集有助于推动该领域技术发展。 Abstract: With the accelerating pace of digital transformation and the widespread adoption of online platforms, both social and technical concerns regarding dark patterns-user interface designs that undermine users' ability to make informed and rational choices-have become increasingly prominent. As corporate online platforms grow more sophisticated in their design strategies, there is a pressing need for proactive and real-time detection technologies that go beyond the predominantly reactive approaches employed by regulatory authorities. In this paper, we propose a visual dark pattern detection framework that improves both detection accuracy and real-time performance. To this end, we constructed a proprietary visual object detection dataset by manually collecting 4,066 UI/UX screenshots containing dark patterns from 194 websites across six major industrial sectors in South Korea and abroad. The collected images were annotated with five representative UI components commonly associated with dark patterns: Button, Checkbox, Input Field, Pop-up, and QR Code. This dataset has been publicly released to support further research and development in the field. To enable real-time detection, this study adopted the YOLOv12x object detection model and applied transfer learning to optimize its performance for visual dark pattern recognition. Experimental results demonstrate that the proposed approach achieves a high detection accuracy of 92.8% in terms of mAP@50, while maintaining a real-time inference speed of 40.5 frames per second (FPS), confirming its effectiveness for practical deployment in online environments. Furthermore, to facilitate future research and contribute to technological advancements, the dataset constructed in this study has been made publicly available at https://github.com/B4E2/B4E2-DarkPattern-YOLO-DataSet.

[107] UniMPR: A Unified Framework for Multimodal Place Recognition with Arbitrary Sensor Configurations

Zhangshuo Qi,Jingyi Xu,Luqi Cheng,Shichen Wen,Yiming Ma,Guangming Xiong

Main category: cs.CV

TL;DR: 本文提出了一种统一的多模态地点识别框架UniMPR，能够在不同传感器配置和环境条件下自适应地处理任意模态组合，并在七个数据集上实现了最先进的性能。

Details

Motivation: 现有的多模态地点识别方法难以动态适应不同的输入模态、在模态缺失或退化时保持鲁棒性，并泛化到多样化的传感器配置。 Method: 将所有输入统一到极坐标BEV特征空间，并通过多分支网络提取模态内和模态间特征；构建大规模训练集并采用自适应标签分配策略进行预训练。 Result: 在七个数据集上的实验表明，UniMPR在不同传感器配置、模态组合和环境条件下均达到最先进水平。 Conclusion: UniMPR是一个统一且鲁棒的多模态地点识别框架，能够灵活适应各种实际应用场景，具有良好的泛化能力和应用前景。 Abstract: Place recognition is a critical component of autonomous vehicles and robotics, enabling global localization in GPS-denied environments. Recent advances have spurred significant interest in multimodal place recognition (MPR), which leverages complementary strengths of multiple modalities. Despite its potential, most existing MPR methods still face three key challenges: (1) dynamically adapting to arbitrary modality inputs within a unified framework, (2) maintaining robustness with missing or degraded modalities, and (3) generalizing across diverse sensor configurations and setups. In this paper, we propose UniMPR, a unified framework for multimodal place recognition. Using only one trained model, it can seamlessly adapt to any combination of common perceptual modalities (e.g., camera, LiDAR, radar). To tackle the data heterogeneity, we unify all inputs within a polar BEV feature space. Subsequently, the polar BEVs are fed into a multi-branch network to exploit discriminative intra-model and inter-modal features from any modality combinations. To fully exploit the network's generalization capability and robustness, we construct a large-scale training set from multiple datasets and introduce an adaptive label assignment strategy for extensive pre-training. Experiments on seven datasets demonstrate that UniMPR achieves state-of-the-art performance under varying sensor configurations, modality combinations, and environmental conditions. Our code will be released at https://github.com/QiZS-BIT/UniMPR.

[108] Pyramidal Adaptive Cross-Gating for Multimodal Detection

Zidong Gu,Shoufu Tian

Main category: cs.CV

TL;DR: 提出了一种用于航拍图像中目标检测的新型网络PACGNet，通过深度骨干融合解决跨模态噪声和特征金字塔结构破坏问题，在DroneVehicle和VEDAI数据集上实现了新的SOTA性能。

Details

Motivation: 现有方法在多模态特征融合时依赖简单策略，易引入跨模态噪声并破坏特征金字塔结构，影响小目标的精细检测。 Method: 提出了PACGNet，包含对称交叉门控（SCG）模块和金字塔特征感知多模态门控（PFMG）模块；SCG采用双向对称‘水平’门控机制抑制噪声并保留语义完整性；PFMG通过渐进式分层门控机制重建特征层次，利用高分辨率层指导低层融合以保留细节。 Result: 在DroneVehicle和VEDAI数据集上，mAP50分别达到81.7%和82.1%，优于现有方法。 Conclusion: PACGNet通过深层骨干融合有效提升了航拍图像中小目标的检测性能，为多模态融合提供了更优的架构设计。 Abstract: Object detection in aerial imagery is a critical task in applications such as UAV reconnaissance. Although existing methods have extensively explored feature interaction between different modalities, they commonly rely on simple fusion strategies for feature aggregation. This introduces two critical flaws: it is prone to cross-modal noise and disrupts the hierarchical structure of the feature pyramid, thereby impairing the fine-grained detection of small objects. To address this challenge, we propose the Pyramidal Adaptive Cross-Gating Network (PACGNet), an architecture designed to perform deep fusion within the backbone. To this end, we design two core components: the Symmetrical Cross-Gating (SCG) module and the Pyramidal Feature-aware Multimodal Gating (PFMG) module. The SCG module employs a bidirectional, symmetrical "horizontal" gating mechanism to selectively absorb complementary information, suppress noise, and preserve the semantic integrity of each modality. The PFMG module reconstructs the feature hierarchy via a progressive hierarchical gating mechanism. This leverages the detailed features from a preceding, higher-resolution level to guide the fusion at the current, lower-resolution level, effectively preserving fine-grained details as features propagate. Through evaluations conducted on the DroneVehicle and VEDAI datasets, our PACGNet sets a new state-of-the-art benchmark, with mAP50 scores reaching 81.7% and 82.1% respectively.

[109] MatE: Material Extraction from Single-Image via Geometric Prior

Zeyu Zhang,Wei Zhai,Jian Yang,Yang Cao

Main category: cs.CV

TL;DR: 本文提出了一种名为MatE的新方法，能够从单张在非受控现实条件下拍摄的图像中生成高质量、可平铺的基于物理渲染（PBR）材质。该方法结合深度估计和双分支扩散模型，校正几何畸变并生成完整的材质贴图，具有对光照和视角变化的不变性。

Details

Motivation: 高保真PBR材质的创建通常依赖专业设备和专家后处理，限制了其广泛应用。本文旨在通过使用普通图像生成高质量材质来降低门槛，实现材质创作的普及化。 Method: MatE首先利用估计的深度图作为几何先验进行粗略校正，然后采用双分支扩散模型，通过旋转对齐和尺度对齐的训练数据学习一致性，进一步修正残余畸变，并生成包括反照率、法线、粗糙度和高度在内的完整材质贴图。 Result: 在合成和真实世界数据上的实验表明，MatE能有效恢复固有材质属性，生成逼真的可平铺PBR材质，且对输入图像的未知光照和透视变化具有鲁棒性和不变性。 Conclusion: MatE为从日常拍摄图像中自动生成高质量PBR材质提供了可行方案，显著简化了图形管线中的材质创作流程，具有实际应用价值。 Abstract: The creation of high-fidelity, physically-based rendering (PBR) materials remains a bottleneck in many graphics pipelines, typically requiring specialized equipment and expert-driven post-processing. To democratize this process, we present MatE, a novel method for generating tileable PBR materials from a single image taken under unconstrained, real-world conditions. Given an image and a user-provided mask, MatE first performs coarse rectification using an estimated depth map as a geometric prior, and then employs a dual-branch diffusion model. Leveraging a learned consistency from rotation-aligned and scale-aligned training data, this model further rectify residual distortions from the coarse result and translate it into a complete set of material maps, including albedo, normal, roughness and height. Our framework achieves invariance to the unknown illumination and perspective of the input image, allowing for the recovery of intrinsic material properties from casual captures. Through comprehensive experiments on both synthetic and real-world data, we demonstrate the efficacy and robustness of our approach, enabling users to create realistic materials from real-world image.

[110] MatSpray: Fusing 2D Material World Knowledge on 3D Geometry

Philipp Langsteiner,Jan-Niklas Dihlmann,Hendrik P. A. Lensch

Main category: cs.CV

TL;DR: 提出一种将2D材质图融合到3D几何中的框架，通过高斯点阵重建几何，并利用扩散模型生成PBR材质图，结合学习和投影方法实现高质量、可重照明的3D场景渲染。

Details

Motivation: 现有3D重建方法在重照明场景下因缺乏精确的空间变化材质参数而表现不佳，且将2D材质映射到3D几何仍具挑战。 Method: 使用高斯点阵进行场景几何重建；利用扩散模型从输入图像生成2D的albedo、粗糙度和金属度图；通过优化基于图像的损失或直接高斯光线追踪投影将材质参数融入3D表示；引入轻量级神经网络（Neural Merger）提升细节精度和多视角一致性。 Result: 该方法在定量指标和视觉真实感上优于现有技术，能生成更准确、可重照明且逼真的渲染结果。 Conclusion: 所提框架有效融合2D材质预测与3D几何，在资产创建流程中显著提升真实感与效率，适用于影视与游戏工业。 Abstract: Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a diffusion model generates 2D maps for albedo, roughness, and metallic parameters. Any existing diffusion model that can convert images or videos to PBR materials can be applied. The predictions are further integrated into the 3D representation either by optimizing an image-based loss or by directly projecting the material parameters onto the Gaussians using Gaussian ray tracing. To enhance fine-scale accuracy and multi-view consistency, we further introduce a light-weight neural refinement step (Neural Merger), which takes ray-traced material features as input and produces detailed adjustments. Our results demonstrate that the proposed methods outperform existing techniques in both quantitative metrics and perceived visual realism. This enables more accurate, relightable, and photorealistic renderings from reconstructed scenes, significantly improving the realism and efficiency of asset creation workflows in content production pipelines.

[111] A two-stream network with global-local feature fusion for bone age assessment

Qiong Lou,Han Yang,Fang Lu

Main category: cs.CV

TL;DR: 提出了一种基于双流深度学习架构的BoNet+模型，用于自动骨龄评估，结合全局和局部特征提取，在RSNA和RHPE数据集上实现了与当前最优方法相当的精度。

Details

Motivation: 现有骨龄评估方法在平衡全局特征与局部骨骼细节方面存在挑战，需要更高效的自动化系统以提高准确性和减轻临床工作负担。 Method: 设计了BoNet+模型，包含两个分支：全局特征提取分支引入Transformer模块增强全局特征学习，局部特征提取分支采用RFAConv模块生成多尺度感受野下的自适应注意力图；最后将两者特征拼接并通过Inception-V3网络优化。 Result: 在RSNA和RHPE测试集上分别达到3.81和5.65个月的平均绝对误差（MAE），性能与当前最优方法相当。 Conclusion: BoNet+模型能够实现自动、高精度且更客观的骨龄评估，有效减少临床工作量，具有良好的临床应用前景。 Abstract: Bone Age Assessment (BAA) is a widely used clinical technique that can accurately reflect an individual's growth and development level, as well as maturity. In recent years, although deep learning has advanced the field of bone age assessment, existing methods face challenges in efficiently balancing global features and local skeletal details. This study aims to develop an automated bone age assessment system based on a two-stream deep learning architecture to achieve higher accuracy in bone age assessment. We propose the BoNet+ model incorporating global and local feature extraction channels. A Transformer module is introduced into the global feature extraction channel to enhance the ability in extracting global features through multi-head self-attention mechanism. A RFAConv module is incorporated into the local feature extraction channel to generate adaptive attention maps within multiscale receptive fields, enhancing local feature extraction capabilities. Global and local features are concatenated along the channel dimension and optimized by an Inception-V3 network. The proposed method has been validated on the Radiological Society of North America (RSNA) and Radiological Hand Pose Estimation (RHPE) test datasets, achieving mean absolute errors (MAEs) of 3.81 and 5.65 months, respectively. These results are comparable to the state-of-the-art. The BoNet+ model reduces the clinical workload and achieves automatic, high-precision, and more objective bone age assessment.

[112] MCVI-SANet: A lightweight semi-supervised model for LAI and SPAD estimation of winter wheat under vegetation index saturation

Zhiheng Zhang,Jiajun Yang,Hong Sun,Dong Wang,Honghua Jiang,Yaru Chen,Tangyuan Ning

Main category: cs.CV

TL;DR: 提出了一种轻量级半监督视觉模型MCVI-SANet，用于解决冬小麦LAI和SPAD估计中植被指数饱和和标注数据不足的问题，结合植被指数饱和感知模块和VICReg半监督策略，在精度和泛化性上均取得优越性能。

Details

Motivation: 植被指数在冠层密集阶段饱和以及冬小麦地面真实标注数据有限，制约了叶面积指数（LAI）和SPAD的准确估算；现有方法特征表达能力不足，深度学习模型存在域差距和高数据需求问题。 Method: 提出了MCVI-SANet模型，包含新设计的植被指数饱和感知块（VI-SABlock）以实现自适应通道-空间特征增强，并结合基于VICReg的半监督学习策略；采用基于植株高度的信息划分数据集以保持生长阶段的代表性。 Result: 在10次重复实验中，MCVI-SANet在LAI估计上达到平均R2为0.8123（RMSE=0.4796），SPAD估计上R2为0.6846（RMSE=2.4222），分别比最优基线提升8.95%和8.17%；模型仅含0.10M参数，保持高推理速度。 Conclusion: 将半监督学习与农学先验知识结合，能有效提升遥感支持的精准农业模型的泛化性和实用性，MCVI-SANet为解决植被指数饱和和数据稀缺问题提供了高效方案。 Abstract: Vegetation index (VI) saturation during the dense canopy stage and limited ground-truth annotations of winter wheat constrain accurate estimation of LAI and SPAD. Existing VI-based and texture-driven machine learning methods exhibit limited feature expressiveness. In addition, deep learning baselines suffer from domain gaps and high data demands, which restrict their generalization. Therefore, this study proposes the Multi-Channel Vegetation Indices Saturation Aware Net (MCVI-SANet), a lightweight semi-supervised vision model. The model incorporates a newly designed Vegetation Index Saturation-Aware Block (VI-SABlock) for adaptive channel-spatial feature enhancement. It also integrates a VICReg-based semi-supervised strategy to further improve generalization. Datasets were partitioned using a vegetation height-informed strategy to maintain representativeness across growth stages. Experiments over 10 repeated runs demonstrate that MCVI-SANet achieves state-of-the-art accuracy. The model attains an average R2 of 0.8123 and RMSE of 0.4796 for LAI, and an average R2 of 0.6846 and RMSE of 2.4222 for SPAD. This performance surpasses the best-performing baselines, with improvements of 8.95% in average LAI R2 and 8.17% in average SPAD R2. Moreover, MCVI-SANet maintains high inference speed with only 0.10M parameters. Overall, the integration of semi-supervised learning with agronomic priors provides a promising approach for enhancing remote sensing-based precision agriculture.

Dunxing Zhang,Jiachen Lu,Han Yang,Lei Bao,Bo Song

Main category: cs.CV

TL;DR: 提出了一种名为ESSC-RM的即插即用框架，用于语义场景补全（SSC），通过两阶段细化机制在多个基准模型上提升了性能。

Details

Motivation: 现有SSC模型生成的预测结果较粗糙，需要有效的方法来提升细节和准确性。 Method: ESSC-RM包含两个模块：基于3D U-Net的预测噪声感知模块（PNAM）和体素级局部几何模块（VLGM），在多尺度监督下对基线模型输出进行细化。 Result: 在SemanticKITTI数据集上，集成ESSC-RM后CGFormer和MonoScene的mIoU分别从16.87%提升至17.27%，从11.08%提升至11.51%。 Conclusion: ESSC-RM是一种通用且有效的SSC模型增强框架，可广泛应用于现有方法以提升性能。 Abstract: We propose ESSC-RM, a plug-and-play Enhancing framework for Semantic Scene Completion with a Refinement Module, which can be seamlessly integrated into existing SSC models. ESSC-RM operates in two phases: a baseline SSC network first produces a coarse voxel prediction, which is subsequently refined by a 3D U-Net-based Prediction Noise-Aware Module (PNAM) and Voxel-level Local Geometry Module (VLGM) under multiscale supervision. Experiments on SemanticKITTI show that ESSC-RM consistently improves semantic prediction performance. When integrated into CGFormer and MonoScene, the mean IoU increases from 16.87% to 17.27% and from 11.08% to 11.51%, respectively. These results demonstrate that ESSC-RM serves as a general refinement framework applicable to a wide range of SSC models.

[114] Efficient Zero-Shot Inpainting with Decoupled Diffusion Guidance

Badr Moufad,Navid Bagheri Shouraki,Alain Oliviero Durmus,Thomas Hirtz,Eric Moulines,Jimmy Olsson,Yazid Janati

Main category: cs.CV

TL;DR: 提出了一种新的零样本图像编辑方法，通过设计简单的高斯后验转移来避免反向传播计算，显著降低推理成本，同时保持高质量的重建效果。

Details

Motivation: 现有零样本扩散模型依赖代理似然函数和频繁的向量-雅可比积计算，导致内存和运行时间开销大。 Method: 设计一个新的似然代理函数，使后验转移为易于采样的高斯分布，从而绕过去噪器网络中的反向传播。 Result: 实验表明该方法在观测一致性与重建质量上表现优异，且显著降低了推理成本。 Conclusion: 所提方法在不牺牲生成质量的前提下，大幅提升了零样本扩散模型在图像编辑任务中的效率。 Abstract: Diffusion models have emerged as powerful priors for image editing tasks such as inpainting and local modification, where the objective is to generate realistic content that remains consistent with observed regions. In particular, zero-shot approaches that leverage a pretrained diffusion model, without any retraining, have been shown to achieve highly effective reconstructions. However, state-of-the-art zero-shot methods typically rely on a sequence of surrogate likelihood functions, whose scores are used as proxies for the ideal score. This procedure however requires vector-Jacobian products through the denoiser at every reverse step, introducing significant memory and runtime overhead. To address this issue, we propose a new likelihood surrogate that yields simple and efficient to sample Gaussian posterior transitions, sidestepping the backpropagation through the denoiser network. Our extensive experiments show that our method achieves strong observation consistency compared with fine-tuned baselines and produces coherent, high-quality reconstructions, all while significantly reducing inference cost. Code is available at https://github.com/YazidJanati/ding.

[115] RecurGS: Interactive Scene Modeling via Discrete-State Recurrent Gaussian Fusion

Wenhao Hu,Haonan Zhou,Zesheng Li,Liu Liu,Jiacheng Dong,Zhizhong Su,Gaoang Wang

Main category: cs.CV

TL;DR: 本文提出了RecurGS，一种基于循环融合的动态3D场景表示框架，能够增量整合离散的高斯场景状态，支持对象级交互与新视角合成。

Details

Motivation: 现有方法在处理离散场景变化时难以支持多状态融合与交互式更新，且易发生灾难性遗忘，限制了其在动态环境中的应用。 Method: RecurGS通过语义对应和基于李代数的SE(3)优化对齐连续状态间的物体运动，采用循环更新机制结合回放监督保留历史结构，并设计体素化、可见性感知的融合模块选择性地更新新观测区域。 Result: 在合成与真实数据集上的实验表明，RecurGS在保持高保真渲染的同时显著提升了更新效率，支持对象级操作和无需额外扫描的新状态合成。 Conclusion: RecurGS为构建可持续交互的高斯3D世界提供了可扩展的解决方案，推动了动态场景建模与机器人交互的发展。 Abstract: Recent advances in 3D scene representations have enabled high-fidelity novel view synthesis, yet adapting to discrete scene changes and constructing interactive 3D environments remain open challenges in vision and robotics. Existing approaches focus solely on updating a single scene without supporting novel-state synthesis. Others rely on diffusion-based object-background decoupling that works on one state at a time and cannot fuse information across multiple observations. To address these limitations, we introduce RecurGS, a recurrent fusion framework that incrementally integrates discrete Gaussian scene states into a single evolving representation capable of interaction. RecurGS detects object-level changes across consecutive states, aligns their geometric motion using semantic correspondence and Lie-algebra based SE(3) refinement, and performs recurrent updates that preserve historical structures through replay supervision. A voxelized, visibility-aware fusion module selectively incorporates newly observed regions while keeping stable areas fixed, mitigating catastrophic forgetting and enabling efficient long-horizon updates. RecurGS supports object-level manipulation, synthesizes novel scene states without requiring additional scans, and maintains photorealistic fidelity across evolving environments. Extensive experiments across synthetic and real-world datasets demonstrate that our framework delivers high-quality reconstructions with substantially improved update efficiency, providing a scalable step toward continuously interactive Gaussian worlds.

[116] Automated Mosaic Tesserae Segmentation via Deep Learning Techniques

Charilaos Kapelonis,Marios Antonakakis,Konstantinos Politof,Aristomenis Antoniadis,Michalis Zervakis

Main category: cs.CV

TL;DR: 本文提出了一种基于Meta AI的SAM 2模型的方法，用于自动分割马赛克图像中的小块（tesserae），并通过构建标注数据集进行微调，在多个指标上显著优于现有方法，推动了马赛克图像的数字化保护。

Details

Motivation: 由于马赛克艺术品古老且易损，亟需数字 preservation；然而该领域公开数据集有限，传统图像分割方法效果不足，因此需要更高效的自动化分割方案。 Method: 采用Meta AI提出的SAM 2基础模型，并构建一个专门标注的马赛克图像数据集对其进行微调，以实现对tesserae的精确分割。 Result: 在测试集上，IoU从89.00%提升至91.02%，召回率从92.12%提升至95.89%；在先前研究的基准上，F-measure提高3%，绝对误差从0.20降至0.02。 Conclusion: 微调后的SAM 2模型在马赛克图像分割任务中表现优异，结合新构建的数据集，有望实现马赛克图像的实时分割，为文化遗产数字化提供有效技术支持。 Abstract: Art is widely recognized as a reflection of civilization and mosaics represent an important part of cultural heritage. Mosaics are an ancient art form created by arranging small pieces, called tesserae, on a surface using adhesive. Due to their age and fragility, they are prone to damage, highlighting the need for digital preservation. This paper addresses the problem of digitizing mosaics by segmenting the tesserae to separate them from the background within the broader field of Image Segmentation in Computer Vision. We propose a method leveraging Segment Anything Model 2 (SAM 2) by Meta AI, a foundation model that outperforms most conventional segmentation models, to automatically segment mosaics. Due to the limited open datasets in the field, we also create an annotated dataset of mosaic images to fine-tune and evaluate the model. Quantitative evaluation on our testing dataset shows notable improvements compared to the baseline SAM 2 model, with Intersection over Union increasing from 89.00% to 91.02% and Recall from 92.12% to 95.89%. Additionally, on a benchmark proposed by a prior approach, our model achieves an F-measure 3% higher than previous methods and reduces the error in the absolute difference between predicted and actual tesserae from 0.20 to just 0.02. The notable performance of the fine-tuned SAM 2 model together with the newly annotated dataset can pave the way for real-time segmentation of mosaic images.

[117] Through the PRISm: Importance-Aware Scene Graphs for Image Retrieval

Dimitrios Georgoulopoulos,Nikolaos Chaidos,Angeliki Dimitriou,Giorgos Stamou

Main category: cs.CV

TL;DR: 本文提出了PRISm，一种基于语义图重要性预测的图像检索框架，通过重要性预测模块和边感知图神经网络提升跨模态图像检索的语义对齐与可解释性。

Details

Motivation: 传统方法难以捕捉场景中的关系和上下文细节，导致语义相似图像检索效果不佳。 Method: 提出PRISm框架，包含两个核心组件：重要性预测模块用于筛选关键对象和关系三元组；边感知图神经网络用于编码关系结构并融合全局视觉特征以生成语义图像嵌入。 Result: 在多个基准和真实数据集上实验表明，PRISm在top-ranked检索性能上显著优于现有方法，并能准确捕捉关键对象及其交互。 Conclusion: PRISm通过显式建模对象语义重要性及其交互关系，实现了更符合人类感知的、可解释的语义图像检索。 Abstract: Accurately retrieving images that are semantically similar remains a fundamental challenge in computer vision, as traditional methods often fail to capture the relational and contextual nuances of a scene. We introduce PRISm (Pruning-based Image Retrieval via Importance Prediction on Semantic Graphs), a multimodal framework that advances image-to-image retrieval through two novel components. First, the Importance Prediction Module identifies and retains the most critical objects and relational triplets within an image while pruning irrelevant elements. Second, the Edge-Aware Graph Neural Network explicitly encodes relational structure and integrates global visual features to produce semantically informed image embeddings. PRISm achieves image retrieval that closely aligns with human perception by explicitly modeling the semantic importance of objects and their interactions, capabilities largely absent in prior approaches. Its architecture effectively combines relational reasoning with visual representation, enabling semantically grounded retrieval. Extensive experiments on benchmark and real-world datasets demonstrate consistently superior top-ranked performance, while qualitative analyses show that PRISm accurately captures key objects and interactions, producing interpretable and semantically meaningful results.

[118] AmPLe: Supporting Vision-Language Models via Adaptive-Debiased Ensemble Multi-Prompt Learning

Fei Song,Yi Li,Jiangmeng Li,Rui Wang,Changwen Zheng,Fanjiang Xu,Hui Xiong

Main category: cs.CV

TL;DR: 本文提出了一种自适应去偏集成多提示学习方法（AmPLe），用于缓解视觉-语言模型中多提示学习的模型-提示匹配偏差和样本-提示匹配偏差，通过信息论指导提取提示相关语义，实现更优的下游任务性能。

Details

Motivation: 现有多提示学习方法忽视了同一提示在不同视觉-语言模型中语义不一致的问题（模型-提示匹配偏差），以及输入样本中与提示无关的语义干扰（样本-提示匹配偏差），限制了性能提升。 Method: 提出AmPLe方法，采用集成学习聚合多个模型的预测结果，并利用信息论分析从输入样本中提取与提示相关的语义，自适应计算去偏的集成权重，以同时缓解两种偏差。 Result: 在新类别泛化、新目标数据集和未见域迁移三个任务上，AmPLe显著优于现有方法，理论分析从因果角度验证了其有效性。 Conclusion: AmPLe能有效缓解多提示学习中的模型-提示和样本-提示匹配偏差，提升视觉-语言模型在资源有限下的适应能力，具有广泛的应用优势。 Abstract: Multi-prompt learning methods have emerged as an effective approach for facilitating the rapid adaptation of vision-language models to downstream tasks with limited resources. Existing multi-prompt learning methods primarily focus on utilizing various meticulously designed prompts within a single foundation vision-language model to achieve superior performance. However, the overlooked model-prompt matching bias hinders the development of multi-prompt learning, i.e., the same prompt can convey different semantics across distinct vision-language models, such as CLIP-ViT-B/16 and CLIP-ViT-B/32, resulting in inconsistent predictions of identical prompt. To mitigate the impact of this bias on downstream tasks, we explore an ensemble learning approach to sufficiently aggregate the benefits of diverse predictions. Additionally, we further disclose the presence of sample-prompt matching bias, which originates from the prompt-irrelevant semantics encapsulated in the input samples. Thus, directly utilizing all information from the input samples for generating weights of ensemble learning can lead to suboptimal performance. In response, we extract prompt-relevant semantics from input samples by leveraging the guidance of the information theory-based analysis, adaptively calculating debiased ensemble weights. Overall, we propose Adaptive-Debiased Ensemble MultiPrompt Learning, abbreviated as AmPLe, to mitigate the two types of bias simultaneously. Extensive experiments on three representative tasks, i.e., generalization to novel classes, new target datasets, and unseen domain shifts, show that AmPLe can widely outperform existing methods. Theoretical validation from a causal perspective further supports the effectiveness of AmPLe.

[119] E-RGB-D: Real-Time Event-Based Perception with Structured Light

Seyed Ehsan Marjani Bajestani,Giovanni Beltrame

Main category: cs.CV

TL;DR: 提出一种结合事件相机与主动结构光投影的新型RGB-D感知方法，实现高速彩色点云生成与深度检测。

Details

Motivation: 传统单色事件相机难以检测静态或缓慢移动物体，且缺乏颜色信息，限制了其在某些场景的应用。 Method: 通过集成数字光处理（DLP）投影仪，构建主动结构光系统，利用事件相机捕捉动态亮度变化，并结合动态投影调整实现逐像素的彩色与深度信息分离检测。 Result: 实现了等效1400 fps的彩色检测速度和4 kHz的像素级深度检测，生成高质量彩色点云，保持空间分辨率。 Conclusion: 该方法显著提升了事件相机在RGB-D感知中的能力，推动了其在机器人、3D重建等领域的应用发展。 Abstract: Event-based cameras (ECs) have emerged as bio-inspired sensors that report pixel brightness changes asynchronously, offering unmatched speed and efficiency in vision sensing. Despite their high dynamic range, temporal resolution, low power consumption, and computational simplicity, traditional monochrome ECs face limitations in detecting static or slowly moving objects and lack color information essential for certain applications. To address these challenges, we present a novel approach that integrates a Digital Light Processing (DLP) projector, forming Active Structured Light (ASL) for RGB-D sensing. By combining the benefits of ECs and projection-based techniques, our method enables the detection of color and the depth of each pixel separately. Dynamic projection adjustments optimize bandwidth, ensuring selective color data acquisition and yielding colorful point clouds without sacrificing spatial resolution. This integration, facilitated by a commercial TI LightCrafter 4500 projector and a monocular monochrome EC, not only enables frameless RGB-D sensing applications but also achieves remarkable performance milestones. With our approach, we achieved a color detection speed equivalent to 1400 fps and 4 kHz of pixel depth detection, significantly advancing the realm of computer vision across diverse fields from robotics to 3D reconstruction methods. Our code is publicly available: https://github.com/MISTLab/event_based_rgbd_ros

[120] MeniMV: A Multi-view Benchmark for Meniscus Injury Severity Grading

Shurui Xu,Siqi Yang,Jiapin Ren,Zhong Cao,Hongwei Yang,Mengzhen Fan,Yuyu Sun,Shuyan Li

Main category: cs.CV

TL;DR: 本文提出了MeniMV，一个用于半月板前后角撕裂分级的多视角MRI数据集，包含3,000次检查和6,000张配准图像，并提供四等级标注，支持自动化膝关节损伤分析研究。

Details

Motivation: 现有方法多依赖粗粒度标签或二分类，缺乏对半月板前后角撕裂的精确定位与严重程度分级信息，限制了自动化MRI分析的发展。 Method: 构建了一个名为MeniMV的多中心、多视角数据集，包含来自750名患者的3,000次MRI检查，每例均标注前后角四等级（0-3）损伤，并使用CNN和Transformer模型进行基准测试。 Result: MeniMV提供了超过以往数据集两倍的病理标注数据量，实验建立了强基线并揭示了当前在严重程度分级上的挑战。 Conclusion: MeniMV填补了半月板角部损伤精细分级数据的空白，为未来肌肉骨骼影像自动化分析提供了重要基础。 Abstract: Precise grading of meniscal horn tears is critical in knee injury diagnosis but remains underexplored in automated MRI analysis. Existing methods often rely on coarse study-level labels or binary classification, lacking localization and severity information. In this paper, we introduce MeniMV, a multi-view benchmark dataset specifically designed for horn-specific meniscus injury grading. MeniMV comprises 3,000 annotated knee MRI exams from 750 patients across three medical centers, providing 6,000 co-registered sagittal and coronal images. Each exam is meticulously annotated with four-tier (grade 0-3) severity labels for both anterior and posterior meniscal horns, verified by chief orthopedic physicians. Notably, MeniMV offers more than double the pathology-labeled data volume of prior datasets while uniquely capturing the dual-view diagnostic context essential in clinical practice. To demonstrate the utility of MeniMV, we benchmark multiple state-of-the-art CNN and Transformer-based models. Our extensive experiments establish strong baselines and highlight challenges in severity grading, providing a valuable foundation for future research in automated musculoskeletal imaging.

[121] Object-Centric Framework for Video Moment Retrieval

Zongyao Li,Yongkang Wong,Satoshi Yamazaki,Jianquan Liu,Mohan Kankanhalli

Main category: cs.CV

TL;DR: 提出一种基于场景图的面向对象的视频时刻检索新框架，通过建模对象级时空关系提升对细粒度查询的理解与定位精度。

Details

Motivation: 现有方法主要依赖全局帧或片段特征，难以捕捉面向对象查询所需的细粒度语义和外观信息，尤其缺乏对对象级时序动态的建模。 Method: 利用场景图解析器提取查询相关对象，从视频帧构建场景图以表示对象及其关系，并生成对象级特征序列；通过关系轨迹变换器（relational tracklet transformer）建模对象间的时空相关性。 Result: 在Charades-STA、QVHighlights和TACoS三个基准上均超越现有最先进方法，显著提升检索性能。 Conclusion: 该对象中心框架能有效捕捉对象状态变化，增强对复杂对象交互的理解，从而更准确地定位与对象相关的描述性时刻。 Abstract: Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.

[122] Plasticine: A Traceable Diffusion Model for Medical Image Translation

Tianyang Zhanng,Xinxing Cheng,Jun Cheng,Shaoming Zheng,He Zhao,Huazhu Fu,Alejandro F Frangi,Jiang Liu,Jinming Duan

Main category: cs.CV

TL;DR: 提出Plasticine框架，首个以可追溯性为核心目标的端到端医学图像翻译方法，结合强度转换与空间变换，在去噪扩散模型中实现像素级可解释性。

Details

Motivation: 现有医学图像翻译方法忽略空间对应关系，缺乏像素级可追溯性，影响临床可解释性。 Method: 在去噪扩散框架内联合建模强度翻译和空间变换，实现端到端的图像翻译并保持像素级对应。 Result: 生成的合成图像具有可解释的强度变化和空间一致的形变，支持全过程的像素级追溯。 Conclusion: Plasticine为医学图像翻译提供了具备临床可追溯性的解决方案，提升了模型在实际应用中的可信度与可解释性。 Abstract: Domain gaps arising from variations in imaging devices and population distributions pose significant challenges for machine learning in medical image analysis. Existing image-to-image translation methods primarily aim to learn mappings between domains, often generating diverse synthetic data with variations in anatomical scale and shape, but they usually overlook spatial correspondence during the translation process. For clinical applications, traceability, defined as the ability to provide pixel-level correspondences between original and translated images, is equally important. This property enhances clinical interpretability but has been largely overlooked in previous approaches. To address this gap, we propose Plasticine, which is, to the best of our knowledge, the first end-to-end image-to-image translation framework explicitly designed with traceability as a core objective. Our method combines intensity translation and spatial transformation within a denoising diffusion framework. This design enables the generation of synthetic images with interpretable intensity transitions and spatially coherent deformations, supporting pixel-wise traceability throughout the translation process.

[123] Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models

Xiaoyang Guo,Keze Wang

Main category: cs.CV

TL;DR: 提出Adaptive-VoCo，一种基于视觉复杂度动态调整压缩率的轻量级自适应视觉压缩框架，提升大规模视觉语言模型的效率与性能。

Details

Motivation: 现有视觉语言模型中的视觉标记压缩方法通常采用固定压缩率，难以适应不同图像的视觉复杂度，限制了模型在效率与表征能力之间的平衡。 Method: 在VoCo-LLaMA基础上引入一个轻量级预测器，利用视觉编码器中的统计线索（如patch token熵和注意力图方差）估计图像复杂度，并动态选择最优压缩率；同时设计联合损失函数，结合率正则化与复杂度对齐，优化效率与性能的权衡。 Result: 实验表明，Adaptive-VoCo在多个多模态任务上持续优于固定压缩率基线方法，展现出更高的推理效率和更强的复杂场景表征能力。 Conclusion: 自适应视觉压缩是构建高效且鲁棒的大规模视觉语言模型的有效路径，Adaptive-VoCo为未来VLM设计提供了新的方向。 Abstract: In recent years, large-scale vision-language models (VLMs) have demonstrated remarkable performance on multimodal understanding and reasoning tasks. However, handling high-dimensional visual features often incurs substantial computational and memory costs. VoCo-LLaMA alleviates this issue by compressing visual patch tokens into a few VoCo tokens, reducing computational overhead while preserving strong cross-modal alignment. Nevertheless, such approaches typically adopt a fixed compression rate, limiting their ability to adapt to varying levels of visual complexity. To address this limitation, we propose Adaptive-VoCo, a framework that augments VoCo-LLaMA with a lightweight predictor for adaptive compression. This predictor dynamically selects an optimal compression rate by quantifying an image's visual complexity using statistical cues from the vision encoder, such as patch token entropy and attention map variance. Furthermore, we introduce a joint loss function that integrates rate regularization with complexity alignment. This enables the model to balance inference efficiency with representational capacity, particularly in challenging scenarios. Experimental results show that our method consistently outperforms fixed-rate baselines across multiple multimodal tasks, highlighting the potential of adaptive visual compression for creating more efficient and robust VLMs.

[124] PlantDiseaseNet-RT50: A Fine-tuned ResNet50 Architecture for High-Accuracy Plant Disease Detection Beyond Standard CNNs

Santwana Sagnika,Manav Malhotra,Ishtaj Kaur Deol,Soumyajit Roy,Swarnav Kumar

Main category: cs.CV

TL;DR: 本文提出了一种基于ResNet50改进的深度学习模型PlantDiseaseNet-RT50，用于自动植物病害检测，在多作物数据集上实现了约98%的准确率、精确率和召回率。

Details

Motivation: 传统植物病害检测依赖人工视觉检查，耗时且难以适用于大规模农业，亟需自动化、高效的解决方案。 Method: 采用ResNet50为基础模型，通过策略性解冻特定层、添加带正则化的自定义分类头，并使用余弦退火动态调整学习率，结合批量归一化与Dropout进行优化训练。 Result: 在包含多种作物病害类别的综合数据集上，PlantDiseaseNet-RT50达到约98%的准确率、精确率和召回率，显著优于传统方法。 Conclusion: PlantDiseaseNet-RT50通过针对性微调将通用预训练模型转化为专用农业诊断工具，具备高精度与计算效率，可广泛应用于实际 farming 场景以支持及时干预并减少作物损失。 Abstract: Plant diseases pose a significant threat to agricultural productivity and global food security, accounting for 70-80% of crop losses worldwide. Traditional detection methods rely heavily on expert visual inspection, which is time-consuming, labour-intensive, and often impractical for large-scale farming operations. In this paper, we present PlantDiseaseNet-RT50, a novel fine-tuned deep learning architecture based on ResNet50 for automated plant disease detection. Our model features strategically unfrozen layers, a custom classification head with regularization mechanisms, and dynamic learning rate scheduling through cosine decay. Using a comprehensive dataset of distinct plant disease categories across multiple crop species, PlantDiseaseNet-RT50 achieves exceptional performance with approximately 98% accuracy, precision, and recall. Our architectural modifications and optimization protocol demonstrate how targeted fine-tuning can transform a standard pretrained model into a specialized agricultural diagnostic tool. We provide a detailed account of our methodology, including the systematic unfreezing of terminal layers, implementation of batch normalization and dropout regularization and application of advanced training techniques. PlantDiseaseNet-RT50 represents a significant advancement in AI-driven agricultural tools, offering a computationally efficient solution for rapid and accurate plant disease diagnosis that can be readily implemented in practical farming contexts to support timely interventions and reduce crop losses.

[125] NASTaR: NovaSAR Automated Ship Target Recognition Dataset

Benyamin Hosseiny,Kamirul Kamirul,Odysseas Pappas,Alin Achim

Main category: cs.CV

TL;DR: 本文提出了一个名为NASTaR的新数据集，用于合成孔径雷达（SAR）图像中的船舶类型分类，包含3415个从NovaSAR S波段影像中提取的船舶样本，并提供了详细的标注和辅助信息（如船尾迹），并通过多种深度学习模型验证了其在不同分类任务中的有效性。

Details

Motivation: 由于船舶类型的多样性和复杂性，以及不同SAR卫星在频率和分辨率上的差异，现有的船舶分类方法需要大量高质量标注数据，但目前缺乏适用于S波段SAR图像的公开数据集，因此需要构建一个专门的数据集来提升模型的准确性和泛化能力。 Method: 作者构建了NASTaR数据集，包含3415个从NovaSAR S波段SAR图像中提取的船舶图像块，结合AIS数据进行精确标注，涵盖23个船舶类别，并区分近海与远海场景，同时提供可见船尾迹的辅助数据集；使用主流深度学习模型在多个分类任务（如四大类、三类、货船vs油轮、渔船识别）上进行基准测试。 Result: 实验结果显示，在四大类船舶分类中准确率超过60%，三类分类中超过70%，区分货船与油轮的准确率超过75%，识别渔船的准确率超过87%。 Conclusion: NASTaR数据集为S波段SAR图像中的船舶自动识别提供了有价值的资源，能够支持多种分类任务，并推动相关深度学习模型的发展与评估。 Abstract: Synthetic Aperture Radar (SAR) offers a unique capability for all-weather, space-based maritime activity monitoring by capturing and imaging strong reflections from ships at sea. A well-defined challenge in this domain is ship type classification. Due to the high diversity and complexity of ship types, accurate recognition is difficult and typically requires specialized deep learning models. These models, however, depend on large, high-quality ground-truth datasets to achieve robust performance and generalization. Furthermore, the growing variety of SAR satellites operating at different frequencies and spatial resolutions has amplified the need for more annotated datasets to enhance model accuracy. To address this, we present the NovaSAR Automated Ship Target Recognition (NASTaR) dataset. This dataset comprises of 3415 ship patches extracted from NovaSAR S-band imagery, with labels matched to AIS data. It includes distinctive features such as 23 unique classes, inshore/offshore separation, and an auxiliary wake dataset for patches where ship wakes are visible. We validated the dataset applicability across prominent ship-type classification scenarios using benchmark deep learning models. Results demonstrate over 60% accuracy for classifying four major ship types, over 70% for a three-class scenario, more than 75% for distinguishing cargo from tanker ships, and over 87% for identifying fishing vessels. The NASTaR dataset is available at https://10.5523/bris, while relevant codes for benchmarking and analysis are available at https://github.com/benyaminhosseiny/nastar.

[126] GTMA: Dynamic Representation Optimization for OOD Vision-Language Models

Jensen Zhang,Ningyuan Liu,Keze Wang

Main category: cs.CV

TL;DR: 提出动态表示优化框架GTMA，通过构建连续的伪词嵌入来解决视觉-语言模型在开放世界中因模态不对称导致的跨模态对齐崩溃问题。

Details

Motivation: 现有视觉-语言模型在开放世界应用中面对分布外概念时表现不佳，主要由于文本编码器受限于固定离散词汇，无法生成新的语义锚点。 Method: 设计引导目标匹配适应（GTMA）框架，在推理时构建与OOD图像视觉锚点最佳对齐的连续伪词嵌入，并采用自适应梯度表示策略优化算法，结合语义正则化保持合理性。 Result: 在ImageNet-R和VISTA-Beyond基准上，GTMA相较于基础模型将零样本和少样本OOD准确率提升15-20%，同时保持对分布内概念的性能。 Conclusion: GTMA有效缓解了模态不对称问题，突破了预训练语义空间限制，显著提升了VLM在开放环境中的泛化能力。 Abstract: Vision-language models (VLMs) struggle in open-world applications, where out-of-distribution (OOD) concepts often trigger cross-modal alignment collapse and severely degrade zero-shot performance. We identify the root cause as modal asymmetry: while the visual encoder can extract discriminative features from unseen images, the text encoder is constrained by a fixed discrete vocabulary and cannot synthesize new semantic anchors. Existing approaches such as CoOp or LoRA provide only partial remedies, as they remain confined to the pre-trained semantic space. To overcome this bottleneck, we propose dynamic representation optimization, realized through the Guided Target-Matching Adaptation (GTMA) framework. At inference time, GTMA constructs a continuous pseudo-word embedding that best aligns with an OOD image's visual anchor, effectively bypassing vocabulary limitations. The optimization is driven by an adaptive gradient-based representation policy optimization algorithm, which incorporates semantic regularization to preserve plausibility and compatibility with the model's prior knowledge. Experiments on ImageNet-R and the VISTA-Beyond benchmark demonstrate that GTMA improves zero-shot and few-shot OOD accuracy by up to 15-20 percent over the base VLM while maintaining performance on in-distribution concepts. Ablation studies further confirm the necessity of pseudo-word optimization.

[127] Detection of AI Generated Images Using Combined Uncertainty Measures and Particle Swarm Optimised Rejection Mechanism

Rahul Yumlembam,Biju Issac,Nauman Aslam,Eaby Kollonoor Babu,Josh Collyer,Fraser Kennedy

Main category: cs.CV

TL;DR: 本文提出了一种基于多源不确定性融合的AI生成图像检测框架，结合Fisher信息、Monte Carlo Dropout熵和深度核学习的预测方差，并利用粒子群优化进行自适应加权与拒绝阈值设定，在分布外生成器和对抗攻击下均表现出较强的鲁棒性。

Details

Motivation: 随着AI生成图像越来越逼真，传统检测方法在面对分布偏移和未知生成器时性能下降，亟需一种鲁棒且可泛化的检测机制。 Method: 采用三种不确定性度量：Fisher信息、Monte Carlo Dropout的熵、深度核学习中的预测方差，并通过粒子群优化（PSO）学习最优权重并设定自适应拒绝阈值。 Result: 所提出的联合不确定性度量在未知生成器上对错误预测的拒绝率约为70%；在FGSM和PGD对抗攻击下，分别拒绝约61%和最高80%的成功攻击样本，同时保持对真实图像和域内AI图像的高接受率。 Conclusion: 多源不确定性融合能够提供一种稳健、自适应的AI生成图像检测方案，尤其在分布外数据和对抗攻击场景下表现优越，具有良好的实际应用潜力。 Abstract: As AI-generated images become increasingly photorealistic, distinguishing them from natural images poses a growing challenge. This paper presents a robust detection framework that leverages multiple uncertainty measures to decide whether to trust or reject a model's predictions. We focus on three complementary techniques: Fisher Information, which captures the sensitivity of model parameters to input variations; entropy-based uncertainty from Monte Carlo Dropout, which reflects predictive variability; and predictive variance from a Deep Kernel Learning framework using a Gaussian Process classifier. To integrate these diverse uncertainty signals, Particle Swarm Optimisation is used to learn optimal weightings and determine an adaptive rejection threshold. The model is trained on Stable Diffusion-generated images and evaluated on GLIDE, VQDM, Midjourney, BigGAN, and StyleGAN3, each introducing significant distribution shifts. While standard metrics such as prediction probability and Fisher-based measures perform well in distribution, their effectiveness degrades under shift. In contrast, the Combined Uncertainty measure consistently achieves an incorrect rejection rate of approximately 70 percent on unseen generators, successfully filtering most misclassified AI samples. Although the system occasionally rejects correct predictions from newer generators, this conservative behaviour is acceptable, as rejected samples can support retraining. The framework maintains high acceptance of accurate predictions for natural images and in-domain AI data. Under adversarial attacks using FGSM and PGD, the Combined Uncertainty method rejects around 61 percent of successful attacks, while GP-based uncertainty alone achieves up to 80 percent. Overall, the results demonstrate that multi-source uncertainty fusion provides a resilient and adaptive solution for AI-generated image detection.

[128] WoundNet-Ensemble: A Novel IoMT System Integrating Self-Supervised Deep Learning and Multi-Model Fusion for Automated, High-Accuracy Wound Classification and Healing Progression Monitoring

Moses Kiprono

Main category: cs.CV

TL;DR: 本文提出了一种名为WoundNet-Ensemble的物联网医疗系统，利用ResNet-50、DINOv2和Swin Transformer三种深度学习模型的集成方法，实现对六种临床伤口类型的自动分类，在5,175张伤口图像数据集上达到99.90%的准确率，并引入加权融合策略提升性能3.7%，同时开发了纵向伤口愈合追踪器，具备临床部署潜力。

Details

Motivation: 慢性伤口（如糖尿病足溃疡）带来巨大的临床和经济负担，当前伤口评估主要依赖主观判断，导致分类不一致和干预延迟，亟需客观、准确的自动化评估工具。 Method: 提出WoundNet-Ensemble系统，结合ResNet-50、自监督Vision Transformer DINOv2和Swin Transformer三种深度学习架构，采用加权融合策略进行集成学习；使用包含5,175张图像的数据集训练和验证模型；并开发了用于计算愈合速率、严重程度评分和生成临床警报的纵向愈合追踪模块。 Result: 在六类伤口图像上实现了99.90%的集成准确率，比现有最先进方法提升3.7%；成功构建并验证了纵向伤口愈合跟踪系统，可提供量化 healing rate 和临床预警。 Conclusion: WoundNet-Ensemble是一种高精度、鲁棒且具备临床可部署性的AI工具，有望推动伤口护理的现代化，支持远程医疗和患者监测，相关代码与模型将公开以促进可重复性研究。 Abstract: Chronic wounds, including diabetic foot ulcers which affect up to one-third of people with diabetes, impose a substantial clinical and economic burden, with U.S. healthcare costs exceeding 25 billion dollars annually. Current wound assessment remains predominantly subjective, leading to inconsistent classification and delayed interventions. We present WoundNet-Ensemble, an Internet of Medical Things system leveraging a novel ensemble of three complementary deep learning architectures: ResNet-50, the self-supervised Vision Transformer DINOv2, and Swin Transformer, for automated classification of six clinically distinct wound types. Our system achieves 99.90 percent ensemble accuracy on a comprehensive dataset of 5,175 wound images spanning diabetic foot ulcers, pressure ulcers, venous ulcers, thermal burns, pilonidal sinus wounds, and fungating malignant tumors. The weighted fusion strategy demonstrates a 3.7 percent improvement over previous state-of-the-art methods. Furthermore, we implement a longitudinal wound healing tracker that computes healing rates, severity scores, and generates clinical alerts. This work demonstrates a robust, accurate, and clinically deployable tool for modernizing wound care through artificial intelligence, addressing critical needs in telemedicine and remote patient monitoring. The implementation and trained models will be made publicly available to support reproducibility.

[129] Hierarchical Bayesian Framework for Multisource Domain Adaptation

Alexander M. Glandon,Khan M. Iftekharuddin

Main category: cs.CV

TL;DR: 提出了一种用于多源域自适应（MDA）的分层贝叶斯框架，通过利用不同源域之间的分布相似性来优化预训练过程，在人类动作识别任务中比现有方法准确率提高17.29%。

Details

Motivation: 现有的MDA方法在预训练阶段要么采用权重共享，要么独立训练模型，缺乏系统性建模；本文旨在通过贝叶斯方法对多个源域间的相似性进行建模，提升MDA性能。 Method: 提出一种分层贝叶斯框架，利用不同源域数据分布的相似性进行预训练优化，从而提升跨域推理能力。 Result: 在Daily-DA RGB视频基准上的实验表明，所提方法在人类动作识别任务中相比现有最先进方法准确率提升了17.29%。 Conclusion: 该贝叶斯框架有效提升了多源域自适应的性能，尤其在复杂的人类动作识别任务中表现出显著优势。 Abstract: Multisource domain adaptation (MDA) aims to use multiple source datasets with available labels to infer labels on a target dataset without available labels for target supervision. Prior works on MDA in the literature is ad-hoc as the pretraining of source models is either based on weight sharing or uses independently trained models. This work proposes a Bayesian framework for pretraining in MDA by considering that the distributions of different source domains are typically similar. The Hierarchical Bayesian Framework uses similarity between the different source data distributions to optimize the pretraining for MDA. Experiments using the proposed Bayesian framework for MDA show that our framework improves accuracy on recognition tasks for a large benchmark dataset. Performance comparison with state-of-the-art MDA methods on the challenging problem of human action recognition in multi-domain benchmark Daily-DA RGB video shows the proposed Bayesian Framework offers a 17.29% improvement in accuracy when compared to the state-of-the-art methods in the literature.

[130] Enhancing Medical Large Vision-Language Models via Alignment Distillation

Aofei Chang,Ting Wang,Fenglong Ma

Main category: cs.CV

TL;DR: 本文提出了一种名为MEDALIGN的轻量级对齐蒸馏框架，用于改善医学视觉-语言模型中的视觉理解错位问题，通过引入两种蒸馏损失显著提升了生成结果的准确性和可解释性。

Details

Motivation: 医学大视觉语言模型（Med-LVLMs）在临床应用中存在幻觉输出的问题，主要源于视觉表征学习不足和视觉注意力对齐不佳，本文旨在解决这两个根本性限制。 Method: 提出MEDALIGN框架，利用领域特定的CLIP模型进行知识蒸馏，引入空间感知的视觉对齐损失（基于视觉token级相似性结构）和注意力感知蒸馏损失，以增强对诊断相关区域的关注。 Result: 在医学报告生成和医学视觉问答任务上进行了广泛实验，结果显示MEDALIGN持续提升了性能和可解释性，生成更符合视觉依据的输出。 Conclusion: MEDALIGN有效缓解了Med-LVLMs中的幻觉问题，通过轻量化的对齐蒸馏策略实现了更可靠的视觉-语言对齐，具有良好的应用潜力。 Abstract: Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MEDALIGN, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MEDALIGN introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual question answering (VQA) benchmarks show that MEDALIGN consistently improves both performance and interpretability, yielding more visually grounded outputs.

[131] OpenView: Empowering MLLMs with Out-of-view VQA

Qixiang Chen,Cheng Zhang,Chi-Wing Fu,Jingwen Ye,Jianfei Cai

Main category: cs.CV

TL;DR: 本文提出了首个针对视觉模型在视角外（OOV）理解能力的研究，通过构建OpenView生成管道、合成数据集和评测基准，显著提升了多模态大语言模型在OOV任务上的表现。

Details

Motivation: 现有MLLMs主要擅长理解图像内可见内容，缺乏对视角外物体、活动和场景的推理能力，限制了其在真实复杂环境中的应用。 Method: 设计了一个四阶段的OpenView生成管道，利用全景图像生成具有空间定位的多选视觉问答（VQA）；构建了高质量的合成数据集OpenView-Dataset用于监督微调；并建立了包含选项和推理准确性的评测基准OpenView-Bench。 Result: 实验表明，经过OpenView增强后，多个MLLM在OOV VQA任务上的平均准确率从48.6%提升至64.1%，但仍与人类表现存在差距。 Conclusion: 该研究首次系统探索了MLLMs的视角外理解能力，验证了通过全景上下文生成训练数据的有效性，为构建更具空间感知和推理能力的视觉模型提供了新方向。 Abstract: Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at https://github.com/q1xiangchen/OpenView.

[132] Placenta Accreta Spectrum Detection Using an MRI-based Hybrid CNN-Transformer Model

Sumaiya Ali,Areej Alhothali,Ohoud Alzamzami,Sameera Albasri,Ahmed Abduljabbar,Muhammad Alwazzan

Main category: cs.CV

TL;DR: 本研究提出了一种结合3D DenseNet121和3D Vision Transformer的混合深度学习模型，用于从MRI体积数据中自动检测胎盘植入谱系障碍（PAS），在独立测试集上达到84.3%的准确率。

Details

Motivation: 由于放射科医生对MRI图像解读的差异，PAS诊断具有挑战性，亟需提高诊断一致性和准确性。 Method: 提出一种融合3D DenseNet121（提取局部特征）和3D Vision Transformer（建模全局空间上下文）的混合模型，在1,133例回顾性MRI数据上进行训练与评估，并与其他3D深度学习架构比较性能。 Result: 所提出的DenseNet121-ViT模型在五次运行的独立测试集中取得最高平均准确率为84.3%，优于其他对比模型。 Conclusion: 混合CNN-Transformer模型在PAS的计算机辅助诊断中表现出强大潜力，可作为提升诊断一致性、准确性和及时性的有效工具。 Abstract: Placenta Accreta Spectrum (PAS) is a serious obstetric condition that can be challenging to diagnose with Magnetic Resonance Imaging (MRI) due to variability in radiologists' interpretations. To overcome this challenge, a hybrid 3D deep learning model for automated PAS detection from volumetric MRI scans is proposed in this study. The model integrates a 3D DenseNet121 to capture local features and a 3D Vision Transformer (ViT) to model global spatial context. It was developed and evaluated on a retrospective dataset of 1,133 MRI volumes. Multiple 3D deep learning architectures were also evaluated for comparison. On an independent test set, the DenseNet121-ViT model achieved the highest performance with a five-run average accuracy of 84.3%. These results highlight the strength of hybrid CNN-Transformer models as a computer-aided diagnosis tool. The model's performance demonstrates a clear potential to assist radiologists by providing a robust decision support to improve diagnostic consistency across interpretations, and ultimately enhance the accuracy and timeliness of PAS diagnosis.

[133] Commercial Vehicle Braking Optimization: A Robust SIFT-Trajectory Approach

Zhe Li,Kun Cheng,Hanyue Mo,Jintao Lu,Ziwen Kuang,Jianwen Ye,Lixu Xu,Xinya Meng,Jiahui Zhao,Shengda Ji,Shuyuan Liu,Mengyu Wang

Main category: cs.CV

TL;DR: 提出一种基于视觉的轨迹分析方法，用于解决商用汽车AEB系统在低速时因CAN信号不准确导致的“零速制动”问题，通过视频帧处理实现高精度运动状态识别，并显著减少误制动。

Details

Motivation: 商用车辆AEB系统在低速运行时因CAN信号不准确常出现误触发“零速制动”，影响安全性和驾驶体验，需更可靠的运动状态检测方案。 Method: 采用NVIDIA Jetson AGX Xavier平台处理盲区摄像头视频流，结合自适应CLAHE增强的SIFT特征提取与KNN-RANSAC匹配，利用5帧滑动窗口进行多帧轨迹位移统计，设计双阈值状态决策矩阵，并通过OBD-II动态配置ROI区域。 Result: 在真实数据集上（32,454个视频片段，1,852辆车）实现99.96%的静止检测F1分数、97.78%的运动识别F1分数，处理延迟仅14.2毫秒；现场部署后误刹车减少89%，紧急制动成功率达100%，故障率低于5%。 Conclusion: 该方法能有效抑制环境干扰和动态物体误检，显著提升低速下商用车辆AEB系统的可靠性与准确性，具备实际部署价值。 Abstract: A vision-based trajectory analysis solution is proposed to address the "zero-speed braking" issue caused by inaccurate Controller Area Network (CAN) signals in commercial vehicle Automatic Emergency Braking (AEB) systems during low-speed operation. The algorithm utilizes the NVIDIA Jetson AGX Xavier platform to process sequential video frames from a blind spot camera, employing self-adaptive Contrast Limited Adaptive Histogram Equalization (CLAHE)-enhanced Scale-Invariant Feature Transform (SIFT) feature extraction and K-Nearest Neighbors (KNN)-Random Sample Consensus (RANSAC) matching. This allows for precise classification of the vehicle's motion state (static, vibration, moving). Key innovations include 1) multiframe trajectory displacement statistics (5-frame sliding window), 2) a dual-threshold state decision matrix, and 3) OBD-II driven dynamic Region of Interest (ROI) configuration. The system effectively suppresses environmental interference and false detection of dynamic objects, directly addressing the challenge of low-speed false activation in commercial vehicle safety systems. Evaluation in a real-world dataset (32,454 video segments from 1,852 vehicles) demonstrates an F1-score of 99.96% for static detection, 97.78% for moving state recognition, and a processing delay of 14.2 milliseconds (resolution 704x576). The deployment on-site shows an 89% reduction in false braking events, a 100% success rate in emergency braking, and a fault rate below 5%.

[134] SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback

Jianglin Lu,Yuanwei Wu,Ziyi Zhao,Hongcheng Wang,Felix Jimenez,Abrar Majeedi,Yun Fu

Main category: cs.CV

TL;DR: 提出一种基于策略优化的无监督图像恢复框架，利用多模态大语言模型提供感知反馈，实现高效、高质量的复杂图像恢复。

Details

Motivation: 现有基于视觉-语言模型的恢复方法存在效率瓶颈，且依赖需大量标注的退化识别模型，难以应用于无标签场景。 Method: 设计一个轻量级代理，通过序列决策选择最优恢复操作；引入由多模态大语言模型驱动的新型奖励机制，在无监督环境下进行策略学习。 Result: 在无监督训练下，该方法在全参考指标上达到SOTA水平，在无参考指标上超越现有方法，并显著提升推理速度。 Conclusion: 所提框架有效解决了现有方法的效率与标注依赖问题，实现了高效、高质量的复杂图像恢复。 Abstract: Complex image restoration aims to recover high-quality images from inputs affected by multiple degradations such as blur, noise, rain, and compression artifacts. Recent restoration agents, powered by vision-language models and large language models, offer promising restoration capabilities but suffer from significant efficiency bottlenecks due to reflection, rollback, and iterative tool searching. Moreover, their performance heavily depends on degradation recognition models that require extensive annotations for training, limiting their applicability in label-free environments. To address these limitations, we propose a policy optimization-based restoration framework that learns an lightweight agent to determine tool-calling sequences. The agent operates in a sequential decision process, selecting the most appropriate restoration operation at each step to maximize final image quality. To enable training within label-free environments, we introduce a novel reward mechanism driven by multimodal large language models, which act as human-aligned evaluator and provide perceptual feedback for policy improvement. Once trained, our agent executes a deterministic restoration plans without redundant tool invocations, significantly accelerating inference while maintaining high restoration quality. Extensive experiments show that despite using no supervision, our method matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios.

[135] Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments

Saeideh Yousefzadeh,Hamidreza Pourreza

Main category: cs.CV

TL;DR: 提出Text2Graph VPR，一种基于语义场景图的可解释视觉位置识别系统，通过将图像序列转化为文本描述并构建场景图进行地点匹配，结合学习与结构化相似性实现鲁棒且透明的定位。

Details

Motivation: 现有VPR方法在长期部署中难以应对光照、天气和季节变化，且缺乏可解释性，需要一种既能保持鲁棒性又能提供透明决策过程的系统。 Method: 将图像序列转换为文本描述，解析为包含对象、属性和关系的场景图，聚合为紧凑的地点表示，并采用图注意力网络嵌入与最短路径核融合的双相似性机制进行检索。 Result: 在Oxford RobotCar和MSLS数据集上验证了系统的有效性，表现出对严重外观变化的鲁棒性，并支持人类文本查询的零样本操作。 Conclusion: 语义图推理是一种可行且可解释的位置识别方法，适用于安全关键和资源受限场景。 Abstract: Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and -- critically -- produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.

[136] PTTA: A Pure Text-to-Animation Framework for High-Quality Creation

Ruiqi Chen,Kaitong Cai,Yijia Fan,Keze Wang

Main category: cs.CV

TL;DR: 本文提出了一种纯文本到动画的生成框架PTTA，通过构建高质量配对数据集并对预训练视频生成模型HunyuanVideo进行微调，实现了高质量动画生成。

Details

Motivation: 现有视频生成模型在动画生成上表现不佳，且文本到动画生成的研究相对不足，尤其是缺乏针对动画风格的专用框架。 Method: 构建小规模但高质量的动画视频与文本描述配对数据集，并基于预训练的HunyuanVideo模型进行微调，以适应动画风格生成。 Result: 在多个维度的视觉评估中，该方法在动画视频合成任务上 consistently 优于现有基线方法。 Conclusion: PTTA框架有效提升了文本到动画生成的质量，验证了针对特定动画风格微调的可行性与优势。 Abstract: Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show that the proposed approach consistently outperforms comparable baselines in animation video synthesis.

[137] Uni-Neur2Img: Unified Neural Signal-Guided Image Generation, Editing, and Stylization via Diffusion Transformers

Xiyue Bai,Ronghao Yu,Jia Xiu,Pengfei Zhou,Jie Xia,Peng Ji

Main category: cs.CV

TL;DR: 本文提出了一种名为Uni-Neur2Img的统一框架，用于从神经信号驱动图像生成与编辑，结合LoRA模块和因果注意力机制，实现了高效、可扩展的多模态条件生成，并通过新构建的EEG-Style数据集验证了其在EEG驱动图像生成、编辑和风格迁移上的优越性能。

Details

Motivation: 现有基于神经信号的生成研究主要集中于文本模态作为条件或中间表示，对视觉模态作为直接条件的研究较为有限。本文旨在填补这一空白，探索神经信号直接驱动图像生成与编辑的统一高效框架。 Method: 提出Uni-Neur2Img框架，采用参数高效的LoRA-based神经信号注入模块，以即插即用方式独立处理各条件信号；引入因果注意力机制以适应长序列神经信号建模；构建新的EEG-Style数据集用于评估。 Result: 在多个公开及自建数据集（CVPR40、Loongx、EEG-Style）上验证了方法的有效性，在EEG驱动图像生成、语义感知图像编辑和风格迁移任务中均显著提升了生成保真度、编辑一致性和风格迁移质量，同时保持低计算开销和良好可扩展性。 Conclusion: Uni-Neur2Img为连接神经信号与视觉内容生成提供了一个统一、高效且可扩展的解决方案，推动了脑机接口与视觉生成模型的融合应用。 Abstract: Generating or editing images directly from Neural signals has immense potential at the intersection of neuroscience, vision, and Brain-computer interaction. In this paper, We present Uni-Neur2Img, a unified framework for neural signal-driven image generation and editing. The framework introduces a parameter-efficient LoRA-based neural signal injection module that independently processes each conditioning signal as a pluggable component, facilitating flexible multi-modal conditioning without altering base model parameters. Additionally, we employ a causal attention mechanism accommodate the long-sequence modeling demands of conditional generation tasks. Existing neural-driven generation research predominantly focuses on textual modalities as conditions or intermediate representations, resulting in limited exploration of visual modalities as direct conditioning signals. To bridge this research gap, we introduce the EEG-Style dataset. We conduct comprehensive evaluations across public benchmarks and self-collected neural signal datasets: (1) EEG-driven image generation on the public CVPR40 dataset; (2) neural signal-guided image editing on the public Loongx dataset for semantic-aware local modifications; and (3) EEG-driven style transfer on our self-collected EEG-Style dataset. Extensive experimental results demonstrate significant improvements in generation fidelity, editing consistency, and style transfer quality while maintaining low computational overhead and strong scalability to additional modalities. Thus, Uni-Neur2Img offers a unified, efficient, and extensible solution for bridging neural signals and visual content generation.

[138] Geometric-Photometric Event-based 3D Gaussian Ray Tracing

Kai Kohyama,Yoshimitsu Aoki,Guillermo Gallego,Shintaro Shiba

Main category: cs.CV

TL;DR: 提出了一种基于事件的3D高斯点阵新框架，通过解耦几何和辐射渲染，在真实数据集上实现了最先进的性能，且无需先验信息或复杂初始化。

Details

Motivation: 事件相机具有高时间分辨率，但如何在3D高斯点阵中有效利用稀疏事件的精细时间信息尚不明确。 Method: 将渲染解耦为两个分支：逐事件的几何（深度）渲染使用光线追踪，基于快照的辐射（强度）渲染使用 warped events 图像。 Result: 在真实世界数据集上达到最先进水平，在合成数据集上表现有竞争力；无需预训练模型或COLMAP初始化，支持灵活事件选择，边缘重建清晰且训练速度快。 Conclusion: 该方法加深了对稀疏事件在3D重建中特性的理解，提升了事件相机在3DGS中的应用潜力。 Abstract: Event cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves state-of-the-art performance on the real-world datasets and competitive performance on the synthetic dataset. Also, the proposed method works without prior information (e.g., pretrained image reconstruction models) or COLMAP-based initialization, is more flexible in the event selection number, and achieves sharp reconstruction on scene edges with fast training time. We hope that this work deepens our understanding of the sparse nature of events for 3D reconstruction. The code will be released.

[139] Adversarial Robustness in Zero-Shot Learning:An Empirical Study on Class and Concept-Level Vulnerabilities

Zhiyuan Peng,Zihan Ye,Shreyank N Gowda,Yuping Yan,Haotian Xu,Ling Shao

Main category: cs.CV

TL;DR: 本文研究了零样本学习（ZSL）模型在类级别和概念级别的对抗攻击下的鲁棒性，提出了新的攻击方法CBEA、CPconA和NCPconA，揭示了现有ZSL模型在对抗扰动下的脆弱性，并强调提升其对抗鲁棒性的必要性。

Details

Motivation: 尽管零样本学习（ZSL）模型在泛化和可解释性方面具有潜力，但其在系统性输入扰动下的鲁棒性尚不明确，亟需评估并改进其面对对抗攻击的稳定性。 Method: 提出类偏置增强攻击（CBEA）以彻底破坏广义零样本学习（GZSL）在所有校准点上的准确性；设计两种新型概念级攻击：类保持概念攻击（CPconA）和非类保持概念攻击（NCPconA），并在过去三年的多种典型ZSL模型上进行广泛实验验证。 Result: 实验证明现有ZSL模型不仅对传统类攻击敏感，也极易受到概念级攻击的影响；CBEA能完全消除GZSL在所有校准点的准确率，而CPconA和NCPconA可操纵预测结果，通过擦除或引入语义概念误导模型。 Conclusion: 当前零样本学习模型在对抗鲁棒性方面存在显著缺陷，未来工作应重点关注如何提升模型在类级别和概念级别上的安全性与稳定性。 Abstract: Zero-shot Learning (ZSL) aims to enable image classifiers to recognize images from unseen classes that were not included during training. Unlike traditional supervised classification, ZSL typically relies on learning a mapping from visual features to predefined, human-understandable class concepts. While ZSL models promise to improve generalization and interpretability, their robustness under systematic input perturbations remain unclear. In this study, we present an empirical analysis about the robustness of existing ZSL methods at both classlevel and concept-level. Specifically, we successfully disrupted their class prediction by the well-known non-target class attack (clsA). However, in the Generalized Zero-shot Learning (GZSL) setting, we observe that the success of clsA is only at the original best-calibrated point. After the attack, the optimal bestcalibration point shifts, and ZSL models maintain relatively strong performance at other calibration points, indicating that clsA results in a spurious attack success in the GZSL. To address this, we propose the Class-Bias Enhanced Attack (CBEA), which completely eliminates GZSL accuracy across all calibrated points by enhancing the gap between seen and unseen class probabilities.Next, at concept-level attack, we introduce two novel attack modes: Class-Preserving Concept Attack (CPconA) and NonClass-Preserving Concept Attack (NCPconA). Our extensive experiments evaluate three typical ZSL models across various architectures from the past three years and reveal that ZSL models are vulnerable not only to the traditional class attack but also to concept-based attacks. These attacks allow malicious actors to easily manipulate class predictions by erasing or introducing concepts. Our findings highlight a significant performance gap between existing approaches, emphasizing the need for improved adversarial robustness in current ZSL models.

[140] SplatBright: Generalizable Low-Light Scene Reconstruction from Sparse Views via Physically-Guided Gaussian Enhancement

Yue Wen,Liang Song,Hesheng Wang

Main category: cs.CV

TL;DR: SplatBright是首个可泛化的基于3D高斯的框架，用于从稀疏sRGB输入实现低光增强与三维重建的联合优化，通过物理引导的光照建模与几何-外观解耦提升跨视图一致性与重建质量。

Details

Motivation: 现有方法在低光3D重建中面临曝光不均、颜色失真、视图不一致及需逐场景训练等问题，缺乏能泛化且无需配对数据的解决方案。 Method: 提出双分支预测器以稳定初始化3D高斯几何参数；采用频率先验保证光照一致性，并设计外观细化模块分离光照、材质和视角相关特征以恢复纹理；利用物理相机模型合成暗态视图进行训练。 Result: 在公开与自采集数据集上实验表明，SplatBright在新视角合成、跨视图一致性及对未见低光场景的泛化能力上优于现有2D与3D方法。 Conclusion: SplatBright通过几何-外观解耦与物理引导建模，实现了无需逐场景微调的高质量稀疏视图低光3D重建，具备良好泛化性与实用性。 Abstract: Low-light 3D reconstruction from sparse views remains challenging due to exposure imbalance and degraded color fidelity. While existing methods struggle with view inconsistency and require per-scene training, we propose SplatBright, which is, to our knowledge, the first generalizable 3D Gaussian framework for joint low-light enhancement and reconstruction from sparse sRGB inputs. Our key idea is to integrate physically guided illumination modeling with geometry-appearance decoupling for consistent low-light reconstruction. Specifically, we adopt a dual-branch predictor that provides stable geometric initialization of 3D Gaussian parameters. On the appearance side, illumination consistency leverages frequency priors to enable controllable and cross-view coherent lighting, while an appearance refinement module further separates illumination, material, and view-dependent cues to recover fine texture. To tackle the lack of large-scale geometrically consistent paired data, we synthesize dark views via a physics-based camera model for training. Extensive experiments on public and self-collected datasets demonstrate that SplatBright achieves superior novel view synthesis, cross-view consistency, and better generalization to unseen low-light scenes compared with both 2D and 3D methods.

[141] PMPGuard: Catching Pseudo-Matched Pairs in Remote Sensing Image-Text Retrieval

Pengxiang Ouyang,Qing Ma,Zheng Wang,Cong Bai

Main category: cs.CV

TL;DR: 提出了一种新的遥感图像-文本检索框架，通过跨模态门控注意力和正负感知注意力机制来缓解伪匹配对的影响，显著提升了在真实数据集上的性能。

Details

Motivation: 遥感图像-文本检索中存在大量语义不匹配或弱对齐的伪匹配对（PMPs），影响跨模态对齐学习的可靠性。 Method: 设计了跨模态门控注意力模块以动态调节信息流，并引入正负感知注意力机制，显式区分对齐学习中的有效（正向）与误导（负向）信号。 Result: 在RSICD、RSITMD和RS5M三个基准数据集上进行了广泛实验，结果表明该方法在处理真实世界中的不匹配和PMP方面具有优越的鲁棒性和有效性。 Conclusion: 所提方法在遥感图像-文本检索任务中实现了最先进的性能，有效缓解了伪匹配对模型学习的负面影响。 Abstract: Remote sensing (RS) image-text retrieval faces significant challenges in real-world datasets due to the presence of Pseudo-Matched Pairs (PMPs), semantically mismatched or weakly aligned image-text pairs, which hinder the learning of reliable cross-modal alignments. To address this issue, we propose a novel retrieval framework that leverages Cross-Modal Gated Attention and a Positive-Negative Awareness Attention mechanism to mitigate the impact of such noisy associations. The gated module dynamically regulates cross-modal information flow, while the awareness mechanism explicitly distinguishes informative (positive) cues from misleading (negative) ones during alignment learning. Extensive experiments on three benchmark RS datasets, i.e., RSICD, RSITMD, and RS5M, demonstrate that our method consistently achieves state-of-the-art performance, highlighting its robustness and effectiveness in handling real-world mismatches and PMPs in RS image-text retrieval tasks.

[142] SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse

Yiming Sun,Mi Zhang,Feifei Li,Geng Hong,Min Yang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频大语言模型幻觉缓解方法SmartSight，通过利用模型自身的内省能力来减少感知幻觉，同时提升视频理解和推理能力。

Details

Motivation: 尽管视频大语言模型（Video-LLMs）近年来迅速发展，但感知幻觉问题带来了显著的安全风险，限制了其在现实世界中的应用。现有缓解幻觉的方法通常会损害模型的视频理解与推理能力，因此需要一种既能降低幻觉又不牺牲模型性能的新方法。 Method: 本文提出了SmartSight，该方法采用非训练方式，生成多个候选回答以发现低幻觉输出，并使用时间注意力崩溃分数（Temporal Attention Collapse score）评估每个回答的幻觉程度，识别模型是否过度关注输入视频中无关紧要的时间区域。此外，引入视觉注意力消失点以实现更准确的幻觉估计和幻觉响应的早期终止，从而显著降低解码成本。 Result: 实验表明，SmartSight将Qwen2.5-VL-7B在VRIPT-HAL上的幻觉率降低了10.59%，同时在VideoMMMU上将性能最高提升了8.86%，验证了其在降低幻觉的同时增强模型理解与推理的能力。 Conclusion: SmartSight是一种高效、无需训练的幻觉缓解框架，能够有效提升开源视频大语言模型的可靠性，在不牺牲甚至增强模型能力的前提下显著减少感知幻觉，具有良好的实际应用前景。 Abstract: Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model's capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model's own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight's effectiveness in improving the reliability of open-source Video-LLMs.

[143] AsyncDiff: Asynchronous Timestep Conditioning for Enhanced Text-to-Image Diffusion Inference

Longhuan Xu,Feng Yin,Cunjian Chen

Main category: cs.CV

TL;DR: 提出了一种异步推理机制，通过解耦去噪器条件时间和潜在状态更新时间，利用轻量级时间步预测模块（TPM）选择更合适的噪声水平，从而提升文本到图像生成的质量。

Details

Motivation: 传统扩散模型推理过程中去噪器条件与潜变量更新同步，限制了对图像细节和纹理的灵活控制，本文旨在通过异步机制提升生成质量与灵活性。 Method: 提出异步推理机制，引入轻量级时间步预测模块（TPM），使用Group Relative Policy Optimization（GRPO）训练TPM以根据当前状态选择最优去噪条件时间步，并在推理时通过缩放超参数调节调整幅度。 Result: 在Stable Diffusion 3.5和Flux.1-dev上，仅用10-15步推理即在MS-COCO 2014和T2I-CompBench数据集上实现了Image Reward、HPSv2、CLIP Score和Pick Score的综合提升。 Conclusion: 异步推理机制能有效提升文本到图像扩散模型的生成质量，且计算成本低，具备实际部署价值。 Abstract: Text-to-image diffusion inference typically follows synchronized schedules, where the numerical integrator advances the latent state to the same timestep at which the denoiser is conditioned. We propose an asynchronous inference mechanism that decouples these two, allowing the denoiser to be conditioned at a different, learned timestep while keeping image update schedule unchanged. A lightweight timestep prediction module (TPM), trained with Group Relative Policy Optimization (GRPO), selects a more feasible conditioning timestep based on the current state, effectively choosing a desired noise level to control image detail and textural richness. At deployment, a scaling hyper-parameter can be used to interpolate between the original and de-synchronized timesteps, enabling conservative or aggressive adjustments. To keep the study computationally affordable, we cap the inference at 15 steps for SD3.5 and 10 steps for Flux. Evaluated on Stable Diffusion 3.5 Medium and Flux.1-dev across MS-COCO 2014 and T2I-CompBench datasets, our method optimizes a composite reward that averages Image Reward, HPSv2, CLIP Score and Pick Score, and shows consistent improvement.

[144] brat: Aligned Multi-View Embeddings for Brain MRI Analysis

Maxime Kayser,Maksim Gridnev,Wanting Wang,Max Bain,Aneesh Rangnekar,Avijit Chatterjee,Aleksandr Petrov,Harini Veeraraghavan,Nathaniel C. Swinburne

Main category: cs.CV

TL;DR: 本文提出了brat（brain report alignment transformer），一种用于脑部MRI的多视角表示学习框架，通过结合临床报告进行训练，解决了脑部MRI中异常复杂且局部化的问题。

Details

Motivation: 脑部MRI存在大量多样化且细微的异常，通常只出现在3D体积中的少数切片上，现有方法难以有效处理这些挑战。 Method: 构建了一个比现有数据集大10倍的脑部MRI数据集（约8万3D扫描及对应放射学报告），并提出一种受文档检索启发的多视角预训练方法，引入隐式查询-特征匹配机制，并采用质量-多样性策略获得与临床报告对齐的多视角MRI嵌入。 Result: 在多个视觉-语言和纯视觉任务上验证了该方法的有效性，表现出显著的性能提升。brat基础模型已公开发布。 Conclusion: brat通过大规模数据和多视角对齐学习，有效提升了脑部MRI分析的表示能力，为医学图像-文本联合建模提供了新思路。 Abstract: We present brat (brain report alignment transformer), a multi-view representation learning framework for brain magnetic resonance imaging (MRI) trained on MRIs paired with clinical reports. Brain MRIs present unique challenges due to the presence of numerous, highly varied, and often subtle abnormalities that are localized to a few slices within a 3D volume. To address these challenges, we introduce a brain MRI dataset $10\times$ larger than existing ones, containing approximately 80,000 3D scans with corresponding radiology reports, and propose a multi-view pre-training approach inspired by advances in document retrieval. We develop an implicit query-feature matching mechanism and adopt concepts from quality-diversity to obtain multi-view embeddings of MRIs that are aligned with the clinical features given by report sentences. We evaluate our approach across multiple vision-language and vision tasks, demonstrating substantial performance improvements. The brat foundation models are publicly released.

[145] A Study of Finetuning Video Transformers for Multi-view Geometry Tasks

Huimin Wu,Kwang-Ting Cheng,Stephen Lin,Zhirong Wu

Main category: cs.CV

TL;DR: 本文研究了通过微调视频基础模型来进行多视图几何任务（如光流估计）的视觉Transformer学习，发现通用的预训练模型只需极少适配即可迁移到多视图问题，并取得了最先进的性能。

Details

Motivation: 以往方法依赖定制化架构设计和特定任务的预训练，本文旨在探索通用视频预训练模型是否能直接迁移至多视图几何任务，以简化流程并提升泛化能力。 Method: 采用通用视频预训练的Transformer模型，在其基础上附加线性解码器，并利用迭代优化机制进行微调，从而实现对光流、深度估计和立体匹配等多视图几何任务的处理。 Result: 在Sintel clean、Sintel final和KITTI数据集上分别取得了0.69、1.78和3.15的端点误差（EPE）表现，在在线测试基准上创下0.79 EPE和3.79 F1的新纪录；同时在3D深度估计和立体匹配任务中也表现出色。 Conclusion: 通用视频预训练Transformer模型通过简单适配即可有效解决多种多视图几何任务，验证了其在几何视觉任务中的强大迁移能力和广泛适用性。 Abstract: This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal adaptation. The core insight is that general-purpose attention between patches learns temporal and spatial information for geometric reasoning. We demonstrate that appending a linear decoder to the Transformer backbone produces satisfactory results, and iterative refinement can further elevate performance to stateof-the-art levels. This conceptually simple approach achieves top cross-dataset generalization results for optical flow estimation with end-point error (EPE) of 0.69, 1.78, and 3.15 on the Sintel clean, Sintel final, and KITTI datasets, respectively. Our method additionally establishes a new record on the online test benchmark with EPE values of 0.79, 1.88, and F1 value of 3.79. Applications to 3D depth estimation and stereo matching also show strong performance, illustrating the versatility of video-pretrained models in addressing geometric vision tasks.

[146] EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images

Jongmin Park,Minh-Quan Viet Bui,Juan Luis Gonzalez Bello,Jaeho Moon,Jihyong Oh,Munchurl Kim

Main category: cs.CV

TL;DR: EcoSplat提出了一种效率可控的前馈式3D高斯点阵化框架，能够根据目标基元数量自适应地预测3D表示，在密集视图设置下优于现有方法。

Details

Motivation: 现有前馈3D高斯点阵化方法在密集视图下生成过多基元，且无法显式控制高斯数量，缺乏效率调控能力。 Method: 采用两阶段优化：第一阶段为像素对齐高斯训练（PGT），学习初始基元预测；第二阶段为重要性感知高斯微调（IGF），根据目标基元数排序并调整参数。 Result: 在多个密集视图场景中验证了EcoSplat的鲁棒性，在严格基元数量限制下优于最先进方法。 Conclusion: EcoSplat是首个可控制输出基元数量的前馈3DGS框架，适用于灵活的下游渲染任务。 Abstract: Feed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their parameters based on the target primitive count. Extensive experiments across multiple dense-view settings show that EcoSplat is robust and outperforms state-of-the-art methods under strict primitive-count constraints, making it well-suited for flexible downstream rendering tasks.

[147] Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts

Linwei Qiu,Gongzhe Li,Xiaozhe Zhang,Qinlin Sun,Fengying Xie

Main category: cs.CV

TL;DR: 本文提出了一种统一的图像校正框架UniRect，通过模拟不同类型镜头的畸变，将多种任务特定的逆问题纳入一个通用的畸变模型中，采用去形变模块和恢复模块的双组件结构，并设计稀疏混合专家结构以解决多任务学习中的任务竞争问题，实现了最先进的性能。

Details

Motivation: 现有方法主要依赖于任务特定架构，限制了其泛化能力和在不同任务中的有效应用。 Method: 提出统一的失真校正框架UniRect，包含使用RP-TPS模型处理复杂几何变形的去形变模块和使用RMBs对抗退化的恢复模块，以及用于缓解多任务学习中任务竞争的SMoEs结构。 Result: 大量实验表明，所提模型相比其他最新方法取得了最先进的性能。 Conclusion: UniRect框架能够有效地处理多种图像校正任务，具有良好的泛化能力和优越的性能表现。 Abstract: Image correction and rectangling are valuable tasks in practical photography systems such as smartphones. Recent remarkable advancements in deep learning have undeniably brought about substantial performance improvements in these fields. Nevertheless, existing methods mainly rely on task-specific architectures. This significantly restricts their generalization ability and effective application across a wide range of different tasks. In this paper, we introduce the Unified Rectification Framework (UniRect), a comprehensive approach that addresses these practical tasks from a consistent distortion rectification perspective. Our approach incorporates various task-specific inverse problems into a general distortion model by simulating different types of lenses. To handle diverse distortions, UniRect adopts one task-agnostic rectification framework with a dual-component structure: a {Deformation Module}, which utilizes a novel Residual Progressive Thin-Plate Spline (RP-TPS) model to address complex geometric deformations, and a subsequent Restoration Module, which employs Residual Mamba Blocks (RMBs) to counteract the degradation caused by the deformation process and enhance the fidelity of the output image. Moreover, a Sparse Mixture-of-Experts (SMoEs) structure is designed to circumvent heavy task competition in multi-task learning due to varying distortions. Extensive experiments demonstrate that our models have achieved state-of-the-art performance compared with other up-to-date methods.

[148] Breast Cancer Recurrence Risk Prediction Based on Multiple Instance Learning

Jinqiu Chen,Huyan Xu

Main category: cs.CV

TL;DR: 本研究利用深度学习对常规H&E染色全切片图像进行乳腺癌复发风险分层，开发并比较了三种多重实例学习框架，在内部数据集上验证了模型的有效性，表明基于计算病理学的自动化风险预测具有临床应用潜力。

Details

Motivation: 乳腺癌复发风险预测是重要的临床挑战，目前依赖基因检测（如21基因复发评分），但成本高且耗时。本研究旨在探索是否可通过深度学习分析常规H&E染色切片实现与基因组结果相关的自动化、低成本复发风险分层。 Method: 研究采用三种多重实例学习（MIL）框架：CLAM-SB、ABMIL 和 ConvNeXt-MIL-XGBoost，在包含210例患者的内部数据集上训练模型以预测5年复发风险（分为低、中、高三级）。使用UNI和CONCH预训练模型提取特征，并通过5折交叉验证评估性能。 Result: 在5折交叉验证中，改进的CLAM-SB模型表现最佳，平均AUC为0.836，分类准确率为76.2%。所有模型均显示出利用常规组织切片进行基因组相关风险预测的潜力，其中CLAM-SB性能最优。 Conclusion: 研究表明，基于深度学习分析常规H&E染色全切片图像可有效预测乳腺癌复发风险，与基因检测结果相关，具备实现快速、低成本临床决策支持的潜力，有望在未来辅助或减少对昂贵基因检测的依赖。 Abstract: Predicting breast cancer recurrence risk is a critical clinical challenge. This study investigates the potential of computational pathology to stratify patients using deep learning on routine Hematoxylin and Eosin (H&E) stained whole-slide images (WSIs). We developed and compared three Multiple Instance Learning (MIL) frameworks -- CLAM-SB, ABMIL, and ConvNeXt-MIL-XGBoost -- on an in-house dataset of 210 patient cases. The models were trained to predict 5-year recurrence risk, categorized into three tiers (low, medium, high), with ground truth labels established by the 21-gene Recurrence Score. Features were extracted using the UNI and CONCH pre-trained models. In a 5-fold cross-validation, the modified CLAM-SB model demonstrated the strongest performance, achieving a mean Area Under the Curve (AUC) of 0.836 and a classification accuracy of 76.2%. Our findings demonstrate the feasibility of using deep learning on standard histology slides for automated, genomics-correlated risk stratification, highlighting a promising pathway toward rapid and cost-effective clinical decision support.

[149] InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Kaican Li,Lewei Yao,Jiannan Wu,Tiezheng Yu,Jierun Chen,Haoli Bai,Lu Hou,Lanqing Hong,Wei Zhang,Nevin L. Zhang

Main category: cs.CV

TL;DR: 本文提出了O3-Bench，一个用于评估多模态推理能力的新基准，特别关注对图像细节的交错注意力。为应对这一挑战，作者还提出InSight-o3框架，包含视觉推理代理（vReasoner）和视觉搜索代理（vSearcher），并通过强化学习训练多模态大模型以提升性能。

Details

Motivation: 现有开放多模态智能体在复杂视觉推理任务（如分析图表、导航地图）中表现不足，尤其缺乏将细微视觉信息与多步推理结合的能力。为此需要更严格的评测基准与更强的视觉搜索机制。 Method: 提出O3-Bench作为高难度多模态推理评测基准，并设计InSight-o3多智能体框架：其中vSearcher执行广义视觉搜索（定位语言描述中的关系性、模糊性或概念性区域），配合vReasoner进行多步推理；通过强化学习专门训练多模态大模型以增强vSearcher能力。 Result: O3-Bench极具挑战性，前沿系统OpenAI o3仅取得40.8%准确率；所提vSearcher可作为即插即用模块显著提升前沿多模态模型在多个基准上的表现。 Conclusion: 通过引入O3-Bench与InSight-o3框架，推动了具备类o3能力的开放多模态系统的发展，特别是在细粒度视觉搜索与复杂推理结合方面迈出实质性一步。 Abstract: The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .

[150] $M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

Kewei Wei,Bocheng Hu,Jie Cao,Xiaohan Chen,Zhengxi Lu,Wubing Xia,Weili Xu,Jiaao Wu,Junchen He,Mingyu Jia,Ciyun Zhao,Ye Sun,Yizhi Li,Zhonghan Zhao,Jian Zhang,Gaoang Wang

Main category: cs.CV

TL;DR: 本文提出了一个名为$M^3-Verse$的多模态、多状态、多维度基准，用于评估大型多模态模型在理解动态环境中对象状态变化方面的能力。该基准基于成对的室内场景视频（状态变化前后），包含270个场景和2,932个问题，涵盖50多个子任务，测试模型在空间智能方面的核心能力。作者评估了16种先进模型，并提出了一种简单而有效的基线方法以提升多状态感知性能。

Details

Motivation: 现有的大型多模态模型在静态图像和单一时态理解上表现良好，但在理解同一空间背景下两个不同时态间对象的动态变化方面能力尚不明确。这种跨状态的空间推理对实现更高级的空间智能至关重要，因此需要专门的基准来系统评估和推动这一能力的发展。 Method: 构建了一个名为$M^3-Verse$的基准，基于成对的多视角室内视频（状态变化前后），设计了涵盖4类核心能力的50多个子任务，共2,932个问题。对16种主流大型多模态模型进行了系统评测，并提出一种改进的基线模型以增强多状态感知能力。 Result: $M^3-Verse$包含270个场景和2,932个问题，评测显示现有LMMs在跟踪状态转换方面存在明显局限。所提出的基线方法在多状态感知任务上实现了显著性能提升。 Conclusion: $M^3-Verse$为评估和推动多模态模型在动态环境中的空间与状态变化理解能力提供了一个新的挑战性平台，有助于发展具备更全面视觉理解能力的下一代模型。 Abstract: Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.

[151] Application of deep learning approaches for medieval historical documents transcription

Maksym Voloshchuk,Bohdana Zarembovska,Mykola Kozlenko

Main category: cs.CV

TL;DR: 本文提出了一种针对9至11世纪手写拉丁文文献的深度学习文本提取方法，考虑了中世纪文献特有的属性，实现了从目标检测到单词识别的完整流程，并报告了多种评估指标结果。

Details

Motivation: 现代OCR技术在处理中世纪拉丁文手稿时效率较低，因这些文献具有特殊书写风格、语言演变和退化问题，需要专门的方法来提高识别准确率。 Method: 采用深度学习模型构建了一个完整的文本信息提取流程，包括数据集构建、目标检测、词级图像嵌入以及基于分类模型的单词识别，并对数据进行了探索性分析。 Result: 模型在召回率、精确率、F1分数、交并比、混淆矩阵和平均字符串距离等指标上取得了可评估的结果，并通过可视化图表展示性能表现。 Conclusion: 该方法能有效适应中世纪手写拉丁文文档的特点，为历史文献数字化提供了可行的自动化解决方案，且代码已开源便于后续研究。 Abstract: Handwritten text recognition and optical character recognition solutions show excellent results with processing data of modern era, but efficiency drops with Latin documents of medieval times. This paper presents a deep learning method to extract text information from handwritten Latin-language documents of the 9th to 11th centuries. The approach takes into account the properties inherent in medieval documents. The paper provides a brief introduction to the field of historical document transcription, a first-sight analysis of the raw data, and the related works and studies. The paper presents the steps of dataset development for further training of the models. The explanatory data analysis of the processed data is provided as well. The paper explains the pipeline of deep learning models to extract text information from the document images, from detecting objects to word recognition using classification models and embedding word images. The paper reports the following results: recall, precision, F1 score, intersection over union, confusion matrix, and mean string distance. The plots of the metrics are also included. The implementation is published on the GitHub repository.

[152] AMLID: An Adaptive Multispectral Landmine Identification Dataset for Drone-Based Detection

James E. Gallagher,Edward J. Oughton

Main category: cs.CV

TL;DR: 本文提出了首个开源的自适应多光谱地雷识别数据集（AMLID），结合RGB和长波红外（LWIR）图像，用于无人机系统（UAS）地雷探测，包含12,078张标注图像，涵盖多种地雷类型和环境条件，旨在提升人道主义排雷研究的可及性与效率。

Details

Motivation: 地雷在全球范围内仍构成严重的人道主义威胁，现有检测方法危险、低效且成本高昂，亟需一种安全、高效且可广泛获取的检测方案。 Method: 构建了一个融合RGB和LWIR影像的多光谱数据集AMLID，包含12,078张标注图像，覆盖21种地雷类型、11种光谱融合水平、4种传感器高度、2个季节期和3种光照条件，支持基于无人机的自适应检测算法开发。 Result: AMLID成为首个公开的多光谱地雷检测数据集，提供了广泛的环境与光谱多样性，可用于训练和评估检测模型，无需接触实弹或昂贵采集设备。 Conclusion: AMLID显著降低了地雷检测研究的技术与资源门槛，推动了人道主义排雷技术的民主化发展。 Abstract: Landmines remain a persistent humanitarian threat, with an estimated 110 million mines deployed across 60 countries, claiming approximately 26,000 casualties annually. Current detection methods are hazardous, inefficient, and prohibitively expensive. We present the Adaptive Multispectral Landmine Identification Dataset (AMLID), the first open-source dataset combining Red-Green-Blue (RGB) and Long-Wave Infrared (LWIR) imagery for Unmanned Aerial Systems (UAS)-based landmine detection. AMLID comprises of 12,078 labeled images featuring 21 globally deployed landmine types across anti-personnel and anti-tank categories in both metal and plastic compositions. The dataset spans 11 RGB-LWIR fusion levels, four sensor altitudes, two seasonal periods, and three daily illumination conditions. By providing comprehensive multispectral coverage across diverse environmental variables, AMLID enables researchers to develop and benchmark adaptive detection algorithms without requiring access to live ordnance or expensive data collection infrastructure, thereby democratizing humanitarian demining research.

[153] Watch Closely: Mitigating Object Hallucinations in Large Vision-Language Models with Disentangled Decoding

Ruiqi Ma,Yu Yan,Chunhong Zhang,Minghao Yin,XinChao Liu,Zhihong Jin,Zheng Hu

Main category: cs.CV

TL;DR: 提出了一种无需训练的幻觉解耦解码（HDD）方法，通过图像分割和空白图像来减少大视觉语言模型在语言和视觉模态中的幻觉问题。

Details

Motivation: 大视觉语言模型（LVLMs）在物体识别任务中存在严重幻觉问题，导致生成的文本与视觉内容不符，影响实际应用。 Method: 提出Hallucination Disentangled Decoding（HDD）方法，通过分割原图并选择增强图像，同时使用空白图像消除语言先验引起的幻觉，且无需训练。 Result: HDD有效减少了语言和视觉模态中的幻觉，降低了模型对语言先验的依赖，提升了视觉理解性能。 Conclusion: HDD是一种有效的无训练干预方法，能够从多模态层面缓解LVLMs的幻觉问题，提升模型可靠性。 Abstract: Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model's dependence on language priors but also enhances its visual performance. (Code: https://github.com/rickeyhhh/Hallucination-Disentangled-Decoding)

[154] Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

Tianrui Zhu,Shiyi Zhang,Zhirui Sun,Jingqi Tian,Yansong Tang

Main category: cs.CV

TL;DR: 提出Memorize-and-Generate (MAG)框架，通过分离记忆压缩与帧生成任务，解决长视频生成中历史上下文丢失与内存开销大的问题。

Details

Motivation: 现有长视频生成方法在使用窗口注意力时会丢弃历史信息，导致场景不一致；而保留全部历史则内存开销过大。需要一种既能有效压缩记忆又能保持场景一致性的方法。 Method: 设计两个独立模型：记忆模型将历史信息压缩为紧凑的KV缓存，生成模型利用该缓存生成后续帧；并构建MAG-Bench评估记忆保持能力。 Result: MAG在历史场景一致性上表现优异，同时在标准视频生成基准上保持竞争力，显著降低内存消耗。 Conclusion: MAG通过解耦记忆与生成，有效平衡了长视频生成中的记忆保留与计算效率，为自回归视频建模提供了可扩展的新范式。 Abstract: Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.

[155] IPCV: Information-Preserving Compression for MLLM Visual Encoders

Yuan Chen,Zichen Wen,Yuzhou Wu,Xuyang Liu,Shuang Chen,Junpeng Ma,Weijia Li,Conghui He,Linfeng Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为IPCV的无训练、信息保持的压缩框架，用于多模态大语言模型（MLLM）的视觉编码器，通过在ViT内部实现激进的token剪枝并利用邻域引导重建和注意力稳定化技术，在降低计算成本的同时保留关键视觉信息。

Details

Motivation: 现有的token剪枝策略无法有效解决MLLM中视觉Transformer（ViT）编码器因处理大量视觉token而导致的高计算开销问题，且容易丢失对文本理解至关重要的视觉线索，并引发特征失真。 Method: 提出IPCV框架，包含两个核心组件：1）邻域引导重建（NGR），在注意力计算过程中临时重建被剪枝的token以参与计算，并在传递给LLM前完全恢复；2）注意力稳定化（AS），通过近似被剪枝token的K/V来减轻其负面影响。该方法无需训练，可与现有LLM端剪枝方法结合。 Result: 实验表明，IPCV显著降低了端到端计算量，在多种图像和视频基准上优于当前最先进的无训练token压缩方法。 Conclusion: IPCV是一种高效、通用的视觉token压缩方案，能够在不牺牲性能的前提下大幅减少MLLM的计算开销，具备良好的应用前景。 Abstract: Multimodal Large Language Models (MLLMs) deliver strong vision-language performance but at high computational cost, driven by numerous visual tokens processed by the Vision Transformer (ViT) encoder. Existing token pruning strategies are inadequate: LLM-stage token pruning overlooks the ViT's overhead, while conventional ViT token pruning, without language guidance, risks discarding textually critical visual cues and introduces feature distortions amplified by the ViT's bidirectional attention. To meet these challenges, we propose IPCV, a training-free, information-preserving compression framework for MLLM visual encoders. IPCV enables aggressive token pruning inside the ViT via Neighbor-Guided Reconstruction (NGR) that temporarily reconstructs pruned tokens to participate in attention with minimal overhead, then fully restores them before passing to the LLM. Besides, we introduce Attention Stabilization (AS) to further alleviate the negative influence from token pruning by approximating the K/V of pruned tokens. It can be directly applied to previous LLM-side token pruning methods to enhance their performance. Extensive experiments show that IPCV substantially reduces end-to-end computation and outperforms state-of-the-art training-free token compression methods across diverse image and video benchmarks. Our code is available at https://github.com/Perkzi/IPCV.

[156] Context-Aware Network Based on Multi-scale Spatio-temporal Attention for Action Recognition in Videos

Xiaoyang Li,Wenzhu Yang,Kanglin Wang,Tiebiao Wang,Qingsong Fei

Main category: cs.CV

TL;DR: 提出了一种上下文感知网络CAN，用于视频动作识别，通过多尺度时空特征提取模块提升性能。

Details

Motivation: 现有方法忽略了动作的多粒度特性，未能充分捕捉多尺度时空线索。 Method: 设计了两个核心模块：多尺度时间线索模块（MTCM）和分组空间线索模块（GSCM），分别用于提取多尺度的时间和空间特征。 Result: 在五个基准数据集上进行了实验，结果表明CAN性能优越，在Something-Something V1、V2、Diving48、Kinetics-400和UCF101上分别达到50.4%、63.9%、88.4%、74.9%和86.9%的准确率。 Conclusion: 捕捉多尺度时空线索对提升动作识别性能至关重要，CAN是一种有效的方法。 Abstract: Action recognition is a critical task in video understanding, requiring the comprehensive capture of spatio-temporal cues across various scales. However, existing methods often overlook the multi-granularity nature of actions. To address this limitation, we introduce the Context-Aware Network (CAN). CAN consists of two core modules: the Multi-scale Temporal Cue Module (MTCM) and the Group Spatial Cue Module (GSCM). MTCM effectively extracts temporal cues at multiple scales, capturing both fast-changing motion details and overall action flow. GSCM, on the other hand, extracts spatial cues at different scales by grouping feature maps and applying specialized extraction methods to each group. Experiments conducted on five benchmark datasets (Something-Something V1 and V2, Diving48, Kinetics-400, and UCF101) demonstrate the effectiveness of CAN. Our approach achieves competitive performance, outperforming most mainstream methods, with accuracies of 50.4% on Something-Something V1, 63.9% on Something-Something V2, 88.4% on Diving48, 74.9% on Kinetics-400, and 86.9% on UCF101. These results highlight the importance of capturing multi-scale spatio-temporal cues for robust action recognition.

[157] MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation

Guohui Zhang,Hu Yu,Xiaoxiao Ma,Yaning Pan,Hang Xu,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出了MaskFocus，一种针对掩码生成模型的新型强化学习框架，通过聚焦关键步骤实现有效的策略优化。

Details

Motivation: 强化学习在语言模型和自回归视觉生成模型中表现出潜力，但在掩码生成模型中的应用因多步迭代过程带来的高计算成本而受限。 Method: 通过测量每一步采样中间图像与最终生成图像之间的相似性来确定步骤级信息增益，并据此识别最关键步骤进行集中策略优化；设计基于熵的动态路由采样机制以探索更有价值的掩码策略。 Result: 在多个文本到图像基准上的实验验证了该方法的有效性。 Conclusion: MaskFocus通过聚焦关键步骤和动态采样机制，有效提升了掩码生成模型的强化学习优化效率与生成质量。 Abstract: Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging. The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them. Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.

[158] In-Context Audio Control of Video Diffusion Transformers

Wenze Liu,Weicai Ye,Minghong Cai,Quande Liu,Xintao Wang,Xiangyu Yue

Main category: cs.CV

TL;DR: 本文提出了ICAC框架，探索在统一的视频扩散Transformer中集成音频信号进行语音驱动视频生成的方法，提出掩码3D注意力机制以实现稳定训练和优异性能。

Details

Motivation: 现有视频生成模型主要关注文本、图像等模态，对严格时间同步的音频信号研究不足，本文旨在探索音频条件在统一架构中的有效融合。 Method: 提出三种音频条件注入机制：标准交叉注意力、2D自注意力和统一3D自注意力，并设计掩码3D注意力机制以约束注意力模式，确保时序对齐。 Result: 实验表明所提掩码3D注意力机制能实现稳定的训练过程，并在唇部同步和视频质量方面取得优越表现。 Conclusion: 通过引入掩码3D注意力，ICAC成功将音频信号融入统一的视频扩散Transformer，在语音驱动视频生成任务中实现了高效且高质量的生成效果。 Abstract: Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.

[159] Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers

Fanis Mathioulakis,Gorjan Radevski,Tinne Tuytelaars

Main category: cs.CV

TL;DR: Eff-GRot是一种高效的、可泛化的从RGB图像估计物体旋转的方法，无需特定对象训练，通过Transformer在潜在空间中联合处理多参考与查询图像的旋转感知表示。

Details

Motivation: 现有旋转估计方法通常依赖于特定对象或类别训练，且计算成本高，难以满足实时性要求。 Method: 提出Eff-GRot，利用Transformer在潜在空间中对比查询图像与多个已知姿态参考图像的旋转感知特征，实现单次前向传播直接预测旋转。 Result: 实验表明Eff-GRot在保持高精度的同时显著提升计算效率，适用于低延迟应用场景。 Conclusion: Eff-GRot为高效、通用的旋转估计提供了可行方案，具有良好的端到端性、可扩展性和实际应用潜力。 Abstract: We introduce Eff-GRot, an approach for efficient and generalizable rotation estimation from RGB images. Given a query image and a set of reference images with known orientations, our method directly predicts the object's rotation in a single forward pass, without requiring object- or category-specific training. At the core of our framework is a transformer that performs a comparison in the latent space, jointly processing rotation-aware representations from multiple references alongside a query. This design enables a favorable balance between accuracy and computational efficiency while remaining simple, scalable, and fully end-to-end. Experimental results show that Eff-GRot offers a promising direction toward more efficient rotation estimation, particularly in latency-sensitive applications.

[160] Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation

Guangtao Lyu,Chenghao Xu,Qi Liu,Jiexi Yan,Muli Yang,Fen Fang,Cheng Deng

Main category: cs.CV

TL;DR: 本文提出了一种名为TempoMoE的层次化节拍感知混合专家模型，用于从音乐生成3D舞蹈，无需依赖噪声较大的风格标签，通过音乐的节奏（BPM）动态选择专家，实现高质量且节奏对齐的舞蹈生成。

Details

Motivation: 现有方法依赖噪声大且粗略的音乐风格标签来生成舞蹈，难以准确捕捉音乐多样性，导致节奏错位或风格漂移；而节奏（如BPM）在不同数据集和风格中更稳定，因此作者希望利用这一特性提升舞蹈生成的节奏一致性。 Method: 提出TempoMoE，一种基于节奏的分层混合专家模块，将运动专家按BPM范围分组，并引入多尺度节拍专家建模短时和长程节奏动态；通过层次化节奏自适应路由机制，从音乐特征中动态选择并融合专家，增强扩散模型的节奏感知能力。 Result: 实验表明，TempoMoE在舞蹈质量与节奏对齐方面均达到当前最优性能，优于依赖风格标签的方法，且不需人工标注标签。 Conclusion: 利用音乐节奏（BPM）作为组织生成模型专家的依据是有效且鲁棒的，TempoMoE为音乐到舞蹈生成提供了一种更精细、自适应且无需风格标签的新范式。 Abstract: Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.

[161] FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation

Ziyuan Tao,Chuanzhi Xu,Sandaru Jayawardana,Wei Bao,Kanchana Thilakarathna,Teng Joon Lim

Main category: cs.CV

TL;DR: 提出一种用于视频暴力检测的设备端联邦学习框架，结合自监督VideoMAE表示、LoRA高效微调和深度隐私保护，在保证隐私的同时显著降低通信开销。

Details

Motivation: 云基视频处理存在隐私泄露、带宽消耗大和推理延迟问题，亟需在设备端实现隐私保护的视频内容审核方案。 Method: 采用自监督VideoMAE作为骨干模型，结合LoRA进行参数高效微调，引入DP-SGD和安全聚合实现差分隐私保护，并在40个客户端上进行联邦学习实验。 Result: 在RWF-2000数据集上，无隐私保护时准确率达77.25%，强差分隐私下仍保持65-66%准确率，通信成本较全模型联邦学习降低28.3倍。 Conclusion: 该框架在保障用户隐私和降低通信开销方面表现优异，适用于短视频平台的高效、安全内容审核。 Abstract: The rapid growth of short-form video platforms increases the need for privacy-preserving moderation, as cloud-based pipelines expose raw videos to privacy risks, high bandwidth costs, and inference latency. To address these challenges, we propose an on-device federated learning framework for video violence detection that integrates self-supervised VideoMAE representations, LoRA-based parameter-efficient adaptation, and defense-in-depth privacy protection. Our approach reduces the trainable parameter count to 5.5M (~3.5% of a 156M backbone) and incorporates DP-SGD with configurable privacy budgets and secure aggregation. Experiments on RWF-2000 with 40 clients achieve 77.25% accuracy without privacy protection and 65-66% under strong differential privacy, while reducing communication cost by $28.3\times$ compared to full-model federated learning. The code is available at: {https://github.com/zyt-599/FedVideoMAE}

[162] Revealing Perception and Generation Dynamics in LVLMs: Mitigating Hallucinations via Validated Dominance Correction

Guangtao Lyu,Xinyi Cheng,Chenghao Xu,Qi Liu,Muli Yang,Fen Fang,Huilin Chen,Jiexi Yan,Xu Yang,Cheng Deng

Main category: cs.CV

TL;DR: 本文研究了大型视觉-语言模型（LVLMs）中的幻觉问题，揭示了视觉感知的三阶段GATE过程和生成过程的SAD模式，并据此提出VDC策略，有效减少幻觉。

Details

Motivation: 尽管大型视觉-语言模型展现出强大能力，但幻觉问题仍然普遍存在，影响输出可靠性，亟需深入理解其内部机制并加以纠正。 Method: 通过系统分析LVLM内部视觉感知与token生成的演化过程，识别出GATE感知模式和SAD生成模式，并基于此设计VDC（Validated Dominance Correction）策略，检测无支持的幻觉token并替换为经验证的主导token。 Result: 实验表明VDC能显著降低多种模型和基准上的幻觉现象，提升生成结果的准确性与可靠性。 Conclusion: LVLM中的幻觉源于生成过程中子主导token的累积，而VDC策略通过增强感知与生成的一致性，有效抑制了这一问题，为构建更可靠的多模态模型提供了新思路。 Abstract: Large Vision-Language Models (LVLMs) have shown remarkable capabilities, yet hallucinations remain a persistent challenge. This work presents a systematic analysis of the internal evolution of visual perception and token generation in LVLMs, revealing two key patterns. First, perception follows a three-stage GATE process: early layers perform a Global scan, intermediate layers Approach and Tighten on core content, and later layers Explore supplementary regions. Second, generation exhibits an SAD (Subdominant Accumulation to Dominant) pattern, where hallucinated tokens arise from the repeated accumulation of subdominant tokens lacking support from attention (visual perception) or feed-forward network (internal knowledge). Guided by these findings, we devise the VDC (Validated Dominance Correction) strategy, which detects unsupported tokens and replaces them with validated dominant ones to improve output reliability. Extensive experiments across multiple models and benchmarks confirm that VDC substantially mitigates hallucinations.

[163] EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

Yuxiao Yang,Hualian Sheng,Sijia Cai,Jing Lin,Jiahao Wang,Bing Deng,Junzhe Lu,Haoqian Wang,Jieping Ye

Main category: cs.CV

TL;DR: EchoMotion 是一种新的视频生成框架，通过联合建模外观和人体运动，显著提升复杂人类动作视频的生成质量。

Details

Motivation: 现有视频生成模型主要依赖像素级训练目标，偏向于外观保真度，难以捕捉人体运动的运动学原理，导致复杂人体动作合成效果不佳。 Method: 提出 EchoMotion 框架，扩展 DiT 为双分支结构，引入 MVS-RoPE 实现视频与运动 token 的统一 3D 位置编码，并采用两阶段训练策略。使用自建的大规模视频-运动配对数据集 HuMoVe 进行训练。 Result: 模型能联合生成高质量的人体动作视频及其对应的动作序列，并支持跨模态条件生成任务。实验表明，显式建模人体运动可显著提升生成视频的时间连贯性和动作合理性。 Conclusion: 显式引入人体运动表征并与外观信息联合建模，是提升人体动作视频生成质量的有效途径，EchoMotion 为此提供了成功的框架设计。 Abstract: Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.

[164] Brain-Gen: Towards Interpreting Neural Signals for Stimulus Reconstruction Using Transformers and Latent Diffusion Models

Hasib Aslam,Muhammad Talal Faiz,Muhammad Imran Malik

Main category: cs.CV

TL;DR: 提出一种基于Transformer和潜在扩散模型的框架，用于从脑电图（EEG）信号中解码视觉刺激，提升了语义结构建模与零样本泛化能力。

Details

Motivation: 现有方法对EEG信号的神经表征解释能力有限，主要受限于EEG的高噪声、空间弥散性和时间变异性。 Method: 采用Transformer提取EEG信号中的时空特征，并将这些特征融入潜在扩散模型（LDM）的注意力机制中，实现从脑活动重建视觉刺激。 Result: 在公开数据集上验证，该方法在潜在空间聚类准确率上提升6.5%，零样本泛化能力提升11.8%，且生成质量与现有基线相当。 Conclusion: 该工作推动了EEG信号在语义解释方面的可推广性，为脑电信号解码提供了新思路。 Abstract: Advances in neuroscience and artificial intelligence have enabled preliminary decoding of brain activity. However, despite the progress, the interpretability of neural representations remains limited. A significant challenge arises from the intrinsic properties of electroencephalography (EEG) signals, including high noise levels, spatial diffusion, and pronounced temporal variability. To interpret the neural mechanism underlying thoughts, we propose a transformers-based framework to extract spatial-temporal representations associated with observed visual stimuli from EEG recordings. These features are subsequently incorporated into the attention mechanisms of Latent Diffusion Models (LDMs) to facilitate the reconstruction of visual stimuli from brain activity. The quantitative evaluations on publicly available benchmark datasets demonstrate that the proposed method excels at modeling the semantic structures from EEG signals; achieving up to 6.5% increase in latent space clustering accuracy and 11.8% increase in zero shot generalization across unseen classes while having comparable Inception Score and Fréchet Inception Distance with existing baselines. Our work marks a significant step towards generalizable semantic interpretation of the EEG signals.

[165] VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent Inference

Sicheng Song,Yanjie Zhang,Zixin Chen,Huamin Qu,Changbo Wang,Chenhui Li

Main category: cs.CV

TL;DR: 本文提出了VizDefender框架，用于检测和分析数据可视化中的篡改行为，结合半脆弱水印和多模态大语言模型来定位篡改区域并推断攻击意图。

Details

Motivation: 数据可视化面临图像编辑技术带来的篡改威胁，影响其完整性与可信度，亟需有效的检测与分析方法。 Method: 提出VizDefender框架，包含半脆弱水印模块（嵌入位置图以定位篡改）和意图分析模块（利用多模态大语言模型解析操纵意图）。 Result: 实验评估和用户研究表明该框架能有效检测篡改、精确定位篡改区域，并成功推断攻击者的意图及其误导效果。 Conclusion: VizDefender为保障数据可视化的完整性提供了有效解决方案，结合技术检测与语义分析，提升了对可视化篡改的防御能力。 Abstract: The integrity of data visualizations is increasingly threatened by image editing techniques that enable subtle yet deceptive tampering. Through a formative study, we define this challenge and categorize tampering techniques into two primary types: data manipulation and visual encoding manipulation. To address this, we present VizDefender, a framework for tampering detection and analysis. The framework integrates two core components: 1) a semi-fragile watermark module that protects the visualization by embedding a location map to images, which allows for the precise localization of tampered regions while preserving visual quality, and 2) an intent analysis module that leverages Multimodal Large Language Models (MLLMs) to interpret manipulation, inferring the attacker's intent and misleading effects. Extensive evaluations and user studies demonstrate the effectiveness of our methods.

Alina Elena Baia,Andrea Cavallaro

Main category: cs.CV

TL;DR: 本文提出了一种名为DeX（Decompose and Explain）的新颖概念驱动反事实解释框架，利用跨模态分解性和图像特定概念生成自然语言中的反事实场景，应用于图像隐私决策领域，能够量化关键场景元素对模型预测的贡献，具备无需训练、灵活性高、可发现数据集偏差并提升公平性的优势。

Details

Motivation: 为了更好地解释分类器在主观且情境依赖的决策（如图像隐私）中的预测结果，需要一种能够生成语义清晰、针对性强且可量化解释因素贡献的方法。 Method: 提出DeX框架，结合跨模态分解性与图像特定概念，通过多准则选择机制（考虑图像相似性和决策置信度）识别关键决策因素，并生成基于图像的稀疏自然语言反事实解释，整个过程无需重新训练模型。 Result: DeX能有效揭示影响主观决策的主要因素，识别出数据集中的潜在偏见，生成的解释优于现有最先进方法，且具备良好的灵活性和可解释性。 Conclusion: DeX是一种高效、无需训练的概念驱动解释方法，特别适用于复杂、主观的视觉决策任务，不仅能提升模型透明度，还能帮助发现和缓解数据偏见，促进公平性改进。 Abstract: Concept-driven counterfactuals explain decisions of classifiers by altering the model predictions through semantic changes. In this paper, we present a novel approach that leverages cross-modal decompositionality and image-specific concepts to create counterfactual scenarios expressed in natural language. We apply the proposed interpretability framework, termed Decompose and Explain (DeX), to the challenging domain of image privacy decisions, which are contextual and subjective. This application enables the quantification of the differential contributions of key scene elements to the model prediction. We identify relevant decision factors via a multi-criterion selection mechanism that considers both image similarity for minimal perturbations and decision confidence to prioritize impactful changes. This approach evaluates and compares diverse explanations, and assesses the interdependency and mutual influence among explanatory properties. By leveraging image-specific concepts, DeX generates image-grounded, sparse explanations, yielding significant improvements over the state of the art. Importantly, DeX operates as a training-free framework, offering high flexibility. Results show that DeX not only uncovers the principal contributing factors influencing subjective decisions, but also identifies underlying dataset biases allowing for targeted mitigation strategies to improve fairness.

[167] CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis

Kaidi Liang,Ke Li,Xianbiao Hu,Ruwen Qin

Main category: cs.CV

TL;DR: 本文提出了CrashChat，一种基于多模态大语言模型的交通碰撞视频分析框架，通过任务解耦与分组的多任务学习策略，在碰撞识别、定位及高级理解任务上实现了最先进的性能。

Details

Motivation: 现有的模型无法在一个统一框架内完成碰撞识别、时间定位和高级视频理解等多重任务，且缺乏有效的训练策略。为此，本文旨在提出一个能够综合处理多种分析需求的统一模型，以提升自动驾驶中事故分析的自动化水平。 Method: 基于VideoLLaMA3构建多模态大语言模型CrashChat，采用指令微调获取领域知识，并设计了一种基于任务解耦与分组的多任务学习策略，以最大化组内与组间的联合学习效益并减轻负迁移。 Result: 实验表明，CrashChat在多个公开数据集上优于现有MLLM和传统视觉方法：碰撞识别接近完美准确率，碰撞定位性能提升176%，事前定位提升40%；在描述与推理任务中，BLEU和ROUGE分数分别提高0.18-0.41和0.18-0.42。 Conclusion: CrashChat是一个高效、端到端的多任务交通碰撞分析工具，具备优异性能与实用部署能力，推动了自动驾驶事故分析的自动化发展。 Abstract: Automating crash video analysis is essential to leverage the growing availability of driving video data for traffic safety research and accountability attribution in autonomous driving. Crash video analysis is a challenging multitask problem due to the complex spatiotemporal dynamics of crash events in video data and the diverse analytical requirements involved. It requires capabilities spanning crash recognition, temporal grounding, and high-level video understanding. Existing models, however, cannot perform all these tasks within a unified framework, and effective training strategies for such models remain underexplored. To fill these gaps, this paper proposes CrashChat, a multimodal large language model (MLLM) for multitask traffic crash analysis, built upon VideoLLaMA3. CrashChat acquires domain-specific knowledge through instruction fine-tuning and employs a novel multitask learning strategy based on task decoupling and grouping, which maximizes the benefit of joint learning within and across task groups while mitigating negative transfer. Numerical experiments on consolidated public datasets demonstrate that CrashChat consistently outperforms existing MLLMs across model scales and traditional vision-based methods, achieving state-of-the-art performance. It reaches near-perfect accuracy in crash recognition, a 176\% improvement in crash localization, and a 40\% improvement in the more challenging pre-crash localization. Compared to general MLLMs, it substantially enhances textual accuracy and content coverage in crash description and reasoning tasks, with 0.18-0.41 increases in BLEU scores and 0.18-0.42 increases in ROUGE scores. Beyond its strong performance, CrashChat is a convenient, end-to-end analytical tool ready for practical implementation. The dataset and implementation code for CrashChat are available at https://github.com/Liangkd/CrashChat.

[168] Localising Shortcut Learning in Pixel Space via Ordinal Scoring Correlations for Attribution Representations (OSCAR)

Akshit Achara,Peter Triantafillou,Esther Puyol-Antón,Alexander Hammers,Andrew P. King

Main category: cs.CV

TL;DR: 本文提出了OSCAR（Ordinal Scoring Correlations for Attribution Representations），一种模型无关的框架，用于量化和定位深度神经网络中的捷径学习问题，尤其在医学图像等敏感属性相关场景中。

Details

Motivation: 现有方法多依赖定性、图像级且假设捷径特征可见，难以适用于医学成像等领域，且无法有效识别与敏感属性相关的偏差。 Method: OSCAR将图像级归因图转换为数据集级的区域排序谱，并通过比较三个模型（平衡基线BA、测试模型TS、敏感属性预测器SA）之间的成对、偏和偏差相关性，生成量化指标和贡献区域排名。 Result: 在CelebA、CheXpert和ADNI数据集上验证了方法的稳定性、对捷径关联程度的敏感性，并能区分局部与弥散型捷径特征；并展示了基于识别结果的测试时衰减策略可减少最差组性能差异。 Conclusion: OSCAR提供了一种轻量级、像素空间的审计工具，生成统计决策规则和空间图谱，帮助用户检测、定位并缓解捷径依赖问题。 Abstract: Deep neural networks often exploit shortcuts. These are spurious cues which are associated with output labels in the training data but are unrelated to task semantics. When the shortcut features are associated with sensitive attributes, shortcut learning can lead to biased model performance. Existing methods for localising and understanding shortcut learning are mostly based upon qualitative, image-level inspection and assume cues are human-visible, limiting their use in domains such as medical imaging. We introduce OSCAR (Ordinal Scoring Correlations for Attribution Representations), a model-agnostic framework for quantifying shortcut learning and localising shortcut features. OSCAR converts image-level task attribution maps into dataset-level rank profiles of image regions and compares them across three models: a balanced baseline model (BA), a test model (TS), and a sensitive attribute predictor (SA). By computing pairwise, partial, and deviation-based correlations on these rank profiles, we produce a set of quantitative metrics that characterise the degree of shortcut reliance for TS, together with a ranking of image-level regions that contribute most to it. Experiments on CelebA, CheXpert, and ADNI show that our correlations are (i) stable across seeds and partitions, (ii) sensitive to the level of association between shortcut features and output labels in the training data, and (iii) able to distinguish localised from diffuse shortcut features. As an illustration of the utility of our method, we show how worst-group performance disparities can be reduced using a simple test-time attenuation approach based on the identified shortcut regions. OSCAR provides a lightweight, pixel-space audit that yields statistical decision rules and spatial maps, enabling users to test, localise, and mitigate shortcut reliance. The code is available at https://github.com/acharaakshit/oscar

[169] Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

Dmitry Demidov,Zaigham Zaheer,Zongyan Han,Omkar Thawakar,Rao Anwer

Main category: cs.CV

TL;DR: 提出FiNDR，首个基于推理增强的大型多模态模型框架，用于无词汇表的细粒度图像识别，通过自动化生成、筛选和验证候选标签实现SOTA性能。

Details

Motivation: 现有方法受限于固定词汇表或复杂脆弱的流水线，难以有效进行无词汇表的细粒度图像识别。 Method: 利用具备推理能力的大型多模态模型生成候选标签，通过视觉-语言模型过滤排序形成类别集，并构建轻量级多模态分类器用于推理。 Result: 在多个基准上达到SOTA，相对提升高达18.8%，且超越使用预定义标签的零样本方法。 Conclusion: 推理增强的LMM可作为可扩展、全自动、开放世界细粒度视觉识别的有效基础。 Abstract: Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous approaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound. Additionally, we show that carefully curated prompts enable open-source LMMs to match proprietary counterparts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code is available on github.com/demidovd98/FiNDR.

[170] Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models

Mohamad Zamini,Diksha Shukla

Main category: cs.CV

TL;DR: Delta-LLaVA提出一种高效的多模态大模型视觉投影器，通过低秩DeltaProjection和轻量Transformer模块，在仅使用144个token的情况下实现更优的性能与显著提升的训练推理效率。

Details

Motivation: 现有的MLLM中视觉投影器对高分辨率输入处理效率低，存在大量冗余，导致计算成本高昂，亟需一种更高效的视觉-语言对齐机制。 Method: 提出Delta-LLaVA，采用低秩的DeltaProjection将多级视觉特征对齐到紧凑子空间，并引入轻量级Transformer模块作为专业化层，在有限token预算下捕捉全局与局部结构。 Result: 在多个基准上取得一致提升，仅用144个token即达到优越性能；推理吞吐提升达55%，端到端训练速度在预训练阶段加快4-5倍，微调阶段加快1.5倍以上。 Conclusion: Delta-LLaVA的“先对齐后专用”设计有效提升了多模态模型的token利用效率，在降低计算成本的同时保持甚至提升性能，具备良好的可扩展性和实用价值。 Abstract: Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost of processing dense visual tokens remains a major bottleneck. A critical component in this pipeline is the visual projector, which bridges the vision encoder and the language model. Standard designs often employ a simple multi-layer perceptron for direct token mapping, but this approach scales poorly with high-resolution inputs, introducing significant redundancy. We present Delta-LLaVA, a token-efficient projector that employs a low-rank DeltaProjection to align multi-level vision features into a compact subspace before further interaction. On top of this base alignment, lightweight Transformer blocks act as specialization layers, capturing both global and local structure under constrained token budgets. Extensive experiments and ablations demonstrate that this base-then-specialize design yields consistent gains across multiple benchmarks with only 144 tokens, highlighting the importance of token formation prior to scaling interaction capacity. With Delta-LLaVA, inference throughput improves by up to 55%, while end-to-end training accelerates by nearly 4-5x in pretraining and over 1.5x in finetuning, highlighting the dual benefits of our design in both efficiency and scalability.

[171] LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer

Raina Panda,Daniel Fein,Arpita Singhal,Mark Fiore,Maneesh Agrawala,Matyas Bohacek

Main category: cs.CV

TL;DR: 本文提出了一种轻量、可解释的艺术风格迁移方法LouvreSAE，基于艺术数据训练的稀疏自编码器提取解耦的风格特征，实现无需微调或优化的高效风格控制。

Details

Motivation: 现有风格迁移方法依赖微调、适配器或提示工程，计算成本高且风格与内容难以分离。 Method: 在生成模型的潜在空间上构建艺术专用的稀疏自编码器（SAE），从艺术数据中学习解耦的风格与结构概念，形成可分解的风格配置文件（steering vectors）。 Result: 在ArtBench10上达到或优于现有方法的风格迁移效果（VGG Style Loss和CLIP Score Style指标），速度提升1.7-20倍，且无需微调、LoRA或额外推理步骤。 Conclusion: LouvreSAE提供了一种高效、可解释的风格表示与迁移框架，支持仅通过少量参考图像进行直接风格引导，推动了生成模型中艺术风格控制的发展。 Abstract: Artistic style transfer in generative models remains a significant challenge, as existing methods often introduce style only via model fine-tuning, additional adapters, or prompt engineering, all of which can be computationally expensive and may still entangle style with subject matter. In this paper, we introduce a training- and inference-light, interpretable method for representing and transferring artistic style. Our approach leverages an art-specific Sparse Autoencoder (SAE) on top of latent embeddings of generative image models. Trained on artistic data, our SAE learns an emergent, largely disentangled set of stylistic and compositional concepts, corresponding to style-related elements pertaining brushwork, texture, and color palette, as well as semantic and structural concepts. We call it LouvreSAE and use it to construct style profiles: compact, decomposable steering vectors that enable style transfer without any model updates or optimization. Unlike prior concept-based style transfer methods, our method requires no fine-tuning, no LoRA training, and no additional inference passes, enabling direct steering of artistic styles from only a few reference images. We validate our method on ArtBench10, achieving or surpassing existing methods on style evaluations (VGG Style Loss and CLIP Score Style) while being 1.7-20x faster and, critically, interpretable.

[172] Point What You Mean: Visually Grounded Instruction Policy

Hang Yu,Juntu Zhao,Yufeng Liu,Kaiyu Li,Cheng Ma,Di Zhang,Yingdong Hu,Guang Chen,Junyuan Xie,Junliang Guo,Junqiao Zhao,Yang Gao

Main category: cs.CV

TL;DR: Point-VLA 是一种增强的视觉-语言-动作模型，通过在语言指令中引入显式视觉线索（如边界框）来解决复杂或分布外场景中的指代歧义问题，实现更精确的物体级定位和更强的泛化能力。

Details

Motivation: 现有VLA模型仅依赖文本指令，在杂乱或分布外场景中物体指代表现有限，难以准确解析指代对象。 Method: 提出Point-VLA模型，将视觉线索（如边界框）与语言指令结合，并开发自动数据标注 pipeline 以高效构建视觉接地数据集。 Result: 在真实世界指代任务中，Point-VLA 在杂乱环境和未见物体场景下表现优于纯文本指令模型，展现出更强的鲁棒性和泛化性。 Conclusion: Point-VLA 通过像素级视觉接地有效缓解了指代歧义问题，提升了具身控制的精度与通用性。 Abstract: Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.

[173] Symmetrization of 3D Generative Models

Nicolas Caytuiro,Ivan Sipiran

Main category: cs.CV

TL;DR: 提出一种数据驱动的方法，通过在训练中使用半物体（沿x=0平面反射）来提升3D生成模型的对称性。

Details

Motivation: 现有3D生成模型生成的形状常缺乏几何对称性，而许多现实物体具有反射对称性，需改进生成结果的对称一致性。 Method: 分析真实3D形状和生成形状的反射对称性，构建仅包含半物体的新数据集（来自ShapeNet的飞机、汽车和椅子类），并在该数据上训练生成模型，生成时通过对称反射补全完整形状。 Result: 实验表明，基于半物体数据集训练的模型生成的形状在几何上更对称，且视觉上更合理，与原始模型和数据生成的结果相比具有一致性和优势。 Conclusion: 通过数据重构而非修改模型架构，可有效提升3D生成模型的对称性，验证了数据中心化方法在形状生成中的潜力。 Abstract: We propose a novel data-centric approach to promote symmetry in 3D generative models by modifying the training data rather than the model architecture. Our method begins with an analysis of reflectional symmetry in both real-world 3D shapes and samples generated by state-of-the-art models. We hypothesize that training a generative model exclusively on half-objects, obtained by reflecting one half of the shapes along the x=0 plane, enables the model to learn a rich distribution of partial geometries which, when reflected during generation, yield complete shapes that are both visually plausible and geometrically symmetric. To test this, we construct a new dataset of half-objects from three ShapeNet classes (Airplane, Car, and Chair) and train two generative models. Experiments demonstrate that the generated shapes are symmetrical and consistent, compared with the generated objects from the original model and the original dataset objects.

[174] VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion

Zaidao Han,Risa Higashita,Jiang Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的可见-遮挡交互补全网络（VOIC），通过离线提取可见区域标签，显式分离可见区感知与遮挡区推理，解决了单目3D语义场景补全中的特征稀释和误差传播问题。

Details

Motivation: 现有方法在单图像输入下难以平衡高置信度可见区域感知与低置信度遮挡区域推理，导致特征干扰和错误传播。 Method: 提出离线可见区域标签提取（VRLE）策略，并设计双解码器框架VOIC，分别处理可见区语义感知和遮挡区场景补全，利用深度导出的占据信息融合图像特征构建基础3D体素表示。 Result: 在SemanticKITTI和SSCBench-KITTI360数据集上实验表明，VOIC在几何补全和语义分割精度上均优于现有单目SSC方法。 Conclusion: VOIC通过解耦可见与遮挡区域的建模，有效提升了单目3D语义场景补全的性能，达到当前最优水平。 Abstract: Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.

[175] DVI: Disentangling Semantic and Visual Identity for Training-Free Personalized Generation

Guandong Li,Yijun Ding

Main category: cs.CV

TL;DR: DVI是一种无需微调的零样本身份定制框架，通过解耦语义与视觉特征流，利用VAE潜在空间的统计特性提升生成图像的视觉一致性与氛围保真度。

Details

Motivation: 现有身份定制方法虽能保持高面部保真度，但常忽略光照、皮肤纹理等视觉上下文，导致生成结果出现语义-视觉不协调的“贴纸效应”。 Method: 提出DVI框架，将身份解耦为细粒度语义流和粗粒度视觉流；利用VAE潜变量的均值和方差作为全局视觉氛围的轻量描述符，并通过无参数特征调制机制将其注入语义嵌入；设计动态时间粒度调度器，在扩散过程早期优先传递视觉氛围，后期细化语义细节。 Result: 实验表明DVI在不进行参数微调的情况下显著提升了视觉一致性和氛围保真度，在IBench评测中优于现有最先进方法，同时保持强大的身份保留能力。 Conclusion: DVI通过解耦并协同建模语义与视觉特征，有效缓解了语义-视觉不协调问题，为无需训练的身份定制提供了更自然、沉浸式的生成方案。 Abstract: Recent tuning-free identity customization methods achieve high facial fidelity but often overlook visual context, such as lighting, skin texture, and environmental tone. This limitation leads to ``Semantic-Visual Dissonance,'' where accurate facial geometry clashes with the input's unique atmosphere, causing an unnatural ``sticker-like'' effect. We propose **DVI (Disentangled Visual-Identity)**, a zero-shot framework that orthogonally disentangles identity into fine-grained semantic and coarse-grained visual streams. Unlike methods relying solely on semantic vectors, DVI exploits the inherent statistical properties of the VAE latent space, utilizing mean and variance as lightweight descriptors for global visual atmosphere. We introduce a **Parameter-Free Feature Modulation** mechanism that adaptively modulates semantic embeddings with these visual statistics, effectively injecting the reference's ``visual soul'' without training. Furthermore, a **Dynamic Temporal Granularity Scheduler** aligns with the diffusion process, prioritizing visual atmosphere in early denoising stages while refining semantic details later. Extensive experiments demonstrate that DVI significantly enhances visual consistency and atmospheric fidelity without parameter fine-tuning, maintaining robust identity preservation and outperforming state-of-the-art methods in IBench evaluations.

[176] Total Curvature Regularization and its_Minimization for Surface and Image Smoothing

Tianle Lu,Ke Chen,Yuping Duan

Main category: cs.CV

TL;DR: 提出了一种新的曲率正则化方法，通过多方向法向曲率惩罚实现锐边保持和各向同性特性，并利用算子分裂的时间离散化求解高阶非线性优化问题。

Details

Motivation: 传统曲率正则化方法难以同时保持锐利边缘和各向同性平滑，需要更有效的正则化策略。 Method: 引入总法向曲率正则化，将其转化为时间依赖PDE系统的稳态求解问题，采用算子分裂进行时间离散，每个子问题具有闭式解或可高效求解。 Result: 该方法无需复杂参数调优，在表面和图像平滑任务中表现出高效性和鲁棒性。 Conclusion: 所提方法能有效处理高阶非线性优化问题，在保持几何细节的同时实现高质量的平滑效果。 Abstract: We introduce a novel formulation for curvature regularization by penalizing normal curvatures from multiple directions. This total normal curvature regularization is capable of producing solutions with sharp edges and precise isotropic properties. To tackle the resulting high-order nonlinear optimization problem, we reformulate it as the task of finding the steady-state solution of a time-dependent partial differential equation (PDE) system. Time discretization is achieved through operator splitting, where each subproblem at the fractional steps either has a closed-form solution or can be efficiently solved using advanced algorithms. Our method circumvents the need for complex parameter tuning and demonstrates robustness to parameter choices. The efficiency and effectiveness of our approach have been rigorously validated in the context of surface and image smoothing problems.

[177] Self-Attention with State-Object Weighted Combination for Compositional Zero Shot Learning

Cheng-Hong Chang,Pei-Hsuan Tsai

Main category: cs.CV

TL;DR: 本文提出了一种名为SASOW的新方法，用于提升组合零样本学习（CZSL）中的状态-对象识别准确率。该方法在KG-SP基础上引入自注意力机制和状态与对象的加权组合，显著提高了对未见组合的识别性能。

Details

Motivation: 现有方法在识别对象状态组合时存在准确性不足且忽略状态与对象权重的问题，而收集所有可能组合的数据成本高昂，因此需要一种能有效识别新组合并提升识别精度的方法。 Method: 在KG-SP基础上，提出SASOW方法：1）在状态和对象分类器中引入自注意力机制以提高识别准确率；2）在组合过程中引入状态与对象的加权机制，使生成的组合更合理准确。 Result: 在MIT-States、UT-Zappos和C-GQA三个基准数据集上，SASOW相比OW-CZSL和KG-SP分别提升了2.1%、1.7%和0.4%的未见组合识别准确率。 Conclusion: SASOW通过引入自注意力机制和加权组合策略，有效提升了CZSL任务中对状态-对象组合的识别能力，尤其在未见组合上的表现优于现有方法，具有较强的实用价值。 Abstract: Object recognition has become prevalent across various industries. However, most existing applications are limited to identifying objects alone, without considering their associated states. The ability to recognize both the state and object simultaneously remains less common. One approach to address this is by treating state and object as a single category during training. However, this approach poses challenges in data collection and training since it requires comprehensive data for all possible combinations. Compositional Zero-shot Learning (CZSL) emerges as a viable solution by treating the state and object as distinct categories during training. CZSL facilitates the identification of novel compositions even in the absence of data for every conceivable combination. The current state-of-the-art method, KG-SP, addresses this issue by training distinct classifiers for states and objects, while leveraging a semantic model to evaluate the plausibility of composed compositions. However, KG-SP's accuracy in state and object recognition can be further improved, and it fails to consider the weighting of states and objects during composition. In this study, we propose SASOW, an enhancement of KG-SP that considers the weighting of states and objects while improving composition recognition accuracy. First, we introduce self-attention mechanisms into the classifiers for states and objects, leading to enhanced accuracy in recognizing both. Additionally, we incorporate the weighting of states and objects during composition to generate more reasonable and accurate compositions. Our validation process involves testing SASOW on three established benchmark datasets. Experimental outcomes affirm when compared against OW-CZSL approach, KG-SP, SASOW showcases improvements of 2.1%, 1.7%, and 0.4% in terms of accuracy for unseen compositions across the MIT-States, UT Zappos, and C-GQA datasets, respectively.

[178] ICP-4D: Bridging Iterative Closest Point and LiDAR Panoptic Segmentation

Gyeongrok Oh,Youngdong Jang,Jonghyun Choi,Suk-Ju Kang,Guang Lin,Sangpil Kim

Main category: cs.CV

TL;DR: 本文提出ICP-4D，一种无需训练的4D LiDAR全景分割框架，通过几何关系实现时空一致性实例关联。

Details

Motivation: 现有方法计算成本高且忽略点云中的几何先验信息。 Method: 利用迭代最近点（ICP）算法对齐实例级点集，并引入基于Sinkhorn的软匹配以提高噪声下的稳定性。 Result: 在SemanticKITTI和panoptic nuScenes数据集上超越现有方法，具备高效计算和遮挡感知能力。 Conclusion: ICP-4D无需训练即可实现高性能4D LiDAR全景分割，有效利用几何结构提升实例关联精度。 Abstract: Dominant paradigms for 4D LiDAR panoptic segmentation are usually required to train deep neural networks with large superimposed point clouds or design dedicated modules for instance association. However, these approaches perform redundant point processing and consequently become computationally expensive, yet still overlook the rich geometric priors inherently provided by raw point clouds. To this end, we introduce ICP-4D, a simple yet effective training-free framework that unifies spatial and temporal reasoning through geometric relations among instance-level point sets. Specifically, we apply the Iterative Closest Point (ICP) algorithm to directly associate temporally consistent instances by aligning the source and target point sets through the estimated transformation. To stabilize association under noisy instance predictions, we introduce a Sinkhorn-based soft matching. This exploits the underlying instance distribution to obtain accurate point-wise correspondences, resulting in robust geometric alignment. Furthermore, our carefully designed pipeline, which considers three instance types-static, dynamic, and missing-offers computational efficiency and occlusion-aware matching. Our extensive experiments across both SemanticKITTI and panoptic nuScenes demonstrate that our method consistently outperforms state-of-the-art approaches, even without additional training or extra point cloud inputs.

[179] Towards AI-Guided Open-World Ecological Taxonomic Classification

Cheng Yaw Low,Heejoon Koo,Jaewoo Park,Kaleb Mesfin Asfaw,Meeyoung Cha

Main category: cs.CV

TL;DR: 本文提出了一个名为TaxoNet的统一框架，用于解决生态分类中的长尾分布、细粒度差异和开放世界识别等挑战，并在多个生态数据集上表现出对稀有物种更强的分类性能。

Details

Motivation: 现有的生态分类方法受限于类别不平衡、细粒度变异、测试时域偏移以及只能识别已知类别的封闭假设，难以满足真实环境下的开放世界需求。 Method: 提出TaxoNet，一种基于嵌入的编码器，采用双边界惩罚损失函数，增强稀有类别的学习信号并抑制常见类别的主导性，以应对上述联合挑战。 Result: 在Google Auto-Arborist、iNat-Plantae和NAFlora-Mini等多个生态数据集上，TaxoNet在整体和稀有物种分类性能上均优于基线模型。 Conclusion: TaxoNet为开放世界的植物分类提供了有效解决方案，支持更可靠的生物多样性监测与保护规划，同时揭示了通用多模态基础模型在植物领域应用的局限性。 Abstract: AI-guided classification of ecological families, genera, and species underpins global sustainability efforts such as biodiversity monitoring, conservation planning, and policy-making. Progress toward this goal is hindered by long-tailed taxonomic distributions from class imbalance, along with fine-grained taxonomic variations, test-time spatiotemporal domain shifts, and closed-set assumptions that can only recognize previously seen taxa. We introduce the Open-World Ecological Taxonomy Classification, a unified framework that captures the co-occurrence of these challenges in realistic ecological settings. To address them, we propose TaxoNet, an embedding-based encoder with a dual-margin penalization loss that strengthens learning signals from rare underrepresented taxa while mitigating the dominance of overrepresented ones, directly confronting interrelated challenges. We evaluate our method on diverse ecological domains: Google Auto-Arborist (urban trees), iNat-Plantae (Plantae observations from various ecosystems in iNaturalist-2019), and NAFlora-Mini (a curated herbarium collection). Our model consistently outperforms baselines, particularly for rare taxa, establishing a strong foundation for open-world plant taxonomic monitoring. Our findings further show that general-purpose multimodal foundation models remain constrained in plant-domain applications.

[180] CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization

Zelin Zhao,Xinyu Gong,Bangya Liu,Ziyang Song,Jun Zhang,Suhui Wu,Yongxin Chen,Hao Zhang

Main category: cs.CV

TL;DR: 提出CETCAM，一种无需相机标注的可控制相机视频生成框架，通过几何感知令牌实现高精度几何一致性与视觉质量。

Details

Motivation: 现有方法依赖难以扩展且与深度估计不一致的相机姿态标注，导致训练与测试不一致。 Method: 利用几何基础模型（如VGGT）估计深度和相机参数，并转化为统一的几何感知令牌，通过轻量级上下文模块集成到预训练视频扩散模型中；采用两阶段渐进训练策略。 Result: 在多个基准上实现了最先进的几何一致性、时间稳定性和视觉真实感，并展现出对修复和布局控制等其他控制模式的良好适应性。 Conclusion: CETCAM有效解决了相机控制视频生成中的标注依赖与几何不一致问题，具有强扩展性和灵活性，适用于多种控制任务。 Abstract: Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations through a consistent and extensible tokenization scheme. CETCAM leverages recent advances in geometry foundation models, such as VGGT, to estimate depth and camera parameters and converts them into unified, geometry-aware tokens. These tokens are seamlessly integrated into a pretrained video diffusion backbone via lightweight context blocks. Trained in two progressive stages, CETCAM first learns robust camera controllability from diverse raw video data and then refines fine-grained visual quality using curated high-fidelity datasets. Extensive experiments across multiple benchmarks demonstrate state-of-the-art geometric consistency, temporal stability, and visual realism. Moreover, CETCAM exhibits strong adaptability to additional control modalities, including inpainting and layout control, highlighting its flexibility beyond camera control. The project page is available at https://sjtuytc.github.io/CETCam_project_page.github.io/.

Sihao Lin,Zerui Li,Xunyi Zhao,Gengze Zhou,Liuyi Wang,Rong Wei,Rui Tang,Juncheng Li,Hanqing Wang,Jiangmiao Pang,Anton van den Hengel,Jiajun Liu,Qi Wu

Main category: cs.CV

TL;DR: VLNVerse是一个大规模、可扩展的视觉语言导航（VLN）基准，旨在解决现有数据集在模拟真实性、任务碎片化和数据规模上的局限，推动面向现实世界泛化的具身智能体研究。

Details

Motivation: 现有VLN基准受限于小规模数据集、简化的物理模拟和任务碎片化，难以支持现代大模型预训练，且缺乏对真实世界泛化的评估能力。 Method: 提出VLNVerse，一个集成多功能、具身化、基于真实物理引擎的大规模基准框架，统一多种任务并提供可扩展工具包；同时开发了一个能处理所有任务的新型多任务统一模型。 Result: 利用VLNVerse的大规模与多样性对现有方法进行全面评估，验证了其在促进模型泛化和模拟到现实迁移方面的能力。 Conclusion: VLNVerse重新定义了VLN为一个端到端的可扩展具身AI问题，填补了模拟与现实之间的差距，为构建通用具身移动智能体提供了重要基础。 Abstract: Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting "ghost" agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.

[182] Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection

Haoze Li,Jie Zhang,Guoying Zhao,Stephen Lin,Shiguang Shan

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言预训练模型的无回放增量学习框架SVLP-IL，用于面部呈现攻击检测，通过多方面提示和选择性弹性权重固化缓解灾难性遗忘，适应新域并保持高性能。

Details

Motivation: 为了应对不断演变的欺骗手段和领域变化，同时遵守隐私法规不保留历史数据，需要在无回放条件下进行增量学习以实现持续有效的面部呈现攻击检测。 Method: 提出SVLP-IL框架，结合多方面提示（MAP）和选择性弹性权重固化（SEWC），利用视觉-语言预训练模型的提示调优能力，在不重放旧数据的情况下实现对新欺骗类型和领域的快速适应，并平衡模型的稳定性与可塑性。 Result: 在多个面部呈现攻击检测基准上的实验表明，SVLP-IL显著减少了灾难性遗忘，在未见域上表现出优越的性能，提升了跨域检测的鲁棒性。 Conclusion: SVLP-IL是一种符合隐私要求、实用且高效的终身学习解决方案，适用于复杂动态环境下的面部呈现攻击检测任务。 Abstract: Face Presentation Attack Detection (PAD) demands incremental learning (IL) to combat evolving spoofing tactics and domains. Privacy regulations, however, forbid retaining past data, necessitating rehearsal-free IL (RF-IL). Vision-Language Pre-trained (VLP) models, with their prompt-tunable cross-modal representations, enable efficient adaptation to new spoofing styles and domains. Capitalizing on this strength, we propose \textbf{SVLP-IL}, a VLP-based RF-IL framework that balances stability and plasticity via \textit{Multi-Aspect Prompting} (MAP) and \textit{Selective Elastic Weight Consolidation} (SEWC). MAP isolates domain dependencies, enhances distribution-shift sensitivity, and mitigates forgetting by jointly exploiting universal and domain-specific cues. SEWC selectively preserves critical weights from previous tasks, retaining essential knowledge while allowing flexibility for new adaptations. Comprehensive experiments across multiple PAD benchmarks show that SVLP-IL significantly reduces catastrophic forgetting and enhances performance on unseen domains. SVLP-IL offers a privacy-compliant, practical solution for robust lifelong PAD deployment in RF-IL settings.

[183] Finer-Personalization Rank: Fine-Grained Retrieval Examines Identity Preservation for Personalized Generation

Connor Kilrain,David Carlyn,Julia Chae,Sara Beery,Wei-Lun Chao,Jianyang Gu

Main category: cs.CV

TL;DR: 本文提出了Finer-Personalization Rank，一种面向个性化生成模型中身份保持的评估协议，通过基于检索的排序方法在多粒度层次上更准确地衡量身份细节的保留。

Details

Motivation: 现有的生成模型评估指标主要关注语义相似性，忽视了对身份相关细粒度特征的保留，难以准确评估个性化生成模型中的身份保持能力。 Method: 提出Finer-Personalization Rank，将生成图像作为查询，在带有身份标签的真实图像库中进行检索，使用检索指标（如平均精度）评估身份保持效果，并在多个细粒度和实例级数据集上验证。 Result: 在CUB、Stanford Cars和动物Re-ID基准上，Finer-Personalization Rank比仅依赖语义的指标更能反映身份保留情况，并揭示多种主流个性化方法存在显著的身份漂移。 Conclusion: 基于图库的排序评估协议为个性化生成提供了更合理且实用的身份保持评估方式，优于传统的成对相似性度量。 Abstract: The rise of personalized generative models raises a central question: how should we evaluate identity preservation? Given a reference image (e.g., one's pet), we expect the generated image to retain precise details attached to the subject's identity. However, current generative evaluation metrics emphasize the overall semantic similarity between the reference and the output, and overlook these fine-grained discriminative details. We introduce Finer-Personalization Rank, an evaluation protocol tailored to identity preservation. Instead of pairwise similarity, Finer-Personalization Rank adopts a ranking view: it treats each generated image as a query against an identity-labeled gallery consisting of visually similar real images. Retrieval metrics (e.g., mean average precision) measure performance, where higher scores indicate that identity-specific details (e.g., a distinctive head spot) are preserved. We assess identity at multiple granularities -- from fine-grained categories (e.g., bird species, car models) to individual instances (e.g., re-identification). Across CUB, Stanford Cars, and animal Re-ID benchmarks, Finer-Personalization Rank more faithfully reflects identity retention than semantic-only metrics and reveals substantial identity drift in several popular personalization methods. These results position the gallery-based protocol as a principled and practical evaluation for personalized generation.

[184] Automatic Neuronal Activity Segmentation in Fast Four Dimensional Spatio-Temporal Fluorescence Imaging using Bayesian Approach

Ran Li,Pan Xiao,Kaushik Dutta,Youdong Guo

Main category: cs.CV

TL;DR: 提出了一种基于贝叶斯深度学习框架的方法，用于从光片显微镜获取的4D时空数据中自动检测神经元活动，结合时空信息并建模不确定性，实现了高可重复性和准确性的神经元活动检测。

Details

Motivation: 荧光显微钙成像中手动分割神经元活动费时费力且泛化性差，亟需一种自动、精确的检测方法。 Method: 利用贝叶斯深度学习框架，结合像素级相关性图（时间信息）和均值汇总图像（空间信息）进行神经元活动检测，并生成概率分割图与不确定性建模。 Result: 相对于Otsu法生成的合成真值，平均Dice得分为0.81；两次运行间的Dice得分为0.79，表现出良好的准确性和可重复性。 Conclusion: 该方法可高效、可靠地实现行为学研究中活跃神经元的快速检测，具有良好的应用前景。 Abstract: Fluorescence Microcopy Calcium Imaging is a fundamental tool to in-vivo record and analyze large scale neuronal activities simultaneously at a single cell resolution. Automatic and precise detection of behaviorally relevant neuron activity from the recordings is critical to study the mapping of brain activity in organisms. However a perpetual bottleneck to this problem is the manual segmentation which is time and labor intensive and lacks generalizability. To this end, we present a Bayesian Deep Learning Framework to detect neuronal activities in 4D spatio-temporal data obtained by light sheet microscopy. Our approach accounts for the use of temporal information by calculating pixel wise correlation maps and combines it with spatial information given by the mean summary image. The Bayesian framework not only produces probability segmentation maps but also models the uncertainty pertaining to active neuron detection. To evaluate the accuracy of our framework we implemented the test of reproducibility to assert the generalization of the network to detect neuron activity. The network achieved a mean Dice Score of 0.81 relative to the synthetic Ground Truth obtained by Otsu's method and a mean Dice Score of 0.79 between the first and second run for test of reproducibility. Our method successfully deployed can be used for rapid detection of active neuronal activities for behavioural studies.

[185] Distinguishing Visually Similar Actions: Prompt-Guided Semantic Prototype Modulation for Few-Shot Action Recognition

Xiaoyang Li,Mingming Lu,Ruiqi Wang,Hao Li,Zewei Le

Main category: cs.CV

TL;DR: 本文提出CLIP-SPM框架，用于解决少样本动作识别中的时序建模、视觉相似性混淆和图文模态差距问题，通过HSMR、SPM和PADM三个模块提升性能，在多个基准上取得优异结果。

Details

Motivation: 少样本动作识别面临数据稀缺、静态背景干扰、动作间视觉差异小以及图文模态不一致等挑战，需提升模型对动态特征的捕捉能力与跨模态对齐能力。 Method: 提出CLIP-SPM框架，包含：1）HSMR模块协同深浅层运动特征以抑制背景干扰；2）SPM策略生成查询相关的文本提示以融合视觉-文本信息；3）PADM方法通过原型-锚点双重调制增强支持与查询样本的一致性。 Result: 在Kinetics、SSv2、UCF101和HMDB51等多个基准上，CLIP-SPM在1-shot、3-shot和5-shot设置下均取得具有竞争力的性能，消融实验验证了各组件的有效性。 Conclusion: CLIP-SPM通过改进时序建模、增强语义判别力和缩小模态差距，有效提升了少样本动作识别的表现，具备较强的泛化能力和应用潜力。 Abstract: Few-shot action recognition aims to enable models to quickly learn new action categories from limited labeled samples, addressing the challenge of data scarcity in real-world applications. Current research primarily addresses three core challenges: (1) temporal modeling, where models are prone to interference from irrelevant static background information and struggle to capture the essence of dynamic action features; (2) visual similarity, where categories with subtle visual differences are difficult to distinguish; and (3) the modality gap between visual-textual support prototypes and visual-only queries, which complicates alignment within a shared embedding space. To address these challenges, this paper proposes a CLIP-SPM framework, which includes three components: (1) the Hierarchical Synergistic Motion Refinement (HSMR) module, which aligns deep and shallow motion features to improve temporal modeling by reducing static background interference; (2) the Semantic Prototype Modulation (SPM) strategy, which generates query-relevant text prompts to bridge the modality gap and integrates them with visual features, enhancing the discriminability between similar actions; and (3) the Prototype-Anchor Dual Modulation (PADM) method, which refines support prototypes and aligns query features with a global semantic anchor, improving consistency across support and query samples. Comprehensive experiments across standard benchmarks, including Kinetics, SSv2-Full, SSv2-Small, UCF101, and HMDB51, demonstrate that our CLIP-SPM achieves competitive performance under 1-shot, 3-shot, and 5-shot settings. Extensive ablation studies and visual analyses further validate the effectiveness of each component and its contributions to addressing the core challenges. The source code and models are publicly available at GitHub.

[186] WaTeRFlow: Watermark Temporal Robustness via Flow Consistency

Utae Jeong,Sumin In,Hyunju Ryu,Jaewan Choi,Feng Yang,Jongheon Jeong,Seungryong Kim,Sangpil Kim

Main category: cs.CV

TL;DR: 本文提出了WaTeRFlow，一个专为图像到视频（I2V）转换下鲁棒水印恢复设计的框架，通过引入FUSE训练机制、光流对齐与时间一致性损失，显著提升了跨模态水印的检测准确性和鲁棒性。

Details

Motivation: 现有图像水印方案在面对先进的图像到视频（I2V）生成技术时，水印检测性能显著下降，尤其是在帧间连续性和生成编辑增强的背景下，缺乏有效的跨模态水印鲁棒性解决方案。 Method: 提出WaTeRFlow框架，包含：(i) FUSE——结合指令驱动编辑和快速视频扩散代理的训练机制，以模拟真实失真；(ii) 光流对齐与时间一致性损失（TCL），稳定逐帧水印预测；(iii) 语义保持损失，确保条件信号不丢失。 Result: 在多个代表性I2V模型上的实验表明，WaTeRFlow实现了更高的首帧和逐帧水印比特准确率，并在生成前后遭受多种 distortions 时仍保持强恢复能力。 Conclusion: WaTeRFlow有效填补了图像水印在I2V跨模态场景下的鲁棒性空白，为未来内容溯源与真实性验证提供了可靠技术支持。 Abstract: Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.

[187] Decoupled Generative Modeling for Human-Object Interaction Synthesis

Hwanhee Jung,Seunggwan Lee,Jeongyoon Yoon,SeungHyeon Kim,Giljoo Nam,Qixing Huang,Sangpil Kim

Main category: cs.CV

TL;DR: 提出DecHOI框架，解耦路径规划与动作合成，提升人-物交互生成的 realism 与一致性。

Details

Motivation: 现有方法依赖手工设定中间路径点且目标耦合，导致运动不同步或穿透等问题，缺乏灵活性和真实性。 Method: 采用解耦设计：轨迹生成器生成人与物体的轨迹，动作生成器基于轨迹合成细节动作；引入聚焦远端关节动态的对抗训练以增强接触真实感。 Result: 在FullBodyManipulation和3D-FUTURE两个基准上优于先前方法，在定量指标、定性评估和感知实验中均表现更优。 Conclusion: DecHOI通过解耦建模和对抗训练，实现了更真实、灵活且一致的人-物交互序列生成，适用于动态场景中的长序列响应式规划。 Abstract: Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.

[188] 6DAttack: Backdoor Attacks in the 6DoF Pose Estimation

Jihui Guo,Zongmin Zhang,Zhen Sun,Yuhao Yang,Jinlin Wu,Fu Zhang,Xinlei He

Main category: cs.CV

TL;DR: 提出6DAttack框架，利用3D物体触发器在保持正常性能的同时诱导6DoF姿态估计产生可控错误，实验证明其在多个模型和数据集上具有高攻击成功率且现有防御方法无效。

Details

Motivation: 6DoF姿态估计在机器人、AR/VR等领域广泛应用，但面临后门攻击的安全威胁，尤其是连续参数（如旋转和平移）的控制问题使得传统2D后门攻击方法不可行，因此需要专门针对6DoF场景的攻击研究。 Method: 提出6DAttack，使用3D物体作为触发器，通过优化触发形状和位置来操控模型输出错误的6DoF姿态，同时保持对干净样本的正常推理能力。 Result: 在PVNet、DenseFusion和PoseDiffusion等模型及LINEMOD、YCB-Video、CO3D等数据集上实现了高达100%的攻击成功率（ASR），且被攻击模型在干净样本上的ADD准确率仍达100%，触发样本达到97.70% ADD-P；现有代表性防御方法无效。 Conclusion: 6DAttack揭示了6DoF姿态估计系统中一个严重且未被充分研究的安全漏洞，表明当前模型对此类后门攻击极为脆弱，需引起重视并发展相应防御机制。 Abstract: Deep learning advances have enabled accurate six-degree-of-freedom (6DoF) object pose estimation, widely used in robotics, AR/VR, and autonomous systems. However, backdoor attacks pose significant security risks. While most research focuses on 2D vision, 6DoF pose estimation remains largely unexplored. Unlike traditional backdoors that only change classes, 6DoF attacks must control continuous parameters like translation and rotation, rendering 2D methods inapplicable. We propose 6DAttack, a framework using 3D object triggers to induce controlled erroneous poses while maintaining normal behavior. Evaluations on PVNet, DenseFusion, and PoseDiffusion across LINEMOD, YCB-Video, and CO3D show high attack success rates (ASRs) without compromising clean performance. Backdoored models achieve up to 100% clean ADD accuracy and 100% ASR, with triggered samples reaching 97.70% ADD-P. Furthermore, a representative defense remains ineffective. Our findings reveal a serious, underexplored threat to 6DoF pose estimation.

[189] Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation

Khanh Nguyen,Dasith de Silva Edirimuni,Ghulam Mubashar Hassan,Ajmal Mian

Main category: cs.CV

TL;DR: 本文提出了一种从RGB图像中通过2D开放词汇检测器引导生成新颖物体3D实例掩码的方法，继承了2D检测器对新类别物体的识别能力，同时保持高效分类，实现了对罕见物体的快速准确检索。

Details

Motivation: 现有方法如Open-YOLO 3D虽然提升了推理速度，但在罕见或未见物体类别上泛化能力差，且依赖SAM和CLIP导致计算开销大。需要一种既能高效处理又能识别开放词汇物体的方法。 Method: 利用预训练的3D分割器直接从点云生成类无关的3D实例掩码，并结合一个实时2D开放词汇检测器对RGB图像进行分析，通过2D检测结果来指导3D掩码的分类，从而实现对新类别物体的识别与定位。 Result: 该方法在保持快速推理的同时，显著提升了对罕见和新类别物体的检索性能，克服了Open-YOLO 3D在泛化能力上的不足。 Conclusion: 所提方法有效平衡了效率与开放词汇识别能力，能够在真实场景中高效、准确地检索包括罕见对象在内的多种物体，适用于机器人和增强现实等实际应用。 Abstract: Locating and retrieving objects from scene-level point clouds is a challenging problem with broad applications in robotics and augmented reality. This task is commonly formulated as open-vocabulary 3D instance segmentation. Although recent methods demonstrate strong performance, they depend heavily on SAM and CLIP to generate and classify 3D instance masks from images accompanying the point cloud, leading to substantial computational overhead and slow processing that limit their deployment in real-world settings. Open-YOLO 3D alleviates this issue by using a real-time 2D detector to classify class-agnostic masks produced directly from the point cloud by a pretrained 3D segmenter, eliminating the need for SAM and CLIP and significantly reducing inference time. However, Open-YOLO 3D often fails to generalize to object categories that appear infrequently in the 3D training data. In this paper, we propose a method that generates 3D instance masks for novel objects from RGB images guided by a 2D open-vocabulary detector. Our approach inherits the 2D detector's ability to recognize novel objects while maintaining efficient classification, enabling fast and accurate retrieval of rare instances from open-ended text queries. Our code will be made available at https://github.com/ndkhanh360/BoxOVIS.

[190] Auditing Significance, Metric Choice, and Demographic Fairness in Medical AI Challenges

Ariel Lubonja,Pedro R. A. S. Bassi,Wenxuan Li,Hualin Qiao,Randal Burns,Alan L. Yuille,Zongwei Zhou

Main category: cs.CV

TL;DR: RankInsight是一个开源工具包，旨在解决当前医学AI排行榜在统计显著性、临床相关性和公平性方面的局限性，通过引入配对显著性分析、器官适配的评估指标和交叉公平性审计，提升排行榜的科学性和公平性。

Details

Motivation: 现有的医学AI挑战赛排行榜存在三个主要问题：缺乏统计显著性检验导致排名不稳定；使用单一平均指标掩盖了关键边界错误；忽视不同人口统计学群体间的性能差异，导致公平性问题被忽略。 Method: 提出RankInsight工具包，包含三项核心技术：1）计算模型间的配对显著性图谱；2）根据器官特性采用更合适的评估指标（如用NSD替代Dice）；3）在性别与种族交叉维度上进行公平性审计。 Result: 实验显示nnU-Net家族模型在统计上显著优于Vision-Language和MONAI模型；当使用NSD评估管状结构时，前四名模型排名顺序发生反转；超过一半的MONAI模型在性别-种族组合上表现出最大性能差距。 Conclusion: RankInsight能够生成更具统计可靠性、临床意义和公平性的医学AI排行榜，适用于过去、现在和未来的挑战赛，推动医学AI评估的标准化和透明化。 Abstract: Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and (3) audits intersectional fairness, revealing that more than half of the MONAI-based entries have the largest gender-race discrepancy on our proprietary Johns Hopkins Hospital dataset. The RankInsight toolkit is publicly released and can be directly applied to past, ongoing, and future challenges. It enables organizers and participants to publish rankings that are statistically sound, clinically meaningful, and demographically fair.

[191] Mamba-Based Modality Disentanglement Network for Multi-Contrast MRI Reconstruction

Weiyi Lyu,Xinming Fang,Jun Wang,Jun Shi,Guixu Zhang,Juncheng Li

Main category: cs.CV

TL;DR: 提出MambaMDN，一种用于多对比度MRI重建的双域框架，通过利用全采样K空间数据补全欠采样数据，并结合Mamba-based模态解耦网络与迭代优化机制，有效减少伪影和无关信息干扰，显著提升重建质量。

Details

Motivation: 现有加速MRI技术难以有效利用K空间先验信息，且多对比度融合策略易引入无关信息，导致重建图像存在伪影或质量下降。 Method: 提出MambaMDN框架：首先利用全采样参考K空间数据补全目标欠采样数据，生成结构对齐但模态混合的输入；然后采用基于Mamba的模态解耦网络分离并去除参考特异性特征；引入迭代优化机制进行多次特征净化以提升重建精度。 Result: 在多个实验中，MambaMDN显著优于现有的多对比度MRI重建方法，能更有效地抑制伪影并保留细节。 Conclusion: MambaMDN通过融合K空间先验与双域解耦策略，实现了高质量的多对比度MRI重建，为临床快速成像提供了有力支持。 Abstract: Magnetic resonance imaging (MRI) is a cornerstone of modern clinical diagnosis, offering unparalleled soft-tissue contrast without ionizing radiation. However, prolonged scan times remain a major barrier to patient throughput and comfort. Existing accelerated MRI techniques often struggle with two key challenges: (1) failure to effectively utilize inherent K-space prior information, leading to persistent aliasing artifacts from zero-filled inputs; and (2) contamination of target reconstruction quality by irrelevant information when employing multi-contrast fusion strategies. To overcome these challenges, we present MambaMDN, a dual-domain framework for multi-contrast MRI reconstruction. Our approach first employs fully-sampled reference K-space data to complete the undersampled target data, generating structurally aligned but modality-mixed inputs. Subsequently, we develop a Mamba-based modality disentanglement network to extract and remove reference-specific features from the mixed representation. Furthermore, we introduce an iterative refinement mechanism to progressively enhance reconstruction accuracy through repeated feature purification. Extensive experiments demonstrate that MambaMDN can significantly outperform existing multi-contrast reconstruction methods.

[192] GaussianImage++: Boosted Image Representation and Compression with 2D Gaussian Splatting

Tiantian Li,Xinjie Zhang,Xingtong Ge,Tongda Xu,Dailan He,Jun Zhang,Yan Wang

Main category: cs.CV

TL;DR: GaussianImage++ 是一种基于2D高斯点阵的隐式图像表示方法，通过失真驱动的密度化机制、上下文感知的高斯滤波和可学习的标量量化器，在减少高斯基元数量的同时实现了优于现有INR和GS方法的表示与压缩性能。

Details

Motivation: 现有的隐式神经表示（INR）虽然有效但训练耗时且内存占用高；而现有的2D高斯点阵（GS）方法如GaussianImage需要大量高斯基元来维持视觉质量，效率较低。因此需要一种更高效、低内存且保持高质量的图像表示与压缩方法。 Method: 提出GaussianImage++：1）引入失真驱动的密度化机制，根据信号强度自适应分配高斯基元；2）采用上下文感知的高斯滤波器，根据不同图像内容优化基元；3）使用属性分离的可学习标量量化器并结合量化感知训练，实现高效的属性压缩。 Result: 实验表明，GaussianImage++在图像表示和压缩性能上优于GaussianImage和基于INR的COIN方法，同时支持实时解码和低内存使用。 Conclusion: GaussianImage++通过创新的密度化策略和压缩机制，在保持高效性和低资源消耗的前提下显著提升了基于高斯点阵的图像表示与压缩性能，为高效图像编码提供了新方向。 Abstract: Implicit neural representations (INRs) have achieved remarkable success in image representation and compression, but they require substantial training time and memory. Meanwhile, recent 2D Gaussian Splatting (GS) methods (\textit{e.g.}, GaussianImage) offer promising alternatives through efficient primitive-based rendering. However, these methods require excessive Gaussian primitives to maintain high visual fidelity. To exploit the potential of GS-based approaches, we present GaussianImage++, which utilizes limited Gaussian primitives to achieve impressive representation and compression performance. Firstly, we introduce a distortion-driven densification mechanism. It progressively allocates Gaussian primitives according to signal intensity. Secondly, we employ context-aware Gaussian filters for each primitive, which assist in the densification to optimize Gaussian primitives based on varying image content. Thirdly, we integrate attribute-separated learnable scalar quantizers and quantization-aware training, enabling efficient compression of primitive attributes. Experimental results demonstrate the effectiveness of our method. In particular, GaussianImage++ outperforms GaussianImage and INRs-based COIN in representation and compression performance while maintaining real-time decoding and low memory usage.

[193] Trifocal Tensor and Relative Pose Estimation with Known Vertical Direction

Tao Li,Zhenbao Yu,Banglei Guan,Jianli Han,Weimin Lv,Friedrich Fraundorfer

Main category: cs.CV

TL;DR: 本文提出了两种用于估计具有已知垂直方向视图间相对位姿的新方法，利用IMU获取的垂直方向信息，仅需解决两个旋转角和两个平移向量。提出了一种仅需三个视图中四个点对应关系的线性闭式解，以及一种使用最新Gröbner基求解器的三对点最小解。由于所需点对应较少，该方法在RANSAC框架下能高效地用于剔除外点并进行视觉里程计中的位姿估计，在合成数据和KITTI真实场景数据上均表现出优于其他方法的精度。

Details

Motivation: 为了提高在已知垂直方向条件下多视图相对位姿估计的效率与精度，尤其是在存在外点的情况下仍能稳健估计位姿，本文旨在设计更少点对应需求的新型求解器。 Method: 提出一种线性闭式求解方法（需四对点）和一种基于Gröbner基的最小求解方法（仅需三对点），二者均利用已知的垂直方向简化位姿估计问题，并集成于RANSAC框架中以增强鲁棒性。 Result: 在合成数据与KITTI真实数据集上的实验表明，所提方法在位姿估计精度上优于现有替代方法，且因点对应需求少而具备更高的计算效率。 Conclusion: 本文提出的两种相对位姿估计算法在精度和效率方面均表现优异，特别适用于配备IMU的设备在视觉里程计中的应用。 Abstract: This work presents two novel solvers for estimating the relative poses among views with known vertical directions. The vertical directions of camera views can be easily obtained using inertial measurement units (IMUs) which have been widely used in autonomous vehicles, mobile phones, and unmanned aerial vehicles (UAVs). Given the known vertical directions, our lgorithms only need to solve for two rotation angles and two translation vectors. In this paper, a linear closed-form solution has been described, requiring only four point correspondences in three views. We also propose a minimal solution with three point correspondences using the latest Gröbner basis solver. Since the proposed methods require fewer point correspondences, they can be efficiently applied within the RANSAC framework for outliers removal and pose estimation in visual odometry. The proposed method has been tested on both synthetic data and real-world scenes from KITTI. The experimental results show that the accuracy of the estimated poses is superior to other alternative methods.

[194] Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

Hengyi Feng,Zeang Sheng,Meiyi Qiang,Wentao Zhang

Main category: cs.CV

TL;DR: 本文研究了多模态大语言模型（MLLMs）在零样本多模态检索任务中表现不佳的原因，发现其表示空间主要由文本语义主导，视觉信息占比很小，且生成导向的训练弱化了检索所需的判别能力。通过稀疏自编码器分析，揭示了影响相似性计算的关键特征实为干扰因素。

Details

Motivation: 探究MLLMs为何在生成任务上成功却在零样本多模态检索中表现差，理解其内部表示机制如何限制检索性能。 Method: 使用稀疏自编码器（SAEs）分解MLLM输出表示，以提取可解释的语义概念，并分析文本与视觉信息在表示空间中的分布及其对相似性计算的影响。 Result: 发现MLLM表示空间主要由文本语义主导，视觉信息贡献小；生成任务导致嵌入同质化，削弱判别力；某些主导相似性计算的特征实际是降低检索性能的干扰项。 Conclusion: 首次对MLLM在多模态检索中的表示进行了深入可解释性分析，指出了模型设计在检索任务上的缺陷，并提出了提升MLLM检索能力的可能方向。 Abstract: Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from serving as effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; the visual information essential for multimodal retrieval only constitutes a small portion. This imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations for MLLMs are in fact distractors that actively degrade retrieval performance. Overall, our work provides the first in-depth interpretability analysis of MLLM representations in the context of multimodal retrieval and offers possible directions for enhancing the multimodal retrieval capabilities of MLLMs.

[195] AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction

Ruikai Li,Xinrun Li,Mengwei Xie,Hao Shan,Shoumeng Qiu,Xinyuan Chang,Yizhe Fan,Feng Xiong,Han Jiang,Yilong Ren,Haiyang Yu,Mu Xu,Yang Long,Varun Ojha,Zhiyong Cui

Main category: cs.CV

TL;DR: 本文提出AMap，一种面向前方感知的在线高精地图构建框架，通过“从未来蒸馏”的范式赋予学生模型前瞻性能力，在不增加推理成本的情况下显著提升前方区域的感知精度。

Details

Motivation: 现有基于历史时序融合的方法在空间上是向后看的，对前方未见道路改善有限，而前方感知错误会直接导致危险驾驶行为，存在安全隐患。 Method: 提出AMap框架，采用“从未来蒸馏”范式，利用能访问未来时序信息的教师模型指导仅使用当前帧的学生模型；引入多层次BEV蒸馏策略和非对称查询自适应模块，将前瞻性知识压缩至学生模型。 Result: 在nuScenes和Argoverse 2数据集上实验表明，AMap在前向关键区域显著优于现有时序模型，同时保持单帧推理效率。 Conclusion: AMap通过隐式注入前瞻性知识，解决了传统方法空间不对称的安全缺陷，实现了更安全高效的在线高精地图构建。 Abstract: Online High-Definition (HD) map construction is pivotal for autonomous driving. While recent approaches leverage historical temporal fusion to improve performance, we identify a critical safety flaw in this paradigm: it is inherently ``spatially backward-looking." These methods predominantly enhance map reconstruction in traversed areas, offering minimal improvement for the unseen road ahead. Crucially, our analysis of downstream planning tasks reveals a severe asymmetry: while rearward perception errors are often tolerable, inaccuracies in the forward region directly precipitate hazardous driving maneuvers. To bridge this safety gap, we propose AMap, a novel framework for Ahead-aware online HD Mapping. We pioneer a ``distill-from-future" paradigm, where a teacher model with privileged access to future temporal contexts guides a lightweight student model restricted to the current frame. This process implicitly compresses prospective knowledge into the student model, endowing it with ``look-ahead" capabilities at zero inference-time cost. Technically, we introduce a Multi-Level BEV Distillation strategy with spatial masking and an Asymmetric Query Adaptation module to effectively transfer future-aware representations to the student's static queries. Extensive experiments on the nuScenes and Argoverse 2 benchmark demonstrate that AMap significantly enhances current-frame perception. Most notably, it outperforms state-of-the-art temporal models in critical forward regions while maintaining the efficiency of single current frame inference.

[196] OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions

Wendong Bu,Kaihang Pan,Yuze Lin,Jiacheng Li,Kai Shen,Wenqiao Zhang,Juncheng Li,Jun Xiao,Siliang Tang

Main category: cs.CV

TL;DR: 提出OmniMoGen，一个通过交错文本-动作指令实现多功能动作生成的统一框架，并构建大规模数据集X2Mo和新基准AnyContext，实验证明其在多项任务上达到SOTA并展现组合编辑、自省生成等新兴能力。

Details

Motivation: 现有动作生成方法局限于孤立任务，缺乏灵活支持自由形式和多目标生成的能力，未实现类似大语言模型在语言任务中的统一范式。 Method: 设计基于RVQ-VAE和Transformer的简洁架构，构建包含13.7万以上样本的大规模交错文本-动作指令数据集X2Mo，提出AnyContext作为新评估基准，实现端到端指令驱动的统一动作生成框架OmniMoGen。 Result: OmniMoGen在文本到动作、动作编辑及AnyContext基准上均达到SOTA性能，展现出组合编辑、自省生成和知识引导生成等新兴能力。 Conclusion: OmniMoGen推动了通用智能动作生成的发展，为未来类LLM式的统一动作生成奠定了基础。 Abstract: Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation. We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation. Project Page: https://OmniMoGen.github.io/.

[197] PEDESTRIAN: An Egocentric Vision Dataset for Obstacle Detection on Pavements

Marios Thoma,Zenonas Theodosiou,Harris Partaourides,Vassilis Vassiliades,Loizos Michael,Andreas Lanitis

Main category: cs.CV

TL;DR: 本文介绍了PEDESTRIAN数据集，包含29种常见人行道障碍物的340段第一人称视角视频，旨在通过深度学习算法实现城市步行环境中的实时障碍物检测，提升行人安全。

Details

Motivation: 城市人行道常被障碍物占据，影响行人通行安全，亟需有效的实时检测系统来改善这一问题。 Method: 利用手机摄像头采集行人第一人称视角视频，构建包含29类障碍物的PEDESTRIAN数据集，并训练多种先进的深度学习模型进行障碍物识别实验。 Result: 成功构建了涵盖340段视频的PEDESTRIAN数据集，并通过实验验证了多种深度学习算法在该数据集上的有效性，可作为行人障碍物检测任务的基准。 Conclusion: PEDESTRIAN数据集为城市人行道障碍物的自动检测提供了有力支持，有助于推动基于第一人称视觉的行人安全辅助系统的发展。 Abstract: Walking has always been a primary mode of transportation and is recognized as an essential activity for maintaining good health. Despite the need for safe walking conditions in urban environments, sidewalks are frequently obstructed by various obstacles that hinder free pedestrian movement. Any object obstructing a pedestrian's path can pose a safety hazard. The advancement of pervasive computing and egocentric vision techniques offers the potential to design systems that can automatically detect such obstacles in real time, thereby enhancing pedestrian safety. The development of effective and efficient identification algorithms relies on the availability of comprehensive and well-balanced datasets of egocentric data. In this work, we introduce the PEDESTRIAN dataset, comprising egocentric data for 29 different obstacles commonly found on urban sidewalks. A total of 340 videos were collected using mobile phone cameras, capturing a pedestrian's point of view. Additionally, we present the results of a series of experiments that involved training several state-of-the-art deep learning algorithms using the proposed dataset, which can be used as a benchmark for obstacle detection and recognition tasks. The dataset can be used for training pavement obstacle detectors to enhance the safety of pedestrians in urban areas.

Zihao Luo,Shaohao Rui,Zhenyu Tang,Guotai Wang,Xiaosong Wang

Main category: cs.CV

TL;DR: 提出InvCoSS，一种无需访问历史真实数据的持续自监督学习框架，通过模型反演生成高质量合成图像以缓解灾难性遗忘，在保护隐私的同时实现优异下游性能。

Details

Motivation: 现有持续自监督学习方法依赖回放历史数据防止灾难性遗忘，但损害数据隐私，限制在实际医疗场景中的应用。 Method: 训练后反演预训练模型生成逼近原分布的合成图像，结合新任务数据进行联合优化；提出InvUNet多尺度融合网络提升反演图像质量，并设计排斥式表征学习机制增强合成样本多样性。 Result: 在九个下游任务中实验表明，InvCoSS性能媲美甚至超越基于数据回放的方法，显著降低存储需求，且完全避免使用真实历史数据。 Conclusion: InvCoSS实现了在无真实数据访问下的高效持续学习，兼顾模型性能、数据隐私与存储效率，推动医疗影像基础模型的实际部署。 Abstract: Continual self-supervised learning (CSSL) in medical imaging trains a foundation model sequentially, alleviating the need for collecting multi-modal images for joint training and offering promising improvements in downstream performance while preserving data privacy. However, most existing methods still rely on replaying data from previous stages to prevent catastrophic forgetting, which compromises privacy and limits their applicability in real-world scenarios where data transfer across sites is often restricted. In this work, we propose InvCoSS, an inversion-driven continual self-supervised learning framework for medical multi-modal image pre-training. Specifically, after training on a previous task, InvCoSS inverts the pre-trained self-supervised model to generate synthetic images that approximate the original training distribution. These synthetic images are then combined with data from the new task for joint optimization, which effectively mitigates catastrophic forgetting while strictly adhering to the constraint of no access to previous real data. Furthermore, to improve the fidelity of synthetic images, we introduce a novel InvUNet with a multi-scale fusion architecture to restore both high- and low-frequency components of the inverted images. To enhance diversity and prevent mode collapse, we design a repulsive representation-learning mechanism that encourages a diverse feature space for synthetic images without class guidance. Extensive experiments across nine downstream tasks validate the effectiveness of InvCoSS, achieving performance comparable to or even superior to prior data-replay methods while significantly reducing storage requirements and eliminating data privacy constraints.

[199] HippMetric: A skeletal-representation-based framework for cross-sectional and longitudinal hippocampal substructural morphometry

Na Gao,Chenfei Ye,Yanwu Yang,Anqi Li,Zhengbo He,Li Liang,Zhiyuan Liu,Xingyu Hao,Ting Ma,Tengfei Guo

Main category: cs.CV

TL;DR: 提出HippMetric框架，基于骨骼表示（s-rep）构建可变形的解剖对齐坐标系，实现海马体亚结构在个体间和纵向扫描中的高精度形态测量与点对点对应。

Details

Motivation: 海马体个体间变异性高且折叠模式复杂，现有方法缺乏稳定的内在坐标系，难以建立可靠的跨个体和纵向对应关系。 Method: 基于骨骼表示（s-rep）和ARMM模型，构建一个与海马体解剖功能对齐的可变形骨架坐标系；通过表面重建、形变和几何约束的轮辐优化生成个性化的s-reps，保证边界贴合、正交性和无交叉性。 Result: 在两个国际队列上实验表明，HippMetric在准确性、可靠性和对应稳定性方面优于现有形状模型。 Conclusion: HippMetric提供了一种生物学合理的海马体亚结构分析框架，支持跨个体和时间的一致性定位，有助于发现神经退行性疾病的早期生物标志物。 Abstract: Accurate characterization of hippocampal substructure is crucial for detecting subtle structural changes and identifying early neurodegenerative biomarkers. However, high inter-subject variability and complex folding pattern of human hippocampus hinder consistent cross-subject and longitudinal analysis. Most existing approaches rely on subject-specific modelling and lack a stable intrinsic coordinate system to accommodate anatomical variability, which limits their ability to establish reliable inter- and intra-individual correspondence. To address this, we propose HippMetric, a skeletal representation (s-rep)-based framework for hippocampal substructural morphometry and point-wise correspondence across individuals and scans. HippMetric builds on the Axis-Referenced Morphometric Model (ARMM) and employs a deformable skeletal coordinate system aligned with hippocampal anatomy and function, providing a biologically grounded reference for correspondence. Our framework comprises two core modules: a skeletal-based coordinate system that respects the hippocampus' conserved longitudinal lamellar architecture, in which functional units (lamellae) are stacked perpendicular to the long-axis, enabling anatomically consistent localization across subjects and time; and individualized s-reps generated through surface reconstruction, deformation, and geometrically constrained spoke refinement, enforcing boundary adherence, orthogonality and non-intersection to produce mathematically valid skeletal geometry. Extensive experiments on two international cohorts demonstrate that HippMetric achieves higher accuracy, reliability, and correspondence stability compared to existing shape models.

[200] Towards Minimal Fine-Tuning of VLMs

Tiange Luo,Lajanugen Logeswaran,Jaekyeom Kim,Justin Johnson,Honglak Lee

Main category: cs.CV

TL;DR: Image-LoRA是一种针对视觉语言模型的轻量级高效微调方法，仅在注意力层的视觉标记范围内对值路径进行低秩适配，并结合头选择与归一化策略，在减少训练计算量的同时保持性能。

Details

Motivation: 为了在保持视觉语言模型性能的同时，降低现有LoRA方法在微调过程中的计算开销和参数量，尤其是在处理视觉标记时的冗余计算。 Method: 提出Image-LoRA，将低秩适配限制在视觉标记范围内的注意力层值路径；使用秩为1的初步适配估计头影响力，选择性地微调部分注意力头；并通过每层的选择规模归一化来稳定更新。 Result: 在多种定位和指代表任务上，Image-LoRA达到或接近标准LoRA的精度，同时显著减少可训练参数和适配器专属训练FLOPs；并在GSM8K等纯文本任务上保持原有推理能力。 Conclusion: Image-LoRA是一种高效的视觉语言模型微调方法，在降低计算成本的同时保持多模态与纯文本任务性能，适合资源受限场景下的VLM适配。 Abstract: We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.

[201] From Pixels to Predicates Structuring urban perception with scene graphs

Yunlong Liu,Shuyang Li,Pengyuan Liu,Yu Zhang,Rudi Stouffs

Main category: cs.CV

TL;DR: 提出了一种基于图结构的三阶段管道，用于从街景图像中提取结构化表示以预测城市感知指标，相比基线模型准确率平均提升26%，并具有良好的跨城市泛化能力。

Details

Motivation: 现有基于街景的感知研究多依赖像素特征或对象共现统计，忽略了影响人类感知的显式关系，因此需要一种能捕捉场景中对象间关系的建模方法。 Method: 采用三阶段 pipeline：首先使用OpenPSG模型从街景图像中提取‘对象-谓词-对象’三元组构建全景场景图；然后利用异构图自编码器（GraphMAE）学习紧凑的场景级图嵌入；最后通过神经网络从嵌入中预测六种感知指标。 Result: 该方法在准确率和精确度上优于图像基线模型，平均提升26%，并在跨城市任务中表现出强泛化能力；结构化表示可解释地揭示了如‘墙上涂鸦’和‘车辆停放在人行道’等降低感知评分的关系模式。 Conclusion: 基于图的结构化表示能有效提升城市感知建模的准确性、可解释性和跨城市泛化性能，推动以人为中心、情境感知的城市分析发展。 Abstract: Perception research is increasingly modelled using streetscapes, yet many approaches still rely on pixel features or object co-occurrence statistics, overlooking the explicit relations that shape human perception. This study proposes a three stage pipeline that transforms street view imagery (SVI) into structured representations for predicting six perceptual indicators. In the first stage, each image is parsed using an open-set Panoptic Scene Graph model (OpenPSG) to extract object predicate object triplets. In the second stage, compact scene-level embeddings are learned through a heterogeneous graph autoencoder (GraphMAE). In the third stage, a neural network predicts perception scores from these embeddings. We evaluate the proposed approach against image-only baselines in terms of accuracy, precision, and cross-city generalization. Results indicate that (i) our approach improves perception prediction accuracy by an average of 26% over baseline models, and (ii) maintains strong generalization performance in cross-city prediction tasks. Additionally, the structured representation clarifies which relational patterns contribute to lower perception scores in urban scenes, such as graffiti on wall and car parked on sidewalk. Overall, this study demonstrates that graph-based structure provides expressive, generalizable, and interpretable signals for modelling urban perception, advancing human-centric and context-aware urban analytics.

Meng Chu,Senqiao Yang,Haoxuan Che,Suiyun Zhang,Xichen Zhang,Shaozuo Yu,Haokun Gui,Zhefan Rao,Dandan Tu,Rui Liu,Jiaya Jia

Main category: cs.CV

TL;DR: 本文提出了Long Goal Bench (LGBench) 以评估生成模型在复杂多目标设计任务中的表现，并提出VisionDirector方法提升模型对长指令的解析与执行能力，显著改善文本、多物体场景和姿态编辑的生成效果。

Details

Motivation: 现有生成模型在处理专业设计师提出的长且多目标的指令时表现不佳，缺乏有效评估其在真实场景中性能的基准，因此需要新的评测标准与优化方法。 Method: 提出LGBench作为包含2000个任务的评测集，引入无需训练的视觉-语言监督器VisionDirector，通过结构化目标提取、动态决策生成或分阶段编辑、微网格采样与语义验证、回滚机制及目标级奖励记录来优化生成流程，并使用Group Relative Policy Optimization微调规划器。 Result: 当前最先进的模型仅能满足少于72%的目标，常遗漏局部编辑；VisionDirector在GenEval上提升7%，在ImgEdit上绝对提升0.07，编辑步骤更短（3.1 vs 4.2），并在字体、多物体场景和姿态编辑上表现出一致的定性改进。 Conclusion: VisionDirector有效提升了生成模型对复杂长指令的理解与执行能力，揭示了当前生成管道的脆弱性，并为未来多目标图像生成提供了新的评估标准与优化方向。 Abstract: Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models' performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic verification and rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 versus 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (plus 7 percent overall) and ImgEdit (plus 0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing.

[203] 3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory

Xinyang Song,Libin Wang,Weining Wang,Zhiwei Li,Jianxin Sun,Dandan Zheng,Jingdong Chen,Qi Li,Zhenan Sun

Main category: cs.CV

TL;DR: 本文提出了3SGen，一种任务感知的统一图像生成框架，能够在一个模型中同时处理主体、风格和结构三种条件控制，并通过自适应任务特定记忆模块实现动态解耦与组合控制。

Details

Motivation: 现有图像生成方法通常孤立地处理主体、风格和结构等条件控制，导致特征纠缠和跨任务迁移能力受限。 Method: 提出3SGen框架，结合MLLM与可学习语义查询对齐图文语义，VAE分支保留细节；核心为ATM模块，通过轻量级门控机制动态管理不同任务的先验知识（如身份、纹理、布局），支持组合式生成。 Result: 在自建的3SGen-Bench及多个公开基准上验证了方法优越性，显著提升跨任务保真度与可控性。 Conclusion: 3SGen实现了多模式条件控制的统一建模，有效缓解任务干扰，具备良好扩展性和实际应用潜力。 Abstract: Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.

[204] Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation

Ivan DeAndres-Tame,Chengwei Ye,Ruben Tolosana,Ruben Vera-Rodriguez,Shiqi Yu

Main category: cs.CV

TL;DR: 该研究评估了四种先进的生成式AI人类动画模型在步态生物识别中保留个体身份特征的能力，发现尽管视觉质量较高，但生物识别保真度较低，表明当前模型难以将身份与动作分离，且依赖外观纹理而非时序动态进行识别。

Details

Motivation: 探究当前最先进的生成式AI人类动画模型是否能够保留用于个体识别的细微时空特征（如步态），特别是在行为生物识别应用中的适用性。 Method: 选取四种生成式AI模型，在两种任务下进行评估：1）从不同复杂度的参考视频中恢复步态模式；2）将步态模式迁移到不同的视觉身份上，以测试身份与动作的解耦能力。 Result: 实验结果显示，虽然生成的动画视觉质量高，但在识别任务中生物识别保真度低，尤其在身份迁移任务中，当纹理与动作分离时，识别性能急剧下降。 Conclusion: 当前生成式AI模型在人类动画中未能有效保留可用于个体识别的步态特征，主要依赖外观线索而非运动动力学，揭示了其在生物识别应用中的根本缺陷。 Abstract: Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.

[205] Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context

Kyungwon Cho,Hanbyul Joo

Main category: cs.CV

TL;DR: 本文提出了一种名为HaMoS的手部感知序列级扩散框架，用于从第一人称视频中估计佩戴者的全身运动，通过结合头部轨迹和间歇可见的手部线索，在真实世界条件下实现了最先进的准确性和时间平滑性。

Details

Motivation: 由于大多数身体部位在自我中心视角下不可见，从第一人称视频中估计佩戴者的全身运动具有挑战性。现有方法主要依赖于头部轨迹或假设手部持续跟踪，这在轻量级设备上不现实。因此，需要一种更实用的方法来解决这一问题。 Method: 提出了HaMoS框架，直接基于头部轨迹和因视野限制及遮挡导致的间歇性可见手部线索进行条件化处理；引入了一种新的数据增强方法以模拟真实世界的条件，并利用局部注意力机制高效推断长序列。 Result: 实验表明，该方法在公共基准测试中达到了最先进水平的精度和时间平滑性。 Conclusion: HaMoS为实现可靠的真实场景下自我中心3D运动理解提供了一个实用的步骤。 Abstract: Egocentric vision systems are becoming widely available, creating new opportunities for human-computer interaction. A core challenge is estimating the wearer's full-body motion from first-person videos, which is crucial for understanding human behavior. However, this task is difficult since most body parts are invisible from the egocentric view. Prior approaches mainly rely on head trajectories, leading to ambiguity, or assume continuously tracked hands, which is unrealistic for lightweight egocentric devices. In this work, we present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices. To overcome the lack of datasets pairing diverse camera views with human motion, we introduce a novel augmentation method that models such real-world conditions. We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction, and thus employ local attention to infer long sequences efficiently. Experiments on public benchmarks show that our method achieves state-of-the-art accuracy and temporal smoothness, demonstrating a practical step toward reliable in-the-wild egocentric 3D motion understanding.

[206] RMLer: Synthesizing Novel Objects across Diverse Categories via Reinforcement Mixing Learning

Jun Li,Zikun Chen,Haibo Chen,Shuo Chen,Jian Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为Reinforcement Mixing Learning (RMLer)的框架，通过强化学习实现跨类别文本概念的深度融合，以解决现有文本到图像生成中概念混合不足的问题。

Details

Motivation: 现有方法在跨类别概念融合时存在概念混合不充分、评估不严谨和输出质量低的问题，如概念失衡或简单拼接。 Method: 将跨类别概念融合建模为强化学习问题：混合特征为状态，混合策略为动作，视觉输出为奖励；设计MLP策略网络预测动态权重以融合文本嵌入，并引入基于语义相似性和组合平衡性的视觉奖励，使用近端策略优化（PPO）训练策略。 Result: 实验表明RMLer在生成来自不同类别的连贯且高保真融合对象方面优于现有方法，实现了更深层次的概念结合。 Conclusion: RMLer为生成新颖视觉概念提供了鲁棒框架，在电影、游戏和设计等领域具有广泛应用潜力。 Abstract: Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in Text-to-Image (T2I) generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs-manifesting as conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose Reinforcement Mixing Learning (RMLer), a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, optimizing the policy via proximal policy optimization. At inference, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate RMLer's superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.

[207] Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing

Xu Zhang,Junyao Ge,Yang Zheng,Kaitai Guo,Jimin Liang

Main category: cs.CV

TL;DR: Think2Seg-RS提出了一种解耦框架，利用LVLM生成结构化几何提示来控制冻结的SAM模型，通过掩码-only强化学习实现语义推理到空间定位的转化，在遥感图像分割任务中表现出色并具备跨任务零样本泛化能力。

Details

Motivation: 现有视觉语言模型在遥感分析中将语言推理与像素预测耦合训练，导致几何接地弱和跨任务泛化能力有限，亟需一种更灵活、可解释的语义推理分割方法。 Method: 提出Think2Seg-RS框架，解耦LVLM语义推理与SAM的像素预测；LVLM仅通过强化学习优化生成几何提示（如点、框），指导冻结的SAM生成掩码，不进行端到端微调。 Result: 在EarthReason数据集上达到SOTA性能；可在多个referring segmentation基准上实现零样本迁移；发现紧凑型分割器在语义监督下优于大型模型，且负向提示在复杂航拍背景中无效。 Conclusion: 语义级推理分割是一种新的地理空间理解范式，Think2Seg-RS通过解耦设计实现了更强的泛化性、可解释性和实用性，为统一的LVLM驱动地球观测开辟了新路径。 Abstract: Large Vision-Language Models (LVLMs) hold great promise for advancing remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only reinforcement learning objective, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Remarkably, the learned prompting policy generalizes zero-shot to multiple referring segmentation benchmarks, exposing a distinct divide between semantic-level and instance-level grounding. We further found that compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds. Together, these findings establish semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation. Our code and model are available at https://github.com/Ricardo-XZ/Think2Seg-RS.

[208] MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture

Hui Li,Jiayue Lyu,Fu-Yun Wang,Kaihui Cheng,Siyu Zhu,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为MixFlow的新训练方法，用于缓解扩散模型中的训练-测试差异问题，通过利用“慢流现象”中减慢时间步的插值混合来提升生成性能，在多种图像生成任务中取得了优异结果。

Details

Motivation: 解决扩散模型在训练和测试时输入分布不一致（即曝光偏差）的问题，尤其是在训练时使用真实噪声数据而测试时使用生成数据导致的性能下降。 Method: 提出MixFlow方法，基于观察到的“慢流现象”，在每个训练时间步使用对应更高噪声水平（减慢时间步）的真实插值数据进行预测网络的后训练，从而缩小训练与测试间的差距。 Result: 在类条件图像生成和文本到图像生成任务上验证了方法有效性；RAE模型结合MixFlow在ImageNet上达到1.43（无引导）和1.10（有引导）FID（256×256），以及1.55（无引导）和1.10（有引导）FID（512×512）。 Conclusion: MixFlow通过利用慢流现象中的 slowed interpolation mixture 有效缓解了训练-测试差异，显著提升了扩散模型的生成质量，具有广泛适用性和实用价值。 Abstract: This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.

[209] Neural Implicit Heart Coordinates: 3D cardiac shape reconstruction from sparse segmentations

Marica Muffoletto,Uxio Hermida,Charlène Mauger,Avan Suinesiaputra,Yiyang Xu,Richard Burns,Lisa Pankewitz,Andrew D McCulloch,Steffen E Petersen,Daniel Rueckert,Alistair A Young

Main category: cs.CV

TL;DR: 本文提出了一种名为Neural Implicit Heart Coordinates (NIHCs)的标准化隐式坐标系统，用于从稀疏临床图像中实现个性化心脏三维重建，具有高精度、强鲁棒性和快速推理的优势。

Details

Motivation: 从稀疏临床图像中准确重建心脏解剖结构在个体化建模中仍具挑战性，现有方法在跨受试者解剖一致性映射方面能力有限。 Method: 基于通用心室坐标构建NIHCs隐式坐标系，通过神经隐式函数直接从少量2D分割图预测NIHCs，并解码为任意分辨率的密集3D分割和高分辨率网格。模型在5000个心脏网格的大数据集上训练。 Result: 在疾病组（n=4549）和健康组（n=5576）中，表面误差分别为2.51±0.33 mm和2.3±0.36 mm；即使在严重切片稀疏和分割噪声下也能稳定重建复杂结构（如瓣膜平面）；推理时间从传统方法的60秒以上缩短至5-15秒。 Conclusion: NIHCs是一种鲁棒且高效的解剖表示方法，能够在极简输入条件下实现精准、一致的心脏三维重建，有望推动个性化心脏建模的临床应用。 Abstract: Accurate reconstruction of cardiac anatomy from sparse clinical images remains a major challenge in patient-specific modeling. While neural implicit functions have previously been applied to this task, their application to mapping anatomical consistency across subjects has been limited. In this work, we introduce Neural Implicit Heart Coordinates (NIHCs), a standardized implicit coordinate system, based on universal ventricular coordinates, that provides a common anatomical reference frame for the human heart. Our method predicts NIHCs directly from a limited number of 2D segmentations (sparse acquisition) and subsequently decodes them into dense 3D segmentations and high-resolution meshes at arbitrary output resolution. Trained on a large dataset of 5,000 cardiac meshes, the model achieves high reconstruction accuracy on clinical contours, with mean Euclidean surface errors of 2.51$\pm$0.33 mm in a diseased cohort (n=4549) and 2.3$\pm$0.36 mm in a healthy cohort (n=5576). The NIHC representation enables anatomically coherent reconstruction even under severe slice sparsity and segmentation noise, faithfully recovering complex structures such as the valve planes. Compared with traditional pipelines, inference time is reduced from over 60 s to 5-15 s. These results demonstrate that NIHCs constitute a robust and efficient anatomical representation for patient-specific 3D cardiac reconstruction from minimal input data.

[210] Extended OpenTT Games Dataset: A table tennis dataset for fine-grained shot type and point outcome

Moamal Fadhil Abdul,Jonas Bruun Hubrechts,Thomas Martini Jørgensen,Emil Hovad

Main category: cs.CV

TL;DR: 本文扩展了公开的OpenTTGames数据集，增加了详细的击球类型、球员姿势和回合结果标注，以促进乒乓球视频中细粒度击球理解与战术分析，并提供可复现的标注方案和基线模型。

Details

Motivation: 现有的乒乓球视频数据集大多未公开或存在使用限制，缺乏细粒度的击球动作标注，限制了自动化分析与模型训练的发展。 Method: 在OpenTTGames数据集基础上，增加帧级精度的击球类型（正手、反手及其子类）、球员姿态（身体倾斜、腿 stance）和回合结果标签，提出紧凑编码方案和代码辅助标注流程，并发布非商业用途的开放许可数据。 Result: 扩展后的数据集提供了更丰富的语义标注，包括击球类型、姿态和战术意图，支持对击球事件的细粒度识别与战术理解建模。 Conclusion: 该工作填补了乒乓球视频分析领域公开、可复用细粒度标注数据的空白，推动了基于视觉的运动表现分析研究。 Abstract: Automatically detecting and classifying strokes in table tennis video can streamline training workflows, enrich broadcast overlays, and enable fine-grained performance analytics. For this to be possible, annotated video data of table tennis is needed. We extend the public OpenTTGames dataset with highly detailed, frame-accurate shot type annotations (forehand, backhand with subtypes), player posture labels (body lean and leg stance), and rally outcome tags at point end. OpenTTGames is a set of recordings from the side of the table with official labels for bounces, when the ball is above the net, or hitting the net. The dataset already contains ball coordinates near events, which are either "bounce", "net", or "empty_event" in the original OpenTTGames dataset, and semantic masks (humans, table, scoreboard). Our extension adds the types of stroke to the events and a per-player taxonomy so models can move beyond event spotting toward tactical understanding (e.g., whether a stroke is likely to win the point or set up an advantage). We provide a compact coding scheme and code-assisted labeling procedure to support reproducible annotations and baselines for fine-grained stroke understanding in racket sports. This fills a practical gap in the community, where many prior video resources are either not publicly released or carry restrictive/unclear licenses that hinder reuse and benchmarking. Our annotations are released under the same CC BY-NC-SA 4.0 license as OpenTTGames, allowing free non-commercial use, modification, and redistribution, with appropriate attribution.

[211] DeltaMIL: Gated Memory Integration for Efficient and Discriminative Whole Slide Image Analysis

Yueting Zhu,Yuehao Song,Shuai Zhang,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: DeltaMIL是一种新的多实例学习框架，用于分析全切片图像（WSI），通过门控增量机制选择语义相关区域并整合判别信息，显著提升了在生存预测和滑动级别分类任务中的性能。

Details

Motivation: 现有MIL方法在处理大规模、异质性的WSI时，难以有效去除冗余信息或整合关键特征，限制了其性能。 Method: 提出DeltaMIL框架，结合门控增量机制（融合遗忘与记忆机制）动态更新记忆，并通过局部模式混合机制保留细粒度病理结构。 Result: 在生存预测中性能提升3.69%（ResNet-50）和2.36%（UNI）；在滑动级别分类中准确率提高3.09%（ResNet-50）和3.75%（UNI）。 Conclusion: DeltaMIL能有效提取有意义线索、抑制噪声，在多种WSI任务中表现出强大且一致的性能优势。 Abstract: Whole Slide Images (WSIs) are typically analyzed using multiple instance learning (MIL) methods. However, the scale and heterogeneity of WSIs generate highly redundant and dispersed information, making it difficult to identify and integrate discriminative signals. Existing MIL methods either fail to discard uninformative cues effectively or have limited ability to consolidate relevant features from multiple patches, which restricts their performance on large and heterogeneous WSIs. To address this issue, we propose DeltaMIL, a novel MIL framework that explicitly selects semantically relevant regions and integrates the discriminative information from WSIs. Our method leverages the gated delta rule to efficiently filter and integrate information through a block combining forgetting and memory mechanisms. The delta mechanism dynamically updates the memory by removing old values and inserting new ones according to their correlation with the current patch. The gating mechanism further enables rapid forgetting of irrelevant signals. Additionally, DeltaMIL integrates a complementary local pattern mixing mechanism to retain fine-grained pathological locality. Our design enhances the extraction of meaningful cues and suppresses redundant or noisy information, which improves the model's robustness and discriminative power. Experiments demonstrate that DeltaMIL achieves state-of-the-art performance. Specifically, for survival prediction, DeltaMIL improves performance by 3.69\% using ResNet-50 features and 2.36\% using UNI features. For slide-level classification, it increases accuracy by 3.09\% with ResNet-50 features and 3.75\% with UNI features. These results demonstrate the strong and consistent performance of DeltaMIL across diverse WSI tasks.

[212] GANeXt: A Fully ConvNeXt-Enhanced Generative Adversarial Network for MRI- and CBCT-to-CT Synthesis

Siyuan Mei,Yan Xia,Fuxin Fan

Main category: cs.CV

TL;DR: 本文提出了一种基于3D ConvNeXt的生成对抗网络GANeXt，用于跨模态（MRI-to-CT和CBCT-to-CT）的统一CT图像合成，具有高精度和泛化能力。

Details

Motivation: 为了在自适应放疗中实现准确的解剖结构表示，需要从MRI或CBCT合成高质量的CT图像，现有方法在多模态或多区域泛化能力上存在不足。 Method: 提出GANeXt，采用3D ConvNeXt构建U型生成器，结合PatchGAN判别器，并引入多种损失函数（MAE、感知损失、分割掩码MAE、对抗损失及Dice与交叉熵组合）进行优化；训练使用AdamW优化器与学习率调度策略，数据经过配准、归一化和增强处理。 Result: 模型在无需微调的情况下联合训练多个解剖区域，在MRI-to-CT任务上训练3000轮、CBCT-to-CT任务上训练1000轮后取得良好性能，推理时采用滑动窗口与平均折叠策略重建全尺寸合成CT。 Conclusion: GANeXt能够有效实现跨模态、多区域的CT合成，具备良好的泛化性和临床应用潜力，适用于自适应放疗中的治疗计划制定。 Abstract: The synthesis of computed tomography (CT) from magnetic resonance imaging (MRI) and cone-beam CT (CBCT) plays a critical role in clinical treatment planning by enabling accurate anatomical representation in adaptive radiotherapy. In this work, we propose GANeXt, a 3D patch-based, fully ConvNeXt-powered generative adversarial network for unified CT synthesis across different modalities and anatomical regions. Specifically, GANeXt employs an efficient U-shaped generator constructed from stacked 3D ConvNeXt blocks with compact convolution kernels, while the discriminator adopts a conditional PatchGAN. To improve synthesis quality, we incorporate a combination of loss functions, including mean absolute error (MAE), perceptual loss, segmentation-based masked MAE, and adversarial loss and a combination of Dice loss and cross-entropy for multi-head segmentation discriminator. For both tasks, training is performed with a batch size of 8 using two separate AdamW optimizers for the generator and discriminator, each equipped with a warmup and cosine decay scheduler, with learning rates of $5\times10^{-4}$ and $1\times10^{-3}$, respectively. Data preprocessing includes deformable registration, foreground cropping, percentile normalization for the input modality, and linear normalization of the CT to the range $[-1024, 1000]$. Data augmentation involves random zooming within $(0.8, 1.3)$ (for MRI-to-CT only), fixed-size cropping to $32\times160\times192$ for MRI-to-CT and $32\times128\times128$ for CBCT-to-CT, and random flipping. During inference, we apply a sliding-window approach with $0.8$ overlap and average folding to reconstruct the full-size sCT, followed by inversion of the CT normalization. After joint training on all regions without any fine-tuning, the final models are selected at the end of 3000 epochs for MRI-to-CT and 1000 epochs for CBCT-to-CT using the full training dataset.

[213] ReasonCD: A Multimodal Reasoning Large Model for Implicit Change-of-Interest Semantic Mining

Zhenyang Huang,Xiao Yu,Yi Zhang,Decheng Wang,Hang Ruan

Main category: cs.CV

TL;DR: 本文提出了一种名为ReasonCD的多模态推理变化检测模型，能够挖掘用户隐式任务意图，并利用大语言模型的强大推理能力提升遥感图像变化检测性能，在公开数据集上表现出色，F1分数达到92.1%，并能解释推理过程以辅助人类决策。

Details

Motivation: 现有方法依赖显式文本描述进行变化检测，面对隐式描述时性能严重下降，因此需要一种能理解隐式意图的方法来提升检测鲁棒性。 Method: 提出ReasonCD模型，结合预训练大语言模型的推理能力，从用户文本中挖掘隐式任务意图，并据此指导遥感图像的变化检测过程。 Result: 在BCDD数据集上F1分数达到92.1%，并在基于SECOND数据集构建的推理子集上验证了其优秀的推理与可解释能力。 Conclusion: ReasonCD能够有效理解用户隐式意图，实现高性能且可解释的遥感图像变化检测，为智能遥感解译提供了新思路。 Abstract: Remote sensing image change detection is one of the fundamental tasks in remote sensing intelligent interpretation. Its core objective is to identify changes within change regions of interest (CRoI). Current multimodal large models encode rich human semantic knowledge, which is utilized for guidance in tasks such as remote sensing change detection. However, existing methods that use semantic guidance for detecting users' CRoI overly rely on explicit textual descriptions of CRoI, leading to the problem of near-complete performance failure when presented with implicit CRoI textual descriptions. This paper proposes a multimodal reasoning change detection model named ReasonCD, capable of mining users' implicit task intent. The model leverages the powerful reasoning capabilities of pre-trained large language models to mine users' implicit task intents and subsequently obtains different change detection results based on these intents. Experiments on public datasets demonstrate that the model achieves excellent change detection performance, with an F1 score of 92.1\% on the BCDD dataset. Furthermore, to validate its superior reasoning functionality, this paper annotates a subset of reasoning data based on the SECOND dataset. Experimental results show that the model not only excels at basic reasoning-based change detection tasks but can also explain the reasoning process to aid human decision-making.

[214] Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization

Zhongwei Chen,Hai-Jun Rong,Zhao-Xu Yang,Guoqi Li

Main category: cs.CV

TL;DR: 本文提出了SpikeViMFormer，首个用于无人机视角地理定位（DVGL）的脉冲神经网络（SNN）框架，通过引入轻量级脉冲驱动Transformer骨干网络、脉冲驱动选择性注意力（SSA）模块和脉冲驱动混合状态空间（SHS）模块，在降低功耗的同时有效缓解了信息丢失与长距离依赖学习难题，并结合分层重排序对齐学习（HRAL）策略优化骨干网络，实验表明其性能优于现有SNN方法并媲美先进ANN模型。

Details

Motivation: 传统基于人工神经网络（ANN）的DVGL方法计算密集、功耗高；而SNN虽具低功耗优势，但在表征学习中因脉冲稀疏性导致关键信息丢失和难以建模长距离依赖，且其在DVGL中的潜力尚未被充分探索。 Method: 提出SpikeViMFormer框架：采用脉冲驱动Transformer作为骨干提取粗粒度特征；设计SSA模块，利用脉冲门控机制实现选择性特征增强；引入SHS模块以混合状态空间建模长距离依赖；推理阶段仅使用骨干以降耗；提出HRAL策略，通过邻域重排序和跨批次一致性约束来优化骨干网络。 Result: 实验结果显示SpikeViMFormer在DVGL任务上优于现有的SNN模型，并与先进的ANN模型性能相当，同时具备更低的计算和能耗开销。 Conclusion: SpikeViMFormer首次将SNN应用于DVGL任务，通过结构创新和训练策略提升，在保证低功耗优势的同时实现了高效特征学习和优异定位性能，展示了SNN在该领域的应用潜力。 Abstract: Traditional drone-view geo-localization (DVGL) methods based on artificial neural networks (ANNs) have achieved remarkable performance. However, ANNs rely on dense computation, which results in high power consumption. In contrast, spiking neural networks (SNNs), which benefit from spike-driven computation, inherently provide low power consumption. Regrettably, the potential of SNNs for DVGL has yet to be thoroughly investigated. Meanwhile, the inherent sparsity of spike-driven computation for representation learning scenarios also results in loss of critical information and difficulties in learning long-range dependencies when aligning heterogeneous visual data sources. To address these, we propose SpikeViMFormer, the first SNN framework designed for DVGL. In this framework, a lightweight spike-driven transformer backbone is adopted to extract coarse-grained features. To mitigate the loss of critical information, the spike-driven selective attention (SSA) block is designed, which uses a spike-driven gating mechanism to achieve selective feature enhancement and highlight discriminative regions. Furthermore, a spike-driven hybrid state space (SHS) block is introduced to learn long-range dependencies using a hybrid state space. Moreover, only the backbone is utilized during the inference stage to reduce computational cost. To ensure backbone effectiveness, a novel hierarchical re-ranking alignment learning (HRAL) strategy is proposed. It refines features via neighborhood re-ranking and maintains cross-batch consistency to directly optimize the backbone. Experimental results demonstrate that SpikeViMFormer outperforms state-of-the-art SNNs. Compared with advanced ANNs, it also achieves competitive performance.Our code is available at https://github.com/ISChenawei/SpikeViMFormer

[215] DSTED: Decoupling Temporal Stabilization and Discriminative Enhancement for Surgical Workflow Recognition

Yueyao Chen,Kai-Ni Wang,Dario Tayupo,Arnaud Huaulm'e,Krystel Nyangoh Timoh,Pierre Jannin,Qi Dou

Main category: cs.CV

TL;DR: 本文提出了一种用于手术工作流识别的双通路框架DSTED，通过可靠记忆传播（RMP）和不确定性感知原型检索（UPR）提升识别的稳定性和对模糊阶段的判别能力，在AutoLaparo-hysterectomy数据集上取得了最先进的性能。

Details

Motivation: 现有手术工作流识别方法存在帧间预测抖动和对模糊阶段区分能力差的问题，亟需提高模型的时间一致性和对困难样本的识别能力。 Method: 提出DSTED框架，包含两个核心模块：RMP通过多准则可靠性评估筛选并融合高置信度的历史特征以维持时间连贯性；UPR则从高不确定性样本构建类别特定的可学习原型，并通过自适应原型匹配优化模糊帧的表示；最后通过置信度门控机制动态平衡两条通路。 Result: 在AutoLaparo-hysterectomy数据集上达到84.36%准确率和65.51% F1分数，分别超越第二名3.51%和4.88%；消融实验显示RMP和UPR分别带来2.19%和1.93%的增益，二者协同作用效果更优；分析表明模型显著减少了时间抖动并改善了相位过渡的识别。 Conclusion: 双通路设计通过解耦时间一致性建模与相位模糊性建模，为稳定的工作流识别提供了新范式，具有更高的性能和临床应用价值。 Abstract: Purpose: Surgical workflow recognition enables context-aware assistance and skill assessment in computer-assisted interventions. Despite recent advances, current methods suffer from two critical challenges: prediction jitter across consecutive frames and poor discrimination of ambiguous phases. This paper aims to develop a stable framework by selectively propagating reliable historical information and explicitly modeling uncertainty for hard sample enhancement. Methods: We propose a dual-pathway framework DSTED with Reliable Memory Propagation (RMP) and Uncertainty-Aware Prototype Retrieval (UPR). RMP maintains temporal coherence by filtering and fusing high-confidence historical features through multi-criteria reliability assessment. UPR constructs learnable class-specific prototypes from high-uncertainty samples and performs adaptive prototype matching to refine ambiguous frame representations. Finally, a confidence-driven gate dynamically balances both pathways based on prediction certainty. Results: Our method achieves state-of-the-art performance on AutoLaparo-hysterectomy with 84.36% accuracy and 65.51% F1-score, surpassing the second-best method by 3.51% and 4.88% respectively. Ablations reveal complementary gains from RMP (2.19%) and UPR (1.93%), with synergistic effects when combined. Extensive analysis confirms substantial reduction in temporal jitter and marked improvement on challenging phase transitions. Conclusion: Our dual-pathway design introduces a novel paradigm for stable workflow recognition, demonstrating that decoupling the modeling of temporal consistency and phase ambiguity yields superior performance and clinical applicability.

[216] Non-Contrast CT Esophageal Varices Grading through Clinical Prior-Enhanced Multi-Organ Analysis

Xiaoming Zhang,Chunli Li,Jiacheng Hao,Yuan Gao,Danyang Tu,Jianyi Qiao,Xiaoli Yin,Le Lu,Ling Zhang,Ke Yan,Yang Hou,Yu Shi

Main category: cs.CV

TL;DR: 本研究提出了一种名为MOON++的新型多模态框架，利用非对比CT（NCCT）扫描通过多器官分析无创评估食管静脉曲张（EV）严重程度，结合肝脏、脾脏和食管的影像特征，在大规模患者数据上表现出优于传统单器官方法的诊断性能，并经放射科医生读片研究验证，为肝硬化患者提供了一种有前景的非侵入性替代诊断方案。

Details

Motivation: 食管静脉曲张是门脉高压的严重并发症，传统诊断依赖内镜检查，具有侵入性；尽管非对比CT（NCCT）具备潜在应用价值，但尚未充分利用，因此需要一种更有效、非侵入性的评估方法。 Method: 提出Multi-Organ-COhesion Network++（MOON++），一种融合肝脏、脾脏和食管影像特征的多模态深度学习框架，基于临床证据建模器官体积关系与肝病严重程度的关联，利用1,631例患者的NCCT数据进行训练与验证，并在239例和289例独立数据集上测试，同时开展放射科医生读片研究以验证性能。 Result: MOON++在区分重度静脉曲张（G3 vs =G2 vs =G2 versus ### [217] [dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models](https://arxiv.org/abs/2512.19433) *Yi Xin,Siqi Luo,Qi Qin,Haoxing Chen,Kaiwen Zhu,Zhiwei Zhang,Yangfan He,Rongchao Zhang,Jinbin Bai,Shuo Cao,Bin Fu,Junjun He,Yihao Liu,Yuewen Cao,Xiaohong Liu* Main category: cs.CV TL;DR: 提出dMLLM-TTS框架，通过分层搜索和自验证反馈机制，在降低计算成本的同时提升多模态大模型生成质量。

Details

Motivation: 现有的测试时扩展方法在扩散多模态大语言模型中效率低且依赖外部验证器，难以有效释放其生成潜力。 Method: 设计了一种具有O(N+T)复杂度的分层搜索算法，并引入利用模型自身图像理解能力进行文本-图像对齐评估的自验证反馈机制。 Result: 在GenEval基准上对三种代表性dMLLM进行实验，生成质量显著提升，效率最高提升6倍。 Conclusion: dMLLM-TTS框架能高效、有效地提升多模态大模型的生成性能，无需外部验证器且计算成本更低。 Abstract: Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: https://github.com/Alpha-VLLM/Lumina-DiMOO.

### [218] [MT-Mark: Rethinking Image Watermarking via Mutual-Teacher Collaboration with Adaptive Feature Modulation](https://arxiv.org/abs/2512.19438) *Fei Ge,Ying Huang,Jie Liu,Guixuan Zhang,Zhi Zeng,Shuwu Zhang,Hu Guan* Main category: cs.CV TL;DR: 本文提出了一种新的深度图像水印框架，通过引入协作交互机制（CIM）和自适应特征调制模块（AFMM），实现嵌入器与提取器之间的显式协同，从而提升水印系统的鲁棒性、感知质量和泛化能力。

Details

Motivation: 现有水印方法中嵌入器与提取器耦合弱，缺乏训练时的协作机制，导致鲁棒性学习效率低且易受宿主干扰。 Method: 提出协作交互机制（CIM），建立嵌入器与提取器之间的双向通信，并采用互为教师的训练范式；设计自适应特征调制模块（AFMM），解耦调制结构与强度，实现内容感知的特征调控。 Result: 在真实世界和AI生成数据集上实验表明，该方法在保持高感知质量的同时，显著优于现有最先进方法，表现出更强的鲁棒性和泛化能力。 Conclusion: 通过架构级重构实现嵌入与提取的协同优化，使鲁棒性源于协调的表征学习而非依赖大量失真模拟，为深度水印系统设计提供了新范式。 Abstract: Existing deep image watermarking methods follow a fixed embedding-distortion-extraction pipeline, where the embedder and extractor are weakly coupled through a final loss and optimized in isolation. This design lacks explicit collaboration, leaving no structured mechanism for the embedder to incorporate decoding-aware cues or for the extractor to guide embedding during training. To address this architectural limitation, we rethink deep image watermarking by reformulating embedding and extraction as explicitly collaborative components. To realize this reformulation, we introduce a Collaborative Interaction Mechanism (CIM) that establishes direct, bidirectional communication between the embedder and extractor, enabling a mutual-teacher training paradigm and coordinated optimization. Built upon this explicitly collaborative architecture, we further propose an Adaptive Feature Modulation Module (AFMM) to support effective interaction. AFMM enables content-aware feature regulation by decoupling modulation structure and strength, guiding watermark embedding toward stable image features while suppressing host interference during extraction. Under CIM, the AFMMs on both sides form a closed-loop collaboration that aligns embedding behavior with extraction objectives. This architecture-level redesign changes how robustness is learned in watermarking systems. Rather than relying on exhaustive distortion simulation, robustness emerges from coordinated representation learning between embedding and extraction. Experiments on real-world and AI-generated datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches in watermark extraction accuracy while maintaining high perceptual quality, showing strong robustness and generalization.

### [219] [D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning](https://arxiv.org/abs/2512.19443) *Evelyn Zhang,Fufu Yu,Aoqi Wu,Zichen Wen,Ke Yan,Shouhong Ding,Biqing Qi,Linfeng Zhang* Main category: cs.CV TL;DR: 本文提出D2Pruner，一种用于多模态大模型的视觉token剪枝框架，通过去偏重要性评分与基于混合图的最大独立集选择机制，在保持高性能的同时显著减少计算量，尤其在细粒度定位任务中大幅优于现有方法。

Details

Motivation: 现有的视觉token剪枝方法在细粒度定位任务上表现不佳，主因是重要性方法存在位置偏差，多样性方法缺乏结构感知，难以兼顾语义重要性与空间冗余问题。 Method: 提出D2Pruner，首先基于去偏注意力得分选取关键token作为锚点，然后在剩余token构建的融合空间邻近与语义相似性的混合图上执行最大独立集（MIS）选择，迭代保留最重要且非邻接的token，实现重要性与多样性的联合优化。 Result: 在LLaVA-1.5-7B上减少74.2% FLOPs的同时保留99.2%性能；在InternVL-2.5-8B的定位任务中，压缩90% token时仍保持85.7%的性能，比现有方法最高提升63.53%。 Conclusion: D2Pruner有效解决了现有剪枝方法的位置偏差与结构盲区问题，兼顾效率与精度，特别适用于需要高保真空间感知的长序列视觉理解任务。 Abstract: Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user's prompt and spatial redundancy. To address this, we introduce D2Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D2Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2\% while retaining 99.2\% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7\% performance at a 90\% token reduction rate, marking a significant advancement with up to 63. 53\% improvement over existing methods.

### [220] [Sign Language Recognition using Parallel Bidirectional Reservoir Computing](https://arxiv.org/abs/2512.19451) *Nitin Kumar Singh,Arie Rachmad Syulistyo,Yuichiro Tanaka,Hakaru Tamukoh* Main category: cs.CV TL;DR: 提出了一种轻量级手语识别系统，结合并行双向储备池计算（PBRC）与MediaPipe，实现高效实时的手语识别。

Details

Motivation: 深度学习模型计算资源消耗大，难以在边缘设备上部署，因此需要一种轻量高效的替代方案。 Method: 利用MediaPipe提取手部关节坐标，输入到由两个回声状态网络构成的并行双向储备池计算（PBRC）架构中进行时序特征学习和分类。 Result: 在WLASL数据集上达到60.85%的top-1准确率，训练时间仅18.67秒，远快于Bi-GRU等深度学习方法。 Conclusion: 该方法为边缘设备上的实时手语识别提供了一种轻量、低成本且高效的解决方案。 Abstract: Sign language recognition (SLR) facilitates communication between deaf and hearing communities. Deep learning based SLR models are commonly used but require extensive computational resources, making them unsuitable for deployment on edge devices. To address these limitations, we propose a lightweight SLR system that combines parallel bidirectional reservoir computing (PBRC) with MediaPipe. MediaPipe enables real-time hand tracking and precise extraction of hand joint coordinates, which serve as input features for the PBRC architecture. The proposed PBRC architecture consists of two echo state network (ESN) based bidirectional reservoir computing (BRC) modules arranged in parallel to capture temporal dependencies, thereby creating a rich feature representation for classification. We trained our PBRC-based SLR system on the Word-Level American Sign Language (WLASL) video dataset, achieving top-1, top-5, and top-10 accuracies of 60.85%, 85.86%, and 91.74%, respectively. Training time was significantly reduced to 18.67 seconds due to the intrinsic properties of reservoir computing, compared to over 55 minutes for deep learning based methods such as Bi-GRU. This approach offers a lightweight, cost-effective solution for real-time SLR on edge devices.

### [221] [Emotion-Director: Bridging Affective Shortcut in Emotion-Oriented Image Generation](https://arxiv.org/abs/2512.19479) *Guoli Jia,Junyao Hu,Xinwei Long,Kai Tian,Kaiyan Zhang,KaiKai Zhao,Ning Ding,Bowen Zhou* Main category: cs.CV TL;DR: 本文提出Emotion-Director，一个用于情感导向图像生成的跨模态协作框架，通过结合视觉与文本提示克服现有方法中情感被简化为语义的问题。

Details

Motivation: 现有情感导向图像生成方法存在“情感捷径”问题，即将情感简单等同于语义，忽略了情感与语义的本质区别。 Method: 提出Emotion-Director框架，包含MC-Diffusion模型（融合视觉与文本提示）和MC-Agent系统（多智能体重写文本提示以表达情感），并改进DPO优化引入负向视觉提示。 Result: 实验表明，该方法在生成具有特定情感色彩的图像方面优于现有方法，能更准确区分相同语义下不同情感的表达。 Conclusion: Emotion-Director有效提升了情感导向图像生成的质量与情感表达的细腻度，推动了情感与语义解耦的生成模型发展。 Abstract: Image generation based on diffusion models has demonstrated impressive capability, motivating exploration into diverse and specialized applications. Owing to the importance of emotion in advertising, emotion-oriented image generation has attracted increasing attention. However, current emotion-oriented methods suffer from an affective shortcut, where emotions are approximated to semantics. As evidenced by two decades of research, emotion is not equivalent to semantics. To this end, we propose Emotion-Director, a cross-modal collaboration framework consisting of two modules. First, we propose a cross-Modal Collaborative diffusion model, abbreviated as MC-Diffusion. MC-Diffusion integrates visual prompts with textual prompts for guidance, enabling the generation of emotion-oriented images beyond semantics. Further, we improve the DPO optimization by a negative visual prompt, enhancing the model's sensitivity to different emotions under the same semantics. Second, we propose MC-Agent, a cross-Modal Collaborative Agent system that rewrites textual prompts to express the intended emotions. To avoid template-like rewrites, MC-Agent employs multi-agents to simulate human subjectivity toward emotions, and adopts a chain-of-concept workflow that improves the visual expressiveness of the rewritten prompts. Extensive qualitative and quantitative experiments demonstrate the superiority of Emotion-Director in emotion-oriented image generation.

### [222] [Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration](https://arxiv.org/abs/2512.19486) *Shaochen Bi,Yuting He,Weiming Wang,Hao Chen* Main category: cs.CV TL;DR: 提出动态流网络（DySNet）以解决可变形医学图像配准中的组合爆炸问题，通过动态调整感受野和权重来抑制干扰特征并建模潜在关系。

Details

Motivation: 双输入导致的特征组合关系呈指数增长，引发干扰特征增多，影响医学图像配准性能。 Method: 设计AdSB模块动态调整感受野形状，结合DySA机制生成动态权重，增强对高相关性特征关系的建模能力。 Result: 在多个实验中，DySNet持续优于最先进的DMIR方法，展现出卓越的泛化能力。 Conclusion: DySNet通过动态调节感受野和权重，有效缓解了组合爆炸问题，为双输入医学图像配准提供了高效解决方案。 Abstract: Combinatorial explosion problem caused by dual inputs presents a critical challenge in Deformable Medical Image Registration (DMIR). Since DMIR processes two images simultaneously as input, the combination relationships between features has grown exponentially, ultimately the model considers more interfering features during the feature modeling process. Introducing dynamics in the receptive fields and weights of the network enable the model to eliminate the interfering features combination and model the potential feature combination relationships. In this paper, we propose the Dynamic Stream Network (DySNet), which enables the receptive fields and weights to be dynamically adjusted. This ultimately enables the model to ignore interfering feature combinations and model the potential feature relationships. With two key innovations: 1) Adaptive Stream Basin (AdSB) module dynamically adjusts the shape of the receptive field, thereby enabling the model to focus on the feature relationships with greater correlation. 2) Dynamic Stream Attention (DySA) mechanism generates dynamic weights to search for more valuable feature relationships. Extensive experiments have shown that DySNet consistently outperforms the most advanced DMIR methods, highlighting its outstanding generalization ability. Our code will be released on the website: https://github.com/ShaochenBi/DySNet.

### [223] [FusionNet: Physics-Aware Representation Learning for Multi-Spectral and Thermal Data via Trainable Signal-Processing Priors](https://arxiv.org/abs/2512.19504) *Georgios Voulgaris* Main category: cs.CV TL;DR: 提出了一种物理感知的多光谱表示学习框架FusionNet，通过结合SWIR和TIR数据建模长期物理过程的稳定特征，提升了跨光谱条件下的鲁棒性和泛化能力。

Details

Motivation: 现有深度学习模型在多模态视觉信号上的归纳偏置与物理成像过程不匹配，尤其难以捕捉由持续热辐射引起的间接环境变化，导致跨光谱和真实场景下性能脆弱。 Method: 引入地质短波红外（SWIR）比率作为对土壤特性变化敏感的物理感知特征，并将其与热红外（TIR）数据通过中间融合架构FusionNet结合；网络设计包含可训练的微分信号处理先验、混合池化策略和更宽的感受野。 Result: 消融实验表明各组件均带来性能提升，DGCNN在SWIR比率上达到88.7%准确率，FusionNet达到90.6%，优于五种光谱配置下的现有方法；迁移学习实验显示ImageNet预训练会降低TIR性能。 Conclusion: 结合物理感知特征选择与原则性深度学习架构可生成鲁棒且可泛化的表示，基于第一性原理的信号建模有助于提升复杂条件下多光谱学习性能。 Abstract: Modern deep learning models operating on multi-modal visual signals often rely on inductive biases that are poorly aligned with the physical processes governing signal formation, leading to brittle performance under cross-spectral and real-world conditions. In particular, approaches that prioritise direct thermal cues struggle to capture indirect yet persistent environmental alterations induced by sustained heat emissions. This work introduces a physics-aware representation learning framework that leverages multi-spectral information to model stable signatures of long-term physical processes. Specifically, a geological Short Wave Infrared (SWIR) ratio sensitive to soil property changes is integrated with Thermal Infrared (TIR) data through an intermediate fusion architecture, instantiated as FusionNet. The proposed backbone embeds trainable differential signal-processing priors within convolutional layers, combines mixed pooling strategies, and employs wider receptive fields to enhance robustness across spectral modalities. Systematic ablations show that each architectural component contributes to performance gains, with DGCNN achieving 88.7% accuracy on the SWIR ratio and FusionNet reaching 90.6%, outperforming state-of-the-art baselines across five spectral configurations. Transfer learning experiments further show that ImageNet pretraining degrades TIR performance, highlighting the importance of modality-aware training for cross-spectral learning. Evaluated on real-world data, the results demonstrate that combining physics-aware feature selection with principled deep learning architectures yields robust and generalisable representations, illustrating how first-principles signal modelling can improve multi-spectral learning under challenging conditions.

### [224] [Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation](https://arxiv.org/abs/2512.19512) *Ziyang Song,Zelin Zang,Zuyao Chen,Xusheng Liang,Dong Yi,Jinlin Wu,Hongbin Liu,Jiebo Luo* Main category: cs.CV TL;DR: 本文提出了一种针对医学图像解剖理解任务的多模态大语言模型训练方法，通过解剖相似性课程学习和群体多样性问题增强来提升模型的推理能力。

Details

Motivation: 现有的监督微调方法在高质量专家标注数据稀缺的情况下效果有限，而现有强化学习方法GRPO在解剖识别中存在知识共享不足和推理路径单一的问题。 Method: 提出了两种新方法：1）基于答案选项相似性的解剖相似性课程学习，逐步增加问题难度；2）群体多样性问题增强，扩展困难问题的搜索空间。 Result: 在SGG-VQA和OmniMedVQA两个基准上进行了实验，结果表明所提方法显著提升了多模态大语言模型的医学推理能力。 Conclusion: 所提出的方法有效解决了医学图像解剖理解中的知识共享和推理多样性问题，显著提高了模型性能。 Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images. Anatomy understanding tasks demand precise understanding and clinically coherent answers, which are difficult to achieve due to the complexity of medical data and the scarcity of high-quality expert annotations. These challenges limit the effectiveness of conventional Supervised Fine-Tuning (SFT) strategies. While recent work has demonstrated that Group Relative Policy Optimization (GRPO) can enhance reasoning in MLLMs without relying on large amounts of data, we find two weaknesses that hinder GRPO's reasoning performance in anatomy recognition: 1) knowledge cannot be effectively shared between different anatomical structures, resulting in uneven information gain and preventing the model from converging, and 2) the model quickly converges to a single reasoning path, suppressing the exploration of diverse strategies. To overcome these challenges, we propose two novel methods. First, we implement a progressive learning strategy called Anatomical Similarity Curriculum Learning by controlling question difficulty via the similarity of answer choices, enabling the model to master complex problems incrementally. Second, we utilize question augmentation referred to as Group Diversity Question Augmentation to expand the model's search space for difficult queries, mitigating the tendency to produce uniform responses. Comprehensive experiments on the SGG-VQA and OmniMedVQA benchmarks show our method achieves a significant improvement across the two benchmarks, demonstrating its effectiveness in enhancing the medical reasoning capabilities of MLLMs. The code can be found in https://github.com/tomato996/Anatomy-R1

### [225] [A Convolutional Neural Deferred Shader for Physics Based Rendering](https://arxiv.org/abs/2512.19522) *Zhuo He,Yingdong Ru,Qianying Liu,Paul Henderson,Nicolas Pugeault* Main category: cs.CV TL;DR: 本文提出了一种新的基于物理的神经延迟着色管线pbnds+，使用卷积神经网络减少参数量并提升着色与重光照性能，同时引入能量正则化以改善暗光照下的反射表现，实验表明其优于传统方法、最新的神经着色模型及基于扩散的方法。

Details

Motivation: 现有基于MLP的神经渲染方法参数多、计算资源需求高，且在数据不平衡时对特殊光照（如暗场景）表现不佳，限制了真实物体的高质量重光照应用。 Method: 提出pbnds+，采用卷积神经网络替代MLP以降低参数量，并引入能量正则化约束模型在暗光照条件下的反射行为，提升渲染效率与泛化能力。 Result: 实验显示pbnds+在着色和重光照任务上优于经典方法、最新神经着色模型和基于扩散的方法，具有更高的效率和更优的视觉质量。 Conclusion: pbnds+通过结构优化和正则化策略有效解决了神经渲染中的高参数量和暗光照建模难题，为真实物体的高效高质量重光照提供了可行方案。 Abstract: Recent advances in neural rendering have achieved impressive results on photorealistic shading and relighting, by using a multilayer perceptron (MLP) as a regression model to learn the rendering equation from a real-world dataset. Such methods show promise for photorealistically relighting real-world objects, which is difficult to classical rendering, as there is no easy-obtained material ground truth. However, significant challenges still remain the dense connections in MLPs result in a large number of parameters, which requires high computation resources, complicating the training, and reducing performance during rendering. Data driven approaches require large amounts of training data for generalization; unbalanced data might bias the model to ignore the unusual illumination conditions, e.g. dark scenes. This paper introduces pbnds+: a novel physics-based neural deferred shading pipeline utilizing convolution neural networks to decrease the parameters and improve the performance in shading and relighting tasks; Energy regularization is also proposed to restrict the model reflection during dark illumination. Extensive experiments demonstrate that our approach outperforms classical baselines, a state-of-the-art neural shading model, and a diffusion-based method.

### [226] [Multi-Modal Soccer Scene Analysis with Masked Pre-Training](https://arxiv.org/abs/2512.19528) *Marc Peral,Guillem Capellera,Luis Ferraz,Antonio Rubio,Antonio Agudo* Main category: cs.CV TL;DR: 提出了一种多模态架构，用于从战术摄像视频中分析足球场景，重点解决球轨迹推断、球状态分类和持球者识别三个任务，通过融合多种输入模态并在无需直接球位置信息的情况下实现鲁棒性预测。

Details

Motivation: 现有方法依赖精确的球跟踪或手工设计的启发式规则，在球被遮挡或噪声干扰时表现不佳，因此需要一种更鲁棒的多模态方法来提升真实比赛场景下的分析能力。 Method: 结合球员轨迹、球员类型和球员图像块三种模态，采用级联的社会时间Transformer模块处理时空动态，并引入CropDrop预训练策略进行模态特定的掩码训练，以减少对图像特征的过拟合。 Result: 在大规模真实顶级联赛数据集上验证了方法的有效性，在三项任务上均显著优于现有最先进基线方法。 Conclusion: 结合结构化与视觉线索的Transformer架构，配合合理的掩码预训练策略，能有效提升多模态足球场景理解的性能与鲁棒性。 Abstract: In this work we propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage, with a focus on three core tasks: ball trajectory inference, ball state classification, and ball possessor identification. To this end, our solution integrates three distinct input modalities (player trajectories, player types and image crops of individual players) into a unified framework that processes spatial and temporal dynamics using a cascade of sociotemporal transformer blocks. Unlike prior methods, which rely heavily on accurate ball tracking or handcrafted heuristics, our approach infers the ball trajectory without direct access to its past or future positions, and robustly identifies the ball state and ball possessor under noisy or occluded conditions from real top league matches. We also introduce CropDrop, a modality-specific masking pre-training strategy that prevents over-reliance on image features and encourages the model to rely on cross-modal patterns during pre-training. We show the effectiveness of our approach on a large-scale dataset providing substantial improvements over state-of-the-art baselines in all tasks. Our results highlight the benefits of combining structured and visual cues in a transformer-based architecture, and the importance of realistic masking strategies in multi-modal learning.

### [227] [SlicerOrbitSurgerySim: An Open-Source Platform for Virtual Registration and Quantitative Comparison of Preformed Orbital Plates](https://arxiv.org/abs/2512.19534) *Chi Zhang,Braedon Gunn,Andrew M. Read-Fuller* Main category: cs.CV TL;DR: SlicerOrbitSurgerySim是一个开源的3D Slicer扩展，用于量化评估和比较预成型眼眶植入板的适应性，以改善术前规划和减少手术修改。

Details

Motivation: 目前缺乏公开可用的工具和标准化指标来定量比较不同厂商、尺寸和患者解剖结构下的预成型眼眶板适配情况。 Method: 开发了一个名为SlicerOrbitSurgerySim的开源扩展，集成于3D Slicer平台，实现多块预成型眼眶板在患者特异性虚拟规划环境中的交互式虚拟配准、评估与比较，并生成可重复的定量板-眼眶距离指标和可视化工具。 Result: 该软件支持患者个体化手术规划和群体水平的植入板适应性统计分析，能够客观比较不同植入设计和放置策略。 Conclusion: SlicerOrbitSurgerySim有助于提升术前决策质量，减少术中调整，促进协作研究和外科教学。 Abstract: Poor adaptation of orbital implants remains a major contributor to postoperative complications and revision surgery. Although preformed orbital plates are widely used to reduce cost and operative time compared with customized implants, surgeons currently lack publicly available tools and standardized metrics to quantitatively compare plate fit across vendors, sizes, and patient anatomy. We developed SlicerOrbitSurgerySim, an open-source extension for the 3D Slicer platform that enables interactive virtual registration, evaluation, and comparison of multiple preformed orbital plates in a patient-specific virtual planning environment. The software generates reproducible quantitative plate-to-orbit distance metrics and visualization tools that support both patient-specific planning and population-level statistical analysis of plate adaptability. By facilitating objective comparison of implant designs and placement strategies, this tool aims to improve preoperative decision-making, reduce intraoperative plate modification, and promote collaborative research and surgical education. Pilot studies, sample datasets, and detailed tutorials are provided to support testing, transparency, and reproducibility.

### [228] [CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion](https://arxiv.org/abs/2512.19535) *Moritz Böhle,Amélie Royer,Juliette Marrie,Edouard Grave,Patrick Pérez* Main category: cs.CV TL;DR: 提出了一种名为CASA的新范式，通过在交叉注意力中引入自注意力机制，在保持高效性的同时显著缩小了与全令牌插入方法的性能差距。

Details

Motivation: 现有的视觉语言模型在处理高分辨率图像、长对话或流视频时计算和内存成本过高，而基于交叉注意力的模型虽然高效但性能较差，尤其是在细粒度视觉任务上。 Method: 提出CASA（Cross-Attention via Self-Attention），在交叉注意力层中启用局部文本到文本的交互，以提升模型表现。 Result: 在常见图像理解基准上显著缩小了与全令牌插入方法的性能差距，同时在长上下文多模态任务（如流视频字幕）中保持与交叉注意力模型相同的可扩展性。 Conclusion: CASA是一种简单且高效的视觉语言建模范式，兼顾性能与效率，适用于长上下文和高分辨率场景。 Abstract: Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .

### [229] [StoryMem: Multi-shot Long Video Storytelling with Memory](https://arxiv.org/abs/2512.19539) *Kaiwen Zhang,Liming Jiang,Angtian Wang,Jacob Zhiyuan Fang,Tiancheng Zhi,Qing Yan,Hao Kang,Xin Lu,Xingang Pan* Main category: cs.CV TL;DR: 本文提出了StoryMem，一种基于显式视觉记忆的多镜头视频生成方法，通过记忆-视频（M2V）设计将预训练的单镜头扩散模型转化为长篇故事讲述者，显著提升了跨镜头一致性和叙事连贯性。

Details

Motivation: 现有的视频生成模型难以保持长时程的视觉一致性与叙事连贯性，限制了其在多镜头视觉 storytelling 中的应用。受人类记忆机制启发，需构建具备动态更新记忆能力的生成框架。 Method: 提出Memory-to-Video（M2V）架构，维护一个紧凑且动态更新的关键帧记忆库；通过潜在空间拼接和负RoPE偏移将记忆注入单镜头扩散模型，仅使用LoRA微调；结合语义关键帧选择与美学过滤策略优化记忆质量。 Result: 在新提出的ST-Bench基准上实验表明，StoryMem在跨镜头一致性、美学质量和提示遵循方面均优于现有方法，能生成长达一分钟的连贯高质量视频。 Conclusion: StoryMem通过引入显式视觉记忆机制，有效解决了长时程多镜头视频生成中的一致性难题，为实现高质量视觉叙事提供了新范式。 Abstract: Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.

### [230] [ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars](https://arxiv.org/abs/2512.19546) *Ziqiao Peng,Yi Chen,Yifeng Ma,Guozhen Zhang,Zhiyao Sun,Zixiang Zhou,Youliang Zhang,Zhengguang Zhou,Zhaoxin Fan,Hongyan Liu,Yuan Zhou,Qinglin Lu,Jun He* Main category: cs.CV TL;DR: 本文提出了一种名为ActAvatar的新型框架，通过文本引导实现对虚拟人动作的相位级精确控制，解决了现有方法在文本跟随能力、时序对齐和依赖额外控制信号方面的不足。

Details

Motivation: 现有虚拟人生成方法在动作与文本、音频的时序语义对齐方面存在不足，且常需依赖姿态骨架等额外控制信号，限制了其灵活性和实用性。 Method: 提出ActAvatar框架，包含三个核心创新：1）相位感知交叉注意力（PACA），将提示分解为全局基块和时间锚定的相位块；2）渐进式音视频对齐，使不同网络层侧重不同模态；3）两阶段训练策略，先建立音视频对应关系，再注入动作控制能力。 Result: 实验表明，ActAvatar在动作控制精度和视觉质量方面均显著优于当前最先进的方法，实现了更高的文本跟随能力和更精准的时序对齐。 Conclusion: ActAvatar通过文本引导实现了无需额外控制信号的高精度动作生成，推动了虚拟人生成技术在语义与时序对齐上的进步。 Abstract: Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model's text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.

### [231] [BabyFlow: 3D modeling of realistic and expressive infant faces](https://arxiv.org/abs/2512.19560) *Antonia Alomar,Mireia Masias,Marius George Linguraru,Federico M. Sukno,Gemma Piella* Main category: cs.CV TL;DR: 本文提出了一种名为BabyFlow的生成式AI模型，通过标准化流分离婴儿面部的身份与表情，实现了对面部身份和表情的独立控制。利用跨年龄表情迁移技术，将成人3D扫描中的表情迁移到婴儿数据中，以增强数据集的表达多样性。该方法在提高3D重建精度的同时，支持合成和修改婴儿表情，并结合扩散模型生成具有稳定3D几何结构的高保真2D婴儿图像，为早期发育障碍的面部分析提供了有力工具。

Details

Motivation: 婴儿面部数据稀缺且常带有自发表情，传统建模方法难以准确捕捉其复杂的非线性变化，限制了对发育障碍的早期分析。因此需要一种能够解耦身份与表情、适应小样本并处理自然表情的建模方法。 Method: 提出BabyFlow模型，采用标准化流学习婴儿面部身份与表情的解耦概率表示；通过跨年龄表情迁移，利用成人3D面部表情丰富婴儿数据；结合扩散模型生成具有一致3D结构的高分辨率2D图像。 Result: BabyFlow显著提升了婴儿面部3D重建的准确性，尤其在嘴、眼、鼻等高表情区域；成功实现表情编辑与身份保持，并生成高质量、几何一致的2D婴儿图像，有效支持数据增强与早期面部分析应用。 Conclusion: BabyFlow通过解耦表示学习与跨年龄表达迁移，为小样本下的婴儿面部建模提供了高效解决方案，推动了基于面部形态的婴幼儿发育障碍早期筛查技术的发展。 Abstract: Early detection of developmental disorders can be aided by analyzing infant craniofacial morphology, but modeling infant faces is challenging due to limited data and frequent spontaneous expressions. We introduce BabyFlow, a generative AI model that disentangles facial identity and expression, enabling independent control over both. Using normalizing flows, BabyFlow learns flexible, probabilistic representations that capture the complex, non-linear variability of expressive infant faces without restrictive linear assumptions. To address scarce and uncontrolled expressive data, we perform cross-age expression transfer, adapting expressions from adult 3D scans to enrich infant datasets with realistic and systematic expressive variants. As a result, BabyFlow improves 3D reconstruction accuracy, particularly in highly expressive regions such as the mouth, eyes, and nose, and supports synthesis and modification of infant expressions while preserving identity. Additionally, by integrating with diffusion models, BabyFlow generates high-fidelity 2D infant images with consistent 3D geometry, providing powerful tools for data augmentation and early facial analysis.

### [232] [No Data? No Problem: Robust Vision-Tabular Learning with Missing Values](https://arxiv.org/abs/2512.19602) *Marta Hasny,Laura Daza,Keno Bressem,Maxime Di Folco,Julia Schnabel* Main category: cs.CV TL;DR: 提出RoVTL框架，通过对比预训练和门控交叉注意力微调，实现对任意缺失程度表格数据的鲁棒视觉-表格联合学习。

Details

Motivation: 现有方法在训练时利用完整表格数据，但在实际应用中表格数据常有缺失，导致性能下降，因此需要更鲁棒的方法。 Method: RoVTL包含两个阶段：1）引入表格属性缺失作为数据增强的对比预训练；2）使用门控交叉注意力模块进行多模态融合，并在微调时采用新的Tabular More vs. Fewer损失结合解耦梯度学习。 Result: 在UK Biobank心脏MRI数据上验证，RoVTL相比先前方法对缺失表格数据更具鲁棒性，并能推广到外部心脏MRI数据集及自然图像领域的汽车广告数据集。 Conclusion: RoVTL能有效处理从0%到100%任意水平的表格数据可用性，在多种数据集上实现一致且鲁棒的多模态学习性能。 Abstract: Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as demographics or clinical measurements. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that can leverage all the tabular data during training while remaining robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning using a gated cross-attention module for multimodal fusion. During fine-tuning, we employ a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with disentangled gradient learning, this enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The code is available at https://github.com/marteczkah/RoVTL.

### [233] [MapTrace: Scalable Data Generation for Route Tracing on Maps](https://arxiv.org/abs/2512.19609) *Artemis Panagopoulou,Aveek Purohit,Achin Kulshrestha,Soroosh Yazdani,Mohit Goyal* Main category: cs.CV TL;DR: 本文提出了一种可扩展的合成数据生成管道，用于提升多模态大语言模型在地图路径追踪等细粒度空间理解任务上的表现，通过23k合成样本微调模型，在MapBench上显著提升了成功率和路径准确性。

Details

Motivation: 现有MLLM在细粒度空间理解（如地图路径追踪）方面能力有限，主要由于缺乏大规模、像素级精确的路径标注数据，而真实数据采集成本高昂且困难。 Method: 设计了一个合成数据生成流程，利用合成地图图像和像素级解析自动生成精确路径标注，构建了包含4k张地图、23k条路径样本的微调数据集，并用于微调开源和闭源MLLM。 Result: 在MapBench上评估显示，微调后模型的成功率最高提升了6.4个百分点，路径追踪误差（NDTW）降低，表明合成数据能有效增强模型的细粒度空间推理能力。 Conclusion: 通过合成数据监督可以显式地教会多模态大模型细粒度的空间推理能力，弥补预训练模型中的不足，为提升其空间理解提供了可行且高效的解决方案。 Abstract: While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.

### [234] [Generative diffusion models for agricultural AI: plant image generation, indoor-to-outdoor translation, and expert preference alignment](https://arxiv.org/abs/2512.19632) *Da Tan,Michael Beck,Christopher P. Bidinosti,Robert H. Gulden,Christopher J. Henry* Main category: cs.CV TL;DR: 本论文提出了一种基于扩散模型的生成方法，用于合成农作物图像、实现室内到室外图像转换，并通过专家偏好对齐微调，提升农业人工智能的数据效率和模型性能。

Details

Motivation: 农业人工智能依赖大量高质量植物图像数据，但实际田间采集成本高、耗时长且受季节限制，因此需要有效的数据增强方法。 Method: 采用Stable Diffusion模型在标注的室内外植物图像上进行微调，生成条件文本控制的油菜和大豆图像；利用DreamBooth和图像引导扩散实现室内到室外图像翻译；构建基于专家评分的奖励模型，进行偏好对齐的加权微调。 Result: 合成图像在Inception Score、FID及下游表型分类任务中表现优异；翻译后的图像提升了YOLOv8在杂草检测与分类中的性能；偏好微调使生成结果更稳定且符合专家预期。 Conclusion: 该方法为农业AI提供了一条高效、实用的生成式数据增强路径，有效缓解了真实数据采集的瓶颈。 Abstract: The success of agricultural artificial intelligence depends heavily on large, diverse, and high-quality plant image datasets, yet collecting such data in real field conditions is costly, labor intensive, and seasonally constrained. This paper investigates diffusion-based generative modeling to address these challenges through plant image synthesis, indoor-to-outdoor translation, and expert preference aligned fine tuning. First, a Stable Diffusion model is fine tuned on captioned indoor and outdoor plant imagery to generate realistic, text conditioned images of canola and soybean. Evaluation using Inception Score, Frechet Inception Distance, and downstream phenotype classification shows that synthetic images effectively augment training data and improve accuracy. Second, we bridge the gap between high resolution indoor datasets and limited outdoor imagery using DreamBooth-based text inversion and image guided diffusion, generating translated images that enhance weed detection and classification with YOLOv8. Finally, a preference guided fine tuning framework trains a reward model on expert scores and applies reward weighted updates to produce more stable and expert aligned outputs. Together, these components demonstrate a practical pathway toward data efficient generative pipelines for agricultural AI.

### [235] [4D Gaussian Splatting as a Learned Dynamical System](https://arxiv.org/abs/2512.19648) *Arnold Caleb Asiimwe,Carl Vondrick* Main category: cs.CV TL;DR: 提出EvoGS，将4D高斯点阵视为连续时间动态系统，通过学习的神经动力场实现动态场景建模，支持稀疏时序监督下的高效学习、时间外推和可组合动态控制。

Details

Motivation: 现有基于形变的方法难以实现时间外推和稀疏监督下的高效学习，且缺乏对运动规律的建模。 Method: 将4D高斯点阵视为连续动态系统，引入神经动力场来描述其状态演化，实现高斯参数的连续时间更新。 Result: 在动态场景基准上优于形变场基线方法，表现出更优的运动连贯性和时间一致性，并支持外推与局部动态编辑。 Conclusion: EvoGS通过将场景建模为遵循学习动力学的演化系统，为动态场景表示提供了更强的时间一致性和可控性。 Abstract: We reinterpret 4D Gaussian Splatting as a continuous-time dynamical system, where scene motion arises from integrating a learned neural dynamical field rather than applying per-frame deformations. This formulation, which we call EvoGS, treats the Gaussian representation as an evolving physical system whose state evolves continuously under a learned motion law. This unlocks capabilities absent in deformation-based approaches:(1) sample-efficient learning from sparse temporal supervision by modeling the underlying motion law; (2) temporal extrapolation enabling forward and backward prediction beyond observed time ranges; and (3) compositional dynamics that allow localized dynamics injection for controllable scene synthesis. Experiments on dynamic scene benchmarks show that EvoGS achieves better motion coherence and temporal consistency compared to deformation-field baselines while maintaining real-time rendering

### [236] [Over++: Generative Video Compositing for Layer Interaction Effects](https://arxiv.org/abs/2512.19661) *Luchao Qi,Jiaye Wu,Jun Myeong Choi,Cary Phillips,Roni Sengupta,Dan B Goldman* Main category: cs.CV TL;DR: 本文提出了增强型合成（augmented compositing）任务及Over++框架，用于在不破坏原始视频的前提下，根据文本提示生成逼真的半透明环境效果。

Details

Motivation: 现有视频生成模型难以在保留输入视频内容的同时添加阴影、反射等环境交互效果，且当前的视频修复方法存在成本高或结果不真实的问题。 Method: 提出Over++视频效果生成框架，构建配对效果数据集并引入无配对增强策略，支持可选蒙版控制和关键帧引导，无需密集标注。 Result: 尽管训练数据有限，Over++仍能生成多样化且逼真的环境效果，在效果生成和场景保持方面优于现有基线方法。 Conclusion: Over++有效解决了专业视频合成中环境交互效果生成的挑战，实现了高质量、文本驱动的视频增强合成。 Abstract: In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.

### [237] [Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis](https://arxiv.org/abs/2512.19663) *Argha Kamal Samanta,Harshika Goyal,Vasudha Joshi,Tushar Mungle,Pabitra Mitra* Main category: cs.CV TL;DR: 提出一种知识增强的多模态联合嵌入框架，融合眼底图像、临床文本和结构化数据，显著提升糖尿病视网膜病变的跨模态检索与诊断性能。

Details

Motivation: 通用视觉-语言模型（如CLIP）在医学领域尤其是眼科图像的跨模态检索中表现不佳，缺乏有效的医学图像-文本对齐方法。 Method: 采用Vision Transformer、Bio-ClinicalBERT和MLP分别编码眼底图像、临床文本和结构化数据，通过带模态特定嵌入的联合Transformer进行融合，并使用对比损失、重建损失和分类损失进行多任务训练。 Result: 在BRSET数据集上，文本到图像检索的Recall@1达到99.94%（优于CLIP的1.29%），SDRG和ICDR分级准确率分别为97.05%和97.97%；在未见的DeepEyeNet上零样本Recall@1为93.95%（CLIP为0.22%）。 Conclusion: 该多模态框架能有效捕捉医学域内的跨模态关联，在检索性能和诊断准确性方面均显著优于现有方法，具有强泛化能力。 Abstract: Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.

### [238] [Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning](https://arxiv.org/abs/2512.19676) *Mojtaba Safari,Shansong Wang,Vanessa L Wildman,Mingzhe Hu,Zach Eidex,Chih-Wei Chang,Erik H Middlebrooks,Richard L. J Qiu,Pretesh Patel,Ashesh B. Jania,Hui Mao,Zhen Tian,Xiaofeng Yang* Main category: cs.CV TL;DR: 提出了一种结合多头选择性状态空间模型（MHSSM）和轻量级通道MLP的新型MRI超分辨率框架，显著提升效率与精度，适用于临床应用。

Details

Motivation: 现有深度学习方法在MRI超分辨率中面临保真度与计算效率之间的权衡，限制了其在临床中的应用。目标是开发一种高效且准确的框架，以保留解剖细节并促进临床集成。 Method: 提出的方法结合多头选择性状态空间模型（MHSSM）与轻量通道MLP，采用2D图像块提取与混合扫描策略捕获长距离依赖；每个MambaFormer模块融合MHSSM、深度卷积与门控通道混合机制。 Result: 在7T脑部和1.5T前列腺MRI数据上均取得最优性能：7T脑数据SSIM=0.951，PSNR=26.90 dB；前列腺数据SSIM=0.770，PSNR=27.15 dB，同时仅使用0.9M参数和57 GFLOPs，较Res-SRDiff减少99.8%参数和97.5%计算量。 Conclusion: 该框架在保持卓越重建质量的同时大幅降低计算成本，具备出色的临床转化潜力，可广泛应用于高分辨率MRI重建。 Abstract: Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan, yet existing deep learning methods face fidelity-efficiency trade-offs. Purpose: To develop a computationally efficient and accurate deep learning framework for MRI SR that preserves anatomical detail for clinical integration. Materials and Methods: We propose a novel SR framework combining multi-head selective state-space models (MHSSM) with a lightweight channel MLP. The model uses 2D patch extraction with hybrid scanning to capture long-range dependencies. Each MambaFormer block integrates MHSSM, depthwise convolutions, and gated channel mixing. Evaluation used 7T brain T1 MP2RAGE maps (n=142) and 1.5T prostate T2w MRI (n=334). Comparisons included Bicubic interpolation, GANs (CycleGAN, Pix2pix, SPSR), transformers (SwinIR), Mamba (MambaIR), and diffusion models (I2SB, Res-SRDiff). Results: Our model achieved superior performance with exceptional efficiency. For 7T brain data: SSIM=0.951+-0.021, PSNR=26.90+-1.41 dB, LPIPS=0.076+-0.022, GMSD=0.083+-0.017, significantly outperforming all baselines (p<0.001). For prostate data: SSIM=0.770+-0.049, PSNR=27.15+-2.19 dB, LPIPS=0.190+-0.095, GMSD=0.087+-0.013. The framework used only 0.9M parameters and 57 GFLOPs, reducing parameters by 99.8% and computation by 97.5% versus Res-SRDiff, while outperforming SwinIR and MambaIR in accuracy and efficiency. Conclusion: The proposed framework provides an efficient, accurate MRI SR solution, delivering enhanced anatomical detail across datasets. Its low computational demand and state-of-the-art performance show strong potential for clinical translation.

### [239] [WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion](https://arxiv.org/abs/2512.19678) *Hanyang Kong,Xingyi Yang,Xiaoxu Zheng,Xinchao Wang* Main category: cs.CV TL;DR: WorldWarp提出了一种结合3D结构锚点和2D生成细化器的视频生成框架，通过高斯溅射构建在线3D几何缓存，并利用时空扩散模型填补遮挡区域，实现几何一致且高质量的长程视频生成。

Details

Motivation: 现有生成模型在相机条件的潜在空间中表现优异，但难以处理遮挡和复杂相机轨迹导致的几何不一致问题，亟需桥接3D几何与2D生成之间的鸿沟。 Method: 采用Gaussian Splatting建立在线3D几何缓存作为结构锚点，通过显式扭曲历史内容到新视角维持几何一致性；引入具有时空变化噪声调度的Spatio-Temporal Diffusion (ST-Diff)模型，对空白区域加全噪声以生成，对已扭曲区域加部分噪声以细化，并动态更新3D缓存。 Result: WorldWarp在处理遮挡和复杂相机运动时显著优于现有方法，实现了最先进的视频保真度和跨帧片段的几何一致性。 Conclusion: 通过耦合3D结构引导与2D扩散生成，WorldWarp有效解决了长程视频生成中几何一致性与纹理质量之间的矛盾，为视频生成提供了新的架构范式。 Abstract: Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: \href{https://hyokong.github.io/worldwarp-page/}{https://hyokong.github.io/worldwarp-page/}.

### [240] [VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation](https://arxiv.org/abs/2512.19680) *Xinyao Liao,Qiyuan He,Kai Xu,Xiaoye Qu,Yicong Li,Wei Wei,Angela Yao* Main category: cs.CV TL;DR: VA-$\pi$ 是一种轻量级的后训练框架，通过变分优化和强化对齐策略，在像素空间中直接优化自回归视觉生成模型，显著提升图像生成质量，且无需重新训练分词器或使用外部奖励模型。

Details

Motivation: 现有的自回归视觉生成模型与分词器之间存在目标不一致：分词器旨在从真实标记重建清晰图像，而生成器仅优化标记似然，导致生成的标记序列可能解码为低质量图像。这种缺乏像素空间监督的问题促使作者提出一种新的对齐优化方法。 Method: VA-$\pi$ 将生成器-分词器对齐建模为变分优化问题，推导出统一像素重建和自回归建模的证据下界（ELBO）。该框架采用基于强化学习的对齐策略，将自回归生成器视为策略，以像素空间的重建质量作为内在奖励，并在教师强制条件下评估预测标记序列的重建效果，从而提供像素级指导。ELBO中的正则项用于保持标记分布的一致性。 Result: 仅使用1%的ImageNet-1K数据和25分钟调优，VA-$\pi$ 将LlamaGen-XXL的FID从14.36降至7.65，IS从86.55提升至116.70；在文本到图像任务中，LlamaGen的GenEval分数从0.306提升至0.339，Janus-Pro从0.725提升至0.744。 Conclusion: VA-$\pi$ 能够快速适应现有自回归生成器，有效提升生成图像质量，且无需修改分词器或引入外部奖励模型，为自回归视觉生成提供了高效实用的后训练优化方案。 Abstract: Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-$π$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$π$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$π$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-$π$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at https://github.com/Lil-Shake/VA-Pi.

### [241] [From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs](https://arxiv.org/abs/2512.19683) *Mingrui Wu,Zhaozhi Wang,Fangjinhua Wang,Jiaolong Yang,Marc Pollefeys,Tong Zhang* Main category: cs.CV TL;DR: 本文提出了一种基于多传感器户外视频数据的大规模基准，用于评估多模态大模型在开放环境中对空间智能的理解能力，揭示了当前模型依赖语言先验而非视觉推理的局限性。

Details

Motivation: 现有的空间推理基准多局限于室内或定性任务，缺乏具有精确三维真值的户外场景数据，难以有效评估多模态大模型的真实空间智能水平。 Method: 构建了一个包含立体相机、LiDAR和IMU/GPS同步采集的行人视角视频数据集，利用精确的3D信息自动生成涵盖定性到定量多层次的空间推理问题，并通过异常合成场景和遮蔽实验进行分析。 Result: 实验表明，在结构化室内基准上表现良好的模型在开放世界场景中性能显著下降，且模型更多依赖语言先验而非视觉输入进行推理。 Conclusion: 该基准为诊断多模态大模型的空间理解缺陷提供了原则性平台，推动实现真正基于物理感知的具身智能发展。 Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.

### [242] [Zero-shot Reconstruction of In-Scene Object Manipulation from Video](https://arxiv.org/abs/2512.19684) *Dixuan Lin,Tianyou Wang,Zhuoyang Pan,Yufu Wang,Lingjie Liu,Kostas Daniilidis* Main category: cs.CV TL;DR: 本文提出了一种从单目RGB视频中重建场景内物体操作的新系统，通过数据驱动的基础模型初始化关键组件，并采用两阶段优化方法实现与场景一致的手-物交互运动恢复。

Details

Motivation: 由于场景重建的不适定性、手-物深度的模糊性以及对物理合理交互的需求，从单目RGB视频中重建场景内的物体操作极具挑战性。现有方法通常以手为中心坐标系进行处理，忽略了整体场景信息，导致度量精度不足和实用性受限。 Method: 首先利用数据驱动的基础模型初始化核心组件（如物体网格与姿态、场景点云和手部姿态），然后通过两阶段优化方法恢复从抓取到交互的完整手-物运动，并确保其与输入视频中的场景信息保持一致。 Result: 该方法能够在单目RGB视频中实现更准确且符合物理规律的手-物交互重建，提升了在真实场景下的度量精度和实际应用能力。 Conclusion: 所提出的系统首次实现了在完整场景上下文中重建物体操作，相较于以往以手为中心的方法，在准确性和实用性方面均有显著提升。 Abstract: We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and the need for physically plausible interactions. Existing methods operate in hand centric coordinates and ignore the scene, hindering metric accuracy and practical use. In our method, we first use data-driven foundation models to initialize the core components, including the object mesh and poses, the scene point cloud, and the hand poses. We then apply a two-stage optimization that recovers a complete hand-object motion from grasping to interaction, which remains consistent with the scene information observed in the input video.

### [243] [Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models](https://arxiv.org/abs/2512.19686) *Zixuan Ye,Quande Liu,Cong Wei,Yuanxing Zhang,Xintao Wang,Pengfei Wan,Kun Gai,Wenhan Luo* Main category: cs.CV TL;DR: 本文提出了一种将视觉上下文一致性融入统一模型推理过程的方法，通过自适应视觉规划和迭代视觉校正提升多模态生成中的关键视觉特征保持能力。

Details

Motivation: 现有的思维链（CoT）方法在多模态生成中主要关注文本提示的一致性，忽视了与视觉参考图像之间的视觉上下文一致性，导致无法维持关键视觉特征。 Method: 引入自适应视觉规划生成结构化视觉检查清单，并通过迭代视觉校正进行自我反思与结果优化；采用监督微调训练模型，并使用flow-GRPO结合定制的视觉检查奖励来增强一致性。 Result: 实验表明，该方法在多模态生成任务中优于零样本统一模型及基于文本CoT的模型，展现出更高的视觉上下文一致性。 Conclusion: 通过显式建模视觉上下文一致性，所提方法有效提升了统一模型在多参考生成等多模态任务中的表现。 Abstract: Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the \textbf{visual context consistency} with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.

### [244] [Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models](https://arxiv.org/abs/2512.19692) *Pablo Ruiz-Ponce,Sergio Escalera,José García-Rodríguez,Jiankang Deng,Rolandos Alexandros Potamias* Main category: cs.CV TL;DR: 本文提出了Interact2Ar，首个端到端、文本条件驱动的自回归扩散模型，用于生成包含完整身体和手部动作的高保真人类交互动作。

Details

Motivation: 现有方法忽略手部动作且难以建模人与人交互中的反应性和适应性，限制了交互的真实感和表现力。 Method: 提出Interact2Ar，采用自回归扩散架构，通过并行分支建模手部运动，并引入新型记忆机制以支持大上下文窗口下的高效交互生成。 Result: 模型能生成高质量的全身交互动作，支持时间序列组合、实时扰动适应及多人场景扩展；新设计的评估器和指标验证了其优越性能。 Conclusion: Interact2Ar在人类交互动作生成上实现了最先进的效果，显著提升了真实感与灵活性，为复杂社交互动的合成提供了新路径。 Abstract: Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to limitations in available data and increased learning complexity, previous methods tend to ignore hand motions, limiting the realism and expressivity of the interactions. Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. Interact2Ar incorporates detailed hand kinematics through dedicated parallel branches, enabling high-fidelity full-body generation. Furthermore, we introduce an autoregressive pipeline coupled with a novel memory technique that facilitates adaptation to the inherent variability of human interactions using efficient large context windows. The adaptability of our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios. To validate the generated motions, we introduce a set of robust evaluators and extended metrics designed specifically for assessing full-body interactions. Through quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of Interact2Ar.

### [245] [The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding](https://arxiv.org/abs/2512.19693) *Weichen Fan,Haiwen Diao,Quan Wang,Dahua Lin,Ziwei Liu* Main category: cs.CV TL;DR: 本文提出了“棱镜假设”，认为不同模态的深度表示在频谱上具有统一结构：语义编码器捕获低频信息，像素编码器保留高频细节；基于此提出统一自编码器（UAE），通过频带调制器融合二者，在ImageNet和MS-COCO上实现了语义与像素级重建的最优性能。

Details

Motivation: 现有模型在统一语义抽象与像素细节方面缺乏有效机制，且对编码器的频谱特性理解不足，导致多模态表示难以协同优化。 Method: 分析多种语义和像素编码器的频谱特征，提出棱镜假设；设计统一自编码器（UAE），引入频带调制器以分离并融合低频语义与高频细节信息。 Result: 在ImageNet和MS-COCO上验证了UAE的有效性，实现了语义保持与像素保真度的联合优化，性能达到SOTA。 Conclusion: 编码器的功能角色与其频谱特性密切相关，棱镜假设为多模态表示学习提供了统一视角，UAE成功实现了语义与细节在共享潜在空间中的共存。 Abstract: Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.

Table of Contents

cs.CL [Back]

[1] Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

[2] Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning

[3] Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression

[4] Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset

[5] Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models

[6] KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction

[7] Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

[8] ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India

[9] CoPE: A Small Language Model for Steerable and Scalable Content Labeling

[10] Narrative Consolidation: Formulating a New Task for Unifying Multi-Perspective Accounts

[11] Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

[12] Training LLMs with LogicReward for Faithful and Rigorous Reasoning

[13] GeoSense-AI: Fast Location Inference from Crisis Microblogs

[14] InstructNet: A Novel Approach for Multi-Label Instruction Classification through Advanced Deep Learning

[15] CTTA-T: Continual Test-Time Adaptation for Text Understanding via Teacher-Student with a Domain-aware and Generalized Teacher

[16] LIR$^3$AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation

[17] Towards Efficient Agents: A Co-Design of Inference Architecture and System

[18] LLM-based Few-Shot Early Rumor Detection with Imitation Agent

[19] DACE For Railway Acronym Disambiguation

[20] LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators

[21] SRS-Stories: Vocabulary-constrained multilingual story generation for language learning

[22] AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

[23] An Agentic AI Framework for Training General Practitioner Student Skills

[24] Mitigating Spurious Correlations in NLI via LLM-Synthesized Counterfactuals and Dynamic Balanced Sampling

[25] Research on a hybrid LSTM-CNN-Attention model for text-based web content classification

[26] Teaching and Critiquing Conceptualization and Operationalization in NLP

[27] Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset

[28] LLMs on Drugs: Language Models Are Few-Shot Consumers

[29] Neologism Learning as a Parameter-Efficient Alternative to Fine-Tuning for Model Steering

[30] From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation

[31] On Finding Inconsistencies in Documents

[32] A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

[33] LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction

[34] Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital

[35] Solver-Independent Automated Problem Formulation via LLMs for High-Cost Simulation-Driven Design

[36] MemEvolve: Meta-Evolution of Agent Memory Systems

[37] From Natural Language to Control Signals: A Conceptual Framework for Semantic Channel Finding in Complex Experimental Infrastructure

[38] From Word to World: Can Large Language Models be Implicit Text-based World Models?

[39] AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus

[40] MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models

[41] Toward Human-Centered AI-Assisted Terminology Work

[42] Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

[43] Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

[44] FASTRIC: Prompt Specification Language for Verifiable LLM Interactions

[45] Evaluating the Challenges of LLMs in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework

[46] Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models

[47] DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

[48] A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs

[49] Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?

[50] SAP: Syntactic Attention Pruning for Transformer-based Language Models

[51] AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards

[52] QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

[53] From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs

[54] JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation

[55] CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation

[56] Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation

[57] ChemATP: A Training-Free Chemical Reasoning Framework for Large Language Models

[58] Auto-Prompting with Retrieval Guidance for Frame Detection in Logistics

[59] CienaLLM: Generative Climate-Impact Extraction from News Articles with Autoregressive LLMs

[60] HATS: High-Accuracy Triple-Set Watermarking for Large Language Models

[61] Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara

[62] CodeSimpleQA: Scaling Factuality in Code Large Language Models

[63] MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

[64] SiamGPT: Quality-First Fine-Tuning for Stable Thai Text Generation

[65] Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations

[66] A Large-Language-Model Framework for Automated Humanitarian Situation Reporting

[67] Event Extraction in Large Language Model

[68] Algerian Dialect

[69] Increasing the Thinking Budget is Not All You Need

[70] MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery

[71] Exploring the features used for summary evaluation by Human and GPT

[72] Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori

[73] Exploring Zero-Shot ACSA with Unified Meaning Representation in Chain-of-Thought Prompting

[74] GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

cs.CV [Back]

[75] A 96pJ/Frame/Pixel and 61pJ/Event Anti-UAV System with Hybrid Object Tracking Modes

[76] NystagmusNet: Explainable Deep Learning for Photosensitivity Risk Prediction

[77] SuperFlow: Training Flow Matching Models with RL on the Fly