Skip to content

Table of Contents

cs.CL [Back]

[1] Dynamic Prompt Fusion for Multi-Task and Cross-Domain Adaptation in LLMs

Xin Hu,Yue Kang,Guanzi Yao,Tianze Kang,Mengjie Wang,Heyao Liu

Main category: cs.CL

TL;DR: 本文提出了一种具有动态提示调度机制的统一多任务学习框架,通过提示池和任务感知调度策略提升大语言模型在多任务和跨域场景下的泛化能力。

Details Motivation: 现有方法如SPoT依赖固定提示模板,难以适应多任务和跨域场景下的语义差异,限制了模型的泛化能力。 Method: 引入提示池和任务感知的调度策略,动态组合和对齐不同任务的提示;利用任务嵌入和门控机制精细控制提示信号,并设计联合多任务学习目标与自动学习调度权重策略。 Result: 实验表明该方法在语言理解与知识推理任务上显著提升性能,且对提示温度和任务数量变化具有鲁棒性,有效缓解任务干扰与负迁移。 Conclusion: 所提出的动态提示调度机制在统一多任务建模和跨域适应中具有良好的适用性与有效性。 Abstract: This study addresses the generalization limitations commonly observed in large language models under multi-task and cross-domain settings. Unlike prior methods such as SPoT, which depends on fixed prompt templates, our study introduces a unified multi-task learning framework with dynamic prompt scheduling mechanism. By introducing a prompt pool and a task-aware scheduling strategy, the method dynamically combines and aligns prompts for different tasks. This enhances the model's ability to capture semantic differences across tasks. During prompt fusion, the model uses task embeddings and a gating mechanism to finely control the prompt signals. This ensures alignment between prompt content and task-specific demands. At the same time, it builds flexible sharing pathways across tasks. In addition, the proposed optimization objective centers on joint multi-task learning. It incorporates an automatic learning strategy for scheduling weights, which effectively mitigates task interference and negative transfer. To evaluate the effectiveness of the method, a series of sensitivity experiments were conducted. These experiments examined the impact of prompt temperature parameters and task number variation. The results confirm the advantages of the proposed mechanism in maintaining model stability and enhancing transferability. Experimental findings show that the prompt scheduling method significantly improves performance on a range of language understanding and knowledge reasoning tasks. These results fully demonstrate its applicability and effectiveness in unified multi-task modeling and cross-domain adaptation.

[2] GAUSS: Benchmarking Structured Mathematical Skills for Large Language Models

Yue Zhang,Jiaxin Zhang,Qiuyu Ren,Tahsin Saffat,Xiaoxuan Liu,Zitong Yang,Banghua Zhu,Yi Ma

Main category: cs.CL

TL;DR: GAUSS是一个评估大语言模型数学能力的基准,涵盖十二个核心技能维度,分为知识理解、问题解决与交流、元技能与创造力三大领域,通过细粒度任务构建可解释的能力画像。

Details Motivation: 现有评估方法难以全面刻画模型的数学能力,缺乏对底层技能的细粒度分析。 Method: 设计覆盖十二个技能维度的结构化基准,按认知技能分类题目,隔离并评估特定能力。 Result: 成功生成GPT-5-thinking和o4-mini-high的技能画像,揭示其优劣势及差异。 Conclusion: 多维、基于技能的评估能更准确反映模型的数学智能,GAUSS为模型分析提供了可解释的工具。 Abstract: We introduce \textbf{GAUSS} (\textbf{G}eneral \textbf{A}ssessment of \textbf{U}nderlying \textbf{S}tructured \textbf{S}kills in Mathematics), a benchmark that evaluates LLMs' mathematical abilities across twelve core skill dimensions, grouped into three domains: knowledge and understanding, problem solving and communication, and meta-skills and creativity. By categorizing problems according to cognitive skills and designing tasks that isolate specific abilities, GAUSS constructs comprehensive, fine-grained, and interpretable profiles of models' mathematical abilities. These profiles faithfully represent their underlying mathematical intelligence. To exemplify how to use the \textsc{GAUSS} benchmark, we have derived the skill profile of \textsc{GPT-5-thinking}, revealing its strengths and weaknesses as well as its differences relative to \textsc{o4-mini-high}, thereby underscoring the value of multidimensional, skill-based evaluation.

[3] Event Causality Identification with Synthetic Control

Haoyu Wang,Fengze Liu,Jiayao Zhang,Dan Roth,Kyle Richardson

Main category: cs.CL

TL;DR: 本文提出了一种基于Rubin因果模型的事件因果识别新方法,通过生成“合成控制”作为主角的“孪生个体”来估计事件间的因果关系,在COPES-hard基准上表现优于包括GPT-4在内的现有方法。

Details Motivation: 传统事件因果识别方法依赖语言模式和多跳推理,易因非正式因果用法和虚假图推理导致错误识别,需更严谨的因果推断框架。 Method: 采用Rubin因果模型,将前一事件视为处理变量,后一事件为结果;通过文本嵌入合成与反演技术,利用历史数据生成‘合成孪生个体’以模拟干预效应。 Result: 所提方法在COPES-hard基准上显著优于传统方法和GPT-4,能更稳健地识别事件间因果关系。 Conclusion: 基于合成控制的因果推断框架为事件因果识别提供了更可靠的方法,克服了真实孪生个体难以获取的问题,提升了因果判断的准确性。 Abstract: Event causality identification (ECI), a process that extracts causal relations between events from text, is crucial for distinguishing causation from correlation. Traditional approaches to ECI have primarily utilized linguistic patterns and multi-hop relational inference, risking false causality identification due to informal usage of causality and specious graphical inference. In this paper, we adopt the Rubin Causal Model to identify event causality: given two temporally ordered events, we see the first event as the treatment and the second one as the observed outcome. Determining their causality involves manipulating the treatment and estimating the resultant change in the likelihood of the outcome. Given that it is only possible to implement manipulation conceptually in the text domain, as a work-around, we try to find a twin for the protagonist from existing corpora. This twin should have identical life experiences with the protagonist before the treatment but undergoes an intervention of treatment. However, the practical difficulty of locating such a match limits its feasibility. Addressing this issue, we use the synthetic control method to generate such a twin' from relevant historical data, leveraging text embedding synthesis and inversion techniques. This approach allows us to identify causal relations more robustly than previous methods, including GPT-4, which is demonstrated on a causality benchmark, COPES-hard.

[4] ZERA: Zero-init Instruction Evolving Refinement Agent - From Zero Instructions to Structured Prompts via Principle-based Optimization

Seungyoun Yi,Minsoo Khang,Sungrae Park

Main category: cs.CL

TL;DR: 提出ZERA框架,通过系统化优化系统和用户提示,使用八项可泛化的标准自动评分与修订提示,显著提升大模型在多种任务上的性能,且收敛快、成本低。

Details Motivation: 现有自动提示优化方法通常仅关注用户提示,依赖非结构化反馈,且需要大量样本和长迭代周期,导致成本高且鲁棒性差。 Method: 提出ZERA框架,联合优化系统和用户提示;基于八个可泛化准则对提示进行评分,并自动推断权重,利用结构化批评进行提示修订,实现低开销、快速收敛的提示优化。 Result: 在五个大语言模型和九个多样化数据集上验证了ZERA的有效性,涵盖推理、摘要和代码生成任务,结果一致优于强基线,消融研究证明各组件对提示构建的有效贡献。 Conclusion: ZERA通过结构化、低开销的提示优化机制,实现了高效、稳定的提示改进,适用于多种任务和模型,推动了自动提示优化的实用性发展。 Abstract: Automatic Prompt Optimization (APO) improves large language model (LLM) performance by refining prompts for specific tasks. However, prior APO methods typically focus only on user prompts, rely on unstructured feedback, and require large sample sizes and long iteration cycles-making them costly and brittle. We propose ZERA (Zero-init Instruction Evolving Refinement Agent), a novel framework that jointly optimizes both system and user prompts through principled, low-overhead refinement. ZERA scores prompts using eight generalizable criteria with automatically inferred weights, and revises prompts based on these structured critiques. This enables fast convergence to high-quality prompts using minimal examples and short iteration cycles. We evaluate ZERA across five LLMs and nine diverse datasets spanning reasoning, summarization, and code generation tasks. Experimental results demonstrate consistent improvements over strong baselines. Further ablation studies highlight the contribution of each component to more effective prompt construction. Our implementation including all prompts is publicly available at https://github.com/younatics/zera-agent.

[5] Thinking in a Crowd: How Auxiliary Information Shapes LLM Reasoning

Haodong Zhao,Chenyan Zhao,Yansi Li,Zhuosheng Zhang,Gongshen Liu

Main category: cs.CL

TL;DR: 本文研究了外部信息对具有逐步思考能力的大型语言模型推理过程的因果影响,提出了SciAux数据集来测试模型在不同信息下的鲁棒性。研究发现,虽然有用信息能提升准确性,但误导性信息会通过“思考模式”加剧错误,导致性能急剧下降,表明模型缺乏批判性评估信息的能力。

Details Motivation: 探讨大型语言模型在面对可能有用、无关或误导性的外部信息时的推理鲁棒性,揭示其在现实复杂场景中应用的风险与挑战。 Method: 构建了一个名为SciAux的新数据集(源自ScienceQA),系统性地测试模型在不同类型辅助信息下的表现,并分析逐步推理过程如何受这些信息的影响。 Result: 发现模型的“思考模式”是一把双刃剑:有用信息提升准确率,但误导性信息会导致性能灾难性下降,且这种错误在逐步推理过程中被放大。 Conclusion: 仅仅让模型‘思考’是不够的,关键在于赋予其批判性评估所依赖信息的能力,以提升推理的稳健性和可靠性。 Abstract: The capacity of Large Language Models (LLMs) to reason is fundamental to their application in complex, knowledge-intensive domains. In real-world scenarios, LLMs are often augmented with external information that can be helpful, irrelevant, or even misleading. This paper investigates the causal impact of such auxiliary information on the reasoning process of LLMs with explicit step-by-step thinking capabilities. We introduce SciAux, a new dataset derived from ScienceQA, to systematically test the robustness of the model against these types of information. Our findings reveal a critical vulnerability: the model's deliberative "thinking mode" is a double-edged sword. While helpful context improves accuracy, misleading information causes a catastrophic drop in performance, which is amplified by the thinking process. Instead of conferring robustness, thinking reinforces the degree of error when provided with misinformation. This highlights that the challenge is not merely to make models "think", but to endow them with the critical faculty to evaluate the information upon which their reasoning is based. The SciAux dataset is available at https://huggingface.co/datasets/billhdzhao/SciAux.

[6] SIRAG: Towards Stable and Interpretable RAG with A Process-Supervised Multi-Agent Framework

Junlin Wang,Zehao Wu,Shaowei Lu,Yanlan Li,Xinghao Huang

Main category: cs.CL

TL;DR: 提出一种基于过程监督的多智能体框架,通过决策者和知识选择器两个轻量级代理,结合LLM作为评判器进行细粒度奖励,提升检索增强生成(RAG)中检索器与生成器的协同效率。

Details Motivation: 检索增强生成(RAG)中检索器与生成器独立开发导致交互不充分,影响效果。 Method: 引入决策者和知识选择器两个轻量级代理,使用LLM-as-a-Judge对中间步骤进行过程监督,并采用树结构rollout策略与PPO算法进行端到端训练。 Result: 在单跳和多跳问答任务上优于标准RAG基线,具有更高准确率、更稳定收敛性和更强可解释性。 Conclusion: 该框架模块化、即插即用,无需修改现有检索器或生成器,适用于实际RAG应用。 Abstract: Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access external knowledge sources, but the effectiveness of RAG relies on the coordination between the retriever and the generator. Since these components are developed independently, their interaction is often suboptimal: the retriever may return irrelevant or redundant documents, while the generator may fail to fully leverage retrieved evidence. In this work, we propose a process-supervised multi-agent framework to bridge the gap between retriever and generator. The framework introduces two lightweight agents: a Decision Maker, which determines when to continue retrieval or stop for answer generation, and a Knowledge Selector, which filters retrieved documents to retain only the most useful evidence. To provide fine-grained supervision, we employ an LLM-as-a-Judge that evaluates each intermediate action with process-level rewards, ensuring more accurate credit assignment than relying solely on final answer correctness. We further adopt a tree-structured rollout strategy to explore diverse reasoning paths, and train both agents with Proximal Policy Optimization (PPO) in an end-to-end manner. Experiments on single-hop and multi-hop question answering benchmarks show that our approach achieves higher accuracy, more stable convergence, and produces more interpretable reasoning trajectories compared with standard RAG baselines. Importantly, the proposed framework is modular and plug-and-play, requiring no modification to the retriever or generator, making it practical for real-world RAG applications.

[7] ERFC: Happy Customers with Emotion Recognition and Forecasting in Conversation in Call Centers

Aditi Debsharma,Bhushan Jagyasi,Surajit Sen,Priyanka Pandey,Devicharith Dovari,Yuvaraj V. C,Rosalin Parida,Gopali Contractor

Main category: cs.CL

TL;DR: 提出了一种新的对话情绪识别与预测架构ERFC,用于预测未来话语的情绪,从而提升客户体验。

Details Motivation: 在呼叫中心等场景中,通过预测客户未来的情绪来改善客户服务体验,提高客户满意度。 Method: 提出了多模态、考虑情绪属性、上下文及话语间依赖关系的ERFC架构,并在IEMOCAP数据集上进行实验验证。 Result: 实验结果表明ERFC架构在情绪预测方面具有可行性,能够有效支持情绪识别与预测任务。 Conclusion: ERFC架构有助于提升客户情绪管理能力,在呼叫中心等重视客户满意度的应用中具有重要商业价值。 Abstract: Emotion Recognition in Conversation has been seen to be widely applicable in call center analytics, opinion mining, finance, retail, healthcare, and other industries. In a call center scenario, the role of the call center agent is not just confined to receiving calls but to also provide good customer experience by pacifying the frustration or anger of the customers. This can be achieved by maintaining neutral and positive emotion from the agent. As in any conversation, the emotion of one speaker is usually dependent on the emotion of other speaker. Hence the positive emotion of an agent, accompanied with the right resolution will help in enhancing customer experience. This can change an unhappy customer to a happy one. Imparting the right resolution at right time becomes easier if the agent has the insight of the emotion of future utterances. To predict the emotions of the future utterances we propose a novel architecture, Emotion Recognition and Forecasting in Conversation. Our proposed ERFC architecture considers multi modalities, different attributes of emotion, context and the interdependencies of the utterances of the speakers in the conversation. Our intensive experiments on the IEMOCAP dataset have shown the feasibility of the proposed ERFC. This approach can provide a tremendous business value for the applications like call center, where the happiness of customer is utmost important.

[8] Evaluating Large Language Models for Detecting Antisemitism

Jay Patel,Hrudayangam Mehta,Jeremy Blackburn

Main category: cs.CL

TL;DR: 该研究评估了八种开源大语言模型(LLMs)在检测反犹太主义内容方面的能力,提出了一种新的类思维链提示方法Guided-CoT,有效提升了各类模型的性能,并分析了模型错误与语义偏差。

Details Motivation: 自动化检测仇恨内容具有重要意义但面临挑战,现有模型需持续训练以适应社交媒体变化,因此需探索更有效的零样本检测方法。 Method: 采用八种开源大语言模型,利用上下文定义作为政策指导,比较多种提示技术,并设计新型Guided-CoT提示方法;通过语义偏离度量分析模型生成理由中的错误。 Result: Guided-CoT显著提升所有模型的检测性能,不受解码配置、模型规模或推理能力影响;Llama 3.1 70B表现优于微调版GPT-3.5;发现不同模型在可解释性和可靠性方面存在差异和矛盾行为。 Conclusion: Guided-CoT是一种有效的提示策略,能增强LLMs在仇恨内容检测中的表现,揭示了当前大模型在语义理解和一致性方面的局限性,为实际应用中的选择与优化提供了依据。 Abstract: Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability.

[9] Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

Hieu Tran,Zonghai Yao,Hong Yu

Main category: cs.CL

TL;DR: 提出了一种名为TEMPO的无批评者强化学习算法,通过构建前缀树(P2T)计算非参数化前缀值,并结合分支门控的时序差分修正,在数学和医学问答任务中优于PPO和GRPO。

Details Motivation: 强化学习在长序列上的token级信用分配存在瓶颈,现有方法如PPO复杂且易过拟合,GRPO忽略分支结构,需更简单有效的算法。 Method: 提出Prefix-to-Tree(P2T)方法将多个响应构建成前缀树,计算非参数化前缀值;基于此设计TEMPO算法,利用树结构提供分支处的精确token级信用分配,无需学习价值网络。 Result: 在Qwen3-1.7B/4B模型上,TEMPO在MATH、MedQA等分布内及GSM-HARD、AMC23等分布外基准均优于PPO和GRPO,达到更高验证准确率且训练时间相当。 Conclusion: TEMPO通过树估计的均值前缀值实现高效的策略优化,是一种简洁、通用且性能优越的无批评者强化学习方法。 Abstract: Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values \(V(s)\) by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.

[10] Brittleness and Promise: Knowledge Graph Based Reward Modeling for Diagnostic Reasoning

Saksham Khatwani,He Cheng,Majid Afshar,Dmitriy Dligach,Yanjun Gao

Main category: cs.CL

TL;DR: 本文探索了将大语言模型(LLM)作为知识图谱(KG)推理路径的奖励模型,以提升医疗诊断推理的可靠性。通过评估多种任务设定和训练范式,发现特定的奖励优化方法可提升路径判断性能,但向下游任务的迁移能力仍有限。

Details Motivation: 大语言模型在诊断推理中缺乏可靠的知识支撑,而知识图谱虽提供结构化医学知识,但现有方法多将其内容插入提示词中,未能实现结构化推理。本文旨在探索一种新的范式,使LLM能够通过判断知识路径的正确性来增强可信推理。 Method: 将LLM视为知识图谱推理路径的奖励模型,训练其判断给定患者输入下候选路径是否导向正确诊断。系统评估了五种任务设定和八种训练范式,并测试该能力在诊断总结和医学问答等下游任务中的泛化性。 Result: 在三个开源指令调优的LLM上实验表明,特定的奖励优化与蒸馏方法能显著提升路径判断性能,但该能力向下游诊断任务的迁移效果较弱,表现出一定的脆弱性。 Conclusion: 本文首次系统评估了基于‘奖励模型风格’的临床知识图谱推理方法,揭示了结构化、基于奖励的监督对生成式AI医疗推理的影响,为未来改进提供了方向。 Abstract: Large language models (LLMs) show promise for diagnostic reasoning but often lack reliable, knowledge grounded inference. Knowledge graphs (KGs), such as the Unified Medical Language System (UMLS), offer structured biomedical knowledge that can support trustworthy reasoning. Prior approaches typically integrate KGs via retrieval augmented generation or fine tuning, inserting KG content into prompts rather than enabling structured reasoning. We explore an alternative paradigm: treating the LLM as a reward model of KG reasoning paths, where the model learns to judge whether a candidate path leads to correct diagnosis for a given patient input. This approach is inspired by recent work that leverages reward training to enhance model reasoning abilities, and grounded in computational theory, which suggests that verifying a solution is often easier than generating one from scratch. It also parallels physicians' diagnostic assessment, where they judge which sequences of findings and intermediate conditions most plausibly support a diagnosis. We first systematically evaluate five task formulation for knowledge path judging and eight training paradigm. Second, we test whether the path judging abilities generalize to downstream diagnostic tasks, including diagnosis summarization and medical question answering. Experiments with three open source instruct-tuned LLMs reveal both promise and brittleness: while specific reward optimization and distillation lead to strong path-judging performance, the transferability to downstream tasks remain weak. Our finding provides the first systematic assessment of "reward model style" reasoning over clinical KGs, offering insights into how structured, reward-based supervision influences diagnostic reasoning in GenAI systems for healthcare.

[11] Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding

Pei-Shuo Wang,Jian-Jia Chen,Chun-Che Yang,Chi-Chih Chang,Ning-Chi Huang,Mohamed S. Abdelfattah,Kai-Chiang Wu

Main category: cs.CL

TL;DR: 提出SubSpec,一种无需训练、无损的即插即用方法,通过低比特量化替代层和共享KV-Cache显著提升大语言模型在参数卸载下的推理速度。

Details Motivation: 大语言模型在内存受限设备上部署困难,现有压缩和参数卸载方法存在质量下降或推理慢的问题,而推测解码虽有潜力但受限于与目标模型对齐不足导致接受长度不够。 Method: SubSpec利用目标模型的部分卸载权重生成低比特量化的替代层构建高度对齐的草稿模型,并共享GPU驻留层和KV-Cache以减少内存开销并增强对齐性,实现高效的推测解码。 Result: 在Qwen2.5 7B上实现9.1倍加速(8GB VRAM),在Qwen2.5 32B上平均实现12.5倍加速(24GB VRAM),显著优于现有方法。 Conclusion: SubSpec是一种高效、通用且即插即用的推测解码框架,能在不损失性能的前提下大幅提升参数卸载场景下的推理速度。 Abstract: The immense model sizes of large language models (LLMs) challenge deployment on memory-limited consumer GPUs. Although model compression and parameter offloading are common strategies to address memory limitations, compression can degrade quality, and offloading maintains quality but suffers from slow inference. Speculative decoding presents a promising avenue to accelerate parameter offloading, utilizing a fast draft model to propose multiple draft tokens, which are then verified by the target LLM in parallel with a single forward pass. This method reduces the time-consuming data transfers in forward passes that involve offloaded weight transfers. Existing methods often rely on pretrained weights of the same family, but require additional training to align with custom-trained models. Moreover, approaches that involve draft model training usually yield only modest speedups. This limitation arises from insufficient alignment with the target model, preventing higher token acceptance lengths. To address these challenges and achieve greater speedups, we propose SubSpec, a plug-and-play method to accelerate parameter offloading that is lossless and training-free. SubSpec constructs a highly aligned draft model by generating low-bit quantized substitute layers from offloaded target LLM portions. Additionally, our method shares the remaining GPU-resident layers and the KV-Cache, further reducing memory overhead and enhance alignment. SubSpec achieves a high average acceptance length, delivering 9.1x speedup for Qwen2.5 7B on MT-Bench (8GB VRAM limit) and an average of 12.5x speedup for Qwen2.5 32B on popular generation benchmarks (24GB VRAM limit).

[12] Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents

Chutong Meng,Philipp Koehn

Main category: cs.CL

TL;DR: 提出了一种名为Speech Vecalign的并行语音对齐方法,无需文本转录即可实现语音段嵌入的单调对齐,在鲁棒性和对齐长度上优于现有方法,并在较少数据下达到或超过了SpeechMatrix模型的性能。

Details Motivation: 为了在没有文本转录的情况下高效对齐大规模平行语音数据,并提升语音到语音翻译的质量。 Method: 提出Speech Vecalign方法,通过单调对齐语音段嵌入来实现端到端的语音对齐,不依赖于文本信息。 Result: 在3000小时VoxPopuli英德语音数据上获得约1000小时高质量对齐;相比Global Mining提升了0.37和0.18 ASR-BLEU;使用8倍少的数据达到或超过SpeechMatrix性能。 Conclusion: Speech Vecalign是一种高效、鲁棒的无监督语音对齐方法,显著提升了低资源条件下的语音到语音翻译性能。 Abstract: We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.

[13] Interactive Real-Time Speaker Diarization Correction with Human Feedback

Xinlu He,Yiwen Guan,Badrivishal Paurana,Zilin Dai,Jacob Whitehill

Main category: cs.CL

TL;DR: 提出一种LLM辅助的说话人日志校正系统,通过实时用户反馈显著降低错误率。

Details Motivation: 现有自动语音处理系统缺乏用户反馈,可能导致说话人识别准确率较低,因此需要引入人类参与以提升性能。 Method: 结合流式ASR与日志分割,利用LLM生成摘要并向用户请求简短语音反馈,实时修正错误;采用SWM技术拆分多说话人片段,并基于用户更正进行在线说话人注册。 Result: 在AMI测试集上的LLM仿真显示,系统将DER降低了9.92%,说话人混淆错误减少了44.23%。 Conclusion: 该系统通过融合LLM与实时用户反馈,有效提升了说话人日志的准确性,具备实际应用潜力。 Abstract: Most automatic speech processing systems operate in "open loop" mode without user feedback about who said what; yet, human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted speaker diarization correction system that lets users fix speaker attribution errors in real time. The pipeline performs streaming ASR and diarization, uses an LLM to deliver concise summaries to the users, and accepts brief verbal feedback that is immediately incorporated without disrupting interactions. Moreover, we develop techniques to make the workflow more effective: First, a split-when-merged (SWM) technique detects and splits multi-speaker segments that the ASR erroneously attributes to just a single speaker. Second, online speaker enrollments are collected based on users' diarization corrections, thus helping to prevent speaker diarization errors from occurring in the future. LLM-driven simulations on the AMI test set indicate that our system substantially reduces DER by 9.92% and speaker confusion error by 44.23%. We further analyze correction efficacy under different settings, including summary vs full transcript display, the number of online enrollments limitation, and correction frequency.

[14] NormGenesis: Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery

Minki Hong,Jangho Choi,Jihie Kim

Main category: cs.CL

TL;DR: 本文提出了NormGenesis,一个跨文化对话生成与标注框架,引入了违反-解决(V2R)对话类型以建模社会规范违背后的修复过程,并通过示例驱动的迭代优化提升多语言对话的语用一致性。

Details Motivation: 现有对话系统缺乏对社会规范动态演变的建模能力,且在非英语语境下语用一致性不足,难以处理涉及文化敏感性的交互场景。 Method: 提出Violation-to-Resolution(V2R)对话类型,结合多语言(英、中、韩)数据构建;采用基于示例的迭代精炼方法,在生成早期对齐语言、情感与社会文化期望;并进行回合级标注,包括规范遵循、说话者意图和情感反应。 Result: 构建了包含10,800个多轮对话的数据集;人类与LLM评估显示,该框架在精炼质量、对话自然性和泛化能力上优于现有数据集;训练V2R增强数据的模型在伦理敏感情境中表现出更强的语用能力。 Conclusion: NormGenesis为文化适应性对话建模建立了新基准,提供了一种可扩展的、面向多样语言与文化的规范感知对话生成方法。 Abstract: Social norms govern culturally appropriate behavior in communication, enabling dialogue systems to produce responses that are not only coherent but also socially acceptable. We present NormGenesis, a multicultural framework for generating and annotating socially grounded dialogues across English, Chinese, and Korean. To model the dynamics of social interaction beyond static norm classification, we propose a novel dialogue type, Violation-to-Resolution (V2R), which models the progression of conversations following norm violations through recognition and socially appropriate repair. To improve pragmatic consistency in underrepresented languages, we implement an exemplar-based iterative refinement early in the dialogue synthesis process. This design introduces alignment with linguistic, emotional, and sociocultural expectations before full dialogue generation begins. Using this framework, we construct a dataset of 10,800 multi-turn dialogues annotated at the turn level for norm adherence, speaker intent, and emotional response. Human and LLM-based evaluations demonstrate that NormGenesis significantly outperforms existing datasets in refinement quality, dialogue naturalness, and generalization performance. We show that models trained on our V2R-augmented data exhibit improved pragmatic competence in ethically sensitive contexts. Our work establishes a new benchmark for culturally adaptive dialogue modeling and provides a scalable methodology for norm-aware generation across linguistically and culturally diverse languages.

[15] Evaluating the Creativity of LLMs in Persian Literary Text Generation

Armin Tourajmehr,Mohammad Reza Modarres,Yadollah Yaghoobzadeh

Main category: cs.CL

TL;DR: 本文评估了大语言模型(LLM)生成富含文化表达的波斯文学文本的能力,构建了一个涵盖20个不同主题的用户生成波斯文学数据集,并采用改编自托兰斯创造性思维测试的四个创造性维度(原创性、流畅性、灵活性和 elaboration)进行评估。通过使用LLM作为自动评分裁判,并验证其与人类评分的一致性,研究还分析了模型对四种核心文学手法(明喻、隐喻、夸张和对比)的理解与运用能力,揭示了LLM在波斯文学生成中的优势与局限。

Details Motivation: 现有研究主要集中于英语文学生成,缺乏对非英语文学传统(如波斯语)的探索,且缺乏标准化的创造力评估方法。因此,本文旨在填补这一空白,系统评估LLM在波斯语境下的文学创造力。 Method: 构建包含20个主题的波斯文学数据集,基于托兰斯创造性思维测试设计四维创造力评估框架(原创性、流畅性、灵活性、elaboration),采用LLM作为自动评分器并用组内相关系数验证其与人类评分的一致性,同时分析模型对明喻、隐喻、夸张和对比四种文学手法的使用能力。 Result: LLM作为评分器与人类判断表现出强一致性,表明自动化评估可靠;模型在生成具有文化相关性的波斯文学文本方面展现出一定能力,但在理解和恰当运用特定文学手法方面仍存在局限。 Conclusion: 尽管LLMs在生成波斯文学文本方面具有一定潜力,但其创造力表现仍有待提升,特别是在文化敏感性和复杂修辞手法的运用上需进一步优化。 Abstract: Large language models (LLMs) have demonstrated notable creative abilities in generating literary texts, including poetry and short stories. However, prior research has primarily centered on English, with limited exploration of non-English literary traditions and without standardized methods for assessing creativity. In this paper, we evaluate the capacity of LLMs to generate Persian literary text enriched with culturally relevant expressions. We build a dataset of user-generated Persian literary spanning 20 diverse topics and assess model outputs along four creativity dimensions-originality, fluency, flexibility, and elaboration-by adapting the Torrance Tests of Creative Thinking. To reduce evaluation costs, we adopt an LLM as a judge for automated scoring and validate its reliability against human judgments using intraclass correlation coefficients, observing strong agreement. In addition, we analyze the models' ability to understand and employ four core literary devices: simile, metaphor, hyperbole, and antithesis. Our results highlight both the strengths and limitations of LLMs in Persian literary text generation, underscoring the need for further refinement.

[16] Developing an AI framework to automatically detect shared decision-making in patient-doctor conversations

Oscar J. Ponce-Ponte,David Toro-Tobon,Luis F. Figueroa,Michael Gionfriddo,Megan Branda,Victor M. Montori,Saturnino Luz,Juan P. Brito

Main category: cs.CL

TL;DR: 本研究提出了一种基于语言模型和对话对齐(CA)分数的自动化方法,用于大规模测量医患对话中的共享决策(SDM),并验证了其与SDM结局指标的显著关联。

Details Motivation: 目前尚无能够大规模自动测量共享决策(SDM)的方法,因此需要开发一种可扩展、自动化的工具来评估患者参与决策的程度,以促进以患者为中心的医疗。 Method: 研究使用157段医患对话录音,转录为42,559个句子,通过上下文-响应对和负采样训练深度学习模型和微调BERT模型(NSP任务),并计算四种CA分数;采用随机效应模型分析CA分数与SDM指标(DCS和OPTION12)的关系,并进行多重比较校正。 Result: 微调后的BERTbase模型在recall@1上表现最佳(0.640);使用无stylebook策略的DL模型生成的AbsMax和Max CA分数与OPTION12显著相关,而BERTbase生成的Max CA分数与DCS显著相关;模型规模不影响CA分数与SDM的关联。 Conclusion: 该研究成功开发了一种可解释、可扩展的自动化方法,利用CA分数量化医患对话中的共享决策水平,具有在大规模场景下评估SDM干预措施的潜力。 Abstract: Shared decision-making (SDM) is necessary to achieve patient-centred care. Currently no methodology exists to automatically measure SDM at scale. This study aimed to develop an automated approach to measure SDM by using language modelling and the conversational alignment (CA) score. A total of 157 video-recorded patient-doctor conversations from a randomized multi-centre trial evaluating SDM decision aids for anticoagulation in atrial fibrillations were transcribed and segmented into 42,559 sentences. Context-response pairs and negative sampling were employed to train deep learning (DL) models and fine-tuned BERT models via the next sentence prediction (NSP) task. Each top-performing model was used to calculate four types of CA scores. A random-effects analysis by clinician, adjusting for age, sex, race, and trial arm, assessed the association between CA scores and SDM outcomes: the Decisional Conflict Scale (DCS) and the Observing Patient Involvement in Decision-Making 12 (OPTION12) scores. p-values were corrected for multiple comparisons with the Benjamini-Hochberg method. Among 157 patients (34% female, mean age 70 SD 10.8), clinicians on average spoke more words than patients (1911 vs 773). The DL model without the stylebook strategy achieved a recall@1 of 0.227, while the fine-tuned BERTbase (110M) achieved the highest recall@1 with 0.640. The AbsMax (18.36 SE7.74 p=0.025) and Max CA (21.02 SE7.63 p=0.012) scores generated with the DL without stylebook were associated with OPTION12. The Max CA score generated with the fine-tuned BERTbase (110M) was associated with the DCS score (-27.61 SE12.63 p=0.037). BERT model sizes did not have an impact the association between CA scores and SDM. This study introduces an automated, scalable methodology to measure SDM in patient-doctor conversations through explainable CA scores, with potential to evaluate SDM strategies at scale.

[17] CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

Daniel Kaiser,Arnoldo Frigessi,Ali Ramezani-Kebrya,Benjamin Ricaud

Main category: cs.CL

TL;DR: CogniLoad是一个基于认知负荷理论的新型合成基准,用于精确分析大语言模型在长上下文推理中的表现,通过独立调节任务难度、干扰比和任务长度,揭示了任务长度是主要限制因素,并提供了系统性诊断工具。

Details Motivation: 现有长上下文推理基准混淆了任务复杂度、干扰项影响和任务长度等关键因素,缺乏对LLM失败原因的精细分析能力。 Method: 基于认知负荷理论(CLT)设计CogniLoad,生成具有可调参数的自然语言逻辑谜题,分别控制内在难度(d)、干扰信号比(ρ)和任务长度(N),以独立评估各认知负荷维度对LLM推理的影响。 Result: 在22个SotA推理LLM上的评估显示,任务长度是性能的主要制约因素,不同模型对内在复杂度容忍度各异,且对干扰比呈现U型响应模式。 Conclusion: CogniLoad通过因子级控制认知负荷维度,提供了一个可复现、可扩展且诊断性强的工具,有助于深入理解LLM推理局限并指导未来模型开发。 Abstract: Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT's core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($\rho$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.

[18] LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

Zeyu Liu,Souvik Kundu,Lianghao Jiang,Anni Li,Srikanth Ronanki,Sravan Bodapati,Gourav Datta,Peter A. Beerel

Main category: cs.CL

TL;DR: LAWCAT是一种高效的线性注意力框架,通过从预训练Transformer迁移能力,实现高性能、长上下文的模型压缩与加速,显著减少对长序列训练数据和计算资源的依赖。

Details Motivation: Transformer架构在长序列任务中因二次计算复杂度存在性能瓶颈,而现有的线性注意力模型训练成本高,难以高效利用预训练知识。 Method: 提出LAWCAT框架,结合因果一维卷积(Conv1D)增强局部依赖建模,并采用归一化门控线性注意力机制提升不同上下文长度下的泛化能力,通过蒸馏预训练模型实现高效知识迁移。 Result: 在仅使用1K长度序列蒸馏Mistral-7B时,LAWCAT在22K token长度下仍保持90%以上的passkey检索准确率;Llama3.2-1B的LAWCAT变体在S-NIAH和BABILong等长上下文任务中表现优异,且预训练token消耗不足0.1%;在超过8K序列长度时,prefill速度优于FlashAttention-2。 Conclusion: LAWCAT为构建高性能、适合边缘部署的长上下文线性模型提供了一条高效路径,显著降低了对大规模长序列训练数据和算力的需求。 Abstract: Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90\% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1\&2\&3 tasks (1K-8K context length) and BABILong benchmark (QA2\&QA3, 0K-16K context length), requiring less than 0.1\% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster prefill speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources.

[19] Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

Ben Finkelshtein,Silviu Cucerzan,Sujay Kumar Jauhar,Ryen White

Main category: cs.CL

TL;DR: 本文系统评估了大语言模型(LLM)在基于文本的图机器学习任务中的表现,涵盖多种交互模式、数据集领域、图结构特性等变量,发现代码生成模式整体性能最佳,尤其适用于长文本或高节点度的图;所有方法在异配图上仍有效,且代码生成能灵活依赖结构、特征或标签。

Details Motivation: 尽管大语言模型被广泛用于文本丰富的图学习任务,但其与图数据交互的能力缺乏系统性理解,本文旨在填补这一空白。 Method: 通过大规模、受控的实验评估,比较提示、工具使用和代码生成三种LLM-图交互模式,并在多个维度(如数据域、同质/异质图、文本长度、模型规模等)进行分析,同时通过特征截断、边删除和标签移除来量化对不同输入的依赖。 Result: 1) 作为代码生成器的LLM整体性能最强,尤其在长文本或高节点度图上优势明显;2) 所有交互方式在异配图上均保持有效性,挑战了LLM方法在低同配性下失效的假设;3) 代码生成能自适应地依赖最信息丰富的输入类型(结构、特征或标签)。 Conclusion: 研究全面揭示了当前LLM-图交互模式的优势与局限,为未来方法的设计提供了实用指导原则。 Abstract: Large language models (LLMs) are increasingly used for text-rich graph machine learning tasks such as node classification in high-impact domains like fraud detection and recommendation systems. Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in their interaction with graph data. In this work, we conduct a large-scale, controlled evaluation across several key axes of variability to systematically assess the strengths and weaknesses of LLM-based graph reasoning methods in text-based applications. The axes include the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; structural regimes contrasting homophilic and heterophilic graphs; feature characteristics involving both short- and long-text node attributes; and model configurations with varying LLM sizes and reasoning capabilities. We further analyze dependencies by methodically truncating features, deleting edges, and removing labels to quantify reliance on input types. Our findings provide practical and actionable guidance. (1) LLMs as code generators achieve the strongest overall performance on graph data, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation is able to flexibly adapt its reliance between structure, features, or labels to leverage the most informative input type. Together, these findings provide a comprehensive view of the strengths and limitations of current LLM-graph interaction modes and highlight key design principles for future approaches.

[20] A Rhythm-Aware Phrase Insertion for Classical Arabic Poetry Composition

Mohamad Elzohbi,Richard Zhao

Main category: cs.CL

TL;DR: 提出一种基于ByT5模型的方法,通过规则驱动的音素到节奏转换和条件去噪训练,实现阿拉伯诗歌中短语的节奏对齐生成。

Details Motivation: 阿拉伯古典诗歌严格遵循特定节奏模式,人工创作需深厚韵律知识,自动化生成面临节奏合规性与语义连贯性双重挑战。现有方法难以兼顾语言特性与创造性需求。 Method: 采用字节级多语言Transformer模型ByT5,设计基于规则的图素到节拍转换方法提取完全带音标的阿拉伯语文本节奏;使用条件去噪目标进行微调,使模型重建被掩码的词以匹配目标节奏;引入课程学习策略,先在通用阿拉伯语数据集上预训练,再在诗歌数据集上微调,并探索从英语到阿拉伯语的跨语言迁移。 Result: 实验结果表明,所提模型在保持较高语义连贯性的同时,实现了优异的节奏对齐效果,显著优于基线方法。跨语言迁移和课程学习策略有效提升了生成质量。 Conclusion: 该方法能有效生成符合特定节奏的阿拉伯诗歌短语,在节奏准确性和语义一致性之间取得良好平衡,具备用于辅助创作古典阿拉伯诗的协同创作应用潜力。 Abstract: This paper presents a methodology for inserting phrases in Arabic poems to conform to a specific rhythm using ByT5, a byte-level multilingual transformer-based model. Our work discusses a rule-based grapheme-to-beat transformation tailored for extracting the rhythm from fully diacritized Arabic script. Our approach employs a conditional denoising objective to fine-tune ByT5, where the model reconstructs masked words to match a target rhythm. We adopt a curriculum learning strategy, pre-training on a general Arabic dataset before fine-tuning on poetic dataset, and explore cross-lingual transfer from English to Arabic. Experimental results demonstrate that our models achieve high rhythmic alignment while maintaining semantic coherence. The proposed model has the potential to be used in co-creative applications in the process of composing classical Arabic poems.

[21] Trace Is In Sentences: Unbiased Lightweight ChatGPT-Generated Text Detector

Mo Mu,Dianqiao Lei,Chang Li

Main category: cs.CL

TL;DR: 提出一种基于文本内部结构的轻量级框架,用于检测原始和经过简单改写的AI生成文本,有效应对现有方法在面对改写、词级模式偏差等问题时的不足。

Details Motivation: 现有的AI生成文本检测方法容易受到改写或简单提示的影响,存在由ChatGPT词级模式和训练数据带来的偏差,且在处理修改后的文本时性能下降。 Method: 利用预训练语言模型获取句子嵌入,并通过注意力机制建模其关系;采用对比学习减轻自回归生成带来的嵌入偏差,结合因果图与反事实方法分离结构特征与主题相关偏差。 Result: 在两个精心构建的数据集(包括摘要对比和修订的生活常见问题)上验证了该方法的有效性,显示出对原始及改写AI文本的良好检测性能。 Conclusion: 所提方法通过捕捉文本的内在结构特征,实现了对AI生成文本更鲁棒的检测,优于现有依赖词级信号的检测器。 Abstract: The widespread adoption of ChatGPT has raised concerns about its misuse, highlighting the need for robust detection of AI-generated text. Current word-level detectors are vulnerable to paraphrasing or simple prompts (PSP), suffer from biases induced by ChatGPT's word-level patterns (CWP) and training data content, degrade on modified text, and often require large models or online LLM interaction. To tackle these issues, we introduce a novel task to detect both original and PSP-modified AI-generated texts, and propose a lightweight framework that classifies texts based on their internal structure, which remains invariant under word-level changes. Our approach encodes sentence embeddings from pre-trained language models and models their relationships via attention. We employ contrastive learning to mitigate embedding biases from autoregressive generation and incorporate a causal graph with counterfactual methods to isolate structural features from topic-related biases. Experiments on two curated datasets, including abstract comparisons and revised life FAQs, validate the effectiveness of our method.

[22] CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs

Jin Young Kim,Ji Won Yoon

Main category: cs.CL

TL;DR: 本文提出了一种名为CCQA的新型推理方法,基于循环一致性,在小语言模型(SLMs)上实现了优于现有最先进方法的性能,尤其在数学和常识推理任务中表现突出,并通过使用轻量级Flan-T5模型有效生成问题,建立了高效推理的新实用基线。

Details Motivation: 现有的推理时推理策略在大语言模型上提升了准确性,但在小模型上的效果不明确且常无效,因此需要一种能有效应用于小语言模型的新型推理方法。 Method: 提出CCQA方法,受循环一致性启发:从每个推理路径和答案生成问题,通过与原始问题的相似度评分评估,并选择得分最高的候选解作为最终答案;为解决小模型生成问题能力弱的问题,引入轻量级Flan-T5模型专门用于问题生成。 Result: 实验结果显示,CCQA在八个模型上的数学和常识推理基准测试中 consistently 优于现有的最先进方法,显著提升小语言模型的推理性能。 Conclusion: CCQA是一种适用于小语言模型的有效推理方法,不仅提升了推理准确性,还建立了一个新的高效推理实践基线,具有实际应用价值。 Abstract: Recently, inference-time reasoning strategies have further improved the accuracy of large language models (LLMs), but their effectiveness on smaller models remains unclear. Based on the observation that conventional approaches often fail to improve performance in this context, we propose \textbf{C}ycle-\textbf{C}onsistency in \textbf{Q}uestion \textbf{A}nswering (CCQA), a novel reasoning method that can be effectively applied to SLMs. Inspired by cycle consistency, CCQA generates a question from each reasoning path and answer, evaluates each by its similarity to the original question, and then selects the candidate solution with the highest similarity score as the final response. Since conventional SLMs struggle to generate accurate questions from their own reasoning paths and answers, we employ a lightweight Flan-T5 model specialized for question generation to support this process efficiently. From the experimental results, it is verified that CCQA consistently outperforms existing state-of-the-art (SOTA) methods across eight models on mathematical and commonsense reasoning benchmarks. Furthermore, our method establishes a new practical baseline for efficient reasoning in SLMs. Source code can be found at https://github.com/scai-research/ccqa_official.

[23] Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

Yeongbin Seo,Gayoung Kim,Jaehyung Kim,Jinyoung Yeo

Main category: cs.CL

TL;DR: 提出一种基于词元先验概率的快速数据过滤方法,利用语料库级别的词频统计来替代耗时的困惑度过滤,在20个下游任务上表现最优且速度提升超1000倍。

Details Motivation: 现有基于困惑度的数据过滤方法存在计算成本高且在噪声或分布外样本上不可靠的问题,需要更高效稳定的方法。 Method: 基于词频统计估计词元先验概率,通过文档中词元先验的均值和标准差进行过滤,无需模型推断。 Result: 在20个下游基准上平均性能最高,过滤速度比困惑度方法快1000倍以上,并适用于代码、数学符号及多语言语料。 Conclusion: 该基于先验的方法是一种简单、高效且可扩展的数据过滤方案,可作为困惑度过滤的有力替代。 Abstract: As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000x compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision

[24] TsqLoRA: Towards Sensitivity and Quality Low-Rank Adaptation for Efficient Fine-Tuning

Yu Chen,Yifei Han,Long Zhang,Yue Du,Bin Li

Main category: cs.CL

TL;DR: 提出TsqLoRA,一种结合数据质量感知采样与敏感性感知低秩适应的高效微调方法,在减少训练开销的同时保持或提升性能。

Details Motivation: 现有参数高效微调方法忽略了不同模型层的敏感性差异及训练数据的重要性,导致效率与性能受限。 Method: 设计质量感知采样机制选择高信息量数据,并通过动态秩分配模块根据各层对参数更新的敏感性调整其低秩适配秩数。 Result: 实验表明,TsqLoRA在多种NLP任务上优于现有方法,显著提升微调效率且性能相当或更优。 Conclusion: TsqLoRA通过联合优化数据选择与层敏感性建模,有效提升了大模型微调的效率与效果。 Abstract: Fine-tuning large pre-trained models for downstream tasks has become a fundamental approach in natural language processing. Fully fine-tuning all model parameters is computationally expensive and memory-intensive, especially in resource-constrained environments. Existing parameter-efficient fine-tuning methods reduce the number of trainable parameters but typically overlook the varying sensitivity of different model layers and the importance of training data. In this work, we propose TsqLoRA, a novel method that integrates data-quality-driven selection with sensitivity-aware low-rank adaptation, consisted of two main components: a quality-aware sampling mechanism for selecting the most informative training data, and a dynamic rank allocation module that adjusts the rank of each layer based on its sensitivity to parameter updates. The experimental results demonstrate that TsqLoRA improves fine-tuning efficiency while maintaining or even improving performance on a variety of NLP tasks. Our code will be available at https://github.com/Benjamin-Ricky/TsqLoRA.

[25] UniECG: Understanding and Generating ECG in One Unified Model

Jiarui Jin,Haoyu Wang,Xiang Lan,Jun Li,Gaofeng Cheng,Hongyan Li,Shenda Hong

Main category: cs.CL

TL;DR: 提出UniECG,首个能够同时进行基于证据的心电图解读和文本条件心电图生成的统一模型。

Details Motivation: 现有统一模型如GPT-5在理解心电图信号和生成准确诊断方面表现不佳,也无法正确生成心电图信号。 Method: 采用解耦的两阶段训练方法,第一阶段学习心电图到文本的解读,第二阶段通过潜在空间对齐注入文本到心电图的生成能力。 Result: UniECG可根据用户输入自主选择解读或生成心电图,在多种任务上扩展了当前心电图模型的能力边界。 Conclusion: UniECG是首个统一的心电图模型,兼具解读与生成能力,显著提升了医学心电图分析的智能化水平。 Abstract: Recent unified models such as GPT-5 have achieved encouraging progress on vision-language tasks. However, these unified models typically fail to correctly understand ECG signals and provide accurate medical diagnoses, nor can they correctly generate ECG signals. To address these limitations, we propose UniECG, the first unified model for ECG capable of concurrently performing evidence-based ECG interpretation and text-conditioned ECG generation tasks. Through a decoupled two-stage training approach, the model first learns evidence-based interpretation skills (ECG-to-Text), and then injects ECG generation capabilities (Text-to-ECG) via latent space alignment. UniECG can autonomously choose to interpret or generate an ECG based on user input, significantly extending the capability boundaries of current ECG models. Our code and checkpoints will be made publicly available at https://github.com/PKUDigitalHealth/UniECG upon acceptance.

[26] A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users

Nishant Balepur,Matthew Shu,Yoo Yeon Sung,Seraphina Goldfarb-Tarrant,Shi Feng,Fumeng Yang,Rachel Rudinger,Jordan Lee Boyd-Graber

Main category: cs.CL

TL;DR: 该研究通过Planorama平台发现,用户偏好和模型生成的计划并不总能准确反映实际帮助性,表面特征(如简洁性)影响偏好但不预测实际效果,因此建议对LLM的对齐应基于真实用户交互反馈而非仅依赖偏好数据。

Details Motivation: 现有LLM对齐方法依赖用户偏好作为帮助性的代理指标,但这一假设缺乏验证。研究旨在检验用户偏好是否真正反映计划的实际帮助性,并揭示当前对齐方法的潜在缺陷。 Method: 设计Planorama实验平台,收集126名用户在300个多步问题中执行4388个LLM计划的数据及5584次计划比较,分析用户成功完成任务的情况与偏好之间的关系,并在代理和奖励模型中复现设置以评估其模拟帮助性的能力。 Result: 1) 用户和模型的偏好以及代理成功率无法准确预测哪些计划真正帮助用户;2) 该差距并非源于个体化偏好,因用户使用偏好或不偏好的计划成功率相似;3) 表面特征如简洁性和问题相似性强烈影响偏好,但这些偏差无法预测实际帮助性。 Conclusion: 当前基于偏好的对齐方法可能偏离实际帮助性,未来LLM对齐应结合真实用户交互中的表现反馈,而不仅仅是用户对‘看似有帮助’的偏好,研究呼吁NLP学者采取更贴近实际使用的评估方案。 Abstract: To assist users in complex tasks, LLMs generate plans: step-by-step instructions towards a goal. While alignment methods aim to ensure LLM plans are helpful, they train (RLHF) or evaluate (ChatbotArena) on what users prefer, assuming this reflects what helps them. We test this with Planorama: an interface where 126 users answer 300 multi-step questions with LLM plans. We get 4388 plan executions and 5584 comparisons to measure plan helpfulness (QA success) and user preferences on plans, and recreate the setup in agents and reward models to see if they simulate or prefer what helps users. We expose: 1) user/model preferences and agent success do not accurately predict which plans help users, so common alignment feedback can misalign with helpfulness; 2) this gap is not due to user-specific preferences, as users are similarly successful when using plans they prefer/disprefer; 3) surface-level cues like brevity and question similarity strongly link to preferences, but such biases fail to predict helpfulness. In all, we argue aligning helpful LLMs needs feedback from real user interactions, not just preferences of what looks helpful, so we discuss the plan NLP researchers can execute to solve this problem.

[27] Consistency-Aware Parameter-Preserving Knowledge Editing Framework for Multi-Hop Question Answering

Lingwen Deng,Yifei Han,Long Zhang,Yue Du,Bin Li

Main category: cs.CL

TL;DR: 本文提出了CAPE-KG,一种一致性感知的参数保持型知识编辑框架,用于多跳问答任务,通过确保知识图谱构建、更新和检索的一致性来提升编辑效果。

Details Motivation: 现有基于知识图谱的参数保持型知识编辑方法在多跳问答中存在不一致问题,导致知识污染和推理不稳定,影响可靠性。 Method: 提出CAPE-KG框架,在知识图谱的构建、更新和检索过程中引入任务一致性机制,确保各环节与多跳问答需求对齐。 Result: 在MQuAKE基准上的实验表明,CAPE-KG显著提升了多跳问答中的知识编辑准确率。 Conclusion: 通过增强一致性,CAPE-KG有效提升了参数保持型知识编辑在多跳推理任务中的性能和可靠性。 Abstract: Parameter-Preserving Knowledge Editing (PPKE) enables updating models with new or corrected information without retraining or parameter adjustment. Recent PPKE approaches based on knowledge graphs (KG) to extend knowledge editing (KE) capabilities to multi-hop question answering (MHQA). However, these methods often lack consistency, leading to knowledge contamination, unstable updates, and retrieval behaviors that fail to reflect the intended edits. Such inconsistencies undermine the reliability of PPKE in multi- hop reasoning. We present CAPE-KG, Consistency-Aware Parameter-Preserving Editing with Knowledge Graphs, a novel consistency-aware framework for PPKE on MHQA. CAPE-KG ensures KG construction, update, and retrieval are always aligned with the requirements of the MHQA task, maintaining coherent reasoning over both unedited and edited knowledge. Extensive experiments on the MQuAKE benchmark show accuracy improvements in PPKE performance for MHQA, demonstrating the effectiveness of addressing consistency in PPKE.

[28] Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction

Huanxin Sheng,Xinyi Liu,Hangfeng He,Jieyu Zhao,Jian Kang

Main category: cs.CL

TL;DR: 本文提出了一个基于保形预测的框架,用于分析大语言模型(LLM)作为评估工具时的不确定性,通过生成预测区间来提高评估的可靠性,并提出中点评分作为低偏差替代方法。

Details Motivation: LLM-as-a-judge在自然语言生成评估中虽具潜力,但其评估结果的不确定性尚未充分探索,影响其在实际应用中的可信度和部署。 Method: 采用保形预测构建连续预测区间,并设计了适用于离散评分任务的有序边界调整方法;同时提出使用区间中点作为评分的新策略。 Result: 实验表明,该方法能提供具有覆盖率保证的有效预测区间,中点评分可降低偏差,重提示机制有助于提升判断质量。 Conclusion: 所提框架有效量化了LLM评估中的不确定性,提升了评估的可靠性和可解释性,为LLM作为评估工具有力支持。 Abstract: LLM-as-a-judge has become a promising paradigm for using large language models (LLMs) to evaluate natural language generation (NLG), but the uncertainty of its evaluation remains underexplored. This lack of reliability may limit its deployment in many applications. This work presents the first framework to analyze the uncertainty by offering a prediction interval of LLM-based scoring via conformal prediction. Conformal prediction constructs continuous prediction intervals from a single evaluation run, and we design an ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low-bias alternative to raw model score and weighted average. We perform extensive experiments and analysis, which show that conformal prediction can provide valid prediction interval with coverage guarantees. We also explore the usefulness of interval midpoint and judge reprompting for better judgment.

[29] MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service

Yizhe Huang,Yang Liu,Ruiyu Zhao,Xiaolong Zhong,Xingming Yue,Ling Jiang

Main category: cs.CL

TL;DR: MemOrb是一种轻量级、即插即用的记忆增强模块,通过结构化反思提升大语言模型代理在客户服务中的任务成功率和跨轮次一致性。

Details Motivation: 现有LLM代理在客户服务中存在遗忘、重复错误和缺乏持续自我改进机制的问题,导致在动态环境中可靠性不足。 Method: 提出MemOrb,一种基于口头强化的记忆层,将多轮交互提炼为策略性反思,存储在共享记忆库中并用于指导决策,无需微调即可实现持续学习。 Result: 实验显示,MemOrb显著提升了任务成功率(最高提升63个百分点)和跨多次试验的一致性表现。 Conclusion: 结构化反思是增强冻结状态下的LLM代理长期可靠性的有效机制,尤其适用于需要稳定性和一致性的客户服务场景。 Abstract: Large Language Model-based agents(LLM-based agents) are increasingly deployed in customer service, yet they often forget across sessions, repeat errors, and lack mechanisms for continual self-improvement. This makes them unreliable in dynamic settings where stability and consistency are critical. To better evaluate these properties, we emphasize two indicators: task success rate as a measure of overall effectiveness, and consistency metrics such as Pass$^k$ to capture reliability across multiple trials. To address the limitations of existing approaches, we propose MemOrb, a lightweight and plug-and-play verbal reinforcement memory layer that distills multi-turn interactions into compact strategy reflections. These reflections are stored in a shared memory bank and retrieved to guide decision-making, without requiring any fine-tuning. Experiments show that MemOrb significantly improves both success rate and stability, achieving up to a 63 percentage-point gain in multi-turn success rate and delivering more consistent performance across repeated trials. Our results demonstrate that structured reflection is a powerful mechanism for enhancing long-term reliability of frozen LLM agents in customer service scenarios.

[30] LOTUSDIS: A Thai far-field meeting corpus for robust conversational ASR

Pattara Tipaksorn,Sumonmas Thatphithakkul,Vataya Chunwijitra,Kwanchiva Thangthai

Main category: cs.CL

TL;DR: LOTUSDIS是一个公开的泰国语会议语料库,包含114小时的真实远场对话,用于推进远场语音识别研究。

Details Motivation: 现有的预训练语音识别模型在处理远场、多说话人、自然重叠语音时性能显著下降,尤其是在非英语语言如泰语中缺乏适合的远场数据集。因此需要构建一个真实反映远场挑战的泰语语音数据集。 Method: 通过九个单通道设备(六种麦克风类型)在0.12米到10米距离内同步录制三人参与的15-20分钟自发会议对话,收集114小时带自然混响、噪声和设备差异的语音数据,并提供标准划分和可复现的基线系统。 Result: 基准测试显示现成模型在远距离下性能严重退化,而使用LOTUSDIS微调后,整体WER从64.3降至38.3,远场WER从81.6降至49.5,尤其在最远麦克风上增益显著。 Conclusion: 多样化的距离训练数据对提升远场ASR鲁棒性至关重要,LOTUSDIS为泰语及类似场景下的可复现研究提供了重要资源。 Abstract: We present LOTUSDIS, a publicly available Thai meeting corpus designed to advance far-field conversational ASR. The dataset comprises 114 hours of spontaneous, unscripted dialogue collected in 15-20 minute sessions with three participants, where overlapping speech is frequent and natural. Speech was recorded simultaneously by nine independent single-channel devices spanning six microphone types at distances from 0.12 m to 10 m, preserving the authentic effects of reverberation, noise, and device coloration without relying on microphone arrays. We provide standard train, dev, test splits and release a reproducible baseline system. We benchmarked several Whisper variants under zero-shot and fine-tuned conditions. Off-the-shelf models showed strong degradation with distance, confirming a mismatch between pre-training data and Thai far-field speech. Fine-tuning on LOTUSDIS dramatically improved robustness: a Thai Whisper baseline reduced overall WER from 64.3 to 38.3 and far-field WER from 81.6 to 49.5, with especially large gains on the most distant microphones. These results underscore the importance of distance-diverse training data for robust ASR. The corpus is available under CC-BY-SA 4.0. We also release training and evaluation scripts as a baseline system to promote reproducible research in this field.

[31] Global-Recent Semantic Reasoning on Dynamic Text-Attributed Graphs with Large Language Models

Yunan Wang,Jianxin Li,Ziwei Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为DyGRASP的新方法,用于处理动态文本属性图(DyTAGs),结合大语言模型(LLMs)和时序图神经网络(temporal GNNs),有效捕捉节点的近期与全局语义动态,并在多个基准任务上显著优于现有方法。

Details Motivation: 现有方法主要针对静态文本属性图,难以有效建模动态图中的近期-全局时间语义,且在处理大量演化文本时效率低下。 Method: 设计了以节点为中心的隐式推理和滑动窗口机制来捕获近期语义;利用定制提示和类RNN链结构进行显式推理以捕捉全局语义;通过更新和合并层融合近期、全局语义与动态结构信息。 Result: 在DyTAG基准数据集上实验表明,DyGRASP在目标节点检索任务中Hit@10指标最高提升34%,并展现出对不同temporal GNN和LLM的良好泛化能力。 Conclusion: DyGRASP有效解决了DyTAG中的近期-全局语义建模难题,兼顾效率与性能,为动态文本图分析提供了新的有效框架。 Abstract: Dynamic Text-Attribute Graphs (DyTAGs), characterized by time-evolving graph interactions and associated text attributes, are prevalent in real-world applications. Existing methods, such as Graph Neural Networks (GNNs) and Large Language Models (LLMs), mostly focus on static TAGs. Extending these existing methods to DyTAGs is challenging as they largely neglect the recent-global temporal semantics: the recent semantic dependencies among interaction texts and the global semantic evolution of nodes over time. Furthermore, applying LLMs to the abundant and evolving text in DyTAGs faces efficiency issues. To tackle these challenges, we propose Dynamic Global-Recent Adaptive Semantic Processing (DyGRASP), a novel method that leverages LLMs and temporal GNNs to efficiently and effectively reason on DyTAGs. Specifically, we first design a node-centric implicit reasoning method together with a sliding window mechanism to efficiently capture recent temporal semantics. In addition, to capture global semantic dynamics of nodes, we leverage explicit reasoning with tailored prompts and an RNN-like chain structure to infer long-term semantics. Lastly, we intricately integrate the recent and global temporal semantics as well as the dynamic graph structural information using updating and merging layers. Extensive experiments on DyTAG benchmarks demonstrate DyGRASP's superiority, achieving up to 34% improvement in Hit@10 for destination node retrieval task. Besides, DyGRASP exhibits strong generalization across different temporal GNNs and LLMs.

[32] False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models

Julie Kallini,Dan Jurafsky,Christopher Potts,Martijn Bartelds

Main category: cs.CL

TL;DR: 本文研究了多语言子词分词器中跨语言的词汇重叠对跨语言迁移的影响,发现词汇重叠有助于建立跨语言语义关系并提升迁移性能。

Details Motivation: 探讨多语言模型中词汇重叠是否促进跨语言迁移或引发语言间干扰,解决先前研究因设置不同和混淆因素导致的结论不一致问题。 Method: 通过控制实验训练双语自回归模型,在系统变化的词汇重叠条件下分析不同语言对的表现,并考察共享词汇的语义相似性影响。 Result: 具有词汇重叠的模型在捕捉跨语言语义关系上表现更强,在XNLI和XQuAD任务上性能优于无重叠模型,且重叠越多迁移效果越好。 Conclusion: 词汇重叠有利于跨语言迁移,保持较大共享词汇仍是多语言分词器的有益设计选择。 Abstract: Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages. Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages? Prior work offers mixed evidence, partly due to varied setups and confounders, such as token frequency or subword segmentation granularity. To address this question, we devise a controlled experiment where we train bilingual autoregressive models on multiple language pairs under systematically varied vocabulary overlap settings. Crucially, we explore a new dimension to understanding how overlap affects transfer: the semantic similarity of tokens shared across languages. We first analyze our models' hidden representations and find that overlap of any kind creates embedding spaces that capture cross-lingual semantic relationships, while this effect is much weaker in models with disjoint vocabularies. On XNLI and XQuAD, we find that models with overlap outperform models with disjoint vocabularies, and that transfer performance generally improves as overlap increases. Overall, our findings highlight the advantages of token overlap in multilingual models and show that substantial shared vocabulary remains a beneficial design choice for multilingual tokenizers.

[33] When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

Yingming Zheng,Hanqi Li,Kai Yu,Lu Chen

Main category: cs.CL

TL;DR: 长上下文监督微调(SFT)能提升大语言模型在短上下文任务上的表现,与预训练中的现象相反;研究发现多头注意力和前馈网络均从中受益,并揭示了长上下文SFT偏好上下文知识、短上下文SFT偏好参数化知识的偏差,混合训练可缓解此问题。

Details Motivation: 尽管长上下文数据在继续预训练中的影响已被广泛研究,但其在监督微调(SFT)中对短上下文任务的影响尚不清楚,本文旨在系统探究SFT数据长度对大语言模型行为的影响。 Method: 通过解耦多头注意力(MHA)和前馈网络(FFN)两个关键组件进行独立分析,并研究它们在长上下文SFT下的交互作用,进而揭示知识偏好偏差,最后验证混合训练的有效性。 Result: 发现长上下文SFT反而提升短上下文任务性能;MHA和FFN均独立受益于长上下文SFT;长上下文SFT偏好上下文知识,短上下文SFT偏好参数化知识,存在知识偏好偏差;混合训练可有效缓解该偏差。 Conclusion: 长上下文SFT对短上下文任务有益,且其带来的知识偏好偏差可通过混合训练策略加以平衡,为大语言模型的微调提供了可解释的优化方向。 Abstract: Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.

[34] Financial Risk Relation Identification through Dual-view Adaptation

Wei-Ning Chiu,Yu-Hsiang Wang,Andy Hsiao,Yu-Shiang Huang,Chuan-Ju Wang

Main category: cs.CL

TL;DR: 提出一种基于10-K文件的无监督方法,利用自然语言处理技术提取企业间风险关系,通过时间与词汇模式微调,构建领域特定的金融编码器,并引入可解释的风险关联评分。

Details Motivation: 传统评估企业间风险关系的方法依赖专家判断和人工分析,存在主观性强、耗时且难以扩展的问题,因此需要一种系统化、可扩展的自动化方法。 Method: 利用上市公司10-K文件作为数据源,结合自然语言处理技术,通过基于时间顺序和词汇模式的无监督微调,构建领域特定的金融编码器,从而捕捉隐含和抽象的风险关联,并提出量化风险关系评分机制。 Result: 实验表明,该方法在多个评估场景下均优于强基线模型,能够更准确地识别企业间风险传导关系。 Conclusion: 所提出的方法有效实现了对企业间风险关系的自动化、可解释性提取,具有良好的应用前景,可用于投资组合管理和投资策略制定等金融场景。 Abstract: A multitude of interconnected risk events -- ranging from regulatory changes to geopolitical tensions -- can trigger ripple effects across firms. Identifying inter-firm risk relations is thus crucial for applications like portfolio management and investment strategy. Traditionally, such assessments rely on expert judgment and manual analysis, which are, however, subjective, labor-intensive, and difficult to scale. To address this, we propose a systematic method for extracting inter-firm risk relations using Form 10-K filings -- authoritative, standardized financial documents -- as our data source. Leveraging recent advances in natural language processing, our approach captures implicit and abstract risk connections through unsupervised fine-tuning based on chronological and lexical patterns in the filings. This enables the development of a domain-specific financial encoder with a deeper contextual understanding and introduces a quantitative risk relation score for transparency, interpretable analysis. Extensive experiments demonstrate that our method outperforms strong baselines across multiple evaluation settings.

[35] AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

Chen Liang,Zhaoqi Huang,Haofen Wang,Fu Chai,Chunying Yu,Huanhuan Wei,Zhengjie Liu,Yanpeng Li,Hongjun Wang,Ruifeng Luo,Xianzhong Zhao

Main category: cs.CL

TL;DR: 本文提出了AECBench,一个面向建筑、工程和施工(AEC)领域的大型语言模型(LLM)综合评估基准,涵盖五级认知框架下的23项任务,并构建了由工程师编制、专家评审的4800个问题数据集,结合“以LLM为裁判”的评估方法,对九个主流LLM进行评测,结果表明模型在记忆与理解层面表现较好,但在表格知识解读、复杂推理计算和专业文档生成方面存在显著不足,揭示了LLM在安全关键型工程应用中的局限性。

Details Motivation: 评估大型语言模型(LLMs)在建筑、工程与施工(AEC)这一专业化且安全敏感领域的鲁棒性和可靠性,因现有模型在此类高要求场景下的表现尚不明确。 Method: 构建名为AECBench的综合性基准,包含基于五级认知框架(知识记忆、理解、推理、计算、应用)的23项代表性任务;收集并由工程师编写、经两轮专家评审的4800道多格式问题数据集;引入基于专家评分标准的‘LLM-as-a-Judge’方法,用于可扩展且一致地评估长文本生成结果。 Result: 对九个LLM的评测显示,模型在知识记忆和理解层面表现良好,但在解读建筑规范表格、执行复杂推理与计算、生成专业文档等高级任务上性能显著下降,暴露出当前LLM在AEC领域应用的关键短板。 Conclusion: AECBench为评估AEC领域LLM的能力提供了系统化工具,揭示了其在安全关键任务中的局限性,为未来提升LLM在工程实践中的可靠性与鲁棒性研究奠定了基础。 Abstract: Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark defines 23 representative tasks within a five-level cognition-oriented evaluation framework encompassing Knowledge Memorization, Understanding, Reasoning, Calculation, and Application. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an LLM-as-a-Judge approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.

[36] Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing

Sabri Boughorbel,Fahim Dalvi,Nadir Durrani,Majd Hawasly

Main category: cs.CL

TL;DR: 本研究使用模型差异分析(model diffing)方法,比较了Gemma-2-9b-it与其经SimPO增强的变体之间的机制性差异,发现SimPO显著提升了安全性、多语言能力和指令遵循能力,同时减少了自我指代和幻觉问题。

Details Motivation: 随着微调成为提升大语言模型性能的主要范式,传统基准测试难以解释模型间性能差异的原因,因此需要更细粒度的分析方法来理解微调过程中模型内部的变化。 Method: 采用基于机械可解释性的模型差异分析方法(model diffing),利用crosscoders识别并分类两个模型之间的潜在表征差异。 Result: SimPO增强模型在安全机制上提升32.8%,多语言能力提升43.8%,指令遵循能力提升151.7%;同时减少自我指代44.1%和幻觉管理68.5%。模型差异分析能揭示 leaderboard 指标之外的具体机制性变化。 Conclusion: 模型差异分析提供了一种透明且有针对性的框架,可用于深入理解微调对大语言模型带来的具体能力变化,超越传统性能指标的局限。 Abstract: As fine-tuning becomes the dominant paradigm for improving large language models (LLMs), understanding what changes during this process is increasingly important. Traditional benchmarking often fails to explain why one model outperforms another. In this work, we use model diffing, a mechanistic interpretability approach, to analyze the specific capability differences between Gemma-2-9b-it and a SimPO-enhanced variant. Using crosscoders, we identify and categorize latent representations that differentiate the two models. We find that SimPO acquired latent concepts predominantly enhance safety mechanisms (+32.8%), multilingual capabilities (+43.8%), and instruction-following (+151.7%), while its additional training also reduces emphasis on model self-reference (-44.1%) and hallucination management (-68.5%). Our analysis shows that model diffing can yield fine-grained insights beyond leaderboard metrics, attributing performance gaps to concrete mechanistic capabilities. This approach offers a transparent and targeted framework for comparing LLMs.

[37] MAPEX: A Multi-Agent Pipeline for Keyphrase Extraction

Liting Zhang,Shiwan Zhao,Aobo Kong,Qicheng Li

Main category: cs.CL

TL;DR: 本文提出了MAPEX,首个将多智能体协作引入关键短语提取的框架,通过模块化设计和双路径策略动态适应不同长度文本,在多个基准数据集上显著优于现有无监督方法和标准大语言模型基线。

Details Motivation: 现有的基于提示的大语言模型在关键短语提取中多采用单阶段、统一提示的方式,难以充分发挥模型的推理与生成能力,尤其在处理不同长度文档时表现受限。 Method: 提出MAPEX框架,包含专家招募、候选提取、主题引导、知识增强和后处理模块,并采用双路径策略:短文本使用知识驱动提取,长文本使用主题引导提取,实现多智能体协同。 Result: 在六个基准数据集和三种大语言模型上实验表明,MAPEX在F1@5指标上平均比最先进的无监督方法提升2.44%,比标准大语言模型基线提升4.01%。 Conclusion: MAPEX通过多智能体协作和动态适应机制,有效提升了关键短语提取的性能,展现出良好的通用性和泛化能力。 Abstract: Keyphrase extraction is a fundamental task in natural language processing. However, existing unsupervised prompt-based methods for Large Language Models (LLMs) often rely on single-stage inference pipelines with uniform prompting, regardless of document length or LLM backbone. Such one-size-fits-all designs hinder the full exploitation of LLMs' reasoning and generation capabilities, especially given the complexity of keyphrase extraction across diverse scenarios. To address these challenges, we propose MAPEX, the first framework that introduces multi-agent collaboration into keyphrase extraction. MAPEX coordinates LLM-based agents through modules for expert recruitment, candidate extraction, topic guidance, knowledge augmentation, and post-processing. A dual-path strategy dynamically adapts to document length: knowledge-driven extraction for short texts and topic-guided extraction for long texts. Extensive experiments on six benchmark datasets across three different LLMs demonstrate its strong generalization and universality, outperforming the state-of-the-art unsupervised method by 2.44\% and standard LLM baselines by 4.01\% in F1@5 on average. Code is available at https://github.com/NKU-LITI/MAPEX.

[38] Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?

Damian Stachura,Joanna Konieczna,Artur Nowak

Main category: cs.CL

TL;DR: 本研究比较了开源权重的大型语言模型与闭源模型在生物医学问答任务中的表现,发现通过检索增强、上下文学习和集成方法等技术,小型开源模型在某些情况下可媲美甚至超越GPT-4和Claude等闭源模型。

Details Motivation: 随着开源大模型的进步,探究其是否能在专业领域(如生物医学问答)有效替代闭源大模型。 Method: 采用多种技术提升问答性能,包括基于嵌入距离的片段检索、上下文学习、结构化输出,并对部分提交使用集成方法融合多个模型的输出。 Result: 开源权重的语言模型在BioASQ Task 13B Phase B中表现优异,部分结果优于GPT-4o、GPT-4.1、Claude 3.5 Sonnet和Claude 3.7 Sonnet等闭源模型,尤其在使用集成策略时表现更佳。 Conclusion: 开源权重的语言模型在生物医学问答任务中已具备与闭源模型相当甚至更强的能力,展示了其在特定应用场景下的潜力和优势。 Abstract: Open-weight versions of large language models (LLMs) are rapidly advancing, with state-of-the-art models like DeepSeek-V3 now performing comparably to proprietary LLMs. This progression raises the question of whether small open-weight LLMs are capable of effectively replacing larger closed-source models. We are particularly interested in the context of biomedical question-answering, a domain we explored by participating in Task 13B Phase B of the BioASQ challenge. In this work, we compare several open-weight models against top-performing systems such as GPT-4o, GPT-4.1, Claude 3.5 Sonnet, and Claude 3.7 Sonnet. To enhance question answering capabilities, we use various techniques including retrieving the most relevant snippets based on embedding distance, in-context learning, and structured outputs. For certain submissions, we utilize ensemble approaches to leverage the diverse outputs generated by different models for exact-answer questions. Our results demonstrate that open-weight LLMs are comparable to proprietary ones. In some instances, open-weight LLMs even surpassed their closed counterparts, particularly when ensembling strategies were applied. All code is publicly available at https://github.com/evidenceprime/BioASQ-13b.

[39] Multi-Hierarchical Feature Detection for Large Language Model Generated Text

Luyan Zhang,Xinyu Xie

Main category: cs.CL

TL;DR: 本文提出了一种多层级特征融合的AI文本检测方法MHFD,结合了语义、句法和统计特征,但在实验中发现其性能提升有限(仅0.4-2.6%),且计算开销显著增加,表明现代神经模型已能高效捕捉检测所需信号。

Details Motivation: 探究多特征融合是否能在现代大语言模型生成文本的检测中显著优于单一样本模型,验证其理论优势在实际中的有效性。 Method: 提出MHFD框架,融合DeBERTa语义分析、句法解析和统计概率特征,采用自适应融合策略进行多层级特征整合。 Result: 在多个基准数据集上,MHFD在领域内检测准确率达89.7%,跨领域保持84.2%,相比现有方法仅提升0.4-2.6%,计算开销增加4.2倍。 Conclusion: 多特征融合对AI文本检测的性能提升有限,现代神经模型本身已足够高效,额外复杂性难以带来显著收益,需重新评估多特征方法的实用性。 Abstract: With the rapid advancement of large language model technology, there is growing interest in whether multi-feature approaches can significantly improve AI text detection beyond what single neural models achieve. While intuition suggests that combining semantic, syntactic, and statistical features should provide complementary signals, this assumption has not been rigorously tested with modern LLM-generated text. This paper provides a systematic empirical investigation of multi-hierarchical feature integration for AI text detection, specifically testing whether the computational overhead of combining multiple feature types is justified by performance gains. We implement MHFD (Multi-Hierarchical Feature Detection), integrating DeBERTa-based semantic analysis, syntactic parsing, and statistical probability features through adaptive fusion. Our investigation reveals important negative results: despite theoretical expectations, multi-feature integration provides minimal benefits (0.4-0.5% improvement) while incurring substantial computational costs (4.2x overhead), suggesting that modern neural language models may already capture most relevant detection signals efficiently. Experimental results on multiple benchmark datasets demonstrate that the MHFD method achieves 89.7% accuracy in in-domain detection and maintains 84.2% stable performance in cross-domain detection, showing modest improvements of 0.4-2.6% over existing methods.

[40] Diversity Boosts AI-Generated Text Detection

Advik Raj Basani,Pin-Yu Chen

Main category: cs.CL

TL;DR: 提出DivEye,一种基于意外性特征的可解释AI生成文本检测框架,在多个基准上优于现有零样本检测器,并提供关于为何文本被标记的可解释见解。

Details Motivation: 现有的AI生成文本检测方法依赖于词元级似然或不透明的黑箱分类器,难以应对高质量生成文本且缺乏可解释性。 Method: 通过捕捉文本中词汇和结构不可预测性的波动,利用基于意外性(surprisal)的可解释统计特征进行检测。 Result: 在多个基准上比现有零样本检测器性能最高提升33.2%,作为辅助信号可将现有检测器性能提升18.7%,并对改写和对抗攻击具有鲁棒性。 Conclusion: DivEye不仅检测性能优越,且具备良好的可解释性,揭示了节奏性不可预测性是LLM检测中一个强大而未被充分探索的信号。 Abstract: Detecting AI-generated text is an increasing necessity to combat misuse of LLMs in education, business compliance, journalism, and social media, where synthetic fluency can mask misinformation or deception. While prior detectors often rely on token-level likelihoods or opaque black-box classifiers, these approaches struggle against high-quality generations and offer little interpretability. In this work, we propose DivEye, a novel detection framework that captures how unpredictability fluctuates across a text using surprisal-based features. Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features. Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines across multiple benchmarks. DivEye is robust to paraphrasing and adversarial attacks, generalizes well across domains and models, and improves the performance of existing detectors by up to 18.7% when used as an auxiliary signal. Beyond detection, DivEye provides interpretable insights into why a text is flagged, pointing to rhythmic unpredictability as a powerful and underexplored signal for LLM detection.

[41] Extractive Fact Decomposition for Interpretable Natural Language Inference in one Forward Pass

Nicholas Popovič,Michael Färber

Main category: cs.CL

TL;DR: 本文提出了JEDI,一种仅使用编码器的架构,能够联合执行抽取式原子事实分解和可解释推理,无需在推理过程中依赖生成模型。通过合成大量带有理由标注的数据,JEDI在分布内准确性和分布外鲁棒性方面均表现优异。

Details Motivation: 现有的自然语言推断和事实核查任务依赖资源消耗大的生成式大模型进行原子事实分解,限制了效率与可扩展性,因此需要一种更高效且无需生成模型的方法。 Method: 提出JEDI,一种仅使用编码器的模型架构,并利用合成的带理由标注数据进行训练,实现抽取式原子事实分解与可解释推理的联合学习。 Result: 实验表明,JEDI在多个NLI基准上分布内性能具有竞争力,且在分布外和对抗场景下显著优于仅基于抽取式理由监督的模型。 Conclusion: 仅使用编码器的架构结合合成理由数据,可在NLI任务中同时实现良好的可解释性和鲁棒泛化能力。 Abstract: Recent works in Natural Language Inference (NLI) and related tasks, such as automated fact-checking, employ atomic fact decomposition to enhance interpretability and robustness. For this, existing methods rely on resource-intensive generative large language models (LLMs) to perform decomposition. We propose JEDI, an encoder-only architecture that jointly performs extractive atomic fact decomposition and interpretable inference without requiring generative models during inference. To facilitate training, we produce a large corpus of synthetic rationales covering multiple NLI benchmarks. Experimental results demonstrate that JEDI achieves competitive accuracy in distribution and significantly improves robustness out of distribution and in adversarial settings over models based solely on extractive rationale supervision. Our findings show that interpretability and robust generalization in NLI can be realized using encoder-only architectures and synthetic rationales. Code and data available at https://jedi.nicpopovic.com

[42] DTW-Align: Bridging the Modality Gap in End-to-End Speech Translation with Dynamic Time Warping Alignment

Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis

Main category: cs.CL

TL;DR: 本文提出使用动态时间规整(DTW)来对齐端到端语音翻译中的语音和文本嵌入,有效缩小模态差距,提升对齐精度,并在低资源场景下优于先前方法。

Details Motivation: 语音和文本模态之间的表示差异导致模态鸿沟,现有对齐方法依赖对齐工具或近邻搜索,存在语言覆盖不全或对齐不准的问题。 Method: 采用动态时间规整(DTW)在训练过程中对齐语音和文本嵌入,无需额外对齐工具。 Result: 实验表明该方法产生更准确的对齐,在6个语言方向中5个低资源设置下优于先前方法,且训练速度显著更快,E2E-ST性能相当。 Conclusion: DTW是一种高效、准确的语音-文本对齐方法,能有效桥接模态差距,尤其适用于低资源语音翻译场景。 Abstract: End-to-End Speech Translation (E2E-ST) is the task of translating source speech directly into target text bypassing the intermediate transcription step. The representation discrepancy between the speech and text modalities has motivated research on what is known as bridging the modality gap. State-of-the-art methods addressed this by aligning speech and text representations on the word or token level. Unfortunately, this requires an alignment tool that is not available for all languages. Although this issue has been addressed by aligning speech and text embeddings using nearest-neighbor similarity search, it does not lead to accurate alignments. In this work, we adapt Dynamic Time Warping (DTW) for aligning speech and text embeddings during training. Our experiments demonstrate the effectiveness of our method in bridging the modality gap in E2E-ST. Compared to previous work, our method produces more accurate alignments and achieves comparable E2E-ST results while being significantly faster. Furthermore, our method outperforms previous work in low resource settings on 5 out of 6 language directions.

[43] Investigating Test-Time Scaling with Reranking for Machine Translation

Shaomu Tan,Ryosuke Mitani,Ritvik Choudhary,Toshiyuki Sekiya

Main category: cs.CL

TL;DR: 本文首次系统研究了测试时扩展(TTS)在机器翻译中的应用,发现其在高资源语言中能有效提升翻译质量,小模型配合大N可匹敌甚至超过大模型表现,但在固定计算预算下,大模型更高效,低资源情况下TTS可能因评估指标盲区导致质量下降。

Details Motivation: 尽管扩大模型参数是提升NLP系统性能的常用方法,但计算成本高昂。测试时扩展(TTS)提供了一种替代方案,通过推理时生成多个候选并选择最佳结果来提升性能,但在机器翻译中尚未被系统研究。 Method: 采用简单的best-of-N框架,在WMT24基准上进行实验,覆盖六种高资源和一种低资源语言对、五种模型规模(3B-72B)以及不同的TTS计算预算(N最大为1024)。 Result: 实验结果显示:a) TTS在高资源语言中能显著提升翻译质量;b) 小模型配合大的N值可达到或超过大模型在N=1时的表现,但计算成本更高;c) 在固定计算预算下,大模型通常更高效,而低资源情况下TTS可能因评估指标盲区导致翻译质量下降。 Conclusion: TTS是一种有效的机器翻译性能提升手段,尤其适用于高资源语言场景,但在低资源情况下需谨慎使用,并应结合更可靠的评估方式以避免盲区问题。 Abstract: Scaling model parameters has become the de facto strategy for improving NLP systems, but it comes with substantial computational costs. Test-Time Scaling (TTS) offers an alternative by allocating more computation at inference: generating multiple candidates and selecting the best. While effective in tasks such as mathematical reasoning, TTS has not been systematically explored for machine translation (MT). In this paper, we present the first systematic study of TTS for MT, investigating a simple but practical best-of-N framework on WMT24 benchmarks. Our experiments cover six high-resource and one low-resource language pairs, five model sizes (3B-72B), and various TTS compute budget (N up to 1024). Our results show that a) For high-resource languages, TTS generally improves translation quality according to multiple neural MT evaluation metrics, and our human evaluation confirms these gains; b) Augmenting smaller models with large $N$ can match or surpass larger models at $N{=}1$ with more compute cost; c) Under fixed compute budgets, larger models are typically more efficient, and TTS can degrade quality due to metric blind spots in low-resource cases.

[44] Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus

Chiara Alzetta,Serena Auriemma,Alessandro Bondielli,Luca Dini,Chiara Fazzone,Alessio Miaschi,Martina Miliani,Marta Sartor

Main category: cs.CL

TL;DR: 本文通过分析2014至2024年CLiC-it会议的论文,追踪意大利计算语言学和自然语言处理领域的研究趋势,构建了CLiC-it语料库,并对作者背景、性别、机构及论文内容进行了综合分析,揭示了该领域的发展动态。

Details Motivation: 随着基于Transformer的大语言模型兴起,计算语言学和自然语言处理的研究重点发生转变,需要系统性地了解意大利学术界的研究趋势和发展方向。 Method: 收集并整理前10届CLiC-it会议的论文元数据和内容,构建CLiC-it语料库,进行元数据分析和主题内容分析。 Result: 揭示了意大利CL/NLP社区在研究主题、作者构成和机构分布等方面的演变趋势,特别是在语言建模和多模态方面的增长。 Conclusion: 该研究为意大利及国际学术界提供了有价值的洞察,有助于指导未来的研究方向和决策。 Abstract: Over the past decade, Computational Linguistics (CL) and Natural Language Processing (NLP) have evolved rapidly, especially with the advent of Transformer-based Large Language Models (LLMs). This shift has transformed research goals and priorities, from Lexical and Semantic Resources to Language Modelling and Multimodality. In this study, we track the research trends of the Italian CL and NLP community through an analysis of the contributions to CLiC-it, arguably the leading Italian conference in the field. We compile the proceedings from the first 10 editions of the CLiC-it conference (from 2014 to 2024) into the CLiC-it Corpus, providing a comprehensive analysis of both its metadata, including author provenance, gender, affiliations, and more, as well as the content of the papers themselves, which address various topics. Our goal is to provide the Italian and international research communities with valuable insights into emerging trends and key developments over time, supporting informed decisions and future directions in the field.

[45] Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering

Alireza Salemi,Cheng Li,Mingyang Zhang,Qiaozhu Mei,Zhuowan Li,Spurthi Amba Hombaiah,Weize Kong,Tao Chen,Hamed Zamani,Michael Bendersky

Main category: cs.CL

TL;DR: 提出了一种名为Pathways of Thoughts (PoT)的推理阶段方法,用于提升问答系统的个性化能力,无需任务特定微调即可应用于任何大语言模型。

Details Motivation: 个性化问答系统在满足用户特定信息需求方面至关重要,但因从长、噪声大且隐式的上下文中推断偏好困难而发展受限。 Method: 将大语言模型的推理建模为迭代决策过程,动态选择推理、修订、个性化和澄清等认知操作,探索多条推理路径并生成多样化候选回答,再根据推断出的用户偏好进行聚合与重加权,生成最终个性化回答。 Result: 在LaMP-QA基准上的实验表明,PoT持续优于强基线方法,相对性能提升最高达13.1%;人工评估显示66%情况下更偏好PoT输出,仅15%为平局。 Conclusion: PoT通过融合多样化的推理路径,在不需微调的情况下有效提升了大语言模型在个性化问答中的表现。 Abstract: Personalization is essential for adapting question answering (QA) systems to user-specific information needs, thereby improving both accuracy and user satisfaction. However, personalized QA remains relatively underexplored due to challenges such as inferring preferences from long, noisy, and implicit contexts, and generating responses that are simultaneously correct, contextually appropriate, and aligned with user expectations and background knowledge. To address these challenges, we propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without requiring task-specific fine-tuning. The approach models the reasoning of an LLM as an iterative decision process, where the model dynamically selects among cognitive operations such as reasoning, revision, personalization, and clarification. This enables exploration of multiple reasoning trajectories, producing diverse candidate responses that capture different perspectives. PoT then aggregates and reweights these candidates according to inferred user preferences, yielding a final personalized response that benefits from the complementary strengths of diverse reasoning paths. Experiments on the LaMP-QA benchmark for personalized QA show that PoT consistently outperforms competitive baselines, achieving up to a 13.1% relative improvement. Human evaluation corroborates these results, with annotators preferring outputs from PoT in 66% of cases and reporting ties in only 15% of cases.

[46] Are most sentences unique? An empirical examination of Chomskyan claims

Hiram Ring

Main category: cs.CL

TL;DR: 本文通过使用NLTK库分析不同语体的大规模语料库,检验了语言学中“大多数话语是全新组合”的观点,发现虽然独特句子常占多数,但重复句子在各类语料中均占有不可忽视的比例。

Details Motivation: 检验语言学中关于大多数句子都是全新组合的普遍主张是否成立。 Method: 利用NLTK Python库对多种语体的语料库进行解析,统计其中完全匹配的句子数量。 Result: 结果显示,在大多数语料库中,完全独特的句子虽常占多数,但重复句子的比例受语体影响显著,且在所有语料中均不占少数。 Conclusion: 句子的独特性高度依赖于语体,重复句子在语言使用中具有重要地位,因此不能简单认为大多数句子都是前所未有的新组合。 Abstract: A repeated claim in linguistics is that the majority of linguistic utterances are unique. For example, Pinker (1994: 10), summarizing an argument by Noam Chomsky, states that "virtually every sentence that a person utters or understands is a brand-new combination of words, appearing for the first time in the history of the universe." With the increased availability of large corpora, this is a claim that can be empirically investigated. The current paper addresses the question by using the NLTK Python library to parse corpora of different genres, providing counts of exact string matches in each. Results show that while completely unique sentences are often the majority of corpora, this is highly constrained by genre, and that duplicate sentences are not an insignificant part of any individual corpus.

[47] Human-Annotated NER Dataset for the Kyrgyz Language

Timur Turatali,Anton Alekseev,Gulira Jumalieva,Gulnara Kabaeva,Sergey Nikolenko

Main category: cs.CL

TL;DR: 本文介绍了KyrgyzNER,首个为吉尔吉斯语构建的手动标注命名实体识别数据集,包含1,499篇新闻文章和27类共39,075个实体提及,并评估了多种NER模型在该数据集上的表现。

Details Motivation: 针对低资源语言吉尔吉斯语缺乏高质量命名实体识别数据的问题,推动其自然语言处理技术的发展。 Method: 构建了一个大规模手动标注的NER数据集,设计了涵盖27类的标注体系,并评估了包括CRF和多语言Transformer模型在内的多种方法。 Result: 多语言RoBERTa模型表现最佳,其他多语言模型效果相近;所有模型对稀有实体类别识别仍存在困难。 Conclusion: 多语言预训练模型在低资源语言处理中具有潜力,未来可通过更细粒度的标注方案进一步提升吉尔吉斯语处理性能。 Abstract: We introduce KyrgyzNER, the first manually annotated named entity recognition dataset for the Kyrgyz language. Comprising 1,499 news articles from the 24.KG news portal, the dataset contains 10,900 sentences and 39,075 entity mentions across 27 named entity classes. We show our annotation scheme, discuss the challenges encountered in the annotation process, and present the descriptive statistics. We also evaluate several named entity recognition models, including traditional sequence labeling approaches based on conditional random fields and state-of-the-art multilingual transformer-based models fine-tuned on our dataset. While all models show difficulties with rare entity categories, models such as the multilingual RoBERTa variant pretrained on a large corpus across many languages achieve a promising balance between precision and recall. These findings emphasize both the challenges and opportunities of using multilingual pretrained models for processing languages with limited resources. Although the multilingual RoBERTa model performed best, other multilingual models yielded comparable results. This suggests that future work exploring more granular annotation schemes may offer deeper insights for Kyrgyz language processing pipelines evaluation.

[48] Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM-Guided Multi-Aspect Clustering

Kun Zhu,Lizi Liao,Yuxuan Gu,Lei Huang,Xiaocheng Feng,Bing Qin

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的上下文感知分层分类法生成框架,通过多维度编码与动态聚类,显著提升科学文献分类的连贯性、细粒度和可解释性。

Details Motivation: 现有分类法构建方法在连贯性和细粒度方面存在不足,难以有效组织快速增长的科研文献。 Method: 利用大语言模型识别论文的关键方面(如方法、数据集、评估),生成面向各方面的摘要,并进行方面特定的编码与动态聚类,构建层次化分类体系。 Result: 在包含156个专家构建分类法和11.6k篇论文的新基准上,实验表明该方法在分类连贯性、细粒度和可解释性方面显著优于先前方法,达到最先进水平。 Conclusion: 所提出的框架能有效生成高质量的科研文献层次分类结构,为大规模科学文献的组织与综述提供了可行解决方案。 Abstract: The rapid growth of scientific literature demands efficient methods to organize and synthesize research findings. Existing taxonomy construction methods, leveraging unsupervised clustering or direct prompting of large language models (LLMs), often lack coherence and granularity. We propose a novel context-aware hierarchical taxonomy generation framework that integrates LLM-guided multi-aspect encoding with dynamic clustering. Our method leverages LLMs to identify key aspects of each paper (e.g., methodology, dataset, evaluation) and generates aspect-specific paper summaries, which are then encoded and clustered along each aspect to form a coherent hierarchy. In addition, we introduce a new evaluation benchmark of 156 expert-crafted taxonomies encompassing 11.6k papers, providing the first naturally annotated dataset for this task. Experimental results demonstrate that our method significantly outperforms prior approaches, achieving state-of-the-art performance in taxonomy coherence, granularity, and interpretability.

[49] Anecdoctoring: Automated Red-Teaming Across Language and Place

Alejandro Cuevas,Saloni Dash,Bharat Kumar Nayak,Dan Vann,Madeleine I. G. Daepp

Main category: cs.CL

TL;DR: 提出了一种名为“anecdoctoring”的新方法,用于跨语言和文化自动生成对抗性提示,以评估生成式AI的错误信息风险。

Details Motivation: 现有的红队测试数据集大多以美国和英语为中心,缺乏跨文化和多语言的覆盖,无法有效应对全球范围内生成式AI被滥用的风险。 Method: 从三个语言(英语、西班牙语、印地语)和两个地区(美国、印度)的事实核查网站收集错误信息声明,将其聚类为广泛叙事,并用知识图谱表征这些聚类,进而增强攻击型大语言模型生成对抗性提示的能力。 Result: 该方法相比少样本提示具有更高的攻击成功率,并提供更好的可解释性,验证了跨文化红队测试的有效性。 Conclusion: 需要基于真实世界对抗性滥用、具备全球适用性的错误信息缓解策略,且应支持多语言与多文化背景。 Abstract: Disinformation is among the top risks of generative artificial intelligence (AI) misuse. Global adoption of generative AI necessitates red-teaming evaluations (i.e., systematic adversarial probing) that are robust across diverse languages and cultures, but red-teaming datasets are commonly US- and English-centric. To address this gap, we propose "anecdoctoring", a novel red-teaming approach that automatically generates adversarial prompts across languages and cultures. We collect misinformation claims from fact-checking websites in three languages (English, Spanish, and Hindi) and two geographies (US and India). We then cluster individual claims into broader narratives and characterize the resulting clusters with knowledge graphs, with which we augment an attacker LLM. Our method produces higher attack success rates and offers interpretability benefits relative to few-shot prompting. Results underscore the need for disinformation mitigations that scale globally and are grounded in real-world adversarial misuse.

[50] Measuring AI "Slop" in Text

Chantal Shaib,Tuhin Chakrabarty,Diego Garcia-Olano,Byron C. Wallace

Main category: cs.CL

TL;DR: 本文通过专家访谈和文本标注,提出了AI生成文本中“slop”的分类体系和可解释的评估维度,发现尽管“slop”判断具有主观性,但仍与连贯性、相关性等潜在维度相关。

Details Motivation: 缺乏对AI生成低质量文本(即“slop”)的明确定义和衡量方法。 Method: 通过自然语言处理、写作和哲学领域专家的访谈构建“slop”分类体系,并进行片段级标注以分析其特征。 Result: 提出了可解释的评估维度;发现二元的“slop”判断虽具主观性,但与连贯性、相关性等潜在维度存在相关性。 Conclusion: 所提出的框架可用于评估AI生成文本的质量,为理解影响文本质量判断的语言和风格因素提供新视角。 Abstract: AI "slop" is an increasingly popular term used to describe low-quality AI-generated text, but there is currently no agreed upon definition of this term nor a means to measure its occurrence. In this work, we develop a taxonomy of "slop" through interviews with experts in NLP, writing, and philosophy, and propose a set of interpretable dimensions for its assessment in text. Through span-level annotation, we find that binary "slop" judgments are (somewhat) subjective, but such determinations nonetheless correlate with latent dimensions such as coherence and relevance. Our framework can be used to evaluate AI-generated text in both detection and binary preference tasks, potentially offering new insights into the linguistic and stylistic factors that contribute to quality judgments.

[51] Soft Tokens, Hard Truths

Natasha Butt,Ariel Kwiatkowski,Ismail Labiad,Julia Kempe,Yann Ollivier

Main category: cs.CL

TL;DR: 本文提出了一种通过强化学习可扩展地学习连续思维链(continuous CoT)的方法,无需从离散CoT中蒸馏,使用“软”token和输入嵌入噪声来增强探索性,在数学推理任务中表现出更高的多样性,并在保持基础模型预测能力方面优于传统方法。

Details Motivation: 连续token被认为比离散token具有更强的表达能力和推理效率,但训练困难限制了其实际应用,因此需要一种可扩展且高效的训练方法。 Method: 采用强化学习训练连续CoT,引入软token(token混合)和输入嵌入噪声以促进探索,避免依赖离散CoT蒸馏,实现高效可扩展训练。 Result: 在Llama和Qwen等8B规模模型上,连续CoT训练在pass@1上与离散CoT相当,在pass@32上更优,显示出更高的推理路径多样性;最佳实践为训练用连续token、推理用离散token;且在跨领域任务上更好保留基础模型性能。 Conclusion: 连续CoT通过强化学习可有效训练,兼具高表达力与部署兼容性,是一种更柔和、高效的推理增强方式。 Abstract: The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.

[52] Online Process Reward Leanring for Agentic Reinforcement Learning

Xiaoqian Liu,Ke Wang,Yuchuan Wu,Fei Huang,Yongbin Li,Junge Zhang,Jianbin Jiao

Main category: cs.CL

TL;DR: 本文提出了一种名为在线过程奖励学习(OPRL)的新型信用分配方法,用于基于代理的强化学习,通过隐式过程奖励模型将轨迹偏好转化为步骤级奖励,结合结果奖励进行策略更新,在多个基准任务上实现了优于前沿大模型和强基线的性能。

Details Motivation: 由于在长视野交互环境中,大型语言模型作为自主代理使用强化学习训练时面临稀疏且难以验证的奖励信号,导致时间信用分配困难,现有过程监督方法存在标注偏差、奖励博弈或高方差等问题,因此需要一种更可靠、高效的信用分配机制。 Method: OPRL交替优化一个隐式的步骤奖励模型(PRM)和代理策略,利用基于轨迹的DPO目标将轨迹偏好转化为隐式步骤奖励,并用这些奖励计算步骤级优势函数,再与基于结果的 episode-level 优势结合进行策略更新,无需额外 rollout 或显式步骤标注。 Result: 在WebShop、VisualSokoban和SOTOPIA三个不同代理基准上,OPRL均优于前沿大语言模型和强RL基线,表现出更高的样本效率、更低的训练方差,并展现出使用更少动作实现高效探索的能力。 Conclusion: OPRL是一种有效的代理式强化学习信用分配框架,能够稳定训练过程并提升性能,理论保证其学习到的步骤奖励与轨迹偏好一致且具有势能塑形奖励特性,适用于现实场景中的复杂代理学习任务。 Abstract: Large language models (LLMs) are increasingly trained with reinforcement learning (RL) as autonomous agents that reason and act over long horizons in interactive environments. However, sparse and sometimes unverifiable rewards make temporal credit assignment extremely challenging. Recent work attempts to integrate process supervision into agent learning but suffers from biased annotation, reward hacking, high-variance from overly fine-grained signals or failtures when state overlap is rare. We therefore introduce Online Process Reward Learning (OPRL), a general credit-assignment strategy for agentic RL that integrates seamlessly with standard on-policy algorithms without relying on additional rollouts or explicit step labels. In OPRL, we optimize an implicit process reward model (PRM) alternately with the agent's policy to transform trajectory preferences into implicit step rewards through a trajectory-based DPO objective. These step rewards are then used to compute step-level advantages, which are combined with episode-level advantages from outcome rewards for policy update, creating a self-reinforcing loop. Theoretical findings guarantee that the learned step rewards are consistent with trajectory preferences and act as potential-based shaping rewards, providing bounded gradients to stabilize training. Empirically, we evaluate OPRL on three distinct agent benmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverfiable rewards in SOTOPIA. Crucially, OPRL shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and lower variance during training. Further analysis also demonstrates the efficient exploration by OPRL using fewer actions, underscoring its potential for agentic learning in real-world scenarios.

[53] Steering Multimodal Large Language Models Decoding for Context-Aware Safety

Zheyuan Liu,Zhangchen Xu,Guangyao Dou,Xiangchi Yuan,Zhaoxuan Tan,Radha Poovendran,Meng Jiang

Main category: cs.CL

TL;DR: 本文提出了一种轻量级、模型无关的解码框架SafeCoDe,通过对比解码和全局感知的标记调制策略,提升多模态大语言模型在视觉上下文中的安全决策能力。

Details Motivation: 现有方法在处理多模态大语言模型的安全性时,难以平衡过度敏感和不敏感问题,缺乏对上下文感知的安全对齐机制。 Method: SafeCoDe采用两阶段方法:首先通过真实图像与高斯噪声图像对比进行对比解码,识别对视觉上下文敏感的标记;然后结合场景级推理与标记级调整,动态调节生成行为。 Result: 在多种MLLM架构和安全基准上的实验表明,SafeCoDe在降低欠敏感和过敏感问题的同时,提升了上下文相关的拒绝行为,并保持了模型的有用性。 Conclusion: SafeCoDe有效增强了多模态大语言模型在复杂视觉场景下的安全性,且具备良好的通用性和实用性。 Abstract: Multimodal Large Language Models (MLLMs) are increasingly deployed in real-world applications, yet their ability to make context-aware safety decisions remains limited. Existing methods often fail to balance oversensitivity (unjustified refusals of benign queries) and undersensitivity (missed detection of visually grounded risks), leaving a persistent gap in safety alignment. To address this issue, we introduce Safety-aware Contrastive Decoding (SafeCoDe), a lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context. SafeCoDe operates in two stages: (1) a contrastive decoding mechanism that highlights tokens sensitive to visual context by contrasting real and Gaussian-noised images, and (2) a global-aware token modulation strategy that integrates scene-level reasoning with token-level adjustment to adapt refusals according to the predicted safety verdict. Extensive experiments across diverse MLLM architectures and safety benchmarks, covering undersensitivity, oversensitivity, and general safety evaluations, show that SafeCoDe consistently improves context-sensitive refusal behaviors while preserving model helpfulness.

[54] Systematic Comparative Analysis of Large Pretrained Language Models on Contextualized Medication Event Extraction

Tariq Abdul-Quddoos,Xishuang Dong,Lijun Qian

Main category: cs.CL

TL;DR: 本文比较了多种预训练注意力模型在电子健康记录(EHR)信息提取任务中的表现,使用n2c2挑战赛的CMED数据集进行药物提取、医学事件检测和多维上下文分类。结果表明,临床领域预训练模型在检测药物和事件方面更优,而通用域预训练的Bert Base在上下文分类任务上表现最佳。

Details Motivation: 提升临床文本中药物相关事件信息提取的准确性,探索不同预训练注意力模型在医疗NLP任务中的适用性。 Method: 对Bert Base、BioBert、两种Bio+Clinical Bert变体、RoBerta和Clinical Longformer等模型在CMED数据集上进行微调,执行药物提取、事件检测和上下文分类任务,并采用 recall、precision 和 F1-Score 进行评估。 Result: 在药物和事件检测任务中,基于临床数据预训练的模型表现更优;而在上下文分类任务中,通用域预训练的Bert Base模型效果最好。 Conclusion: 模型在特定任务上的表现受其预训练数据影响显著,选择合适的预训练模型应根据具体任务需求权衡领域特异性与通用性。 Abstract: Attention-based models have become the leading approach in modeling medical language for Natural Language Processing (NLP) in clinical notes. These models outperform traditional techniques by effectively capturing contextual rep- resentations of language. In this research a comparative analysis is done amongst pre- trained attention based models namely Bert Base, BioBert, two variations of Bio+Clinical Bert, RoBerta, and Clinical Long- former on task related to Electronic Health Record (EHR) information extraction. The tasks from Track 1 of Harvard Medical School's 2022 National Clinical NLP Challenges (n2c2) are considered for this comparison, with the Contextualized Medication Event Dataset (CMED) given for these task. CMED is a dataset of unstructured EHRs and annotated notes that contain task relevant information about the EHRs. The goal of the challenge is to develop effective solutions for extracting contextual information related to patient medication events from EHRs using data driven methods. Each pre-trained model is fine-tuned and applied on CMED to perform medication extraction, medical event detection, and multi-dimensional medication event context classification. Pro- cessing methods are also detailed for breaking down EHRs for compatibility with the applied models. Performance analysis has been carried out using a script based on constructing medical terms from the evaluation portion of CMED with metrics including recall, precision, and F1-Score. The results demonstrate that models pre-trained on clinical data are more effective in detecting medication and medication events, but Bert Base, pre- trained on general domain data showed to be the most effective for classifying the context of events related to medications.

[55] CompLLM: Compression for Long Context Q&A

Gabriele Berton,Jayakrishnan Unnikrishnan,Son Tran,Mubarak Shah

Main category: cs.CL

TL;DR: CompLLM是一种针对大语言模型长上下文处理的软压缩技术,通过分段独立压缩实现高效、可扩展和可重用的上下文压缩,显著提升推理速度并减少KV缓存占用。

Details Motivation: 现有上下文压缩方法通常整体压缩输入,导致二次复杂度且无法跨查询复用计算,限制了实际应用。 Method: 将输入上下文划分为多个段,每段独立进行软压缩,支持线性复杂度压缩、缓存重用和跨查询共享。 Result: 在2倍压缩率下,高上下文长度时首 token 时间(TTFT)最高加快4倍,KV缓存减少50%,并在极长序列上性能优于未压缩上下文。 Conclusion: CompLLM具备高效性、可扩展性和可重用性,适合实际部署,能有效支持超长上下文场景下的大语言模型应用。 Abstract: Large Language Models (LLMs) face significant computational challenges when processing long contexts due to the quadratic complexity of self-attention. While soft context compression methods, which map input text to smaller latent representations, have shown promise, their real-world adoption is limited. Existing techniques typically compress the context as a single unit, which leads to quadratic compression complexity and an inability to reuse computations across queries with overlapping contexts. In this work, we introduce CompLLM, a soft compression technique designed for practical deployment. Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently. This simple design choice yields three critical properties: efficiency, as the compression step scales linearly with the context length; scalability, enabling models trained on short sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and reusability, allowing compressed segments to be cached and reused across different queries. Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%. Furthermore, CompLLM achieves performance comparable to that obtained with the uncompressed context, and even surpasses it on very long sequences, demonstrating its effectiveness and practical utility.

[56] Reinforcement Learning on Pre-Training Data

Siheng Li,Kejiao Li,Zenan Xu,Guanhua Huang,Evander Yang,Kun Li,Haoyuan Wu,Jiajia Wu,Zihao Zheng,Chenchen Zhang,Kun Shi,Kyrierl Deng,Qi Yi,Ruibin Xiong,Tingqiang Xu,Yuhao Jiang,Jianfeng Yan,Yuyuan Zeng,Guanghui Xu,Jinbao Xue,Zhijiang Xu,Zheng Fang,Shuai Li,Qibin Liu,Xiaoxue Li,Zhuoyu Li,Yangyu Tao,Fei Gao,Cheng Jiang,Bo Chao Wang,Kai Liu,Jianchen Zhu,Wai Lam,Wayyt Wang,Bo Zhou,Di Wang

Main category: cs.CL

TL;DR: 本文提出了一种新的大语言模型训练范式RLPT,利用预训练数据中的下一个文本片段预测作为奖励信号,通过强化学习提升模型推理能力,无需人工标注反馈。

Details Motivation: 由于高质量文本数据增长有限,传统依赖监督学习的扩展方法受限,需要一种能更高效利用现有预训练数据来提升模型性能的新方法。 Method: 提出Reinforcement Learning on Pre-Training data (RLPT),采用下一段落推理目标,将预训练数据转化为强化学习中的奖励信号,使策略模型能自主探索并从丰富上下文中学习。 Result: 在多个通用和数学推理基准上验证了RLPT的有效性,在Qwen3-4B-Base模型上显著提升了MMLU、AIME等指标,并展现出良好的计算扩展性,同时可增强RLVR等其他强化学习方法的效果。 Conclusion: RLPT为大语言模型提供了一种可扩展且无需人工标注的强化学习训练范式,有效拓展了模型的推理能力边界,具有广泛应用潜力。 Abstract: The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

[57] Extracting Conceptual Spaces from LLMs Using Prototype Embeddings

Nitesh Kumar,Usashi Chatterjee,Steven Schockaert

Main category: cs.CL

TL;DR: 提出一种通过编码原型描述来从大语言模型中提取概念空间的方法,并通过微调使原型嵌入与概念空间维度对齐,实验证明该方法非常有效。

Details Motivation: 概念空间在认知科学和可解释AI中有重要应用,但难以学习,现有方法缺乏对底层特征的有效编码。 Method: 通过将特征(如甜度)编码为对应原型(如非常甜的食物)的描述嵌入,并微调LLM以对齐原型嵌入与概念空间维度。 Result: 实验分析表明,该方法能高效地从LLM中提取出有意义的概念空间。 Conclusion: 所提策略为从大语言模型中构建可解释的概念空间提供了实用且有效的新途径。 Abstract: Conceptual spaces represent entities and concepts using cognitively meaningful dimensions, typically referring to perceptual features. Such representations are widely used in cognitive science and have the potential to serve as a cornerstone for explainable AI. Unfortunately, they have proven notoriously difficult to learn, although recent LLMs appear to capture the required perceptual features to a remarkable extent. Nonetheless, practical methods for extracting the corresponding conceptual spaces are currently still lacking. While various methods exist for extracting embeddings from LLMs, extracting conceptual spaces also requires us to encode the underlying features. In this paper, we propose a strategy in which features (e.g. sweetness) are encoded by embedding the description of a corresponding prototype (e.g. a very sweet food). To improve this strategy, we fine-tune the LLM to align the prototype embeddings with the corresponding conceptual space dimensions. Our empirical analysis finds this approach to be highly effective.

[58] SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data

Erik Božík,Marek Šuppa

Main category: cs.CL

TL;DR: 本文介绍了SloPalSpeech,一个包含2806小时议会录音的大规模斯洛伐克语ASR数据集,用于解决低资源语言的语音识别问题。通过该数据集微调OpenAI Whisper模型,在标准基准上显著降低了词错误率,并公开发布了数据集、转录文本和模型以促进后续研究。

Details Motivation: 由于训练数据稀缺,斯洛伐克语等低资源语言的自动语音识别(ASR)面临挑战,因此需要大规模高质量数据集来提升模型性能。 Method: 构建了一个鲁棒的处理流程,将长录音对齐并分割为适合训练的30秒音频-文本对,并使用SloPalSpeech数据集对多个OpenAI Whisper模型进行微调。 Result: 在Common Voice和FLEURS等标准斯洛伐克语基准上,微调后的Whisper-small模型词错误率(WER)最多降低70%,接近更大规模的Whisper-large-v3的基线性能。 Conclusion: SloPalSpeech有效提升了低资源语言ASR性能,所发布的数据集和模型有助于推动未来相关研究。 Abstract: Automatic Speech Recognition (ASR) for low-resource languages like Slovak is hindered by the scarcity of training data. To address this, we introduce SloPalSpeech, a new, large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings. We developed a robust processing pipeline to align and segment long-form recordings into clean, 30-second audio-transcript pairs suitable for model training. We use this dataset to fine-tune several OpenAI Whisper models (small, medium, large-v3, and large-v3-turbo), achieving significant Word Error Rate (WER) reductions on standard Slovak benchmarks like Common Voice and FLEURS. For instance, the fine-tuned Whisper-small model's WER dropped by up to 70\%, approaching the baseline performance of the much larger Whisper-large-v3 model. To foster future research in low-resource speech recognition, we publicly release the complete SloPalSpeech dataset, the fully segmented transcripts (60 million words), and all our fine-tuned models.

[59] WolBanking77: Wolof Banking Speech Intent Classification Dataset

Abdou Karim Kandji,Frédéric Precioso,Cheikh Ba,Samba Ndiaye,Augustin Ndione

Main category: cs.CL

TL;DR: 本文介绍了WolBanking77数据集,一个用于低资源语言沃尔夫语意图分类的新型数据集,包含文本和语音数据,旨在推动在文盲率高、书面资源少地区的自然语言处理研究。

Details Motivation: 由于现有意图分类研究主要集中于高资源语言,忽视了如塞内加尔广泛使用的沃尔夫语等低资源语言,而该地区文盲率高,语言以口语为主,因此需要构建适用于此类语言的数据集以促进相关研究。 Method: 作者构建了一个名为WolBanking77的沃尔夫语意图分类数据集,包含9,791条银行领域文本句子和超过4小时的语音数据,并在多种文本和语音模型上进行实验,评估其性能。 Result: 在WolBanking77数据集上,各类基线模型表现出良好性能,报告了NLP模型的F1分数和ASR模型的词错误率,并进行了模型间比较,结果令人鼓舞。 Conclusion: WolBanking77为低资源语言特别是口语主导环境下的意图分类研究提供了重要资源,未来将维护更新数据集并开源代码以支持学术发展。 Abstract: Intent classification models have made a lot of progress in recent years. However, previous studies primarily focus on high-resource languages datasets, which results in a gap for low-resource languages and for regions with a high rate of illiterate people where languages are more spoken than read or written. This is the case in Senegal, for example, where Wolof is spoken by around 90\% of the population, with an illiteracy rate of 42\% for the country. Wolof is actually spoken by more than 10 million people in West African region. To tackle such limitations, we release a Wolof Intent Classification Dataset (WolBanking77), for academic research in intent classification. WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences. Experiments on various baselines are conducted in this work, including text and voice state-of-the-art models. The results are very promising on this current dataset. This paper also provides detailed analyses of the contents of the data. We report baseline f1-score and word error rate metrics respectively on NLP and ASR models trained on WolBanking77 dataset and also comparisons between models. We plan to share and conduct dataset maintenance, updates and to release open-source code.

[60] DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture

Arijit Maji,Raghvendra Kumar,Akash Ghosh,Anushka,Nemil Shah,Abhilekh Borah,Vanshika Shah,Nishant Mishra,Sriparna Saha

Main category: cs.CL

TL;DR: DRISHTIKON是一个首个专注于印度文化的多模态、多语言基准,涵盖15种语言和64,000多个文本-图像对,用于评估生成式AI对文化理解的能力。

Details Motivation: 现有基准多为全球性或通用性,缺乏对特定文化(尤其是印度多元文化)的深入覆盖,因此需要一个专门评估AI文化理解能力的基准。 Method: 构建包含文本-图像对的多模态数据集,覆盖印度所有地区和多种文化主题,并在零样本和思维链设置下评估多种视觉-语言模型。 Result: 评估显示当前模型在处理低资源语言和较少记录的传统方面存在明显局限,难以进行文化相关的推理。 Conclusion: DRISHTIKON填补了包容性AI研究的空白,为发展具有文化感知能力的多模态语言技术提供了有力测试平台。 Abstract: We introduce DRISHTIKON, a first-of-its-kind multimodal and multilingual benchmark centered exclusively on Indian culture, designed to evaluate the cultural understanding of generative AI systems. Unlike existing benchmarks with a generic or global scope, DRISHTIKON offers deep, fine-grained coverage across India's diverse regions, spanning 15 languages, covering all states and union territories, and incorporating over 64,000 aligned text-image pairs. The dataset captures rich cultural themes including festivals, attire, cuisines, art forms, and historical heritage amongst many more. We evaluate a wide range of vision-language models (VLMs), including open-source small and large models, proprietary systems, reasoning-specialized VLMs, and Indic-focused models, across zero-shot and chain-of-thought settings. Our results expose key limitations in current models' ability to reason over culturally grounded, multimodal inputs, particularly for low-resource languages and less-documented traditions. DRISHTIKON fills a vital gap in inclusive AI research, offering a robust testbed to advance culturally aware, multimodally competent language technologies.

cs.CV [Back]

[61] PolypSeg-GradCAM: Towards Explainable Computer-Aided Gastrointestinal Disease Detection Using U-Net Based Segmentation and Grad-CAM Visualization on the Kvasir Dataset

Akwasi Asare,Ulas Bagci

Main category: cs.CV

TL;DR: 本研究提出了一种可解释的深度学习框架PolypSeg-GradCAM,结合U-Net与Grad-CAM,用于透明化息肉分割,在Kvasir-SEG数据集上表现出高准确性和可视化解释能力。

Details Motivation: 结直肠癌是全球主要的癌症死因之一,胃肠道息肉是其关键前体。早期精准分割息肉对预防癌症进展至关重要,但人工标注耗时且易受主观影响,现有深度学习方法缺乏可解释性,限制了临床应用。 Method: 提出PolypSeg-GradCAM框架,结合U-Net架构与梯度加权类激活映射(Grad-CAM),在Kvasir-SEG数据集(1000张标注内窥镜图像)上进行训练与评估,实现自动化且可解释的息肉分割。 Result: 模型在测试集上达到平均IoU为0.9257,训练和验证集上的Dice系数均高于0.96,Grad-CAM可视化显示预测聚焦于临床相关区域,具备良好可解释性。 Conclusion: PolypSeg-GradCAM通过高精度与可解释性相结合,推动了可信AI辅助结肠镜检查的发展,有助于提升结直肠癌的早期预防。 Abstract: Colorectal cancer (CRC) remains one of the leading causes of cancer-related morbidity and mortality worldwide, with gastrointestinal (GI) polyps serving as critical precursors according to the World Health Organization (WHO). Early and accurate segmentation of polyps during colonoscopy is essential for reducing CRC progression, yet manual delineation is labor-intensive and prone to observer variability. Deep learning methods have demonstrated strong potential for automated polyp analysis, but their limited interpretability remains a barrier to clinical adoption. In this study, we present PolypSeg-GradCAM, an explainable deep learning framework that integrates the U-Net architecture with Gradient-weighted Class Activation Mapping (Grad-CAM) for transparent polyp segmentation. The model was trained and evaluated on the Kvasir-SEG dataset of 1000 annotated endoscopic images. Experimental results demonstrate robust segmentation performance, achieving a mean Intersection over Union (IoU) of 0.9257 on the test set and consistently high Dice coefficients (F-score > 0.96) on training and validation sets. Grad-CAM visualizations further confirmed that predictions were guided by clinically relevant regions, enhancing transparency and trust in the model's decisions. By coupling high segmentation accuracy with interpretability, PolypSeg-GradCAM represents a step toward reliable, trustworthy AI-assisted colonoscopy and improved early colorectal cancer prevention.

[62] PerceptronCARE: A Deep Learning-Based Intelligent Teleopthalmology Application for Diabetic Retinopathy Diagnosis

Akwasi Asare,Isaac Baffour Senkyire,Emmanuel Freeman,Simon Hilary Ayinedenaba Aluze-Ele,Kelvin Kwao

Main category: cs.CV

TL;DR: 本研究提出了一种基于深度学习的远程医疗应用PerceptronCARE,用于自动检测糖尿病视网膜病变,准确率达85.4%,具有良好的临床和远程医疗应用前景。

Details Motivation: 糖尿病视网膜病变是成年人视力丧失的主要原因,尤其在医疗资源匮乏地区构成重大公共卫生挑战,亟需高效、可扩展的筛查工具。 Method: 采用ResNet-18、EfficientNet-B0和SqueezeNet等多种卷积神经网络,基于视网膜图像进行模型训练与评估,选择在准确率和计算效率之间最优的模型,并集成云平台与多用户框架。 Result: 最终模型在疾病严重程度分类上达到85.4%的准确率,支持实时筛查,并具备良好的可扩展性和数据安全性。 Conclusion: PerceptronCARE展示了人工智能驱动的远程医疗系统在提升糖尿病视网膜病变筛查可及性方面的巨大潜力,尤其适用于偏远和资源受限地区。 Abstract: Diabetic retinopathy is a leading cause of vision loss among adults and a major global health challenge, particularly in underserved regions. This study presents PerceptronCARE, a deep learning-based teleophthalmology application designed for automated diabetic retinopathy detection using retinal images. The system was developed and evaluated using multiple convolutional neural networks, including ResNet-18, EfficientNet-B0, and SqueezeNet, to determine the optimal balance between accuracy and computational efficiency. The final model classifies disease severity with an accuracy of 85.4%, enabling real-time screening in clinical and telemedicine settings. PerceptronCARE integrates cloud-based scalability, secure patient data management, and a multi-user framework, facilitating early diagnosis, improving doctor-patient interactions, and reducing healthcare costs. This study highlights the potential of AI-driven telemedicine solutions in expanding access to diabetic retinopathy screening, particularly in remote and resource-constrained environments.

[63] Self Identity Mapping

Xiuding Cai,Yaoyao Zhu,Linjie Fu,Dong Miao,Yu Yao

Main category: cs.CV

TL;DR: 提出一种简单而有效的数据内在正则化框架Self Identity Mapping (SIM),通过逆映射机制增强表示学习,减少前向传播中的信息损失,并促进梯度流动。其高效实例ρSIM在多种任务和网络架构中均显著提升性能,具有模型和任务无关性,可作为即插即用模块使用。

Details Motivation: 传统正则化方法依赖启发式设计,在不同场景下可靠性与效果有限,需要一种更通用、内在的数据驱动正则化方法来提升深度模型的泛化能力。 Method: 提出Self Identity Mapping (SIM),通过重构输入从其变换后的输出实现正则化;引入ρSIM,采用patch级特征采样和基于投影的方法降低计算复杂度,实现高效特征重建。 Result: 在图像分类、少样本提示学习、领域泛化等多个任务上,ρSIM consistently 提升基线性能;验证了其在语义分割、图像翻译、音频分类和时间序列异常检测等密集预测和非视觉任务中的有效性,并证明其与其他正则化方法正交,可协同增效。 Conclusion: SIM是一种模型无关、任务无关的即插即用正则化框架,能有效提升深度神经网络的表示学习能力和泛化性能,具有广泛适用性和实用价值。 Abstract: Regularization is essential in deep learning to enhance generalization and mitigate overfitting. However, conventional techniques often rely on heuristics, making them less reliable or effective across diverse settings. We propose Self Identity Mapping (SIM), a simple yet effective, data-intrinsic regularization framework that leverages an inverse mapping mechanism to enhance representation learning. By reconstructing the input from its transformed output, SIM reduces information loss during forward propagation and facilitates smoother gradient flow. To address computational inefficiencies, We instantiate SIM as $ \rho\text{SIM} $ by incorporating patch-level feature sampling and projection-based method to reconstruct latent features, effectively lowering complexity. As a model-agnostic, task-agnostic regularizer, SIM can be seamlessly integrated as a plug-and-play module, making it applicable to different network architectures and tasks. We extensively evaluate $\rho\text{SIM}$ across three tasks: image classification, few-shot prompt learning, and domain generalization. Experimental results show consistent improvements over baseline methods, highlighting $\rho\text{SIM}$'s ability to enhance representation learning across various tasks. We also demonstrate that $\rho\text{SIM}$ is orthogonal to existing regularization methods, boosting their effectiveness. Moreover, our results confirm that $\rho\text{SIM}$ effectively preserves semantic information and enhances performance in dense-to-dense tasks, such as semantic segmentation and image translation, as well as in non-visual domains including audio classification and time series anomaly detection. The code is publicly available at https://github.com/XiudingCai/SIM-pytorch.

[64] MAGIA: Sensing Per-Image Signals from Single-Round Averaged Gradients for Label-Inference-Free Gradient Inversion

Zhanting Zhou,Jinbo Wang,Zeqin Wu,Fengli Zhang

Main category: cs.CV

TL;DR: 提出MAGIA方法,一种基于动量的自适应梯度反演攻击框架,可在单轮平均梯度场景下实现高保真多图像重建,无需标签推断或辅助信息。

Details Motivation: 在单轮平均梯度(SAG)场景中,样本梯度被混合在批次均值中,导致梯度反演困难,现有方法难以有效恢复原始数据。 Method: MAGIA引入基于动量的自适应修正机制,通过探测随机子集感知隐含的单图像信号;其目标函数包含两项创新:1)闭式组合重缩放以获得更紧的优化界;2)结合全批次与子集损失的动量混合策略以提升重建鲁棒性。 Result: 实验表明,MAGIA在大批次场景下显著优于现有先进方法,能够实现高保真度的多图像重建,且计算开销与标准求解器相当,无需任何辅助信息。 Conclusion: MAGIA为单轮平均梯度下的梯度反演提供了高效、鲁棒的解决方案,推动了无标签推断的隐私攻击研究边界。 Abstract: We study gradient inversion in the challenging single round averaged gradient SAG regime where per sample cues are entangled within a single batch mean gradient. We introduce MAGIA a momentum based adaptive correction on gradient inversion attack a novel label inference free framework that senses latent per image signals by probing random data subsets. MAGIA objective integrates two core innovations 1 a closed form combinatorial rescaling that creates a provably tighter optimization bound and 2 a momentum based mixing of whole batch and subset losses to ensure reconstruction robustness. Extensive experiments demonstrate that MAGIA significantly outperforms advanced methods achieving high fidelity multi image reconstruction in large batch scenarios where prior works fail. This is all accomplished with a computational footprint comparable to standard solvers and without requiring any auxiliary information.

[65] Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR

Khalil Hennara,Muhammad Hreden,Mohamed Motasim Hamed,Ahmad Bastati,Zeina Aldallal,Sara Chrouf,Safwan AlModhayan

Main category: cs.CV

TL;DR: 本文提出了Baseer,一种专门用于阿拉伯语文档OCR的视觉语言模型,并通过合成与真实数据的大规模数据集进行训练,显著优于现有开源和商业解决方案,实现了0.25的WER,为形态丰富的语言提供了高精度OCR的新基准。

Details Motivation: 阿拉伯语文档OCR由于连写脚本、多样的字体、变音符号和从右到左的书写方向而具有挑战性,且现有的多模态大模型在阿拉伯语上的表现有限。 Method: 采用解码器-only微调策略,在预训练的多模态大模型基础上,使用包含合成和真实世界文档的大规模数据集对模型进行微调,并引入高质量、专家验证的基准Misraj-DocOCR用于评估。 Result: Baseer在阿拉伯语文档OCR任务上显著优于现有方案,达到0.25的词错误率(WER),成为新的最先进方法。 Conclusion: 针对特定领域微调通用多模态大模型可显著提升阿拉伯语等复杂语言的OCR性能,Baseer为高精度OCR提供了强有力的基线。 Abstract: Arabic document OCR remains a challenging task due to the language's cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.

[66] A Deep Learning Approach for Spatio-Temporal Forecasting of InSAR Ground Deformation in Eastern Ireland

Wendong Yao,Saeed Azadnejad,Binhua Huang,Shane Donohue,Soumyabrata Dev

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的新框架,将稀疏InSAR时间序列数据转换为密集的时空张量,并结合CNN-LSTM模型实现高精度的地表形变预测。

Details Motivation: 稀疏的InSAR时间序列数据使得未来地表形变预测具有挑战性,现有方法难以同时捕捉空间模式和时间依赖性。 Method: 将稀疏点测量数据转化为密集时空张量,采用混合CNN-LSTM模型,利用计算机视觉架构进行联合时空建模。 Result: 在爱尔兰东部Sentinel-1数据上的实验表明,该方法比LightGBM和LASSO等强基线模型具有更高的预测精度和空间一致性,且可避免简单持久性偏差。 Conclusion: 所提出的时空深度学习框架显著提升了地表形变预测性能,验证了其在高分辨率形变监测中的有效性与潜力。 Abstract: Monitoring ground displacement is crucial for urban infrastructure stability and mitigating geological hazards. However, forecasting future deformation from sparse Interferometric Synthetic Aperture Radar (InSAR) time-series data remains a significant challenge. This paper introduces a novel deep learning framework that transforms these sparse point measurements into a dense spatio-temporal tensor. This methodological shift allows, for the first time, the direct application of advanced computer vision architectures to this forecasting problem. We design and implement a hybrid Convolutional Neural Network and Long-Short Term Memory (CNN-LSTM) model, specifically engineered to simultaneously learn spatial patterns and temporal dependencies from the generated data tensor. The model's performance is benchmarked against powerful machine learning baselines, Light Gradient Boosting Machine and LASSO regression, using Sentinel-1 data from eastern Ireland. Results demonstrate that the proposed architecture provides significantly more accurate and spatially coherent forecasts, establishing a new performance benchmark for this task. Furthermore, an interpretability analysis reveals that baseline models often default to simplistic persistence patterns, highlighting the necessity of our integrated spatio-temporal approach to capture the complex dynamics of ground deformation. Our findings confirm the efficacy and potential of spatio-temporal deep learning for high-resolution deformation forecasting.

[67] A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts

George Corrêa de Araújo,Helena de Almeida Maia,Helio Pedrini

Main category: cs.CV

TL;DR: 本文提出了Scrapbook框架,用于生成大规模数据集以探测AI模型对基本概念(如物体识别、位置关系和属性识别)的理解。实验表明,现有模型在物体识别上表现良好,但在位置信息和带约束问题上存在理解与一致性不足的问题。

Details Motivation: 为了系统评估AI模型对基础概念的理解能力,在处理复杂任务前验证其对基本元素的掌握程度。 Method: 提出Scrapbook框架,通过生成包含大量关于单一概念且具有丰富语言变化的问题的数据集,来测试AI模型的理解能力。 Result: 实验发现当前模型在物体识别和计数上表现较好,但在绝对/相对位置理解、几何形状及附加约束问题上表现不佳;MobileVLM-V2出现较多错误答案,其他模型则倾向于回答“是”,缺乏一致性。 Conclusion: Scrapbook框架为生成多样化、全面的数据集提供了有效工具,有助于系统性评估和改进AI模型在基础概念理解方面的性能。 Abstract: In this paper, we present the Scrapbook framework, a novel methodology designed to generate extensive datasets for probing the learned concepts of artificial intelligence (AI) models. The framework focuses on fundamental concepts such as object recognition, absolute and relative positions, and attribute identification. By generating datasets with a large number of questions about individual concepts and a wide linguistic variation, the Scrapbook framework aims to validate the model's understanding of these basic elements before tackling more complex tasks. Our experimental findings reveal that, while contemporary models demonstrate proficiency in recognizing and enumerating objects, they encounter challenges in comprehending positional information and addressing inquiries with additional constraints. Specifically, the MobileVLM-V2 model showed significant answer disagreements and plausible wrong answers, while other models exhibited a bias toward affirmative answers and struggled with questions involving geometric shapes and positional information, indicating areas for improvement in understanding and consistency. The proposed framework offers a valuable instrument for generating diverse and comprehensive datasets, which can be utilized to systematically assess and enhance the performance of AI models.

[68] The Describe-Then-Generate Bottleneck: How VLM Descriptions Alter Image Generation Outcomes

Sai Varun Kodathala,Rakesh Vunnam

Main category: cs.CV

TL;DR: 本研究实证分析了视觉-语言-视觉管道中的“描述-生成”瓶颈,发现通过文本中介表示视觉信息时存在显著的感知和结构信息损失。

Details Motivation: 随着多模态AI系统在创意工作流中的广泛应用,量化视觉内容经由文本中介导致的信息损失变得尤为重要。 Method: 通过构建150对图像的‘描述-生成’流水线,使用LPIPS、SSIM和颜色距离等指标衡量感知、结构和色彩维度的信息保留程度。 Result: 99.3%的样本表现出明显的感知退化,91.5%的样本显示出显著的结构信息丢失。 Conclusion: “描述-生成”瓶颈是当前多模态系统中可测量且一致的限制因素。 Abstract: With the increasing integration of multimodal AI systems in creative workflows, understanding information loss in vision-language-vision pipelines has become important for evaluating system limitations. However, the degradation that occurs when visual content passes through textual intermediation remains poorly quantified. In this work, we provide empirical analysis of the describe-then-generate bottleneck, where natural language serves as an intermediate representation for visual information. We generated 150 image pairs through the describe-then-generate pipeline and applied existing metrics (LPIPS, SSIM, and color distance) to measure information preservation across perceptual, structural, and chromatic dimensions. Our evaluation reveals that 99.3% of samples exhibit substantial perceptual degradation and 91.5% demonstrate significant structural information loss, providing empirical evidence that the describe-then-generate bottleneck represents a measurable and consistent limitation in contemporary multimodal systems.

[69] AI-Derived Structural Building Intelligence for Urban Resilience: An Application in Saint Vincent and the Grenadines

Isabelle Tingzon,Yoji Toriumi,Caroline Gevaert

Main category: cs.CV

TL;DR: 提出一种AI驱动的工作流程,利用高分辨率卫星图像自动推断屋顶属性,以填补小岛屿发展中国家建筑数据的空白。

Details Motivation: 小岛屿发展中国家(SIDS)在气候脆弱地区往往缺乏详细的建筑结构数据,阻碍了灾害风险评估和城市韧性规划。 Method: 比较了地理空间基础模型结合浅层分类器与微调深度学习模型在屋顶分类中的性能,并评估了引入邻近SIDS训练数据对模型性能的影响。 Result: 屋顶坡度和屋顶材料分类的最佳模型F1分数分别为0.88和0.83。 Conclusion: 该方法结合本地能力建设,可帮助SIDS利用AI和地球观测数据实现更高效、基于证据的城市治理。 Abstract: Detailed structural building information is used to estimate potential damage from hazard events like cyclones, floods, and landslides, making them critical for urban resilience planning and disaster risk reduction. However, such information is often unavailable in many small island developing states (SIDS) in climate-vulnerable regions like the Caribbean. To address this data gap, we present an AI-driven workflow to automatically infer rooftop attributes from high-resolution satellite imagery, with Saint Vincent and the Grenadines as our case study. Here, we compare the utility of geospatial foundation models combined with shallow classifiers against fine-tuned deep learning models for rooftop classification. Furthermore, we assess the impact of incorporating additional training data from neighboring SIDS to improve model performance. Our best models achieve F1 scores of 0.88 and 0.83 for roof pitch and roof material classification, respectively. Combined with local capacity building, our work aims to provide SIDS with novel capabilities to harness AI and Earth Observation (EO) data to enable more efficient, evidence-based urban governance.

[70] VLA-LPAF: Lightweight Perspective-Adaptive Fusion for Vision-Language-Action to Enable More Unconstrained Robotic Manipulation

Jinyue Bian,Zhaoxing Zhang,Zhengyu Liang,Shiwei Zheng,Shengtao Zhang,Rong Shen,Chen Yang,Anzhou Hou

Main category: cs.CV

TL;DR: 本文提出了一种轻量级模块VLA-LPAF,用于提升视觉-语言-动作(VLA)模型在不同视角下的适应能力,仅使用2D数据即可有效缓解因视角差异导致的性能下降。

Details Motivation: 由于不同环境中摄像头视角和数量的差异,VLA模型在跨视角时面临视觉特征不一致的问题,限制了其泛化能力。因此需要提升模型对多视角输入的适应性。 Method: 提出VLA-LPAF模块,通过单视角图像微调,并在潜在空间中融合多视角观测,增强模型的视角适应能力。基于RoboFlamingo构建RoboFlamingo-LPAF框架,在不增加大量计算负担的情况下实现多视角信息融合。 Result: 实验表明,RoboFlamingo-LPAF在CALVIN上平均任务成功率提升约8%,LIBERO上提升15%,自定义仿真基准上提升30%,并在真实世界任务中验证了其视角适应能力。 Conclusion: VLA-LPAF能有效提升现有VLA模型在多视角环境下的鲁棒性和泛化性能,具有实际应用潜力。 Abstract: The Visual-Language-Action (VLA) models can follow text instructions according to visual observations of the surrounding environment. This ability to map multimodal inputs to actions is derived from the training of the VLA model on extensive standard demonstrations. These visual observations captured by third-personal global and in-wrist local cameras are inevitably varied in number and perspective across different environments, resulting in significant differences in the visual features. This perspective heterogeneity constrains the generality of VLA models. In light of this, we first propose the lightweight module VLA-LPAF to foster the perspective adaptivity of VLA models using only 2D data. VLA-LPAF is finetuned using images from a single view and fuses other multiview observations in the latent space, which effectively and efficiently bridge the gap caused by perspective inconsistency. We instantiate our VLA-LPAF framework with the VLA model RoboFlamingo to construct RoboFlamingo-LPAF. Experiments show that RoboFlamingo-LPAF averagely achieves around 8% task success rate improvement on CALVIN, 15% on LIBERO, and 30% on a customized simulation benchmark. We also demonstrate the developed viewadaptive characteristics of the proposed RoboFlamingo-LPAF through real-world tasks.

[71] URNet: Uncertainty-aware Refinement Network for Event-based Stereo Depth Estimation

Yifeng Cheng,Alois Knoll,Hu Cao

Main category: cs.CV

TL;DR: 本文提出了一种用于基于事件的立体深度估计的不确定性感知细化网络URNet,其结合局部-全局细化模块和KL散度不确定性建模,在DSEC数据集上优于现有方法。

Details Motivation: 为了提升基于事件相机的深度估计的精度与可靠性,尤其是在复杂动态环境下的表现。 Method: 设计了一个局部-全局细化模块以捕捉细粒度局部细节和长距离全局上下文,并引入基于KL散度的不确定性建模方法来增强预测可靠性。 Result: 在DSEC数据集上的大量实验表明,URNet在定性和定量评估中均持续优于当前最先进的方法。 Conclusion: URNet通过结合局部-全局结构和不确定性建模,显著提升了事件驱动立体深度估计的性能。 Abstract: Event cameras provide high temporal resolution, high dynamic range, and low latency, offering significant advantages over conventional frame-based cameras. In this work, we introduce an uncertainty-aware refinement network called URNet for event-based stereo depth estimation. Our approach features a local-global refinement module that effectively captures fine-grained local details and long-range global context. Additionally, we introduce a Kullback-Leibler (KL) divergence-based uncertainty modeling method to enhance prediction reliability. Extensive experiments on the DSEC dataset demonstrate that URNet consistently outperforms state-of-the-art (SOTA) methods in both qualitative and quantitative evaluations.

[72] Visionerves: Automatic and Reproducible Hybrid AI for Peripheral Nervous System Recognition Applied to Endometriosis Cases

Giammarco La Barbera,Enzo Bonnot,Thomas Isla,Juan Pablo de la Plata,Joy-Rose Dunoyer de Segonzac,Jennifer Attali,Cécile Lozach,Alexandre Bellucci,Louis Marcellin,Laure Fournier,Sabine Sarnacki,Pietro Gori,Isabelle Bloch

Main category: cs.CV

TL;DR: 提出了一种名为Visionerves的新型混合AI框架,用于从多梯度DWI和形态MRI数据中识别周围神经系统,相较于传统方法在准确性上有显著提升。

Details Motivation: 子宫内膜异位症常导致慢性盆腔疼痛并可能涉及神经,但周围神经的成像仍具挑战性。 Method: 结合深度学习模型自动分割解剖结构,并通过符号空间推理进行纤维追踪和神经识别,利用模糊空间关系编码解剖知识,避免手动选择ROI。 Result: 在10名患有(确诊或疑似)子宫内膜异位症女性的腰骶丛应用中,Dice分数提高达25%,空间误差减少至小于5毫米。 Conclusion: 该自动且可重复的方法为子宫内膜异位症相关神经病变及其他神经受累疾病的无创诊断提供了新途径。 Abstract: Endometriosis often leads to chronic pelvic pain and possible nerve involvement, yet imaging the peripheral nerves remains a challenge. We introduce Visionerves, a novel hybrid AI framework for peripheral nervous system recognition from multi-gradient DWI and morphological MRI data. Unlike conventional tractography, Visionerves encodes anatomical knowledge through fuzzy spatial relationships, removing the need for selection of manual ROIs. The pipeline comprises two phases: (A) automatic segmentation of anatomical structures using a deep learning model, and (B) tractography and nerve recognition by symbolic spatial reasoning. Applied to the lumbosacral plexus in 10 women with (confirmed or suspected) endometriosis, Visionerves demonstrated substantial improvements over standard tractography, with Dice score improvements of up to 25% and spatial errors reduced to less than 5 mm. This automatic and reproducible approach enables detailed nerve analysis and paves the way for non-invasive diagnosis of endometriosis-related neuropathy, as well as other conditions with nerve involvement.

[73] V-SenseDrive: A Privacy-Preserving Road Video and In-Vehicle Sensor Fusion Framework for Road Safety & Driver Behaviour Modelling

Muhammad Naveed,Nazia Perwaiz,Sidra Sultana,Mohaira Ahmad,Muhammad Moazam Fraz

Main category: cs.CV

TL;DR: 本文提出了V-SenseDrive,首个基于巴基斯坦驾驶环境的隐私保护多模态驾驶员行为数据集,结合智能手机传感器与道路视频数据,记录正常、激进和危险驾驶行为,支持驾驶员行为分类与智能交通系统研究。

Details Motivation: 现有驾驶员行为数据集主要来自发达国家,缺乏对发展中国家复杂交通环境(如巴基斯坦)中驾驶行为多样性的表征,且常涉及驾驶员面部录像带来的隐私问题。因此,需要一个兼顾隐私保护并反映真实异质交通环境的数据集。 Method: 通过定制的Android应用程序收集智能手机的高频率加速度计、陀螺仪和GPS数据,并同步采集面向道路的视频;在多种道路类型上记录三种驾驶行为(正常、激进、风险),所有数据源精确时间对齐;数据集分为原始层、处理层和语义层。 Result: 成功构建了V-SenseDrive数据集,包含多模态、时间同步的传感器与视频数据,覆盖巴基斯坦真实的驾驶场景,支持多类驾驶行为分析,并为ADAS、保险和车队管理提供数据基础。 Conclusion: V-SenseDrive填补了全球驾驶员行为数据集中关于发展中国家复杂交通环境的空白,为面向本地化情境的智能交通系统和驾驶行为研究提供了重要资源。 Abstract: Road traffic accidents remain a major public health challenge, particularly in countries with heterogeneous road conditions, mixed traffic flow, and variable driving discipline, such as Pakistan. Reliable detection of unsafe driving behaviours is a prerequisite for improving road safety, enabling advanced driver assistance systems (ADAS), and supporting data driven decisions in insurance and fleet management. Most of existing datasets originate from the developed countries with limited representation of the behavioural diversity observed in emerging economies and the driver's face recording voilates the privacy preservation. We present V-SenseDrive, the first privacy-preserving multimodal driver behaviour dataset collected entirely within the Pakistani driving environment. V-SenseDrive combines smartphone based inertial and GPS sensor data with synchronized road facing video to record three target driving behaviours (normal, aggressive, and risky) on multiple types of roads, including urban arterials, secondary roads, and motorways. Data was gathered using a custom Android application designed to capture high frequency accelerometer, gyroscope, and GPS streams alongside continuous video, with all sources precisely time aligned to enable multimodal analysis. The focus of this work is on the data acquisition process, covering participant selection, driving scenarios, environmental considerations, and sensor video synchronization techniques. The dataset is structured into raw, processed, and semantic layers, ensuring adaptability for future research in driver behaviour classification, traffic safety analysis, and ADAS development. By representing real world driving in Pakistan, V-SenseDrive fills a critical gap in the global landscape of driver behaviour datasets and lays the groundwork for context aware intelligent transportation solutions.

[74] Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

Daxiang Dong,Mingming Zheng,Dong Xu,Bairong Zhuang,Wenyu Zhang,Chunhua Luo,Haoran Wang,Zijian Zhao,Jie Li,Yuxuan Li,Hanjun Zhong,Mengyue Liu,Jieting Chen,Shupeng Li,Lun Tian,Yaping Feng,Xin Li,Donggang Jiang,Yong Chen,Yehua Xu,Duohao Qin,Chen Feng,Dan Wang,Henghua Zhang,Jingjing Ha,Jinhui He,Yanfeng Zhai,Chengxin Zheng,Jiayi Mao,Jiacheng Chen,Ruchang Yao,Ziye Yuan,Jianmin Wu,Guangjun Xie,Dou Shen

Main category: cs.CV

TL;DR: 本文介绍了Qianfan-VL系列多模态大语言模型,通过多阶段渐进训练和高精度数据合成管道,在通用和特定领域任务上均取得先进性能,尤其在OCR和文档理解方面表现突出。

Details Motivation: 提升多模态大模型在特定领域(如OCR、文档理解)的能力,同时保持强大的通用性能,推动企业级应用部署。 Method: 采用多阶段渐进式训练和高精度数据合成管道,并在百度昆仑P800芯片上实现高效大规模训练,部分变体引入长链式思维推理能力。 Result: Qianfan-VL在CCBench、SEEDBench IMG、ScienceQA、MMStar等基准上达到SOTA;在OCRBench得873分,DocVQA达94.75%;Qianfan-VL-8B和70B在MathVista上达78.6%;单任务训练在5000芯片上实现超90%扩展效率。 Conclusion: 提出了一种有效的领域增强型多模态模型开发方法,验证了国产AI基础设施训练顶级多模态模型的能力,适用于多样化的企业部署场景。 Abstract: We present Qianfan-VL, a series of multimodal large language models ranging from 3B to 70B parameters, achieving state-of-the-art performance through innovative domain enhancement techniques. Our approach employs multi-stage progressive training and high-precision data synthesis pipelines, which prove to be critical technologies for enhancing domain-specific capabilities while maintaining strong general performance. Qianfan-VL achieves comparable results to leading open-source models on general benchmarks, with state-of-the-art performance on benchmarks such as CCBench, SEEDBench IMG, ScienceQA, and MMStar. The domain enhancement strategy delivers significant advantages in OCR and document understanding, validated on both public benchmarks (OCRBench 873, DocVQA 94.75%) and in-house evaluations. Notably, Qianfan-VL-8B and 70B variants incorporate long chain-of-thought capabilities, demonstrating superior performance on mathematical reasoning (MathVista 78.6%) and logical inference tasks. All models are trained entirely on Baidu's Kunlun P800 chips, validating the capability of large-scale AI infrastructure to train SOTA-level multimodal models with over 90% scaling efficiency on 5000 chips for a single task. This work establishes an effective methodology for developing domain-enhanced multimodal models suitable for diverse enterprise deployment scenarios.

[75] HazeFlow: Revisit Haze Physical Model as ODE and Non-Homogeneous Haze Generation for Real-World Dehazing

Junseong Shin,Seungwoo Chung,Yunjeong Yang,Tae Hyun Kim

Main category: cs.CV

TL;DR: 提出HazeFlow,一种基于ODE的去雾框架,通过将大气散射模型重构为常微分方程,并结合马尔可夫链布朗运动生成非均匀雾模拟数据,实现仅需单步推理的高效真实场景去雾。

Details Motivation: 现有深度学习去雾方法因缺乏成对的真实世界训练数据及传统ASM模型难以应对复杂多变的现实雾况,导致泛化能力差。 Method: 将大气散射模型(ASM)重新建模为常微分方程(ODE),借鉴Rectified Flow思想,学习从有雾图像到清晰图像的最优ODE轨迹;同时提出基于马尔可夫链布朗运动(MCBM)的非均匀雾生成方法以生成更真实的配对数据。 Result: 在多个真实世界去雾基准数据集上实现了最先进的性能,且仅需单步推理即可完成去雾,表现出强适应性和高效性。 Conclusion: HazeFlow通过物理引导的ODE建模和新型雾生成策略,有效提升了去雾模型在真实场景中的泛化能力和实用性。 Abstract: Dehazing involves removing haze or fog from images to restore clarity and improve visibility by estimating atmospheric scattering effects. While deep learning methods show promise, the lack of paired real-world training data and the resulting domain gap hinder generalization to real-world scenarios. In this context, physics-grounded learning becomes crucial; however, traditional methods based on the Atmospheric Scattering Model (ASM) often fall short in handling real-world complexities and diverse haze patterns. To solve this problem, we propose HazeFlow, a novel ODE-based framework that reformulates ASM as an ordinary differential equation (ODE). Inspired by Rectified Flow (RF), HazeFlow learns an optimal ODE trajectory to map hazy images to clean ones, enhancing real-world dehazing performance with only a single inference step. Additionally, we introduce a non-homogeneous haze generation method using Markov Chain Brownian Motion (MCBM) to address the scarcity of paired real-world data. By simulating realistic haze patterns through MCBM, we enhance the adaptability of HazeFlow to diverse real-world scenarios. Through extensive experiments, we demonstrate that HazeFlow achieves state-of-the-art performance across various real-world dehazing benchmark datasets.

[76] TinyEcoWeedNet: Edge Efficient Real-Time Aerial Agricultural Weed Detection

Omar H. Khater,Abdul Jabbar Siddiqui,Aiman El-Maleh,M. Shamim Hossain

Main category: cs.CV

TL;DR: 本文提出了一种压缩版的EcoWeedNet模型,通过结构化通道剪枝、量化感知训练和TensorRT加速,在资源受限的边缘设备上实现了高效精准的农业杂草检测。

Details Motivation: 深度学习模型在农业中的边缘设备部署面临资源受限的问题,需要高效的模型压缩方法。 Method: 采用结构化通道剪枝、量化感知训练(QAT)以及NVIDIA TensorRT在Jetson Orin Nano上进行加速,解决了复杂架构(如残差连接、注意力机制、拼接和CSP块)剪枝的挑战。 Result: 模型大小最多减少68.5%,计算量降低3.2 GFLOPs,FP16下推理速度达到184 FPS,比基线快28.7%;在CottonWeedDet12数据集上,39.5%剪枝率的EcoWeedNet优于YOLO11n和YOLO12n,达到83.7%精确率、77.5%召回率和85.9% mAP50。 Conclusion: 压缩后的EcoWeedNet在保持高性能的同时显著提升了效率,适用于精准农业中的边缘部署。 Abstract: Deploying deep learning models in agriculture is difficult because edge devices have limited resources, but this work presents a compressed version of EcoWeedNet using structured channel pruning, quantization-aware training (QAT), and acceleration with NVIDIA's TensorRT on the Jetson Orin Nano. Despite the challenges of pruning complex architectures with residual shortcuts, attention mechanisms, concatenations, and CSP blocks, the model size was reduced by up to 68.5% and computations by 3.2 GFLOPs, while inference speed reached 184 FPS at FP16, 28.7% faster than the baseline. On the CottonWeedDet12 dataset, the pruned EcoWeedNet with a 39.5% pruning ratio outperformed YOLO11n and YOLO12n (with only 20% pruning), achieving 83.7% precision, 77.5% recall, and 85.9% mAP50, proving it to be both efficient and effective for precision agriculture.

[77] Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction

Yi Gu,Kuniaki Saito,Jiaxin Ma

Main category: cs.CV

TL;DR: 提出一种新的多模态学习框架,结合增强的模态dropout和对比学习,有效应对模态缺失和不平衡问题,在医学诊断任务中表现出色,尤其在单模态场景下达到SOTA性能。

Details Motivation: 解决现实世界中多模态医学数据存在的模态缺失和不平衡问题,提升模型在异构数据融合中的鲁棒性和实用性。 Method: 引入可学习的模态token以实现对模态缺失感知的融合,并在传统单模态对比学习基础上增加融合多模态表示的对比目标,结合增强的模态dropout策略。 Result: 在大规模临床数据集上验证了方法的有效性,尤其在仅有一个模态可用的情况下仍取得最优性能,并成功集成到最新的CT基础模型中。 Conclusion: 该方法具有高效、通用和可扩展的优点,为实际临床应用提供了一种低成本且鲁棒的多模态学习解决方案。 Abstract: As medical diagnoses increasingly leverage multimodal data, machine learning models are expected to effectively fuse heterogeneous information while remaining robust to missing modalities. In this work, we propose a novel multimodal learning framework that integrates enhanced modalities dropout and contrastive learning to address real-world limitations such as modality imbalance and missingness. Our approach introduces learnable modality tokens for improving missingness-aware fusion of modalities and augments conventional unimodal contrastive objectives with fused multimodal representations. We validate our framework on large-scale clinical datasets for disease detection and prediction tasks, encompassing both visual and tabular modalities. Experimental results demonstrate that our method achieves state-of-the-art performance, particularly in challenging and practical scenarios where only a single modality is available. Furthermore, we show its adaptability through successful integration with a recent CT foundation model. Our findings highlight the effectiveness, efficiency, and generalizability of our approach for multimodal learning, offering a scalable, low-cost solution with significant potential for real-world clinical applications. The code is available at https://github.com/omron-sinicx/medical-modality-dropout.

[78] Rethinking Pulmonary Embolism Segmentation: A Study of Current Approaches and Challenges with an Open Weight Model

Yixin Zhang,Ryan Chamberlain,Lawrance Ngo,Kevin Kramer,Maciej A. Mazurowski

Main category: cs.CV

TL;DR: 本研究系统评估了九种主流的3D分割模型在肺栓塞(PE)分割任务中的表现,发现3D U-Net(ResNet编码器)效果最佳,CNN模型整体优于Vision Transformer,且预训练可能对分割性能产生负面影响。

Details Motivation: 肺栓塞(PE)的自动分割对于临床诊断至关重要,但不同深度学习模型在此任务上的表现尚缺乏系统比较,尤其是CNN与Vision Transformer之间的性能差异以及预训练的影响仍不明确。 Method: 基于490例CTPA扫描构建高质量标注数据集,在统一框架下评估九种CNN和Vision Transformer模型(使用预训练或随机初始化权重)的分割性能,并在内部测试集和公共数据集上验证最佳模型。 Result: 3D U-Net(ResNet编码器)表现最优;CNN普遍优于ViT;分类预训练反而降低分割性能;模型性能具有一致性;中心和大体积栓子可准确分割(最佳Dice得分为0.7131),远端栓子仍难检测。最佳模型在60例测试中检出181个栓子,含49个假阳性与28个假阴性,并在公共数据集上验证了泛化能力。 Conclusion: 3D CNN架构(特别是3D U-Net)更适合PE分割任务,预训练需谨慎使用,未来应聚焦于提升远端栓子的检测能力并构建更高质量的数据集。 Abstract: In this study, we curated a densely annotated in-house dataset comprising 490 CTPA scans. Using this dataset, we systematically evaluated nine widely used segmentation architectures from both the CNN and Vision Transformer (ViT) families, initialized with either pretrained or random weights, under a unified testing framework as a performance audit. Our study leads to several important observations: (1) 3D U-Net with a ResNet encoder remains a highly effective architecture for PE segmentation; (2) 3D models are particularly well-suited to this task given the morphological characteristics of emboli; (3) CNN-based models generally yield superior performance compared to their ViT-based counterparts in PE segmentation; (4) classification-based pretraining, even on large PE datasets, can adversely impact segmentation performance compared to training from scratch, suggesting that PE classification and segmentation may rely on different sets of discriminative features; (5) different model architectures show a highly consistent pattern of segmentation performance when trained on the same data; and (6) while central and large emboli can be segmented with satisfactory accuracy, distal emboli remain challenging due to both task complexity and the scarcity of high-quality datasets. Besides these findings, our best-performing model achieves a mean Dice score of 0.7131 for segmentation. It detects 181 emboli with 49 false positives and 28 false negatives from 60 in-house testing scans. Its generalizability is further validated on public datasets.

[79] Improving Handshape Representations for Sign Language Processing: A Graph Neural Network Approach

Alessa Carbo,Eric Nalisnick

Main category: cs.CV

TL;DR: 提出了一种基于图神经网络的新方法,分离时间动态和静态手形配置,用于手语中的手形识别,建立了首个结构化手形识别基准,在37类手形上达到46%的准确率。

Details Motivation: 现有计算方法很少显式建模手形,限制了手语识别的准确性和语言学分析能力。 Method: 采用解剖学启发的图结构与对比学习相结合的图神经网络,分离时间动态和静态手形配置。 Result: 在包含37个手形类别的签名序列中,实现了46%的识别准确率,显著优于基线方法的25%。 Conclusion: 该方法有效提升了手形识别性能,并为手语的语言学分析提供了新的计算工具。 Abstract: Handshapes serve a fundamental phonological role in signed languages, with American Sign Language employing approximately 50 distinct shapes. However,computational approaches rarely model handshapes explicitly, limiting both recognition accuracy and linguistic analysis.We introduce a novel graph neural network that separates temporal dynamics from static handshape configurations. Our approach combines anatomically-informed graph structures with contrastive learning to address key challenges in handshape recognition, including subtle interclass distinctions and temporal variations. We establish the first benchmark for structured handshape recognition in signing sequences, achieving 46% accuracy across 37 handshape classes (with baseline methods achieving 25%).

[80] Influence of Classification Task and Distribution Shift Type on OOD Detection in Fetal Ultrasound

Chun Kit Wong,Anders N. Christensen,Cosmin I. Bercea,Julia A. Schnabel,Martin G. Tolsgaard,Aasa Feragen

Main category: cs.CV

TL;DR: 研究探讨了在胎儿超声图像中,分类任务本身对分布外(OOD)检测性能的影响,发现不同任务在不同ID-OOD标准下表现各异,且OOD检测性能优越并不保证最优的拒绝预测,强调任务选择和不确定性策略需与具体医疗应用场景相匹配。

Details Motivation: 在胎儿超声等医疗场景中,由于图像特征和临床环境的异质性,可靠的分布外检测对深度学习模型的安全部署至关重要。现有研究多关注不确定性量化方法,而本文旨在探究分类任务本身对OOD检测性能的影响。 Method: 通过在四种分类任务上实验八种不确定性量化方法,评估不同任务下的OOD检测性能,并分析其在图像特征偏移和解剖结构偏移两种定义下的表现差异。 Result: 实验表明,OOD检测性能显著依赖于所选分类任务,最佳任务取决于ID-OOD的定义标准;同时发现,OOD检测性能好并不意味着 abstained prediction(拒绝预测)效果最优。 Conclusion: 应根据具体的下游应用需求来选择合适的分类任务和不确定性策略,以实现更安全、可靠的医疗图像分析。 Abstract: Reliable out-of-distribution (OOD) detection is important for safe deployment of deep learning models in fetal ultrasound amidst heterogeneous image characteristics and clinical settings. OOD detection relies on estimating a classification model's uncertainty, which should increase for OOD samples. While existing research has largely focused on uncertainty quantification methods, this work investigates the impact of the classification task itself. Through experiments with eight uncertainty quantification methods across four classification tasks, we demonstrate that OOD detection performance significantly varies with the task, and that the best task depends on the defined ID-OOD criteria; specifically, whether the OOD sample is due to: i) an image characteristic shift or ii) an anatomical feature shift. Furthermore, we reveal that superior OOD detection does not guarantee optimal abstained prediction, underscoring the necessity to align task selection and uncertainty strategies with the specific downstream application in medical image analysis.

[81] OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata

Oussema Dhaouadi,Riccardo Marin,Johannes Meier,Jacques Kaiser,Daniel Cremers

Main category: cs.CV

TL;DR: 提出首个大规模无人机图像与地理空间数据配对的数据集OrthoLoC,并引入AdHoP方法显著提升跨域特征匹配与定位精度。

Details Motivation: 解决在无GNSS/GPS或网络连接受限场景下高精度视觉定位问题,探索利用轻量级正射地理数据作为替代方案的研究空白。 Method: 构建包含16,425张无人机图像的多模态配对数据集OrthoLoC,支持对图像检索与特征匹配进行解耦评估;提出可集成到任意特征匹配器的AdHoP优化技术。 Result: AdHoP使匹配性能最高提升95%,平移误差减少63%;数据集支持对域偏移、分辨率和共可见性对定位精度的影响进行系统评估。 Conclusion: OrthoLoC为基于正射地理数据的视觉定位提供了有效基准,AdHoP显著提升了跨域匹配鲁棒性,推动了资源受限场景下的定位技术发展。 Abstract: Accurate visual localization from aerial views is a fundamental problem with applications in mapping, large-area inspection, and search-and-rescue operations. In many scenarios, these systems require high-precision localization while operating with limited resources (e.g., no internet connection or GNSS/GPS support), making large image databases or heavy 3D models impractical. Surprisingly, little attention has been given to leveraging orthographic geodata as an alternative paradigm, which is lightweight and increasingly available through free releases by governmental authorities (e.g., the European Union). To fill this gap, we propose OrthoLoC, the first large-scale dataset comprising 16,425 UAV images from Germany and the United States with multiple modalities. The dataset addresses domain shifts between UAV imagery and geospatial data. Its paired structure enables fair benchmarking of existing solutions by decoupling image retrieval from feature matching, allowing isolated evaluation of localization and calibration performance. Through comprehensive evaluation, we examine the impact of domain shifts, data resolutions, and covisibility on localization accuracy. Finally, we introduce a refinement technique called AdHoP, which can be integrated with any feature matcher, improving matching by up to 95% and reducing translation error by up to 63%. The dataset and code are available at: https://deepscenario.github.io/OrthoLoC.

[82] A Single Image Is All You Need: Zero-Shot Anomaly Localization Without Training Data

Mehrdad Moradi,Shengzhe Chen,Hao Yan,Kamran Paynabar

Main category: cs.CV

TL;DR: 本文提出了一种名为SSDnet的单图像异常定位方法,用于零样本场景下的图像异常检测。该方法基于卷积神经网络的归纳偏置,通过patch级自重建框架学习深度图像先验,结合掩码、块打乱和噪声等策略防止身份映射,并采用基于内积相似性的感知损失来捕捉结构信息。SSDnet无需外部训练数据或标签,在MVTec-AD和织物数据集上表现优于现有方法。

Details Motivation: 在许多实际场景中,缺乏足够的正常样本进行训练,甚至没有参考图像,传统依赖训练数据的异常检测方法难以应用。因此,需要一种仅基于测试图像本身的零样本异常检测方法。 Method: 提出Single Shot Decomposition Network (SSDnet),利用CNN的归纳偏置,通过将输入图像分块进行自重建来学习深度图像先验;引入掩码、patch打乱和高斯噪声防止网络学习恒等映射;使用基于特征内积相似性的感知损失,提升对结构和纹理的建模能力。 Result: SSDnet在MVTec-AD数据集上达到0.99 AUROC和0.60 AUPRC,在织物数据集上达到0.98 AUROC和0.67 AUPRC,优于当前最先进的方法。同时验证了其对噪声和缺失像素的鲁棒性。 Conclusion: SSDnet实现了一种有效的零样本单图像异常定位方法,无需外部训练数据或参考图像,通过改进的自重建策略和损失函数,能够准确检测出图像中的异常区域,具有良好的实用性和鲁棒性。 Abstract: Anomaly detection in images is typically addressed by learning from collections of training data or relying on reference samples. In many real-world scenarios, however, such training data may be unavailable, and only the test image itself is provided. We address this zero-shot setting by proposing a single-image anomaly localization method that leverages the inductive bias of convolutional neural networks, inspired by Deep Image Prior (DIP). Our method is named Single Shot Decomposition Network (SSDnet). Our key assumption is that natural images often exhibit unified textures and patterns, and that anomalies manifest as localized deviations from these repetitive or stochastic patterns. To learn the deep image prior, we design a patch-based training framework where the input image is fed directly into the network for self-reconstruction, rather than mapping random noise to the image as done in DIP. To avoid the model simply learning an identity mapping, we apply masking, patch shuffling, and small Gaussian noise. In addition, we use a perceptual loss based on inner-product similarity to capture structure beyond pixel fidelity. Our approach needs no external training data, labels, or references, and remains robust in the presence of noise or missing pixels. SSDnet achieves 0.99 AUROC and 0.60 AUPRC on MVTec-AD and 0.98 AUROC and 0.67 AUPRC on the fabric dataset, outperforming state-of-the-art methods. The implementation code will be released at https://github.com/mehrdadmoradi124/SSDnet

[83] Align Where the Words Look: Cross-Attention-Guided Patch Alignment with Contrastive and Transport Regularization for Bengali Captioning

Riad Ahmed Anonto,Sardar Md. Saffat Zabin,M. Saifur Rahman

Main category: cs.CV

TL;DR: 提出一种计算感知的孟加拉语图像描述生成方法,利用LaBSE验证的英-孟双语对和合成图像数据,结合三重损失目标(PAL+InfoNCE+OT),显著提升低资源语言下的视觉-语言模型接地性能。

Details Motivation: 现有视觉-语言模型在低资源语言中存在对象错位问题,主要由于配对数据稀缺、翻译中介破坏对齐以及以英语为中心的预训练忽略目标语言语义。 Method: 构建基于冻结MaxViT提取稳定视觉块、孟加拉语原生mBART-50解码、轻量级跨模态桥接的模型;采用LaBSE验证的真实双语对与11万张双语提示生成的合成图像进行训练;引入三重损失:Patch对齐损失(PAL)、InfoNCE全局分离损失和基于Sinkhorn的最优传输(OT)细粒度匹配损失。 Result: 在Flickr30k-1k和MSCOCO-1k上显著优于强基线,BLEU-4分别达12.29和12.00,METEOR达27.98和28.14,BERTScore-F1达71.20和75.40;真实与合成数据的中心距离缩小41%。 Conclusion: 所提出的三重损失协同机制有效提升了低资源语言下的视觉-语言对齐与生成质量,缓解了数据稀缺下的模型接地偏差问题。 Abstract: Grounding vision--language models in low-resource languages remains challenging, as they often produce fluent text about the wrong objects. This stems from scarce paired data, translation pivots that break alignment, and English-centric pretraining that ignores target-language semantics. We address this with a compute-aware Bengali captioning pipeline trained on LaBSE-verified EN--BN pairs and 110k bilingual-prompted synthetic images. A frozen MaxViT yields stable visual patches, a Bengali-native mBART-50 decodes, and a lightweight bridge links the modalities. Our core novelty is a tri-loss objective: Patch-Alignment Loss (PAL) aligns real and synthetic patch descriptors using decoder cross-attention, InfoNCE enforces global real--synthetic separation, and Sinkhorn-based OT ensures balanced fine-grained patch correspondence. This PAL+InfoNCE+OT synergy improves grounding, reduces spurious matches, and drives strong gains on Flickr30k-1k (BLEU-4 12.29, METEOR 27.98, BERTScore-F1 71.20) and MSCOCO-1k (BLEU-4 12.00, METEOR 28.14, BERTScore-F1 75.40), outperforming strong CE baselines and narrowing the real--synthetic centroid gap by 41%.

[84] TinyBEV: Cross Modal Knowledge Distillation for Efficient Multi Task Bird's Eye View Perception and Planning

Reeshad Khan,John Gauch

Main category: cs.CV

TL;DR: TinyBEV是一个轻量级、纯视觉的鸟瞰图框架,通过多阶段蒸馏将大型教师模型(UniAD)的全栈能力压缩到仅28M参数的学生模型中,在nuScenes上实现了高效且实时的3D检测、运动预测和规划等功能。

Details Motivation: 现有的高效纯视觉BEV模型(如VAD系列)缺乏完整的自动驾驶全栈功能,而大型多模态模型难以部署。TinyBEV旨在构建一个兼具高性能与低计算开销的实时全栈模型。 Method: 提出一种模型无关的多阶段知识蒸馏策略,结合特征级、输出级和自适应区域感知监督,将大容量多模态教师模型的知识迁移到轻量级纯视觉学生模型中,并采用统一的28M参数主干网络实现检测、地图分割、运动预测、占据预测和目标导向规划等任务。 Result: 在nuScenes数据集上,TinyBEV实现了39.0 mAP的检测性能、1.08 minADE的运动预测误差和0.32的碰撞率,运行速度达11 FPS(比UniAD快5倍),参数量减少78%。 Conclusion: TinyBEV证明了通过知识蒸馏可在资源受限条件下保留完整驾驶智能,有效缩小了大规模感知-规划模型与实际部署之间的差距。 Abstract: We present TinyBEV, a unified, camera only Bird's Eye View (BEV) framework that distills the full-stack capabilities of a large planning-oriented teacher (UniAD [19]) into a compact, real-time student model. Unlike prior efficient camera only baselines such as VAD[23] and VADv2[7], TinyBEV supports the complete autonomy stack 3D detection, HD-map segmentation, motion forecasting, occupancy prediction, and goal-directed planning within a streamlined 28M-parameter backbone, achieving a 78% reduction in parameters over UniAD [19]. Our model-agnostic, multi-stage distillation strategy combines feature-level, output-level, and adaptive region-aware supervision to effectively transfer high-capacity multi-modal knowledge to a lightweight BEV representation. On nuScenes[4], Tiny-BEV achieves 39.0 mAP for detection, 1.08 minADE for motion forecasting, and a 0.32 collision rate, while running 5x faster (11 FPS) and requiring only camera input. These results demonstrate that full-stack driving intelligence can be retained in resource-constrained settings, bridging the gap between large-scale, multi-modal perception-planning models and deployment-ready real-time autonomy.

[85] BlurBall: Joint Ball and Motion Blur Estimation for Table Tennis Ball Tracking

Thomas Gossard,Filip Radovic,Andreas Ziegler,Andrea Zell

Main category: cs.CV

TL;DR: 提出了一种新的标注策略,将球的位置标在运动模糊的中心,并显式标注模糊属性,从而提升检测性能和轨迹预测。

Details Motivation: 现有标注方法将球标在模糊前端,忽略了与速度相关的运动线索,导致检测不对称且不准确。 Method: 提出将球标注在模糊条纹中心并标注模糊属性的新策略,构建了新的乒乓球检测数据集,并设计了BlurBall模型,结合多帧输入和注意力机制联合估计球的位置和模糊特征。 Result: 新标注方法在多种模型上均提升了检测性能,BlurBall模型取得了当前最优的检测结果,并改善了轨迹预测的可靠性。 Conclusion: 利用运动模糊信息进行更合理的标注和建模,不仅能提高检测精度,还有助于实时体育分析中的轨迹预测。 Abstract: Motion blur reduces the clarity of fast-moving objects, posing challenges for detection systems, especially in racket sports, where balls often appear as streaks rather than distinct points. Existing labeling conventions mark the ball at the leading edge of the blur, introducing asymmetry and ignoring valuable motion cues correlated with velocity. This paper introduces a new labeling strategy that places the ball at the center of the blur streak and explicitly annotates blur attributes. Using this convention, we release a new table tennis ball detection dataset. We demonstrate that this labeling approach consistently enhances detection performance across various models. Furthermore, we introduce BlurBall, a model that jointly estimates ball position and motion blur attributes. By incorporating attention mechanisms such as Squeeze-and-Excitation over multi-frame inputs, we achieve state-of-the-art results in ball detection. Leveraging blur not only improves detection accuracy but also enables more reliable trajectory prediction, benefiting real-time sports analytics.

[86] MVP: Motion Vector Propagation for Zero-Shot Video Object Detection

Binhua Huang,Ni Wang,Wendong Yao,Soumyabrata Dev

Main category: cs.CV

TL;DR: 提出一种无需训练的管道方法MVP,通过在固定间隔的关键帧上使用OWLv2检测,并利用压缩域运动向量传播检测结果到中间帧,实现了高效、准确的开放词汇视频目标检测。

Details Motivation: 减少在每个视频帧上运行大型开放词汇检测器带来的高昂计算成本,同时保持较高的检测精度。 Method: 仅在关键帧上运行OWLv2检测器,利用压缩域中的运动向量进行3x3网格聚合,结合面积增长检查和可选的单类切换机制,将检测结果传播至中间帧。整个过程无需标注数据或微调。 Result: 在ILSVRC2015-VID验证集上,MVP达到mAP@0.5=0.609,mAP@[0.5:0.95]=0.316,在宽松IoU阈值下接近逐帧OWLv2-Large性能,且优于基于跟踪器的传播方法。相比需训练的YOLOv12x(mAP@0.5=0.631),MVP在无需标签的情况下表现相近。 Conclusion: 压缩域传播是一种实用的方法,可在保持强大零样本检测能力的同时显著减少检测器调用次数,适用于高效开放词汇视频检测。 Abstract: Running a large open-vocabulary (Open-vocab) detector on every video frame is accurate but expensive. We introduce a training-free pipeline that invokes OWLv2 only on fixed-interval keyframes and propagates detections to intermediate frames using compressed-domain motion vectors (MV). A simple 3x3 grid aggregation of motion vectors provides translation and uniform-scale updates, augmented with an area-growth check and an optional single-class switch. The method requires no labels, no fine-tuning, and uses the same prompt list for all open-vocabulary methods. On ILSVRC2015-VID (validation dataset), our approach (MVP) attains mAP@0.5=0.609 and mAP@[0.5:0.95]=0.316. At loose intersection-over-union (IoU) thresholds it remains close to framewise OWLv2-Large (0.747/0.721 at 0.2/0.3 versus 0.784/0.780), reflecting that coarse localization is largely preserved. Under the same keyframe schedule, MVP outperforms tracker-based propagation (MOSSE, KCF, CSRT) at mAP@0.5. A supervised reference (YOLOv12x) reaches 0.631 at mAP@0.5 but requires labeled training, whereas our method remains label-free and open-vocabulary. These results indicate that compressed-domain propagation is a practical way to reduce detector invocations while keeping strong zero-shot coverage in videos. Our code and models are available at https://github.com/microa/MVP.

[87] Improving the color accuracy of lighting estimation models

Zitian Zhang,Joshua Urban Davis,Jeanne Phuong Anh Vu,Jiangtao Kuang,Jean-François Lalonde

Main category: cs.CV

TL;DR: 本文研究了单张图像高动态范围(HDR)光照估计中的颜色鲁棒性问题,提出通过预训练的白平衡网络对输入图像进行预处理,显著提升现有光照估计方法的颜色准确性,且无需重新训练模型。

Details Motivation: 实现虚拟物体与真实场景的逼真融合需要准确的光照估计,而颜色准确性是影响视觉真实感的关键因素,但常被现有方法忽视。 Method: 构建包含多样化光照颜色的HDR数据集,系统评估多种适应策略,重点测试基于预训练白平衡网络的输入预处理方法对现有光照估计模型的影响。 Result: 实验表明,使用预训练白平衡网络进行输入预处理能显著提升颜色鲁棒性,在所有测试场景中优于其他策略,并在三种最先进的光照估计方法上验证了该方法的通用性。 Conclusion: 简单的输入预处理(如白平衡)可有效提升现有HDR光照估计模型的颜色准确性,为增强AR应用中的视觉真实感提供了一种高效、即插即用的解决方案。 Abstract: Advances in high dynamic range (HDR) lighting estimation from a single image have opened new possibilities for augmented reality (AR) applications. Predicting complex lighting environments from a single input image allows for the realistic rendering and compositing of virtual objects. In this work, we investigate the color robustness of such methods -- an often overlooked yet critical factor for achieving visual realism. While most evaluations conflate color with other lighting attributes (e.g., intensity, direction), we isolate color as the primary variable of interest. Rather than introducing a new lighting estimation algorithm, we explore whether simple adaptation techniques can enhance the color accuracy of existing models. Using a novel HDR dataset featuring diverse lighting colors, we systematically evaluate several adaptation strategies. Our results show that preprocessing the input image with a pre-trained white balance network improves color robustness, outperforming other strategies across all tested scenarios. Notably, this approach requires no retraining of the lighting estimation model. We further validate the generality of this finding by applying the technique to three state-of-the-art lighting estimation methods from recent literature.

[88] Check Field Detection Agent (CFD-Agent) using Multimodal Large Language and Vision Language Models

Sourav Halder,Jinjun Tong,Xinyu Wu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的自动化支票字段检测框架,结合视觉语言模型(VLM)和多模态大语言模型(MLLM),实现零样本检测关键支票字段,具有良好的泛化能力,并可生成高质量标注数据以支持后续专用模型开发。

Details Motivation: 由于数据私有性和隐私问题,传统依赖大量标注数据的支票欺诈检测方法面临资源稀缺挑战,因此需要一种无需训练且能适应多种支票格式的检测方案。 Method: 提出一种无需训练的框架,利用视觉语言模型(VLM)与多模态大语言模型(MLLM)协同工作,实现对支票中关键字段(如签名、MICR行、金额等)的零样本检测与定位。 Result: 在包含110张不同格式支票的手工整理数据集上进行评估,模型表现出较强的检测性能和良好的泛化能力,能够准确识别多种布局下的关键字段。 Conclusion: 该方法降低了部署门槛,适用于现实金融场景,同时可作为生成高质量标注数据的工具,助力机构开发定制化的实时检测模型。 Abstract: Checks remain a foundational instrument in the financial ecosystem, facilitating substantial transaction volumes across institutions. However, their continued use also renders them a persistent target for fraud, underscoring the importance of robust check fraud detection mechanisms. At the core of such systems lies the accurate identification and localization of critical fields, such as the signature, magnetic ink character recognition (MICR) line, courtesy amount, legal amount, payee, and payer, which are essential for subsequent verification against reference checks belonging to the same customer. This field-level detection is traditionally dependent on object detection models trained on large, diverse, and meticulously labeled datasets, a resource that is scarce due to proprietary and privacy concerns. In this paper, we introduce a novel, training-free framework for automated check field detection, leveraging the power of a vision language model (VLM) in conjunction with a multimodal large language model (MLLM). Our approach enables zero-shot detection of check components, significantly lowering the barrier to deployment in real-world financial settings. Quantitative evaluation of our model on a hand-curated dataset of 110 checks spanning multiple formats and layouts demonstrates strong performance and generalization capability. Furthermore, this framework can serve as a bootstrap mechanism for generating high-quality labeled datasets, enabling the development of specialized real-time object detection models tailored to institutional needs.

[89] Losing the Plot: How VLM responses degrade on imperfect charts

Philip Wootaek Shin,Jack Sampson,Vijaykrishnan Narayanan,Andres Marquez,Mahantesh Halappanavar

Main category: cs.CV

TL;DR: 本文提出了一种名为CHART NOISe的新数据集,用于评估视觉语言模型在噪声和遮挡条件下的图表理解能力,揭示了现有模型在退化输入下的系统性脆弱性,并提出了提升鲁棒性和可靠性的基线策略。

Details Motivation: 现实世界中的图表常包含失真且需要复杂推理,而现有基准多基于清晰图表和事实性问题,无法充分评估模型在真实场景下的表现。 Method: 构建CHART NOISe数据集,结合图表失真、遮挡和受韩国高考英语题型启发的多项选择题,引入‘提示反向不一致性’作为评估指标,并对ChatGPT-4o、Claude Sonnet 4和Gemini 2.5 Pro等先进VLM进行评测。 Result: 发现模型在噪声或遮挡条件下性能显著下降,幻觉现象(如数值捏造、趋势误判)增多且仍保持过高置信度;同时观察到模型在确认与否定同一陈述时存在自我矛盾。 Conclusion: CHART NOISe为图表理解提供了更严格的测试环境,揭示了当前VLM在鲁棒性和可靠性方面的不足,并通过质量过滤和遮挡检测等策略为改进提供了方向。 Abstract: Vision language models (VLMs) show strong results on chart understanding, yet existing benchmarks assume clean figures and fact based queries. Real world charts often contain distortions and demand reasoning beyond simple matching. We evaluate ChatGPT 4o, Claude Sonnet 4, and Gemini 2.5 Pro, finding sharp performance drops under corruption or occlusion, with hallucinations such as value fabrication, trend misinterpretation, and entity confusion becoming more frequent. Models remain overconfident in degraded settings, generating plausible but unsupported explanations. To address this gap, we introduce CHART NOISe(Chart Hallucinations, Answers, and Reasoning Testing on Noisy and Occluded Input Selections), a dataset combining chart corruptions, occlusions, and exam style multiple choice questions inspired by Korea's CSAT English section. A key innovation is prompt reverse inconsistency, where models contradict themselves when asked to confirm versus deny the same statement. Our contributions are threefold: (1) benchmarking state of the art VLMs, exposing systematic vulnerabilities in chart reasoning; (2) releasing CHART NOISe, the first dataset unifying corruption, occlusion, and reverse inconsistency; and (3) proposing baseline mitigation strategies such as quality filtering and occlusion detection. Together, these efforts establish a rigorous testbed for advancing robustness and reliability in chart understanding.

[90] CPT-4DMR: Continuous sPatial-Temporal Representation for 4D-MRI Reconstruction

Xinyang Wu,Muheng Li,Xia Li,Orso Pusterla,Sairos Safai,Philippe C. Cattin,Antony J. Lomax,Ye Zhang

Main category: cs.CV

TL;DR: 提出一种基于神经表示的4D-MRI重建框架,通过连续形变建模和双网络协同实现高效、高保真的呼吸运动捕捉,无需相位分 bin 或模板扫描。

Details Motivation: 传统4D-MRI方法依赖相位分箱或模板扫描,难以捕捉呼吸时序变异,流程复杂且计算成本高。 Method: 提出一种神经表示框架,包含空间解剖网络(SAN)和时间运动网络(TMN)。SAN编码连续3D解剖结构,TMN利用Transformer提取的呼吸信号生成时序一致的形变场,实现端到端的图像重建与运动建模融合。 Result: 在19名志愿者自由呼吸数据上验证,该方法能准确捕捉规则与不规则呼吸模式,保持血管和支气管连续性,解剖保真度高;处理时间从5小时降至15分钟训练+每帧1秒内推断,显著提升效率。 Conclusion: 该框架无需模板和相位分箱,可任意时刻重建高质量3D图像,性能优于传统方法,在4D放疗计划与实时自适应治疗中具有广泛应用前景。 Abstract: Four-dimensional MRI (4D-MRI) is an promising technique for capturing respiratory-induced motion in radiation therapy planning and delivery. Conventional 4D reconstruction methods, which typically rely on phase binning or separate template scans, struggle to capture temporal variability, complicate workflows, and impose heavy computational loads. We introduce a neural representation framework that considers respiratory motion as a smooth, continuous deformation steered by a 1D surrogate signal, completely replacing the conventional discrete sorting approach. The new method fuses motion modeling with image reconstruction through two synergistic networks: the Spatial Anatomy Network (SAN) encodes a continuous 3D anatomical representation, while a Temporal Motion Network (TMN), guided by Transformer-derived respiratory signals, produces temporally consistent deformation fields. Evaluation using a free-breathing dataset of 19 volunteers demonstrates that our template- and phase-free method accurately captures both regular and irregular respiratory patterns, while preserving vessel and bronchial continuity with high anatomical fidelity. The proposed method significantly improves efficiency, reducing the total processing time from approximately five hours required by conventional discrete sorting methods to just 15 minutes of training. Furthermore, it enables inference of each 3D volume in under one second. The framework accurately reconstructs 3D images at any respiratory state, achieves superior performance compared to conventional methods, and demonstrates strong potential for application in 4D radiation therapy planning and real-time adaptive treatment.

[91] An Analysis of Kalman Filter based Object Tracking Methods for Fast-Moving Tiny Objects

Prithvi Raj Singh,Raju Gottumukkala,Anthony Maida

Main category: cs.CV

TL;DR: 本文评估了五种基于卡尔曼滤波的先进跟踪方法在快速移动小物体(如壁球)上的性能,发现尽管DeepOCSORT精度最高,但所有方法均存在显著跟踪漂移,表明现有方法在处理不可预测运动时存在局限性,需开发专用技术。

Details Motivation: 由于快速移动的小物体具有不可预测的运动模式和较小的视觉标记,传统基于卡尔曼滤波的跟踪方法在处理此类物体时性能下降,因此需要评估现有方法在类似应用场景下的有效性并揭示其局限性。 Method: 研究使用包含10,000个标注帧的自定义数据集,对OCSORT、DeepOCSORT、ByteTrack、BoTSORT和StrongSORT五种先进跟踪方法进行实验评估,重点关注推理速度和每图像更新频率对跟踪精度和可靠性的影响。 Result: 实验结果显示,DeepOCSORT的平均ADE为31.15像素,跟踪误差最低;ByteTrack推理最快,平均为26.6ms;但所有方法的空间误差均在3-11厘米之间,表现出显著的跟踪漂移。 Conclusion: 当前基于卡尔曼滤波的跟踪方法在处理快速移动的小物体时存在根本性局限,误差率比标准基准高3-4倍,亟需针对此类应用开发更专业的跟踪方法。 Abstract: Unpredictable movement patterns and small visual mark make precise tracking of fast-moving tiny objects like a racquetball one of the challenging problems in computer vision. This challenge is particularly relevant for sport robotics applications, where lightweight and accurate tracking systems can improve robot perception and planning capabilities. While Kalman filter-based tracking methods have shown success in general object tracking scenarios, their performance degrades substantially when dealing with rapidly moving objects that exhibit irregular bouncing behavior. In this study, we evaluate the performance of five state-of-the-art Kalman filter-based tracking methods-OCSORT, DeepOCSORT, ByteTrack, BoTSORT, and StrongSORT-using a custom dataset containing 10,000 annotated racquetball frames captured at 720p-1280p resolution. We focus our analysis on two critical performance factors: inference speed and update frequency per image, examining how these parameters affect tracking accuracy and reliability for fast-moving tiny objects. Our experimental evaluation across four distinct scenarios reveals that DeepOCSORT achieves the lowest tracking error with an average ADE of 31.15 pixels compared to ByteTrack's 114.3 pixels, while ByteTrack demonstrates the fastest processing at 26.6ms average inference time versus DeepOCSORT's 26.8ms. However, our results show that all Kalman filter-based trackers exhibit significant tracking drift with spatial errors ranging from 3-11cm (ADE values: 31-114 pixels), indicating fundamental limitations in handling the unpredictable motion patterns of fast-moving tiny objects like racquetballs. Our analysis demonstrates that current tracking approaches require substantial improvements, with error rates 3-4x higher than standard object tracking benchmarks, highlighting the need for specialized methodologies for fast-moving tiny object tracking applications.

[92] MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition

Binhua Huang,Wendong Yao,Shaowu Chen,Guoxin Wang,Qingyuan Wang,Soumyabrata Dev

Main category: cs.CV

TL;DR: MoCrop是一种无需训练的运动感知自适应裁剪模块,利用H.264视频中的运动向量定位运动密集区域,提升压缩域视频动作识别的效率和精度。

Details Motivation: 在压缩域视频动作识别中,如何高效利用已有信息(如运动向量)提升模型性能并减少计算开销是一个关键问题。现有方法通常需要额外训练或引入参数,限制了其效率和通用性。 Method: MoCrop通过H.264中的运动向量构建运动密度子矩阵,搜索最优单次剪裁区域,并应用于所有I帧;结合去噪与合并(DM)、蒙特卡洛采样(MCS)和自适应裁剪(AC),实现低开销下的鲁棒裁剪。 Result: 在UCF101上,ResNet-50使用MoCrop在相同FLOPs下提升+3.5% Top-1准确率,或在减少26.5% FLOPs时提升+2.4%;应用于CoViAR等模型也显著降低计算成本同时保持高性能,并在多种骨干网络上表现出良好泛化能力。 Conclusion: MoCrop无需训练、不增加参数,具有高通用性和低部署成本,能有效提升压缩域视频动作识别的效率与准确性,适合实时应用。 Abstract: We introduce MoCrop, a motion-aware adaptive cropping module for efficient video action recognition in the compressed domain. MoCrop uses motion vectors that are available in H.264 video to locate motion-dense regions and produces a single clip-level crop that is applied to all I-frames at inference. The module is training free, adds no parameters, and can be plugged into diverse backbones. A lightweight pipeline that includes denoising & merge (DM), Monte Carlo sampling (MCS), and adaptive cropping (AC) via a motion-density submatrix search yields robust crops with negligible overhead. On UCF101, MoCrop improves accuracy or reduces compute. With ResNet-50, it delivers +3.5% Top-1 accuracy at equal FLOPs (attention setting), or +2.4% Top-1 accuracy with 26.5% fewer FLOPs (efficiency setting). Applied to CoViAR, it reaches 89.2% Top-1 accuracy at the original cost and 88.5% Top-1 accuracy while reducing compute from 11.6 to 8.5 GFLOPs. Consistent gains on MobileNet-V3, EfficientNet-B1, and Swin-B indicate strong generality and make MoCrop practical for real-time deployment in the compressed domain. Our code and models are available at https://github.com/microa/MoCrop.

[93] Codebook-Based Adaptive Feature Compression With Semantic Enhancement for Edge-Cloud Systems

Xinyu Wang,Zikun Zhou,Yingjian Li,Xin An,Hongpeng Wang

Main category: cs.CV

TL;DR: 提出了一种基于码本的自适应特征压缩框架CAFC-SE,通过向量量化将连续视觉特征映射为离散索引,在低比特率条件下有效保留信息丰富的视觉模式,提升了边缘-云系统中的分析性能。

Details Motivation: 现有图像和特征压缩方法在低比特率下表现不佳,要么保留过多冗余信息,要么符号分布过于集中,难以兼顾压缩效率与分析精度。 Method: 提出CAFC-SE框架,利用码本进行向量量化(VQ),将边缘端的连续视觉特征映射为离散索引,并选择性传输至云端,结合语义增强提升特征表达能力。 Result: 实验表明,CAFC-SE在低比特率条件下显著优于现有方法,在压缩率和分析准确率方面均表现出优越性能。 Conclusion: CAFC-SE通过离散化和选择性传输视觉特征,有效解决了低比特率下的信息损失问题,为边缘-云系统的高效视觉分析提供了可行方案。 Abstract: Coding images for machines with minimal bitrate and strong analysis performance is key to effective edge-cloud systems. Several approaches deploy an image codec and perform analysis on the reconstructed image. Other methods compress intermediate features using entropy models and subsequently perform analysis on the decoded features. Nevertheless, these methods both perform poorly under low-bitrate conditions, as they retain many redundant details or learn over-concentrated symbol distributions. In this paper, we propose a Codebook-based Adaptive Feature Compression framework with Semantic Enhancement, named CAFC-SE. It maps continuous visual features to discrete indices with a codebook at the edge via Vector Quantization (VQ) and selectively transmits them to the cloud. The VQ operation that projects feature vectors onto the nearest visual primitives enables us to preserve more informative visual patterns under low-bitrate conditions. Hence, CAFC-SE is less vulnerable to low-bitrate conditions. Extensive experiments demonstrate the superiority of our method in terms of rate and accuracy.

[94] MK-UNet: Multi-kernel Lightweight CNN for Medical Image Segmentation

Md Mostafijur Rahman,Radu Marculescu

Main category: cs.CV

TL;DR: 本文提出了一种超轻量级的多核U形卷积神经网络MK-UNet,用于医学图像分割,具有低计算成本和高精度的优势。

Details Motivation: 设计一种既轻量又高效的医学图像分割模型,以适应资源受限环境下的实时高保真诊断需求。 Method: 引入多核深度可分离卷积块(MKDC)和多种注意力机制(通道、空间和分组门控注意力),在保持极低参数量和FLOPs的同时提升分割性能。 Result: MK-UNet在六个二值医学图像基准上超越了现有最先进方法,参数量和计算量分别比TransUNet少333倍和123倍,同时Dice分数更高;相比UNeXt,在参数减少4.7倍的情况下Dice提升达6.7%。 Conclusion: MK-UNet是一种在资源受限场景下极具潜力的高效医学图像分割方案,适用于即时医疗设备等应用。 Abstract: In this paper, we introduce MK-UNet, a paradigm shift towards ultra-lightweight, multi-kernel U-shaped CNNs tailored for medical image segmentation. Central to MK-UNet is the multi-kernel depth-wise convolution block (MKDC) we design to adeptly process images through multiple kernels, while capturing complex multi-resolution spatial relationships. MK-UNet also emphasizes the images salient features through sophisticated attention mechanisms, including channel, spatial, and grouped gated attention. Our MK-UNet network, with a modest computational footprint of only 0.316M parameters and 0.314G FLOPs, represents not only a remarkably lightweight, but also significantly improved segmentation solution that provides higher accuracy over state-of-the-art (SOTA) methods across six binary medical imaging benchmarks. Specifically, MK-UNet outperforms TransUNet in DICE score with nearly 333$\times$ and 123$\times$ fewer parameters and FLOPs, respectively. Similarly, when compared against UNeXt, MK-UNet exhibits superior segmentation performance, improving the DICE score up to 6.7% margins while operating with 4.7$\times$ fewer #Params. Our MK-UNet also outperforms other recent lightweight networks, such as MedT, CMUNeXt, EGE-UNet, and Rolling-UNet, with much lower computational resources. This leap in performance, coupled with drastic computational gains, positions MK-UNet as an unparalleled solution for real-time, high-fidelity medical diagnostics in resource-limited settings, such as point-of-care devices. Our implementation is available at https://github.com/SLDGroup/MK-UNet.

[95] BridgeSplat: Bidirectionally Coupled CT and Non-Rigid Gaussian Splatting for Deformable Intraoperative Surgical Navigation

Maximilian Fehrentz,Alexander Winkler,Thomas Heiliger,Nazim Haouchine,Christian Heiliger,Nassir Navab

Main category: cs.CV

TL;DR: 提出了一种名为BridgeSplat的新方法,通过将术中3D重建与术前CT数据结合,实现可变形的手术导航。

Details Motivation: 弥合手术视频与患者体数据之间的差距,提升基于单目RGB数据的手术导航精度。 Method: 将3D高斯分布绑定到CT网格上,通过光度监督联合优化高斯参数和网格形变,并利用父三角形关系保持高斯与网格对齐。 Result: 在猪内脏手术和模拟人肝合成数据上验证了方法的有效性,展示了术前CT在单目RGB数据下的合理形变。 Conclusion: BridgeSplat能够有效融合术中视觉信息与术前CT,实现精确且可解释的软组织形变建模,具有临床应用潜力。 Abstract: We introduce BridgeSplat, a novel approach for deformable surgical navigation that couples intraoperative 3D reconstruction with preoperative CT data to bridge the gap between surgical video and volumetric patient data. Our method rigs 3D Gaussians to a CT mesh, enabling joint optimization of Gaussian parameters and mesh deformation through photometric supervision. By parametrizing each Gaussian relative to its parent mesh triangle, we enforce alignment between Gaussians and mesh and obtain deformations that can be propagated back to update the CT. We demonstrate BridgeSplat's effectiveness on visceral pig surgeries and synthetic data of a human liver under simulation, showing sensible deformations of the preoperative CT on monocular RGB data. Code, data, and additional resources can be found at https://maxfehrentz.github.io/ct-informed-splatting/ .

[96] Source-Free Domain Adaptive Semantic Segmentation of Remote Sensing Images with Diffusion-Guided Label Enrichment

Wenjie Liu,Hongmin Liu,Lixin Zhang,Bin Fan

Main category: cs.CV

TL;DR: 提出了一种新的伪标签优化框架Diffusion-Guided Label Enrichment (DGLE),用于源域数据不可访问的遥感图像语义分割任务,通过从少量高质量伪标签出发,利用扩散模型传播生成完整且高质量的伪标签集,显著提升了目标域上的模型性能。

Details Motivation: 在源域数据不可访问的实际场景中,现有的无监督域适应方法因伪标签噪声问题难以有效优化,限制了自训练性能,因此需要一种能有效提升伪标签质量的新方法。 Method: 首先通过置信度过滤和超分辨率增强融合生成少量高质量伪标签作为初始种子;然后利用扩散模型的强大去噪能力和对复杂分布的建模能力,将这些不完整、分布不规则的种子伪标签传播为完整且高质量的伪标签集。 Result: 所提DGLE框架在多个遥感图像数据集上取得了优于现有SFDA方法的性能,显著提高了伪标签质量与模型在目标域的分割精度。 Conclusion: DGLE通过扩散模型引导的伪标签扩展策略,有效解决了直接优化完整伪标签集的困难,为源自由域适应提供了一种高效可靠的解决方案。 Abstract: Research on unsupervised domain adaptation (UDA) for semantic segmentation of remote sensing images has been extensively conducted. However, research on how to achieve domain adaptation in practical scenarios where source domain data is inaccessible namely, source-free domain adaptation (SFDA) remains limited. Self-training has been widely used in SFDA, which requires obtaining as many high-quality pseudo-labels as possible to train models on target domain data. Most existing methods optimize the entire pseudo-label set to obtain more supervisory information. However, as pseudo-label sets often contain substantial noise, simultaneously optimizing all labels is challenging. This limitation undermines the effectiveness of optimization approaches and thus restricts the performance of self-training. To address this, we propose a novel pseudo-label optimization framework called Diffusion-Guided Label Enrichment (DGLE), which starts from a few easily obtained high-quality pseudo-labels and propagates them to a complete set of pseudo-labels while ensuring the quality of newly generated labels. Firstly, a pseudo-label fusion method based on confidence filtering and super-resolution enhancement is proposed, which utilizes cross-validation of details and contextual information to obtain a small number of high-quality pseudo-labels as initial seeds. Then, we leverage the diffusion model to propagate incomplete seed pseudo-labels with irregular distributions due to its strong denoising capability for randomly distributed noise and powerful modeling capacity for complex distributions, thereby generating complete and high-quality pseudo-labels. This method effectively avoids the difficulty of directly optimizing the complete set of pseudo-labels, significantly improves the quality of pseudo-labels, and thus enhances the model's performance in the target domain.

[97] Hyperbolic Coarse-to-Fine Few-Shot Class-Incremental Learning

Jiaxin Dai,Xiang Xiang

Main category: cs.CV

TL;DR: 本文提出将特征提取器嵌入双曲空间(Poincaré球模型)以改进细粒度类别增量学习任务,引入双曲对比损失和全连接层,并结合最大熵分布生成增强特征,有效提升了少量样本下的分类性能。

Details Motivation: 传统的欧式空间在表示层次化数据时存在局限,而双曲空间更适合建模层级结构。为了更好地解释“由粗到细”的学习范式,在少样本条件下提升细粒度分类性能,作者探索将特征嵌入双曲空间。 Method: 采用Poincaré球模型构建双曲空间中的特征表示,设计双曲对比损失和双曲全连接层以支持模型优化与分类;利用双曲空间中的最大熵分布估计细类特征的概率分布,并生成增强特征以缓解小样本训练中的过拟合。 Result: 在C2FSCIL基准上的实验表明,该方法显著提高了粗类和细类的分类准确率,尤其在少样本设置下表现优越。 Conclusion: 将特征提取和分类过程置于双曲空间中,能更有效地捕捉层次语义结构,结合特征增强策略,显著提升了粗到细的少样本类增量学习性能。 Abstract: In the field of machine learning, hyperbolic space demonstrates superior representation capabilities for hierarchical data compared to conventional Euclidean space. This work focuses on the Coarse-To-Fine Few-Shot Class-Incremental Learning (C2FSCIL) task. Our study follows the Knowe approach, which contrastively learns coarse class labels and subsequently normalizes and freezes the classifier weights of learned fine classes in the embedding space. To better interpret the "coarse-to-fine" paradigm, we propose embedding the feature extractor into hyperbolic space. Specifically, we employ the Poincar\'e ball model of hyperbolic space, enabling the feature extractor to transform input images into feature vectors within the Poincar\'e ball instead of Euclidean space. We further introduce hyperbolic contrastive loss and hyperbolic fully-connected layers to facilitate model optimization and classification in hyperbolic space. Additionally, to enhance performance under few-shot conditions, we implement maximum entropy distribution in hyperbolic space to estimate the probability distribution of fine-class feature vectors. This allows generation of augmented features from the distribution to mitigate overfitting during training with limited samples. Experiments on C2FSCIL benchmarks show that our method effectively improves both coarse and fine class accuracies.

[98] GeoRemover: Removing Objects and Their Causal Visual Artifacts

Zixin Zhu,Haoxiang Li,Xuelu Feng,He Wu,Chunming Qiao,Junsong Yuan

Main category: cs.CV

TL;DR: 提出一种几何感知的两阶段框架,通过解耦几何去除和外观渲染,有效去除目标物体及其因果视觉伪影(如阴影、反射),在两个基准上达到最优性能。

Details Motivation: 现有方法忽略物体几何存在与其视觉效应之间的因果关系,导致无法完全去除目标物体的因果视觉伪影或缺乏可控性。 Method: 采用两阶段框架:第一阶段在几何空间(如深度)中严格按掩码监督去除物体;第二阶段基于更新后的几何条件生成逼真的RGB图像,隐式消除因果视觉效应。引入基于正负样本对的偏好驱动目标来指导几何去除学习。 Result: 在两个流行基准上实验表明,该方法在去除物体及其相关伪影方面优于现有方法,实现了最先进的性能。 Conclusion: 该几何感知框架能有效、可控地去除物体及其因果视觉伪影,为智能图像编辑提供了新的解决方案。 Abstract: Towards intelligent image editing, object removal should eliminate both the target object and its causal visual artifacts, such as shadows and reflections. However, existing image appearance-based methods either follow strictly mask-aligned training and fail to remove these causal effects which are not explicitly masked, or adopt loosely mask-aligned strategies that lack controllability and may unintentionally over-erase other objects. We identify that these limitations stem from ignoring the causal relationship between an object's geometry presence and its visual effects. To address this limitation, we propose a geometry-aware two-stage framework that decouples object removal into (1) geometry removal and (2) appearance rendering. In the first stage, we remove the object directly from the geometry (e.g., depth) using strictly mask-aligned supervision, enabling structure-aware editing with strong geometric constraints. In the second stage, we render a photorealistic RGB image conditioned on the updated geometry, where causal visual effects are considered implicitly as a result of the modified 3D geometry. To guide learning in the geometry removal stage, we introduce a preference-driven objective based on positive and negative sample pairs, encouraging the model to remove objects as well as their causal visual artifacts while avoiding new structural insertions. Extensive experiments demonstrate that our method achieves state-of-the-art performance in removing both objects and their associated artifacts on two popular benchmarks. The code is available at https://github.com/buxiangzhiren/GeoRemover.

[99] SEGA: A Transferable Signed Ensemble Gaussian Black-Box Attack against No-Reference Image Quality Assessment Models

Yujia Liu,Dingquan Li,Tiejun Huang

Main category: cs.CV

TL;DR: 本文提出了SEGA,一种可迁移的黑盒攻击方法,用于提升对无参考图像质量评估(NR-IQA)模型的攻击迁移性,通过高斯平滑和梯度集成近似目标模型梯度,并设计滤波掩码保证扰动的不可感知性。

Details Motivation: 现有针对NR-IQA模型的对抗攻击在白盒设置下有效,但在更实际的黑盒场景中迁移性差,难以攻击未知模型,因此需要提高攻击的可迁移性。 Method: 提出Signed Ensemble Gaussian黑盒攻击(SEGA),通过对源模型应用高斯平滑并集成其平滑梯度来逼近目标模型梯度,同时设计扰动滤波掩码去除不合适的扰动以保证视觉不可感知性。 Result: 在CLIVE数据集上的实验表明,SEGA在跨模型迁移攻击NR-IQA模型时表现出优异的转移能力,显著优于现有方法。 Conclusion: SEGA有效解决了NR-IQA模型黑盒攻击中迁移性低的问题,为评估和增强此类模型的鲁棒性提供了新思路。 Abstract: No-Reference Image Quality Assessment (NR-IQA) models play an important role in various real-world applications. Recently, adversarial attacks against NR-IQA models have attracted increasing attention, as they provide valuable insights for revealing model vulnerabilities and guiding robust system design. Some effective attacks have been proposed against NR-IQA models in white-box settings, where the attacker has full access to the target model. However, these attacks often suffer from poor transferability to unknown target models in more realistic black-box scenarios, where the target model is inaccessible. This work makes the first attempt to address the challenge of low transferability in attacking NR-IQA models by proposing a transferable Signed Ensemble Gaussian black-box Attack (SEGA). The main idea is to approximate the gradient of the target model by applying Gaussian smoothing to source models and ensembling their smoothed gradients. To ensure the imperceptibility of adversarial perturbations, SEGA further removes inappropriate perturbations using a specially designed perturbation filter mask. Experimental results on the CLIVE dataset demonstrate the superior transferability of SEGA, validating its effectiveness in enabling successful transfer-based black-box attacks against NR-IQA models.

[100] HadaSmileNet: Hadamard fusion of handcrafted and deep-learning features for enhancing facial emotion recognition of genuine smiles

Mohammad Junayed Hasan,Nabeel Mohammed,Shafin Rahman,Philipp Koehn

Main category: cs.CV

TL;DR: 本文提出了一种名为HadaSmileNet的新型特征融合框架,通过参数自由的乘法交互直接融合基于Transformer的表示与生理学驱动的D-Markers,在多个基准数据集上实现了最先进的笑容情感识别性能,同时减少了26%的参数并简化了训练过程。

Details Motivation: 现有基于多任务学习的笑容识别方法存在计算效率低、需要复杂损失平衡的问题,且依赖辅助任务监督,限制了其在实际应用中的部署。 Method: 提出HadaSmileNet,采用Hadamard乘法融合策略,将Transformer模型提取的深度特征与手工设计的生理学D-Markers进行直接融合,实现无需额外参数的高效特征交互,并系统评估了15种融合策略。 Result: 在UvA-NEMO、MMI、SPOS和BBC四个基准数据集上分别达到88.7%、99.7%、98.5%和100%的准确率,优于现有方法;相比多任务方法减少26%参数,训练更简单,可视化显示特征区分能力更强。 Conclusion: HadaSmileNet通过直接融合深度特征与领域知识,在保持高计算效率的同时显著提升表情识别性能,适用于需要实时情感计算的多媒体数据挖掘应用。 Abstract: The distinction between genuine and posed emotions represents a fundamental pattern recognition challenge with significant implications for data mining applications in social sciences, healthcare, and human-computer interaction. While recent multi-task learning frameworks have shown promise in combining deep learning architectures with handcrafted D-Marker features for smile facial emotion recognition, these approaches exhibit computational inefficiencies due to auxiliary task supervision and complex loss balancing requirements. This paper introduces HadaSmileNet, a novel feature fusion framework that directly integrates transformer-based representations with physiologically grounded D-Markers through parameter-free multiplicative interactions. Through systematic evaluation of 15 fusion strategies, we demonstrate that Hadamard multiplicative fusion achieves optimal performance by enabling direct feature interactions while maintaining computational efficiency. The proposed approach establishes new state-of-the-art results for deep learning methods across four benchmark datasets: UvA-NEMO (88.7 percent, +0.8), MMI (99.7 percent), SPOS (98.5 percent, +0.7), and BBC (100 percent, +5.0). Comprehensive computational analysis reveals 26 percent parameter reduction and simplified training compared to multi-task alternatives, while feature visualization demonstrates enhanced discriminative power through direct domain knowledge integration. The framework's efficiency and effectiveness make it particularly suitable for practical deployment in multimedia data mining applications that require real-time affective computing capabilities.

[101] Event-guided 3D Gaussian Splatting for Dynamic Human and Scene Reconstruction

Xiaoting Yin,Hao Shi,Kailun Yang,Jiajun Zhai,Shangwei Guo,Lin Wang,Kaiwei Wang

Main category: cs.CV

TL;DR: 提出一种基于事件相机的动态人体与静态场景联合重建框架,利用3D高斯点阵和事件引导损失,在单目事件相机输入下实现高质量、抗运动模糊的人体-场景重建。

Details Motivation: 传统RGB视频在快速运动下易产生运动模糊,难以精确重建动态人体与静态场景;事件相机具有微秒级时间分辨率,可提供更丰富的动态信息,因此探索其在人体-场景重建中的应用。 Method: 采用统一的3D高斯点阵表示人体与场景,通过可学习语义属性区分二者;仅对人体高斯进行形变以实现动画,场景高斯保持静态;引入事件引导损失,将连续渲染间的模拟亮度变化与事件流匹配,以提升快速运动区域的局部保真度。 Result: 在ZJU-MoCap-Blur和MMHPSD-Blur两个基准数据集上达到最优性能,相比强基线方法在PSNR/SSIM指标上有显著提升,LPIPS更低,尤其对高速运动主体效果更优。 Conclusion: 该方法有效利用事件相机优势,实现了无需外部人像掩码、简洁高效的动态人体与静态场景联合重建,显著提升了在运动模糊情况下的重建质量。 Abstract: Reconstructing dynamic humans together with static scenes from monocular videos remains difficult, especially under fast motion, where RGB frames suffer from motion blur. Event cameras exhibit distinct advantages, e.g., microsecond temporal resolution, making them a superior sensing choice for dynamic human reconstruction. Accordingly, we present a novel event-guided human-scene reconstruction framework that jointly models human and scene from a single monocular event camera via 3D Gaussian Splatting. Specifically, a unified set of 3D Gaussians carries a learnable semantic attribute; only Gaussians classified as human undergo deformation for animation, while scene Gaussians stay static. To combat blur, we propose an event-guided loss that matches simulated brightness changes between consecutive renderings with the event stream, improving local fidelity in fast-moving regions. Our approach removes the need for external human masks and simplifies managing separate Gaussian sets. On two benchmark datasets, ZJU-MoCap-Blur and MMHPSD-Blur, it delivers state-of-the-art human-scene reconstruction, with notable gains over strong baselines in PSNR/SSIM and reduced LPIPS, especially for high-speed subjects.

[102] Live-E2T: Real-time Threat Monitoring in Video via Deduplicated Event Reasoning and Chain-of-Thought

Yuhan Wang,Cheng Liu,Zihan Zhao,Weichao Wu

Main category: cs.CV

TL;DR: 本文提出了一种名为Live-E2T的新框架,用于实时视频流中的威胁行为监测与解释性评估,通过语义元组分解、在线事件去重更新和基于思维链的大语言模型推理,在准确性、实时性和可解释性方面均优于现有方法。

Details Motivation: 现有方法在实现实时性能与决策可解释性之间难以兼顾,限制了其在真实场景中的应用。 Method: 1) 将视频帧分解为结构化的Human-Object-Interaction-Place语义元组;2) 设计高效的在线事件去重与更新机制以消除时空冗余;3) 使用思维链策略微调大语言模型,实现对事件序列的透明逻辑推理并生成威胁评估报告。 Result: 在XD-Violence和UCF-Crime等基准数据集上的实验表明,Live-E2T在威胁检测精度、实时效率和可解释性方面显著优于当前最先进的方法。 Conclusion: Live-E2T有效统一了实时性与可解释性,为实时威胁监测提供了可靠且可理解的解决方案。 Abstract: Real-time threat monitoring identifies threatening behaviors in video streams and provides reasoning and assessment of threat events through explanatory text. However, prevailing methodologies, whether based on supervised learning or generative models, struggle to concurrently satisfy the demanding requirements of real-time performance and decision explainability. To bridge this gap, we introduce Live-E2T, a novel framework that unifies these two objectives through three synergistic mechanisms. First, we deconstruct video frames into structured Human-Object-Interaction-Place semantic tuples. This approach creates a compact, semantically focused representation, circumventing the information degradation common in conventional feature compression. Second, an efficient online event deduplication and updating mechanism is proposed to filter spatio-temporal redundancies, ensuring the system's real time responsiveness. Finally, we fine-tune a Large Language Model using a Chain-of-Thought strategy, endow it with the capability for transparent and logical reasoning over event sequences to produce coherent threat assessment reports. Extensive experiments on benchmark datasets, including XD-Violence and UCF-Crime, demonstrate that Live-E2T significantly outperforms state-of-the-art methods in terms of threat detection accuracy, real-time efficiency, and the crucial dimension of explainability.

[103] The Photographer Eye: Teaching Multimodal Large Language Models to See and Critique like Photographers

Daiqing Qi,Handong Zhao,Jing Shi,Simon Jenni,Yifei Fan,Franck Dernoncourt,Scott Cohen,Sheng Li

Main category: cs.CV

TL;DR: 本文提出了一种增强多模态大语言模型(MLLM)美学理解能力的新方法,包括构建专业摄影评论数据集PhotoCritique、设计多视角视觉融合模型PhotoEye,以及建立专业美学理解基准PhotoBench。

Details Motivation: 现有的MLLM在美学视觉理解方面表现有限,主要集中在基本常识层面,缺乏对摄影技术、后期处理等专业知识的理解,难以应对真实场景中的复杂需求。 Method: 首先构建了来自专业摄影师和爱好者讨论的大规模数据集PhotoCritique;然后提出了语言引导的多视角视觉融合模型PhotoEye;最后建立了专业美学理解基准PhotoBench进行评估。 Result: 在现有基准和新提出的PhotoBench上,PhotoEye均显著优于现有模型,展现出更强的美学理解能力。 Conclusion: 通过引入专业数据集、创新模型结构和专业评估基准,该工作显著提升了MLLM在图像美学理解方面的性能,推动了美学与通用视觉理解的深度融合。 Abstract: While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component--a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise--including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.

[104] Enhancing Video Object Segmentation in TrackRAD Using XMem Memory Network

Pengchao Deng,Shengqi Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于XMem模型的实时MRI引导放疗中的肿瘤分割框架,旨在应对TrackRAD2025挑战。该方法利用记忆增强架构对长序列动态MRI进行肿瘤分割和运动追踪,在标注数据有限的情况下仍保持较高精度,并满足临床实时性要求。尽管实验记录丢失导致无法提供定量结果,但初步结果显示其具备良好的分割性能。

Details Motivation: 在MRI引导的放射治疗中,精确且实时地跟踪肿瘤运动对于提高治疗的安全性和有效性至关重要。然而,由于动态MRI序列较长、数据标注成本高以及实时性要求严格,现有方法面临挑战。因此,需要一种能够在有限标注数据下实现高效、准确肿瘤分割与追踪的方法。 Method: 本研究采用XMem模型,一种具有记忆增强机制的深度学习架构,用于对长序列cine-MRI图像中的肿瘤进行分割。该框架通过整合历史帧的记忆信息,实现对肿瘤运动的持续跟踪,并在推理过程中保持实时处理能力。模型设计注重计算效率与分割准确性之间的平衡,适用于临床实时应用。 Result: 虽然详细的实验记录已丢失,无法提供具体的定量评估结果,但在开发过程中的初步观察表明,该XMem-based框架表现出合理的分割性能,能够稳定地追踪肿瘤区域,并满足临床所需的实时性要求。 Conclusion: 所提出的基于XMem的肿瘤分割框架在真实世界MRI引导放疗场景中展现出潜力,有助于提升肿瘤定位的精准度,进而增强放疗的安全性与疗效。未来工作将包括恢复或重新运行实验以获得完整的性能评估。 Abstract: This paper presents an advanced tumor segmentation framework for real-time MRI-guided radiotherapy, designed for the TrackRAD2025 challenge. Our method leverages the XMem model, a memory-augmented architecture, to segment tumors across long cine-MRI sequences. The proposed system efficiently integrates memory mechanisms to track tumor motion in real-time, achieving high segmentation accuracy even under challenging conditions with limited annotated data. Unfortunately, the detailed experimental records have been lost, preventing us from reporting precise quantitative results at this stage. Nevertheless, From our preliminary impressions during development, the XMem-based framework demonstrated reasonable segmentation performance and satisfied the clinical real-time requirement. Our work contributes to improving the precision of tumor tracking during MRI-guided radiotherapy, which is crucial for enhancing the accuracy and safety of cancer treatments.

[105] SSCM: A Spatial-Semantic Consistent Model for Multi-Contrast MRI Super-Resolution

Xiaoman Wu,Lubin Gan,Siying Wu,Jing Zhang,Yunwei Ou,Xiaoyan Sun

Main category: cs.CV

TL;DR: 提出了一种用于多对比磁共振成像超分辨率的Spatial-Semantic Consistent Model (SSCM),通过动态空间扭曲模块、语义感知令牌聚合块和空间-频率融合块,实现了更优的空间语义一致性与高频细节恢复。

Details Motivation: 解决传统方法在多对比MRI超分辨率中对空间语义一致性建模不足及频域信息利用不充分的问题。 Method: 设计了SSCM模型,包含动态空间扭曲模块用于跨对比度空间对齐,语义感知令牌聚合块保持长距离语义一致性,以及空间-频率融合块恢复精细结构。 Result: 在公开和私有数据集上实验表明,SSCM在参数更少的情况下达到了最先进的性能,并保证了空间和语义上一致的重建结果。 Conclusion: SSCM有效提升了多对比MRI超分辨率的质量,在减少采集时间的同时保持了解剖结构的精确对齐和细节还原。 Abstract: Multi-contrast Magnetic Resonance Imaging super-resolution (MC-MRI SR) aims to enhance low-resolution (LR) contrasts leveraging high-resolution (HR) references, shortening acquisition time and improving imaging efficiency while preserving anatomical details. The main challenge lies in maintaining spatial-semantic consistency, ensuring anatomical structures remain well-aligned and coherent despite structural discrepancies and motion between the target and reference images. Conventional methods insufficiently model spatial-semantic consistency and underuse frequency-domain information, which leads to poor fine-grained alignment and inadequate recovery of high-frequency details. In this paper, we propose the Spatial-Semantic Consistent Model (SSCM), which integrates a Dynamic Spatial Warping Module for inter-contrast spatial alignment, a Semantic-Aware Token Aggregation Block for long-range semantic consistency, and a Spatial-Frequency Fusion Block for fine structure restoration. Experiments on public and private datasets show that SSCM achieves state-of-the-art performance with fewer parameters while ensuring spatially and semantically consistent reconstructions.

[106] OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

Zhuoxiao Chen,Hongyang Yu,Ying Xu,Yadan Luo,Long Duong,Yuan-Fang Li

Main category: cs.CV

TL;DR: 本文提出了一种名为Oracle-educated GRPO(OraPO)的新方法,结合基于FactScore的奖励机制(FactS),在资源受限的情况下高效生成放射学报告。该方法通过单阶段强化学习和诊断事实驱动的奖励机制,在极少量训练数据下实现了新的SOTA性能。

Details Motivation: 现有放射学报告生成方法依赖大规模数据和计算资源,训练成本高且效率低,难以适应资源受限场景。因此,需要一种更高效、低成本的方法来提升临床挑战性病例的报告生成质量。 Method: 提出OraPO框架,采用单阶段强化学习训练,并利用轻量级oracle步骤将失败的探索转化为直接偏好监督;设计FactScore奖励(FactS),通过提取原子级临床事实并验证其与真实标签的蕴含关系,提供密集且可解释的句子级奖励。 Result: 在CheXpert Plus数据集上达到新的SOTA性能(F1分数0.341),仅使用小型基础视觉语言模型和少量训练数据(比现有方法少2-3个数量级),在普通硬件上即可实现。 Conclusion: OraPO结合FactS构建了一个紧凑而强大的放射学报告生成框架,显著提升了学习效率,尤其适用于资源受限环境,为低预算医疗AI应用提供了新方向。 Abstract: Radiology report generation (RRG) aims to automatically produce clinically faithful reports from chest X-ray images. Prevailing work typically follows a scale-driven paradigm, by multi-stage training over large paired corpora and oversized backbones, making pipelines highly data- and compute-intensive. In this paper, we propose Oracle-educated GRPO {OraPO) with a FactScore-based reward (FactS) to tackle the RRG task under constrained budgets. OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step. FactS grounds learning in diagnostic evidence by extracting atomic clinical facts and checking entailment against ground-truth labels, yielding dense, interpretable sentence-level rewards. Together, OraPO and FactS create a compact and powerful framework that significantly improves learning efficiency on clinically challenging cases, setting the new SOTA performance on the CheXpert Plus dataset (0.341 in F1) with 2--3 orders of magnitude less training data using a small base VLM on modest hardware.

[107] Training-Free Multi-Style Fusion Through Reference-Based Adaptive Modulation

Xu Liu,Yibo Lu,Xinxian Wang,Xinyu Wu

Main category: cs.CV

TL;DR: 提出了一种无需训练的参考式多风格融合框架AMSF,通过语义标记分解和相似性感知重加权模块,在扩散模型中实现可控的多风格融合。

Details Motivation: 现有方法受限于仅支持单一风格图像且缺乏平衡多种风格影响的机制,难以实现混合美学和多风格扩展。 Method: 采用语义标记分解模块将多个风格图像和文本提示编码,并自适应注入到冻结的扩散模型的每个交叉注意力层;引入相似性感知重加权模块在每步去噪过程中动态调整各风格成分的注意力分配。 Result: 定性和定量评估表明,AMSF在多风格融合效果上优于现有最先进方法,且可无缝扩展至两种或更多风格。 Conclusion: AMSF为扩散模型中的表达性多风格生成提供了实用解决方案,具备良好的可扩展性和用户可控性。 Abstract: We propose Adaptive Multi-Style Fusion (AMSF), a reference-based training-free framework that enables controllable fusion of multiple reference styles in diffusion models. Most of the existing reference-based methods are limited by (a) acceptance of only one style image, thus prohibiting hybrid aesthetics and scalability to more styles, and (b) lack of a principled mechanism to balance several stylistic influences. AMSF mitigates these challenges by encoding all style images and textual hints with a semantic token decomposition module that is adaptively injected into every cross-attention layer of an frozen diffusion model. A similarity-aware re-weighting module then recalibrates, at each denoising step, the attention allocated to every style component, yielding balanced and user-controllable blends without any fine-tuning or external adapters. Both qualitative and quantitative evaluations show that AMSF produces multi-style fusion results that consistently outperform the state-of-the-art approaches, while its fusion design scales seamlessly to two or more styles. These capabilities position AMSF as a practical step toward expressive multi-style generation in diffusion models.

[108] MLF-4DRCNet: Multi-Level Fusion with 4D Radar and Camera for 3D Object Detection in Autonomous Driving

Yuzhi Wu,Li Xiao,Jun Liu,Guangfeng Jiang,XiangGen Xia

Main category: cs.CV

TL;DR: 提出了一种基于多级融合的4D毫米波雷达与相机融合框架MLF-4DRCNet,用于3D目标检测,在点级、场景级和建议级进行多层次特征融合,显著提升了性能。

Details Motivation: 现有4D雷达-相机融合方法沿用LiDAR设计的融合范式,忽视了雷达点云稀疏不全的问题,且融合层次较粗,限制了检测性能。 Method: 提出MLF-4DRCNet,包含三个模块:增强雷达点编码器(ERPE)对齐并稠密化雷达点云;分层场景融合池化(HSFP)动态融合多尺度特征;建议级融合增强(PLFE)优化区域建议。采用三重注意力体素编码和可变形注意力实现多模态特征融合。 Result: 在View-of-Delft和TJ4DRadSet数据集上达到SOTA性能,尤其在VoD数据集上性能媲美LiDAR-based模型。 Conclusion: MLF-4DRCNet通过多层次融合机制有效克服了4D雷达点云稀疏性和噪声问题,实现了高性能的3D目标检测,具备实际应用潜力。 Abstract: The emerging 4D millimeter-wave radar, measuring the range, azimuth, elevation, and Doppler velocity of objects, is recognized for its cost-effectiveness and robustness in autonomous driving. Nevertheless, its point clouds exhibit significant sparsity and noise, restricting its standalone application in 3D object detection. Recent 4D radar-camera fusion methods have provided effective perception. Most existing approaches, however, adopt explicit Bird's-Eye-View fusion paradigms originally designed for LiDAR-camera fusion, neglecting radar's inherent drawbacks. Specifically, they overlook the sparse and incomplete geometry of radar point clouds and restrict fusion to coarse scene-level integration. To address these problems, we propose MLF-4DRCNet, a novel two-stage framework for 3D object detection via multi-level fusion of 4D radar and camera images. Our model incorporates the point-, scene-, and proposal-level multi-modal information, enabling comprehensive feature representation. It comprises three crucial components: the Enhanced Radar Point Encoder (ERPE) module, the Hierarchical Scene Fusion Pooling (HSFP) module, and the Proposal-Level Fusion Enhancement (PLFE) module. Operating at the point-level, ERPE densities radar point clouds with 2D image instances and encodes them into voxels via the proposed Triple-Attention Voxel Feature Encoder. HSFP dynamically integrates multi-scale voxel features with 2D image features using deformable attention to capture scene context and adopts pooling to the fused features. PLFE refines region proposals by fusing image features, and further integrates with the pooled features from HSFP. Experimental results on the View-of-Delft (VoD) and TJ4DRadSet datasets demonstrate that MLF-4DRCNet achieves the state-of-the-art performance. Notably, it attains performance comparable to LiDAR-based models on the VoD dataset.

[109] Prompt-Guided Dual Latent Steering for Inversion Problems

Yichen Wu,Xu Liu,Chenxuan Zhao,Xinyu Wu

Main category: cs.CV

TL;DR: 提出了一种无需训练的双潜变量引导框架PDLS,用于在扩散模型中更准确地逆向生成受损图像,通过结构路径和语义路径的协同控制,有效避免语义漂移并保留细节。

Details Motivation: 现有方法在图像逆向过程中难以兼顾结构保真度和语义准确性,常导致语义漂移或细节模糊,因此需要一种能同时保持源图像结构和语义一致性的新方法。 Method: 基于Rectified Flow模型,将逆向过程分解为结构路径和提示引导的语义路径,构建最优控制问题,并通过线性二次调节器(LQR)推导出闭式解,实现每一步生成轨迹的动态引导。 Result: 在FFHQ-1K和ImageNet-1K上的多种复原任务(如去模糊、超分辨率、自由形式修复)中,PDLS在图像保真度和语义一致性方面均优于单潜变量基线方法。 Conclusion: PDLS通过双路径动态控制,在无需针对每个图像进行优化的情况下,实现了高质量、无语义漂移的图像重建,显著提升了扩散模型在复杂逆向任务中的表现。 Abstract: Inverting corrupted images into the latent space of diffusion models is challenging. Current methods, which encode an image into a single latent vector, struggle to balance structural fidelity with semantic accuracy, leading to reconstructions with semantic drift, such as blurred details or incorrect attributes. To overcome this, we introduce Prompt-Guided Dual Latent Steering (PDLS), a novel, training-free framework built upon Rectified Flow models for their stable inversion paths. PDLS decomposes the inversion process into two complementary streams: a structural path to preserve source integrity and a semantic path guided by a prompt. We formulate this dual guidance as an optimal control problem and derive a closed-form solution via a Linear Quadratic Regulator (LQR). This controller dynamically steers the generative trajectory at each step, preventing semantic drift while ensuring the preservation of fine detail without costly, per-image optimization. Extensive experiments on FFHQ-1K and ImageNet-1K under various inversion tasks, including Gaussian deblurring, motion deblurring, super-resolution and freeform inpainting, demonstrate that PDLS produces reconstructions that are both more faithful to the original image and better aligned with the semantic information than single-latent baselines.

[110] Learning neuroimaging models from health system-scale data

Yiwei Lyu,Samir Harake,Asadur Chowdury,Soumyanil Banerjee,Rachel Gologorsky,Shixuan Liu,Anna-Katharina Meissner,Akshay Rao,Chenhui Zhao,Akhil Kondepudi,Cheng Jiang,Xinhai Hou,Rushikesh S. Joshi,Volker Neuschmelting,Ashok Srinivasan,Dawn Kleindorfer,Brian Athey,Vikas Gulani,Aditya Pandey,Honglak Lee,Todd Hollon

Main category: cs.CV

TL;DR: Prima是首个用于神经影像的视觉语言模型,基于超过22万项MRI研究训练,能在真实临床环境中实现高精度诊断,支持分诊优先级和临床推荐,并展现算法公平性。

Details Motivation: 应对全球MRI需求增长带来的医疗系统压力、诊断延迟和医生 burnout,尤其改善资源匮乏地区患者的诊疗不平等。 Method: 利用大型学术医疗系统的数据开发Prima,采用分层视觉架构的视觉语言模型(VLM),在超过22万项MRI研究上训练,并在涵盖3万项MRI的全系统一年研究中评估其对52种神经系统疾病的诊断性能。 Result: Prima在52种神经影像诊断中平均ROC曲线下面积达92.0,优于现有最先进的通用和医学AI模型,能提供可解释的鉴别诊断、分诊优先级和转诊建议,并表现出跨人群的算法公平性。 Conclusion: Prima作为基于医疗系统规模数据训练的视觉语言模型,展现出提升神经影像诊断效率、公平性和临床实用性的巨大潜力,推动AI驱动医疗服务的发展。 Abstract: Neuroimaging is a ubiquitous tool for evaluating patients with neurological diseases. The global demand for magnetic resonance imaging (MRI) studies has risen steadily, placing significant strain on health systems, prolonging turnaround times, and intensifying physician burnout \cite{Chen2017-bt, Rula2024-qp-1}. These challenges disproportionately impact patients in low-resource and rural settings. Here, we utilized a large academic health system as a data engine to develop Prima, the first vision language model (VLM) serving as an AI foundation for neuroimaging that supports real-world, clinical MRI studies as input. Trained on over 220,000 MRI studies, Prima uses a hierarchical vision architecture that provides general and transferable MRI features. Prima was tested in a 1-year health system-wide study that included 30K MRI studies. Across 52 radiologic diagnoses from the major neurologic disorders, including neoplastic, inflammatory, infectious, and developmental lesions, Prima achieved a mean diagnostic area under the ROC curve of 92.0, outperforming other state-of-the-art general and medical AI models. Prima offers explainable differential diagnoses, worklist priority for radiologists, and clinical referral recommendations across diverse patient demographics and MRI systems. Prima demonstrates algorithmic fairness across sensitive groups and can help mitigate health system biases, such as prolonged turnaround times for low-resource populations. These findings highlight the transformative potential of health system-scale VLMs and Prima's role in advancing AI-driven healthcare.

[111] Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation

Yuanhuiyi Lyu,Chi Kit Wong,Chenfei Liao,Lutao Jiang,Xu Zheng,Zexin Lu,Linfeng Zhang,Xuming Hu

Main category: cs.CV

TL;DR: 提出了一种新的统一模型推理框架Understanding-in-Generation (UiG),通过在生成过程中融合模型的理解能力来提升文本到图像生成的性能,特别是在长文本提示下显著优于现有方法。

Details Motivation: 现有的基于思维链(CoT)的方法将理解与生成过程分离,限制了对统一模型推理过程的有效引导,无法充分弥补生成能力的不足。 Method: 提出Understanding-in-Generation (UiG)框架,利用‘图像编辑’作为桥梁,在生成过程中逐步引入模型的理解能力。首先验证生成图像,并将理解结果转化为编辑指令,然后逐步优化图像。 Result: 在TIIF基准的长文本提示设置下,相比现有文本到图像推理方法取得了3.92%的性能提升。 Conclusion: UiG框架有效弥合了理解与生成之间的鸿沟,通过在推理过程中持续注入模型的理解能力,显著提升了统一模型在复杂文本到图像生成任务中的表现。 Abstract: Recent works have made notable advancements in enhancing unified models for text-to-image generation through the Chain-of-Thought (CoT). However, these reasoning methods separate the processes of understanding and generation, which limits their ability to guide the reasoning of unified models in addressing the deficiencies of their generative capabilities. To this end, we propose a novel reasoning framework for unified models, Understanding-in-Generation (UiG), which harnesses the robust understanding capabilities of unified models to reinforce their performance in image generation. The core insight of our UiG is to integrate generative guidance by the strong understanding capabilities during the reasoning process, thereby mitigating the limitations of generative abilities. To achieve this, we introduce "Image Editing" as a bridge to infuse understanding into the generation process. Initially, we verify the generated image and incorporate the understanding of unified models into the editing instructions. Subsequently, we enhance the generated image step by step, gradually infusing the understanding into the generation process. Our UiG framework demonstrates a significant performance improvement in text-to-image generation over existing text-to-image reasoning methods, e.g., a 3.92% gain on the long prompt setting of the TIIF benchmark. The project code: https://github.com/QC-LY/UiG

[112] Zero-shot Monocular Metric Depth for Endoscopic Images

Nicolas Toussaint,Emanuele Colleoni,Ricardo Sanchez-Matilla,Joshua Sutcliffe,Vanessa Thompson,Muhammad Asad,Imanol Luengo,Danail Stoyanov

Main category: cs.CV

TL;DR: 本文提出了一个针对内窥镜图像的深度估计基准,并发布了一个包含真实和合成数据的新数据集EndoSynth,通过在合成数据上微调基础模型显著提升了在未见真实数据上的性能。

Details Motivation: 由于缺乏针对内窥镜图像的鲁棒基准和高质量数据集,现有深度估计模型在临床场景中的泛化能力受限。 Method: 构建了一个综合基准,评估了当前最先进的相对和度量深度估计模型在真实未见内窥镜图像上的表现,并发布了带有真实深度和分割掩码的合成数据集EndoSynth,用于微调基于Transformer的基础模型。 Result: 使用EndoSynth微调的模型在大多数未见的真实内窥镜图像上显著提高了深度估计精度,验证了合成数据对实际应用的有效性。 Conclusion: 该工作通过提供公开的基准、合成数据集和训练权重,推动了内窥镜图像深度估计的发展,为未来研究提供了重要资源。 Abstract: Monocular relative and metric depth estimation has seen a tremendous boost in the last few years due to the sharp advancements in foundation models and in particular transformer based networks. As we start to see applications to the domain of endoscopic images, there is still a lack of robust benchmarks and high-quality datasets in that area. This paper addresses these limitations by presenting a comprehensive benchmark of state-of-the-art (metric and relative) depth estimation models evaluated on real, unseen endoscopic images, providing critical insights into their generalisation and performance in clinical scenarios. Additionally, we introduce and publish a novel synthetic dataset (EndoSynth) of endoscopic surgical instruments paired with ground truth metric depth and segmentation masks, designed to bridge the gap between synthetic and real-world data. We demonstrate that fine-tuning depth foundation models using our synthetic dataset boosts accuracy on most unseen real data by a significant margin. By providing both a benchmark and a synthetic dataset, this work advances the field of depth estimation for endoscopic images and serves as an important resource for future research. Project page, EndoSynth dataset and trained weights are available at https://github.com/TouchSurgery/EndoSynth.

[113] LEAF-Mamba: Local Emphatic and Adaptive Fusion State Space Model for RGB-D Salient Object Detection

Lanhu Wu,Zilin Gao,Hao Fei,Mong-Li Lee,Wynne Hsu

Main category: cs.CV

TL;DR: 提出了一种基于状态空间模型的局部强调与自适应融合网络(LEAF-Mamba),用于RGB-D显著性目标检测,在性能和效率上优于现有方法。

Details Motivation: 现有CNN和Vision Transformer在处理RGB-D显著性检测时存在感受野局限或计算复杂度高的问题,且直接应用状态空间模型会导致局部语义不足和跨模态融合不充分。 Method: 设计了局部强调状态空间模块(LE-SSM)以捕获多尺度局部依赖,并提出基于SSM的自适应融合模块(AFM)实现有效的跨模态交互与融合。 Result: 在多个数据集上超越16种前沿方法,兼具高效性和高性能,并在RGB-T SOD任务上表现出强泛化能力。 Conclusion: LEAF-Mamba在保持线性复杂度的同时,有效提升了RGB-D显著性目标检测的局部感知和跨模态融合性能,具有优越的性能和广泛的应用潜力。 Abstract: RGB-D salient object detection (SOD) aims to identify the most conspicuous objects in a scene with the incorporation of depth cues. Existing methods mainly rely on CNNs, limited by the local receptive fields, or Vision Transformers that suffer from the cost of quadratic complexity, posing a challenge in balancing performance and computational efficiency. Recently, state space models (SSM), Mamba, have shown great potential for modeling long-range dependency with linear complexity. However, directly applying SSM to RGB-D SOD may lead to deficient local semantics as well as the inadequate cross-modality fusion. To address these issues, we propose a Local Emphatic and Adaptive Fusion state space model (LEAF-Mamba) that contains two novel components: 1) a local emphatic state space module (LE-SSM) to capture multi-scale local dependencies for both modalities. 2) an SSM-based adaptive fusion module (AFM) for complementary cross-modality interaction and reliable cross-modality integration. Extensive experiments demonstrate that the LEAF-Mamba consistently outperforms 16 state-of-the-art RGB-D SOD methods in both efficacy and efficiency. Moreover, our method can achieve excellent performance on the RGB-T SOD task, proving a powerful generalization ability.

[114] Lightweight Vision Transformer with Window and Spatial Attention for Food Image Classification

Xinle Gao,Linghui Ye,Zhiyong Xiao

Main category: cs.CV

TL;DR: 提出一种轻量级的食品图像分类算法,结合窗口多头注意力机制(WMHAM)和空间注意力机制(SAM),在减少参数量和计算量的同时实现高性能分类。

Details Motivation: Vision Transformer模型在食品图像分类中存在参数量大、计算复杂度高的问题,限制了其在资源受限环境中的应用。 Method: 引入WMHAM通过窗口划分高效捕获局部和全局上下文特征,结合SAM自适应突出关键空间区域,提升特征表示能力。 Result: 在Food-101和Vireo Food-172数据集上分别达到95.24%和94.33%的准确率,显著降低参数量和FLOPs。 Conclusion: 所提方法在计算效率与分类性能之间实现了良好平衡,适用于资源受限场景下的食品图像分类。 Abstract: With the rapid development of society and continuous advances in science and technology, the food industry increasingly demands higher production quality and efficiency. Food image classification plays a vital role in enabling automated quality control on production lines, supporting food safety supervision, and promoting intelligent agricultural production. However, this task faces challenges due to the large number of parameters and high computational complexity of Vision Transformer models. To address these issues, we propose a lightweight food image classification algorithm that integrates a Window Multi-Head Attention Mechanism (WMHAM) and a Spatial Attention Mechanism (SAM). The WMHAM reduces computational cost by capturing local and global contextual features through efficient window partitioning, while the SAM adaptively emphasizes key spatial regions to improve discriminative feature representation. Experiments conducted on the Food-101 and Vireo Food-172 datasets demonstrate that our model achieves accuracies of 95.24% and 94.33%, respectively, while significantly reducing parameters and FLOPs compared with baseline methods. These results confirm that the proposed approach achieves an effective balance between computational efficiency and classification performance, making it well-suited for deployment in resource-constrained environments.

[115] OSDA: A Framework for Open-Set Discovery and Automatic Interpretation of Land-cover in Remote Sensing Imagery

Siyi Chen,Kai Wang,Weicong Pang,Ruiming Yang,Ziru Chen,Renjun Gao,Alexis Kai Hon Lau,Dasa Gu,Chenchen Zhang,Cheng Li

Main category: cs.CV

TL;DR: 本文提出了一种名为OSDA的三阶段框架,用于无标注的开放集土地覆盖发现、分割与描述,结合像素级精度与高层语义理解,实现了对卫星图像中新颖地物的自动识别与语义解释。

Details Motivation: 开放集遥感场景中存在大量未见过的地物类别,传统方法依赖人工标注且难以实现语义可解释的分割,因此需要一种无需标注、能同时实现精细空间定位和开放语义分类的自动化方法。 Method: OSDA框架包含三个阶段:(1)使用可提示微调的SAM模型进行精确发现与掩码提取;(2)通过两阶段微调的多模态大语言模型(MLLM)实现语义归因与上下文描述;(3)利用LLM作为评判器并结合人工评分进行评估。 Result: 该框架在多种卫星影像上实现了无需人工标注的开放集土地覆盖分析,能够准确分割新类别对象并生成可解释的语义描述,评估结果表明其具有良好的泛化能力和鲁棒性。 Conclusion: OSDA为动态土地覆盖监测提供了一个可扩展、可解释且无需标签的解决方案,在自动制图更新和大规模地球观测分析中具有广泛应用前景。 Abstract: Open-set land-cover analysis in remote sensing requires the ability to achieve fine-grained spatial localization and semantically open categorization. This involves not only detecting and segmenting novel objects without categorical supervision but also assigning them interpretable semantic labels through multimodal reasoning. In this study, we introduce OSDA, an integrated three-stage framework for annotation-free open-set land-cover discovery, segmentation, and description. The pipeline consists of: (1) precise discovery and mask extraction with a promptable fine-tuned segmentation model (SAM), (2) semantic attribution and contextual description via a two-phase fine-tuned multimodal large language model (MLLM), and (3) LLM-as-judge and manual scoring of the MLLMs evaluation. By combining pixel-level accuracy with high-level semantic understanding, OSDA addresses key challenges in open-world remote sensing interpretation. Designed to be architecture-agnostic and label-free, the framework supports robust evaluation across diverse satellite imagery without requiring manual annotation. Our work provides a scalable and interpretable solution for dynamic land-cover monitoring, showing strong potential for automated cartographic updating and large-scale earth observation analysis.

[116] Overview of PlantCLEF 2021: cross-domain plant identification

Herve Goeau,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: PlantCLEF 2021挑战旨在利用标本馆数据提升数据稀缺地区植物自动识别能力,聚焦南美圭亚那盾区约1000种高多样性植物,采用跨域分类任务,结合标本图像、野外照片及物种功能性状进行建模,测试集为纯野外拍摄照片。

Details Motivation: 现有深度学习模型依赖大量野外图像数据,但热带等生物多样性丰富地区数据匮乏;而标本馆长期积累了大量热带植物标本数字化记录,亟需探索如何利用这些历史数据提升自动化植物识别在数据贫乏区域的性能。 Method: 构建包含约百万张标本图像和数千张野外照片的跨域训练集,引入5个物种形态与功能特征作为辅助信息,设计跨域分类任务,训练模型学习标本与野外图像间的映射关系,测试集使用纯野外照片评估泛化能力。 Result: 多个参赛团队提交了融合多模态信息与跨域学习的系统,部分方法显著提升了在数据稀缺场景下的识别准确率,验证了利用标本数据辅助野外植物识别的可行性与潜力。 Conclusion: 结合标本馆数字化资源与少量野外图像可有效提升数据贫乏地区植物自动识别性能,跨域学习与多模态特征融合是关键,未来应加强标本与实地观测数据的整合应用。 Abstract: Automated plant identification has improved considerably thanks to recent advances in deep learning and the availability of training data with more and more field photos. However, this profusion of data concerns only a few tens of thousands of species, mainly located in North America and Western Europe, much less in the richest regions in terms of biodiversity such as tropical countries. On the other hand, for several centuries, botanists have systematically collected, catalogued and stored plant specimens in herbaria, especially in tropical regions, and recent efforts by the biodiversity informatics community have made it possible to put millions of digitised records online. The LifeCLEF 2021 plant identification challenge (or "PlantCLEF 2021") was designed to assess the extent to which automated identification of flora in data-poor regions can be improved by using herbarium collections. It is based on a dataset of about 1,000 species mainly focused on the Guiana Shield of South America, a region known to have one of the highest plant diversities in the world. The challenge was evaluated as a cross-domain classification task where the training set consisted of several hundred thousand herbarium sheets and a few thousand photos to allow learning a correspondence between the two domains. In addition to the usual metadata (location, date, author, taxonomy), the training data also includes the values of 5 morphological and functional traits for each species. The test set consisted exclusively of photos taken in the field. This article presents the resources and evaluations of the assessment carried out, summarises the approaches and systems used by the participating research groups and provides an analysis of the main results.

[117] AGSwap: Overcoming Category Boundaries in Object Fusion via Adaptive Group Swapping

Zedong Zhang,Ying Tai,Jianjun Qian,Jian Yang,Jun Li

Main category: cs.CV

TL;DR: 提出了一种名为AGSwap的新方法和一个大规模跨类别对象融合数据集COF,用于改善文本到图像生成中的对象融合效果。

Details Motivation: 现有文本到图像生成中的跨类别对象融合方法常产生语义不一致、视觉混乱的结果,且缺乏全面的基准数据集。 Method: AGSwap包含两部分:组别嵌入交换(Group-wise Embedding Swapping)实现语义属性融合,自适应组更新(Adaptive Group Updating)通过平衡评估分数动态优化融合过程;同时构建了基于ImageNet-1K和WordNet的COF数据集,含95个大类、每类10个子类,支持45万以上融合配对。 Result: 在简单与复杂提示下,AGSwap均优于当前最先进的组合式T2I方法(包括GPT-Image-1),实验验证其在融合一致性与视觉质量上的优势。 Conclusion: AGSwap结合COF数据集为跨类别对象融合提供了有效解决方案,显著提升了文本到图像生成中对象融合的语义连贯性与视觉质量。 Abstract: Fusing cross-category objects to a single coherent object has gained increasing attention in text-to-image (T2I) generation due to its broad applications in virtual reality, digital media, film, and gaming. However, existing methods often produce biased, visually chaotic, or semantically inconsistent results due to overlapping artifacts and poor integration. Moreover, progress in this field has been limited by the absence of a comprehensive benchmark dataset. To address these problems, we propose \textbf{Adaptive Group Swapping (AGSwap)}, a simple yet highly effective approach comprising two key components: (1) Group-wise Embedding Swapping, which fuses semantic attributes from different concepts through feature manipulation, and (2) Adaptive Group Updating, a dynamic optimization mechanism guided by a balance evaluation score to ensure coherent synthesis. Additionally, we introduce \textbf{Cross-category Object Fusion (COF)}, a large-scale, hierarchically structured dataset built upon ImageNet-1K and WordNet. COF includes 95 superclasses, each with 10 subclasses, enabling 451,250 unique fusion pairs. Extensive experiments demonstrate that AGSwap outperforms state-of-the-art compositional T2I methods, including GPT-Image-1 using simple and complex prompts.

[118] Overview of LifeCLEF Plant Identification task 2019: diving into data deficient tropical countries

Herve Goeau,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: PlantCLEF 2019挑战赛旨在评估数据匮乏地区植物自动识别系统的性能,使用了一个包含主要来自圭亚那地盾和北亚马逊雨林的10,000种植物的数据集。

Details Motivation: 尽管深度学习在植物自动识别方面取得了显著进展,但训练数据仅覆盖数万种物种,而全球约有36.9万种植物,许多生物多样性丰富地区的物种数据仍然匮乏,因此需要评估在数据稀缺条件下自动识别系统的表现。 Method: 组织了LifeCLEF 2019植物识别挑战赛,提供一个涵盖10,000种植物的大规模数据集,重点针对圭亚那地盾和北亚马逊雨林地区,并将参赛系统的表现与顶尖热带植物专家的识别能力进行对比评估。 Result: 挑战赛成功吸引了多个研究团队参与,展示了不同深度学习方法在大规模、高多样性植物识别任务中的表现,部分系统接近甚至达到专家水平。 Conclusion: PlantCLEF 2019为评估数据匮乏地区的植物自动识别技术提供了重要平台,推动了相关算法的发展,并揭示了当前系统在处理高物种多样性和有限标注数据时的优势与挑战。 Abstract: Automated identification of plants has improved considerably thanks to the recent progress in deep learning and the availability of training data. However, this profusion of data only concerns a few tens of thousands of species, while the planet has nearly 369K. The LifeCLEF 2019 Plant Identification challenge (or "PlantCLEF 2019") was designed to evaluate automated identification on the flora of data deficient regions. It is based on a dataset of 10K species mainly focused on the Guiana shield and the Northern Amazon rainforest, an area known to have one of the greatest diversity of plants and animals in the world. As in the previous edition, a comparison of the performance of the systems evaluated with the best tropical flora experts was carried out. This paper presents the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

[119] RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

Ke Li,Di Wang,Ting Wang,Fuyu Dong,Yiming Zhang,Luyao Zhang,Xiangyu Wang,Shaofeng Li,Quan Wang

Main category: cs.CV

TL;DR: 提出RSVG-ZeroOV,一种无需训练的框架,利用冻结的通用基础模型实现零样本开放词汇的遥感图像视觉定位。

Details Motivation: 现有方法受限于闭集词汇且依赖昂贵的数据集和耗时微调,难以适用于开放世界场景。 Method: 框架包含三个阶段:1)概览:使用视觉-语言模型获取文本查询与视觉区域间的语义关联;2)聚焦:利用扩散模型的细粒度先验补全结构和形状信息;3)演化:通过注意力演化模块抑制无关激活,生成纯净分割掩码。 Result: 在多个实验中,该方法显著优于现有的弱监督和零样本方法,实现了高效、可扩展的零样本开放词汇RSVG。 Conclusion: RSVG-ZeroOV无需任务特定训练,有效挖掘冻结基础模型在零样本开放词汇遥感视觉定位中的潜力。 Abstract: Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose \textbf{RSVG-ZeroOV}, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attention\footnote[1]{In this paper, although decoder-only VLMs use self-attention over all tokens, we refer to the image-text interaction part as cross-attention to distinguish it from pure visual self-attention.}maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.

[120] What Makes You Unique? Attribute Prompt Composition for Object Re-Identification

Yingquan Wang,Pingping Zhang,Chong Sun,Dong Wang,Huchuan Lu

Main category: cs.CV

TL;DR: 提出了一种基于文本语义的属性提示组合(APC)框架,用于提升行人重识别中的判别性和泛化能力,结合快速-慢速训练策略,在单域和跨域ReID任务中均表现出优越性能。

Details Motivation: 现有行人重识别模型受限于单域或跨域场景,易过拟合或抑制身份判别特征,缺乏兼顾判别性与泛化能力的统一框架。 Method: 设计属性提示生成器(APG),包含语义属性字典(SAD)和提示组合模块(PCM),利用文本语义生成判别性特征;并提出快速-慢速训练策略(FSTS),通过快速更新流学习特定判别知识,慢速更新流保留预训练视觉语言模型的通用表征。 Result: 在常规和领域泛化ReID数据集上实验表明,该方法在判别性和泛化性方面均优于现有最先进方法。 Conclusion: APC框架通过融合文本语义与双流训练策略,有效平衡了行人重识别中的判别性与泛化能力,提升了跨域和单域场景下的性能表现。 Abstract: Object Re-IDentification (ReID) aims to recognize individuals across non-overlapping camera views. While recent advances have achieved remarkable progress, most existing models are constrained to either single-domain or cross-domain scenarios, limiting their real-world applicability. Single-domain models tend to overfit to domain-specific features, whereas cross-domain models often rely on diverse normalization strategies that may inadvertently suppress identity-specific discriminative cues. To address these limitations, we propose an Attribute Prompt Composition (APC) framework, which exploits textual semantics to jointly enhance discrimination and generalization. Specifically, we design an Attribute Prompt Generator (APG) consisting of a Semantic Attribute Dictionary (SAD) and a Prompt Composition Module (PCM). SAD is an over-complete attribute dictionary to provide rich semantic descriptions, while PCM adaptively composes relevant attributes from SAD to generate discriminative attribute-aware features. In addition, motivated by the strong generalization ability of Vision-Language Models (VLM), we propose a Fast-Slow Training Strategy (FSTS) to balance ReID-specific discrimination and generalizable representation learning. Specifically, FSTS adopts a Fast Update Stream (FUS) to rapidly acquire ReID-specific discriminative knowledge and a Slow Update Stream (SUS) to retain the generalizable knowledge inherited from the pre-trained VLM. Through a mutual interaction, the framework effectively focuses on ReID-relevant features while mitigating overfitting. Extensive experiments on both conventional and Domain Generalized (DG) ReID datasets demonstrate that our framework surpasses state-of-the-art methods, exhibiting superior performances in terms of both discrimination and generalization. The source code is available at https://github.com/AWangYQ/APC.

[121] Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

Tong Zhang,Kuofeng Gao,Jiawang Bai,Leo Yu Zhang,Xin Yin,Zonghui Wang,Shouling Ji,Wenzhi Chen

Main category: cs.CV

TL;DR: 提出基于最优传输的OTCCLIP框架,通过细粒度视觉-文本特征对齐来重建图像-文本对,有效防御数据中毒攻击并提升CLIP在污染数据上的性能。

Details Motivation: 现有防御方法仅依赖全局表示匹配图像-文本对,忽略细粒度特征,易引入错误配对,影响CLIP预训练效果。 Method: 提出基于最优传输的距离度量,重新分配文本描述,并通过最优传输目标函数促进模态间和模态内细粒度对齐。 Result: 实验表明OTCCLIP能显著降低攻击成功率,并在零样本和线性探测任务上优于先前方法。 Conclusion: OTCCLIP通过细粒度特征对齐有效提升了CLIP在中毒数据下的鲁棒性和性能。 Abstract: Recent studies have shown that Contrastive Language-Image Pre-training (CLIP) models are threatened by targeted data poisoning and backdoor attacks due to massive training image-caption pairs crawled from the Internet. Previous defense methods correct poisoned image-caption pairs by matching a new caption for each image. However, the matching process relies solely on the global representations of images and captions, overlooking fine-grained features of visual and textual features. It may introduce incorrect image-caption pairs and harm the CLIP pre-training. To address their limitations, we propose an Optimal Transport-based framework to reconstruct image-caption pairs, named OTCCLIP. We propose a new optimal transport-based distance measure between fine-grained visual and textual feature sets and re-assign new captions based on the proposed optimal transport distance. Additionally, to further reduce the negative impact of mismatched pairs, we encourage the inter- and intra-modality fine-grained alignment by employing optimal transport-based objective functions. Our experiments demonstrate that OTCCLIP can successfully decrease the attack success rates of poisoning attacks. Also, compared to previous methods, OTCCLIP significantly improves CLIP's zero-shot and linear probing performance trained on poisoned datasets.

[122] Knowledge Transfer from Interaction Learning

Yilin Gao,Kangyi Chen,Zhongxing Peng,Hengjie Lu,Shugong Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为Learning from Interactions (LFI)的认知启发式框架,通过显式建模视觉理解的交互过程,解决现有视觉基础模型(VFMs)在从视觉语言模型(VLMs)迁移知识时因忽略交互过程而导致的表征差异问题。

Details Motivation: 当前VFMs多采用结果导向范式,忽视了VLMs中蕴含的跨模态交互过程,导致知识迁移效率低、泛化能力受限。 Method: 提出Interaction Queries以保持跨层的持久关系结构,并利用VLMs中的跨模态注意力机制生成基于交互的监督信号,实现更高效的知识迁移。 Result: 在多个基准任务上取得显著提升,TinyImageNet分类提升3.3 mAP,COCO检测/分割提升1.6 mAP/2.4 AP;在PACS和VLCS上零样本性能分别提升2.4和9.3;收敛更快且参数开销小;人类评估显示语义一致性是对比方法的2.7倍。 Conclusion: LFI通过建模交互过程有效缩小了VLMs与VFMs之间的表征鸿沟,提升了知识迁移效率和跨任务、跨域的泛化能力,具备认知合理性与工程实用性。 Abstract: Current visual foundation models (VFMs) face a fundamental limitation in transferring knowledge from vision language models (VLMs), while VLMs excel at modeling cross-modal interactions through unified representation spaces, existing VFMs predominantly adopt result-oriented paradigms that neglect the underlying interaction processes. This representational discrepancy hinders effective knowledge transfer and limits generalization across diverse vision tasks. We propose Learning from Interactions (LFI), a cognitive-inspired framework that addresses this gap by explicitly modeling visual understanding as an interactive process. Our key insight is that capturing the dynamic interaction patterns encoded in pre-trained VLMs enables more faithful and efficient knowledge transfer to VFMs. The approach centers on two technical innovations, Interaction Queries, which maintain persistent relational structures across network layers, and interaction-based supervision, derived from the cross-modal attention mechanisms of VLMs. Comprehensive experiments demonstrate consistent improvements across multiple benchmarks, achieving 3.3 and 1.6mAP/2.4AP absolute gains on TinyImageNet classification and COCO detection/segmentation respectively, with minimal parameter overhead and faster convergence. The framework particularly excels in cross-domain settings, delivering 2.4 and 9.3 zero-shot improvements on PACS and VLCS. Human evaluations further confirm its cognitive alignment, outperforming result-oriented methods by 2.7 times in semantic consistency metrics.

[123] HyPSAM: Hybrid Prompt-driven Segment Anything Model for RGB-Thermal Salient Object Detection

Ruichao Hou,Xingyuan Li,Tongwei Ren,Dongming Zhou,Gangshan Wu,Jinde Cao

Main category: cs.CV

TL;DR: 本文提出了一种新的混合提示驱动的“分割任意模型”(HyPSAM),用于RGB-热成像显著目标检测(RGB-T SOD),通过动态融合网络(DFNet)生成高质量初始显著图作为视觉提示,并设计即插即用的优化网络(P2RNet)结合文本、掩码和框提示来指导SAM精炼结果,实验证明该方法在多个数据集上达到最先进水平,且具有良好的通用性和扩展性。

Details Motivation: 由于模态间特征融合不充分和数据稀缺,现有方法在精确边界定位和完整对象识别方面存在挑战,难以充分利用RGB与热成像模态的互补信息。 Method: 提出HyPSAM框架,包含两个核心模块:1)动态融合网络(DFNet),采用动态卷积和多分支解码实现自适应跨模态交互,生成高质量初始显著图;2)即插即用优化网络(P2RNet),利用文本、框和掩码等混合提示引导SAM进行精细化分割,提升显著图精度。 Result: 在三个公开数据集上的实验表明,所提方法在RGB-T SOD任务中达到最先进的性能,同时具备良好泛化能力,可无缝集成到其他RGB-T SOD方法中带来显著性能提升。 Conclusion: HyPSAM有效结合了动态跨模态融合与混合提示机制,在零样本条件下展现出强大的分割能力,验证了提示工程在RGB-T显著目标检测中的巨大潜力。 Abstract: RGB-thermal salient object detection (RGB-T SOD) aims to identify prominent objects by integrating complementary information from RGB and thermal modalities. However, learning the precise boundaries and complete objects remains challenging due to the intrinsic insufficient feature fusion and the extrinsic limitations of data scarcity. In this paper, we propose a novel hybrid prompt-driven segment anything model (HyPSAM), which leverages the zero-shot generalization capabilities of the segment anything model (SAM) for RGB-T SOD. Specifically, we first propose a dynamic fusion network (DFNet) that generates high-quality initial saliency maps as visual prompts. DFNet employs dynamic convolution and multi-branch decoding to facilitate adaptive cross-modality interaction, overcoming the limitations of fixed-parameter kernels and enhancing multi-modal feature representation. Moreover, we propose a plug-and-play refinement network (P2RNet), which serves as a general optimization strategy to guide SAM in refining saliency maps by using hybrid prompts. The text prompt ensures reliable modality input, while the mask and box prompts enable precise salient object localization. Extensive experiments on three public datasets demonstrate that our method achieves state-of-the-art performance. Notably, HyPSAM has remarkable versatility, seamlessly integrating with different RGB-T SOD methods to achieve significant performance gains, thereby highlighting the potential of prompt engineering in this field. The code and results of our method are available at: https://github.com/milotic233/HyPSAM.

[124] TriFusion-AE: Language-Guided Depth and LiDAR Fusion for Robust Point Cloud Processing

Susmit Neogi

Main category: cs.CV

TL;DR: 提出了一种名为TriFusion-AE的多模态交叉注意力自编码器,融合文本先验、单目深度图和LiDAR点云,显著提升了在强对抗攻击和重度噪声下的鲁棒性。

Details Motivation: LiDAR点云易受噪声、遮挡和对抗性干扰影响,现有自编码器在复杂真实场景下性能下降,需提升重建的鲁棒性。 Method: 设计TriFusion-AE,通过跨模态交叉注意力机制融合文本语义、图像深度信息与LiDAR空间结构,实现多模态联合表示学习。 Result: 在nuScenes-mini数据集上验证,模型在强噪声和对抗攻击下显著优于CNN基线方法,而在轻微扰动下提升有限。 Conclusion: TriFusion-AE通过多模态融合增强了点云重建的鲁棒性,且框架具有模型无关性,可广泛集成到现有CNN自编码器中。 Abstract: LiDAR-based perception is central to autonomous driving and robotics, yet raw point clouds remain highly vulnerable to noise, occlusion, and adversarial corruptions. Autoencoders offer a natural framework for denoising and reconstruction, but their performance degrades under challenging real-world conditions. In this work, we propose TriFusion-AE, a multimodal cross-attention autoencoder that integrates textual priors, monocular depth maps from multi-view images, and LiDAR point clouds to improve robustness. By aligning semantic cues from text, geometric (depth) features from images, and spatial structure from LiDAR, TriFusion-AE learns representations that are resilient to stochastic noise and adversarial perturbations. Interestingly, while showing limited gains under mild perturbations, our model achieves significantly more robust reconstruction under strong adversarial attacks and heavy noise, where CNN-based autoencoders collapse. We evaluate on the nuScenes-mini dataset to reflect realistic low-data deployment scenarios. Our multimodal fusion framework is designed to be model-agnostic, enabling seamless integration with any CNN-based point cloud autoencoder for joint representation learning.

[125] COLT: Enhancing Video Large Language Models with Continual Tool Usage

Yuyang Liu,Xinyuan Shi,Bang Yang,Peilin Zhou,Jiahua Dong,Long Chen,Ian Reid,Xiaondan Liang

Main category: cs.CV

TL;DR: 本文提出了一种名为COLT的方法,用于增强开源视频大语言模型的持续工具使用能力,通过可学习的工具代码本避免灾难性遗忘,并在不断流入的新工具流中自动获得工具使用技能。

Details Motivation: 现有方法假设工具库是固定的,难以适应现实世界中持续变化和流入的工具数据,因此需要一种能够持续学习新工具并保留旧知识的方法。 Method: 提出COLT框架,引入可学习的工具代码本作为工具特定的记忆系统,根据用户指令与代码本中工具特征的相似性动态选择相关工具,并在持续的工具流中进行训练。 Result: 在多个视频大语言模型基准和自建的VideoToolBench数据集上进行了广泛实验,结果表明COLT在工具使用任务上达到了最先进的性能。 Conclusion: COLT能够有效提升开源视频大语言模型在持续学习新工具时的表现,同时避免遗忘已有工具,具有良好的实际应用潜力。 Abstract: The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering 'catastrophic forgetting' of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.

[126] FixingGS: Enhancing 3D Gaussian Splatting via Training-Free Score Distillation

Zhaorui Wang,Yi Gu,Deming Zhou,Renjing Xu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的3D高斯点阵重建增强方法FixingGS,利用扩散模型先验并通过蒸馏策略实现跨视角一致的伪影去除与补全。

Details Motivation: 稀疏视角下的3D高斯点阵重建因视觉信息不足而存在明显伪影,现有生成先验方法难以保证多视角一致性。 Method: 提出FixingGS,采用蒸馏方法提取更精确且跨视角一致的扩散先验,并设计自适应渐进增强策略优化欠约束区域的重建。 Result: 实验表明,FixingGS在视觉质量和重建性能上优于现有的最先进方法。 Conclusion: FixingGS有效提升了稀疏视角下3DGS的重建质量,实现了高质量、多视角一致的修复与补全。 Abstract: Recently, 3D Gaussian Splatting (3DGS) has demonstrated remarkable success in 3D reconstruction and novel view synthesis. However, reconstructing 3D scenes from sparse viewpoints remains highly challenging due to insufficient visual information, which results in noticeable artifacts persisting across the 3D representation. To address this limitation, recent methods have resorted to generative priors to remove artifacts and complete missing content in under-constrained areas. Despite their effectiveness, these approaches struggle to ensure multi-view consistency, resulting in blurred structures and implausible details. In this work, we propose FixingGS, a training-free method that fully exploits the capabilities of the existing diffusion model for sparse-view 3DGS reconstruction enhancement. At the core of FixingGS is our distillation approach, which delivers more accurate and cross-view coherent diffusion priors, thereby enabling effective artifact removal and inpainting. In addition, we propose an adaptive progressive enhancement scheme that further refines reconstructions in under-constrained regions. Extensive experiments demonstrate that FixingGS surpasses existing state-of-the-art methods with superior visual quality and reconstruction performance. Our code will be released publicly.

[127] Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

Junhao Su,Yuanliang Wan,Junwei Yang,Hengyu Shi,Tianyang Han,Junfeng Luo,Yurui Qiu

Main category: cs.CV

TL;DR: 本文提出了一种结构化反思方法(structured reflection),通过显式建模错误诊断与修复过程,提升工具增强型大语言模型在多轮交互中的可靠性与错误恢复能力。

Details Motivation: 现有工具增强型LLM依赖监督模仿或粗粒度强化学习,且反思机制脆弱,无法有效诊断和修复错误,导致多轮交互中重复失败。 Method: 提出结构化反思框架,将错误到修复的过程建模为可训练的动作;采用DAPO和GSPO目标结合面向工具使用的奖励机制,优化“反思-调用-最终输出”的逐步策略。 Result: 在BFCL v3和新提出的Tool-Reflection-Bench基准上实验显示,该方法显著提升了多轮工具调用成功率和错误恢复能力,减少了冗余调用。 Conclusion: 显式建模并直接优化反思过程能有效提升工具交互的可靠性,为智能体从失败中学习提供了可复现的路径。 Abstract: Tool-augmented large language models (LLMs) are usually trained with supervised imitation or coarse-grained reinforcement learning that optimizes single tool calls. Current self-reflection practices rely on heuristic prompts or one-way reasoning: the model is urged to 'think more' instead of learning error diagnosis and repair. This is fragile in multi-turn interactions; after a failure the model often repeats the same mistake. We propose structured reflection, which turns the path from error to repair into an explicit, controllable, and trainable action. The agent produces a short yet precise reflection: it diagnoses the failure using evidence from the previous step and then proposes a correct, executable follow-up call. For training we combine DAPO and GSPO objectives with a reward scheme tailored to tool use, optimizing the stepwise strategy Reflect, then Call, then Final. To evaluate, we introduce Tool-Reflection-Bench, a lightweight benchmark that programmatically checks structural validity, executability, parameter correctness, and result consistency. Tasks are built as mini trajectories of erroneous call, reflection, and corrected call, with disjoint train and test splits. Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, and a reduction of redundant calls. These results indicate that making reflection explicit and optimizing it directly improves the reliability of tool interaction and offers a reproducible path for agents to learn from failure.

[128] Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models

Xijun Wang,Junyun Huang,Rayyan Abdalla,Chengyuan Zhang,Ruiqi Xian,Dinesh Manocha

Main category: cs.CV

TL;DR: 本文提出了Bi-VLM,一种基于高斯分位数非均匀分离权重的视觉语言模型超低比特量化方法,在多个基准和模型上显著优于现有技术,并发现量化模型中图像token存在90%-99%的冗余,可进一步剪枝提升效率。

Details Motivation: 视觉语言模型(VLMs)计算成本和内存需求高,限制了其在硬件受限环境中的应用,亟需高效低比特量化方法。 Method: 提出Bi-VLM,基于高斯分位数将模型权重非均匀划分为异常值(显著)和多个内点(非显著)子集;设计显著性感知的混合量化算法,根据不同子集的显著性度量和压缩目标对缩放因子和二值矩阵施加不同约束。 Result: 在VLM的语言模型部分,Bi-VLM在四个基准和三个模型上的视觉问答任务中比现有最优方法提升3%-47%;整体VLM性能提升4%-45%;同时发现量化模型中90%-99%的图像token存在冗余,支持进一步剪枝。 Conclusion: Bi-VLM通过非均匀权重划分和显著性感知量化,在极低比特(≤2位)下显著提升了视觉语言模型的压缩与推理效率,兼具高性能和高实用性,适用于资源受限场景。 Abstract: We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth $\leq2$ bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier (salient) and multiple inlier (unsalient) subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token pruning on the quantized models and observe that there is redundancy of image tokens 90% - 99% in the quantized models. This helps us to further prune the visual tokens to improve efficiency.

[129] VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Hao Wang,Eiki Murata,Lingfang Zhang,Ayako Sato,So Fukuda,Ziqi Yin,Wentao Hu,Keisuke Nakao,Yusuke Nakamura,Sebastian Zwirner,Yi-Chia Chen,Hiroyuki Otomo,Hiroki Ouchi,Daisuke Kawahara

Main category: cs.CV

TL;DR: 本文提出了VIR-Bench,一个包含200个旅行视频的新基准,旨在评估多模态大语言模型在长距离时空轨迹理解方面的能力,实验表明现有模型表现不佳,但基于该基准开发的旅行规划代理显著提升了推荐效果。

Details Motivation: 当前视频理解基准主要关注室内或短距离户外场景,缺乏对长距离时空轨迹的理解评估,限制了多模态大语言模型在导航和具身AI等现实任务中的发展。 Method: 构建VIR-Bench基准,将行程重建作为核心任务,用于评测MLLMs在长时间、远距离视频中的地理时空智能,并基于此开发旅行规划代理进行案例验证。 Result: 实验显示最先进的MLLMs在VIR-Bench上表现较差,说明该任务具有挑战性;基于该基准设计的旅行规划代理在实际应用中展现出明显改进的推荐能力。 Conclusion: VIR-Bench有效揭示了现有MLLMs在长距离时空理解上的不足,同时证明其评估标准可转化为实际应用场景中的性能提升,推动未来模型向更复杂的现实任务发展。 Abstract: Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

[130] DiSSECT: Structuring Transfer-Ready Medical Image Representations through Discrete Self-Supervision

Azad Singh,Deepak Mishra

Main category: cs.CV

TL;DR: 本文提出了DiSSECT,一种用于医学图像表示学习的自监督学习框架,通过引入多尺度向量量化来提升模型在低标签情况下的泛化性和可迁移性。

Details Motivation: 现有自监督方法依赖复杂结构或特定先验,易产生捷径学习,尤其在胸部X光等模态中表现不佳,限制了其可扩展性和泛化能力。 Method: 提出DiSSECT框架,将多尺度向量量化引入自监督学习流程,构建离散表示瓶颈,迫使模型学习可重复、结构感知的特征,抑制无关变异。 Result: 在多个公开医学图像数据集上验证,DiSSECT在分类和分割任务中表现出色,尤其在低标签场景下具有高标签效率,且无需或仅需极少微调。 Conclusion: DiSSECT通过离散化表示提升了医学图像自监督学习的鲁棒性、可迁移性和实用性,优于现有最先进方法。 Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for medical image representation learning, particularly in settings with limited labeled data. However, existing SSL methods often rely on complex architectures, anatomy-specific priors, or heavily tuned augmentations, which limit their scalability and generalizability. More critically, these models are prone to shortcut learning, especially in modalities like chest X-rays, where anatomical similarity is high and pathology is subtle. In this work, we introduce DiSSECT -- Discrete Self-Supervision for Efficient Clinical Transferable Representations, a framework that integrates multi-scale vector quantization into the SSL pipeline to impose a discrete representational bottleneck. This constrains the model to learn repeatable, structure-aware features while suppressing view-specific or low-utility patterns, improving representation transfer across tasks and domains. DiSSECT achieves strong performance on both classification and segmentation tasks, requiring minimal or no fine-tuning, and shows particularly high label efficiency in low-label regimes. We validate DiSSECT across multiple public medical imaging datasets, demonstrating its robustness and generalizability compared to existing state-of-the-art approaches.

[131] ColorBlindnessEval: Can Vision-Language Models Pass Color Blindness Tests?

Zijian Ling,Han Zhang,Yazhuo Zhou,Jiahao Cui

Main category: cs.CV

TL;DR: 提出ColorBlindnessEval基准,用于评估视觉-语言模型在模拟色盲测试的对抗性视觉场景中的鲁棒性,发现现有模型在识别复杂图案中的数字时存在明显局限和幻觉问题。

Details Motivation: 现有视觉-语言模型在复杂或对抗性视觉环境下的表现尚不明确,尤其是类似色盲测试这类依赖颜色感知的任务,亟需专门基准来评估其鲁棒性。 Method: 构建包含500张类Ishihara图像的数据集,涵盖0到99的数字及多种颜色组合,采用Yes/No和开放式提示评估9个VLM,并与人类表现对比。 Result: 实验表明,当前VLM在对抗性颜色环境下识别数字能力有限,普遍存在幻觉现象,性能显著低于人类。 Conclusion: 需要增强VLM在复杂视觉环境中的鲁棒性和可靠性,ColorBlindnessEval为改进和评测VLM在关键应用场景中的准确性提供了有效工具。 Abstract: This paper presents ColorBlindnessEval, a novel benchmark designed to evaluate the robustness of Vision-Language Models (VLMs) in visually adversarial scenarios inspired by the Ishihara color blindness test. Our dataset comprises 500 Ishihara-like images featuring numbers from 0 to 99 with varying color combinations, challenging VLMs to accurately recognize numerical information embedded in complex visual patterns. We assess 9 VLMs using Yes/No and open-ended prompts and compare their performance with human participants. Our experiments reveal limitations in the models' ability to interpret numbers in adversarial contexts, highlighting prevalent hallucination issues. These findings underscore the need to improve the robustness of VLMs in complex visual environments. ColorBlindnessEval serves as a valuable tool for benchmarking and improving the reliability of VLMs in real-world applications where accuracy is critical.

[132] Real-time Deer Detection and Warning in Connected Vehicles via Thermal Sensing and Deep Learning

Hemanth Puppala,Wayne Sarasua,Srinivas Biyaguda,Farhad Farzinpour,Mashrur Chowdhury

Main category: cs.CV

TL;DR: 本文提出了一种结合热成像、深度学习和车联万物(CV2X)通信的实时检测与驾驶员预警系统,以减少美国每年高达210万起的鹿-车碰撞事故。系统在自建热成像鹿数据集上训练,检测精度高,并在实地测试中表现出色,能在恶劣天气下保持88%-92%的检测准确率,远优于可见光相机。系统端到端延迟低于100毫秒,可及时向驾驶员发出预警,并通过CV2X向周边车辆广播信息,具有实际应用潜力。

Details Motivation: 鹿-车碰撞在美国造成大量人员伤亡、经济损失及鹿群数量下降,现有可见光摄像头在夜间或恶劣天气下检测能力有限,亟需一种高效、鲁棒的实时检测与预警技术来提升道路安全。 Method: 构建包含1.2万张以上热成像鹿图像的自定义数据集,采用深度学习模型进行训练;利用热成像摄像头实现全天候检测;当检测置信度达到阈值时,通过CV2X技术将传感器数据共享信息广播至周围车辆和路侧单元,实现协同预警。 Result: 系统在实验评估中达到98.84%的平均精度均值(mAP)、95.44%的精确率和95.96%的召回率;实地测试显示在多种天气条件下检测准确率维持在88%-92%,显著优于可见光摄像头的不足60%;端到端延迟稳定低于100毫秒,能有效提供早期预警。 Conclusion: 本研究验证了融合热成像、深度学习与CV2X通信技术在减少鹿-车碰撞方面的可行性与高效性,为智能交通系统中的野生动物碰撞预防提供了切实可行的技术路径。 Abstract: Deer-vehicle collisions represent a critical safety challenge in the United States, causing nearly 2.1 million incidents annually and resulting in approximately 440 fatalities, 59,000 injuries, and 10 billion USD in economic damages. These collisions also contribute significantly to declining deer populations. This paper presents a real-time detection and driver warning system that integrates thermal imaging, deep learning, and vehicle-to-everything communication to help mitigate deer-vehicle collisions. Our system was trained and validated on a custom dataset of over 12,000 thermal deer images collected in Mars Hill, North Carolina. Experimental evaluation demonstrates exceptional performance with 98.84 percent mean average precision, 95.44 percent precision, and 95.96 percent recall. The system was field tested during a follow-up visit to Mars Hill and readily sensed deer providing the driver with advanced warning. Field testing validates robust operation across diverse weather conditions, with thermal imaging maintaining between 88 and 92 percent detection accuracy in challenging scenarios where conventional visible light based cameras achieve less than 60 percent effectiveness. When a high probability threshold is reached sensor data sharing messages are broadcast to surrounding vehicles and roadside units via cellular vehicle to everything (CV2X) communication devices. Overall, our system achieves end to end latency consistently under 100 milliseconds from detection to driver alert. This research establishes a viable technological pathway for reducing deer-vehicle collisions through thermal imaging and connected vehicles.

[133] Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

Guoxin Wang,Jun Zhao,Xinyi Liu,Yanbo Liu,Xuyang Cao,Chao Li,Zhuoyun Liu,Qintian Sun,Fangru Zhou,Haoqiang Xing,Zhenhong Yang

Main category: cs.CV

TL;DR: Citrus-V 是一个结合图像分析与文本推理的多模态医学基础模型,支持病灶定位、结构化报告生成和类医生诊断推理,通过新颖的多模态训练方法,在多个基准上优于现有开源模型。

Details Motivation: 现有医学影像模型泛化能力差,依赖多个专用网络,且临床应用需要精确的视觉定位、多模态融合和链式思维推理。 Method: 提出 Citrus-V 模型,集成检测、分割和多模态链式思维推理,并采用新型多模态训练方法,发布涵盖多种任务的开源数据集。 Result: 在多个基准测试中,Citrus-V 超过现有的开源医学模型和专家级影像系统,实现从视觉定位到临床推理的统一 pipeline。 Conclusion: Citrus-V 实现了从像素级病灶定位到自动化报告和可靠第二意见的端到端诊疗支持,具备良好的临床应用潜力。 Abstract: Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.

[134] Towards Application Aligned Synthetic Surgical Image Synthesis

Danush Kumar Venkatesh,Stefanie Speidel

Main category: cs.CV

TL;DR: 提出了一种名为Surgical Application-Aligned Diffusion (SAADi)的新框架,通过轻量级微调扩散模型,使其生成的图像更符合下游任务需求,有效缓解手术数据稀缺问题。

Details Motivation: 由于标注手术数据稀缺,深度学习在计算机辅助干预中的应用受限;同时现有扩散模型存在记忆化问题,导致生成样本不一致或缺乏多样性,可能损害下游任务性能。 Method: 构建下游模型偏好的合成图像与非偏好图像的配对样本,通过对扩散模型进行轻量级微调,显式地将生成过程与下游任务目标对齐,并引入迭代优化机制进一步提升样本质量。 Result: 在三个手术数据集上实验表明,分类任务性能提升7%-9%,分割任务提升2%-10%,对罕见类别改善显著;迭代优化进一步带来4%-10%的性能增益,且未出现样本退化现象。 Conclusion: SAADi通过任务感知对齐克服了传统扩散模型的局限,为解决医学图像中数据稀缺问题提供了有效方案,推动了手术视觉应用的发展。 Abstract: The scarcity of annotated surgical data poses a significant challenge for developing deep learning systems in computer-assisted interventions. While diffusion models can synthesize realistic images, they often suffer from data memorization, resulting in inconsistent or non-diverse samples that may fail to improve, or even harm, downstream performance. We introduce \emph{Surgical Application-Aligned Diffusion} (SAADi), a new framework that aligns diffusion models with samples preferred by downstream models. Our method constructs pairs of \emph{preferred} and \emph{non-preferred} synthetic images and employs lightweight fine-tuning of diffusion models to align the image generation process with downstream objectives explicitly. Experiments on three surgical datasets demonstrate consistent gains of $7$--$9\%$ in classification and $2$--$10\%$ in segmentation tasks, with the considerable improvements observed for underrepresented classes. Iterative refinement of synthetic samples further boosts performance by $4$--$10\%$. Unlike baseline approaches, our method overcomes sample degradation and establishes task-aware alignment as a key principle for mitigating data scarcity and advancing surgical vision applications.

[135] A Kernel Space-based Multidimensional Sparse Model for Dynamic PET Image Denoising

Kuang Xiaodong,Li Bingxuan,Li Yuan,Rao Fan,Ma Gege,Xie Qingguo,Mok Greta S P,Liu Huafeng,Zhu Wentao

Main category: cs.CV

TL;DR: 提出了一种基于模型的神经网络KMDS-Net,用于动态PET图像去噪,利用帧间空间相关性和帧内结构一致性,显著提升去噪性能。

Details Motivation: 动态PET成像中短时间帧的统计量有限,导致图像质量难以保证,现有方法在时空分辨率和噪声抑制方面存在不足。 Method: 构建了基于核空间的多维稀疏(KMDS)模型,结合深度学习,将参数估计过程用神经网络替代,实现端到端的自适应优化。 Result: 在模拟和真实数据上验证了KMDS-Net的优越性,去噪效果优于基线方法,能有效提升动态PET的时空分辨率。 Conclusion: KMDS-Net是一种高效的动态PET图像去噪方法,具有良好的应用前景。 Abstract: Achieving high image quality for temporal frames in dynamic positron emission tomography (PET) is challenging due to the limited statistic especially for the short frames. Recent studies have shown that deep learning (DL) is useful in a wide range of medical image denoising tasks. In this paper, we propose a model-based neural network for dynamic PET image denoising. The inter-frame spatial correlation and intra-frame structural consistency in dynamic PET are used to establish the kernel space-based multidimensional sparse (KMDS) model. We then substitute the inherent forms of the parameter estimation with neural networks to enable adaptive parameters optimization, forming the end-to-end neural KMDS-Net. Extensive experimental results from simulated and real data demonstrate that the neural KMDS-Net exhibits strong denoising performance for dynamic PET, outperforming previous baseline methods. The proposed method may be used to effectively achieve high temporal and spatial resolution for dynamic PET. Our source code is available at https://github.com/Kuangxd/Neural-KMDS-Net/tree/main.

[136] Surgical Video Understanding with Label Interpolation

Garam Kim,Tae Kyeong Jeong,Juyoun Park

Main category: cs.CV

TL;DR: 提出了一种结合光流分割标签插值与多任务学习的新框架,以解决手术场景理解中的时空不平衡问题。

Details Motivation: 现有研究多集中于单任务方法,难以全面理解复杂的手术场景;同时,多任务学习所需的像素级分割数据因标注成本高而稀缺,导致时空标注不平衡。 Method: 利用标注关键帧的光流估计将标签传播到相邻未标注帧,实现分割标签的插值,并结合多任务学习框架进行训练。 Result: 该方法有效缓解了时空标注不平衡问题,丰富了稀疏的空间监督信息,提升了手术场景理解的准确性和效率。 Conclusion: 所提出的框架在减少对手工标注依赖的同时,显著提高了机器人辅助手术中视觉理解的性能,有助于推动其临床应用。 Abstract: Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons through minimally invasive approaches. To fully realize its potential, however, a precise understanding of the visual data generated during surgical procedures is essential. Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions that limit comprehensive understanding. Moreover, the effective application of multi-task learning (MTL) requires sufficient pixel-level segmentation data, which are difficult to obtain due to the high cost and expertise required for annotation. In particular, long-term annotations such as phases and steps are available for every frame, whereas short-term annotations such as surgical instrument segmentation and action detection are provided only for key frames, resulting in a significant temporal-spatial imbalance. To address these challenges, we propose a novel framework that combines optical flow-based segmentation label interpolation with multi-task learning. optical flow estimated from annotated key frames is used to propagate labels to adjacent unlabeled frames, thereby enriching sparse spatial supervision and balancing temporal and spatial information for training. This integration improves both the accuracy and efficiency of surgical scene understanding and, in turn, enhances the utility of RAS.

[137] Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

Yanzuo Lu,Xin Xia,Manlin Zhang,Huafeng Kuang,Jianbin Zheng,Yuxi Ren,Xuefeng Xiao

Main category: cs.CV

TL;DR: 本文提出了一种名为Hyper-Bagel的统一加速框架,用于提升多模态理解和生成任务的效率,通过推测解码和多阶段蒸馏实现显著加速,同时保持高质量输出。

Details Motivation: 随着多模态上下文中交织的标记数量增加,扩散去噪和自回归解码带来巨大计算开销,亟需高效加速方法。 Method: 采用分治策略,结合推测解码进行下一标记预测,并利用多阶段蒸馏优化扩散去噪过程。 Result: 在多模态理解任务中实现2倍以上加速;生成任务中,6-NFE模型实现16.67倍文本到图像生成加速和22倍图像编辑加速,1-NFE模型支持近实时交互。 Conclusion: Hyper-Bagel显著提升了多模态处理的速度与效率,结合对抗蒸馏与人类反馈学习,实现了高性能、低成本的实时多模态交互。 Abstract: Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.

[138] Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography

Gianmarco Spinaci,Lukas Klic,Giovanni Colavizza

Main category: cs.CV

TL;DR: 该研究评估了多模态大语言模型(如GPT-4o、Gemini 2.5)和视觉语言模型(如CLIP、SigLIP)在基督教图像单标签分类任务中的表现,并与微调的ResNet50进行对比,发现部分通用模型优于传统方法,尤其在加入类别描述时性能提升,但少样本学习效果有限。

Details Motivation: 探索通用多模态模型是否能在无需专门训练的情况下有效处理基督教图像学这一复杂文化领域的图像分类任务,并评估其在数字人文中作为元数据整理工具的潜力。 Method: 使用ArtDL、ICONCLASS和Wikidata三个支持Iconclass的数据集,筛选出前10个高频类别,在三种条件下测试模型:仅用类别标签分类、结合Iconclass描述分类、以及五样本少样本学习;并与微调的ResNet50基线比较。 Result: Gemini-2.5 Pro和GPT-4o整体上优于ResNet50基线;在Wikidata数据集上所有模型性能下降,其中SigLIP表现最好;加入类别描述通常提升零样本性能,而少样本学习多数情况下未带来显著增益。 Conclusion: 通用多模态大模型具备处理视觉复杂的文化遗产图像分类的能力,可作为数字人文中元数据标注的有力工具,未来应优化提示策略并扩展至更多模型与分类方法。 Abstract: This study evaluates the capabilities of Multimodal Large Language Models (LLMs) and Vision Language Models (VLMs) in the task of single-label classification of Christian Iconography. The goal was to assess whether general-purpose VLMs (CLIP and SigLIP) and LLMs, such as GPT-4o and Gemini 2.5, can interpret the Iconography, typically addressed by supervised classifiers, and evaluate their performance. Two research questions guided the analysis: (RQ1) How do multimodal LLMs perform on image classification of Christian saints? And (RQ2), how does performance vary when enriching input with contextual information or few-shot exemplars? We conducted a benchmarking study using three datasets supporting Iconclass natively: ArtDL, ICONCLASS, and Wikidata, filtered to include the top 10 most frequent classes. Models were tested under three conditions: (1) classification using class labels, (2) classification with Iconclass descriptions, and (3) few-shot learning with five exemplars. Results were compared against ResNet50 baselines fine-tuned on the same datasets. The findings show that Gemini-2.5 Pro and GPT-4o outperformed the ResNet50 baselines. Accuracy dropped significantly on the Wikidata dataset, where Siglip reached the highest accuracy score, suggesting model sensitivity to image size and metadata alignment. Enriching prompts with class descriptions generally improved zero-shot performance, while few-shot learning produced lower results, with only occasional and minimal increments in accuracy. We conclude that general-purpose multimodal LLMs are capable of classification in visually complex cultural heritage domains. These results support the application of LLMs as metadata curation tools in digital humanities workflows, suggesting future research on prompt optimization and the expansion of the study to other classification strategies and models.

[139] ViG-LRGC: Vision Graph Neural Networks with Learnable Reparameterized Graph Construction

Ismael Elsharkawi,Hossam Sharara,Ahmed Rafea

Main category: cs.CV

TL;DR: 提出了一种可学习的、无需超参数的图构建方法LRGC,用于视觉图神经网络(ViG),通过键-查询注意力和软阈值重参数化实现端到端可训练的图结构学习,在ImageNet-1k上优于现有ViG模型。

Details Motivation: 现有ViG模型依赖非学习性的统计方法构建图像图结构,如k-NN或相似性阈值法,无法自适应地选择最优邻域且需手动设置超参数,限制了性能。 Method: 提出Learnable Reparameterized Graph Construction (LRGC),在每层中对节点对应用键-查询注意力机制,并通过软阈值重参数化进行可微分的边选择,使图结构构建可学习且无需预设超参数。 Result: ViG-LRGC在ImageNet-1k数据集上优于同规模的最先进ViG模型,验证了其有效性。 Conclusion: LRGC为Vision GNN提供了更优的可学习图构建方式,消除了传统方法的偏差,实现了每层自适应阈值学习,提升了图像表示能力。 Abstract: Image Representation Learning is an important problem in Computer Vision. Traditionally, images were processed as grids, using Convolutional Neural Networks or as a sequence of visual tokens, using Vision Transformers. Recently, Vision Graph Neural Networks (ViG) have proposed the treatment of images as a graph of nodes; which provides a more intuitive image representation. The challenge is to construct a graph of nodes in each layer that best represents the relations between nodes and does not need a hyper-parameter search. ViG models in the literature depend on non-parameterized and non-learnable statistical methods that operate on the latent features of nodes to create a graph. This might not select the best neighborhood for each node. Starting from k-NN graph construction to HyperGraph Construction and Similarity-Thresholded graph construction, these methods lack the ability to provide a learnable hyper-parameter-free graph construction method. To overcome those challenges, we present the Learnable Reparameterized Graph Construction (LRGC) for Vision Graph Neural Networks. LRGC applies key-query attention between every pair of nodes; then uses soft-threshold reparameterization for edge selection, which allows the use of a differentiable mathematical model for training. Using learnable parameters to select the neighborhood removes the bias that is induced by any clustering or thresholding methods previously introduced in the literature. In addition, LRGC allows tuning the threshold in each layer to the training data since the thresholds are learnable through training and are not provided as hyper-parameters to the model. We demonstrate that the proposed ViG-LRGC approach outperforms state-of-the-art ViG models of similar sizes on the ImageNet-1k benchmark dataset.

[140] Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model

Xueyu Liu,Xiaoyi Zhang,Guangze Shi,Meilin Liu,Yexin Lai,Yongfei Wu,Mingqiang Wei

Main category: cs.CV

TL;DR: 提出了一种基于对抗强化学习的点提示优化框架Point Prompt Defender,通过攻击-防御机制自动优化SAM模型的提示质量。

Details Motivation: 现有方法依赖启发式或手工设计的提示,限制了可扩展性和泛化能力,因此需要一种自动优化提示的方法以提升SAM的鲁棒性。 Method: 构建了一个任务无关的双空间图表示环境,使用深度Q网络训练攻击者和防御者代理,攻击者试图降低SAM分割性能,防御者则恢复精度;推理时仅使用防御者优化提示。 Result: 实验表明该方法有效提升了SAM在多种任务下的分割性能、鲁棒性和泛化能力。 Conclusion: Point Prompt Defender提供了一种灵活、可解释且即插即用的提示优化框架,显著增强了基于提示的分割效果。 Abstract: Prompt quality plays a critical role in the performance of the Segment Anything Model (SAM), yet existing approaches often rely on heuristic or manually crafted prompts, limiting scalability and generalization. In this paper, we propose Point Prompt Defender, an adversarial reinforcement learning framework that adopts an attack-for-defense paradigm to automatically optimize point prompts. We construct a task-agnostic point prompt environment by representing image patches as nodes in a dual-space graph, where edges encode both physical and semantic distances. Within this environment, an attacker agent learns to activate a subset of prompts that maximally degrade SAM's segmentation performance, while a defender agent learns to suppress these disruptive prompts and restore accuracy. Both agents are trained using Deep Q-Networks with a reward signal based on segmentation quality variation. During inference, only the defender is deployed to refine arbitrary coarse prompt sets, enabling enhanced SAM segmentation performance across diverse tasks without retraining. Extensive experiments show that Point Prompt Defender effectively improves SAM's robustness and generalization, establishing a flexible, interpretable, and plug-and-play framework for prompt-based segmentation.

[141] SmartWilds: Multimodal Wildlife Monitoring Dataset

Jenna Kline,Anirudh Potlapally,Bharath Pillai,Tanishka Wani,Rugved Katole,Vedant Patil,Penelope Covey,Hari Subramoni,Tanya Berger-Wolf,Christopher Stewart

Main category: cs.CV

TL;DR: SmartWilds是首个多模态野生动物监测数据集,包含无人机图像、相机陷阱和生物声学记录,用于支持环境监测和保护研究。

Details Motivation: 为濒危物种研究、生态保护和栖息地管理提供全面、可复现的多模态数据支持。 Method: 在俄亥俄州The Wilds保护区采集为期四天的同步多模态数据,包括无人机影像、相机陷阱图像/视频和生物声学记录,并对不同传感器模态性能进行比较分析。 Result: 成功收集并发布了覆盖多种本地与非本地物种的220英亩区域的多模态数据,展示了各传感器在土地利用、物种检测、行为分析和栖息地监测中的互补优势。 Conclusion: SmartWilds建立了可复现的多模态野生动物监测协议,推动了保护性计算机视觉研究,并计划未来加入GPS追踪、公民科学数据和跨季节扩展数据。 Abstract: We present the first release of SmartWilds, a multimodal wildlife monitoring dataset. SmartWilds is a synchronized collection of drone imagery, camera trap photographs and videos, and bioacoustic recordings collected during summer 2025 at The Wilds safari park in Ohio. This dataset supports multimodal AI research for comprehensive environmental monitoring, addressing critical needs in endangered species research, conservation ecology, and habitat management. Our pilot deployment captured four days of synchronized monitoring across three modalities in a 220-acre pasture containing Pere David's deer, Sichuan takin, Przewalski's horses, as well as species native to Ohio, including bald eagles, white-tailed deer, and coyotes. We provide a comparative analysis of sensor modality performance, demonstrating complementary strengths for landuse patterns, species detection, behavioral analysis, and habitat monitoring. This work establishes reproducible protocols for multimodal wildlife monitoring while contributing open datasets to advance conservation computer vision research. Future releases will include synchronized GPS tracking data from tagged individuals, citizen science data, and expanded temporal coverage across multiple seasons.

[142] RS3DBench: A Comprehensive Benchmark for 3D Spatial Perception in Remote Sensing

Jiayu Wang,Ruizhi Wang,Jie Song,Haofei Zhang,Mingli Song,Zunlei Feng,Li Sun

Main category: cs.CV

TL;DR: 本文提出了一个名为RS3DBench的新基准,包含54,951对遥感图像和像素级对齐的深度图及文本描述,用于推动大规模3D视觉模型在遥感领域的应用,并基于稳定扩散提出了一种新的深度估计模型,实现了当前最优性能。

Details Motivation: 现有遥感数据集普遍缺乏精确的深度信息或图像与深度数据之间的精准对齐,限制了3D视觉感知模型的发展,因此需要构建一个高质量、大规模的3D理解基准。 Method: 构建了一个包含图像、深度图和文本描述的多模态遥感3D理解基准RS3DBench,并基于稳定扩散模型开发了一个具备多模态融合能力的遥感深度估计模型。 Result: RS3DBench包含54,951组对齐数据,覆盖多种地理场景;所提出的深度估计模型在该数据集上达到了最先进的性能。 Conclusion: RS3DBench为遥感图像的3D空间理解提供了重要资源,推动了地理人工智能和通用大尺度3D视觉模型的发展。 Abstract: In this paper, we introduce a novel benchmark designed to propel the advancement of general-purpose, large-scale 3D vision models for remote sensing imagery. While several datasets have been proposed within the realm of remote sensing, many existing collections either lack comprehensive depth information or fail to establish precise alignment between depth data and remote sensing images. To address this deficiency, we present a visual Benchmark for 3D understanding of Remotely Sensed images, dubbed RS3DBench. This dataset encompasses 54,951 pairs of remote sensing images and pixel-level aligned depth maps, accompanied by corresponding textual descriptions, spanning a broad array of geographical contexts. It serves as a tool for training and assessing 3D visual perception models within remote sensing image spatial understanding tasks. Furthermore, we introduce a remotely sensed depth estimation model derived from stable diffusion, harnessing its multimodal fusion capabilities, thereby delivering state-of-the-art performance on our dataset. Our endeavor seeks to make a profound contribution to the evolution of 3D visual perception models and the advancement of geographic artificial intelligence within the remote sensing domain. The dataset, models and code will be accessed on the https://rs3dbench.github.io.

[143] DeblurSplat: SfM-free 3D Gaussian Splatting with Event Camera for Robust Deblurring

Pengteng Li,Yunfan Lu,Pinhao Song,Weiyu Guo,Huizai Yao,F. Richard Yu,Hui Xiong

Main category: cs.CV

TL;DR: 本文提出了一种无需结构光运动(SfM)的去模糊3D高斯点阵化方法DeblurSplat,利用事件相机和预训练密集立体模块(DUSt3R)直接生成精确初始点云,并结合事件流进行精细化监督优化,显著提升了去模糊3D重建的质量与渲染效率。

Details Motivation: 现有的基于SfM的3D高斯点阵化方法在处理运动模糊图像时,因相机位姿估计不准确导致累积误差,影响点云质量,且缺乏对动态场景变化的敏感性,难以实现高质量的去模糊重建。 Method: 1. 利用预训练的密集立体匹配模块DUSt3R,从模糊图像中直接获取高精度初始点云,避免依赖SfM估计相机位姿带来的误差传递;2. 引入事件相机的事件流,因其对动态变化高度敏感,通过解码事件流与模糊图像中的潜在清晰图像,为场景重建优化提供细粒度监督信号。 Result: 实验表明,DeblurSplat在多种场景下均能生成高保真的新视角图像,并在去模糊3D-GS任务中显著优于现有最先进方法,同时具备更高的渲染效率。 Conclusion: DeblurSplat成功实现了无需SfM的端到端去模糊3D高斯点阵化,通过融合事件流与预训练立体匹配模型,有效提升了动态模糊场景下的重建质量与效率,为未来实时清晰视觉重建提供了新思路。 Abstract: In this paper, we propose the first Structure-from-Motion (SfM)-free deblurring 3D Gaussian Splatting method via event camera, dubbed DeblurSplat. We address the motion-deblurring problem in two ways. First, we leverage the pretrained capability of the dense stereo module (DUSt3R) to directly obtain accurate initial point clouds from blurred images. Without calculating camera poses as an intermediate result, we avoid the cumulative errors transfer from inaccurate camera poses to the initial point clouds' positions. Second, we introduce the event stream into the deblur pipeline for its high sensitivity to dynamic change. By decoding the latent sharp images from the event stream and blurred images, we can provide a fine-grained supervision signal for scene reconstruction optimization. Extensive experiments across a range of scenes demonstrate that DeblurSplat not only excels in generating high-fidelity novel views but also achieves significant rendering efficiency compared to the SOTAs in deblur 3D-GS.

[144] MoiréNet: A Compact Dual-Domain Network for Image Demoiréing

Shuwei Guo,Simin Luan,Yan Ke,Zeyd Boukhers,John See,Cong Yang

Main category: cs.CV

TL;DR: 提出了一种基于U-Net的卷积神经网络MoiréNet,结合频域和空域特征,有效去除显示设备与相机传感器间产生的莫尔条纹,在性能领先的同时参数量显著减少。

Details Motivation: 莫尔条纹由显示像素阵列与相机传感器网格之间的频谱混叠引起,具有各向异性、多尺度特性,传统去莫尔方法难以有效处理这些复杂伪影。 Method: 提出MoiréNet,包含方向频率-空间编码器(DFSE)以识别莫尔条纹方向,以及频率-空间自适应选择器(FSAS)实现特征自适应抑制,结合频域与空域信息进行联合优化。 Result: 在公开且常用的数据集上达到最先进水平,参数量仅为551.3万,相比ESDNet-L减少48%,同时保持更优的图像恢复质量。 Conclusion: MoiréNet在去莫尔性能和参数效率之间取得了良好平衡,适用于智能手机摄影、工业成像和增强现实等资源受限场景。 Abstract: Moir\'e patterns arise from spectral aliasing between display pixel lattices and camera sensor grids, manifesting as anisotropic, multi-scale artifacts that pose significant challenges for digital image demoir\'eing. We propose Moir\'eNet, a convolutional neural U-Net-based framework that synergistically integrates frequency and spatial domain features for effective artifact removal. Moir\'eNet introduces two key components: a Directional Frequency-Spatial Encoder (DFSE) that discerns moir\'e orientation via directional difference convolution, and a Frequency-Spatial Adaptive Selector (FSAS) that enables precise, feature-adaptive suppression. Extensive experiments demonstrate that Moir\'eNet achieves state-of-the-art performance on public and actively used datasets while being highly parameter-efficient. With only 5.513M parameters, representing a 48% reduction compared to ESDNet-L, Moir\'eNet combines superior restoration quality with parameter efficiency, making it well-suited for resource-constrained applications including smartphone photography, industrial imaging, and augmented reality.

[145] Frequency-Domain Decomposition and Recomposition for Robust Audio-Visual Segmentation

Yunzhe Shen,Kai Peng,Leiye Liu,Wei Ji,Jingjing Li,Miao Zhang,Yongri Piao,Huchuan Lu

Main category: cs.CV

TL;DR: 本文提出了一种新的音频-视觉分割框架FAVS,通过频域分解与重组的视角解决模态间的固有矛盾,引入FDED和SCMC模块以增强跨模态一致性并保留模态特异性特征,在多个基准数据集上实现了最先进的性能。

Details Motivation: 现有音频-视觉分割方法忽略了音频和视觉模态在频域上的本质差异,特别是音频高频中的噪声干扰与视觉高频中丰富结构细节之间的矛盾,导致性能受限。 Method: 将AVS任务重新定义为频域分解与重组问题,提出FAVS框架,包含两个核心模块:FDED模块采用基于残差的迭代频域分解来区分模态特定语义与结构特征;SCMC模块利用混合专家架构通过动态专家路由增强语义一致性和特征保持。 Result: 在三个基准数据集上进行了广泛实验,FAVS框架均取得了最先进的性能,定性可视化结果也验证了FDED和SCMC模块的有效性。 Conclusion: FAVS通过频域感知的方式有效解决了音频与视觉模态间的高频冲突,提升了多模态融合效果,为音频-视觉分割提供了新的技术路径。 Abstract: Audio-visual segmentation (AVS) plays a critical role in multimodal machine learning by effectively integrating audio and visual cues to precisely segment objects or regions within visual scenes. Recent AVS methods have demonstrated significant improvements. However, they overlook the inherent frequency-domain contradictions between audio and visual modalities--the pervasively interfering noise in audio high-frequency signals vs. the structurally rich details in visual high-frequency signals. Ignoring these differences can result in suboptimal performance. In this paper, we rethink the AVS task from a deeper perspective by reformulating AVS task as a frequency-domain decomposition and recomposition problem. To this end, we introduce a novel Frequency-Aware Audio-Visual Segmentation (FAVS) framework consisting of two key modules: Frequency-Domain Enhanced Decomposer (FDED) module and Synergistic Cross-Modal Consistency (SCMC) module. FDED module employs a residual-based iterative frequency decomposition to discriminate modality-specific semantics and structural features, and SCMC module leverages a mixture-of-experts architecture to reinforce semantic consistency and modality-specific feature preservation through dynamic expert routing. Extensive experiments demonstrate that our FAVS framework achieves state-of-the-art performance on three benchmark datasets, and abundant qualitative visualizations further verify the effectiveness of the proposed FDED and SCMC modules. The code will be released as open source upon acceptance of the paper.

[146] xAI-CV: An Overview of Explainable Artificial Intelligence in Computer Vision

Nguyen Van Tu,Pham Nguyen Hai Long,Vo Hoai Viet

Main category: cs.CV

TL;DR: 本文综述了面向视觉感知任务的四种代表性可解释人工智能(xAI)方法:显著性图、概念瓶颈模型(CBM)、基于原型的方法和混合方法,分析了它们的机制、优缺点及评估指标,旨在为未来研究与应用提供指导。

Details Motivation: 深度学习在图像分析中虽表现优异,但其“黑箱”特性导致决策过程难以解释,尤其在关键应用中引发可靠性担忧。因此,需要发展可解释人工智能(xAI)以帮助人类理解AI模型的决策机制。 Method: 本文采用综述方法,系统分析了四种代表性的xAI方法:显著性图、概念瓶颈模型、原型-based方法和混合方法,并比较了它们的机制、优势与局限性,同时讨论了常用的评估指标。 Result: 论文梳理了四类xAI方法的工作原理与适用场景,总结了各自的优势与挑战,并归纳了现有的评估手段,提供了对当前xAI技术发展的全面认识。 Conclusion: 不同xAI方法各有优劣,选择应根据具体应用场景权衡可解释性与性能;未来的研究需进一步提升解释质量、建立统一评估标准,并推动xAI在实际系统中的集成与应用。 Abstract: Deep learning has become the de facto standard and dominant paradigm in image analysis tasks, achieving state-of-the-art performance. However, this approach often results in "black-box" models, whose decision-making processes are difficult to interpret, raising concerns about reliability in critical applications. To address this challenge and provide human a method to understand how AI model process and make decision, the field of xAI has emerged. This paper surveys four representative approaches in xAI for visual perception tasks: (i) Saliency Maps, (ii) Concept Bottleneck Models (CBM), (iii) Prototype-based methods, and (iv) Hybrid approaches. We analyze their underlying mechanisms, strengths and limitations, as well as evaluation metrics, thereby providing a comprehensive overview to guide future research and applications.

[147] LiDAR Point Cloud Image-based Generation Using Denoising Diffusion Probabilistic Models

Amirhesam Aghanouri,Cristina Olaverri-Monreal

Main category: cs.CV

TL;DR: 本文提出了一种改进的去噪扩散概率模型(DDPM),通过新的噪声调度和时间步嵌入技术生成高质量的合成LiDAR点云数据,用于增强自动驾驶车辆的感知能力。

Details Motivation: 由于真实世界LiDAR数据采集耗时且易受噪声和稀疏性影响,限制了自动驾驶系统在复杂环境下的性能,因此需要高效生成高质量、多样化的合成数据以提升感知模型鲁棒性。 Method: 采用去噪扩散概率模型(DDPM),引入新颖的噪声调度策略和时间步嵌入技术,提升模型对时序信息的感知能力,并基于投影生成更真实的三维点云。 Result: 在IAMCV和KITTI-360数据集上广泛评估,使用四种性能指标对比SOTA方法,结果表明该方法在处理噪声和稀疏LiDAR数据方面表现优越,能生成具有丰富空间关系和结构细节的多样化点云。 Conclusion: 所提方法显著提升了自动驾驶感知任务中合成LiDAR数据的质量与多样性,有效缓解了真实数据中噪声和稀疏性问题,优于大多数现有基线方法。 Abstract: Autonomous vehicles (AVs) are expected to revolutionize transportation by improving efficiency and safety. Their success relies on 3D vision systems that effectively sense the environment and detect traffic agents. Among sensors AVs use to create a comprehensive view of surroundings, LiDAR provides high-resolution depth data enabling accurate object detection, safe navigation, and collision avoidance. However, collecting real-world LiDAR data is time-consuming and often affected by noise and sparsity due to adverse weather or sensor limitations. This work applies a denoising diffusion probabilistic model (DDPM), enhanced with novel noise scheduling and time-step embedding techniques to generate high-quality synthetic data for augmentation, thereby improving performance across a range of computer vision tasks, particularly in AV perception. These modifications impact the denoising process and the model's temporal awareness, allowing it to produce more realistic point clouds based on the projection. The proposed method was extensively evaluated under various configurations using the IAMCV and KITTI-360 datasets, with four performance metrics compared against state-of-the-art (SOTA) methods. The results demonstrate the model's superior performance over most existing baselines and its effectiveness in mitigating the effects of noisy and sparse LiDAR data, producing diverse point clouds with rich spatial relationships and structural detail.

[148] Advancing Metallic Surface Defect Detection via Anomaly-Guided Pretraining on a Large Industrial Dataset

Chuni Liu,Hongjie Li,Jiaqi Du,Yangyang Hou,Qian Sun,Lei Jin,Ke Xu

Main category: cs.CV

TL;DR: 提出了一种新的异常引导自监督预训练方法(AGSSP),通过利用异常先验信息提升金属表面缺陷检测的表示学习效果,在多个指标上显著优于基于ImageNet预训练的模型。

Details Motivation: 解决现有预训练方法在金属表面缺陷检测中因域差距或无法区分细微缺陷与复杂背景而导致效果不佳的问题。 Method: 采用两阶段框架:第一阶段通过蒸馏异常图知识来预训练骨干网络,第二阶段使用由异常图生成的伪缺陷框预训练检测器;同时提出知识增强方法生成高质量异常图,并构建大规模工业图像数据集。 Result: 在多个设置下均显著提升性能,相比ImageNet预训练模型,mAP@0.5最高提升10%,mAP@0.5:0.95最高提升11.4%。 Conclusion: AGSSP有效缓解了数据稀缺和域差异问题,为工业缺陷检测提供了一种更优的自监督预训练范式。 Abstract: The pretraining-finetuning paradigm is a crucial strategy in metallic surface defect detection for mitigating the challenges posed by data scarcity. However, its implementation presents a critical dilemma. Pretraining on natural image datasets such as ImageNet, faces a significant domain gap. Meanwhile, naive self-supervised pretraining on in-domain industrial data is often ineffective due to the inability of existing learning objectives to distinguish subtle defect patterns from complex background noise and textures. To resolve this, we introduce Anomaly-Guided Self-Supervised Pretraining (AGSSP), a novel paradigm that explicitly guides representation learning through anomaly priors. AGSSP employs a two-stage framework: (1) it first pretrains the model's backbone by distilling knowledge from anomaly maps, encouraging the network to capture defect-salient features; (2) it then pretrains the detector using pseudo-defect boxes derived from these maps, aligning it with localization tasks. To enable this, we develop a knowledge-enhanced method to generate high-quality anomaly maps and collect a large-scale industrial dataset of 120,000 images. Additionally, we present two small-scale, pixel-level labeled metallic surface defect datasets for validation. Extensive experiments demonstrate that AGSSP consistently enhances performance across various settings, achieving up to a 10\% improvement in mAP@0.5 and 11.4\% in mAP@0.5:0.95 compared to ImageNet-based models. All code, pretrained models, and datasets are publicly available at https://clovermini.github.io/AGSSP-Dev/.

[149] Audio-Driven Universal Gaussian Head Avatars

Kartik Teotia,Helge Rhodin,Mohit Mendiratta,Hyeongwoo Kim,Marc Habermann,Christian Theobalt

Main category: cs.CV

TL;DR: 本文提出了首个音频驱动的通用逼真虚拟头像合成方法,结合了通用语音模型与新提出的通用头像先验(UHAP),能高保真地生成包含几何和外观变化的逼真面部表情。

Details Motivation: 现有方法多仅将音频映射到几何变形,忽略音频相关的外观变化,且缺乏跨个体泛化能力,因此需要一种能统一建模几何与外观、并适用于不同人物的音频驱动头像合成方法。 Method: 提出通用头像先验(UHAP),在跨身份多视角视频上训练,并利用中性扫描数据监督以保留身份细节;设计通用语音模型将原始音频直接映射到UHAP的隐式表达空间(包含几何与外观变化);采用单目编码器实现对新主体的高效个性化。 Result: 生成的头像具有精确的唇形同步和细腻的表情细节(如眉毛运动、眼神变化、口腔内外观等);在唇同步精度、图像质量和感知真实感等指标上优于现有几何主导的方法。 Conclusion: 该方法是首个可泛化的、支持细粒度外观建模的音频驱动头像生成框架,在真实感、同步性和表达丰富性方面均表现优越,显著提升了虚拟头像的合成质量。 Abstract: We introduce the first method for audio-driven universal photorealistic avatar synthesis, combining a person-agnostic speech model with our novel Universal Head Avatar Prior (UHAP). UHAP is trained on cross-identity multi-view videos. In particular, our UHAP is supervised with neutral scan data, enabling it to capture the identity-specific details at high fidelity. In contrast to previous approaches, which predominantly map audio features to geometric deformations only while ignoring audio-dependent appearance variations, our universal speech model directly maps raw audio inputs into the UHAP latent expression space. This expression space inherently encodes, both, geometric and appearance variations. For efficient personalization to new subjects, we employ a monocular encoder, which enables lightweight regression of dynamic expression variations across video frames. By accounting for these expression-dependent changes, it enables the subsequent model fine-tuning stage to focus exclusively on capturing the subject's global appearance and geometry. Decoding these audio-driven expression codes via UHAP generates highly realistic avatars with precise lip synchronization and nuanced expressive details, such as eyebrow movement, gaze shifts, and realistic mouth interior appearance as well as motion. Extensive evaluations demonstrate that our method is not only the first generalizable audio-driven avatar model that can account for detailed appearance modeling and rendering, but it also outperforms competing (geometry-only) methods across metrics measuring lip-sync accuracy, quantitative image quality, and perceptual realism.

[150] SynapFlow: A Modular Framework Towards Large-Scale Analysis of Dendritic Spines

Pamela Osuna-Vargas,Altug Kamacioglu,Dominik F. Aschauer,Petros E. Vlachos,Sercan Alipek,Jochen Triesch,Simon Rumpel,Matthias Kaschube

Main category: cs.CV

TL;DR: 本文提出了一种基于机器学习的模块化流程SynapFlow,用于自动化检测、时间追踪和特征提取双光子显微镜3D+时间数据中的树突棘,支持学习与记忆的神经机制研究。

Details Motivation: 树突棘的结构动态是研究学习和记忆神经基础的重要指标,但其在3D+时间显微数据中的大规模分析仍面临挑战且耗时费力。 Method: 结合基于Transformer的检测模块、融合空间特征的深度追踪组件、利用空间一致性的时序追踪模块,以及量化生物学相关特性的特征提取单元,构建端到端的自动化分析流程。 Result: 在开源标注数据及作者发布的两个新标注数据集(分别用于检测/深度追踪和时间追踪)上验证了方法的有效性,实现了准确的树突棘检测与跨时间追踪。 Conclusion: 该方法为树突棘动态的大规模、可扩展分析提供了可靠基准,并公开数据、代码和预训练权重以促进后续研究。 Abstract: Dendritic spines are key structural components of excitatory synapses in the brain. Given the size of dendritic spines provides a proxy for synaptic efficacy, their detection and tracking across time is important for studies of the neural basis of learning and memory. Despite their relevance, large-scale analyses of the structural dynamics of dendritic spines in 3D+time microscopy data remain challenging and labor-intense. Here, we present a modular machine learning-based pipeline designed to automate the detection, time-tracking, and feature extraction of dendritic spines in volumes chronically recorded with two-photon microscopy. Our approach tackles the challenges posed by biological data by combining a transformer-based detection module, a depth-tracking component that integrates spatial features, a time-tracking module to associate 3D spines across time by leveraging spatial consistency, and a feature extraction unit that quantifies biologically relevant spine properties. We validate our method on open-source labeled spine data, and on two complementary annotated datasets that we publish alongside this work: one for detection and depth-tracking, and one for time-tracking, which, to the best of our knowledge, is the first data of this kind. To encourage future research, we release our data, code, and pre-trained weights at https://github.com/pamelaosuna/SynapFlow, establishing a baseline for scalable, end-to-end analysis of dendritic spine dynamics.

[151] No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning

Matheus Vinícius Todescato,Joel Luís Carbonera

Main category: cs.CV

TL;DR: 提出一种结合视觉-语言模型(VLM)和预训练视觉模型的零样本图像分类框架,通过置信度伪标签策略在无标注数据下动态自学习,无需微调VLM或使用大语言模型,在十个多样化数据集上优于基线方法。

Details Motivation: 深度学习依赖大量标注数据,在数据稀缺场景中应用受限,现有零样本方法难以充分利用语义与视觉特征的互补性。 Method: 构建一个自学习循环框架:利用VLM根据类别名称生成初始伪标签,通过置信度筛选高置信样本;使用预训练视觉模型(如CNN或ViT)提取并增强其视觉特征;用这些特征训练轻量级分类器,并迭代优化。整个过程无需标注数据、不微调VLM、也不引入大语言模型。 Result: 在十个不同数据集上的实验表明,该方法显著优于标准零样本分类基线,验证了其有效性和泛化能力。 Conclusion: 所提框架通过融合VLM的语义理解与预训练视觉模型的表征能力,在无需人工标注和模型微调的前提下实现了更优的零样本图像分类性能,具有实际应用潜力。 Abstract: While deep learning, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), has significantly advanced classification performance, its typical reliance on extensive annotated datasets presents a major obstacle in many practical scenarios where such data is scarce. Vision-language models (VLMs) and transfer learning with pre-trained visual models appear as promising techniques to deal with this problem. This paper proposes a novel zero-shot image classification framework that combines a VLM and a pre-trained visual model within a self-learning cycle. Requiring only the set of class names and no labeled training data, our method utilizes a confidence-based pseudo-labeling strategy to train a lightweight classifier directly on the test data, enabling dynamic adaptation. The VLM identifies high-confidence samples, and the pre-trained visual model enhances their visual representations. These enhanced features then iteratively train the classifier, allowing the system to capture complementary semantic and visual cues without supervision. Notably, our approach avoids VLM fine-tuning and the use of large language models, relying on the visual-only model to reduce the dependence on semantic representation. Experimental evaluations on ten diverse datasets demonstrate that our approach outperforms the baseline zero-shot method.

[152] Seeing Through Reflections: Advancing 3D Scene Reconstruction in Mirror-Containing Environments with Gaussian Splatting

Zijing Guo,Yunyang Zhao,Lin Wang

Main category: cs.CV

TL;DR: 本文提出了MirrorScene3D数据集和ReflectiveGS方法,利用镜面反射作为补充视角提升镜像环境下的3D重建质量。

Details Motivation: 现有3D重建方法在镜像环境中表现不佳,且忽视了镜面反射所包含的丰富信息。 Method: 构建MirrorScene3D数据集,并提出ReflectiveGS方法,扩展3D高斯点阵以利用镜面反射作为额外视角进行重建。 Result: 在MirrorScene3D上的实验表明,ReflectiveGS在SSIM、PSNR、LPIPS和训练速度上优于现有方法。 Conclusion: ReflectiveGS有效提升了镜像环境下的3D重建质量,为该领域设定了新基准。 Abstract: Mirror-containing environments pose unique challenges for 3D reconstruction and novel view synthesis (NVS), as reflective surfaces introduce view-dependent distortions and inconsistencies. While cutting-edge methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) excel in typical scenes, their performance deteriorates in the presence of mirrors. Existing solutions mainly focus on handling mirror surfaces through symmetry mapping but often overlook the rich information carried by mirror reflections. These reflections offer complementary perspectives that can fill in absent details and significantly enhance reconstruction quality. To advance 3D reconstruction in mirror-rich environments, we present MirrorScene3D, a comprehensive dataset featuring diverse indoor scenes, 1256 high-quality images, and annotated mirror masks, providing a benchmark for evaluating reconstruction methods in reflective settings. Building on this, we propose ReflectiveGS, an extension of 3D Gaussian Splatting that utilizes mirror reflections as complementary viewpoints rather than simple symmetry artifacts, enhancing scene geometry and recovering absent details. Experiments on MirrorScene3D show that ReflectiveGaussian outperforms existing methods in SSIM, PSNR, LPIPS, and training speed, setting a new benchmark for 3D reconstruction in mirror-rich environments.

[153] Generative data augmentation for biliary tract detection on intraoperative images

Cristina Iacono,Mariarosaria Meola,Federica Conte,Laura Mecozzi,Umberto Bracale,Pietro Falco,Fanny Ficuciello

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的方法,利用YOLO检测算法和生成对抗网络(GAN)生成的合成数据,从手术中的白光图像中定位胆道,以减少腹腔镜胆囊切除术中的胆管损伤风险。

Details Motivation: 腹腔镜胆囊切除术虽然恢复快、美容效果好,但存在较高的胆管损伤风险,影响患者生活质量与生存率,因此需要提高术中胆管的可视化水平以避免此类损伤。 Method: 构建并标注了一个用于训练YOLO检测算法的图像数据库,并采用传统数据增强技术和生成对抗网络(GAN)生成合成训练数据,以提升模型性能。 Result: 实验结果表明所提出的方法在胆道定位方面具有潜力,同时讨论了相关的伦理问题。 Conclusion: 结合GAN生成的合成数据可以有效增强训练集,提升深度学习模型在术中胆道定位的准确性,有助于降低胆管损伤的风险。 Abstract: Cholecystectomy is one of the most frequently performed procedures in gastrointestinal surgery, and the laparoscopic approach is the gold standard for symptomatic cholecystolithiasis and acute cholecystitis. In addition to the advantages of a significantly faster recovery and better cosmetic results, the laparoscopic approach bears a higher risk of bile duct injury, which has a significant impact on quality of life and survival. To avoid bile duct injury, it is essential to improve the intraoperative visualization of the bile duct. This work aims to address this problem by leveraging a deep-learning approach for the localization of the biliary tract from white-light images acquired during the surgical procedures. To this end, the construction and annotation of an image database to train the Yolo detection algorithm has been employed. Besides classical data augmentation techniques, the paper proposes Generative Adversarial Network (GAN) for the generation of a synthetic portion of the training dataset. Experimental results have been discussed along with ethical considerations.

[154] Prompt-DAS: Annotation-Efficient Prompt Learning for Domain Adaptive Semantic Segmentation of Electron Microscopy Images

Jiabao Chen,Shan Xiong,Jialin Peng

Main category: cs.CV

TL;DR: 提出了一种可提示的多任务框架Prompt-DAS,用于电子显微镜图像中细胞器实例的领域自适应分割,支持无监督、弱监督和交互式分割。

Details Motivation: 实现标注高效的领域自适应分割,减少对大量标注数据的依赖。 Method: 引入可变数量的点提示机制,并结合辅助中心点检测任务和提示引导的对比学习来增强特征判别能力。 Result: 在多个挑战性基准上实验表明,该方法在UDA、WDA和基于SAM的方法中均表现出更优性能。 Conclusion: Prompt-DAS灵活且高效,能够在不同提示配置下实现多种分割模式,显著提升跨域分割性能。 Abstract: Domain adaptive segmentation (DAS) of numerous organelle instances from large-scale electron microscopy (EM) is a promising way to enable annotation-efficient learning. Inspired by SAM, we propose a promptable multitask framework, namely Prompt-DAS, which is flexible enough to utilize any number of point prompts during the adaptation training stage and testing stage. Thus, with varying prompt configurations, Prompt-DAS can perform unsupervised domain adaptation (UDA) and weakly supervised domain adaptation (WDA), as well as interactive segmentation during testing. Unlike the foundation model SAM, which necessitates a prompt for each individual object instance, Prompt-DAS is only trained on a small dataset and can utilize full points on all instances, sparse points on partial instances, or even no points at all, facilitated by the incorporation of an auxiliary center-point detection task. Moreover, a novel prompt-guided contrastive learning is proposed to enhance discriminative feature learning. Comprehensive experiments conducted on challenging benchmarks demonstrate the effectiveness of the proposed approach over existing UDA, WDA, and SAM-based approaches.

[155] Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards

Honghao Chen,Xingzhou Lou,Xiaokun Feng,Kaiqi Huang,Xinlong Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于视觉-语言模型的细粒度“逐步推理”框架,通过步骤级奖励建模实现可评估的推理过程,并结合强化学习显著提升模型在多模态任务上的表现。

Details Motivation: 现有视觉-语言推理方法多采用粗粒度推理链,难以进行细粒度结构化推理,且中间步骤的质量难以评估。因此需要一种更精细、可评估的推理机制。 Method: 提出‘逐步推理’(Chain of Step Reasoning)框架,构建步骤级推理数据,训练过程奖励模型(PRM)以评估每一步推理质量,并利用细粒度奖励进行强化学习训练和推理时扩展。 Result: 该方法在多个具有挑战性的视觉-语言基准上建立了强有力的基线,实现了持续性能提升,并通过消融实验揭示了各组件的影响及推理时扩展的若干有趣特性。 Conclusion: 本工作为视觉-语言模型提供了一个简单、有效且透明的细粒度推理与评估框架,不仅提升了性能,还为复杂多模态推理提供了可评估、可优化的基础。 Abstract: Chain of thought reasoning has demonstrated remarkable success in large language models, yet its adaptation to vision-language reasoning remains an open challenge with unclear best practices. Existing attempts typically employ reasoning chains at a coarse-grained level, which struggles to perform fine-grained structured reasoning and, more importantly, are difficult to evaluate the reward and quality of intermediate reasoning. In this work, we delve into chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training. With the proposed approaches, our models set strong baselines with consistent improvements on challenging vision-language benchmarks. More importantly, we conduct a thorough empirical analysis and ablation study, unveiling the impact of each component and several intriguing properties of inference-time scaling. We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning. Our dataset, PRM, and code will be available at https://github.com/baaivision/CoS.

[156] Weakly Supervised Food Image Segmentation using Vision Transformers and Segment Anything Model

Ioannis Sarafis,Alexandros Papadopoulos,Anastasios Delopoulos

Main category: cs.CV

TL;DR: 提出一种弱监督语义分割方法,利用SAM和ViT的注意力机制生成食物图像的分割掩码,仅使用图像级标注训练,无需像素级标注。

Details Motivation: 为了减少对像素级标注的依赖,同时提升食物图像分割效果,利用零样本能力和视觉Transformer的注意力机制改进弱监督分割。 Method: 使用Swin Transformer生成类激活图(CAMs)作为SAM的提示输入,结合图像预处理和单/多掩码生成策略优化SAM输出,实现弱监督分割。 Result: 在FoodSeg103数据集上平均每个图像生成2.4个掩码,多掩码场景下mIoU达到0.54。 Conclusion: 该方法可有效用于加速食物图像标注,或集成到饮食与营养追踪应用中。 Abstract: In this paper, we propose a weakly supervised semantic segmentation approach for food images which takes advantage of the zero-shot capabilities and promptability of the Segment Anything Model (SAM) along with the attention mechanisms of Vision Transformers (ViTs). Specifically, we use class activation maps (CAMs) from ViTs to generate prompts for SAM, resulting in masks suitable for food image segmentation. The ViT model, a Swin Transformer, is trained exclusively using image-level annotations, eliminating the need for pixel-level annotations during training. Additionally, to enhance the quality of the SAM-generated masks, we examine the use of image preprocessing techniques in combination with single-mask and multi-mask SAM generation strategies. The methodology is evaluated on the FoodSeg103 dataset, generating an average of 2.4 masks per image (excluding background), and achieving an mIoU of 0.54 for the multi-mask scenario. We envision the proposed approach as a tool to accelerate food image annotation tasks or as an integrated component in food and nutrition tracking applications.

[157] A DyL-Unet framework based on dynamic learning for Temporally Consistent Echocardiographic Segmentation

Jierui Qu,Jianchun Zhao

Main category: cs.CV

TL;DR: 本文提出了一种名为DyL-UNet的动态学习框架,用于实现超声心动图中时间上稳定且精确的心脏解剖分割。该方法通过构建回波动力学图(EDG)并引入心脏相位-动力学注意力(CPDA)机制,在保持高分割精度的同时显著提升时间一致性。

Details Motivation: 超声心动图易受形变和斑点噪声影响,导致帧间分割抖动,即使单帧分割准确,时间不稳定性仍会影响功能评估和临床解释。因此需要一种能保证时间一致性的分割方法。 Method: 提出DyL-UNet,结合多个基于Swin Transformer的编码器-解码器分支,并在跳跃连接中引入心脏相位-动力学注意力(CPDA),利用回波动力学图(EDG)提取视频中的动态信息以增强时间一致性。 Result: 在CAMUS和EchoNet-Dynamic数据集上的实验表明,DyL-UNet在保持与其他方法相当的分割精度的同时,显著提升了时间稳定性。 Conclusion: DyL-UNet能够有效提高超声心动图分割的时间一致性,为自动化临床分析提供了可靠解决方案。 Abstract: Accurate segmentation of cardiac anatomy in echocardiography is essential for cardiovascular diagnosis and treatment. Yet echocardiography is prone to deformation and speckle noise, causing frame-to-frame segmentation jitter. Even with high accuracy in single-frame segmentation, temporal instability can weaken functional estimates and impair clinical interpretability. To address these issues, we propose DyL-UNet, a dynamic learning-based temporal consistency U-Net segmentation architecture designed to achieve temporally stable and precise echocardiographic segmentation. The framework constructs an Echo-Dynamics Graph (EDG) through dynamic learning to extract dynamic information from videos. DyL-UNet incorporates multiple Swin-Transformer-based encoder-decoder branches for processing single-frame images. It further introduces Cardiac Phase-Dynamics Attention (CPDA) at the skip connections, which uses EDG-encoded dynamic features and cardiac-phase cues to enforce temporal consistency during segmentation. Extensive experiments on the CAMUS and EchoNet-Dynamic datasets demonstrate that DyL-UNet maintains segmentation accuracy comparable to existing methods while achieving superior temporal consistency, providing a reliable solution for automated clinical echocardiography.

[158] WaveletGaussian: Wavelet-domain Diffusion for Sparse-view 3D Gaussian Object Reconstruction

Hung Nguyen,Runfa Li,An Le,Truong Nguyen

Main category: cs.CV

TL;DR: 提出WaveletGaussian框架,通过在小波域中应用扩散模型并结合轻量网络优化高频子带,显著提升稀疏视角下3D高斯重建效率。

Details Motivation: 现有方法在稀疏视角下性能下降,且依赖计算开销大的扩散模型微调与修复步骤,亟需更高效的重建方案。 Method: 将扩散模型应用于小波变换后的低频LL子带,高频子带由轻量网络优化,并采用在线随机掩码策略高效构建训练对。 Result: 在Mip-NeRF 360和OmniObject3D数据集上实验表明,该方法在保持竞争力渲染质量的同时大幅减少训练时间。 Conclusion: WaveletGaussian通过小波域分解与高效训练策略,实现了快速、高质量的稀疏视角3D高斯重建。 Abstract: 3D Gaussian Splatting (3DGS) has become a powerful representation for image-based object reconstruction, yet its performance drops sharply in sparse-view settings. Prior works address this limitation by employing diffusion models to repair corrupted renders, subsequently using them as pseudo ground truths for later optimization. While effective, such approaches incur heavy computation from the diffusion fine-tuning and repair steps. We present WaveletGaussian, a framework for more efficient sparse-view 3D Gaussian object reconstruction. Our key idea is to shift diffusion into the wavelet domain: diffusion is applied only to the low-resolution LL subband, while high-frequency subbands are refined with a lightweight network. We further propose an efficient online random masking strategy to curate training pairs for diffusion fine-tuning, replacing the commonly used, but inefficient, leave-one-out strategy. Experiments across two benchmark datasets, Mip-NeRF 360 and OmniObject3D, show WaveletGaussian achieves competitive rendering quality while substantially reducing training time.

[159] 3rd Place Report of LSVOS 2025 MeViS Track: Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference

Alexey Nekrasov,Ali Athar,Daan de Geus,Alexander Hermans,Bastian Leibe

Main category: cs.CV

TL;DR: Sa2VA-i 是对 Sa2VA 的改进版本,通过修复训练与推理过程中的不一致性,在多个视频对象分割基准上实现了新的最先进性能。

Details Motivation: 发现 Sa2VA 在指代表达视频对象分割任务中未发挥全部潜力,主要由于训练与推理流程间的不一致。 Method: 提出 Sa2VA-i,修正了原模型在训练和推理过程中的不一致问题,并基于相同的检查点进行优化。 Result: 在 MeViS、Ref-YT-VOS、Ref-DAVIS 和 ReVOS 等多个基准上显著提升性能,其中 MeViS 上最高提升达 +11.6 J&F;Sa2VA-i-1B 模型性能媲美原 Sa2VA-26B 模型。 Conclusion: 实现更优的推理一致性可显著提升模型性能,凸显实现细节的重要性,为指代表达视频分割领域提供有益见解。 Abstract: Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at https://github.com/kumuji/sa2va-i

[160] Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications

Ganesh Mallya,Yotam Gigi,Dahun Kim,Maxim Neumann,Genady Beryozkin,Tomer Shekel,Anelia Angelova

Main category: cs.CV

TL;DR: 提出一种无需训练的方法,将多光谱数据以零样本模式输入仅支持RGB的通用多模态模型(如Gemini2.5),通过适配输入空间和注入领域特定指令,显著提升遥感任务中的土地覆盖与利用分类性能。

Details Motivation: 现有的多光谱图像分析依赖专门训练的模型,成本高且难以兼容强大的通用多模态模型;同时,通用模型无法理解多光谱信号,限制了其在遥感中的应用。 Method: 采用训练-free方法,将多光谱输入适配到通用多模态模型的视觉空间,并通过指令注入方式引入领域知识,使仅接受RGB训练的模型(如Gemini2.5)能处理多光谱数据。 Result: 在多个遥感基准任务上实现了显著的零样本性能提升,验证了Gemini2.5对新输入的快速适应能力。 Conclusion: 该方法使地理空间专业人员能够轻松利用强大的通用多模态模型处理非标准传感器数据,充分发挥其推理与上下文理解能力,推动遥感分析效率提升。 Abstract: Multi-spectral imagery plays a crucial role in diverse Remote Sensing applications including land-use classification, environmental monitoring and urban planning. These images are widely adopted because their additional spectral bands correlate strongly with physical materials on the ground, such as ice, water, and vegetation. This allows for more accurate identification, and their public availability from missions, such as Sentinel-2 and Landsat, only adds to their value. Currently, the automatic analysis of such data is predominantly managed through machine learning models specifically trained for multi-spectral input, which are costly to train and support. Furthermore, although providing a lot of utility for Remote Sensing, such additional inputs cannot be used with powerful generalist large multimodal models, which are capable of solving many visual problems, but are not able to understand specialized multi-spectral signals. To address this, we propose a training-free approach which introduces new multi-spectral data in a Zero-Shot-only mode, as inputs to generalist multimodal models, trained on RGB-only inputs. Our approach leverages the multimodal models' understanding of the visual space, and proposes to adapt to inputs to that space, and to inject domain-specific information as instructions into the model. We exemplify this idea with the Gemini2.5 model and observe strong Zero-Shot performance gains of the approach on popular Remote Sensing benchmarks for land cover and land use classification and demonstrate the easy adaptability of Gemini2.5 to new inputs. These results highlight the potential for geospatial professionals, working with non-standard specialized inputs, to easily leverage powerful multimodal models, such as Gemini2.5, to accelerate their work, benefiting from their rich reasoning and contextual capabilities, grounded in the specialized sensor data.

[161] Investigating Traffic Accident Detection Using Multimodal Large Language Models

Ilhan Skender,Kailin Tong,Selim Solmaz,Daniel Watzenig

Main category: cs.CV

TL;DR: 本研究探讨了多模态大语言模型(MLLMs)在零样本条件下利用基础设施摄像头图像进行交通事故检测的能力,并结合YOLO、Deep SORT和SAM等视觉分析技术提升性能。

Details Motivation: 由于真实交通事故数据稀缺且标注成本高,研究旨在探索无需大量标注数据即可实现高效事故检测的零样本方法。 Method: 使用CARLA模拟生成的DeepAccident数据集评估Gemini 1.5/2.0、Gemma 3和Pixtral等MLLM在未微调情况下的表现,并结合YOLO、Deep SORT和SAM提取的视觉信息增强提示以提高准确性。 Result: Pixtral表现最佳,F1得分为0.71,召回率为83%;Gemini系列在精度上提升至90%,但F1和召回率下降;Gemma 3表现最稳定。 Conclusion: 结合先进视觉分析技术可显著提升MLLM在交通监控中的零样本检测能力,具有实际应用潜力。 Abstract: Traffic safety remains a critical global concern, with timely and accurate accident detection essential for hazard reduction and rapid emergency response. Infrastructure-based vision sensors offer scalable and efficient solutions for continuous real-time monitoring, facilitating automated detection of acci- dents directly from captured images. This research investigates the zero-shot capabilities of multimodal large language models (MLLMs) for detecting and describing traffic accidents using images from infrastructure cameras, thus minimizing reliance on extensive labeled datasets. Main contributions include: (1) Evaluation of MLLMs using the simulated DeepAccident dataset from CARLA, explicitly addressing the scarcity of diverse, realistic, infrastructure-based accident data through controlled simulations; (2) Comparative performance analysis between Gemini 1.5 and 2.0, Gemma 3 and Pixtral models in acci- dent identification and descriptive capabilities without prior fine-tuning; and (3) Integration of advanced visual analytics, specifically YOLO for object detection, Deep SORT for multi- object tracking, and Segment Anything (SAM) for instance segmentation, into enhanced prompts to improve model accuracy and explainability. Key numerical results show Pixtral as the top performer with an F1-score of 0.71 and 83% recall, while Gemini models gained precision with enhanced prompts (e.g., Gemini 1.5 rose to 90%) but suffered notable F1 and recall losses. Gemma 3 offered the most balanced performance with minimal metric fluctuation. These findings demonstrate the substantial potential of integrating MLLMs with advanced visual analytics techniques, enhancing their applicability in real-world automated traffic monitoring systems.

[162] Track-On2: Enhancing Online Point Tracking with Memory

Görkay Aydemir,Weidi Xie,Fatma Güney

Main category: cs.CV

TL;DR: 本文提出了Track-On2,一种基于Transformer的在线长期点跟踪模型,通过架构改进、更有效的内存使用和合成训练策略,在多个基准上实现了最先进的性能。

Details Motivation: 解决在显著外观变化、运动和遮挡下跨视频帧一致地识别点的长期点跟踪问题,并适用于实时和流式应用。 Method: 扩展了之前的Track-On模型,提出了一种因果处理帧并利用记忆机制保持时间连贯性的简单高效Transformer模型,推理时先进行粗粒度块级分类再细化。 Result: 在五个合成和真实世界基准上取得了最先进的结果,超越了先前的在线跟踪器甚至一些利用双向上下文的强离线方法。 Conclusion: 证明了仅用合成数据训练的因果、基于记忆的架构是现实世界点跟踪的可扩展解决方案。 Abstract: In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across video frames under significant appearance changes, motion, and occlusion. We target the online setting, i.e. tracking points frame-by-frame, making it suitable for real-time and streaming applications. We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking. Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies. Unlike prior approaches that rely on full-sequence access or iterative updates, our model processes frames causally and maintains temporal coherence via a memory mechanism, which is key to handling drift and occlusions without requiring future frames. At inference, we perform coarse patch-level classification followed by refinement. Beyond architecture, we systematically study synthetic training setups and their impact on memory behavior, showing how they shape temporal robustness over long sequences. Through comprehensive experiments, Track-On2 achieves state-of-the-art results across five synthetic and real-world benchmarks, surpassing prior online trackers and even strong offline methods that exploit bidirectional context. These results highlight the effectiveness of causal, memory-based architectures trained purely on synthetic data as scalable solutions for real-world point tracking. Project page: https://kuis-ai.github.io/track_on2

[163] KAMERA: Enhancing Aerial Surveys of Ice-associated Seals in Arctic Environments

Adam Romlein,Benjamin X. Hou,Yuval Boss,Cynthia L. Christman,Stacie Koslovsky,Erin E. Moreland,Jason Parham,Anthony Hoogs

Main category: cs.CV

TL;DR: KAMERA是一个用于多摄像头、多光谱同步和实时检测海豹与北极熊的综合系统,显著减少数据处理时间,并支持元数据标注和地理映射,软件完全开源。

Details Motivation: 提高在阿拉斯加周边海域进行空中调查时对冰栖海豹和北极熊的检测效率和数据处理速度。 Method: 通过严格的校准和硬件同步,结合多光谱成像,实现多摄像头系统的实时目标检测,并将图像和检测结果映射到世界坐标平面。 Result: 相比以往方法最多减少80%的数据处理时间,所有数据均附带元数据,检测结果可快速评估并准确估算调查面积。 Conclusion: KAMERA系统有效提升了野生动物空中调查的效率与精度,其开源设计有望推动科研领域的类似应用。 Abstract: We introduce KAMERA: a comprehensive system for multi-camera, multi-spectral synchronization and real-time detection of seals and polar bears. Utilized in aerial surveys for ice-associated seals in the Bering, Chukchi, and Beaufort seas around Alaska, KAMERA provides up to an 80% reduction in dataset processing time over previous methods. Our rigorous calibration and hardware synchronization enable using multiple spectra for object detection. All collected data are annotated with metadata so they can be easily referenced later. All imagery and animal detections from a survey are mapped onto a world plane for accurate surveyed area estimates and quick assessment of survey results. We hope KAMERA will inspire other mapping and detection efforts in the scientific community, with all software, models, and schematics fully open-sourced.

[164] NeuCODEX: Edge-Cloud Co-Inference with Spike-Driven Compression and Dynamic Early-Exit

Maurf Hassan,Steven Davy,Muhammad Zawish,Owais Bin Zuber,Nouman Ashraf

Main category: cs.CV

TL;DR: NeuCODEX是一种面向脉冲神经网络的协同推理架构,通过联合优化时空冗余,实现边缘-云系统中的高效能推理,显著降低数据传输、能耗和延迟,同时保持高精度。

Details Motivation: 在边缘设备上运行完整的脉冲神经网络(SNN)推理面临高延迟和能量消耗的问题,而现有的边缘-云协同推理方案受限于高传输成本和延迟,因此需要一种更高效的协同推理机制。 Method: 提出NeuCODEX架构,结合学习驱动的脉冲压缩模块以减少数据传输,并引入基于输出置信度的动态早退机制以自适应终止推理,从而优化时空冗余。 Result: 在CIFAR10、Caltech、CIFAR10-DVS和N-Caltech等数据集上验证,NeuCODEX实现了最高2048倍的数据传输减少,超过90%的边缘能耗降低,端到端延迟最高降低3倍,且精度损失小于2%。 Conclusion: NeuCODEX有效支持了资源受限环境下高性能、实用化的SNN部署,为边缘智能提供了高效节能的解决方案。 Abstract: Spiking Neural Networks (SNNs) offer significant potential for enabling energy-efficient intelligence at the edge. However, performing full SNN inference at the edge can be challenging due to the latency and energy constraints arising from fixed and high timestep overheads. Edge-cloud co-inference systems present a promising solution, but their deployment is often hindered by high latency and feature transmission costs. To address these issues, we introduce NeuCODEX, a neuromorphic co-inference architecture that jointly optimizes both spatial and temporal redundancy. NeuCODEX incorporates a learned spike-driven compression module to reduce data transmission and employs a dynamic early-exit mechanism to adaptively terminate inference based on output confidence. We evaluated NeuCODEX on both static images (CIFAR10 and Caltech) and neuromorphic event streams (CIFAR10-DVS and N-Caltech). To demonstrate practicality, we prototyped NeuCODEX on ResNet-18 and VGG-16 backbones in a real edge-to-cloud testbed. Our proposed system reduces data transfer by up to 2048x and edge energy consumption by over 90%, while reducing end-to-end latency by up to 3x compared to edge-only inference, all with a negligible accuracy drop of less than 2%. In doing so, NeuCODEX enables practical, high-performance SNN deployment in resource-constrained environments.

[165] RoSe: Robust Self-supervised Stereo Matching under Adverse Weather Conditions

Yun Wang,Junjie Hu,Junhui Hou,Chenghao Zhang,Renwei Yang,Dapeng Oliver Wu

Main category: cs.CV

TL;DR: 本文提出了一种在恶劣天气条件下鲁棒的自监督立体匹配方法RoSe,通过引入视觉基础模型的先验知识和场景对应先验来提升特征表示与监督信号的鲁棒性,并在合成数据上验证了其有效性。

Details Motivation: 现有自监督立体匹配方法在夜间、雨雾等恶劣天气下性能显著下降,主要由于天气退化导致CNN特征提取困难以及光度一致性假设失效。 Method: 利用视觉基础模型提供鲁棒先验增强CNN特征提取,并构建保持语义和视差一致的清洁-恶劣图像对合成数据集,提出包含场景对应学习与恶劣天气蒸馏的两步自监督训练范式。 Result: 实验表明该方法在多种恶劣天气条件下优于现有的自监督立体匹配方法,具有良好的有效性与泛化能力。 Conclusion: 通过引入基础模型先验和场景对应约束,可显著提升自监督立体匹配模型在恶劣天气下的鲁棒性与性能。 Abstract: Recent self-supervised stereo matching methods have made significant progress, but their performance significantly degrades under adverse weather conditions such as night, rain, and fog. We identify two primary weaknesses contributing to this performance degradation. First, adverse weather introduces noise and reduces visibility, making CNN-based feature extractors struggle with degraded regions like reflective and textureless areas. Second, these degraded regions can disrupt accurate pixel correspondences, leading to ineffective supervision based on the photometric consistency assumption. To address these challenges, we propose injecting robust priors derived from the visual foundation model into the CNN-based feature extractor to improve feature representation under adverse weather conditions. We then introduce scene correspondence priors to construct robust supervisory signals rather than relying solely on the photometric consistency assumption. Specifically, we create synthetic stereo datasets with realistic weather degradations. These datasets feature clear and adverse image pairs that maintain the same semantic context and disparity, preserving the scene correspondence property. With this knowledge, we propose a robust self-supervised training paradigm, consisting of two key steps: robust self-supervised scene correspondence learning and adverse weather distillation. Both steps aim to align underlying scene results from clean and adverse image pairs, thus improving model disparity estimation under adverse weather effects. Extensive experiments demonstrate the effectiveness and versatility of our proposed solution, which outperforms existing state-of-the-art self-supervised methods. Codes are available at \textcolor{blue}{https://github.com/cocowy1/RoSe-Robust-Self-supervised-Stereo-Matching-under-Adverse-Weather-Conditions}.

[166] YOLO-LAN: Precise Polyp Detection via Optimized Loss, Augmentations and Negatives

Siddharth Gupta,Jitin Singla

Main category: cs.CV

TL;DR: 提出基于YOLO的结肠息肉检测管道YOLO-LAN,结合M2IoU损失和数据增强,在多个数据集上优于现有方法。

Details Motivation: 结肠镜检查中人工检测息肉存在不一致和遗漏问题,需要更准确、实时的自动检测方法。 Method: 基于YOLO架构构建检测管道,采用M2IoU损失函数、多样化数据增强和负样本训练以模拟真实临床场景。 Result: 在Kvasir-seg和BKAI-IGH NeoPolyp数据集上表现优于现有方法,YOLOv12在Kvasir-seg上达到mAP$_{50}$ 0.9619和mAP$_{50:95}$ 0.8599,显著提升检测精度。 Conclusion: YOLO-LAN具有高精度和鲁棒性,能有效辅助临床结直肠癌筛查,具备实际应用价值。 Abstract: Colorectal cancer (CRC), a lethal disease, begins with the growth of abnormal mucosal cell proliferation called polyps in the inner wall of the colon. When left undetected, polyps can become malignant tumors. Colonoscopy is the standard procedure for detecting polyps, as it enables direct visualization and removal of suspicious lesions. Manual detection by colonoscopy can be inconsistent and is subject to oversight. Therefore, object detection based on deep learning offers a better solution for a more accurate and real-time diagnosis during colonoscopy. In this work, we propose YOLO-LAN, a YOLO-based polyp detection pipeline, trained using M2IoU loss, versatile data augmentations and negative data to replicate real clinical situations. Our pipeline outperformed existing methods for the Kvasir-seg and BKAI-IGH NeoPolyp datasets, achieving mAP$_{50}$ of 0.9619, mAP$_{50:95}$ of 0.8599 with YOLOv12 and mAP$_{50}$ of 0.9540, mAP$_{50:95}$ of 0.8487 with YOLOv8 on the Kvasir-seg dataset. The significant increase is achieved in mAP$_{50:95}$ score, showing the precision of polyp detection. We show robustness based on polyp size and precise location detection, making it clinically relevant in AI-assisted colorectal screening.

[167] The 1st Solution for MOSEv2 Challenge 2025: Long-term and Concept-aware Video Segmentation via SeC

Mingqi Gao,Jingkun Chen,Yunqi Miao,Gengshen Wu,Zhijin Qin,Jungong Han

Main category: cs.CV

TL;DR: 本论文提出并改进了SeC框架(基于SAM-2),通过引入长期记忆和概念感知记忆,有效应对复杂半监督视频对象分割中的遮挡与干扰问题,在MOSEv2挑战赛中取得第一名(JF得分39.89%)。

Details Motivation: 为了解决复杂半监督视频对象分割中物体遮挡、重现以及背景干扰等核心难题,需要提升模型的时间连续性建模能力和语义辨别能力。 Method: 基于SAM-2框架增强设计SeC,引入长期记忆机制以保持跨帧的时间一致性,并设计概念感知记忆模块提供语义先验,抑制无关干扰物的影响。 Result: 在LSVOS挑战赛的MOSEv2赛道上,该方法在测试集上取得了39.89%的JF分数,排名第一。 Conclusion: 长期记忆与概念感知记忆的结合显著提升了复杂场景下视频对象分割的性能,验证了其在处理遮挡、重现和语义干扰方面的有效性。 Abstract: This technical report explores the MOSEv2 track of the LSVOS Challenge, which targets complex semi-supervised video object segmentation. By analysing and adapting SeC, an enhanced SAM-2 framework, we conduct a detailed study of its long-term memory and concept-aware memory, showing that long-term memory preserves temporal continuity under occlusion and reappearance, while concept-aware memory supplies semantic priors that suppress distractors; together, these traits directly benefit several MOSEv2's core challenges. Our solution achieves a JF score of 39.89% on the test set, ranking 1st in the MOSEv2 track of the LSVOS Challenge.

[168] Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models

Yueyan Li,Chenggong Zhao,Zeyuan Zang,Caixia Yuan,Xiaojie Wang

Main category: cs.CV

TL;DR: 本文受人类视觉双通路假说启发,将视觉语言模型(VLM)的视觉处理分解为对象识别与空间感知两部分进行研究,揭示了图像内容理解的两阶段过程和位置表示的几何结构,并基于此提出提升解码效率和空间推理能力的新方法。

Details Motivation: 现有VLM在视觉信息处理上采用序列化方式,与人类视觉的并行机制不同,且其内部机制不透明,限制了理解和架构创新。因此需要从人类视觉机制出发,深入分析VLM的内部工作机制。 Method: 借鉴人类视觉的‘what’与‘where’双通路假说,分别研究VLM中的对象识别与空间感知;通过将图像转换为文本token图分析对象识别过程;理论推导并实证验证位置表示的几何结构;提出无需指令的token压缩算法和RoPE缩放技术。 Result: 发现VLM中对象识别是一个从属性识别到语义消歧的两阶段过程;揭示了位置表示背后的几何结构;提出的token压缩算法提升了解码效率,RoPE缩放技术增强了空间推理能力。 Conclusion: 通过对VLM视觉处理机制的解构与分析,不仅加深了对其内部工作原理的理解,也为未来更高效、更强推理能力的VLM架构设计提供了清晰原则。 Abstract: Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the "what" and "where" pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model's perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.

[169] Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

Ioanna Ntinou,Alexandros Xenos,Yassine Ouali,Adrian Bulat,Georgios Tzimiropoulos

Main category: cs.CV

TL;DR: 提出一种无需视觉编码器的文本到文本检索方法,利用VLLM生成的图像描述实现高效的跨模态检索,在减少模态差距、提升组合性和隐私保护方面优于传统双编码器模型。

Details Motivation: 现有对比训练的视觉-语言模型(如CLIP)存在语言理解浅层化、模态差距大、训练成本高和隐私问题,尤其在文本到图像检索任务中受限于双编码器结构。 Method: 摒弃传统文本-图像检索范式,采用基于VLLM生成的结构化图像描述的文本-文本检索框架,使用单编码器架构并在少量GPU上进行校准。 Result: 在多个检索和组合性基准(包括新发布的subFlickr和subCOCO)上达到或超越传统多模态模型,实现了最先进的零样本性能,仅需几小时双GPU训练,且模型小至0.3B参数即可取得优异表现。 Conclusion: 视觉编码器在检索任务中并非必需,通过文本化图像描述可有效缩小模态差距、提升性能与隐私保护,为高效、轻量化的视觉-语言理解提供了新方向。 Abstract: Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding, manifesting bag-of-words behaviour. These limitations are reinforced by their dual-encoder design, which induces a modality gap. Additionally, the reliance on vast web-collected data corpora for training makes the process computationally expensive and introduces significant privacy concerns. To address these limitations, in this work, we challenge the necessity of vision encoders for retrieval tasks by introducing a vision-free, single-encoder retrieval pipeline. Departing from the traditional text-to-image retrieval paradigm, we migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions. We demonstrate that this paradigm shift has significant advantages, including a substantial reduction of the modality gap, improved compositionality, and better performance on short and long caption queries, all attainable with only a few hours of calibration on two GPUs. Additionally, substituting raw images with textual descriptions introduces a more privacy-friendly alternative for retrieval. To further assess generalisation and address some of the shortcomings of prior compositionality benchmarks, we release two benchmarks derived from Flickr30k and COCO, containing diverse compositional queries made of short captions, which we coin subFlickr and subCOCO. Our vision-free retriever matches and often surpasses traditional multimodal models. Importantly, our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks, with models as small as 0.3B parameters. Code is available at: https://github.com/IoannaNti/LexiCLIP

[170] Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs

Israfel Salazar,Desmond Elliott,Yova Kementchedjhieva

Main category: cs.CV

TL;DR: 本文研究了对比视觉-语言模型在理解长且密集的标题时的组成性能力,发现组成性训练和长标题训练相互促进,但效果依赖于数据质量和模型设计。高质量的长标题数据有助于同时提升两种能力。

Details Motivation: 尽管视觉-语言模型在绑定图文信息方面取得进展,但在理解长而密集的标题方面仍存在挑战。作者假设组成性(即推理对象属性和对象间关系的能力)是关键因素。 Method: 通过训练和评估一系列针对组成性和长标题理解能力的模型,分析二者之间的相互作用。 Result: 发现组成性训练能提升长标题检索性能,而长标题训练也能增强组成性理解;但这些提升对数据质量和模型设计敏感,低质量数据或冻结位置嵌入等策略会限制泛化。 Conclusion: 组成性理解和长标题理解是相互关联的能力,可通过高质量、具象化的长标题数据联合学习,为提升视觉-语言模型的泛化能力提供实践指导。 Abstract: Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, but understanding long, dense captions remains an open challenge. We hypothesize that compositionality, the capacity to reason about object-attribute bindings and inter-object relationships, is key to understanding longer captions. In this paper, we investigate the interaction between compositionality and long-caption understanding, asking whether training for one property enhances the other. We train and evaluate a range of models that target each of these capabilities. Our results reveal a bidirectional relationship: compositional training improves performance on long-caption retrieval, and training on long captions promotes compositionality. However, these gains are sensitive to data quality and model design. We find that training on poorly structured captions, or with limited parameter updates, fails to support generalization. Likewise, strategies that aim at retaining general alignment, such as freezing positional embeddings, do not improve compositional understanding. Overall, we find that compositional understanding and long-caption understanding are intertwined capabilities that can be jointly learned through training on dense, grounded descriptions. Despite these challenges, we show that models trained on high-quality, long-caption data can achieve strong performance in both tasks, offering practical guidance for improving VLM generalization.

[171] Enabling Plant Phenotyping in Weedy Environments using Multi-Modal Imagery via Synthetic and Generated Training Data

Earl Ranario,Ismael Mayanja,Heesup Yun,Brian N. Bailey,J. Mason Earles

Main category: cs.CV

TL;DR: 提出了一种结合合成RGB图像、少量真实标注和基于GAN的跨模态对齐的框架,以提升热成像中植物语义分割的准确性。

Details Motivation: 解决在户外环境中由于植物与杂草对比度低和频繁遮挡导致的热成像植物分割困难问题。 Method: 使用1,128张合成图像训练模型,并引入最多5张真实标注的田间图像,结合CycleGAN-turbo实现RGB到热成像的跨模态转换与对齐。 Result: 相较于全真实数据基线,杂草类别的分割性能最大相对提升22%,植物类别提升17%。 Conclusion: 合成数据结合少量真实标注和生成模型的跨域翻译可显著提升复杂田间环境下多模态图像的分割性能。 Abstract: Accurate plant segmentation in thermal imagery remains a significant challenge for high throughput field phenotyping, particularly in outdoor environments where low contrast between plants and weeds and frequent occlusions hinder performance. To address this, we present a framework that leverages synthetic RGB imagery, a limited set of real annotations, and GAN-based cross-modality alignment to enhance semantic segmentation in thermal images. We trained models on 1,128 synthetic images containing complex mixtures of crop and weed plants in order to generate image segmentation masks for crop and weed plants. We additionally evaluated the benefit of integrating as few as five real, manually segmented field images within the training process using various sampling strategies. When combining all the synthetic images with a few labeled real images, we observed a maximum relative improvement of 22% for the weed class and 17% for the plant class compared to the full real-data baseline. Cross-modal alignment was enabled by translating RGB to thermal using CycleGAN-turbo, allowing robust template matching without calibration. Results demonstrated that combining synthetic data with limited manual annotations and cross-domain translation via generative models can significantly boost segmentation performance in complex field environments for multi-model imagery.

[172] HyKid: An Open MRI Dataset with Expert-Annotated Multi-Structure and Choroid Plexus in Pediatric Hydrocephalus

Yunzhi Xu,Yushuang Ding,Hu Sun,Hongxi Zhang,Li Zhao

Main category: cs.CV

TL;DR: 本文提出了一个名为HyKid的开源数据集,包含48名儿童脑积水患者的高分辨率3D MRI图像和专家标注的脑组织及脉络丛分割结果,结合临床报告提取结构化信息,发现脉络丛体积与脑脊液总量之间存在强相关性,可作为脑积水评估的潜在生物标志物(AUC=0.87),为神经影像算法提供了高质量基准。

Details Motivation: 儿童脑积水评估具有挑战性,且缺乏公开的、专家标注的医学影像数据集,尤其是包含脉络丛分割的数据,限制了相关研究的发展。 Method: 收集48名儿科脑积水患者的低分辨率MRI图像,采用切片到体积算法重建为1mm各向同性高分辨率3D MRI,并由经验丰富的神经科医生手动修正脑白质、灰质、侧脑室、外部脑脊液及脉络丛的分割;同时使用检索增强生成框架从临床放射报告中提取结构化数据。 Result: 脉络丛体积与总脑脊液体积之间存在强相关性,该特征在预测模型中表现出色(AUC=0.87),表明其可作为脑积水评估的潜在生物标志物;HyKid数据集为脑积水相关研究提供了高质量的公开基准。 Conclusion: HyKid数据集不仅填补了儿童脑积水领域公开标注数据的空白,还揭示了脉络丛在疾病评估中的重要作用,有助于推动基于神经影像的诊断算法发展。 Abstract: Evaluation of hydrocephalus in children is challenging, and the related research is limited by a lack of publicly available, expert-annotated datasets, particularly those with segmentation of the choroid plexus. To address this, we present HyKid, an open-source dataset from 48 pediatric patients with hydrocephalus. 3D MRIs were provided with 1mm isotropic resolution, which was reconstructed from routine low-resolution images using a slice-to-volume algorithm. Manually corrected segmentations of brain tissues, including white matter, grey matter, lateral ventricle, external CSF, and the choroid plexus, were provided by an experienced neurologist. Additionally, structured data was extracted from clinical radiology reports using a Retrieval-Augmented Generation framework. The strong correlation between choroid plexus volume and total CSF volume provided a potential biomarker for hydrocephalus evaluation, achieving excellent performance in a predictive model (AUC = 0.87). The proposed HyKid dataset provided a high-quality benchmark for neuroimaging algorithms development, and it revealed the choroid plexus-related features in hydrocephalus assessments. Our datasets are publicly available at https://www.synapse.org/Synapse:syn68544889.

[173] MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation

Tongshuai Wu,Chao Lu,Ze Song,Yunlong Lin,Sizhe Fan,Xuemei Chen

Main category: cs.CV

TL;DR: 本文提出了一种多尺度特征交互网络(MsFIN),用于从行车记录仪视频中进行早期事故预测,通过多尺度特征聚合、时序特征处理和多尺度特征后融合,显著优于现有方法。

Details Motivation: 现有的事故预测模型难以有效建模交通参与者之间的特征级交互(尤其在行车记录仪视角下常存在遮挡)以及捕捉事故前复杂的异步多时序行为线索。 Method: 提出MsFIN网络,包含三个模块:1)多尺度模块提取短、中、长期时间尺度的场景表征;2)利用Transformer增强特征交互;3)在因果约束下进行时序特征处理;4)多尺度特征后融合生成综合风险表征。 Result: 在DAD和DADA数据集上的实验表明,MsFIN在预测准确性和预警及时性方面均显著优于单尺度特征提取的最先进模型,消融实验验证了各模块的有效性。 Conclusion: MsFIN通过多尺度特征融合和上下文交互建模,有效提升了行车记录仪视频中的早期事故预测性能,具有较强的实用潜力。 Abstract: With the widespread deployment of dashcams and advancements in computer vision, developing accident prediction models from the dashcam perspective has become critical for proactive safety interventions. However, two key challenges persist: modeling feature-level interactions among traffic participants (often occluded in dashcam views) and capturing complex, asynchronous multi-temporal behavioral cues preceding accidents. To deal with these two challenges, a Multi-scale Feature Interaction Network (MsFIN) is proposed for early-stage accident anticipation from dashcam videos. MsFIN has three layers for multi-scale feature aggregation, temporal feature processing and multi-scale feature post fusion, respectively. For multi-scale feature aggregation, a Multi-scale Module is designed to extract scene representations at short-term, mid-term and long-term temporal scales. Meanwhile, the Transformer architecture is leveraged to facilitate comprehensive feature interactions. Temporal feature processing captures the sequential evolution of scene and object features under causal constraints. In the multi-scale feature post fusion stage, the network fuses scene and object features across multiple temporal scales to generate a comprehensive risk representation. Experiments on DAD and DADA datasets show that MsFIN significantly outperforms state-of-the-art models with single-scale feature extraction in both prediction correctness and earliness. Ablation studies validate the effectiveness of each module in MsFIN, highlighting how the network achieves superior performance through multi-scale feature fusion and contextual interaction modeling.

[174] DevFD: Developmental Face Forgery Detection by Learning Shared and Orthogonal LoRA Subspaces

Tianshuo Zhang,Li Gao,Siran Peng,Xiangyu Zhu,Zhen Lei

Main category: cs.CV

TL;DR: 本文提出了一种基于连续学习的面部伪造检测方法,通过开发混合专家模型(MoE)结合LoRA技术,有效应对不断演化的伪造面孔,并防止灾难性遗忘。

Details Motivation: 由于伪造人脸技术快速演变,现有检测模型难以跟上新类型伪造的发展,且在有限数据和计算资源下难以持续学习新知识并保留旧知识。 Method: 采用Developmental Mixture of Experts(MoE)架构,使用LoRA作为专家模块,分为Real-LoRA学习真实人脸特征,多个Fake-LoRA分别学习不同伪造类型的增量信息,并通过正交梯度损失确保学习方向不干扰已有知识。 Result: 在数据集和伪造类型增量协议下的实验表明,该方法能有效适应新伪造类型,同时避免遗忘已学的伪造模式,显著优于现有方法。 Conclusion: 所提方法将面部伪造检测视为连续学习问题,实现了对新型伪造的快速适应与旧知识的保持,为应对动态演进的伪造技术提供了可行方案。 Abstract: The rise of realistic digital face generation and manipulation poses significant social risks. The primary challenge lies in the rapid and diverse evolution of generation techniques, which often outstrip the detection capabilities of existing models. To defend against the ever-evolving new types of forgery, we need to enable our model to quickly adapt to new domains with limited computation and data while avoiding forgetting previously learned forgery types. In this work, we posit that genuine facial samples are abundant and relatively stable in acquisition methods, while forgery faces continuously evolve with the iteration of manipulation techniques. Given the practical infeasibility of exhaustively collecting all forgery variants, we frame face forgery detection as a continual learning problem and allow the model to develop as new forgery types emerge. Specifically, we employ a Developmental Mixture of Experts (MoE) architecture that uses LoRA models as its individual experts. These experts are organized into two groups: a Real-LoRA to learn and refine knowledge of real faces, and multiple Fake-LoRAs to capture incremental information from different forgery types. To prevent catastrophic forgetting, we ensure that the learning direction of Fake-LoRAs is orthogonal to the established subspace. Moreover, we integrate orthogonal gradients into the orthogonal loss of Fake-LoRAs, preventing gradient interference throughout the training process of each task. Experimental results under both the datasets and manipulation types incremental protocols demonstrate the effectiveness of our method.

[175] Lavida-O: Elastic Masked Diffusion Models for Unified Multimodal Understanding and Generation

Shufan Li,Jiuxiang Gu,Kangning Liu,Zhe Lin,Zijun Wei,Aditya Grover,Jason Kuen

Main category: cs.CV

TL;DR: Lavida-O是一种统一的多模态掩码扩散模型,支持图像理解与生成任务,具备物体定位、图像编辑和高分辨率合成等新能力,并通过规划与自省机制提升生成质量。

Details Motivation: 现有模型仅支持简单的图像级理解任务和低分辨率生成,缺乏多功能性和高效性。 Method: 提出Elastic Mixture-of-Transformer架构、通用文本条件化和分层采样等新技术,实现统一的多模态理解与生成。 Result: 在RefCOCO、GenEval和ImgEdit等多个基准上达到SOTA,优于Qwen2.5-VL和FluxKontext-dev,且推理速度更快。 Conclusion: Lavida-O是首个利用自身理解能力优化生成结果的统一多模态扩散模型,兼具高性能与高效率。 Abstract: We proposed Lavida-O, a unified multi-modal Masked Diffusion Model (MDM) capable of image understanding and generation tasks. Unlike existing multimodal diffsion language models such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O exhibits many new capabilities such as object grounding, image-editing, and high-resolution (1024px) image synthesis. It is also the first unified MDM that uses its understanding capabilities to improve image generation and editing results through planning and iterative self-reflection. To allow effective and efficient training and sampling, Lavida-O ntroduces many novel techniques such as Elastic Mixture-of-Transformer architecture, universal text conditioning, and stratified sampling. \ours~achieves state-of-the-art performance on a wide range of benchmarks such as RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference.

[176] ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

Benedetta Liberatori,Alessandro Conti,Lorenzo Vaquero,Yiming Wang,Elisa Ricci,Paolo Rota

Main category: cs.CV

TL;DR: 本文提出了基于概念的视频相似性估计任务(ConViS)及相应基准数据集ConViS-Bench,旨在通过语义概念维度量化视频间的相似性,支持更接近人类推理的细粒度视频比较。

Details Motivation: 现有视频相似性模型多依赖全局相似性评分,难以捕捉人类从多个语义角度判断视频相似性的能力,缺乏对不同概念维度的细粒度建模。 Method: 提出ConViS任务,通过预定义的语义概念集合对视频对进行细粒度相似性评分;构建ConViS-Bench基准,包含跨领域的标注视频对及其概念级相似性分数与文本描述,并评估多种先进模型的表现。 Result: 实验表明现有模型在不同概念上的表现存在显著差异,说明某些概念更具挑战性;ConViS-Bench能有效评估模型与人类判断的一致性。 Conclusion: ConViS和ConViS-Bench为语言驱动的视频理解提供了新方向和重要资源,推动更符合人类认知的细粒度视频相似性研究。 Abstract: What does it mean for two videos to be similar? Videos may appear similar when judged by the actions they depict, yet entirely different if evaluated based on the locations where they were filmed. While humans naturally compare videos by taking different aspects into account, this ability has not been thoroughly studied and presents a challenge for models that often depend on broad global similarity scores. Large Multimodal Models (LMMs) with video understanding capabilities open new opportunities for leveraging natural language in comparative video tasks. We introduce Concept-based Video Similarity estimation (ConViS), a novel task that compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts. ConViS allows for human-like reasoning about video similarity and enables new applications such as concept-conditioned video retrieval. To support this task, we also introduce ConViS-Bench, a new benchmark comprising carefully annotated video pairs spanning multiple domains. Each pair comes with concept-level similarity scores and textual descriptions of both differences and similarities. Additionally, we benchmark several state-of-the-art models on ConViS, providing insights into their alignment with human judgments. Our results reveal significant performance differences on ConViS, indicating that some concepts present greater challenges for estimating video similarity. We believe that ConViS-Bench will serve as a valuable resource for advancing research in language-driven video understanding.

[177] Adversarially-Refined VQ-GAN with Dense Motion Tokenization for Spatio-Temporal Heatmaps

Gabriel Maldonado,Narges Rashvand,Armin Danesh Pazho,Ghazal Alinezhad Noghre,Vinit Katariya,Hamed Tabkhi

Main category: cs.CV

TL;DR: 提出了一种基于对抗优化的VQ-GAN框架,通过密集运动标记化压缩时空热图,有效保留人体运动细节,在SSIM和时序稳定性上显著优于基线方法,并揭示了2D和3D运动表示所需的不同标记词汇量。

Details Motivation: 人体运动具有高维和冗余特性,如何高效压缩并保留细粒度运动信息是连续运动理解的关键挑战。 Method: 引入对抗优化的VQ-GAN框架,结合密集运动标记化和对抗训练,提升热图重建质量,减少运动模糊和时序错位。 Result: 在CMU Panoptic数据集上比dVAE基线提升9.31% SSIM,降低37.1%时序不稳定性;发现2D运动可用128个标记高效表示,而3D运动需1024个标记。 Conclusion: 所提方法在运动压缩与重建方面表现优越,具备实际部署潜力,且为运动复杂性分析提供了新视角。 Abstract: Continuous human motion understanding remains a core challenge in computer vision due to its high dimensionality and inherent redundancy. Efficient compression and representation are crucial for analyzing complex motion dynamics. In this work, we introduce an adversarially-refined VQ-GAN framework with dense motion tokenization for compressing spatio-temporal heatmaps while preserving the fine-grained traces of human motion. Our approach combines dense motion tokenization with adversarial refinement, which eliminates reconstruction artifacts like motion smearing and temporal misalignment observed in non-adversarial baselines. Our experiments on the CMU Panoptic dataset provide conclusive evidence of our method's superiority, outperforming the dVAE baseline by 9.31% SSIM and reducing temporal instability by 37.1%. Furthermore, our dense tokenization strategy enables a novel analysis of motion complexity, revealing that 2D motion can be optimally represented with a compact 128-token vocabulary, while 3D motion's complexity demands a much larger 1024-token codebook for faithful reconstruction. These results establish practical deployment feasibility across diverse motion analysis applications. The code base for this work is available at https://github.com/TeCSAR-UNCC/Pose-Quantization.

[178] Graph-Radiomic Learning (GrRAiL) Descriptor to Characterize Imaging Heterogeneity in Confounding Tumor Pathologies

Dheerendranath Battalapalli,Apoorva Safai,Maria Jaramillo,Hyemin Um,Gustavo Adalfo Pineda Ortiz,Ulas Bagci,Manmeet Singh Ahluwalia,Marwa Ismail,Pallavi Tiwari

Main category: cs.CV

TL;DR: 提出一种新的图放射组学学习(GrRAiL)方法,用于在临床MRI上表征病灶内异质性,通过聚类和图论度量捕捉复杂空间关系,在多个肿瘤应用中显著优于现有方法。

Details Motivation: 传统放射组学方法常忽略病灶内复杂的空间异质性,难以可靠区分恶性肿瘤与混淆性病理。 Method: GrRAiL首先基于体素级放射组学特征对病灶子区域进行聚类,然后构建加权图并计算图论指标以量化簇间空间关联,从而描述病灶内异质性。 Result: 在947例患者中验证,涵盖胶质母细胞瘤、脑转移瘤和胰腺IPMN三种场景,GrRAiL在交叉验证和测试集上的准确率均显著优于现有方法(提升超过10%-13%)。 Conclusion: GrRAiL能有效捕捉病灶内高阶空间异质性,具有良好的临床可行性,可提高肿瘤诊断与鉴别诊断的准确性。 Abstract: A significant challenge in solid tumors is reliably distinguishing confounding pathologies from malignant neoplasms on routine imaging. While radiomics methods seek surrogate markers of lesion heterogeneity on CT/MRI, many aggregate features across the region of interest (ROI) and miss complex spatial relationships among varying intensity compositions. We present a new Graph-Radiomic Learning (GrRAiL) descriptor for characterizing intralesional heterogeneity (ILH) on clinical MRI scans. GrRAiL (1) identifies clusters of sub-regions using per-voxel radiomic measurements, then (2) computes graph-theoretic metrics to quantify spatial associations among clusters. The resulting weighted graphs encode higher-order spatial relationships within the ROI, aiming to reliably capture ILH and disambiguate confounding pathologies from malignancy. To assess efficacy and clinical feasibility, GrRAiL was evaluated in n=947 subjects spanning three use cases: differentiating tumor recurrence from radiation effects in glioblastoma (GBM; n=106) and brain metastasis (n=233), and stratifying pancreatic intraductal papillary mucinous neoplasms (IPMNs) into no+low vs high risk (n=608). In a multi-institutional setting, GrRAiL consistently outperformed state-of-the-art baselines - Graph Neural Networks (GNNs), textural radiomics, and intensity-graph analysis. In GBM, cross-validation (CV) and test accuracies for recurrence vs pseudo-progression were 89% and 78% with >10% test-accuracy gains over comparators. In brain metastasis, CV and test accuracies for recurrence vs radiation necrosis were 84% and 74% (>13% improvement). For IPMN risk stratification, CV and test accuracies were 84% and 75%, showing >10% improvement.

[179] Moving by Looking: Towards Vision-Driven Avatar Motion Generation

Markos Diomataris,Berat Mert Albaba,Giorgio Becherini,Partha Ghosh,Omid Taheri,Michael J. Black

Main category: cs.CV

TL;DR: 本文提出了CLOPS,首个仅使用以自我为中心的视觉来感知环境并导航的人类化身模型,通过将低层次运动技能学习与高层次视觉到运动控制分离,实现了类人运动生成。

Details Motivation: 现有方法忽视了人类感知与运动之间的相互依赖性,且缺乏结合场景上下文的大规模数据集,因此需要一种基于人类感知机制(尤其是以自我为中心的视觉)来生成更自然、类人行为的化身。 Method: 采用两阶段训练:首先在大规模动作捕捉数据集上训练运动先验模型;然后使用Q学习训练策略网络,将自我中心视觉输入映射为对运动先验的高层控制命令。 Result: 实验证明,仅依靠自我中心视觉的化身能表现出类人运动特征,如根据视野中的障碍物进行避障行走。 Conclusion: 为化身配备类人传感器,特别是以自我为中心的视觉,有助于生成更逼真、自然的人类行为,是未来人形化身发展的一个有前景方向。 Abstract: The way we perceive the world fundamentally shapes how we move, whether it is how we navigate in a room or how we interact with other humans. Current human motion generation methods, neglect this interdependency and use task-specific ``perception'' that differs radically from that of humans. We argue that the generation of human-like avatar behavior requires human-like perception. Consequently, in this work we present CLOPS, the first human avatar that solely uses egocentric vision to perceive its surroundings and navigate. Using vision as the primary driver of motion however, gives rise to a significant challenge for training avatars: existing datasets have either isolated human motion, without the context of a scene, or lack scale. We overcome this challenge by decoupling the learning of low-level motion skills from learning of high-level control that maps visual input to motion. First, we train a motion prior model on a large motion capture dataset. Then, a policy is trained using Q-learning to map egocentric visual inputs to high-level control commands for the motion prior. Our experiments empirically demonstrate that egocentric vision can give rise to human-like motion characteristics in our avatars. For example, the avatars walk such that they avoid obstacles present in their visual field. These findings suggest that equipping avatars with human-like sensors, particularly egocentric vision, holds promise for training avatars that behave like humans.

[180] OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps

Bingnan Li,Chen-Yu Wang,Haiyang Xu,Xiang Zhang,Ethan Armand,Divyansh Srivastava,Xiaojun Shan,Zeyuan Chen,Jianwen Xie,Zhuowen Tu

Main category: cs.CV

TL;DR: 本文提出了OverLayScore指标和OverLayBench基准,用于评估布局到图像生成中重叠边界框的复杂性,并引入CreatiLayout-AM模型以改善在复杂重叠场景下的生成性能。

Details Motivation: 现有布局到图像生成方法在处理边界框严重重叠的情况时表现不佳,且当前基准偏向简单案例,缺乏对复杂重叠情况的有效评估。 Method: 提出OverLayScore来量化重叠复杂度,构建OverLayBench作为新基准,并基于去遮挡掩码数据集微调CreatiLayout-AM模型。 Result: 分析显示现有方法在高OverLayScore情况下性能下降,新基准覆盖更复杂的重叠场景,CreatiLayout-AM在复杂重叠条件下表现出改进。 Conclusion: OverLayScore和OverLayBench为评估和提升布局到图像生成模型在真实复杂重叠场景中的鲁棒性提供了有效工具和基础。 Abstract: Despite steady progress in layout-to-image generation, current methods still struggle with layouts containing significant overlap between bounding boxes. We identify two primary challenges: (1) large overlapping regions and (2) overlapping instances with minimal semantic distinction. Through both qualitative examples and quantitative analysis, we demonstrate how these factors degrade generation quality. To systematically assess this issue, we introduce OverLayScore, a novel metric that quantifies the complexity of overlapping bounding boxes. Our analysis reveals that existing benchmarks are biased toward simpler cases with low OverLayScore values, limiting their effectiveness in evaluating model performance under more challenging conditions. To bridge this gap, we present OverLayBench, a new benchmark featuring high-quality annotations and a balanced distribution across different levels of OverLayScore. As an initial step toward improving performance on complex overlaps, we also propose CreatiLayout-AM, a model fine-tuned on a curated amodal mask dataset. Together, our contributions lay the groundwork for more robust layout-to-image generation under realistic and challenging scenarios. Project link: https://mlpc-ucsd.github.io/OverLayBench.

[181] Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

Sherwin Bahmani,Tianchang Shen,Jiawei Ren,Jiahui Huang,Yifeng Jiang,Haithem Turki,Andrea Tagliasacchi,David B. Lindell,Zan Gojcic,Sanja Fidler,Huan Ling,Jun Gao,Xuanchi Ren

Main category: cs.CV

TL;DR: 本文提出一种自蒸馏框架,将视频扩散模型中的隐式3D知识蒸馏到显式的3D高斯点阵(3DGS)表示中,无需多视角训练数据,可从文本或单张图像生成3D场景,支持实时渲染,并扩展至单目视频生成动态3D场景。

Details Motivation: 现有基于学习的3D重建方法依赖真实世界多视角数据,获取困难;而视频扩散模型虽具强大学习能力但局限于2D,难以满足机器人等需与3D环境交互的应用需求。 Method: 在视频扩散模型中增强RGB解码器的同时引入3DGS解码器,并通过RGB解码器输出监督3DGS解码器,利用扩散模型生成的合成数据训练3DGS解码器,实现无需真实多视角数据的3D场景生成。 Result: 该方法在静态和动态3D场景生成任务上均达到最先进水平,支持从文本或单图生成高质量3D场景,并能由单目视频生成动态3D内容。 Conclusion: 所提自蒸馏框架成功将2D视频扩散模型的想象力扩展到3D领域,无需真实多视角数据即可生成可实时渲染的3D场景,为虚拟环境构建提供了新思路。 Abstract: The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.

[182] VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

Weijie Wang,Yeqing Chen,Zeyu Zhang,Hengyu Liu,Haoxiao Wang,Zhiyuan Feng,Wenkang Qin,Zheng Zhu,Donny Y. Chen,Bohan Zhuang

Main category: cs.CV

TL;DR: 本文提出VolSplat,一种基于体素对齐高斯分布的前馈3D高斯点阵新范式,克服了传统像素对齐方法在多视角一致性、密度分布偏差和对输入视图依赖性方面的局限,实现了更鲁棒、几何更一致的新型视角合成,在多个基准上达到最先进性能。

Details Motivation: 传统基于像素对齐的3D高斯点阵方法存在对输入视图数量依赖性强、视图偏差密度分布以及因遮挡或低纹理导致的对齐误差等问题,限制了重建质量与多视角一致性,因此需要一种更鲁棒、独立于2D特征匹配的新范式。 Method: 提出VolSplat,通过预测3D体素网格并从中直接生成高斯分布,取代传统的2D像素对齐方式;利用3D体素结构实现跨视角的一致性建模,并根据场景复杂度自适应控制高斯密度。 Result: 在RealEstate10K和ScanNet等标准数据集上取得最先进的渲染质量和几何一致性,生成更合理、视角间更一致的高斯重建结果,且具备更强的鲁棒性和可扩展性。 Conclusion: VolSplat通过从像素对齐转向体素对齐的高斯预测范式,有效解决了现有方法的关键缺陷,为前馈式3D重建提供了更可靠、可扩展的框架,推动了高斯点阵在多视角重建中的进一步应用。 Abstract: Feed-forward 3D Gaussian Splatting (3DGS) has emerged as a highly effective solution for novel view synthesis. Existing methods predominantly rely on a pixel-aligned Gaussian prediction paradigm, where each 2D pixel is mapped to a 3D Gaussian. We rethink this widely adopted formulation and identify several inherent limitations: it renders the reconstructed 3D models heavily dependent on the number of input views, leads to view-biased density distributions, and introduces alignment errors, particularly when source views contain occlusions or low texture. To address these challenges, we introduce VolSplat, a new multi-view feed-forward paradigm that replaces pixel alignment with voxel-aligned Gaussians. By directly predicting Gaussians from a predicted 3D voxel grid, it overcomes pixel alignment's reliance on error-prone 2D feature matching, ensuring robust multi-view consistency. Furthermore, it enables adaptive control over Gaussian density based on 3D scene complexity, yielding more faithful Gaussian point clouds, improved geometric consistency, and enhanced novel-view rendering quality. Experiments on widely used benchmarks including RealEstate10K and ScanNet demonstrate that VolSplat achieves state-of-the-art performance while producing more plausible and view-consistent Gaussian reconstructions. In addition to superior results, our approach establishes a more scalable framework for feed-forward 3D reconstruction with denser and more robust representations, paving the way for further research in wider communities. The video results, code and trained models are available on our project page: https://lhmd.top/volsplat.

[183] CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching

Chen Chen,Pengsheng Guo,Liangchen Song,Jiasen Lu,Rui Qian,Xinze Wang,Tsu-Jui Fu,Wei Liu,Yinfei Yang,Alex Schwing

Main category: cs.CV

TL;DR: 提出了一种名为CAR-Flow的条件感知重参数化方法,用于扩散和流模型中的条件生成建模,通过调整源和目标分布的位置来缩短模型需要学习的概率路径,从而加速训练并提升性能。

Details Motivation: 现有的扩散和流模型在条件生成中需同时学习质量传输和条件注入,对模型要求较高。为了减轻模型负担,希望将条件信息融入源或目标分布中,以简化学习过程。 Method: 提出Condition-Aware Reparameterization for Flow Matching (CAR-Flow),通过一个轻量级的可学习偏移来调整源分布、目标分布或两者,使模型学习更短的概率路径。该方法兼容现有流模型架构,并可无缝集成。 Result: 在低维合成数据上验证了CAR-Flow对概率路径的可视化和量化效果;在ImageNet-256上的实验显示,将CAR-Flow应用于SiT-XL/2模型后,FID从2.07降至1.68,且仅增加不到0.6%的额外参数。 Conclusion: CAR-Flow通过引入条件感知的分布重定位,有效简化了流匹配过程中的学习难度,在保持模型轻量的同时显著提升生成质量和训练效率。 Abstract: Conditional generative modeling aims to learn a conditional data distribution from samples containing data-condition pairs. For this, diffusion and flow-based methods have attained compelling results. These methods use a learned (flow) model to transport an initial standard Gaussian noise that ignores the condition to the conditional data distribution. The model is hence required to learn both mass transport and conditional injection. To ease the demand on the model, we propose Condition-Aware Reparameterization for Flow Matching (CAR-Flow) -- a lightweight, learned shift that conditions the source, the target, or both distributions. By relocating these distributions, CAR-Flow shortens the probability path the model must learn, leading to faster training in practice. On low-dimensional synthetic data, we visualize and quantify the effects of CAR. On higher-dimensional natural image data (ImageNet-256), equipping SiT-XL/2 with CAR-Flow reduces FID from 2.07 to 1.68, while introducing less than 0.6% additional parameters.