Table of Contents
cs.CL [Back]
[1] Enhancing Safety of Large Language Models via Embedding Space Separation
Xu Zhao,Xiting Wang,Weiran Shen
Main category: cs.CL
TL;DR: 本文提出了一种名为Embedding Space Separation (ES2)的表示层微调方法,通过扩大有害与安全查询在嵌入空间中的距离来提升大语言模型的安全性,并引入KL散度正则化以保持模型通用能力。
Details
Motivation: 近期研究发现,大语言模型中有害与安全查询的隐式表征具有线性可分性,这被用于构造攻击;本文受此启发,旨在从表征层面增强安全性。 Method: 提出Embedding Space Separation (ES2)方法,在嵌入空间中显式增大有害与安全表征间的距离;同时在损失函数中加入KL散度正则项,约束微调后模型在无害输入上的logits与原始模型对齐。 Result: 在多个开源大语言模型和标准安全基准上的实验表明,该方法显著提升了模型安全性,同时保持了与基线模型相当的通用能力。 Conclusion: ES2是一种有效且实用的表示级安全增强方法,兼顾安全性与模型能力保留。 Abstract: Large language models (LLMs) have achieved impressive capabilities, yet ensuring their safety against harmful prompts remains a critical challenge. Recent work has revealed that the latent representations (embeddings) of harmful and safe queries in LLMs typically exhibit linear separability, a property that has been exploited to construct attacks by perturbing the embeddings of harmful queries towards the safe subspace. Motivated by this observation, we propose a representation-level fine-tuning approach, named Embedding Space Separation (ES2), which improves LLM safety by explicitly enlarging the distance between harmful and safe representations in the embedding space. To prevent degradation of model's general capabilities, we introduce a Kullback-Leibler (KL) divergence regularization term into the loss function, which constrains the logits of the fine-tuned model to align with those of the original base model on harmless inputs. We evaluate our method on several open-source LLMs using standard safety benchmarks. Extensive experimental results demonstrate that our approach substantially improves model safety while maintaining comparable general capabilities.[2] RedacBench: Can AI Erase Your Secrets?
Hyunjun Jeon,Kyuyoung Kim,Jinwoo Shin
Main category: cs.CL
TL;DR: 本文提出了RedacBench,一个面向策略驱动型文本脱敏任务的综合基准,涵盖多源文本与安全策略,通过命题级标注评估模型在安全性(敏感信息移除)与实用性(非敏感信息保留)间的权衡。
Details
Motivation: 现有脱敏评测基准局限于预定义敏感类别(如PII)或特定技术(如掩码),缺乏对多样化安全策略和跨领域脱敏能力的系统评估。 Method: 构建包含514篇人工撰写文本和187条安全策略的RedacBench数据集;基于8053个细粒度人工标注命题,量化模型在安全性和实用性两方面的表现;评估多种脱敏策略及主流大语言模型。 Result: 实验表明更先进的语言模型可提升安全性,但在保持实用性方面仍面临显著挑战;RedacBench揭示了当前模型在策略理解与语义保全之间的关键瓶颈。 Conclusion: RedacBench为策略条件化脱敏提供了首个大规模、细粒度、多维度的评测框架,推动安全、可控与实用并重的文本脱敏研究,并开源数据集与交互式评测平台。 Abstract: Modern language models can readily extract sensitive information from unstructured text, making redaction -- the selective removal of such information -- critical for data security. However, existing benchmarks for redaction typically focus on predefined categories of data such as personally identifiable information (PII) or evaluate specific techniques like masking. To address this limitation, we introduce RedacBench, a comprehensive benchmark for evaluating policy-conditioned redaction across domains and strategies. Constructed from 514 human-authored texts spanning individual, corporate, and government sources, paired with 187 security policies, RedacBench measures a model's ability to selectively remove policy-violating information while preserving the original semantics. We quantify performance using 8,053 annotated propositions that capture all inferable information in each text. This enables assessment of both security -- the removal of sensitive propositions -- and utility -- the preservation of non-sensitive propositions. Experiments across multiple redaction strategies and state-of-the-art language models show that while more advanced models can improve security, preserving utility remains a challenge. To facilitate future research, we release RedacBench along with a web-based playground for dataset customization and evaluation. Available at https://hyunjunian.github.io/redaction-playground/.[3] Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs
Hengwei Ye,Yuanting Guan,Yuxuan Ge,Tianying Zhu,Zhenhan Guan,Yijia Zhong,Yijing Zhang,Han Zhang,Yingna Wu,Zheng Tian
Main category: cs.CL
TL;DR: 本文提出KidGym——一个受韦氏儿童智力量表启发的2D网格化多模态大语言模型(MLLM)评估基准,涵盖执行、感知推理、学习、记忆与规划五大核心能力,共12项任务,强调可定制性、可扩展性与认知发展导向的鲁棒评测。
Details
Motivation: 现有MLLM评估缺乏对类人通用智能的细粒度、可解释能力分解;借鉴儿童认知发展理论,构建更贴近人类智能成长路径的评测体系。 Method: 基于韦氏智力测验框架,设计5类核心能力对应的12个2D网格任务,采用随机布局、多样场景与对象,支持用户自定义与难度调节,并在多个SOTA MLLM上开展实证评估。 Result: 通过KidGym评测揭示了当前主流MLLM在规划、长期记忆及跨任务迁移等方面的关键局限,验证了该基准的有效性与区分度。 Conclusion: KidGym为MLLM提供了首个聚焦认知发展维度、可扩展且可定制的综合性评测平台,推动向更通用、可解释、类人智能方向演进。 Abstract: Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human-like competence than language-only models, we take inspiration from the Wechsler Intelligence Scales - an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid-based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs' adaptability and developmental potential, mirroring the stages of children's cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user-customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state-of-the-art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: https://kidgym.github.io/KidGym-Website/.[4] CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language
Roy Uziel,Omer Belhasin,Itay Levi,Akhiad Bercovich,Ran El-Yaniv,Ran Zilberstein,Michael Elad
Main category: cs.CL
TL;DR: 本文提出CRoCoDiL框架,将掩码扩散模型(MDM)迁移至连续句子级语义空间,通过联合训练编码器-去掩码器架构,构建基于MDM解码的新自编码器,并设计两种无条件文本生成算法(ConThenDisc和ConWithinDisc),显著提升生成质量与采样速度。
Details
Motivation: 掩码扩散模型(MDMs)虽为自回归生成提供了高效非因果替代方案,但受限于离散边缘分布,常出现词元依赖建模不足和语义不连贯问题。 Method: 提出CRoCoDiL:在连续句子级语义空间中执行扩散;联合训练编码器-DEMASKER架构,使MDM去掩码操作作用于连续潜在表示;由此构建新型自编码器(编码器+MDM解码器);并基于该框架设计两种无条件生成算法——ConThenDisc(先连续生成再离散解码)和ConWithinDisc(在离散采样过程中持续优化连续潜表示)。 Result: 在LLaDA上实验表明,所提方法在无条件文本生成任务中生成质量更优,且采样速度提升超10倍。 Conclusion: 将扩散过程迁移至连续语义空间并耦合编码器-扩散解码器联合训练,可有效缓解MDM的语义不连贯与依赖建模缺陷,显著提升生成效率与质量。 Abstract: Masked Diffusion Models (MDMs) provide an efficient non-causal alternative to autoregressive generation but often struggle with token dependencies and semantic incoherence due to their reliance on discrete marginal distributions. We address these limitations by shifting the diffusion process into a continuous sentence-level semantic space. We propose CRoCoDiL (Continuous and Robust Conditioned Diffusion for Language), a unified fine-tuning approach that jointly trains an encoder-demasker architecture, grounding the MDM demasking in continuous latent representations. This leads to the formation of a novel autoencoder in which decoding is obtained by an MDM algorithm. Relying on the same framework, we introduce two unconditional text synthesis algorithms: Continuous-Then-Discrete (ConThenDisc), a hybrid-diffusion approach that first generates latent representations in continuous space and then decodes these to tokens via an MDM, and Continuous-Within-Discrete (ConWithinDisc), a multi-diffusion strategy that refines latent representations throughout the discrete sampling process. Experiments using LLaDA show that our methods achieve superior generation quality and more than 10x faster sampling speeds in an unconditional setting.[5] Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models
Jiayun Wu,Peixu Hou,Shan Qu,Peng Zhang,Ning Gu,Tun Lu
Main category: cs.CL
TL;DR: 本文提出了一种名为Fast-Slow Thinking Reward Models (F/S-RM) 的混合奖励模型架构,结合了快速(首词预测)与慢速(链式推理)两种思考模式,并通过双置信度机制动态激活慢思考,从而在提升性能的同时显著降低计算开销。
Details
Motivation: 现有奖励模型中,生成式奖励模型(GRM)虽准确但计算昂贵,标量奖励模型(SRM)高效但性能和适应性不足,亟需兼顾效率与性能的新架构。 Method: 基于双重过程理论,设计单模型双路径奖励机制:首词预测输出标量分数(fast thinking),同时支持CoT推理输出判断(slow thinking),并引入双置信度激活机制动态决定是否启用慢思考。 Result: F/S-RM在性能上相对当前最优模型提升1.2%,同时token消耗减少20.8%。 Conclusion: F/S-RM成功融合快慢思维范式,在保持高准确率的同时显著提升推理效率,为RLHF中的奖励建模提供了更优的实用化方案。 Abstract: Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.[6] Multi-Agent Debate with Memory Masking
Hongduan Tian,Xiao Feng,Ziyuan Zhao,Xiangyu Zhu,Rolan Yan,Bo Han
Main category: cs.CL
TL;DR: 本文提出了一种改进的多智能体辩论框架MAD-M²,通过在每轮辩论开始时对错误记忆进行掩码,提升LLM推理鲁棒性,并在数学与逻辑推理基准上验证了其有效性。
Details
Motivation: 观察到现有基于多智能体辩论(MAD)的推理框架易受错误记忆影响,而其性能高度依赖前序辩论中记忆的质量,因此需提升对错误记忆的鲁棒性。 Method: 提出MAD-M²框架,在每轮辩论起始阶段对前一轮辩论中的错误记忆进行掩码,保留有益信息、剔除错误内容,从而优化上下文信息。 Result: 在主流数学与逻辑推理基准上,MAD-M²能有效识别并屏蔽错误记忆,推理性能优于原始MAD框架。 Conclusion: 记忆质量是决定多智能体辩论效果的关键因素;引入记忆掩码机制可显著提升LLM推理的鲁棒性与准确性。 Abstract: Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference-time sampling to enhance performance. In particular, among all LLM reasoning frameworks, *multi-agent debate* (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the previous debate, indicating that the existence of erroneous memories poses a threat to the performance of MAD. To address this problem, we introduce a simple yet effective multi-agent debate framework, *multi-agent debate with memory masking* (MAD-M$^2$), to improve the robustness of MAD by allowing LLM agents to mask erroneous memories from the previous debate round at the beginning of each debate round. In this way, MAD-M$^2$ can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories. Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks demonstrate that MAD-M$^2$ can identify the erroneous memories and achieve better performance in reasoning than MAD.[7] Locally Coherent Parallel Decoding in Diffusion Language Models
Michael Hersche,Nicolas Menet,Ronan Tanios,Abbas Rahimi
Main category: cs.CL
TL;DR: 本文提出CoDiLA方法,通过引入小型辅助自回归模型在扩散潜空间中建模局部依赖,解决离散扩散语言模型并行生成时的语法不一致问题,在保持亚线性延迟和双向建模能力的同时提升代码生成的连贯性与准确性。
Details
Motivation: 标准离散扩散语言模型在并行预测多个token时仅基于条件边缘分布独立采样,无法捕捉并发token间的联合依赖,导致语法不一致和多token结构断裂,尤其影响代码生成质量。 Method: 提出CoDiLA(Coherent Diffusion with Local Autoregression),在扩散模型框架下,用一个轻量级辅助自回归模型(如0.6B参数)在扩散潜空间中对局部token块进行序列化解码,实现块间并行与块内自回归的协同。 Result: 在代码生成基准测试中,CoDiLA显著消除连贯性缺陷,在精度与速度之间建立新的Pareto前沿;仅需极小辅助AR模型即可达到优异性能。 Conclusion: CoDiLA成功调和了扩散模型的并行高效性与自回归模型的局部语法一致性,为代码生成等结构敏感任务提供了更优的生成范式。 Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel block generation while ensuring sequential validity within each block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.[8] Expected Reward Prediction, with Applications to Model Routing
Kenan Hasanaliyev,Silas Alberti,Jenny Hamer,Dheeraj Rajagopal,Kevin Robinson,Jasper Snoek,Victor Veitch,Alexander Nicholas D'Amour
Main category: cs.CL
TL;DR: 本文提出了一种基于响应级奖励模型预测大语言模型在给定提示下期望奖励的方法(ERP),并将其用于推理时的模型路由,以在控制计算成本的同时最大化奖励。实验表明ERP优于基于类别平均性能的基线,并能解释更复杂路由协议的成功原因,且易于扩展新模型。
Details
Motivation: 现有奖励模型通常用于对单个模型生成的多个响应进行排序,但本文探索能否将此类模型提升为评估不同模型对同一提示的整体适配性,从而支持更高效的模型选择与路由。 Method: 利用响应级奖励模型,通过采样估计各模型对给定提示的期望奖励;以此构建简单、可扩展的推理时模型路由协议(ERP),动态将提示分配给预期奖励最高的模型。 Result: ERP在open-perfectblend数据集上显著优于按提示类别选择平均性能最优模型的基线;能有效解释更复杂路由方法的有效性;且新增模型仅需其响应即可无缝集成。 Conclusion: 响应级奖励模型蕴含足够信息以准确预测模型-提示对的期望表现;基于此的轻量级ERP路由策略兼具高性能、可解释性与可扩展性,为多模型协同推理提供了实用新范式。 Abstract: Reward models are a standard tool to score responses from LLMs. Reward models are built to rank responses to a fixed prompt sampled from a single model, for example to choose the best of n sampled responses. In this paper, we study whether scores from response-level reward models lifted to score a model's suitability for a prompt, prior to seeing responses from that model. Specifically, we show that it is straightforward to predict the expected reward that an LLM would earn from the reward model under repeated sampling. Further, we show that these expected reward predictions are precise and discriminative enough to support an application to a model routing protocol that routes prompts to models at inference time to maximize reward while controlling computational cost. We demonstrate the performance of this routing procedure on the open-perfectblend dataset, using a model pool composed of Llama3.1-Instruct 8B/70B, Gemma2-IT 9B/27B, and Gemma1-IT 7B models. Our simple expected reward prediction--based routing (ERP) outperforms baselines that route prompts to models with the best average performance within each prompt's category, and explains the success of more complex routing protocols that implicitly estimate an expected reward. Our approach has the added advantage of being trivially extensible as new models are added to the pool.[9] An experimental study of KV cache reuse strategies in chunk-level caching systems
Samuel Cestola,Tianxiang Xia,Zheng Weiyan,Zheng Pengfei,Diego Didona
Main category: cs.CL
TL;DR: 本文分析了检索增强生成中块级缓存(CLC)的局限性,指出现有方法在准确性和适用性上存在根本缺陷,并提出一种融合多种技术的新CLC设计以提升准确性。
Details
Motivation: 现有CLC方法忽略块间交叉注意力依赖,导致输出质量下降,且各自存在准确性和适用性限制。 Method: 通过系统性实验评估揭示现有CLC方法的根本局限,并基于其互补性,设计一种融合多种技术的新CLC方案。 Result: 实验证明新CLC设计在保持推理加速的同时显著提升了生成准确性。 Conclusion: CLC方法需兼顾跨块依赖建模与技术融合,所提新设计为检索增强生成中的高效高质推理提供了更优路径。 Abstract: Retrieval-augmented generation improves large language models' accuracy by adding relevant retrieved text to the prompt. Chunk level caching (CLC) accelerates inference by precomputing KV caches for these retrieved chunks and reusing them. However, these caches miss cross-attention dependencies between chunks, which can reduce output quality. Several methods try to improve CLC accuracy using different techniques. We make two main contributions. First, we show that existing CLC approaches have fundamental limitations that limit their accuracy or their applicability. We back this conclusion with an extensive CLC system experimental evaluation. Second, we observe that existing CLC techniques are complementary. We leverage this insight to propose a new CLC design that carefully combines them and achieves better accuracy.[10] Thinking into the Future: Latent Lookahead Training for Transformers
Lorenzo Noci,Gregor Bachmann,Seyed-Mohsen Moosavi-Dezfooli,Moin Nabi
Main category: cs.CL
TL;DR: 本文提出了一种名为'潜在前瞻(latent lookahead)'的训练策略,使自回归语言模型能在生成每个token前,在隐空间中进行多步递归预测,从而提升其规划与远见能力。
Details
Motivation: 传统自回归语言模型在每步只能单次前向传播、逐token采样,缺乏反思与多路径探索能力,且计算资源分配固定,难以应对复杂token的高计算需求。 Method: 在序列某些位置,模型不直接采样未来token,而是将隐藏状态递归反馈回上下文进行τ步隐空间前瞻,生成τ个隐状态预测,并监督其匹配后续τ个真实token。 Result: 在迷宫求解、数独、ProsQA等需要远见与规划的任务上,latent lookahead显著优于自回归与非自回归基线模型。 Conclusion: latent lookahead通过引入可控的隐空间多步推理,增强了语言模型的规划能力和计算适应性,为克服标准自回归范式的局限性提供了新路径。 Abstract: Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model to commit at every step, preventing it from exploring or reflecting upon multiple plausible continuations. Furthermore, the compute allocation across tokens is uniform; every token is formed based on a single forward-pass, potentially limiting the model's expressiveness in cases where difficult tokens require inherently more compute. Towards addressing these limitations, we introduce latent lookahead, a training strategy that enables models to "think" before generating: at selected positions in the sequence, before committing to the next token, the model performs a multi-step lookahead in latent space. More precisely, instead of sampling future tokens, we leverage the network's latent space by recursively feeding its hidden states back into the context for $τ$ steps, investing more compute on predicting that token. This produces $τ$ latent predictions that are supervised against the next $τ$ ground-truth tokens, encouraging the model to "lookahead" and refine its prediction. We show that latent lookahead substantially outperforms both autoregressive and non-autoregressive baselines on planning tasks such as maze solving, Sudoku, and ProsQA, where foresight is essential.[11] Linguistic Signatures for Enhanced Emotion Detection
Florian Lecourt,Madalina Croitoru,Konstantin Todorov
Main category: cs.CL
TL;DR: This paper investigates whether linguistic features can serve as reliable, interpretable signals for emotion recognition in text, and finds that incorporating emotion-specific linguistic signatures into RoBERTa-based models improves performance on the GoEmotions benchmark.
Details
Motivation: Little is known about the linguistic regularities characterizing how emotions are expressed across different corpora and labels, despite recent progress in emotion detection using transformer-based models. Method: The authors extract emotion-specific linguistic signatures from 13 English datasets and incorporate these high-level linguistic features into RoBERTa-based models. Result: RoBERTa-based models enriched with linguistic features achieve consistent performance gains of up to +2.4 macro F1 on the GoEmotions benchmark. Conclusion: Explicit lexical cues can complement neural representations and improve robustness in predicting emotion categories. Abstract: Emotion detection is a central problem in NLP, with recent progress driven by transformer-based models trained on established datasets. However, little is known about the linguistic regularities that characterize how emotions are expressed across different corpora and labels. This study examines whether linguistic features can serve as reliable interpretable signals for emotion recognition in text. We extract emotion-specific linguistic signatures from 13 English datasets and evaluate how incorporating these features into transformer models impacts performance. Our RoBERTa-based models enriched with high level linguistic features achieve consistent performance gains of up to +2.4 macro F1 on the GoEmotions benchmark, showing that explicit lexical cues can complement neural representations and improve robustness in predicting emotion categories.[12] Beyond Test-Time Compute Strategies: Advocating Energy-per-Token in LLM Inference
Patrick Wilhelm,Thorsten Wittkopp,Odej Kao
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLMs)与小语言模型(SLMs)在能效与准确率之间的权衡,提出基于能耗的评估指标(如Energy-per-Token)和动态控制推理深度的节能策略,以实现可持续AI部署。
Details
Motivation: LLMs虽性能优异但能耗高;许多任务中SLMs配合先进推理策略(如CoT、多数投票)即可满足需求,但其额外推理可能增加能耗,亟需量化并优化能效-精度权衡。 Method: 在MMLU基准上分析不同测试时计算策略的能效-精度权衡;建模Transformer输入输出token与硬件能耗的非线性关系;提出Energy-per-Token等能效指标;设计基于运行曲线的可控CoT推理与能耗感知路由机制。 Result: 揭示了SLMs在特定策略下可接近LLMs精度但显著降低能耗;验证了token规模与能耗的非线性关系;证明动态调控推理深度可提升单位能耗下的精度。 Conclusion: 应将能效作为核心评估维度,通过模型选择、推理控制与路由机制协同优化,推动绿色、可持续的大模型部署。 Abstract: Large Language Models (LLMs) demonstrate exceptional performance across diverse tasks but come with substantial energy and computational costs, particularly in request-heavy scenarios. In many real-world applications, the full scale and capabilities of LLMs are often unnecessary, as Small Language Models (SLMs) can provide accurate responses for simpler text generation tasks. When enhanced with advanced reasoning strategies, such as Chain-of-Thought (CoT) prompting or Majority Voting, SLMs can approach the performance of larger models while reducing overall computational requirements. However, these strategies can also introduce additional energy costs, creating an energy-accuracy trade-off. Our analysis examines these trade-offs in test-time compute strategies for smaller models compared to larger ones, using the MMLU benchmark. Additionally, we explore the input-output token dynamics of transformer architectures, which result in nonlinear hardware energy operation curves for LLMs. To bridge AI research with its physical impact, we propose \textit{energy efficiency metrics}, including Energy-per-Token, as complements to traditional accuracy benchmarks. Beyond model selection, we propose controlled reasoning in CoT token generation, using operating curves to regulate reasoning depth dynamically. This vision integrates a energy-aware routing mechanism, ensuring that model selection and inference strategies balance accuracy for sustainable AI deployment.[13] Decoding the decoder: Contextual sequence-to-sequence modeling for intracortical speech decoding
Michal Olak,Tommaso Boccato,Matteo Ferrante
Main category: cs.CL
TL;DR: 本文提出了一种基于多任务Transformer的序列到序列模型,结合Neural Hammer Scalpel(NHS)校准模块,用于从6v区皮层内记录信号中解码尝试性语音,在Willett等数据集上实现了当前最优的音素错误率(14.3%)和词错误率(19.4%),并揭示了注意力机制中存在时间分块现象,支持神经语音证据的时序分割与累积假设。
Details
Motivation: 探索上下文感知的序列到序列解码对亚词级神经读出、鲁棒性和可解释性的贡献,尤其针对脑机接口中数据稀缺和日间非平稳性问题。 Method: 采用多任务Transformer架构联合预测音素序列、词序列和声学辅助特征;引入Neural Hammer Scalpel(NHS)校准模块,融合全局对齐与特征级调制以应对日间变异;通过跨日泛化实验与编码器/解码器注意力可视化进行分析。 Result: 在Willett数据集上取得14.3%音素错误率(SOTA)、25.6%(直接解码)和19.4%(候选生成+重打分)词错误率;NHS显著优于线性或无日特异性变换;跨日泛化性能随时间距离增大而下降;注意力图显示编码器存在重复的时间分块,且音素与词解码器对这些片段有不同利用方式。 Conclusion: 上下文感知的序列到序列建模可提升皮层内语音信号到音素的解码保真度;基于注意力的分析有助于生成关于神经语音证据如何时序分割与累积的可验证假设。 Abstract: Speech brain--computer interfaces require decoders that translate intracortical activity into linguistic output while remaining robust to limited data and day-to-day variability. While prior high-performing systems have largely relied on framewise phoneme decoding combined with downstream language models, it remains unclear what contextual sequence-to-sequence decoding contributes to sublexical neural readout, robustness, and interpretability. We evaluated a multitask Transformer-based sequence-to-sequence model for attempted speech decoding from area 6v intracortical recordings. The model jointly predicts phoneme sequences, word sequences, and auxiliary acoustic features. To address day-to-day nonstationarity, we introduced the Neural Hammer Scalpel (NHS) calibration module, which combines global alignment with feature-wise modulation. We further analyzed held-out-day generalization and attention patterns in the encoder and decoders. On the Willett et al. dataset, the proposed model achieved a state-of-the-art phoneme error rate of 14.3%. Word decoding reached 25.6% WER with direct decoding and 19.4% WER with candidate generation and rescoring. NHS substantially improved both phoneme and word decoding relative to linear or no day-specific transform, while held-out-day experiments showed increasing degradation on unseen days with temporal distance. Attention visualizations revealed recurring temporal chunking in encoder representations and distinct use of these segments by phoneme and word decoders. These results indicate that contextual sequence-to-sequence modeling can improve the fidelity of neural-to-phoneme readout from intracortical speech signals and suggest that attention-based analyses can generate useful hypotheses about how neural speech evidence is segmented and accumulated over time.[14] FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems
Mahesh Kumar,Bhaskarjit Sarmah,Stefano Pasquali
Main category: cs.CL
TL;DR: 本文提出FinBench-QA-Hallucination基准,用于评估知识图谱增强型金融问答系统中的幻觉检测方法,发现LLM裁判和嵌入方法性能最优但对噪声敏感,而嵌入方法鲁棒性更强。
Details
Motivation: 当前知识图谱增强的问答系统缺乏系统性检测幻觉(即事实错误输出)的机制,而在金融等高风险领域,幻觉可能导致监管违规与错误决策,因此亟需可靠评估基准。 Method: 构建了FinBench-QA-Hallucination基准数据集(755个标注样本,源自300页SEC 10-K文件),采用保守的证据链接协议(需文本片段与关系三元组双重支持)标注真实性;对比评估六类幻觉检测方法(LLM裁判、微调分类器、NLI模型、跨度检测器、嵌入法)在有/无KG三元组两种条件下的表现,并进行Cochran's Q与McNemar统计检验。 Result: LLM裁判与嵌入法在干净条件下F1最高(0.82–0.86);引入噪声三元组后,多数方法MCC下降44–84%,嵌入法仅下降9%;统计检验显示方法间差异极显著(p < 0.001)。 Conclusion: 现有KG增强问答系统在金融场景中存在严重幻觉检测脆弱性;嵌入法更具鲁棒性;该基准为高风险领域AI可靠性评估与信息系统设计提供了可迁移框架。 Abstract: As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language Inference (NLI) models, span detectors, and embedding-based methods under two conditions: with and without KG triplets. Results show that LLM-based judges and embedding approaches achieve the highest performance (F1: 0.82-0.86) under clean conditions. However, most methods degrade significantly when noisy triplets are introduced, with Matthews Correlation Coefficient (MCC) dropping 44-84 percent, while embedding methods remain relatively robust with only 9 percent degradation. Statistical tests (Cochran's Q and McNemar) confirm significant performance differences (p < 0.001). Our findings highlight vulnerabilities in current KG-augmented systems and provide insights for building reliable financial information systems, where hallucinations can lead to regulatory violations and flawed decisions. The benchmark also offers a framework for integrating AI reliability evaluation into information system design across other high-stakes domains such as healthcare, legal, and government.[15] Abjad-Kids: An Arabic Speech Classification Dataset for Primary Education
Abdul Aziz Snoubara,Baraa Al_Maradni,Haya Al_Naal,Malek Al_Madrmani,Roaa Jdini,Seedra Zarzour,Khloud Al Jallad
Main category: cs.CL
TL;DR: 本文提出了Abjad-Kids——一个面向阿拉伯语儿童语音识别的公开数据集,并设计了基于CNN-LSTM的分层分类方法以应对阿拉伯语音素高相似性和样本少的挑战,实验表明静态语言学分组效果更优,但存在过拟合问题。
Details
Motivation: 阿拉伯语等低资源语言缺乏面向儿童的公开语音数据集,限制了儿童语音识别研究的发展。 Method: 构建了包含46397条3-12岁儿童语音样本的Abjad-Kids数据集;提出基于CNN-LSTM的两阶段分层分类方法,对比了静态语言学分组与动态聚类分组策略,并结合数据增强与模型正则化。 Result: 静态语言学分组策略性能优于动态聚类分组;CNN-LSTM模型配合数据增强显著优于传统机器学习方法;但仍普遍存在过拟合问题。 Conclusion: Abjad-Kids填补了阿拉伯语儿童语音数据集的空白,所提分层CNN-LSTM方法有效提升了识别性能,但需进一步扩充数据以缓解过拟合;该数据集将公开发布,推动相关研究。 Abstract: Speech-based AI educational applications have gained significant interest in recent years, particularly for children. However, children speech research remains limited due to the lack of publicly available datasets, especially for low-resource languages such as Arabic.This paper presents Abjad-Kids, an Arabic speech dataset designed for kindergarten and primary education, focusing on fundamental learning of alphabets, numbers, and colors. The dataset consists of 46397 audio samples collected from children aged 3 - 12 years, covering 141 classes. All samples were recorded under controlled specifications to ensure consistency in duration, sampling rate, and format. To address high intra-class similarity among Arabic phonemes and the limited samples per class, we propose a hierarchical audio classification based on CNN-LSTM architectures. Our proposed methodology decomposes alphabet recognition into a two-stage process: an initial grouping classification model followed by specialized classifiers for each group. Both strategies: static linguistic-based grouping and dynamic clustering-based grouping, were evaluated. Experimental results demonstrate that static linguistic-based grouping achieves superior performance. Comparisons between traditional machine learning with deep learning approaches, highlight the effectiveness of CNN-LSTM models combined with data augmentation. Despite achieving promising results, most of our experiments indicate a challenge with overfitting, which is likely due to the limited number of samples, even after data augmentation and model regularization. Thus, future work may focus on collecting additional data to address this issue. Abjad-Kids will be publicly available. We hope that Abjad-Kids enrich children representation in speech dataset, and be a good resource for future research in Arabic speech classification for kids.[16] SciNav: A General Agent Framework for Scientific Coding Tasks
Tianshu Zhang,Huan Sun
Main category: cs.CL
TL;DR: 本文提出了一种名为SciNav的科学导航代理框架,专为科学编程任务设计,利用成对相对判断指导树搜索,在有限搜索预算下高效筛选高质量解。
Details
Motivation: 现有自主科学代理多针对主观性强、难以评估的开放性科学问题;而科学编程任务具有可执行、可客观评估的优点,但缺乏结构化、端到端的代理框架。 Method: 提出SciNav框架,采用基于成对相对判断的树搜索策略,在受限搜索预算下动态选择top-K有希望的解分支、剪枝低潜力分支,并通过相对比较逐步收敛解空间。 Result: 在两个科学编程基准上,SciNav显著优于直接提示、OpenHands、Self-Debug等基线方法,也优于随机选择和LLM绝对评分等对照方法,且跨模型、任务类型与难度均表现稳健。 Conclusion: 相对判断引导的top-K搜索策略能有效提升科学编程中代理的解质量与实用性,SciNav为构建更实用的科学代理迈出关键一步。 Abstract: Autonomous science agents built on large language models (LLMs) are increasingly used to generate hypotheses, design experiments, and produce reports. However, prior work mainly targets open-ended scientific problems with subjective outputs that are difficult to evaluate. Scientific coding benchmarks, by contrast, provide executable outputs for objective assessment. Existing approaches remain engineering-driven pipelines, revealing the need for structured, end-to-end science agent frameworks for scientific coding tasks. We address this gap by focusing on scientific coding tasks, where evaluation can be made rigorously, and introducing an agent framework SciNav (Scientific Navigator) that enables more effective solution exploration. Our framework is designed to operate under constrained search budgets, moving beyond reliance on pre-defined success metrics and prolonged search cycles. Inspired by findings that comparative judgments often reveal finer-grained quality differences and therefore provide greater discriminative power than absolute scoring, our framework leverages pairwise relative judgments within a tree search process to select top-K promising solution branches, prune low-potential ones, and progressively narrow down the solution candidates on the selected branches guided by relative comparisons. We demonstrate our agent's effectiveness across different types of tasks on two benchmarks. Experiments show that SciNav significantly outperforms direct prompting and prior agents like OpenHands and Self-Debug across different base models, task types, and difficulty levels, and exceeds different frontier comparators such as random selection and LLM absolute scoring. These results confirm the strength of our agent design and highlight the effectiveness of relative judgment-guided top-K search for high-quality scientific coding, marking a step toward more practical science agents.[17] The production of meaning in the processing of natural language
Christopher J. Agostino,Quan Le Thien,Nayan D'Souza,Louis van der Elst
Main category: cs.CL
TL;DR: 本文研究大型语言模型在语义处理中表现出的量子上下文性(通过CHSH不等式违反程度|S|衡量),发现|S|的分布离散度与主流评测指标(如MMLU、幻觉率)正交,暗示上下文性是独立于传统能力维度的新认知特征;并探讨其对提示注入防御及社会性操控的深层启示。
Details
Motivation: 理解自然语言意义生成的基本机制对构建安全、有益的人机交互至关重要;已有认知科学和大模型实验证明语义处理具有量子式上下文性,但其与模型能力的关系尚不明确。 Method: 系统测量不同规模(跨越四个数量级)语言模型在推理过程中的CHSH |S|参数,关联MMLU、幻觉率、无意义检测等基准,并分析|S|随采样参数和词序的变化规律,结合信息论探讨其对提示注入防御和社会语境操控的含义。 Result: |S|分布的四分位距(IQR)与所有外部评测指标完全正交;|S|违反率与各基准呈微弱负相关但不显著;|S|受采样参数和词序影响,且真实上下文性对提示注入防御构成根本性约束。 Conclusion: 语言模型的量子上下文性是一种独立于传统性能指标的认知维度,其存在揭示了更深层的语义操控机制——即通过塑造解释空间本身(而非诱导特定回答)实现‘语境制造’,这对AI安全与人机交互设计具有根本性启示。 Abstract: Understanding the fundamental mechanisms governing the production of meaning in the processing of natural language is critical for designing safe, thoughtful, engaging, and empowering human-agent interactions. Experiments in cognitive science and social psychology have demonstrated that human semantic processing exhibits contextuality more consistent with quantum logical mechanisms than classical Boolean theories, and recent works have found similar results in large language models -- in particular, clear violations of the Bell inequality in experiments of contextuality during interpretation of ambiguous expressions. We explore the CHSH $|S|$ parameter -- the metric associated with the inequality -- across the inference parameter space of models spanning four orders of magnitude in scale, cross-referencing it with MMLU, hallucination rate, and nonsense detection benchmarks. We find that the interquartile range of the $|S|$ distribution -- the statistic that most sharply differentiates models from one another -- is completely orthogonal to all external benchmarks, while violation rate shows weak anticorrelation with all three benchmarks that does not reach significance. We investigate how $|S|$ varies with sampling parameters and word order, and discuss the information-theoretic constraints that genuine contextuality imposes on prompt injection defenses and its human analogue, whereby careful construction and maintenance of social contextuality can be carried out at scale -- manufacturing not consent but contextuality itself, a subtler and more fundamental form of manipulation that shapes the space of possible interpretations before any particular one is reached.[18] Coding Agents are Effective Long-Context Processors
Weili Cao,Xunjian Yin,Bhuwan Dhingra,Shuyan Zhou
Main category: cs.CL
TL;DR: 本文提出将长上下文处理从大语言模型(LLM)的隐式注意力机制中解耦,转而交由具备原生工具调用能力与文件系统操作能力的编码智能体(coding agents)显式执行,显著提升长文本推理、检索增强生成与开放域问答等任务性能。
Details
Motivation: 现有LLM虽能扩展上下文长度,但依赖不可解释的注意力机制,且随上下文增长性能显著下降;亟需更可控、可解释、可扩展的长上下文处理范式。 Method: 利用现成前沿编码智能体作为通用接口,通过文件系统组织大规模文本(达三万亿token),并使用原生代码与终端命令进行显式交互式检索、筛选与推理,而非传统语义注意力或向量检索。 Result: 在多个长上下文基准测试中,该方法平均超越现有最优方法17.3%;验证了工具调用能力与文件系统导航能力是性能提升的关键驱动因素。 Conclusion: 将长上下文处理外部化给编码智能体是一种有效替代方案,为LLM长上下文建模提供了新路径,有望摆脱对单纯扩大上下文窗口或改进注意力机制的依赖。 Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in scaling to access massive contexts. However, the access is via the latent and uninterpretable attention mechanisms, and LLMs fail to effective process long context, exhibiting significant performance degradation as context length increases. In this work, we study whether long-context processing can be externalized from latent attention into explicit, executable interactions, by allowing coding agents to organize text in file systems and manipulate it using its native tools. We evaluate off-the-shelf frontier coding agents as the general interface for tasks that require processing long contexts, including long-context reasoning, retrieval-augmented generation, and open-domain question answering with large-scale corpus contains up to three trillion tokens. Across multiple benchmarks, these agents outperform published state-of-the-art by 17.3% on average. We attribute this efficacy to two key factors: native tool proficiency, which enables agents to leverage executable code and terminal commands rather than passive semantic queries, and file system familiarity, which allows them to navigate massive text corpora as directory structures. These findings suggest that delegating long-context processing to coding agents offers an effective alternative to semantic search or context window scaling, opening new directions for long-context processing in LLMs.[19] A Training-Free Regeneration Paradigm: Contrastive Reflection Memory Guided Self-Verification and Self-Improvement
Yuran Li,Di Wu,Benoit Boulet
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的再生范式,利用离线构建的对比式反思记忆(RM)进行自我验证与单次再生,以提升LLM输出准确性,同时兼顾效率。
Details
Motivation: 现有验证引导的自我改进方法在推理效率与准确性之间存在权衡:迭代验证-修正计算开销大且易陷入错误推理;Best-of-N采样需大量样本却无法修复模型内部缺陷。 Method: 构建离线对比式Reflection Memory(RM),在推理时先进行RM引导的自我验证,再执行一次RM引导的从头再生,避免迭代修正和多样本选择。 Result: 在涵盖算法、推理、符号及领域特定任务的九个基准上,该方法在大小规模LLM上均优于先前方法,且计算成本低。 Conclusion: RM引导的单次验证-再生范式是一种高效、准确、无需训练的LLM输出改进新路径。 Abstract: Verification-guided self-improvement has recently emerged as a promising approach to improving the accuracy of large language model (LLM) outputs. However, existing approaches face a trade-off between inference efficiency and accuracy: iterative verification-rectification is computationally expensive and prone to being trapped in faulty reasoning, while best-of-N selection requires extensive sampling without addressing internal model flaws. We propose a training-free regeneration paradigm that leverages an offline-curated contrastive Reflection Memory (RM) to provide corrective guidance, while regenerating from scratch helps break out of faulty reasoning. At inference time, the method performs RM-guided self-verification followed by a single RM-guided regeneration, avoiding both iterative correction and multi-sample selection. We evaluated our method on nine benchmarks that span algorithmic, reasoning, symbolic, and domain-specific tasks in both small- and large-scale LLMs. Experiment results show that our method outperforms prior methods while maintaining low computational cost.[20] Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable
Rounak Saha,Gurusha Juneja,Dayita Chaudhuri,Naveeja Sajeevan,Nihar B Shah,Danish Pruthi
Main category: cs.CL
TL;DR: 本文探讨了当前AI文本检测器在识别同行评审中人类与AI协作产出内容时的准确性和可靠性,发现现有检测器存在高误报率,无法满足学术诚信审查所需的精度标准。
Details
Motivation: 近期多个学术会议和期刊禁止同行评审者使用大语言模型(LLM),仅允许用于润色、改写和语法修正;但这些政策是否可执行尚不明确。 Method: 构建了模拟不同层级人机协作的同行评审数据集,并评估了五种最先进AI文本检测器(含两个商用系统);进一步探索了利用论文手稿信息和科学写作领域特性等评审特有信号来提升检测性能的可能性。 Result: 所有检测器均将部分经LLM润色的人类撰写评审误判为纯AI生成;引入评审特有信号虽有一定改进,但无一达到可实际用于学术不端判定的准确率标准。 Conclusion: 当前AI检测工具不适用于判定同行评审中的AI使用行为,基于此类工具得出的AI使用率估计需谨慎解读,因其可能严重高估违规程度。 Abstract: A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.[21] Diffutron: A Masked Diffusion Language Model for Turkish Language
Şuayp Talha Kocabay,Talha Rüzgar Akkuş
Main category: cs.CL
TL;DR: 本文提出Diffutron,一种专为土耳其语设计的掩码扩散语言模型,通过LoRA微调和渐进式指令调优,在资源受限下实现了与大参数基线模型相当的性能。
Details
Motivation: 掩码扩散语言模型(MDLMs)在形态丰富语言(如土耳其语)中的应用仍有限,亟需针对此类语言设计高效、轻量的非自回归生成模型。 Method: 提出Diffutron模型:1)基于LoRA对多语言编码器在大规模语料上进行持续预训练;2)采用渐进式指令微调策略,依次在通用和任务特定指令集上适配模型。 Result: 在多个综合基准测试中,该紧凑型模型性能媲美现有的数十亿参数基线模型。 Conclusion: 验证了掩码扩散建模结合多阶段调优策略,在土耳其语等形态丰富语言的非自回归文本生成中具有有效性与可行性。 Abstract: Masked Diffusion Language Models (MDLMs) have emerged as a compelling non-autoregressive alternative to standard large language models; however, their application to morphologically rich languages remains limited. In this paper, we introduce $\textit{Diffutron}$, a masked diffusion language model specifically designed for Turkish. Our approach leverages a resource-efficient training pipeline, starting with LoRA-based continual pre-training of a multilingual encoder on a large-scale corpus. To enable generative capabilities, we employ a progressive instruction-tuning strategy, sequentially adapting the model on general and task-specific instruction sets. Experimental results across comprehensive benchmarks demonstrate that, despite its compact size, our model achieves competitive performance compared to existing multi-billion-parameter baselines. These findings validate the effectiveness of masked diffusion modeling combined with multi-stage tuning for non-autoregressive text generation in Turkish.[22] PARHAF, a human-authored corpus of clinical reports for fictitious patients in French
Xavier Tannier,Salam Abbara,Rémi Flicoteaux,Youness Khalil,Aurélie Névéol,Pierre Zweigenbaum,Emmanuel Bacry
Main category: cs.CL
TL;DR: PARHAF is a large, open-source, privacy-preserving synthetic French clinical corpus of 7394 expert-authored fictitious clinical reports, designed to support NLP development under strict EU privacy regulations.
Details
Motivation: To overcome data sharing restrictions imposed by privacy regulations (e.g., GDPR in France/EU) that hinder clinical NLP development, especially for French-language systems. Method: A structured protocol combining clinician expertise (104 residents across 18 specialties) and epidemiological data from France’s SNDS was used to author realistic but fully fictitious clinical reports following predefined scenarios and templates. Result: A publicly available corpus of 7394 clinical reports covering 5009 fictitious patient cases across broad medical/surgical specialties, with general-purpose and four specialized subsets (oncology, infectious diseases, diagnostic coding), released under CC-BY license (partially embargoed for future benchmarks). Conclusion: PARHAF enables privacy-compliant training and evaluation of French clinical NLP models and provides a scalable, replicable methodology for building synthetic clinical corpora in other languages and health systems. Abstract: The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.[23] Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study
Mohammed Rakibul Hasan
Main category: cs.CL
TL;DR: 本研究评估了GPT-4、Gemini Pro、Llama-3和Mistral-7B在孟加拉国等低资源背景下,针对COVID-19、登革热、尼帕病毒和基孔肯雅热等健康危机问题的回答可靠性。
Details
Motivation: 评估大型语言模型(LLMs)在低资源环境中提供可靠健康信息的能力,尤其关注其在公共卫生危机中的适用性与风险。 Method: 构建来自权威来源的健康危机问答数据集,并采用语义相似度、专家-模型交叉评估和自然语言推理(NLI)方法评估GPT-4、Gemini Pro、Llama-3和Mistral-7B的表现。 Result: 发现LLMs在表征流行病学历史和健康危机知识方面既有优势也有明显局限,其输出质量参差不齐,存在误导政策制定的风险。 Conclusion: LLMs在资源受限地区具有辅助公共卫生决策的潜力,但需谨慎部署并辅以人工审核与本地化适配,以规避可靠性风险。 Abstract: Large Language Models (LLMs) offer significant potential for delivering health information. However, their reliability in low-resource contexts remains uncertain. This study evaluates GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis-related enquiries concerning COVID-19, dengue, the Nipah virus, and Chikungunya in the low-resource context of Bangladesh. We constructed a question--answer dataset from authoritative sources and assessed model outputs through semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI). Findings highlight both the strengths and limitations of LLMs in representing epidemiological history and health crisis knowledge, underscoring their promise and risks for informing policy in resource-constrained environments.[24] Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
Tianyi Huang,Nathan Huang,Justin Tang,Wenqian Chen,Elsa Fan
Main category: cs.CL
TL;DR: 本文提出PCFJudge方法,通过在多个候选答案顺序下重复运行事实性评估提示,并聚合结果以提高大语言模型作为评判者时的稳定性与准确性。
Details
Motivation: 大型语言模型(LLMs)被广泛用作评判者,但其判断易受无关呈现方式(如候选答案顺序)影响,导致事实性评估不稳定。 Method: 提出PCFJudge:在相同候选答案集合的不同排列顺序下多次运行‘事实性优先’的列表式提示,并聚合得分、排序及不确定性信号,形成共识决策。 Result: 在RewardBench 2 Factuality基准上,PCFJudge比直接评判提升最高达7个绝对百分点;消融实验表明主要增益来自排列共识本身,而非复杂仲裁机制。 Conclusion: 候选顺序引发的不稳定性是事实性评判误差的重要来源,对这种干扰变量取平均是一种简单而有效的提升LLM评估可靠性的策略。 Abstract: Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points. Development ablations show that the dominant gain comes from permutation consensus itself rather than from heavier arbitration layers. These results suggest that a meaningful share of factuality-judging error arises from order instability, and that averaging over this nuisance variation is a simple and effective way to make LLM evaluation more reliable.[25] JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs
Taihei Shiotani,Masahiro Kaneko,Ayana Niwa,Yuki Maruyama,Daisuke Oba,Masanari Ohi,Naoaki Okazaki
Main category: cs.CL
TL;DR: 本文提出JUBAKU基准,专为评估日语大语言模型中的社会偏见而设计,强调日本文化特有因素(如等级关系、方言、传统性别角色),通过母语者手工构建的对抗性对话场景检测偏见;实验表明现有日语LLM在该基准上表现远低于随机基线,验证了其有效性与挑战性。
Details
Motivation: 现有非英语大语言模型社会偏见评估多依赖英文基准的直译,无法反映日本等地区的本土文化规范(如等级制、方言、传统性别角色),导致偏见检测不充分。 Method: 构建面向日本文化的对抗性偏见基准JUBAKU,涵盖10类文化维度,由日语母语标注员手工设计对话场景,以触发和暴露日语大模型中的隐性社会偏见;并在9个日语LLM及3个适配英文基准上进行评测,辅以人类标注员评估可靠性。 Result: 所有9个日语LLM在JUBAKU上平均准确率仅23%(13%–33%),显著低于50%随机基线;而在其他基准上表现更好;人类标注员识别无偏响应准确率达91%,证实JUBAKU的可靠性和对抗性。 Conclusion: JUBAKU有效揭示了当前日语大语言模型中被既有基准忽视的文化特定偏见,凸显本地化、文化适配的评估基准对公平性研究的必要性。 Abstract: Social biases reflected in language are inherently shaped by cultural norms, which vary significantly across regions and lead to diverse manifestations of stereotypes. Existing evaluations of social bias in large language models (LLMs) for non-English contexts, however, often rely on translations of English benchmarks. Such benchmarks fail to reflect local cultural norms, including those found in Japanese. For instance, Western benchmarks may overlook Japan-specific stereotypes related to hierarchical relationships, regional dialects, or traditional gender roles. To address this limitation, we introduce Japanese cUlture adversarial BiAs benchmarK Under handcrafted creation (JUBAKU), a benchmark tailored to Japanese cultural contexts. JUBAKU uses adversarial construction to expose latent biases across ten distinct cultural categories. Unlike existing benchmarks, JUBAKU features dialogue scenarios hand-crafted by native Japanese annotators, specifically designed to trigger and reveal latent social biases in Japanese LLMs. We evaluated nine Japanese LLMs on JUBAKU and three others adapted from English benchmarks. All models clearly exhibited biases on JUBAKU, performing below the random baseline of 50% with an average accuracy of 23% (ranging from 13% to 33%), despite higher accuracy on the other benchmarks. Human annotators achieved 91% accuracy in identifying unbiased responses, confirming JUBAKU's reliability and its adversarial nature to LLMs.[26] A Modular LLM Framework for Explainable Price Outlier Detection
Shadi Sartipi,John Wu,Sina Ghotbi,Nikhita Vedula,Shervin Malmasi
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的智能代理框架,用于检测商品价格异常值,通过相关产品检索、多维度相对效用评估和可解释推理三阶段实现高准确率与可解释性。
Details
Motivation: 传统价格异常检测方法仅依赖简单阈值,忽略了商品属性间的丰富语义关系,难以兼顾准确性与可解释性,影响零售与电商的竞争力、收入和消费者信任。 Method: 构建一个基于LLM的智能代理框架,包含三个阶段:(i) 相关性分类——利用商品描述与属性检索价格可比的相似商品;(ii) 相对效用评估——沿品牌、规格、功能等价格影响维度逐一对比目标商品与相似商品;(iii) 推理决策——聚合各对比依据生成可解释的异常判断。 Result: 在测试集上与人工审核员达成超75%的一致率,显著优于零样本和检索增强的LLM基线方法;消融实验验证了关键超参数敏感性及框架在不同精度需求与审核一致性要求下的灵活性。 Conclusion: 该LLM代理框架能有效建模商品间语义关系,实现高精度、可解释的价格异常检测,具备实际部署潜力与扩展性。 Abstract: Detecting product price outliers is important for retail and e-commerce stores as erroneous or unexpectedly high prices adversely affect competitiveness, revenue, and consumer trust. Classical techniques offer simple thresholds while ignoring the rich semantic relationships among product attributes. We propose an agentic Large Language Model (LLM) framework that treats outlier price flagging as a reasoning task grounded in related product detection and comparison. The system processes the prices of target products in three stages: (i) relevance classification selects price-relevant similar products using product descriptions and attributes; (ii) relative utility assessment evaluates the target product against each similar product along price influencing dimensions (e.g., brand, size, features); (iii) reasoning-based decision aggregates these justifications into an explainable price outlier judgment. The framework attains over 75% agreement with human auditors on a test dataset, and outperforms zero-shot and retrieval based LLM techniques. Ablation studies show the sensitivity of the method to key hyper-parameters and testify on its flexibility to be applied to cases with different accuracy requirement and auditor agreements.[27] Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention
Manh Nguyen,Anh Nguyen,Dung Nguyen,Svetha Venkatesh,Hung Le
Main category: cs.CL
TL;DR: 本文提出了一种名为Diversity-Aware Retention (DAR)的轻量级多智能体辩论框架,通过在每轮辩论中选择彼此及与多数投票最大分歧的响应子集进行广播,以减少噪声和冗余,提升推理质量。
Details
Motivation: 现有基于不确定性估计的消息过滤方法不可靠,因置信度校准不准且对阈值敏感;同时,全广播机制带来噪声、冗余并浪费算力。 Method: DAR采用基于索引的显式保留机制,在每轮中选取与彼此及多数投票分歧最大的响应子集进行广播,保持原始消息不变以确保分歧真实性。 Result: 在多种推理与问答基准上实验表明,DAR的选择性消息传播能持续提升辩论性能,尤其在智能体数量增多时效果更显著。 Conclusion: 多智能体推理中,‘智能体听到什么’与‘智能体说什么’同等重要;DAR通过多样性感知的消息保留有效缓解噪声累积问题。 Abstract: Multi-Agent Debate has emerged as a promising framework for improving the reasoning quality of large language models through iterative inter-agent communication. However, broadcasting all agent messages at every round introduces noise and redundancy that can degrade debate quality and waste computational resources. Current approaches rely on uncertainty estimation to filter low-confidence responses before broadcasting, but this approach is unreliable due to miscalibrated confidence scores and sensitivity to threshold selection. To address this, we propose Diversity-Aware Retention (DAR), a lightweight debate framework that, at each debate round, selects the subset of agent responses that maximally disagree with each other and with the majority vote before broadcasting. Through an explicit index-based retention mechanism, DAR preserves the original messages without modification, ensuring that retained disagreements remain authentic. Experiments on diverse reasoning and question answering benchmarks demonstrate that our selective message propagation consistently improves debate performance, particularly as the number of agents scales, where noise accumulation is most severe. Our results highlight that what agents hear is as important as what agents say in multi-agent reasoning systems.[28] Weber's Law in Transformer Magnitude Representations: Efficient Coding, Representational Geometry, and Psychophysical Laws in Language Models
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: 本文通过心理物理学方法研究了大语言模型中数量表征的几何结构,发现其具有一致的对数压缩特性,但这种几何结构与行为表现脱钩,且早期层在数量处理中起关键作用。
Details
Motivation: 近期研究对transformer语言模型中数量表征方式存在分歧(对数、线性或按位循环),本文旨在用心理物理学的正式工具解决该争议。 Method: 采用四种互补范式(表征相似性分析、行为辨别、精度梯度、因果干预),在三个数量领域(数值、时间、空间)和三类7-9B指令微调模型(Llama、Mistral、Qwen)中进行跨架构验证,并辅以语料统计分析。 Result: 1)表征几何始终呈对数压缩(RSA相关性0.68–0.96),线性几何从未占优;2)对数几何与行为能力脱钩:一模型达到人类水平韦伯分数(WF=0.20),另一则未达,且两者在时/空辨别任务中均表现随机;3)因果干预显示早期层功能特异性强(4.1×),而几何最强的后期层无因果参与(1.2×);语料分析证实满足高效编码前提(α=0.77)。 Conclusion: 训练数据的统计特性足以产生对数压缩的数量表征几何,但该几何本身并不保证模型具备相应的行为能力。 Abstract: How do transformer language models represent magnitude? Recent work disagrees: some find logarithmic spacing, others linear encoding, others per-digit circular representations. We apply the formal tools of psychophysics to resolve this. Using four converging paradigms (representational similarity analysis, behavioural discrimination, precision gradients, causal intervention) across three magnitude domains in three 7-9B instruction-tuned models spanning three architecture families (Llama, Mistral, Qwen), we report three findings. First, representational geometry is consistently log-compressive: RSA correlations with a Weber-law dissimilarity matrix ranged from .68 to .96 across all 96 model-domain-layer cells, with linear geometry never preferred. Second, this geometry is dissociated from behaviour: one model produces a human-range Weber fraction (WF = 0.20) while the other does not, and both models perform at chance on temporal and spatial discrimination despite possessing logarithmic geometry. Third, causal intervention reveals a layer dissociation: early layers are functionally implicated in magnitude processing (4.1x specificity) while later layers where geometry is strongest are not causally engaged (1.2x). Corpus analysis confirms the efficient coding precondition (alpha = 0.77). These results suggest that training data statistics alone are sufficient to produce log-compressive magnitude geometry, but geometry alone does not guarantee behavioural competence.[29] PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs
Tianyi Huang,Caden Yang,Emily Yin,Eric Wang,Michael Zhang
Main category: cs.CL
TL;DR: PAVE是一种推理时验证层,用于证据支持的问答,通过分解检索到的上下文为原子事实、草拟答案、评分支持度并修订低支持答案,从而提升检索增强语言模型的答案一致性。
Details
Motivation: 现有检索增强语言模型在未明确检查检索内容是否支持结论的情况下就做出回答,导致答案缺乏证据支持。 Method: PAVE将检索到的上下文分解为问题相关的原子前提,生成初步答案,评估该答案被前提支持的程度,并对支持度低的答案进行修订,最终输出可审计的推理轨迹。 Result: 在固定检索器和主干模型的控制实验中,PAVE在两个证据支持型问答任务上显著优于简单后检索基线,其中在跨度标注基准上最高提升达32.7准确率点。 Conclusion: 显式前提提取与支持度门控修订能有效增强检索增强大语言模型在证据支持问答中的逻辑一致性,为该方向提供了概念验证。 Abstract: Retrieval-augmented language models can retrieve relevant evidence yet still commit to answers before explicitly checking whether the retrieved context supports the conclusion. We present PAVE (Premise-Grounded Answer Validation and Editing), an inference-time validation layer for evidence-grounded question answering. PAVE decomposes retrieved context into question-conditioned atomic facts, drafts an answer, scores how well that draft is supported by the extracted premises, and revises low-support outputs before finalization. The resulting trace makes answer commitment auditable at the level of explicit premises, support scores, and revision decisions. In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark. We view these findings as proof-of-concept evidence that explicit premise extraction plus support-gated revision can strengthen evidence-grounded consistency in retrieval-augmented LLM systems.[30] Can I guess where you are from? Modeling dialectal morphosyntactic similarities in Brazilian Portuguese
Manoel Siqueira,Raquel Freitag
Main category: cs.CL
TL;DR: 本文通过分析巴西葡萄牙语中形态句法的共变现象,探讨是否能从语言变量的组合行为推断说话者的方言来源;研究发现相关性分析仅能捕捉有限的成对关联,而聚类分析更能揭示反映地域方言模式的说话人分组;尽管社会语言学与计算方法在样本量要求上存在方法论限制,但跨学科合作对构建公平、包容且尊重方言多样性的语言技术至关重要。
Details
Motivation: 评估是否能从语言变量的组合行为推断巴西葡萄牙语使用者的方言来源,并推动兼顾方言多样性的语言技术发展。 Method: 针对与代词相关的四个语法现象,采用相关性分析和聚类方法建模形态句法共变及方言分布。 Result: 相关性分析仅揭示有限的成对关联,而聚类分析成功识别出符合地域方言模式的说话人分组;同时揭示了社会语言学与计算方法在样本量需求上的方法论张力。 Conclusion: 跨学科研究对构建公平、包容并尊重方言多样性的语言技术具有关键意义,其价值超越方法整合的挑战。 Abstract: This paper investigates morphosyntactic covariation in Brazilian Portuguese (BP) to assess whether dialectal origin can be inferred from the combined behavior of linguistic variables. Focusing on four grammatical phenomena related to pronouns, correlation and clustering methods are applied to model covariation and dialectal distribution. The results indicate that correlation captures only limited pairwise associations, whereas clustering reveals speaker groupings that reflect regional dialectal patterns. Despite the methodological constraints imposed by differences in sample size requirements between sociolinguistics and computational approaches, the study highlights the importance of interdisciplinary research. Developing fair and inclusive language technologies that respect dialectal diversity outweighs the challenges of integrating these fields.[31] Reasoning Topology Matters: Network-of-Thought for Complex Reasoning Tasks
Fan Huang
Main category: cs.CL
TL;DR: 本文提出Network-of-Thought(NoT)框架,将大语言模型推理建模为带类型节点与边的有向图,由启发式控制器策略引导,在多跳推理等复杂任务上优于链式(CoT)和树式(ToT)结构,并揭示评估方法对性能排名的显著影响。
Details
Motivation: 现有提示范式(如Chain-of-Thought和Tree-of-Thought)在建模复杂推理(如结果融合、假设回溯、多源证据整合)时存在拓扑局限性,亟需更灵活的结构化推理表示。 Method: 提出Network-of-Thought(NoT)框架,将推理过程建模为有向图,节点和边具有语义类型,并设计基于启发式的控制器策略来指导图搜索;在多个基准(GSM8K、Game of 24、HotpotQA、ProofWriter)和模型(GPT-4o-mini、Llama-3.3-70B、Qwen2.5-72B)上系统比较NoT与CoT、ToT的性能、拓扑简洁性与token效率。 Result: NoT在多跳推理(HotpotQA)和数学推理(GSM8K)上超越ToT和CoT(如Qwen2.5-72B达91.7% HotpotQA准确率);自生成控制器启发式在逻辑推理(ProofWriter)中优于固定/随机策略(uncertainty-only加权达57.0%);字符串匹配评估严重低估NoT性能(HotpotQA上误差达14–18个百分点)。 Conclusion: 网络化推理拓扑(NoT)更适合处理需信息融合与动态回溯的复杂任务;LLM可自生成有效启发式指导图搜索;评估方式选择对方法比较至关重要,尤其对开放型推理方法。 Abstract: Existing prompting paradigms structure LLM reasoning in limited topologies: Chain-of-Thought (CoT) produces linear traces, while Tree-of-Thought (ToT) performs branching search. Yet complex reasoning often requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources. We propose Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct), we investigate when network topology outperforms chain or tree structures, whether LLM-generated heuristics can guide graph-based reasoning search, and the computation-accuracy tradeoff across topologies, evaluating each method on accuracy, topology simplicity, and token efficiency. Our results show that CoT remains effective for sequential tasks with GPT-4o-mini (89.5\% on GSM8K), while NoT surpasses ToT on multi-hop reasoning (91.0\% vs.\ 88.0\% on HotpotQA with LLM-as-Judge). With 72B open-source models, NoT achieves the highest accuracy on GSM8K (91.5\%), and Qwen2.5-72B achieves the best multi-hop QA result overall (91.7\% on HotpotQA). Self-generated controller heuristics outperform fixed and random strategies on logical reasoning, with uncertainty-only weighting achieving 57.0\% on ProofWriter. We also find that evaluation methodology significantly impacts method rankings: string-match underestimates all methods on open-ended QA, with the largest gap for NoT, a pattern consistent across all three models (14--18 percentage point gap on HotpotQA).[32] MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages
Anri Lombard,Simbarashe Mawere,Temi Aina,Ethan Wolff,Sbonelo Gumede,Elan Novick,Francois Meyer,Jan Buys
Main category: cs.CL
TL;DR: 本文介绍了MzansiText语料库和MzansiLM小规模解码器-only语言模型,专为南非11种官方语言(其中9种为低资源语言)设计,并系统评估了其在自然语言理解和生成任务中的不同微调策略效果。
Details
Motivation: 解决南非多种低资源语言缺乏专用解码器-only语言模型的问题,探索小规模模型在低资源多语言场景下的指令微调泛化能力。 Method: 构建MzansiText多语言预训练语料库及可复现过滤流程;从零训练125M参数的MzansiLM模型;在三种适应范式下评估:单语任务微调、多语任务微调、多任务指令微调。 Result: 单语任务微调在isiXhosa数据到文本生成中达20.65 BLEU,媲美大十倍的编码器-解码器基线;多语任务微调在isiXhosa新闻分类中达78.5% macro-F1;但小模型在少样本推理上仍表现不佳。 Conclusion: MzansiLM为南非低资源语言提供了可复现的小规模解码器-only基线,验证了特定微调策略对低资源多语言任务的有效性,同时揭示了小模型在少样本推理上的局限性。 Abstract: Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching 20.65 BLEU on isiXhosa and competing with encoder-decoder baselines over ten times larger. Multilingual task-specific finetuning benefits closely related languages on topic classification, achieving 78.5% macro-F1 on isiXhosa news classification. While MzansiLM adapts effectively to supervised NLU and NLG tasks, few-shot reasoning remains challenging at this model size, with performance near chance even for much larger decoder-only models. We release MzansiText and MzansiLM to provide a reproducible decoder-only baseline and clear guidance on adaptation strategies for South African languages at small scale.[33] Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement
Jiang Liu,Ge Qiu,Hao Fei,Dongdong Xie,Jinbo Li,Fei Li,Chong Teng,Donghong Ji
Main category: cs.CL
TL;DR: 本文提出Code-MIE框架,将多模态信息抽取(MIE)建模为统一的代码理解与生成任务,采用代码风格的输入(Python函数)与输出(Python字典)模板,并融合文本实体属性与图像场景图/视觉特征,在多个数据集上达到SOTA性能。
Details
Motivation: 现有基于大语言模型的多模态信息抽取方法存在两方面不足:一是自然语言模板与结构化信息抽取任务不匹配;二是虽有少量工作尝试代码风格模板,但仅限于纯文本抽取,且模板设计复杂、缺乏通用性。 Method: 提出Code-style Multimodal Information Extraction(Code-MIE)框架:(1)从文本中抽取实体属性(如性别、隶属关系)以增强上下文理解;(2)将图像转换为场景图和视觉特征以融入视觉信息;(3)构建Python函数式输入模板(含实体属性、场景图、原始文本)和Python字典式输出模板(含实体、关系等结构化结果)。 Result: 在M$^3$D(英文61.03%,中文60.49%)、Twitter-15(76.04%)、Twitter-17(88.07%)、MNRE(73.94%)四个数据集上均优于六个基线模型,达到当前最优性能。 Conclusion: Code-MIE通过代码风格建模与多模态信息融合,有效提升了多模态信息抽取的结构化表达能力与性能,为MIE任务提供了更契合、更通用的新范式。 Abstract: With the rapid development of large language models (LLMs), more and more researchers have paid attention to information extraction based on LLMs. However, there are still some spaces to improve in the existing related methods. First, existing multimodal information extraction (MIE) methods usually employ natural language templates as the input and output of LLMs, which mismatch with the characteristics of information tasks that mostly include structured information such as entities and relations. Second, although a few methods have adopted structured and more IE-friendly code-style templates, they just explored their methods on text-only IE rather than multimodal IE. Moreover, their methods are more complex in design, requiring separate templates to be designed for each task. In this paper, we propose a Code-style Multimodal Information Extraction framework (Code-MIE) which formalizes MIE as unified code understanding and generation. Code-MIE has the following novel designs: (1) Entity attributes such as gender, affiliation are extracted from the text to guide the model to understand the context and role of entities. (2) Images are converted into scene graphs and visual features to incorporate rich visual information into the model. (3) The input template is constructed as a Python function, where entity attributes, scene graphs and raw text compose of the function parameters. In contrast, the output template is formalized as Python dictionaries containing all extraction results such as entities, relations, etc. To evaluate Code-MIE, we conducted extensive experiments on the M$^3$D, Twitter-15, Twitter-17, and MNRE datasets. The results show that our method achieves state-of-the-art performance compared to six competing baseline models, with 61.03\% and 60.49\% on the English and Chinese datasets of M$^3$D, and 76.04\%, 88.07\%, and 73.94\% on the other three datasets.[34] The Anatomy of an Edit: Mechanism-Guided Activation Steering for Knowledge Editing
Yuan Cao,Mingyang Wang,Hinrich Schütze
Main category: cs.CL
TL;DR: 本文提出了一种基于神经元级知识归因(NLKA)的机制分析方法,揭示了知识编辑(KE)在模型内部如何实现,并据此设计了无需权重修改的机制引导激活干预方法MEGA,在多个基准上实现了高效、通用的知识编辑。
Details
Motivation: 现有知识编辑(KE)方法缺乏对编辑如何在模型内部实际生效的机制理解,尤其是编辑成功与失败时模型计算变化的差异尚不明确。 Method: 采用后编辑归因(post-edit attribution)方法,通过对比成功与失败的知识编辑案例,识别模型中发生改变的关键计算模块;发现注意力机制(中后期)主要促进新知识,而注意力与前馈网络(FFN)协同抑制旧知识;基于此提出MEGA方法,在归因定位区域进行注意力-残差层面的激活干预,不修改模型权重。 Result: MEGA在CounterFact和Popular数据集上,在GPT2-XL和LLaMA2-7B模型上均取得优异的KE性能,验证了机制分析可直接指导高效编辑方法设计。 Conclusion: 后编辑归因不仅是分析工具,更可作为工程信号指导知识编辑方法的设计;MEGA证明了基于机制理解的无权重修改编辑是可行且有效的,具备跨架构泛化能力。 Abstract: Large language models (LLMs) are increasingly used as knowledge bases, but keeping them up to date requires targeted knowledge editing (KE). However, it remains unclear how edits are implemented inside the model once applied. In this work, we take a mechanistic view of KE using neuron-level knowledge attribution (NLKA). Unlike prior work that focuses on pre-edit causal tracing and localization, we use post-edit attribution -- contrasting successful and failed edits -- to isolate the computations that shift when an edit succeeds. Across representative KE methods, we find a consistent pattern: mid-to-late attention predominantly promotes the new target, while attention and FFN modules cooperate to suppress the original fact. Motivated by these findings, we propose MEGA, a MEchanism-Guided Activation steering method that performs attention-residual interventions in attribution-aligned regions without modifying model weights. On CounterFact and Popular, MEGA achieves strong editing performance across KE metrics on GPT2-XL and LLaMA2-7B. Overall, our results elevate post-edit attribution from analysis to engineering signal: by pinpointing where and how edits take hold, it powers MEGA to deliver reliable, architecture-agnostic knowledge edits.[35] RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution
Kaiyuan Li,Jing-Cheng Pang,Yang Yu
Main category: cs.CL
TL;DR: 本文提出Cross-Generation评估框架,发现RLVR在可验证任务上有效,但在通用问答(GQA)任务中无法自动迁移提升;进而提出START方法,分离思维与回答训练,显著提升GQA上的推理质量与答案准确率。
Details
Motivation: 验证RLVR是否能自动提升LLM在通用问答(GQA)任务上的表现,因现有假设缺乏充分验证。 Method: 提出Cross-Generation评估框架以量化中间推理质量;发现RLVR在GQA中效果有限后,设计Separated Thinking And Response Training (START),先仅用答案奖励训练思维过程。 Result: 实验证明RLVR在GQA上效果明显弱于可验证任务;START方法在多个GQA基准和RL算法上均提升了推理质量与最终答案性能。 Conclusion: RLVR不能直接迁移至GQA任务,需针对GQA显式训练;START通过解耦思维与响应训练,有效缓解奖励捷径问题,是更适配GQA的RL训练范式。 Abstract: Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediate reasoning by feeding the generated thinking context into LLMs of varying capabilities. Our evaluation leads to a discouraging finding: the efficacy of the thinking process on GQA tasks is markedly lower than on verifiable tasks, suggesting that explicit training on GQA remains necessary in addition to training on verifiable tasks. We further observe that direct RL training on GQA is less effective than RLVR. Our hypothesis is that, whereas verifiable tasks demand robust logical chains to obtain high rewards, GQA tasks often admit shortcuts to high rewards without cultivating high-quality thinking. To avoid possible shortcuts, we introduce a simple method, Separated Thinking And Response Training (START), which first trains only the thinking process, using rewards defined on the final answer. We show that START improves both the quality of thinking and the final answer across several GQA benchmarks and RL algorithms.[36] BenchBench: Benchmarking Automated Benchmark Generation
Yandan Zheng,Haoran Luo,Zhenghong Lin,Wenjin Liu,Luu Anh Tuan
Main category: cs.CL
TL;DR: 本文提出BenchBench,一种用于评估自动化基准生成能力的三阶段框架与数据集,强调模型‘设计基准’的能力应与‘回答基准’的能力同等重要;实验表明设计能力与答题能力仅中度相关,并支持对多模态、多语言基准的可扩展审计。
Details
Motivation: 现有LLM评估依赖静态测试集,易饱和、受污染且更新成本高;而用LLM作为评判者又引入新偏差;因此需转向评估模型自身生成高质量基准的能力。 Method: 提出BenchBench三阶段流程:(i)从种子基准中提取结构化领域卡片;(ii)用多个设计师LLM按配额生成题目套件;(iii)通过多模型作答小组+精确/数值/符号验证器或量表指导评判,构建设计师–作答者矩阵并标注题目质量与心理测量指标。 Result: 在9个跨领域变体(含多语言、多模态)上生成16.7K题目,过滤后保留约15K核心题目,获得约152K条模型–题目评分响应;发现设计能力与答题能力Spearman相关性仅~0.37,题目无效性与区分度呈负相关(Pearson r~−0.62)。 Conclusion: 模型的基准设计能力是独立且可量化的维度,BenchBench为LLM评估范式提供了新方向,支持更鲁棒、可审计、可扩展的基准生成与质量分析。 Abstract: Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer--answerer matrices with item-level quality flags and psychometric diagnostics. Across nine variants spanning computer science, mathematics, medicine, and theory-of-mind reasoning (including multilingual and multimodal settings), we generate 16.7K items, retain ~15K core items post-filtering, and produce ~152K graded model--item responses. BenchBench shows that benchmark-design ability is only moderately correlated with answer-time strength (Spearman rho ~0.37), invalidity is negatively associated with discrimination (Pearson r~0.62), and the resulting designer--answerer matrices enable scalable audits of format/modality/language fidelity and suite-dependent self/family interactions. The project is available at: https://github.com/koanatakiyo/BenchBench.[37] HiCI: Hierarchical Construction-Integration for Long-Context Attention
Xiangyu Zeng,Qi Xu,Yunke Wang,Chang Xu
Main category: cs.CL
TL;DR: 本文提出HiCI(分层构建-整合)模块,通过分层注意力机制显式建模局部到全局的信息结构,在仅增加<5.5%参数的情况下,显著扩展LLaMA-2的上下文长度并提升多项长文本任务性能。
Details
Motivation: 现有长上下文语言建模多聚焦于token级注意力的可扩展性,而忽视了对局部到全局信息结构的显式建模;受话语理解认知理论启发,需引入更符合人类阅读机制的归纳偏置。 Method: 提出HiCI分层注意力模块:先构建段落级表征,再将其整合进共享全局上下文,最后将两者广播以调节段内注意力;采用参数高效方式适配LLaMA-2。 Result: 在7B模型上将上下文从4K扩展至100K tokens,13B模型扩展至64K tokens;在语言建模、检索和指令遵循等基准上持续优于强基线,主题检索媲美专有模型,代码理解超越GPT-3.5-Turbo-16K。 Conclusion: 显式的分层信息结构是一种有效的归纳偏置,能显著提升长上下文语言模型的建模能力与泛化性能。 Abstract: Long-context language modeling is commonly framed as a scalability challenge of token-level attention, yet local-to-global information structuring remains largely implicit in existing approaches. Drawing on cognitive theories of discourse comprehension, we propose HiCI (Hierarchical Construction--Integration), a hierarchical attention module that constructs segment-level representations, integrates them into a shared global context, and broadcasts both to condition segment-level attention. We validate HiCI through parameter-efficient adaptation of LLaMA-2 with only <5.5% additional parameters, extending context from 4K to 100K tokens (7B) and 64K tokens (13B). Across language modeling, retrieval, and instruction-following benchmarks, HiCI yields consistent improvements over strong baselines, including matching proprietary models on topic retrieval and surpassing GPT-3.5-Turbo-16K on code comprehension. These results demonstrate the effectiveness of explicit hierarchical structuring as an inductive bias for long-context modeling.[38] Can ChatGPT Really Understand Modern Chinese Poetry?
Shanshan Wang,Derek F. Wong,Jingming Yao,Lidia S. Chao
Main category: cs.CL
TL;DR: 本文提出了一种评估ChatGPT对现代诗歌理解能力的综合框架,并通过与专业诗人合作,在多个维度上评估其对现代中文诗歌的解读,结果显示其解读在73%以上的情况下符合原诗人的意图,但在‘诗意性’等维度上表现欠佳。
Details
Motivation: ChatGPT在诗歌生成和翻译方面表现出色,但其是否真正理解诗歌尚无研究;以往工作仅分析实验结果,未触及理解本质问题。 Method: 构建多维评估框架,联合专业诗人对ChatGPT解读不同诗人现代中文诗歌的表现进行人工评估。 Result: ChatGPT的解读与原诗人意图一致率超73%,但在捕捉‘诗意性’等维度上效果不佳。 Conclusion: 所提框架有效且必要,不仅评估了ChatGPT的诗歌理解能力,也为大语言模型在诗歌相关任务中的后续研究奠定了基础。 Abstract: ChatGPT has demonstrated remarkable capabilities on both poetry generation and translation, yet its ability to truly understand poetry remains unexplored. Previous poetry-related work merely analyzed experimental outcomes without addressing fundamental issues of comprehension. This paper introduces a comprehensive framework for evaluating ChatGPT's understanding of modern poetry. We collaborated with professional poets to evaluate ChatGPT's interpretation of modern Chinese poems by different poets along multiple dimensions. Evaluation results show that ChatGPT's interpretations align with the original poets' intents in over 73% of the cases. However, its understanding in certain dimensions, particularly in capturing poeticity, proved to be less satisfactory. These findings highlight the effectiveness and necessity of our proposed framework. This study not only evaluates ChatGPT's ability to understand modern poetry but also establishes a solid foundation for future research on LLMs and their application to poetry-related tasks.[39] SozKZ: Training Efficient Small Language Models for Kazakh from Scratch
Saken Tukenov
Main category: cs.CL
TL;DR: 本文提出了SozKZ系列专为哈萨克语设计的从零训练的语言模型(50M-600M参数),采用适配其黏着语特性的50K BPE分词器,在多项哈萨克语基准测试中超越同等或更大规模的多语言模型,验证了小规模专用模型对低资源语言的有效性。
Details
Motivation: 现有多语言模型对哈萨克语等低资源语言支持不足:分配容量少、分词器不适应其黏着语形态。 Method: 从零训练基于Llama架构的SozKZ模型族(50M–600M参数),使用90亿哈萨克语token和定制的50K BPE分词器,并在三个哈萨克语基准(文化问答、Belebele阅读理解、SIB-200主题分类)上系统评估。 Result: SozKZ-600M在文化问答达30.3%准确率(接近Llama-3.2-1B的32.0%),在SIB-200分类达25.5%,超越所有对比的≤2B参数多语言模型;且50M到600M呈现稳定缩放规律。 Conclusion: 针对低资源语言,从小规模、专用数据和适配分词器出发从零训练专用模型,可在显著降低计算成本的同时实现与大模型相当甚至更优的性能,是一条可行技术路径。 Abstract: Kazakh, a Turkic language spoken by over 22 million people, remains underserved by existing multilingual language models, which allocate minimal capacity to low-resource languages and employ tokenizers ill-suited to agglutinative morphology. We present SozKZ, a family of Llama-architecture language models (50M-600M parameters) trained entirely from scratch on 9 billion tokens of Kazakh text with a dedicated 50K BPE tokenizer. We evaluate all models on three Kazakh benchmarks -- multiple-choice cultural QA, reading comprehension (Belebele), and topic classification (SIB-200) -- alongside five multilingual baselines ranging from 500M to 3B parameters. Our 600M model achieves 30.3% accuracy on Kazakh cultural QA, approaching the 32.0% of Llama-3.2-1B (2x larger), and 25.5% on SIB-200 topic classification, surpassing all evaluated multilingual models up to 2B parameters. We observe consistent scaling from 50M to 600M, with MC QA accuracy rising from 22.8% to 30.3%, suggesting that further scaling remains beneficial. These results demonstrate that small, dedicated models trained from scratch with a language-appropriate tokenizer offer a viable path for low-resource language technology, achieving competitive performance at a fraction of the computational cost. All models and the tokenizer are released under open licenses.[40] NoveltyAgent: Autonomous Novelty Reporting Agent with Point-wise Novelty Analysis and Self-Validation
Jiajun Hou,Hexuan Deng,Wenxiang Jiao,Xuebo Liu,Xiaopeng Ke,Min Zhang
Main category: cs.CL
TL;DR: 本文提出NoveltyAgent,一个用于评估学术论文新颖性的多智能体系统,通过细粒度分解、跨文献比对与检查表式评估框架,显著提升新颖性分析的准确性与可信度。
Details
Motivation: 学术论文数量激增导致筛选成本上升,现有方法(如通用AI评审员或复用DeepResearch)缺乏领域适配机制,新颖性评估质量不足。 Method: 设计多智能体系统NoveltyAgent:将论文分解为离散的新颖性要点进行细粒度检索与比对;构建全面的相关论文数据库并交叉验证主张以保障忠实性;提出基于检查表的开放生成任务评估框架。 Result: 在实验中,NoveltyAgent性能超越GPT-5 DeepResearch达10.15%,达到当前最优水平。 Conclusion: NoveltyAgent能提供可靠、高质量的新颖性分析,助力研究者快速识别真正创新的论文。 Abstract: The exponential growth of academic publications has led to a surge in papers of varying quality, increasing the cost of paper screening. Current approaches either use novelty assessment within general AI Reviewers or repurpose DeepResearch, which lacks domain-specific mechanisms and thus delivers lower-quality results. To bridge this gap, we introduce NoveltyAgent, a multi-agent system designed to generate comprehensive and faithful novelty reports, enabling thorough evaluation of a paper's originality. It decomposes manuscripts into discrete novelty points for fine-grained retrieval and comparison, and builds a comprehensive related-paper database while cross-referencing claims to ensure faithfulness. Furthermore, to address the challenge of evaluating such open-ended generation tasks, we propose a checklist-based evaluation framework, providing an unbiased paradigm for building reliable evaluations. Extensive experiments show that NoveltyAgent achieves state-of-the-art performance, outperforming GPT-5 DeepResearch by 10.15%. We hope this system will provide reliable, high-quality novelty analysis and help researchers quickly identify novel papers. Code and demo are available at https://github.com/SStan1/NoveltyAgent.[41] LLM Router: Prefill is All You Need
Tanay Varshney,Annie Surla,Michelle Xu,Gomathy Venkata Krishnan,Maximilian Jeblick,David Austin,Neal Vaidya,Davide Onofrio
Main category: cs.CL
TL;DR: 本文提出了一种基于内部prefill激活的新型路由器架构SharedTrunkNet,通过Encoder-Target Decoupling和数学探针(Fisher Separability与Effective Dimensionality)选择最优信号层,显著缩小了单模型与Oracle路由之间的性能差距,并大幅降低成本。
Details
Motivation: 现有路由器依赖脆弱的语义信号,而不同大模型在任务子集上表现互补,存在通过智能路由提升整体性能的巨大潜力。 Method: 提出Encoder-Target Decoupling机制,分离提供预测信号的Encoder模型与待评估性能的Target模型;利用Fisher Separability (J) 和 Effective Dimensionality (d_eff) 作为数学探针筛选最优层信号;构建SharedTrunkNet架构实现高效异构配对。 Result: SharedTrunkNet可填补最强单模型与Oracle之间最高达45.58%的准确率差距,并相较最高成本模型节省74.31%的成本。 Conclusion: 内部prefill激活是比语义信号更鲁棒的路由依据,结构化信号探针与解耦式架构能有效挖掘模型间互补性,为高效模型集成提供新范式。 Abstract: LLMs often share comparable benchmark accuracies, but their complementary performance across task subsets suggests that an Oracle router--a theoretical selector with perfect foresight--can significantly surpass standalone model accuracy by navigating model-specific strengths. While current routers rely on fragile semantic signals, we propose using internal prefill activations via Encoder-Target Decoupling--a functional separation between the model providing the predictive signal (the Encoder) and the model whose performance is being estimated (the Target). This allows optimized heterogeneous pairing between unique encoders and target models. We utilize Fisher Separability (J) and Effective Dimensionality (d_eff) as mathematical probes to isolate optimal layer-wise signals, providing the predictive foundation for our SharedTrunkNet architecture. SharedTrunkNet captures up to 45.58% of the accuracy gap between the strongest standalone model and the Oracle while achieving 74.31% cost savings relative to the highest-cost model.[42] Mitigating Shortcut Reasoning in Language Models: A Gradient-Aware Training Approach
Hongyu Cao,Kunpeng Liu,Dongjie Wang,Yanjie Fu
Main category: cs.CL
TL;DR: 本文提出了一种名为Shortcut-Aware Reasoning Training (SART) 的新训练框架,旨在识别并缓解大语言模型在推理任务中依赖表面模式匹配和答案记忆等捷径行为,通过梯度感知机制提升模型的真实逻辑推理能力与泛化性。
Details
Motivation: 大语言模型虽表现出强推理能力,但常依赖表面模式匹配和答案记忆等捷径,而非真正的逻辑推理,影响其泛化与鲁棒性。 Method: 提出SART框架,利用ShortcutScore和梯度手术(gradient surgery)检测并缓解易引发捷径行为的样本;通过梯度与验证目标的不一致性及答案词元集中度识别捷径信号,并相应调整训练动态。 Result: 在受控推理基准测试中,SART相较最强基线提升16.5%准确率和40.2%鲁棒性,显著增强分布偏移下的泛化能力。 Conclusion: SART是一种有效的数据与训练动态协同优化方法,能显著抑制捷径学习,推动大模型实现更可靠、可泛化的逻辑推理。 Abstract: Large language models exhibit strong reasoning capabilities, yet often rely on shortcuts such as surface pattern matching and answer memorization rather than genuine logical inference. We propose Shortcut-Aware Reasoning Training (SART), a gradient-aware framework that detects and mitigates shortcut-promoting samples via ShortcutScore and gradient surgery. Our method identifies shortcut signals through gradient misalignment with validation objectives and answer-token concentration, and modifies training dynamics accordingly. Experiments on controlled reasoning benchmarks show that SART achieves +16.5% accuracy and +40.2% robustness over the strongest baseline, significantly improving generalization under distribution shifts. Code is available at: https://github.com/fuyanjie/short-cut-aware-data-centric-reasoning.[43] The Hidden Puppet Master: A Theoretical and Real-World Account of Emotional Manipulation in LLMs
Jocelyn Shen,Amina Luvsanchultem,Jessica Kim,Kynnedy Smith,Valdemar Danry,Kantwon Rogers,Sharifa Alghowinem,Hae Won Park,Maarten Sap,Cynthia Breazeal
Main category: cs.CL
TL;DR: 本文提出PUPPET理论分类法,聚焦于LLM对话中隐藏动机的道德性,并通过大规模人类实验发现有害隐藏动机比亲社会动机引发更显著的信念偏移;同时评估了LLMs在信念变化预测任务上的表现,发现其虽具中等预测能力但系统性低估偏移幅度。
Details
Motivation: 现有说服与操纵检测研究多基于模拟或辩论场景,缺乏与真实人类信念变化的相关性,且忽视驱动操纵的隐藏动机的道德维度。 Method: 构建以动机道德性为核心的个性化情感操纵理论分类法PUPPET,并开展N=1,035参与者的现实日常查询人类实验,操控个性化程度与动机方向(有害vs.亲社会);同时对多个LLM进行信念变化预测能力基准测试。 Result: 有害隐藏动机导致的信念偏移显著大于亲社会动机;LLMs在信念预测任务中呈现中等相关性(r=0.3–0.5),但系统性低估偏移幅度。 Conclusion: 本工作为研究和应对LLM在日常实用查询中由隐藏动机驱动的操纵行为,提供了理论扎实且行为验证的基础。 Abstract: As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to being subtly steered toward hidden incentives misaligned with their own interests. Prior works have benchmarked persuasion and manipulation detection, but these efforts rely on simulated or debate-style settings, remain uncorrelated with real human belief shifts, and overlook a critical dimension: the morality of hidden incentives driving the manipulation. We introduce PUPPET, a theoretical taxonomy of personalized emotional manipulation in LLM-human dialogues that centers around incentive morality, and conduct a human study with N=1,035 participants across realistic everyday queries, varying personalization and incentive direction (harmful versus prosocial). We find that harmful hidden incentives produce significantly larger belief shifts than prosocial ones. Finally, we benchmark LLMs on the task of belief prediction, finding that models exhibit moderate predictive ability of belief change based on conversational contexts (r=0.3 - 0.5), but they also systematically underestimate the magnitude of belief shift. Together, this work establishes a theoretically grounded and behaviorally validated foundation for studying, and ultimately combatting, incentive-driven manipulation in LLMs during everyday, practical user queries.[44] User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction
Yuren Hao,Shuhaib Mehri,ChengXiang Zhai,Dilek Hakkani-Tür
Main category: cs.CL
TL;DR: 本文提出VARS框架,通过长短期用户向量建模偏好,在冻结大模型主干下实现无需微调的在线个性化检索,提升多轮交互效率而非单纯准确率。
Details
Motivation: 现有大语言模型作为个人助理缺乏持久化用户建模能力,导致用户需反复申明偏好;亟需一种轻量、在线、无需微调的个性化方法。 Method: 提出Vector-Adapted Retrieval Scoring(VARS):在共享偏好空间中用长短期向量表征用户,基于用户弱标量反馈在线更新向量,并以此偏差检索打分过程;框架与pipeline无关、主干冻结。 Result: 在MultiSessionCollab多会话协作基准上验证:VARS在冻结主干下显著降低超时率与用户努力,任务成功率媲美Reflection基线;长短期向量分别体现跨用户偏好共性与会话特异性,具备可解释性。 Conclusion: VARS是一种高效、可扩展、可解释的用户建模范式,证明检索阶段的用户感知比端到端微调更适配多轮个性化场景,尤其提升交互效率。 Abstract: Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector-Adapted Retrieval Scoring (VARS), a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users' feedback, enabling personalization without per-user fine-tuning. We evaluate on \textsc{MultiSessionCollab}, an online multi-session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user-aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long-term vectors also align with cross-user preference overlap, while short-term vectors capture session-specific adaptation, supporting the interpretability of the dual-vector design. Code, model, and data are available at https://github.com/YurenHao0426/VARS.[45] Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models
Xinyue Liu,Niloofar Mireshghallah,Jane C. Ginsburg,Tuhin Chakrabarty
Main category: cs.CL
TL;DR: 本文揭示了前沿大语言模型在微调后会显著增加对训练数据中受版权保护内容的逐字复现,挑战了厂商关于模型不存储训练数据的声明及现有版权法中的合理使用裁决基础。
Details
Motivation: 前沿LLM公司声称其模型不存储训练数据,并依赖RLHF、系统提示和输出过滤等安全对齐策略防止受版权保护内容的逐字复现,以此作为应对版权侵权诉讼的关键法律抗辩。本文旨在检验这些声明与措施的实际有效性。 Method: 通过使用语义性剧情摘要(而非原文)作为提示,对GPT-4o、Gemini-2.5-Pro和DeepSeek-V3.1等模型进行微调,使其扩展摘要为完整文本;评估其对未见版权书籍的逐字复现能力,并测试跨作者泛化性、数据来源影响及不同模型间一致性。 Result: 微调后模型可复现85–90%的保留版受版权书籍内容,单次连续逐字复现超460词;该能力跨作者泛化(仅用村上春树作品微调即可复现30+无关作者作品);随机作者对或公共领域数据微调亦有类似效果,而合成数据微调则几乎无复现;三厂商模型在相同书籍区域复现高度一致(r ≥ 0.90)。 Conclusion: 模型权重确实隐含存储了受版权保护作品的副本;微调会激活预训练阶段的潜在记忆,导致安全机制失效;这一发现动摇了近期以‘防复现措施有效’为前提的版权合理使用判决的法理基础。 Abstract: Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors' works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors' works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.[46] DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles
Bo Jiang
Main category: cs.CL
TL;DR: 本文提出DiscoUQ框架,利用多智能体间分歧的语义与几何结构(如证据重叠、嵌入聚类等)来提升多LLM系统输出的不确定性量化效果,在多个基准上实现更优校准与AUROC。
Details
Motivation: 现有方法仅依赖浅层投票统计,忽略了智能体推理中丰富的语义信息,导致不确定性估计不准、校准差。 Method: 提出DiscoUQ框架,包含三种方法:DiscoUQ-LLM(基于LLM提取的结构特征的逻辑回归)、DiscoUQ-Embed(基于嵌入几何特征的逻辑回归)和DiscoUQ-Learn(融合全部特征的神经网络)。 Result: 在四个基准测试中,DiscoUQ-LLM平均AUROC达0.802,优于最佳基线(0.791),且校准误差ECE显著降低(0.036 vs. 0.098);特征具有跨基准泛化能力,尤其在‘弱分歧’场景下提升最大。 Conclusion: 利用智能体间分歧的深层结构信息可显著提升多LLM系统不确定性估计的质量与鲁棒性,DiscoUQ为可信多智能体推理提供了新范式。 Abstract: Multi-agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents' reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter-agent disagreement -- both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) -- to produce well-calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ-LLM (logistic regression on LLM-extracted structure features), DiscoUQ-Embed (logistic regression on embedding geometry), and DiscoUQ-Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU, TruthfulQA, ARC-Challenge) with a 5-agent system using Qwen3.5-27B, DiscoUQ-LLM achieves an average AUROC of 0.802, outperforming the best baseline (LLM Aggregator, 0.791) while being substantially better calibrated (ECE 0.036 vs. 0.098). The learned features generalize across benchmarks with near-zero performance degradation and provide the largest improvements where they are most needed: in the ambiguous "weak disagreement" tier where simple vote counting fails.[47] Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO
Jinquan Zheng,Jia Yuan,Jiacheng Yao,Chenyang Gu,Pujun Zheng,Guoxiu He
Main category: cs.CL
TL;DR: 本文提出了一种名为PA-GRPO的新方法,通过构建排列组并引入跨排列优势与一致性感知奖励,缓解大语言模型在多选和成对评估任务中的选择偏差问题,显著降低偏差同时保持高性能。
Details
Motivation: 现有大语言模型在多选和成对评估任务中因选项位置、标签符号等非语义因素产生选择偏差;推理时去偏代价高且可能损害推理能力,逐点训练忽略同一问题在不同排列下应给出一致答案的要求。 Method: 提出Permutation-Aware Group Relative Policy Optimization(PA-GRPO):为每个样本构造排列组,采用两种机制联合优化——(1)跨排列优势(以同一样本所有排列的平均奖励为基准计算优势),(2)一致性感知奖励(鼓励模型在不同排列下输出一致决策)。 Result: 在七个基准上显著优于强基线,大幅降低选择偏差,同时维持高整体性能。 Conclusion: PA-GRPO能有效缓解LLMs在多选与成对评估任务中的选择偏差,兼顾去偏效果与推理能力,是一种高效、鲁棒的训练策略。 Abstract: Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).[48] Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models
Abdul-Salem Beibitkhan
Main category: cs.CL
TL;DR: 本文通过在英语、哈萨克语和蒙古语上对8个大语言模型进行基准测试,发现其在低资源语言上存在13.8–16.7个百分点的性能差距;跨语言迁移提示(先英文推理再回译)仅对双语架构模型有效,而对英语主导模型无效,表明缓解策略需因架构而异。
Details
Motivation: 当前大语言模型(LLMs)在低资源语言上的表现尚不明确,亟需系统评估其服务能力与局限性,以揭示是否系统性地忽视了低资源语言社区。 Method: 在英语、哈萨克语和蒙古语三种语言上,使用50道人工构建、覆盖事实性、推理、技术性和文化相关类别的问题,对8个LLM在5种实验条件下进行基准测试;评估2000个响应在准确性、流利度和完整性三方面的表现;并尝试跨语言迁移提示(先英文推理后回译)策略。 Result: 发现英语与低资源语言间存在稳定且显著的性能差距(13.8–16.7个百分点),模型虽保持表面流利但准确性明显下降;跨语言迁移提示仅使双语架构模型提升2.2–4.3个百分点,对英语主导模型无改善。 Conclusion: 现有LLM系统性地未能充分服务低资源语言群体;有效的性能提升策略高度依赖模型架构,不存在普适性解决方案。 Abstract: We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally grounded categories, we evaluate 2,000 responses on accuracy, fluency, and completeness. We find a consistent performance gap of 13.8-16.7 percentage points between English and low-resource language conditions, with models maintaining surface-level fluency while producing significantly less accurate content. Cross-lingual transfer-prompting models to reason in English before translating back-yields selective gains for bilingual architectures (+2.2pp to +4.3pp) but provides no benefit to English-dominant models. Our results demonstrate that current LLMs systematically underserve low-resource language communities, and that effective mitigation strategies are architecture-dependent rather than universal.[49] Reading Between the Lines: How Electronic Nonverbal Cues shape Emotion Decoding
Taara Kumar,Kokil Jaidka
Main category: cs.CL
TL;DR: 本文系统研究了电子非语言线索(eNVCs)在微博等文本型网络交流中的作用,构建了理论驱动的eNVC分类体系与自动检测工具,并通过实验和焦点小组验证其对情绪识别、歧义感知及用户解读策略的影响。
Details
Motivation: 在缺乏身体线索的文本型网络交流中,用户如何重建非语言表达?这一问题因CMC日益普及而愈发紧迫。 Method: 采用三项互补研究:研究1基于非语言传播理论构建eNVC统一分类法并开发Python自动检测工具;研究2为被试内问卷实验,检验eNVC对情绪解码准确率与歧义感知的因果影响;研究3通过焦点小组探讨用户对数字韵律的解读策略。 Result: 证实eNVC显著提升情绪解码准确率、降低感知歧义;发现讽刺等边界条件下效果减弱;揭示用户依赖线索缺失推断意义、并在模糊情境中倾向负面解读;eNVC被确立为可测、连贯的数字行为类别。 Conclusion: eNVC是数字交流中可测量、有理论基础的关键非语言资源;本研究深化了线索丰富性与解读努力的理论理解,并为情感计算、用户建模与情绪感知界面设计提供了实用工具与实证依据。 Abstract: As text-based computer-mediated communication (CMC) increasingly structures everyday interaction, a central question re-emerges with new urgency: How do users reconstruct nonverbal expression in environments where embodied cues are absent? This paper provides a systematic, theory-driven account of electronic nonverbal cues (eNVCs) - textual analogues of kinesics, vocalics, and paralinguistics - in public microblog communication. Across three complementary studies, we advance conceptual, empirical, and methodological contributions. Study 1 develops a unified taxonomy of eNVCs grounded in foundational nonverbal communication theory and introduces a scalable Python toolkit for their automated detection. Study 2, a within-subject survey experiment, offers controlled causal evidence that eNVCs substantially improve emotional decoding accuracy and lower perceived ambiguity, while also identifying boundary conditions, such as sarcasm, under which these benefits weaken or disappear. Study 3, through focus group discussions, reveals the interpretive strategies users employ when reasoning about digital prosody, including drawing meaning from the absence of expected cues and defaulting toward negative interpretations in ambiguous contexts. Together, these studies establish eNVCs as a coherent and measurable class of digital behaviors, refine theoretical accounts of cue richness and interpretive effort, and provide practical tools for affective computing, user modeling, and emotion-aware interface design. The eNVC detection toolkit is available as a Python and R package at https://github.com/kokiljaidka/envc.[50] Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation
Tianle Yang,Chengzhe Sun,Phil Rose,Cassandra L. Jacobs,Siwei Lyu
Main category: cs.CL
TL;DR: 本文提出了一种分段级韵律探测框架,用于评估神经TTS模型再现辅音诱导基频扰动的能力,发现当前模型更依赖词频相关的记忆而非抽象的音段-韵律编码,揭示了其在韵律细节泛化能力上的局限。
Details
Motivation: 评估神经TTS模型对辅音诱导f0扰动这一细粒度音段-韵律效应的建模能力,探究其是否具备抽象韵律编码能力而非仅依赖记忆。 Method: 构建分段级韵律探测框架,对比Tacotron 2与FastSpeech 2在LJ Speech数据上合成语音与自然语音在数千个按词频分层的单词中的f0扰动表现,并扩展至多系统大规模评估。 Result: 高词频词的f0扰动再现准确,但低词频词表现差,表明模型更依赖词级记忆而非抽象音段-韵律编码;该现象在多种先进TTS系统中普遍存在。 Conclusion: 当前主流TTS系统在细粒度韵律建模上存在泛化能力不足的问题,所提探测框架可为未来TTS评估、可解释性与合成语音真实性评估提供语言学依据。 Abstract: This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.[51] ViCLSR: A Supervised Contrastive Learning Framework with Natural Language Inference for Natural Language Understanding Tasks
Tin Van Huynh,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen
Main category: cs.CL
TL;DR: 本文提出ViCLSR,一种专为越南语设计的监督对比学习框架,利用现有自然语言推理数据集优化句子嵌入,在多个越南语NLU基准上显著超越PhoBERT。
Details
Motivation: 越南语等低资源语言因标注数据稀缺,难以获得高质量文本表示;现有预训练模型(如PhoBERT)性能受限,而对比学习在提升句子表征方面展现出潜力。 Method: 提出ViCLSR——一种面向越南语的监督对比学习框架,并设计适配现有越南语数据集以支持监督对比学习的流程,利用NLI数据进行训练。 Result: ViCLSR在ViNLI、ViWikiFC、ViFactCheck、UIT-ViCTSD和ViMMRC2.0五个越南语NLU基准上分别提升F1或准确率4.33%–9.02%,显著优于PhoBERT;并进行了深入结果分析。 Conclusion: 监督对比学习能有效缓解越南语等低资源语言在NLU任务中的数据匮乏问题,提升句子表征质量;ViCLSR已开源以推动越南语NLP研究。 Abstract: High-quality text representations are crucial for natural language understanding (NLU), but low-resource languages like Vietnamese face challenges due to limited annotated data. While pre-trained models like PhoBERT and CafeBERT perform well, their effectiveness is constrained by data scarcity. Contrastive learning (CL) has recently emerged as a promising approach for improving sentence representations, enabling models to effectively distinguish between semantically similar and dissimilar sentences. We propose ViCLSR (Vietnamese Contrastive Learning for Sentence Representations), a novel supervised contrastive learning framework specifically designed to optimize sentence embeddings for Vietnamese, leveraging existing natural language inference (NLI) datasets. Additionally, we propose a process to adapt existing Vietnamese datasets for supervised learning, ensuring compatibility with CL methods. Our experiments demonstrate that ViCLSR significantly outperforms the powerful monolingual pre-trained model PhoBERT on five benchmark NLU datasets such as ViNLI (+6.97% F1), ViWikiFC (+4.97% F1), ViFactCheck (+9.02% F1), UIT-ViCTSD (+5.36% F1), and ViMMRC2.0 (+4.33% Accuracy). ViCLSR shows that supervised contrastive learning can effectively address resource limitations in Vietnamese NLU tasks and improve sentence representation learning for low-resource languages. Furthermore, we conduct an in-depth analysis of the experimental results to uncover the factors contributing to the superior performance of contrastive learning models. ViCLSR is released for research purposes in advancing natural language processing tasks.[52] Evaluating Reasoning-Based Scaffolds for Human-AI Co-Annotation: The ReasonAlign Annotation Protocol
Smitha Muthya Sudheendra,Jaideep Srivastava
Main category: cs.CL
TL;DR: 本文提出ReasonAlign框架,通过向标注者展示大语言模型生成的推理过程(但不显示预测标签),研究推理对人类标注行为的影响。实验表明,该方法在提升标注者间一致性的同时仅需少量标签修订,说明推理主要帮助解决模糊案例。
Details
Motivation: 人类标注在NLP评估中至关重要,但在主观任务中常存在显著标注者间差异;而大语言模型虽能提供结构化推理以辅助标注,其对人类标注行为的具体影响尚不明确。 Method: 提出ReasonAlign——一种基于推理的标注支架框架,隐藏LLM预测标签、仅展示其推理过程;采用两阶段Delphi式标注协议:标注者先独立标注,再基于模型推理进行修订;在情感分类与观点检测任务上评估,并引入标注者努力代理(AEP)指标量化修订比例。 Result: 暴露于推理后,标注者间一致性提高,同时标签修订比例较低;AEP指标显示仅少量标签被修改,表明推理主要缓解模糊案例而非引发广泛改动。 Conclusion: 推理解释有助于提升标注一致性,ReasonAlign作为一种推理驱动的标注支架,是支持人机协同标注流程的实用机制。 Abstract: Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. We introduce ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. We frame this as a controlled study of how reasoning affects human annotation behavior, rather than a full evaluation of annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for supporting human-AI annotation workflows.[53] Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects
Nurul Labib Sayeedi,Md. Faiyaz Abdullah Sayeedi,Shubhashis Roy Dipta,Rubaya Tabassum,Ariful Ekraj Hridoy,Mehraj Mahmood,Mahbub E Sobhani,Md. Tarek Hasan,Swakkhar Shatabda
Main category: cs.CL
TL;DR: 本文提出了BanglaVerse,一个面向孟加拉文化、支持多语言与多方言的多模态视觉-语言模型评测基准,旨在更真实地评估模型在文化语境和语言变异下的理解能力。
Details
Motivation: 孟加拉文化丰富多元,但在多模态评测中严重缺乏代表性;现有评测仅用标准孟加拉语会高估模型真实能力,忽视方言与历史关联语言带来的挑战。 Method: 构建了包含1152张人工标注图像、覆盖9个文化领域的BanglaVerse基准,扩展至4种语言和5种孟加拉方言,生成约32.3K评测样本,支持视觉问答与图像描述任务,并开展跨语言/方言的系统性评测实验。 Result: 实验证明:仅用标准孟加拉语评测会高估性能;方言变异显著降低模型表现(尤其图像描述);印地语、乌尔都语等关联语言能保留部分文化含义但结构化推理能力弱;瓶颈主要在于文化知识缺失而非视觉定位本身。 Conclusion: BanglaVerse为评估多语言多模态模型的文化感知能力提供了更现实、更具挑战性的测试平台,强调文化知识建模与方言鲁棒性的重要性。 Abstract: Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.[54] Entropy Alone is Insufficient for Safe Selective Prediction in LLMs
Edward Phillips,Fredrik K. Gustafsson,Sean Wu,Anshul Thakur,David A. Clifton
Main category: cs.CL
TL;DR: 本文提出了一种结合熵分与正确性探测信号的不确定性量化方法,以改进选择性预测系统在低错误率目标下的风险-覆盖率权衡和校准性能。
Details
Motivation: 现有基于熵的不确定性量化方法在选择性预测策略中存在模型依赖的失效模式,且缺乏面向实际部署的评估(如能否在指定风险水平下可靠运行)。 Method: 将熵得分与正确性探测信号相结合,构建新的不确定性评分方法,并在多个问答基准和模型家族上进行评估。 Result: 在TriviaQA、BioASQ、MedicalQA三个QA基准及四种模型家族上,该组合得分相比纯熵基线普遍提升了风险-覆盖率权衡和校准性能。 Conclusion: 面向部署的评估至关重要;仅依赖传统不确定性指标(如熵)不足以保障系统在低错误率要求下的可信运行,需引入额外信号(如正确性探测)增强鲁棒性。 Abstract: Selective prediction systems can mitigate harms resulting from language model hallucinations by abstaining from answering in high-risk cases. Uncertainty quantification techniques are often employed to identify such cases, but are rarely evaluated in the context of the wider selective prediction policy and its ability to operate at low target error rates. We identify a model-dependent failure mode of entropy-based uncertainty methods that leads to unreliable abstention behaviour, and address it by combining entropy scores with a correctness probe signal. We find that across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score generally improves both the risk--coverage trade-off and calibration performance relative to entropy-only baselines. Our results highlight the importance of deployment-facing evaluation of uncertainty methods, using metrics that directly reflect whether a system can be trusted to operate at a stated risk level.[55] Explainable Semantic Textual Similarity via Dissimilar Span Detection
Diego Miguel Lozano,Daryna Dementieva,Alexander Fraser
Main category: cs.CL
TL;DR: 本文提出了一种新任务——差异跨度检测(DSD),旨在识别文本对中语义不同的词或token,并发布了配套数据集SSD;通过多种基线方法实验发现LLM和监督模型效果最好,但整体性能仍低,表明任务难度大;进一步验证了DSD可提升下游任务如复述检测的性能。
Details
Motivation: 现有语义文本相似度(STS)方法仅输出单一分数,缺乏可解释性,难以揭示影响相似度的具体语义差异。 Method: 提出了差异跨度检测(DSD)新任务;构建了Span Similarity Dataset(SSD)数据集,采用LLM+人工校验的半自动流程;设计并评估了多种无监督(LIME、SHAP、LLM及自研方法)和一种有监督的基线方法。 Result: LLM和监督模型在DSD任务上表现最佳,但整体性能仍较低;额外实验证明DSD能提升复述检测任务的性能。 Conclusion: DSD是一项具有挑战性但有价值的新任务,有助于增强STS的可解释性,并可迁移提升下游NLP任务效果。 Abstract: Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.[56] Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles
Sai Koneru,Jian Wu,Sarah Rajtmajer
Main category: cs.CL
TL;DR: 本文研究了从科学论文全文中提取假设及其统计证据的问题,提出了一种两阶段的检索-抽取框架,并通过控制实验分析了不同检索设计对假设和统计证据提取效果的影响。
Details
Motivation: 从科学论文全文中提取假设及其统计证据对于实证研究综合至关重要,但由于文档长度和科学论点分布在不同章节,这一任务仍然困难。 Method: 采用两阶段的检索-抽取框架,系统地评估了上下文数量、上下文质量(标准RAG、重排序、微调检索器+重排序)以及理想段落设置对四种大语言模型抽取器性能的影响。 Result: 有针对性的上下文选择显著提升了假设提取效果,但统计证据提取依然困难;即使使用理想段落,性能仍处于中等水平,表明抽取器在处理混合数值-文本语句方面存在固有局限。 Conclusion: 检索质量与上下文清洁度的优化对假设提取至关重要,而统计证据提取的瓶颈主要在于抽取器本身对复杂混合格式的理解能力,而非检索环节。 Abstract: Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article's abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.[57] Graph Fusion Across Languages using Large Language Models
Kaung Myat Kyaw,Khush Agarwal,Jonathan Chan
Main category: cs.CL
TL;DR: 本文提出了一种利用大语言模型(LLM)进行跨语言知识图谱融合的框架,通过结构线性化将三元组映射为自然语言序列,使LLM能作为通用语义桥梁解决跨语言异构性问题,并在DBP15K数据集上验证了其可扩展性和模块化能力。
Details
Motivation: 跨语言知识图谱融合面临语义异质性和图结构复杂性的挑战,亟需一种能统一处理多源、多语言知识的通用方法。 Method: 提出基于LLM的跨语言图融合框架,采用结构线性化技术将知识图谱三元组转换为自然语言序列(如[头实体] [关系] [尾实体]),利用LLM的上下文推理与多语言语义先验,在演化融合图与新候选图之间映射关系并消解实体歧义。 Result: 在DBP15K数据集上的实验表明,该方法能成功实现多个异构图的顺序聚合,展现出良好的跨语言对齐能力和连续知识合成效果。 Conclusion: LLM可作为通用语义桥接器,支撑可扩展、模块化的跨语言知识图谱融合,为多源多语言环境下的动态知识整合提供了新范式。 Abstract: Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ($G_{c}^{(t-1)}$) and a new candidate graph ($G_{t}$). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.[58] Conversation Tree Architecture: A Structured Framework for Context-Aware Multi-Branch LLM Conversations
Pranav Hemanth,Sampriti Saha
Main category: cs.CL
TL;DR: 本文提出Conversation Tree Architecture (CTA),一种树状分层结构,用于解决大语言模型在长程多主题对话中因扁平化上下文累积导致的‘逻辑上下文中毒’问题;CTA通过隔离节点上下文、定义父子节点间上下文流动机制(含易失性节点),为结构化对话管理与多智能体扩展提供理论基础与原型实现。
Details
Motivation: 现有对话接口采用扁平、追加式上下文结构,导致多主题对话中上下文混杂、响应质量随对话延长持续下降,即‘逻辑上下文中毒’问题。 Method: 提出Conversation Tree Architecture(CTA),将对话建模为树形结构,每个节点拥有独立局部上下文窗口;定义上下文在父子节点间的定向流动机制(分支创建时向下传递、分支删除时向上回收),并引入需选择性合并或丢弃的volatile nodes;形式化其核心原语,分析上下文流动的设计挑战,并关联LLM记忆管理相关工作。 Result: 构建了CTA的完整框架定义与设计分析,实现了可运行的原型系统,验证其在结构化上下文管理及向多智能体场景自然扩展上的可行性。 Conclusion: CTA为LLM长程多主题对话提供了原则性、可扩展的上下文组织范式,有望显著缓解逻辑上下文中毒,提升对话一致性与可控性。 Abstract: Large language models (LLMs) are increasingly deployed for extended, multi-topic conversations, yet the flat, append-only structure of current conversation interfaces introduces a fundamental limitation: all context accumulates in a single unbounded window, causing topically distinct threads to bleed into one another and progressively degrade response quality. We term this failure mode logical context poisoning. In this paper, we introduce the Conversation Tree Architecture (CTA), a hierarchical framework that organizes LLM conversations as trees of discrete, context-isolated nodes. Each node maintains its own local context window; structured mechanisms govern how context flows between parent and child nodes, downstream on branch creation and upstream on branch deletion. We additionally introduce volatile nodes, transient branches whose local context must be selectively merged upward or permanently discarded before purging. We formalize the architecture's primitives, characterize the open design problems in context flow, relate our framework to prior work in LLM memory management, and describe a working prototype implementation. The CTA provides a principled foundation for structured conversational context management and extends naturally to multi-agent settings.[59] More Than Sum of Its Parts: Deciphering Intent Shifts in Multimodal Hate Speech Detection
Runze Sun,Yu Zheng,Zexuan Xiong,Zhongjin Qu,Lei Chen,Jiwen Lu,Jie Zhou
Main category: cs.CL
TL;DR: 本文提出ARCADE框架和H-VLI基准,通过模拟法庭辩论机制建模多模态语义交互,以更精准识别隐式仇恨言论。
Details
Motivation: 现有仇恨言论检测系统在处理多模态隐式攻击时效果不佳,因其难以捕捉跨模态语义涌现与意图转换。 Method: 提出细粒度语义意图偏移建模,并构建依赖模态交互而非显性辱骂词的H-VLI基准;设计ARCADE框架,利用角色化代理(指控方/辩护方)进行不对称推理与辩论式决策。 Result: ARCADE在H-VLI上显著超越SOTA方法,尤其在隐式案例中表现突出,同时在主流基准上保持竞争力。 Conclusion: 模态间语义互作用是理解隐式仇恨言论的关键,基于辩论机制的不对称推理能有效提升多模态仇恨检测的鲁棒性与可解释性。 Abstract: Combating hate speech on social media is critical for securing cyberspace, yet relies heavily on the efficacy of automated detection systems. As content formats evolve, hate speech is transitioning from solely plain text to complex multimodal expressions, making implicit attacks harder to spot. Current systems, however, often falter on these subtle cases, as they struggle with multimodal content where the emergent meaning transcends the aggregation of individual modalities. To bridge this gap, we move beyond binary classification to characterize semantic intent shifts where modalities interact to construct implicit hate from benign cues or neutralize toxicity through semantic inversion. Guided by this fine-grained formulation, we curate the Hate via Vision-Language Interplay (H-VLI) benchmark where the true intent hinges on the intricate interplay of modalities rather than overt visual or textual slurs. To effectively decipher these complex cues, we further propose the Asymmetric Reasoning via Courtroom Agent DEbate (ARCADE) framework. By simulating a judicial process where agents actively argue for accusation and defense, ARCADE forces the model to scrutinize deep semantic cues before reaching a verdict. Extensive experiments demonstrate that ARCADE significantly outperforms state-of-the-art baselines on H-VLI, particularly for challenging implicit cases, while maintaining competitive performance on established benchmarks. Our code and data are available at: https://github.com/Sayur1n/H-VLI[60] Enhancing reasoning accuracy in large language models during inference time
Vinay Sharma,Manish Jain
Main category: cs.CL
TL;DR: 本文系统评估了三种推理时策略(自一致性、双模型推理一致性和自我反思)以提升大语言模型在多步推理任务中的准确性,发现自一致性方法结合核采样和控制温度能显著提高准确率(9%-15%),适用于低风险场景;双模型方法适合中等风险场景;而自我反思效果有限。
Details
Motivation: 大语言模型在多步推理任务中仍不可靠,尤其在未经过额外训练或微调时,亟需无需训练的推理时改进方法。 Method: 系统评估三种推理时策略:(i) 基于随机解码的自一致性(使用核采样与可控温度);(ii) 双模型推理一致性(比较两个独立模型输出);(iii) 自我反思(模型自我批判与修正)。所有方法均基于思维链(CoT)提示。 Result: 自一致性方法带来9%–15%绝对准确率提升;双模型方法提供更高可信度但计算开销更大;自我反思仅带来边际增益,尤其对较小非推理型模型效果有限。 Conclusion: 推理时策略可显著提升LLM推理可靠性,其中自一致性最高效实用,双模型适用于高可靠性需求场景,自我反思当前实用性较低。 Abstract: Large Language Models (LLMs) often exhibit strong linguistic abilities while remaining unreliable on multi-step reasoning tasks, particularly when deployed without additional training or fine-tuning. In this work, we study inference-time techniques to improve the reasoning accuracy of LLMs. We systematically evaluate three classes of inference-time strategies: (i) self-consistency via stochastic decoding, where the model is sampled multiple times using controlled temperature and nucleus sampling and the most frequent final answer is selected; (ii) dual-model reasoning agreement, where outputs from two independent models are compared and only consistent reasoning traces are trusted; and (iii) self-reflection, where the model critiques and revises its own reasoning. Across all evaluated methods, we employ Chain-of-Thought (CoT) [1] prompting to elicit explicit intermediate reasoning steps before generating final answers. In this work, we provide a controlled comparative evaluation across three inference-time strategies under identical prompting and verification settings. Our experiments on LLM [2] show that self-consistency with nucleus sampling and controlled temperature value yields the substantial gains, achieving a 9% to 15% absolute improvement in accuracy over greedy single-pass decoding, well-suited for low-risk domains, offering meaningful gains with minimal overhead. The dual-model approach provides additional confirmation for model reasoning steps thus more appropriate for moderate-risk domains, where higher reliability justifies additional compute. Self-reflection offers only marginal improvements, suggesting limited effectiveness for smaller non-reasoning models at inference time.[61] TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols
Saketh Vinjamuri,Marielle Fis Loperena,Marie C. Spezia,Ramez Kouzy
Main category: cs.CL
TL;DR: 本文提出TimeTox,一个基于LLM(Gemini模型)的三阶段流水线,用于从临床试验方案文档的评估时间表中自动提取“时间毒性”(即累积医疗接触天数)。研究对比了单次处理和两阶段(先结构化再计数)两种架构,在合成数据上两阶段更准,但在真实世界协议中单次处理更稳定、更适用于生产部署。
Details
Motivation: 时间毒性是衡量临床试验患者负担的重要指标,但目前需人工从协议文档中提取,费时费力,亟需自动化方法。 Method: 开发了TimeTox流水线:第一阶段用Gemini摘要整份PDF协议;第二阶段在六个累计时间点上量化各治疗组的时间毒性;第三阶段通过基于位置的组匹配实现多轮运行共识。对比了单次处理(vanilla)与两阶段(structure-then-count)两种架构。 Result: 在20份合成时间表(240次比对)上,两阶段架构达100%临床可接受精度(±3天,MAE=0.81天),单次处理仅41.5%(MAE=9.0天);但在644份真实肿瘤学协议上,单次处理表现出更高稳定性:95.3%达到临床可接受精度(IQR≤3天),82.0%完全稳定(IQR=0)。最终在1288个治疗组上完成生产级提取。 Conclusion: 对于LLM在真实医疗文本中的生产部署,提取结果的稳定性比在合成数据上的绝对精度更为关键;单次处理架构因更强的鲁棒性和一致性,更适合实际应用。 Abstract: Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google's Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.[62] Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles
Adi Gabay,Gabriel Stanovsky,Liat Peterfreund
Main category: cs.CL
TL;DR: 本文提出'归约阶梯'概念,挑战了将大语言模型在认知推理任务中的表现简单划分为'认知推理'与'生硬记忆'的二分法,指出记忆实为一种特殊归约;实验发现部分大模型通过归约解决认知谜题,但一旦真正需要认知推理时所有模型均表现不佳。
Details
Motivation: 现有研究将大语言模型在经典认知谜题上的表现简单归因于'认知推理能力'或'生硬记忆',作者认为该二分法不完整,需更精细地刻画模型如何利用已有知识解决新问题。 Method: 提出'归约阶梯'(reduction ladder)框架,即对标准认知谜题进行系统性渐进式修改,在保持底层逻辑不变的前提下逐步增加归约难度;在此框架下评测多个大语言模型的表现。 Result: 部分大模型能在归约阶梯中较高层级成功求解,表明其具备一定归约能力;但所有模型在真正需要认知推理(而非归约)的层级上均显著失败。 Conclusion: 记忆应被理解为归约的一种特例;当前大语言模型的认知推理能力依然薄弱,其在认知谜题上的成功多依赖于对已知模式的归约而非真正的推理。 Abstract: Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents' knowledge. Prior work evaluating LLMs on canonical epistemic puzzles interpreted their behavior through a dichotomy between epistemic reasoning and brittle memorization. We argue that this framing is incomplete: in recent models, memorization is better understood as a special case of reduction, where a new instance is mapped onto a known problem. Instead, we introduce a reduction ladder, a sequence of modifications that progressively move instances away from a canonical epistemic puzzle, making reduction increasingly difficult while preserving the underlying logic. We find that while some large models succeed via reduction, other models fail early, and all models struggle once epistemic reasoning is required.[63] Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF
K. M. Jubair Sami,Dipto Sumit,Ariyan Hossain,Farig Sadeque
Main category: cs.CL
TL;DR: 本文提出了一种两阶段框架,用于评估大语言模型(LLMs)在九种孟加拉语方言上的问答性能偏差;通过RAG增强翻译与LLM-as-a-judge评估保真度,对19个模型进行基准测试,发现方言差异越大性能下降越显著,且模型规模增大并不能一致缓解该偏差;贡献包括新的翻译质量评估方法、高质量方言基准数据集和面向安全关键应用的Critical Bias Sensitivity(CBS)指标。
Details
Motivation: 大型语言模型(LLMs)常对低资源语言的地区方言表现出性能偏差,但目前尚缺乏系统量化此类偏差的框架。 Method: 提出两阶段框架:第一阶段使用检索增强生成(RAG)将标准孟加拉语问题翻译并人工标注为九种方言变体(共4000组问题),并采用LLM-as-a-judge评估翻译保真度(经人类验证优于传统指标);第二阶段在黄金标注数据集上对19个LLM进行问答基准测试,共执行68,395次RLAIF评估,并通过多评委一致性与人工回退验证。 Result: 发现方言间语言差异越大,模型性能下降越严重(如吉大港方言得分为5.44/10,远低于坦盖尔方言的7.68/10);模型参数量增加并未一致改善方言表现;提出了可验证的翻译质量评估法、高质量方言基准数据集及Critical Bias Sensitivity(CBS)新指标。 Conclusion: LLMs在低资源方言上存在显著且系统性的性能偏差,现有缩放策略不足以缓解该问题;本工作为方言公平性评估提供了方法论、数据与指标支撑,尤其适用于安全敏感场景。 Abstract: Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.[64] Conspiracy Frame: a Semiotically-Driven Approach for Conspiracy Theories Detection
Heidi Campana Piva,Shaina Ashraf,Maziar Kianimoghadam Jouneghani,Arianna Longo,Rossana Damiano,Lucie Flek,Marco Antonio Stranisci
Main category: cs.CL
TL;DR: 本文提出了Conspiracy Frame语义框架和Con.Fra.数据集,用于细粒度建模和识别阴谋论叙事,并探索其在LLM中的应用效果。
Details
Motivation: 阴谋论作为反权威叙事,引发社会冲突并影响政治信息认知,亟需更通用、可解释的识别方法。 Method: 基于框架语义学与符号学构建Conspiracy Frame,并据此标注Telegram消息形成Con.Fra.数据集(span-level);结合FrameNet映射分析语义模式,并在in-context学习中注入帧信息评估LLM识别能力。 Result: 帧信息注入未显著提升in-domain/out-of-domain识别性能,但发现跨领域的抽象语义模式(如'Kinship'、'Ingest_substance'),表明语义与符号层面建模具有潜力。 Conclusion: Conspiracy Frame与Con.Fra.为阴谋论识别提供了新范式,强调语义与符号结构对提升模型可解释性与泛化性的重要价值。 Abstract: Conspiracy theories are anti-authoritarian narratives that lead to social conflict, impacting how people perceive political information. To help in understanding this issue, we introduce the Conspiracy Frame: a fine-grained semantic representation of conspiratorial narratives derived from frame-semantics and semiotics, which spawned the Conspiracy Frames (Con.Fra.) dataset: a corpus of Telegram messages annotated at span-level. The Conspiracy Frame and Con.Fra. dataset contribute to the implementation of a more generalizable understanding and recognition of conspiracy theories. We observe the ability of LLMs to recognize this phenomenon in-domain and out-of-domain, investigating the role that frames may have in supporting this task. Results show that, while the injection of frames in an in-context approach does not lead to clear increase of performance, it has potential; the mapping of annotated spans with FrameNet shows abstract semantic patterns (e.g., `Kinship', `Ingest\_substance') that potentially pave the way for a more semantically- and semiotically-aware detection of conspiratorial narratives.[65] Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models
Jinghan Cao,Yu Ma,Xinjin Li,Qingyang Ren,Xiangyun Chen
Main category: cs.CL
TL;DR: 本文提出Performance-Efficiency Ratio (PER)指标,系统评估16个语言模型在5个NLP任务上的效率,发现小模型(0.5–3B参数)在兼顾准确率、吞吐量、内存与延迟时表现最优,为资源受限场景下的模型部署提供定量依据。
Details
Motivation: 大型语言模型虽性能优异,但计算开销大,难以部署于资源受限环境,亟需任务导向的效率评估框架。 Method: 构建涵盖准确率、吞吐量、内存占用和延迟的几何均值归一化指标PER;对16个语言模型在5个典型NLP任务上进行系统性任务特定效率分析。 Result: 小模型(0.5–3B参数)在所有任务中均获得最高的PER分数,显著优于更大参数量模型。 Conclusion: 在强调推理效率而非微小精度提升的生产场景中,应优先选用小模型,本研究为其提供了实证基础与量化标准。 Abstract: Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the Performance-Efficiency Ratio (PER), a novel metric integrating accuracy, throughput, memory, and latency through geometric mean normalization. Our systematic evaluation reveals that small models (0.5--3B parameters) achieve superior PER scores across all given tasks. These findings establish quantitative foundations for deploying small models in production environments prioritizing inference efficiency over marginal accuracy gains.[66] Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks
Navya Mehrotra,Adam Visokay,Kristina Gligorić
Main category: cs.CL
TL;DR: 本文提出了一种名为Perspective-Driven Inference的新方法,用于在主观性标注任务中更公平、准确地估计不同人群视角下的标注分布,通过自适应采样策略优化有限的人工标注资源。
Details
Motivation: 现有大语言模型(LLM)标注纠错方法假设单一真实标签,但在涉及主观判断(如礼貌性、冒犯性)的任务中,不同人口统计群体间的分歧本身具有意义,需建模其标注分布而非追求唯一真值。 Method: 提出Perspective-Driven Inference框架,将跨群体的标注分布作为建模目标,并设计一种自适应采样策略,优先对LLM代理表现最差的群体分配人工标注资源。 Result: 在礼貌性和冒犯性评分任务上验证了该方法,相比均匀采样基线,显著提升了对难以建模的人口群体的标注准确性,同时保持整体覆盖度。 Conclusion: 该方法为在主观任务中利用LLM辅助标注提供了更公平、高效且以人群视角为中心的新范式。 Abstract: Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.[67] Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs
Mariela M. Nina,Caio Veloso Costa,Lilian Berton,Didier A. Vega-Oliveros
Main category: cs.CL
TL;DR: 本文系统评估了参数高效微调(PEFT)和量化技术在巴西葡萄牙语问答任务(SQuAD-BR)中对BERTimbau模型的应用,发现LoRA等方法可在显著降低计算成本的同时保持高精度,支持绿色AI目标。
Details
Motivation: 大型语言模型计算成本高,阻碍了低资源语言(如巴西葡萄牙语)的可及性,亟需高效、可持续的微调方案。 Method: 在SQuAD-BR数据集上,对BERTimbau-Base/Large模型应用四种PEFT方法(LoRA、DoRA、QLoRA、QDoRA)及量化技术,共评估40种配置,并对比Tucano和Sabiá等生成式模型。 Result: LoRA在BERTimbau-Large上达基线95.8%性能(F1=81.32),训练时间减少73.5%;学习率提升至2e-4可带来最高+19.71 F1增益;大模型量化鲁棒性更强(F1损失仅4.83 vs 9.56);生成式模型虽可达相近F1,但GPU内存和训练时间开销显著更高。 Conclusion: 对于巴西葡萄牙语抽取式问答任务,基于编码器的小型模型(如BERTimbau)配合PEFT与量化,比大型生成式LLM更高效、更可持续,符合Green AI理念。 Abstract: Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8\% of baseline performance on BERTimbau-Large while reducing training time by 73.5\% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textit{Green AI} principles. An exploratory evaluation of Tucano and Sabiá on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2$\times$ more GPU memory and 3$\times$ more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.[68] Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval
Hang Gao,Dimitris N. Metaxas
Main category: cs.CL
TL;DR: 本文提出语义偏移(semantic shift)作为解释Transformer嵌入模型中各向异性与长度诱导坍缩现象的根本原因,并通过理论分析与实验验证其对检索性能下降的预测能力。
Details
Motivation: 现有研究描述了嵌入病理性现象(如各向异性、长度坍缩),但缺乏对其何时及为何损害下游检索任务的因果解释。 Method: 提出语义平滑理论,定义并形式化语义偏移为融合局部语义演化与全局语义离散度的可计算指标,并在多个数据集和嵌入模型上开展控制实验。 Result: 语义偏移程度与嵌入聚集严重性高度一致,且能有效预测检索性能下降;而文本长度本身不具备该预测能力。 Conclusion: 语义偏移为理解嵌入坍缩提供了统一、可操作的视角,并可用于诊断各向异性何时真正有害。 Abstract: Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.[69] PROMPT2BOX: Uncovering Entailment Structure among LLM Prompts
Neeladri Bhuiya,Shib Sankar Dasgupta,Andrew McCallum,Haw-Shiuan Chang
Main category: cs.CL
TL;DR: 本文提出PROMPT2BOX方法,利用盒嵌入(box embedding)替代传统向量嵌入,以更好捕捉提示词的语义相似性与特异性关系,从而提升大语言模型弱点的细粒度分析能力。
Details
Motivation: 传统向量嵌入主要反映主题相似性,难以区分同主题但特异性(即难度)不同的提示,限制了对LLM弱点的精细分析。 Method: 提出PROMPT2BOX:设计一个编码器将提示映射到盒嵌入空间,该编码器在现有及合成数据集上训练;并开发一种针对盒嵌入的新颖降维技术,支持可视化与比较。 Result: 实验表明,盒嵌入在刻画提示特异性上显著优于向量基线;在UltraFeedback数据集上构建17个LLM的层次聚类树时,PROMPT2BOX比向量方法多发现8.9%的LLM弱点,且层次深度与指令特异性的相关性提升约33%。 Conclusion: 盒嵌入能更有效地建模提示的特异性结构,为LLM弱点挖掘提供更精细、更具解释性的表示基础。 Abstract: To discover the weaknesses of LLMs, researchers often embed prompts into a vector space and cluster them to extract insightful patterns. However, vector embeddings primarily capture topical similarity. As a result, prompts that share a topic but differ in specificity, and consequently in difficulty, are often represented similarly, making fine-grained weakness analysis difficult. To address this limitation, we propose PROMPT2BOX, which embeds prompts into a box embedding space using a trained encoder. The encoder, trained on existing and synthesized datasets, outputs box embeddings that capture not only semantic similarity but also specificity relations between prompts (e.g., "writing an adventure story" is more specific than "writing a story"). We further develop a novel dimension reduction technique for box embeddings to facilitate dataset visualization and comparison. Our experiments demonstrate that box embeddings consistently capture prompt specificity better than vector baselines. On the downstream task of creating hierarchical clustering trees for 17 LLMs from the UltraFeedback dataset, PROMPT2BOX can identify 8.9\% more LLM weaknesses than vector baselines and achieves an approximately 33\% stronger correlation between hierarchical depth and instruction specificity.[70] KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning
Shuai Wang,Yinan Yu
Main category: cs.CL
TL;DR: 本文提出KG-Hopper,一种基于强化学习的框架,使轻量级开源大语言模型能在单次推理中完成端到端、全局感知的多跳知识图谱推理,显著优于更大参数量的多步方法,并媲美GPT-3.5-Turbo等闭源模型。
Details
Motivation: 现有KBQA方法依赖固定流水线的顺序推理,缺乏灵活性且易因单步错误导致级联失败,难以支持知识密集型多跳推理。 Method: 提出KG-Hopper——一种面向紧凑型开源LLM的强化学习框架,将整个知识图谱遍历与决策过程整合进统一的‘思考’阶段,支持全局建模跨步依赖、动态路径探索与回溯。 Result: 在8个KG推理基准上,基于7B参数LLM的KG-Hopper持续超越高达70B参数的多步系统,并达到与GPT-3.5-Turbo和GPT-4o-mini相当的性能,同时保持模型轻量、开源与数据高效。 Conclusion: KG-Hopper验证了通过RL驱动的端到端推理范式可突破传统分步KBQA的局限,在精度、效率与开放性之间取得更好平衡。 Abstract: Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking'' stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG-Hopper.[71] Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
Tae-Eun Song
Main category: cs.CL
TL;DR: 本文提出Cross-Context Verification(CCV)与Hierarchical Cross-Context Architecture(HCCA),通过多会话解题多样性与信息受限的多角色分析,有效识别大模型在编码基准中是真实推理还是记忆泄露,解决了现有检测方法不直接观测推理过程、易产生假阳性等问题。
Details
Motivation: LLM编码基准(如SWE-bench Verified)面临解法泄露和测试质量差的可信度危机;现有检测方法(如改写一致性、n元组重叠、困惑度)无法直接区分模型是推理还是回忆;且重复验证会加剧误判。 Method: 提出黑盒方法Cross-Context Verification(CCV):对同一问题在N个独立会话中求解并量化解法多样性;配套设计多智能体框架HCCA,通过跨角色的信息限制防止确认偏误。 Result: 在9个SWE-bench Verified问题(45次试验)上,CCV实现污染样本与真实推理的完美分离(U=0, p≈0.012, r=1.0);发现污染呈二值性、'无推理'可完美判别污染、33%既有污染标签为假阳性;HCCA能发现单分析师遗漏的复合缺陷案例;多阶段验证实验失败,证实信息限制而非结构复杂性是关键机制。 Conclusion: CCV与HCCA为LLM编码能力评估提供了更可靠、可解释的黑盒验证范式,强调解题多样性与信息隔离是检测记忆泄露的核心,而非增加验证轮次或分析层级。 Abstract: LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary--models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA's independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result--100% sycophantic confirmation--providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.[72] DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation
Siqi Guo,Ming Lin,Tianbao Yang
Main category: cs.CL
TL;DR: DRTriton是一个用于训练大语言模型将PyTorch代码自动转换为高性能Triton(进而编译为CUDA)内核的可扩展学习框架,通过合成数据、课程强化学习和测试时搜索显著提升转换成功率与运行速度。
Details
Motivation: 现有大模型(如GPT-5.2、Claude-Sonnet-4.5)在将PyTorch代码自动转换为高效CUDA内核任务中表现不佳,而人工开发CUDA内核成本高、难度大。 Method: 提出DRTriton框架,包含三部分:(i) CSP-DAG数据合成算法,确保算子空间全覆盖与难度可控的均匀采样;(ii) 解耦奖励的课程强化学习,同步优化转换成功率与推理速度;(iii) 测试时搜索算法进一步加速生成的Triton内核。 Result: DRTriton-7B在KernelBench Level 2上92%的内核实现加速,远超GPT-5.2(23%)和Claude-Sonnet-4.5(19%);且仅用合成数据训练即能泛化到真实复杂CUDA内核。 Conclusion: DRTriton有效解决了LLM在PyTorch-to-CUDA转换任务中的性能瓶颈,为生成式AI底层算子优化提供了可扩展、高性能的自动化新范式。 Abstract: Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.[73] TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild
Kai-Wei Chang,Yi-Cheng Lin,Huang-Cheng Chou,Wenze Ren,Yu-Han Huang,Yun-Shao Tsai,Chien-Cheng Chen,Yu Tsao,Yuan-Fu Liao,Shrikanth Narayanan,James Glass,Hung-yi Lee
Main category: cs.CL
TL;DR: 本文介绍了TaigiSpeech,一个面向台湾闽南语(台语)的低资源语音意图检测数据集,专为老年人群体采集,包含21名说话人共3000条语句,并提出了两种低监督数据挖掘策略以支持无文字口语语言的数据构建。
Details
Motivation: 解决低资源、主要为口语且缺乏书写系统的语言(如台湾闽南语)在语音技术中严重缺乏标注数据的问题,尤其关注老年人等代表性不足群体的需求。 Method: 构建了TaigiSpeech数据集(21位老年说话人,3k条真实场景语音),并提出两种数据挖掘策略:1)基于关键词匹配+大语言模型通过中介语言进行伪标注;2)利用音视频多模态线索、仅需极少文本监督的音频-视觉框架。 Result: 成功构建首个面向台语老年人的真实语音意图检测数据集,支持医疗与家居助手等实用场景,并设计出可扩展、低监督的数据构建流程。 Conclusion: TaigiSpeech填补了低资源、无文字口语语言语音意图识别的数据空白,其数据构建方法为类似语言提供了可复用范式,并将以CC BY 4.0协议开源以推动相关研究。 Abstract: Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on https://kwchang.org/taigispeech.[74] Effective Strategies for Asynchronous Software Engineering Agents
Jiayi Geng,Graham Neubig
Main category: cs.CL
TL;DR: 本文提出CAID(Centralized Asynchronous Isolated Delegation)范式,通过集中任务委派、异步执行与隔离工作区三大软件工程原语,提升多AI代理在长周期软件工程任务中的协作准确性与效率。实验表明其在论文复现和Python库开发任务上显著优于单代理基线。
Details
Motivation: 现有AI代理在长周期、多依赖的软件工程任务中面临准确率低、完成延迟高等挑战;多代理并发协作易引发冲突、同步困难、整合复杂等问题;而人类开发者已具备成熟协作基础设施,值得借鉴。 Method: 提出CAID范式,包含三个核心机制:1)中央管理者构建依赖感知的任务计划;2)各代理在隔离工作区中异步执行子任务;3)基于可执行测试的结构化集成与验证;底层依托git worktree、commit、merge等SWE原语实现分支-合并协调。 Result: 在PaperBench(论文复现)上准确率提升26.7个百分点,在Commit0(Python库开发)上提升14.3个百分点;系统分析确认分支-合并是多代理协作的核心协调机制,且SWE原语能可靠支撑其实现。 Conclusion: 将人类软件工程中的协作原语形式化并嵌入多代理系统设计,可显著提升长周期任务的协同效能;CAID为构建工程级AI协作系统提供了可执行、可验证的范式基础。 Abstract: AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.[75] Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment
Mohamed Sobhi Jabal,Jikai Zhang,Dominic LaBella,Jessica L. Houk,Dylan Zhang,Jeffrey D. Rudie,Kirti Magudia,Maciej A. Mazurowski,Evan Calabrese
Main category: cs.CL
TL;DR: 本研究开发了一个端到端的多智能体大语言模型(LLM)与卷积神经网络(CNN)系统,用于自动完成脑肿瘤报告与数据系统(BT-RADS)分类,在509例胶质瘤术后MRI检查中准确率达76.0%,显著高于临床初评的57.5%。
Details
Motivation: BT-RADS标准虽已建立,但其临床应用依赖复杂的人工整合影像趋势、药物影响和放疗时间,亟需自动化辅助工具提升评估一致性与效率。 Method: 构建多智能体LLM系统:抽取代理从非结构化临床笔记中提取类固醇/贝伐珠单抗使用状态及放疗日期;评分代理结合CNN自动分割所得体积测量值与提取变量,执行BT-RADS决策逻辑;在单中心509例回顾性数据上验证性能。 Result: 系统准确率76.0%(vs 临床57.5%,P<0.001);上下文依赖类别(如BT-1a/b、BT-3a)敏感性高(>87.5%),阈值依赖类别(如BT-2、BT-4)敏感性中等(57.1%–74.8%),BT-4阳性预测值达92.9%。 Conclusion: 该多智能体LLM-CNN系统可显著提升BT-RADS分类与专家标准的一致性,尤其适用于需综合临床背景判断的类别,并对BT-4具有高特异性识别能力,具备临床转化潜力。 Abstract: The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.[76] Triangulating Temporal Dynamics in Multilingual Swiss Online News
Bros Victor,Dufraisse Evan,Popescu Adrian,Gatica-Perez Daniel
Main category: cs.CL
TL;DR: 本文通过三角测量法分析瑞士数字媒体在法语、德语和意大利语三大语言区的新闻报道时间趋势,结合定量分析(如词法指标、命名实体识别、基于Wikidata的链接、针对性情感分析和共识型变点检测)与定性解读,构建了‘本土化特征谱’和‘邻近显著性比率’以支持跨语言比较,并揭示语言文化背景对新闻报道的影响。
Details
Motivation: 现有研究在多语言社会新闻报道分析中,缺乏对语言和文化多样性的充分考量,尤其在瑞士这样复杂的多语言国家,亟需一种能兼顾语言差异与文化背景的综合性分析框架。 Method: 采用三角测量法,整合定量分析(处理170万篇新闻文章,运用词法指标、命名实体识别、Wikidata链接、针对性情感分析、共识型变点检测)与定性解读;构建本土化特征谱和邻近显著性比率,支撑跨语言比较并关联本土化与文化邻近性理论。 Result: 发现了瑞士不同语言区在主题性、周期性及突发性事件报道中的显著时间模式差异,证实语言与文化语境深刻影响新闻选择、呈现与情感倾向;验证了三角测量法在媒体研究中的有效性。 Conclusion: 本研究不仅深化了对瑞士数字媒体动态的理解,更提出了一套可推广至其他多语言或多元文化媒体环境的分析框架,强调语言与文化因素在新闻建构中的核心作用。 Abstract: Analyzing news coverage in multilingual societies can offer valuable insights into the dynamics of public discourse and the development of collective narratives, yet comprehensive studies that account for linguistic and cultural diversity within national media ecosystems remain limited, particularly in complex contexts such as Switzerland. This paper studies temporal trends in Swiss digital media across the country's three main linguistic regions, French, German, and Italian, using a triangulated methodology that combines quantitative analyses with qualitative insights. We collected and processed over 1.7 million news articles, applying lexical metrics, named entity recognition and Wikidata-based linking, targeted sentiment analysis, and consensus-based change-point detection. To enable principled cross-language comparisons and to connect to theories of domestication and cultural proximity, we derive domestication profiles together with a proximity salience ratio. Our analysis spans thematic, recurrent, and singular events. By integrating quantitative data with qualitative interpretation, we provide new insights into the dynamics of Swiss digital media and demonstrate the usefulness of triangulation in media studies. The findings reveal distinct temporal patterns and highlight how linguistic and cultural contexts influence reporting. Our approach offers a framework applicable to other multilingual or culturally diverse media environments, contributing to a deeper understanding of how news is shaped by linguistic and cultural factors.[77] Generalizable Self-Evolving Memory for Automatic Prompt Optimization
Guanbao Liang,Yuanchen Bei,Sheng Zhou,Yuheng Qin,Huan Zhou,Bingxin Jia,Bin Li,Jiajun Bu
Main category: cs.CL
TL;DR: 本文提出MemAPO,一种基于记忆的自动提示优化框架,通过双记忆机制(策略模板与错误模式)实现可泛化、自演化的提示优化,显著提升性能并降低优化成本。
Details
Motivation: 现有自动提示优化方法局限于为固定任务搜索专用提示,难以泛化到异构查询,也无法积累可复用的提示知识。 Method: MemAPO构建双内存机制:将成功推理轨迹提炼为可复用的策略模板,将错误生成组织为结构化错误模式;对新提示,检索相关策略与错误模式以合成更优提示;并通过迭代自省与记忆编辑实现持续自我更新。 Result: 在多个基准测试中,MemAPO持续优于代表性提示优化基线方法,并大幅降低优化开销。 Conclusion: MemAPO将提示优化重新定义为可泛化、自演化的经验积累过程,为LLM提示工程提供了更具适应性与可持续性的新范式。 Abstract: Automatic prompt optimization is a promising approach for adapting large language models (LLMs) to downstream tasks, yet existing methods typically search for a specific prompt specialized to a fixed task. This paradigm limits generalization across heterogeneous queries and prevents models from accumulating reusable prompting knowledge over time. In this paper, we propose MemAPO, a memory-driven framework that reconceptualizes prompt optimization as generalizable and self-evolving experience accumulation. MemAPO maintains a dual-memory mechanism that distills successful reasoning trajectories into reusable strategy templates while organizing incorrect generations into structured error patterns that capture recurrent failure modes. Given a new prompt, the framework retrieves both relevant strategies and failure patterns to compose prompts that promote effective reasoning while discouraging known mistakes. Through iterative self-reflection and memory editing, MemAPO continuously updates its memory, enabling prompt optimization to improve over time rather than restarting from scratch for each task. Experiments on diverse benchmarks show that MemAPO consistently outperforms representative prompt optimization baselines while substantially reducing optimization cost.[78] CatRAG: Functor-Guided Structural Debiasing with Retrieval Augmentation for Fair LLMs
Ravi Ranjan,Utkarsh Grover,Mayur Akewar,Xiaomin Lin,Agoritsa Polyzou
Main category: cs.CL
TL;DR: 本文提出CatRAG Debiasing框架,结合范畴论指导的嵌入空间投影与检索增强生成(RAG),在多个开源大模型上显著提升问答公平性,大幅降低各类偏见得分并保持高准确率。
Details
Motivation: 大型语言模型在关键场景中部署时存在人口统计、性别和地域偏见,威胁公平性与可信度;现有去偏方法多作用于单阶段,缓解不彻底且在分布偏移下效用易退化。 Method: 提出CatRAG Debiasing双路径框架:一是基于范畴论的functor组件,实现结构保持的嵌入空间投影以抑制偏见方向、保留语义;二是结合RAG进行结构化去偏。 Result: 在BBQ偏见评测基准上,对Llama-3、GPT-OSS和Gemma-3三个开源LLM,CatRAG将准确率最高提升40%(相较基线),较先前方法提升超10%,并将性别、国籍、种族及交叉子群的偏见得分降至近零(原基线达60%)。 Conclusion: CatRAG Debiasing通过融合数学结构建模与RAG机制,实现了更鲁棒、更全面的去偏效果,在保持模型实用性的同时显著提升公平性,为可信AI部署提供了新范式。 Abstract: Large Language Models (LLMs) are deployed in high-stakes settings but can show demographic, gender, and geographic biases that undermine fairness and trust. Prior debiasing methods, including embedding-space projections, prompt-based steering, and causal interventions, often act at a single stage of the pipeline, resulting in incomplete mitigation and brittle utility trade-offs under distribution shifts. We propose CatRAG Debiasing, a dual-pronged framework that integrates functor with Retrieval-Augmented Generation (RAG) guided structural debiasing. The functor component leverages category-theoretic structure to induce a principled, structure-preserving projection that suppresses bias-associated directions in the embedding space while retaining task-relevant semantics. On the Bias Benchmark for Question Answering (BBQ) across three open-source LLMs (Meta Llama-3, OpenAI GPT-OSS, and Google Gemma-3), CatRAG achieves state-of-the-art results, improving accuracy by up to 40% over the corresponding base models and by more than 10% over prior debiasing methods, while reducing bias scores to near zero (from 60% for the base models) across gender, nationality, race, and intersectional subgroups.[79] SynSym: A Synthetic Data Generation Framework for Psychiatric Symptom Identification
Migyeong Kang,Jihyun Kim,Hyolim Jeon,Sunwoo Hwang,Jihyun An,Yonghoon Kim,Haewoon Kwak,Jisun An,Jinyoung Han
Main category: cs.CL
TL;DR: 本文提出SynSym框架,利用大语言模型生成高质量、多样化的合成数据,用于精神症状识别任务,实验证明其生成的数据可媲美真实标注数据的效果。
Details
Motivation: 构建大规模症状级数据集面临专家标注成本高和缺乏标准化标注指南的挑战,限制了模型对多样化症状表达的泛化能力。 Method: SynSym框架利用大语言模型:(1) 将每种症状扩展为子概念以增强表达多样性;(2) 生成反映不同语言风格的精神症状表达;(3) 基于临床共现模式构建真实的多症状表达。 Result: 在三个抑郁症状表达风格各异的基准数据集上验证,仅用SynSym生成的合成数据训练的模型性能接近使用真实数据训练的模型,且结合少量真实数据微调后效果更优。 Conclusion: 合成数据可作为精神症状建模中真实标注的有效替代资源,SynSym为生成临床相关且逼真的症状表达提供了实用框架。 Abstract: Psychiatric symptom identification on social media aims to infer fine-grained mental health symptoms from user-generated posts, allowing a detailed understanding of users' mental states. However, the construction of large-scale symptom-level datasets remains challenging due to the resource-intensive nature of expert labeling and the lack of standardized annotation guidelines, which in turn limits the generalizability of models to identify diverse symptom expressions from user-generated text. To address these issues, we propose SynSym, a synthetic data generation framework for constructing generalizable datasets for symptom identification. Leveraging large language models (LLMs), SynSym constructs high-quality training samples by (1) expanding each symptom into sub-concepts to enhance the diversity of generated expressions, (2) producing synthetic expressions that reflect psychiatric symptoms in diverse linguistic styles, and (3) composing realistic multi-symptom expressions, informed by clinical co-occurrence patterns. We validate SynSym on three benchmark datasets covering different styles of depressive symptom expression. Experimental results demonstrate that models trained solely on the synthetic data generated by SynSym perform comparably to those trained on real data, and benefit further from additional fine-tuning with real data. These findings underscore the potential of synthetic data as an alternative resource to real-world annotations in psychiatric symptom modeling, and SynSym serves as a practical framework for generating clinically relevant and realistic symptom expressions.[80] DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing
Nasser-Eddine Monir,Zakaria Baou
Main category: cs.CL
TL;DR: DATASHI is a new parallel English-Tashlhiyt corpus designed to address the lack of computational resources for Amazigh languages, supporting NLP tasks like translation and orthographic normalization, with evaluations showing strong performance from large language models, especially Gemini-2.5-Pro.
Details
Motivation: To fill a critical gap in computational resources for Amazigh languages—particularly Tashlhiyt—by providing a parallel corpus that supports systematic study of orthographic diversity and enables text- and speech-based NLP tasks. Method: Constructed DATASHI, a 5,000-sentence English-Tashlhiyt parallel corpus featuring a 1,500-sentence subset with both expert-standardized and non-standard user-generated versions; evaluated state-of-the-art LLMs (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) in zero-shot and few-shot settings; conducted fine-grained error analysis across phonological classes using edit operations. Result: Gemini-2.5-Pro achieved the lowest word and character-level error rates and showed robust cross-lingual generalization; model-specific sensitivities to marked Tashlhiyt phonological features (e.g., geminates, emphatics) were identified via edit operation analysis. Conclusion: DATASHI effectively advances NLP for low-resource Amazigh languages; its dual-version design and multimodal potential make it valuable for orthographic normalization and future speech-data alignment; LLM evaluations provide diagnostic insights for improving modeling of typologically distinctive features. Abstract: DATASHI is a new parallel English-Tashlhiyt corpus that fills a critical gap in computational resources for Amazigh languages. It contains 5,000 sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, enabling systematic study of orthographic diversity and normalization. This dual design supports text-based NLP tasks - such as tokenization, translation, and normalization - and also serves as a foundation for read-speech data collection and multimodal alignment. Comprehensive evaluations with state-of-the-art Large Language Models (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) show clear improvements from zero-shot to few-shot prompting, with Gemini-2.5-Pro achieving the lowest word and character-level error rates and exhibiting robust cross-lingual generalization. A fine-grained analysis of edit operations - deletions, substitutions, and insertions - across phonological classes (geminates, emphatics, uvulars, and pharyngeals) further highlights model-specific sensitivities to marked Tashlhiyt features and provides new diagnostic insights for low-resource Amazigh orthography normalization.[81] A Comparative Analysis of LLM Memorization at Statistical and Internal Levels: Cross-Model Commonalities and Model-Specific Signatures
Bowen Chen,Namgi Han,Yusuke Miyao
Main category: cs.CL
TL;DR: 本研究通过分析多个大语言模型系列(Pythia、OpenLLaMa、StarCoder、OLMo1/2/3)的统计与内部机制,揭示了记忆行为的共性(如记忆率随模型规模对数线性增长、共享频率与领域分布)与差异(如重要注意力头的家族特异性分布),推动对大模型记忆机制的统一基础理解。
Details
Motivation: 现有研究受限于预训练数据不可得,多聚焦单一模型系列,导致观察孤立,难以区分记忆现象的普适性与特异性。 Method: 收集多个开源模型系列,从统计层面(记忆率、序列压缩性、频率与领域分布)和内部机制层面(扰动鲁棒性、中间层解码、注意力头消融)系统分析记忆行为。 Result: 发现记忆率随模型规模呈对数线性增长;记忆序列可进一步压缩;不同模型共享记忆序列的频率与领域分布模式;但关键注意力头的分布具有模型家族特异性;记忆序列比一般文本对注入扰动更敏感。 Conclusion: 记忆行为既有跨模型系列的共性规律,也存在家族级差异,需结合统计与内部机制进行统一建模,为构建通用记忆理论奠定基础。 Abstract: Memorization is a fundamental component of intelligence for both humans and LLMs. However, while LLM performance scales rapidly, our understanding of memorization lags. Due to limited access to the pre-training data of LLMs, most previous studies focus on a single model series, leading to isolated observations among series, making it unclear which findings are general or specific. In this study, we collect multiple model series (Pythia, OpenLLaMa, StarCoder, OLMo1/2/3) and analyze their shared or unique memorization behavior at both the statistical and internal levels, connecting individual observations while showing new findings. At the statistical level, we reveal that the memorization rate scales log-linearly with model size, and memorized sequences can be further compressed. Further analysis demonstrated a shared frequency and domain distribution pattern for memorized sequences. However, different models also show individual features under the above observations. At the internal level, we find that LLMs can remove certain injected perturbations, while memorized sequences are more sensitive. By decoding middle layers and attention head ablation, we revealed the general decoding process and shared important heads for memorization. However, the distribution of those important heads differs between families, showing a unique family-level feature. Through bridging various experiments and revealing new findings, this study paves the way for a universal and fundamental understanding of memorization in LLM.[82] TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression
Li Wang,Yandong Wang,Xin Yu,Kui Zhang,Tianhao Peng,Wenjun Wu
Main category: cs.CL
TL;DR: 本文提出TAMTRL方法,通过利用相关文档作为教师信号,在多轮强化学习中进行奖励重塑,以解决长文档处理中的时间信用分配问题,提升模型在长上下文任务中的表现。
Details
Motivation: 现有大语言模型在处理超出上下文窗口的长文档时需分块多轮处理,但仅最终结果有监督信号,导致中间记忆更新难以评估,引发时间信用分配难题;而现有解决方案计算开销大且估计噪声高。 Method: 提出Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL),利用相关文档作为教师信号,将每轮模型输入与之对齐,并通过归一化概率实现自监督奖励分配。 Result: 在七个长上下文基准测试中,多个不同规模模型上TAMTRL均持续优于强基线方法。 Conclusion: TAMTRL能为每轮记忆更新提供细粒度学习信号,有效缓解多轮训练中的信用分配问题,显著提升长上下文处理能力。 Abstract: The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model's context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at https://anonymous.4open.science/r/TAMTRL-F1F8.[83] Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with Consensus-Aware Gradient Fusion
Shixu Liu
Main category: cs.CL
TL;DR: 本文提出WeatherTGD,一种无需训练的多智能体框架,利用文本梯度下降思想,通过统计分析、物理诠释与气象专家三类LLM代理协同生成可解释天气时序描述,融合其领域特异性‘文本梯度’进行迭代优化,在自动与人工评估中均显著优于基线。
Details
Motivation: 现有方法要么只输出无解释的数值预测,要么生成缺乏气象学深度的泛化描述,难以兼顾可解释性与领域专业性。 Method: 构建基于Text Gradient Descent(TGD)思想的训练-free多智能体框架WeatherTGD,含三个专用LLM代理(统计分析师、物理诠释者、气象专家),各自从时序数据生成领域文本梯度;通过Consensus-Aware Gradient Fusion机制融合梯度,并驱动类梯度下降的迭代字幕精炼过程。 Result: 在真实气象数据集上,WeatherTGD在LLM自动评估和人类专家评估中均显著优于现有多智能体基线,且因并行执行保持计算高效。 Conclusion: 将文本梯度下降范式引入气象文本生成任务是可行且有效的,多智能体协同生成与融合领域梯度可兼顾可解释性、专业性与效率。 Abstract: Generating interpretable natural language captions from weather time series data remains a significant challenge at the intersection of meteorological science and natural language processing. While recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities in time series forecasting and analysis, existing approaches either produce numerical predictions without human-accessible explanations or generate generic descriptions lacking domain-specific depth. We introduce WeatherTGD, a training-free multi-agent framework that reinterprets collaborative caption refinement through the lens of Text Gradient Descent (TGD). Our system deploys three specialized LLM agents including a Statistical Analyst, a Physics Interpreter, and a Meteorology Expert that generate domain-specific textual gradients from weather time series observations. These gradients are aggregated through a novel Consensus-Aware Gradient Fusion mechanism that extracts common signals while preserving unique domain perspectives. The fused gradients then guide an iterative refinement process analogous to gradient descent, where each LLM-generated feedback signal updates the caption toward an optimal solution. Experiments on real-world meteorological datasets demonstrate that WeatherTGD achieves significant improvements in both LLM-based evaluation and human expert evaluation, substantially outperforming existing multi-agent baselines while maintaining computational efficiency through parallel agent execution.[84] Probing How Scalable Table Data Enhances General Long-Context Reasoning
Huaibing Xie,Guoliang Zhao,Yang Liu,Shihan Dou,Siming Huang,Yanling Xiao,Shaolei Wang,Yiting Liu,Cheng Zhang,Shaofan Liu,Pluto Zhou
Main category: cs.CL
TL;DR: 本文发现结构化表格数据(具有周期性结构)对长上下文推理具有显著提升效果,通过互信息理论分析揭示其内在依赖机制,并提出TableLong合成框架,在多个基准上平均提升超8%。
Details
Motivation: 现有研究缺乏对何种数据类型能有效提升大语言模型长上下文推理能力的系统探索;作者观察到结构化表格数据具备潜力,进而探究其原理与有效性。 Method: 基于互信息数学分析表格数据的依赖结构;开展系统性能力评估与缩放实验;设计TableLong流水线,利用强化学习合成高质量、多样且可验证的结构化表格数据用于后训练。 Result: 在多个长上下文基准上平均提升+8.24%,在跨领域基准上也平均提升+8.06%;验证了周期性非衰减依赖是关键机制。 Conclusion: 结构化表格数据是一种高效、可扩展且机制可解释的长上下文推理增强数据源,为LLM后训练数据设计提供了实用指导。 Abstract: As real-world tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24\% on average), and even improves performance on out-of-domain benchmarks (+8.06\% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.[85] SemEval-2026 Task 12: Abductive Event Reasoning: Towards Real-World Event Causal Inference for Large Language Models
Pengfei Cao,Mingxuan Yang,Yubo Chen,Chenlong Zhang,Mingxuan Liu,Kang Liu,Jun Zhao
Main category: cs.CL
TL;DR: 本文介绍了SemEval-2026任务12:溯因事件推理(AER),旨在从支持性证据中识别目标事件最可能的直接原因,构建了一个基于证据的多选基准,以推动现实世界因果推理研究。
Details
Motivation: 直接因果推断在证据丰富的现实场景中仍被忽视,亟需一个聚焦真实事件、能反映分布式证据、背景干扰和语义混淆等挑战的基准任务。 Method: 组织SemEval-2026 Task 12,将溯因事件推理建模为证据支撑的多选任务;设计数据构建流程与评估方案,并分析参赛系统结果。 Result: 吸引了122支队伍、518次提交;构建了首个面向真实事件、强调直接因果与多文档证据整合的公开基准数据集。 Conclusion: AER为事件溯因与因果推理提供了新基准,揭示了多文档理解与因果判别中的关键挑战,为后续研究指明方向。 Abstract: Understanding why real-world events occur is important for both natural language processing and practical decision-making, yet direct-cause inference remains underexplored in evidence-rich settings. To address this gap, we organized SemEval-2026 Task 12: Abductive Event Reasoning (AER).\footnote{The task data is available at https://github.com/sooo66/semeval2026-task12-dataset.git} The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence-grounded multiple-choice benchmark that captures key challenges of real-world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non-causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.[86] Politics of Questions in News: A Mixed-Methods Study of Interrogative Stances as Markers of Voice and Power
Bros Victor,Barbini Matilde,Gerard Patrick,Gatica-Perez Daniel
Main category: cs.CL
TL;DR: 本文通过混合方法研究了法语数字新闻中疑问句的使用模式,结合大规模计算分析与定性标注,揭示了疑问句在新闻话语中的功能分布、文本回应机制及话语主体特征。
Details
Motivation: 现有研究多集中于广播访谈或小规模语料,缺乏对大规模新闻语料中疑问句的功能区分与系统性计算分析;同时,计算语言学研究常未区分疑问句与陈述句,忽视其语用功能。 Method: 采用混合方法:基于100多万篇2023年1月至2024年6月的法语数字新闻,自动识别疑问立场、近似功能类型并定位文中答案;辅以基于语义与语用理论的定性标注子语料库进行验证与阐释。 Result: 疑问句虽稀疏但具系统性模式:主要用于议题引入或组织;信息寻求与回声式疑问占多数,引导性与附加疑问罕见;多数疑问在同一篇文章内被回应,回应多出自记者叙述 voice,少用引语;疑问上下文高度聚焦具体人物、机构与地点,而公众与宏观社会群体极少出现,体现显著个人化倾向。 Conclusion: 疑问立场、文本承接与话语声音可在语料库尺度上操作化;融合计算方法与语用学、社会学视角,有助于理解提问实践如何结构性地塑造当代新闻话语。 Abstract: Interrogatives in news discourse have been examined in linguistics and conversation analysis, but mostly in broadcast interviews and relatively small, often English-language corpora, while large-scale computational studies of news rarely distinguish interrogatives from declaratives or differentiate their functions. This paper brings these strands together through a mixed-methods study of the "Politics of Questions" in contemporary French-language digital news. Using over one million articles published between January 2023 and June 2024, we automatically detect interrogative stances, approximate their functional types, and locate textual answers when present, linking these quantitative measures to a qualitatively annotated subcorpus grounded in semantic and pragmatic theories of questions. Interrogatives are sparse but systematically patterned: they mainly introduce or organize issues, with most remaining cases being information-seeking or echo-like, while explicitly leading or tag questions are rare. Although their density and mix vary across outlets and topics, our heuristic suggests that questions are overwhelmingly taken up within the same article and usually linked to a subsequent answer-like span, most often in the journalist's narrative voice and less often through quoted speech. Interrogative contexts are densely populated with named individuals, organizations, and places, whereas publics and broad social groups are mentioned much less frequently, suggesting that interrogative discourse tends to foreground already prominent actors and places and thus exhibits strong personalization. We show how interrogative stance, textual uptake, and voice can be operationalized at corpus scale, and argue that combining computational methods with pragmatic and sociological perspectives can help account for how questioning practices structure contemporary news discourse.[87] Instruction Set and Language for Symbolic Regression
Ezequiel Lopez-Rubio,Mario Pascual-Gonzalez
Main category: cs.CL
TL;DR: 本文提出IsalSR框架,通过紧凑的两层字母表编码表达式DAG,并计算剪枝后的规范字符串,以解决符号回归中的结构冗余问题。
Details
Motivation: 符号回归(SR)中存在结构冗余问题:同一表达式DAG可有多种节点编号方式,导致搜索空间重复、浪费适应度评估。 Method: 提出IsalSR表示框架,将表达式DAG编码为紧凑两层字母表上的字符串,并计算一个剪枝的规范字符串——即完全标记DAG同构不变量,以统一所有等价表示。 Result: 实现了对等价表达式表示的完全归一化,显著减少搜索空间冗余,提升符号回归效率。 Conclusion: IsalSR提供了一种有效消除结构冗余的方法,为符号回归提供了更紧凑、更高效的表示与搜索基础。 Abstract: A fundamental but largely unaddressed obstacle in Symbolic regression (SR) is structural redundancy: every expression DAG with admits many distinct node-numbering schemes that all encode the same expression, each occupying a separate point in the search space and consuming fitness evaluations without adding diversity. We present IsalSR (Instruction Set and Language for Symbolic Regression), a representation framework that encodes expression DAGs as strings over a compact two-tier alphabet and computes a pruned canonical string -- a complete labeled-DAG isomorphism invariant -- that collapses all the equivalent representations into a single canonical form.[88] Select, Label, Evaluate: Active Testing in NLP
Antonio Purificato,Maria Sofia Bucarelli,Andrea Bacciu,Amin Mantrach,Fabrizio Silvestri
Main category: cs.CL
TL;DR: 本文提出了Active Testing框架,通过选择最具信息量的测试样本进行标注,显著降低NLP模型评估中的人工标注成本,实验表明在保持性能估计误差<1%的前提下,最多可减少95%的标注量,并引入自适应停止准则以自动确定最优标注数量。
Details
Motivation: NLP中人工标注成本高、耗时长,尤其是测试集标注要求高精度、低错误,传统全量标注方式资源消耗大。 Method: 形式化定义Active Testing框架,对18个数据集、4种嵌入策略、4类NLP任务进行大规模基准测试;提出一种无需预设标注预算的自适应停止准则。 Result: 在性能估计准确率下降不超过1%的前提下,实现最高95%的标注量削减;不同方法效果因数据特性和任务类型而异,无单一最优方法;自适应停止准则能自动确定最优标注数量。 Conclusion: Active Testing是高效可靠的模型评估替代方案,自适应停止机制提升了其实用性与鲁棒性,为低资源场景下的NLP评估提供了新范式。 Abstract: Human annotation cost and time remain significant bottlenecks in Natural Language Processing (NLP), with test data annotation being particularly expensive due to the stringent requirement for low-error and high-quality labels necessary for reliable model evaluation. Traditional approaches require annotating entire test sets, leading to substantial resource requirements. Active Testing is a framework that selects the most informative test samples for annotation. Given a labeling budget, it aims to choose the subset that best estimates model performance while minimizing cost and human effort. In this work, we formalize Active Testing in NLP and we conduct an extensive benchmarking of existing approaches across 18 datasets and 4 embedding strategies spanning 4 different NLP tasks. The experiments show annotation reductions of up to 95%, with performance estimation accuracy difference from the full test set within 1%. Our analysis reveals variations in method effectiveness across different data characteristics and task types, with no single approach emerging as universally superior. Lastly, to address the limitation of requiring a predefined annotation budget in existing sample selection strategies, we introduce an adaptive stopping criterion that automatically determines the optimal number of samples.[89] Riding Brainwaves in LLM Space: Understanding Activation Patterns Using Individual Neural Signatures
Ajan Subramanian,Sumukh Bettadapura,Rohan Sathish
Main category: cs.CL
TL;DR: 本文研究了冻结的大型语言模型(LLM)表征是否能编码个体特异性的脑电(EEG)信号,发现Qwen 2.5 7B等模型深层隐藏状态中存在稳定、不可迁移、与人群共享信号正交的个体特异性方向,可显著提升EEG预测性能,为基于EEG的个性化建模提供几何基础。
Details
Motivation: 随着消费级EEG设备普及,探索语言模型能否适配个体神经响应成为关键问题;本文旨在验证冻结LLM表征中是否存在可被线性解码的个体特异性EEG编码方向。 Method: 在ZuCo语料库30名被试的词级别EEG数据上,为每人训练独立线性探针(probe),将冻结Qwen 2.5 7B各层隐藏状态映射至其EEG功率谱(如高伽马波);对比个体探针与统一人群探针性能,并进行稳定性、可迁移性、正交性及层定位分析;辅以LLaMA 3.1 8B验证泛化性及混淆变量控制。 Result: 个体探针在所有EEG特征上均显著优于人群探针(高伽马波rho=0.183 vs. 0.020, p<10^-4);个体方向具时间稳定性(cosine=0.824)、非可迁移性(self rho=0.369 vs. other rho=0.143, p<10^-19)、与人群信号正交;该信号集中在深层(峰值在第24层/28层);结果跨模型(LLaMA)和控制条件稳健。 Conclusion: 冻结语言模型深层表征中存在稳定、个体特异、几何可分离的神经编码方向,支持其作为EEG驱动个性化建模的通用基础。 Abstract: Consumer-grade EEG is entering everyday devices, from earbuds to headbands, raising the question of whether language models can be adapted to individual neural responses. We test this by asking whether frozen LLM representations encode person-specific EEG signals, directions in activation space that predict one person's brain activity but not another's. Using word-level EEG from 30 participants reading naturalistic sentences (ZuCo corpus), we train a separate linear probe for each person, mapping hidden states from a frozen Qwen 2.5 7B to that individual's EEG power. Person-specific probes outperform a single population probe on every EEG feature tested; for high-gamma power, the person-specific probe achieves rho = 0.183, a ninefold improvement over the population probe (rho = 0.020, p < 10^-4). A negative control, fixation count, shows no person-specific advantage (p = 0.360); fixation count reflects word length and frequency rather than individual cognition. The individual directions are temporally stable (split-half cosine = 0.824), non-transferable across people (self rho = 0.369 vs. other rho = 0.143, p < 10^-19), and distinct from the shared population signal: person-specific probes retain predictive power after the population component is removed. The person-specific signal concentrates in the model's deep layers, rising consistently with depth and peaking at Layer 24 of 28. The results are consistent across architectures (LLaMA 3.1 8B) and survive word-level confound controls. Frozen language models contain stable, person-specific neural directions in their deep layers, providing a geometric foundation for EEG-driven personalization.[90] Ara-Best-RQ: Multi Dialectal Arabic SSL
Haroun Elleuch,Ryan Whetten,Salima Mdhaffar,Yannick Estève,Fethi Bougares
Main category: cs.CL
TL;DR: Ara-BEST-RQ 是一系列面向多方言阿拉伯语语音处理的自监督学习模型,基于大量阿拉伯语语音数据预训练 Conformer 架构的 BEST-RQ 模型(最大6亿参数),在方言识别任务上达到SOTA,且参数量更少;相比多语言或非阿拉伯语单语模型,专为阿拉伯语方言设计的预训练显著提升下游性能;所有模型、代码和预处理数据将开源。
Details
Motivation: 现有阿拉伯语语音处理模型缺乏针对其多方言特性的专门设计,通用多语言或单语模型(尤其非阿拉伯语)难以充分建模方言差异,导致下游性能受限。 Method: 构建包含5640小时网络爬取CC许可阿拉伯语语音及公开数据集的大规模语料;基于Conformer架构,预训练自监督BEST-RQ模型族(最大600M参数);聚焦多方言阿拉伯语进行家族式(family-targeted)预训练。 Result: 在方言识别(DID)任务上达到当时最优(SOTA)性能,且参数量少于竞争模型;在ASR任务上也取得良好效果;验证了方言针对性预训练显著优于多语言或非阿拉伯语单语预训练。 Conclusion: 为阿拉伯语多方言语音处理量身定制的自监督预训练范式(Ara-BEST-RQ)是有效且高效的,开源资源将推动阿拉伯语语音技术发展。 Abstract: We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech and combining it with publicly available datasets, we pre-train conformer-based BEST-RQ models up to 600M parameters. Our models are evaluated on dialect identification (DID) and automatic speech recognition (ASR) tasks, achieving state-of-the-art performance on the former while using fewer parameters than competing models. We demonstrate that family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data. All models, code, and pre-processed datasets will be publicly released to support reproducibility and further research in Arabic speech technologies.[91] SLURP-TN : Resource for Tunisian Dialect Spoken Language Understanding
Haroun Elleuch,Salima Mdhaffar,Yannick Estève,Fethi Bougares
Main category: cs.CL
TL;DR: 本文介绍了SLURP-TN数据集,一个面向突尼斯方言的口语语言理解(SLU)资源,并提供了基于该数据集的ASR和SLU基线模型。
Details
Motivation: 由于缺乏口语语言理解(SLU)资源,只有少数高资源语言能受益于深度学习和预训练语言模型的进展,本文旨在缓解这一障碍。 Method: 通过录制55位母语者朗读源自SLURP六个领域的句子(人工翻译为突尼斯方言),构建SLURP-TN数据集,并开发了多个基于该数据集的自动语音识别(ASR)和SLU模型。 Result: 构建了一个包含4165句话、约5小时语音材料的突尼斯方言SLU数据集SLURP-TN,并发布了相关基线模型。 Conclusion: SLURP-TN填补了突尼斯方言SLU资源的空白,为低资源方言的语音理解和任务型对话系统研究提供了新基准和工具。 Abstract: Spoken Language Understanding (SLU) aims to extract the semantic information from the speech utterance of user queries. It is a core component in a task-oriented dialogue system. With the spectacular progress of deep neural network models and the evolution of pre-trained language models, SLU has obtained significant breakthroughs. However, only a few high-resource languages have taken advantage of this progress due to the absence of SLU resources. In this paper, we seek to mitigate this obstacle by introducing SLURP-TN. This dataset was created by recording 55 native speakers uttering sentences in Tunisian dialect, manually translated from six SLURP domains. The result is an SLU Tunisian dialect dataset that comprises 4165 sentences recorded into around 5 hours of acoustic material. We also develop a number of Automatic Speech Recognition and SLU models exploiting SLUTP-TN. The Dataset and baseline models are available at: https://huggingface.co/datasets/Elyadata/SLURP-TN.[92] Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora, Prompt Tuning, and Full Fine-Tuning
Ulugbek Shernazarov,Rostislav Svitsov,Bin Shi
Main category: cs.CL
TL;DR: 本文比较了LoRA、Prompt Tuning和全量微调在Flan-T5模型上进行医学文本摘要任务的效果,发现LoRA以仅0.6%可训练参数显著优于全量微调,并揭示低秩约束具有正则化作用。
Details
Motivation: 大型语言模型在医学等专业领域微调需大量计算资源,亟需参数高效微调(PEFT)方法来降低开销并保持性能。 Method: 在Flan-T5模型家族和PubMed医学摘要数据集上,系统对比LoRA、Prompt Tuning与全量微调;采用多随机种子实验与敏感性分析(如LoRA秩、prompt token数)验证稳定性与关键影响因素。 Result: LoRA在Flan-T5-Large上达到43.52 ± 0.18 ROUGE-1,显著高于全量微调的40.67 ± 0.21,且仅需0.6%可训练参数;低秩约束被证实具有有益的正则化效果。 Conclusion: LoRA是一种更高效、更稳健的医学文本摘要微调方法,挑战了必须更新全部参数的传统假设,为资源受限下的领域适配提供了可靠方案。 Abstract: Fine-tuning large language models for domain-specific tasks such as medical text summarization demands substantial computational resources. Parameter-efficient fine-tuning (PEFT) methods offer promising alternatives by updating only a small fraction of parameters. This paper compares three adaptation approaches-Low-Rank Adaptation (LoRA), Prompt Tuning, and Full Fine-Tuning-across the Flan-T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine-tuning, achieving 43.52 +/- 0.18 ROUGE-1 on Flan-T5-Large with only 0.6% trainable parameters compared to 40.67 +/- 0.21 for full fine-tuning. Sensitivity analyses examine the impact of LoRA rank and prompt token count. Our findings suggest the low-rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates. Code is available at https://github.com/eracoding/llm-medical-summarization[93] Retrieving Climate Change Disinformation by Narrative
Max Upravitelev,Veronika Solopova,Charlott Jakob,Premtim Sahitaj,Vera Schmitt
Main category: cs.CL
TL;DR: 本文提出将气候虚假信息叙事检测重新定义为检索任务,并设计SpecFi框架通过生成假设文档来桥接抽象叙事与具体文本,提升对新兴叙事的检测能力。
Details
Motivation: 现有方法依赖固定分类体系,难以应对不断出现的新叙事,因此需要更灵活的检测框架。 Method: 将叙事检测建模为基于核心消息查询的文本检索任务;构建SpecFi框架,利用图聚类得到的社区摘要作为少样本示例生成假设文档;引入叙事方差作为嵌入式难度度量。 Result: SpecFi在CARDS数据集上达到0.505的MAP;在高方差叙事上,SpecFi-CS比BM25鲁棒得多(MAP损失分别为32.7% vs 63.4%);无监督社区摘要与专家构建的分类体系高度一致。 Conclusion: 将叙事检测转为检索任务并辅以生成式假设文档,可有效支持开放域、动态演化的虚假信息识别;图结构分析能从无标签文本中自动揭示叙事结构。 Abstract: Detecting climate disinformation narratives typically relies on fixed taxonomies, which do not accommodate emerging narratives. Thus, we re-frame narrative detection as a retrieval task: given a narrative's core message as a query, rank texts from a corpus by alignment with that narrative. This formulation requires no predefined label set and can accommodate emerging narratives. We repurpose three climate disinformation datasets (CARDS, Climate Obstruction, climate change subset of PolyNarrative) for retrieval evaluation and propose SpecFi, a framework that generates hypothetical documents to bridge the gap between abstract narrative descriptions and their concrete textual instantiations. SpecFi uses community summaries from graph-based community detection as few-shot examples for generation, achieving a MAP of 0.505 on CARDS without access to narrative labels. We further introduce narrative variance, an embedding-based difficulty metric, and show via partial correlation analysis that standard retrieval degrades on high-variance narratives (BM25 loses 63.4% of MAP), while SpecFi-CS remains robust (32.7% loss). Our analysis also reveals that unsupervised community summaries converge on descriptions close to expert-crafted taxonomies, suggesting that graph-based methods can surface narrative structure from unlabeled text.[94] Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch
Stella Eva Tsiapali,Cong-Thanh Do,Kate Knill
Main category: cs.CL
TL;DR: 本文分析了DSKD-CMA方法的注意力机制,揭示其优缺点,并提出基于生成对抗学习的改进方法DSKD-CMA-GA,以缓解不同Tokenizer模型间键值分布不匹配问题,在文本生成任务中取得稳定提升,尤其在分布外数据上效果更显著。
Details
Motivation: DSKD-CMA作为跨Tokenizer大模型知识蒸馏的SOTA方法,其内部机理尚不清晰,且存在键-查询分布不匹配问题,限制性能。 Method: 通过人工token对齐探测与注意力热力图可视化分析DSKD-CMA的注意力机制;在此基础上,提出DSKD-CMA-GA,引入生成对抗学习来对齐跨模型的key/query分布。 Result: 在文本生成任务中ROUGE-L指标获得稳定小幅提升(平均+0.37),尤其在out-of-distribution数据上效果更明显,缩小了跨Tokenizer与同Tokenizer知识蒸馏之间的性能差距。 Conclusion: 深入理解并优化跨Tokenizer知识蒸馏中的注意力分布匹配,是提升小模型性能的关键路径;DSKD-CMA-GA验证了生成对抗思想在该场景下的有效性。 Abstract: Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.[95] Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison
Caio Vicentino
Main category: cs.CL
TL;DR: 本文对自回归(AR)和掩码扩散(MDLM)语言模型进行了严格控制的实证对比,发现二者训练吞吐量相近,但收敛行为与生成特性显著不同:AR收敛快但易过拟合,生成流畅却重复;MDLM收敛慢但持续提升,生成更富多样性但偶有语法错误。
Details
Motivation: 探究生成范式(自回归 vs 掩码扩散)对模型训练动态与生成质量的独立影响,需在数据、算力、硬件等条件完全一致下进行公平比较。 Method: 在完全相同的数据(TinyStories 50M tokens)、计算预算(20,000步,batch size 32,seq len 512)和硬件(NVIDIA H100)下,分别训练AR与MDLM模型,并从训练效率、收敛行为和生成多样性(如Distinct-n、Self-BLEU、首词/五词开头唯一性)三方面定量对比。 Result: 1)训练吞吐量相近(~50K tokens/sec),MDLM仅多耗4.7%墙钟时间;2)AR在14,000步即过拟合,MDLM至20,000步仍在改进;3)AR生成高度重复(99.8%样本首词相同),MDLM生成更结构多样(93.4%独特五词开头,更高Distinct-n、更低Self-BLEU),但语法一致性略差。 Conclusion: 生成范式本身显著影响模型训练动力学与输出特性,MDLM并非AR的简单替代,而是一种具有不同权衡(多样性vs流畅性、收敛速度vs最终性能)的新范式,需适配专属训练策略。 Abstract: We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.[96] Multiperspectivity as a Resource for Narrative Similarity Prediction
Max Upravitelev,Veronika Solopova,Jing Yang,Charlott Jakob,Premtim Sahitaj,Ariana Sahitaj,Vera Schmitt
Main category: cs.CL
TL;DR: 本文提出将叙事相似性预测中的多视角性(multiperspectivity)视为系统设计优势而非缺陷,通过构建31个LLM角色组成的集成模型,在SemEval-2026 Task 4上达到0.705准确率,并发现性别相关解释性词汇与性能呈负相关,呼吁评估框架应容纳解释多样性。
Details
Motivation: 叙事相似性预测本质上具有解释多样性,不同合理解读会导致不同相似性判断,而现有语义评估基准仅设单一真值,忽视该特性;本文主张将多视角性主动融入预测系统设计。 Method: 构建包含31个LLM角色的集成模型,涵盖遵循不同阐释框架的专业者与直觉型非专业角色;在SemEval-2026 Task 4数据集上采用多数投票机制进行实验,并开展错误相关性分析。 Result: 集成系统准确率达0.705;准确率随集成规模增大而提升;从业者角色个体性能较差但错误相关性低,带来更大集成增益;性别聚焦类阐释词汇使用频率与准确率呈一致负相关。 Conclusion: 多视角集成可提升叙事相似性预测鲁棒性;当前基准可能遗漏有效解读维度(如性别视角),亟需支持解释多元性的新型评估框架。 Abstract: Predicting narrative similarity can be understood as an inherently interpretive task: different, equally valid readings of the same text can produce divergent interpretations and thus different similarity judgments, posing a fundamental challenge for semantic evaluation benchmarks that encode a single ground truth. Rather than treating this multiperspectivity as a challenge to overcome, we propose to incorporate it in the decision making process of predictive systems. To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting. Our error analysis reveals a consistent negative association between gender-focused interpretive vocabulary and accuracy across all persona categories, suggesting either attention to dimensions not relevant for the benchmark or valid interpretations absent from the ground truth. This finding underscores the need for evaluation frameworks that account for interpretive plurality.[97] The Semantic Ladder: A Framework for Progressive Formalization of Natural Language Content for Knowledge Graphs and AI Systems
Lars Vogt
Main category: cs.CL
TL;DR: 本文提出Semantic Ladder框架,通过多级语义显式化(从自然语言到形式逻辑)实现知识的渐进式形式化,支持语义连续性、可追溯性与异构表示集成。
Details
Motivation: 自然语言表达的知识与机器可操作的形式语义模型之间存在鸿沟,尤其在数据录入阶段要求完全语义形式化时,这一问题尤为突出。 Method: 提出Semantic Ladder架构框架,以模块化语义单元为基本载体,按语义显式程度划分为多个层级(如文本片段→本体模型→高阶逻辑),并定义层间转换机制以支持语义增强、语句结构化和逻辑建模。 Result: 实现了语义知识空间的增量构建,降低了语义解析负担,并支持自然语言、结构化语义模型及向量嵌入等异构表示的统一集成。 Conclusion: Semantic Ladder为构建可扩展、互操作性强且面向AI的数据与知识基础设施提供了基础性框架。 Abstract: Semantic data and knowledge infrastructures must reconcile two fundamentally different forms of representation: natural language, in which most knowledge is created and communicated, and formal semantic models, which enable machine-actionable integration, interoperability, and reasoning. Bridging this gap remains a central challenge, particularly when full semantic formalization is required at the point of data entry. Here, we introduce the Semantic Ladder, an architectural framework that enables the progressive formalization of data and knowledge. Building on the concept of modular semantic units as identifiable carriers of meaning, the framework organizes representations across levels of increasing semantic explicitness, ranging from natural language text snippets to ontology-based and higher-order logical models. Transformations between levels support semantic enrichment, statement structuring, and logical modelling while preserving semantic continuity and traceability. This approach enables the incremental construction of semantic knowledge spaces, reduces the semantic parsing burden, and supports the integration of heterogeneous representations, including natural language, structured semantic models, and vector-based embeddings. The Semantic Ladder thereby provides a foundation for scalable, interoperable, and AI-ready data and knowledge infrastructures.[98] Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation
Ireh Kim,Tesia Sker,Chanwoo Kim
Main category: cs.CL
TL;DR: 本文提出了一种两阶段微调策略,利用大语言模型(LLM)增强的文档级平行数据来提升LLM在文档级机器翻译中的性能,通过数据增强与质量过滤缓解数据稀缺与生成幻觉/遗漏问题。
Details
Motivation: LLM在传统句子级机器翻译中表现不佳,但在建模上下文、保持文档级连贯性方面具有优势;然而面临高质量文档级平行数据稀缺和生成易出错(幻觉/遗漏)两大挑战。 Method: 提出两阶段微调策略:首先用LLM将摘要数据转换为文档级平行数据,并结合sacreBLEU、COMET和LaBSE余弦相似度进行多指标过滤以提升数据质量;然后先在丰富句级MT数据上微调,再在过滤后的文档级语料上微调。 Result: 该方法有效提升了LLM在文档级机器翻译任务上的性能,缓解了数据稀缺与生成错误问题,增强了跨句一致性与翻译质量。 Conclusion: LLM可通过高质量文档级数据增强与分阶段微调,成功应用于文档级机器翻译,在保持上下文建模优势的同时克服其固有缺陷。 Abstract: In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.[99] Gumbel Distillation for Parallel Text Generation
Chi Zhang,Xixi Hu,Bo Liu,Qiang Liu
Main category: cs.CL
TL;DR: 本文提出Gumbel Distillation,一种新型蒸馏技术,利用Gumbel-Max技巧使并行解码器能有效学习自回归教师模型的联合token分布,显著提升并行语言模型生成质量。
Details
Motivation: 自回归语言模型解码慢,而并行解码模型虽快却难以建模复杂token联合分布,导致生成质量下降;需缩小这一性能差距。 Method: 提出Gumbel Distillation:利用Gumbel-Max技巧构建从隐式Gumbel噪声空间到AR教师模型输出token的确定性映射,实现对并行解码器的模型无关蒸馏。 Result: 在LM1B和OpenWebText上实验表明,该方法大幅提升并行语言模型生成质量,相比MDLM在OpenWebText上MAUVE分数提升30.0%,生成困惑度降低10.5%。 Conclusion: Gumbel Distillation是一种通用、有效的蒸馏策略,可显著弥合并行与自回归语言模型在生成质量上的差距。 Abstract: The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at https://github.com/hxixixh/gumbel-distill.[100] Adapting Self-Supervised Speech Representations for Cross-lingual Dysarthria Detection in Parkinson's Disease
Abner Hernandez,Eunjung Yeo,Kwanghee Choi,Chin-Jou Li,Zhengjun Yue,Rohan Kumar Das,Jan Rusz,Mathew Magimai Doss,Juan Rafael Orozco-Arroyave,Tomás Arias-Vergara,Andreas Maier,Elmar Nöth,David R. Mortensen,David Harwath,Paula Andrea Perez-Toro
Main category: cs.CL
TL;DR: 本文提出了一种表示级语言迁移(LS)方法,通过基于健康对照语音估计的质心向量适配,对齐源语言和目标语言的自监督语音表征,以提升跨语言构音障碍检测性能。
Details
Motivation: 构音障碍语音数据稀缺,跨语言检测面临挑战,尤其因语音表征中包含语言依赖结构而干扰检测效果。 Method: 提出表示级语言迁移(LS)方法,利用健康对照语音估计质心,对源语言自监督语音表征进行向量适配,使其分布与目标语言对齐。 Result: 在捷克语、德语和西班牙语帕金森病语音数据集上验证,LS显著提升跨语言设置下的敏感性和F1值,在多语言设置下也有稳定小幅提升;嵌入空间分析显示语言身份信息减少。 Conclusion: LS能有效削弱语音表征中的语言依赖结构,从而提升跨语言构音障碍检测性能,为低资源语言临床语音分析提供新思路。 Abstract: The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson's disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.[101] MemDLM: Memory-Enhanced DLM Training
Zehua Pei,Hui-Ling Zhen,Weizhe Lin,Sinno Jialin Pan,Yunhe Wang,Mingxuan Yuan,Bei Yu
Main category: cs.CL
TL;DR: 本文提出MemDLM,一种通过双层优化在训练中嵌入模拟去噪过程的记忆增强型扩散语言模型,以缓解训练-推理不匹配问题;其参数化记忆机制既加速训练收敛,又可在推理时作为自适应模块提升长上下文理解和检索能力。
Details
Motivation: 扩散语言模型(DLMs)存在显著的训练-推理不匹配:训练采用单步掩码预测目标,而推理依赖多步渐进去噪轨迹。 Method: 提出MemDLM,利用双层优化:内层更新快速权重构建捕获样本局部去噪轨迹经验的参数化记忆,外层基于该记忆更新基础模型;推理时可重新启用内层作为自适应步骤。 Result: MemDLM实现更快收敛、更低训练损失;推理时启用参数化记忆可提升长上下文理解,并在‘海中寻针’任务中展现出类in-weight检索机制,缓解token级注意力瓶颈。 Conclusion: 将去噪轨迹经验显式建模为参数化记忆,能有效弥合DLM的训练-推理鸿沟,并带来训练与推理双重增益。 Abstract: Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: https://github.com/JarvisPei/MemDLM.[102] Greater accessibility can amplify discrimination in generative AI
Carolin Holtermann,Minh Duc Bui,Kaitlyn Zhou,Valentin Hofmann,Katharina von der Wense,Anne Lauscher
Main category: cs.CL
TL;DR: 本文揭示了语音驱动的大语言模型(LLMs)存在系统性性别歧视,仅凭说话者声音即引发刻板化响应,且该偏见超越文本交互;研究还发现不常使用聊天机器人的用户对隐式属性推断最为敏感,并提出音高调节可作为缓解策略。
Details
Motivation: 语音交互虽能提升可访问性(如对低识字率、运动障碍或仅用移动设备的用户),但语音携带难以掩盖的身份线索(如性别),可能引发新的公平性问题;现有研究多关注文本偏见,忽视语音模态特有的偏见机制。 Method: 通过实验评估音频启用的LLM对不同性别声音的响应差异,分析其输出中性别刻板化形容词和职业倾向;结合n=1000的用户调查,探究用户对属性推断的接受度与行为反应;并尝试音高操纵作为干预手段测试其对偏见输出的调控效果。 Result: 音频LLM表现出显著性别歧视,响应偏向性别刻板化内容,且偏见程度高于文本交互;用户调查显示不常使用者最反感隐式属性推断并更易退出交互;音高调节可系统性降低性别歧视输出。 Conclusion: 语音接口并非简单将文本LLM迁移至新模态,而是引入由副语言线索驱动的独特偏见机制;因此,AI开发中必须同步推进可访问性与公平性,不能顾此失彼。 Abstract: Hundreds of millions of people rely on large language models (LLMs) for education, work, and even healthcare. Yet these models are known to reproduce and amplify social biases present in their training data. Moreover, text-based interfaces remain a barrier for many, for example, users with limited literacy, motor impairments, or mobile-only devices. Voice interaction promises to expand accessibility, but unlike text, speech carries identity cues that users cannot easily mask, raising concerns about whether accessibility gains may come at the cost of equitable treatment. Here we show that audio-enabled LLMs exhibit systematic gender discrimination, shifting responses toward gender-stereotyped adjectives and occupations solely on the basis of speaker voice, and amplifying bias beyond that observed in text-based interaction. Thus, voice interfaces do not merely extend text models to a new modality but introduce distinct bias mechanisms tied to paralinguistic cues. Complementary survey evidence ($n=1,000$) shows that infrequent chatbot users are most hesitant to undisclosed attribute inference and most likely to disengage when such practices are revealed. To demonstrate a potential mitigation strategy, we show that pitch manipulation can systematically regulate gender-discriminatory outputs. Overall, our findings reveal a critical tension in AI development: efforts to expand accessibility through voice interfaces simultaneously create new pathways for discrimination, demanding that fairness and accessibility be addressed in tandem.[103] TiCo: Time-Controllable Training for Spoken Dialogue Models
Kai-Wei Chang,Wei-Chih Chen,En-Pei Hu,Hung-yi Lee,James Glass
Main category: cs.CL
TL;DR: TiCo是一种简单的后训练方法,通过引入口语时间标记(STM)使口语对话模型具备时间感知能力,从而生成符合指定时长要求的响应。
Details
Motivation: 现有口语对话模型缺乏时间感知能力,难以遵循与持续时间相关的指令(如“生成约15秒的响应”),影响语音助手等实际应用的交互质量。 Method: 提出TiCo方法,利用自生成和强化学习,在模型生成过程中插入口语时间标记(STM,如<10.6秒>),使模型在生成过程中估计已用说话时间并动态调整剩余内容以满足目标时长。 Result: 实验表明,TiCo显著提升了模型对时长约束的遵守能力,同时保持响应质量。 Conclusion: TiCo是一种轻量、高效且无需额外问答对的后训练方法,能有效赋予口语对话模型时间可控性,适用于真实语音交互系统。 Abstract: We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.cs.CV [Back]
[104] Efficient AI-Driven Multi-Section Whole Slide Image Analysis for Biochemical Recurrence Prediction in Prostate Cancer
Yesung Cho,Dongmyung Shin,Sujeong Hong,Jooyeon Lee,Seongmin Park,Geongyu Lee,Jongbae Park,Hong Koo Ha
Main category: cs.CV
TL;DR: 本文提出了一种新型AI框架,通过同时分析前列腺癌患者多张病理切片,精准预测根治性前列腺切除术后的生化复发(BCR),在大规模数据集上显著优于传统临床指标,并通过外部验证证实其泛化性和临床可行性。
Details
Motivation: 前列腺癌术后生化复发(BCR)预测困难,主要因肿瘤多灶性分布于整个前列腺腺体,传统单张切片分析难以全面捕捉肿瘤异质性。 Method: 构建基于多张连续病理切片的AI框架,整合补丁级与切片级子采样策略;使用包含789名患者共23,451张切片的大规模数据集进行训练与验证;采用多变量Cox比例风险模型评估AI风险评分的独立预后价值。 Result: AI模型在1年和2年BCR预测中性能显著优于PSA、Gleason评分等临床基准;AI衍生风险评分被证实为最强独立预后因子;子采样策略大幅降低计算成本且不损性能;外部验证证实模型泛化能力。 Conclusion: 该多切片AI分析框架具备临床可行性和高 prognostic 价值,可作为前列腺癌术后管理的可扩展辅助工具。 Abstract: Prostate cancer is one of the most frequently diagnosed malignancies in men worldwide. However, precise prediction of biochemical recurrence (BCR) after radical prostatectomy remains challenging due to the multifocality of tumors distributed throughout the prostate gland. In this paper, we propose a novel AI framework that simultaneously processes a series of multi-section pathology slides to capture the comprehensive tumor landscape across the entire prostate gland. To develop this predictive AI model, we curated a large-scale dataset of 23,451 slides from 789 patients. The proposed framework demonstrated strong predictive performance for 1- and 2-year BCR prediction, substantially outperforming established clinical benchmarks. The AI-derived risk score was validated as the most potent independent prognostic factor in a multivariable Cox proportional hazards analysis, surpassing conventional clinical markers such as pre-operative PSA and Gleason score. Furthermore, we demonstrated that integrating patch and slide sub-sampling strategies significantly reduces computational cost during both training and inference without compromising predictive performance, and generalizability of AI was confirmed through external validation. Collectively, these results highlight the clinical feasibility and prognostic value of the proposed AI-based multi-section slide analysis as a scalable tool for post-operative management in prostate cancer.[105] Understanding Pruning Regimes in Vision-Language Models Through Domain-Aware Layer Selection
Saeed Khaki,Nima Safaei,Kamal Ginotra
Main category: cs.CV
TL;DR: 本文研究了在视觉-语言模型(VLMs)中基于领域感知的解码器层剪枝策略,提出通过激活相似性识别对数学或非数学任务影响最小的层,并发现性能随剪枝预算变化呈现三阶段规律,实现了在保持数学与通用多模态能力前提下的高效模型压缩。
Details
Motivation: Transformer架构的视觉-语言模型存在显著的深度冗余,但移除特定解码器层对需强感知-推理耦合的任务(如数学推理)的影响尚不明确。 Method: 提出基于领域感知的激活相似性度量方法,构建数学感知、非数学感知和混合排序准则,用于识别各层在数学/非数学输入下输入-输出表征变化最小的层,并系统评估不同剪枝策略在多种VLM和基准上的表现。 Result: 发现性能随剪枝预算呈现一致的三阶段结构:低预算时敏感依赖所剪层、中预算时策略趋同、高预算时结构连续性主导;所提领域感知排序在敏感阶段最稳定,在高预算下媲美或超越结构感知基线。 Conclusion: 模型深度对领域特异性行为具有结构性影响;基于领域感知的简单剪枝策略可有效压缩VLM深度,同时保留关键的数学与通用视觉-语言能力。 Abstract: Transformer-based vision-language models (VLMs) contain substantial depth redundancy, yet the effect of removing specific decoder layers remains poorly understood, especially for domains that require tight coupling between perception and multi-step reasoning. We study structured decoder layer pruning through the lens of domain-aware activation similarity, measuring how strongly each layer transforms representations for math versus non-math inputs. This yields simple math-aware, non-math-aware, and mixed ranking criteria that identify layers whose input-output activations change least within a target domain. Across two state-of-the-art VLMs and a broad suite of math and general multimodal benchmarks, we uncover a consistent three-regime structure: at low pruning budgets, performance is highly sensitive to which layers are removed; at moderate budgets, methods converge as structural damage accumulates; and at high budgets, structural continuity dominates, favoring spacing-aware strategies. Our domain-aware rankings achieve the strongest stability in the ranking-sensitive regime, while matching or exceeding structure-aware baselines at larger budgets. These results provide a clearer picture of how depth contributes to domain-specific behavior in VLMs and offer a practical, interpretable approach to reducing model depth without sacrificing essential mathematical or general vision-language capabilities.[106] Mix-and-Match Pruning: Globally Guided Layer-Wise Sparsification of DNNs
Danial Monachan,Samira Nazari,Mahdi Taheri,Ali Azarpeyvand,Milos Krstic,Michael Huebner,Christian Herglotz
Main category: cs.CV
TL;DR: 本文提出Mix-and-Match Pruning,一种全局引导、分层稀疏化框架,通过敏感度评分和简单架构规则生成多样高质量剪枝配置,显著提升边缘设备上DNN压缩的精度-稀疏度权衡。
Details
Motivation: 不同层和架构对剪枝响应差异大,单一剪枝策略效果次优,需更灵活、架构感知的剪枝方法。 Method: 基于敏感度(幅值、梯度或其组合)推导架构感知的稀疏度范围(如保留归一化层、激进剪枝分类器),并系统采样生成多种剪枝策略,避免重复剪枝运行。 Result: 在CNN和ViT(如Swin-Tiny)上实现Pareto最优结果,相较标准单准则剪枝,相对降低40%精度下降。 Conclusion: 协同利用现有剪枝信号比引入新准则更能可靠高效地获得压缩模型。 Abstract: Deploying deep neural networks (DNNs) on edge devices requires strong compression with minimal accuracy loss. This paper introduces Mix-and-Match Pruning, a globally guided, layer-wise sparsification framework that leverages sensitivity scores and simple architectural rules to generate diverse, high-quality pruning configurations. The framework addresses a key limitation that different layers and architectures respond differently to pruning, making single-strategy approaches suboptimal. Mix-and-Match derives architecture-aware sparsity ranges, e.g., preserving normalization layers while pruning classifiers more aggressively, and systematically samples these ranges to produce ten strategies per sensitivity signal (magnitude, gradient, or their combination). This eliminates repeated pruning runs while offering deployment-ready accuracy-sparsity trade-offs. Experiments on CNNs and Vision Transformers demonstrate Pareto-optimal results, with Mix-and-Match reducing accuracy degradation on Swin-Tiny by 40% relative to standard single-criterion pruning. These findings show that coordinating existing pruning signals enables more reliable and efficient compressed models than introducing new criteria.[107] STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
Runze Wang,Yuxuan Song,Youcheng Cai,Ligang Liu
Main category: cs.CV
TL;DR: 本文提出STAC框架,通过时空感知的缓存压缩机制,在保证长时一致性的同时大幅降低因果VGGT变压器在流式3D重建中的内存消耗与计算开销。
Details
Motivation: 现有因果VGGT变压器虽用KV缓存支持流式3D重建,但缓存随流长线性增长,内存受限下早期清除缓存严重损害重建质量与时间一致性。 Method: 基于注意力内在时空稀疏性的观察,提出STAC:(1) 基于衰减累积注意力分数的工作时间令牌缓存;(2) 体素对齐的空间冗余令牌压缩的长期空间令牌缓存;(3) 分块多帧联合优化策略提升时间连贯性与GPU效率。 Result: 实验表明STAC在重建质量上达到SOTA,内存减少近10倍,推理加速4倍。 Conclusion: STAC显著提升了流式场景下实时3D重建的可扩展性,为大模型驱动的在线重建提供了高效缓存管理新范式。 Abstract: Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal VGGT transformers address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency. In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency. Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings.[108] Efficient Visual Anomaly Detection at the Edge: Enabling Real-Time Industrial Inspection on Resource-Constrained Devices
Arianna Stropeni,Fabrizio Genilotti,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Gian Antonio Susto
Main category: cs.CV
TL;DR: 本文提出了两种面向边缘设备部署的高效视觉异常检测(VAD)方法PatchCore-Lite和Padim-Lite,显著降低内存占用与推理时间,实现在资源受限边缘设备上的实时、隐私保护工业质检。
Details
Motivation: 工业质检中VAD需满足实时性和隐私性,需从云端转向边缘部署,但边缘设备资源受限,亟需轻量化VAD方法。 Method: 基于PatchCore和PaDiM模型,提出PatchCore-Lite(采用乘积量化+子集解码的两级搜索)和Padim-Lite(使用对角协方差简化Mahalanobis距离计算)两种轻量方法。 Result: 在MVTec AD和VisA数据集上验证:PatchCore-Lite内存减少79%,Padim-Lite内存减少77%、推理时间减少31%。 Conclusion: 所提方法有效支持VAD在边缘设备部署,兼顾实时性、隐私性与成本效益。 Abstract: Visual Anomaly Detection (VAD) is essential for industrial quality control, enabling automatic defect detection in manufacturing. In real production lines, VAD systems must satisfy strict real-time and privacy requirements, necessitating a shift from cloud-based processing to local edge deployment. However, processing data locally on edge devices introduces new challenges because edge hardware has limited memory and computational resources. To overcome these limitations, we propose two efficient VAD methods designed for edge deployment: PatchCore-Lite and Padim-Lite, based on the popular PatchCore and PaDiM models. PatchCore-Lite runs first a coarse search on a product-quantized memory bank, then an exact search on a decoded subset. Padim-Lite is sped up using diagonal covariance, turning Mahalanobis distance into efficient element-wise computation. We evaluate our methods on the MVTec AD and VisA benchmarks and show their suitability for edge environments. PatchCore-Lite achieves a remarkable 79% reduction in total memory footprint, while PaDiM-Lite achieves substantial efficiency gains with a 77% reduction in total memory and a 31% decrease in inference time. These results show that VAD can be effectively deployed on edge devices, enabling real-time, private, and cost-efficient industrial inspection.[109] Remote Sensing Image Dehazing: A Systematic Review of Progress, Challenges, and Prospects
Heng Zhou,Xiaoxiong Liu,Zhenxi Zhang,Jieheng Yun,Chengyang Li,Yunchu Yang,Dongyi Xia,Chunna Tian,Xiao-Jun Wu
Main category: cs.CV
TL;DR: 本文首次系统综述了遥感图像(RSI)去雾方法,涵盖方法演进、基准评估与物理一致性分析,总结30余种代表性模型,并在5个数据集上用12项指标进行大规模实验,指出Transformer与扩散模型提升感知质量,物理引导混合模型增强辐射稳定性,最后提出TCE(可信、可控、高效)去雾系统的发展方向。
Details
Motivation: 遥感图像常受雾霾和薄云退化影响,导致地表反射率失真,阻碍下游应用;现有研究缺乏系统性综述与物理一致性评估。 Method: 构建三阶段分类框架(手工物理先验→数据驱动深度恢复→物理-智能混合生成),综述CNN、GAN、Transformer及扩散模型等30+方法;开展跨数据集、多指标(PSNR、SSIM、LPIPS、FID等12项)定量评估;设计物理辐射一致性实验验证传输图/大气光约束对色彩偏差的影响。 Result: Transformer与扩散模型平均提升SSIM 12%~18%,降低感知误差20%~35%;物理引导混合模型辐射更稳定;显式传输/大气光约束使颜色偏差最多降低27%。 Conclusion: 当前挑战包括动态大气建模、多模态融合、轻量化部署、数据稀缺与联合退化处理;未来应发展可信(Trustworthy)、可控(Controllable)、高效(Efficient)的TCE去雾系统。 Abstract: Remote sensing images (RSIs) are frequently degraded by haze, fog, and thin clouds, which obscure surface reflectance and hinder downstream applications. This study presents the first systematic and unified survey of RSIs dehazing, integrating methodological evolution, benchmark assessment, and physical consistency analysis. We categorize existing approaches into a three-stage progression: from handcrafted physical priors, to data-driven deep restoration, and finally to hybrid physical-intelligent generation, and summarize more than 30 representative methods across CNNs, GANs, Transformers, and diffusion models. To provide a reliable empirical reference, we conduct large-scale quantitative experiments on five public datasets using 12 metrics, including PSNR, SSIM, CIEDE, LPIPS, FID, SAM, ERGAS, UIQI, QNR, NIQE, and HIST. Cross-domain comparison reveals that recent Transformer- and diffusion-based models improve SSIM by 12%~18% and reduce perceptual errors by 20%~35% on average, while hybrid physics-guided designs achieve higher radiometric stability. A dedicated physical radiometric consistency experiment further demonstrates that models with explicit transmission or airlight constraints reduce color bias by up to 27%. Based on these findings, we summarize open challenges: dynamic atmospheric modeling, multimodal fusion, lightweight deployment, data scarcity, and joint degradations, and outline promising research directions for future development of trustworthy, controllable, and efficient (TCE) dehazing systems. All reviewed resources, including source code, benchmark datasets, evaluation metrics, and reproduction configurations are publicly available at https://github.com/VisionVerse/RemoteSensing-Restoration-Survey.[110] Transparent Fragments Contour Estimation via Visual-Tactile Fusion for Autonomous Reassembly
Qihao Lin,Borui Chen,Yuping Zhou,Jianing Wu,Yulan Guo,Weishi Zheng,Chongkun Xia
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉-触觉融合的透明碎片轮廓估计框架,构建了TransFrag27K数据集和TransFragNet网络,并结合Gelsight Mini传感器获取触觉信息,实现高精度轮廓估计与碎片重装配。
Details
Motivation: 透明碎片轮廓估计在精密光学仪器维修、文物修复等场景中至关重要,但因光学特性强、形状不规则等挑战,现有方法难以胜任。 Method: 构建透明碎片数据集TransFrag27K;提出视觉抓取位置检测网络TransFragNet;利用Gelsight Mini传感器获取侧边缘触觉信息;设计视觉-触觉融合材质分类器;提出多维相似性度量的轮廓匹配与重装配算法。 Result: 所提框架在真实场景验证中表现出强性能,实验结果证实其有效性;公开了数据集与代码。 Conclusion: 视觉-触觉融合是解决透明碎片轮廓估计难题的有效途径,该框架为相关研究提供了可复现的基准与新思路。 Abstract: The contour estimation of transparent fragments is very important for autonomous reassembly, especially in the fields of precision optical instrument repair, cultural relic restoration, and identification of other precious device broken accidents. Different from general intact transparent objects, the contour estimation of transparent fragments face greater challenges due to strict optical properties, irregular shapes and edges. To address this issue, a general transparent fragments contour estimation framework based on visual-tactile fusion is proposed in this paper. First, we construct the transparent fragment dataset named TransFrag27K, which includes a multiscene synthetic data of broken fragments from multiple types of transparent objects, and a scalable synthetic data generation pipeline. Secondly, we propose a visual grasping position detection network named TransFragNet to identify, locate and segment the sampling grasping position. And, we use a two-finger gripper with Gelsight Mini sensors to obtain reconstructed tactile information of the lateral edge of the fragments. By fusing this tactile information with visual cues, a visual-tactile fusion material classifier is proposed. Inspired by the way humans estimate a fragment's contour combining vision and touch, we introduce a general transparent fragment contour estimation framework based on visual-tactile fusion, demonstrates strong performance in real-world validation. Finally, a multi-dimensional similarity metrics based contour matching and reassembly algorithm is proposed, providing a reproducible benchmark for evaluating visual-tactile contour estimation and fragment reassembly. The experimental results demonstrate the validity of the proposed framework. The dataset and codes are available at https://github.com/Keithllin/Transparent-Fragments-Contour-Estimation.[111] HSI Image Enhancement Classification Based on Knowledge Distillation: A Study on Forgetting
Songfeng Zhu
Main category: cs.CV
TL;DR: 本文提出了一种基于教师模型的知识保留方法和掩码式部分类别知识蒸馏算法,用于解决高光谱图像增量分类中的灾难性遗忘问题,无需依赖旧类样本,提升了模型准确性。
Details
Motivation: 解决高光谱图像增量分类中因模型更新导致的灾难性遗忘问题,尤其避免对旧类别样本的依赖。 Method: 提出基于教师模型的知识保留方法,利用增量类别样本缓解对旧类别知识的遗忘;引入掩码式部分类别知识蒸馏算法,通过解耦知识蒸馏过滤误导性信息。 Result: 在对比实验和消融实验中,所提方法展现出稳健的性能提升。 Conclusion: 该方法有效缓解了增量学习中的灾难性遗忘,且不依赖旧类别样本,在保持旧类别识别能力的同时提升了整体分类精度。 Abstract: In incremental classification tasks for hyperspectral images, catastrophic forgetting is an unavoidable challenge. While memory recall methods can mitigate this issue, they heavily rely on samples from old categories. This paper proposes a teacher-based knowledge retention method for incremental image classification. It alleviates model forgetting of old category samples by utilizing incremental category samples, without depending on old category samples. Additionally, this paper introduces a mask-based partial category knowledge distillation algorithm. By decoupling knowledge distillation, this approach filters out potentially misleading information that could misguide the student model, thereby enhancing overall accuracy. Comparative and ablation experiments demonstrate the proposed method's robust performance.[112] InjectFlow: Weak Guides Strong via Orthogonal Injection for Flow Matching
Dayu Wang,Jiaye Yang,Weikang Li,Jiahui Liang,Yang Li
Main category: cs.CV
TL;DR: 本文提出InjectFlow方法,通过在初始速度场计算中注入正交语义,无需训练即可缓解Flow Matching模型因数据偏差导致的语义退化问题,显著提升对分布外及少数类样本的生成质量。
Details
Motivation: Flow Matching模型虽性能优异,但对数据集偏差高度敏感,导致分布外或少数类样本生成时出现严重语义退化。 Method: 形式化定义FM框架中的'偏置流形',指出条件期望平滑是导致推理中轨迹锁定的根本原因;据此提出训练无关的InjectFlow方法,在初始速度场计算阶段注入正交语义,避免潜在空间向多数模式漂移。 Result: 在GenEval数据集上,InjectFlow成功修复了标准FM模型75%无法正确生成的提示;实验证明其在保持高生成质量的同时显著提升公平性与鲁棒性。 Conclusion: InjectFlow为构建更公平、鲁棒的视觉基础模型提供了即插即用的解决方案,理论分析与算法设计兼具严谨性与实用性。 Abstract: Flow Matching (FM) has recently emerged as a leading approach for high-fidelity visual generation, offering a robust continuous-time alternative to ordinary differential equation (ODE) based models. However, despite their success, FM models are highly sensitive to dataset biases, which cause severe semantic degradation when generating out-of-distribution or minority-class samples. In this paper, we provide a rigorous mathematical formalization of the ``Bias Manifold'' within the FM framework. We identify that this performance drop is driven by conditional expectation smoothing, a mechanism that inevitably leads to trajectory lock-in during inference. To resolve this, we introduce InjectFlow, a novel, training-free method by injecting orthogonal semantics during the initial velocity field computation, without requiring any changes to the random seeds. This design effectively prevents the latent drift toward majority modes while maintaining high generative quality. Extensive experiments demonstrate the effectiveness of our approach. Notably, on the GenEval dataset, InjectFlow successfully fixes 75% of the prompts that standard flow matching models fail to generate correctly. Ultimately, our theoretical analysis and algorithm provide a ready-to-use solution for building more fair and robust visual foundation models.[113] Transferable Multi-Bit Watermarking Across Frozen Diffusion Models via Latent Consistency Bridges
Hong-Hanh Nguyen-Le,Van-Tuan Tran,Thuc D. Nguyen,Nhien-An Le-Khac
Main category: cs.CV
TL;DR: DiffMark是一种即插即用的扩散模型水印方法,通过在每步去噪过程中注入可学习的扰动实现单次前向传播的多比特水印检测,兼具每图像密钥灵活性和跨模型可迁移性,并利用潜变量一致性模型(LCM)加速训练与检测。
Details
Motivation: 现有扩散模型水印方法存在两大缺陷:基于采样的方法需昂贵的多步逆向过程且仅支持零比特检测;基于微调的方法虽支持多比特但绑定特定模型、缺乏通用性。亟需一种高效、灵活、可迁移的水印方案。 Method: DiffMark在完全冻结的扩散模型每一步去噪中注入一个持久的、可学习的扰动δ;该扰动在最终潜变量z0中累积并可通过单次前向传播恢复;为解决对冻结UNet反向传播难题,采用Latent Consistency Models(LCM)作为可微训练桥梁,大幅减少梯度步数。 Result: DiffMark实现单次前向检测(16.4ms),比采样法快45倍;支持每图像动态密钥生成;可在未见过的扩散架构间直接迁移,无需重训练;同时在失真、重生成和对抗攻击下保持强鲁棒性。 Conclusion: DiffMark突破了扩散模型水印在效率、灵活性与通用性之间的权衡,为大规模生成内容的溯源与责任认定提供了实用、可部署的新范式。 Abstract: As diffusion models (DMs) enable photorealistic image generation at unprecedented scale, watermarking techniques have become essential for provenance establishment and accountability. Existing methods face challenges: sampling-based approaches operate on frozen models but require costly $N$-step Denoising Diffusion Implicit Models (DDIM) inversion (typically N=50) for zero-bit-only detection; fine-tuning-based methods achieve fast multi-bit extraction but couple the watermark to a specific model checkpoint, requiring retraining for each architecture. We propose DiffMark, a plug-and-play watermarking method that offers three key advantages over existing approaches: single-pass multi-bit detection, per-image key flexibility, and cross-model transferability. Rather than encoding the watermark into the initial noise vector, DiffMark injects a persistent learned perturbation $δ$ at every denoising step of a completely frozen DM. The watermark signal accumulates in the final denoised latent $z_0$ and is recovered in a single forward pass. The central challenge of backpropagating gradients through a frozen UNet without traversing the full denoising chain is addressed by employing Latent Consistency Models (LCM) as a differentiable training bridge. This reduces the number of gradient steps from 50 DDIM to 4 LCM and enables a single-pass detection at 16.4 ms, a 45x speedup over sampling-based methods. Moreover, by this design, the encoder learns to map any runtime secret to a unique perturbation at inference time, providing genuine per-image key flexibility and transferability to unseen diffusion-based architectures without per-model fine-tuning. Although achieving these advantages, DiffMark also maintains competitive watermark robustness against distortion, regeneration, and adversarial attacks.[114] The Global-Local loop: what is missing in bridging the gap between geospatial data from numerous communities?
Clément Mallet,Ana-Maria Raimond
Main category: cs.CV
TL;DR: 本文探讨了地理空间数据融合中的'主从范式'局限性,提出应加强多源数据的对称利用与跨尺度、跨社区的协同反馈,构建'全球-本地循环',并通过案例分析和研究方向讨论推动更通用、更深入的数据融合方法。
Details
Motivation: 当前地理空间数据融合多采用'主从范式',即单一主导数据源辅以其他数据,缺乏多源间的对称互惠与跨尺度、跨社区协同,导致数据潜力未被充分挖掘。 Method: 通过典型应用案例分析,梳理关键的多源交互模式;进而系统讨论被忽视的研究方向,强调在多尺度、多社区背景下协同利用异构地理空间数据的方法论。 Result: 识别出若干被低估的对称化数据融合配置;提出'全球-本地循环'概念框架;明确了支持该框架的若干前沿研究方向。 Conclusion: 打破主从范式、推动多源对称融合与跨域反馈是提升地理空间数据分析泛化能力与实用性的重要路径,需在方法、工具与协作机制上进行系统性创新。 Abstract: We face a unprecedented amount of geospatial data, describing directly or indirectly the Earth Surface at multiple spatial, temporal, and semantic scales, and stemming from numerous contributors, from satellites to citizens. The main challenge in all the geospatial-related communities lies in suitably leveraging a combination of some of the sources for either a generic or a thematic application. Certain data fusion schemes are predominantly exploited: they correspond to popular tasks with mainstream data sources, e.g., free archives of Sentinel images coupled with OpenStreetMap data under an open and widespread deep-learning backbone for land-cover mapping purposes. Most of these approaches unfortunately operate under a "master-slave" paradigm, where one source is basically integrated to help processing the "main" source, without mutual advantages (e.g., large-scale estimation of a given biophysical variable using in-situ observations) and under a specific community bias. We argue that numerous key data fusion configurations, and in particular the effort in symmetrizing the exploitation of multiple data sources, are insufficiently addressed while being highly beneficial for generic or thematic applications. Bridges and retroactions between scales, communities and their respective sources are lacking, neglecting the utmost potential of such a "global-local loop". In this paper, we propose to establish the most relevant interaction schemes through illustrative use cases. We subsequently discuss under-explored research directions that could take advantage of leveraging available data through multiples extents and communities.[115] EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control
Yuzhe Weng,Haotian Wang,Yuanhong Yu,Jun Du,Shan He,Xiaoyan Wu,Haoran Xu
Main category: cs.CV
TL;DR: 本文提出EARTalking,一种端到端、GPT风格的自回归模型,用于交互式音频驱动说话人头生成,通过Sink Frame Window Attention(SFA)和Frame Condition In-Context(FCIC)机制实现帧级流式生成与细粒度控制。
Details
Motivation: 现有AR方法依赖中间面部表征,限制表达力与真实感;扩散方法逐片段生成,缺乏细粒度控制且存在固有延迟。 Method: 提出EARTalking模型,引入Sink Frame Window Attention(SFA)支持变长视频与身份一致性,以及Frame Condition In-Context(FCIC)方案实现流式、上下文内注入多样化控制信号。 Result: EARTalking在性能上超越现有自回归方法,媲美扩散方法,验证了上下文内流式自回归控制的可行性。 Conclusion: EARTalking为灵活、高效、可扩展的音频驱动说话人头生成提供了新方向,支持交互式、任意时刻的细粒度控制。 Abstract: Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.[116] GraphiContact: Pose-aware Human-Scene Robust Contact Perception for Interactive Systems
Xiaojian Lin,Yaomin Shen,Junyuan Ma,Yujie Sun,Chengqing Bu,Wenxin Zhang,Zongzheng Zhang,Hao Fei,Lei Jin,Hao Zhao
Main category: cs.CV
TL;DR: 本文提出GraphiContact框架,联合单图像3D人体网格重建与顶点级人-场景接触预测,利用预训练Transformer编码器引入姿态感知的先验知识,并通过SIMU不确定性训练策略提升遮挡和噪声下的鲁棒性,在多个基准上同时提升接触预测与人体重建性能。
Details
Motivation: 现有方法要么忽视显式3D人体先验而仅做接触预测,要么专注于人体重建却未直接优化在遮挡与感知噪声下的鲁棒顶点级接触推断,二者存在割裂。 Method: 提出GraphiContact:基于重建的人体网格,融合两个预训练Transformer编码器的姿态与几何先验;引入Single-Image Multi-Infer Uncertainty(SIMU)训练策略,采用token级自适应路由模拟训练时的遮挡与噪声,保持测试时单分支高效推理。 Result: 在五个基准数据集上,GraphiContact在顶点级人-场景接触预测和单图像3D人体网格重建两方面均取得一致性能提升。 Conclusion: 联合建模3D人体重建与接触推理、并引入多源先验与不确定性训练,可有效提升单目场景下人-场景交互理解的准确性与鲁棒性。 Abstract: Monocular vertex-level human-scene contact prediction is a fundamental capability for interactive systems such as assistive monitoring, embodied AI, and rehabilitation analysis. In this work, we study this task jointly with single-image 3D human mesh reconstruction, using reconstructed body geometry as a scaffold for contact reasoning. Existing approaches either focus on contact prediction without sufficiently exploiting explicit 3D human priors, or emphasize pose/mesh reconstruction without directly optimizing robust vertex-level contact inference under occlusion and perceptual noise. To address this gap, we propose GraphiContact, a pose-aware framework that transfers complementary human priors from two pretrained Transformer encoders and predicts per-vertex human-scene contact on the reconstructed mesh. To improve robustness in real-world scenarios, we further introduce a Single-Image Multi-Infer Uncertainty (SIMU) training strategy with token-level adaptive routing, which simulates occlusion and noisy observations during training while preserving efficient single-branch inference at test time. Experiments on five benchmark datasets show that GraphiContact achieves consistent gains on both contact prediction and 3D human reconstruction. Our code, based on the GraphiContact method, provides comprehensive 3D human reconstruction and interaction analysis, and will be publicly available at https://github.com/Aveiro-Lin/GraphiContact.[117] VGS-Decoding: Visual Grounding Score Guided Decoding for Hallucination Mitigation in Medical VLMs
Govinda Kolli,Adinath Madhavrao Dukre,Behzad Bozorgtabar,Dwarikanath Mahapatra,Imran Razzak
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的推理阶段方法VGS-Decoding,通过定义视觉依赖性得分(VGS)来识别并抑制医学视觉-语言模型中的幻觉现象,在多个医学VQA数据集上显著提升性能,且仅引入少量推理开销。
Details
Motivation: 医学视觉-语言模型(VLMs)常因依赖语言先验而非视觉证据而产生幻觉,这在临床应用中存在风险。 Method: 提出视觉定位分数引导解码(VGS-Decoding),定义视觉定位分数(VGS)来衡量每个词元对视觉信息的依赖程度,通过对比原始图像与失真图像生成的概率分布;在解码过程中动态重加权词元概率,增强视觉支撑词元、抑制幻觉词元。 Result: 在MIMIC-Diff-VQA和VQA-RAD数据集上,对LLaVA-Med、CheXagent和MedGemma模型均取得一致提升,最高整体准确率提升+9.12%,开放性回答召回率提升+8.98%,仅增加2倍推理开销,无需额外训练。 Conclusion: VGS-Decoding是一种高效、实用、无需训练的幻觉缓解方法,适用于临床场景部署。 Abstract: Medical Vision-Language Models (VLMs) often hallucinate by generating responses based on language priors rather than visual evidence, posing risks in clinical applications. We propose Visual Grounding Score Guided Decoding (VGS-Decoding), a training-free method to mitigate hallucinations during inference. Our key insight is that hallucinated tokens maintain or increase their probability when visual information is degraded, while visually grounded tokens decrease in probability. We introduce the Visual Grounding Score (VGS), which measures each token's visual dependency by comparing distributions from original and distorted images. During decoding, we reweight probabilities by amplifying visually grounded tokens while suppressing hallucinations. Unlike fixed-weight contrastive methods, VGS-Decoding provides per-token adaptive control. Experiments on MIMIC-Diff-VQA and VQA-RAD across LLaVA-Med, CheXagent, and MedGemma demonstrate consistent improvements, with up to +9.12% overall gain and $+8.98\%$ in open-ended recall, while introducing only $2\times$ inference overhead and no additional training, making it practical for clinical deployment. Upon acceptance, code will be released publicly to facilitate reproducibility.[118] Which Workloads Belong in Orbit? A Workload-First Framework for Orbital Data Centers Using Semantic Abstraction
Durgendra Narayan Singh
Main category: cs.CV
TL;DR: 本文提出了一种以工作负载为中心的框架,用于决定哪些AI任务适合部署在轨道上而非地面云,结合轨道数据中心成熟度分阶段采用,并通过在轨语义压缩原型验证了其有效性。
Details
Motivation: 随着发射成本下降和数据密集型AI工作负载增长,空间计算变得可行,但需明确哪些任务真正适合在轨处理。 Method: 提出 workload-centric 框架与分阶段采用模型,并基于在轨语义压缩原型(如地球观测影像处理与多视角立体重建)进行实证验证。 Result: 地球观测流水线实现99.7–99.99%有效载荷压缩;立体重建将306 MB原始数据压缩至1.57 MB(压缩率99.49%)。 Conclusion: 早期在轨工作负载适配性主要由语义抽象能力而非原始算力规模决定,支持‘工作负载优先’的设计范式。 Abstract: Space-based compute is becoming plausible as launch costs fall and data-intensive AI workloads grow. This paper proposes a workload-centric framework for deciding which tasks belong in orbit versus terrestrial cloud, along with a phased adoption model tied to orbital data center maturity. We ground the framework with in-orbit semantic-reduction prototypes. An Earth-observation pipeline on Sentinel-2 imagery from Seattle and Bengaluru (formerly Bangalore) achieves 99.7-99.99% payload reduction by converting raw imagery to compact semantic artifacts. A multi-pass stereo reconstruction prototype reduces ~306 MB to ~1.57 MB of derived 3D representations (99.49% reduction). These results support a workload-first view in which semantic abstraction, not raw compute scale, drives early workload suitability.[119] NCSTR: Node-Centric Decoupled Spatio-Temporal Reasoning for Video-based Human Pose Estimation
Quang Dang Huynh,Xuefei Yin,Andrew Busch,Hugo G. Espinosa,Alan Wee-Chung Liew,Matthew T. O. Worsey,Yanming Zhu
Main category: cs.CV
TL;DR: 本文提出了一种以节点为中心的视频人体姿态估计框架,通过显式融合视觉、时序和结构信息,提升跨帧一致性和关节拓扑建模能力。
Details
Motivation: 现有方法依赖热图或隐式时空特征聚合,难以充分表达关节拓扑结构且跨帧一致性弱,尤其在运动模糊、遮挡和复杂动态场景下性能受限。 Method: 提出节点中心框架:1)基于视觉-时序速度的关节嵌入;2)注意力驱动的姿态查询编码器生成图像条件下的关节感知节点嵌入;3)双分支解耦时空注意力图(局部时间传播+全局空间约束);4)节点空间专家融合模块自适应融合双分支输出。 Result: 在三个主流视频姿态基准上显著超越现有最优方法。 Conclusion: 显式的节点中心推理能有效提升视频姿态估计精度与鲁棒性,为该领域提供了新思路。 Abstract: Video-based human pose estimation remains challenged by motion blur, occlusion, and complex spatiotemporal dynamics. Existing methods often rely on heatmaps or implicit spatio-temporal feature aggregation, which limits joint topology expressiveness and weakens cross-frame consistency. To address these problems, we propose a novel node-centric framework that explicitly integrates visual, temporal, and structural reasoning for accurate pose estimation. First, we design a visuo-temporal velocity-based joint embedding that fuses sub-pixel joint cues and inter-frame motion to build appearance- and motion-aware representations. Then, we introduce an attention-driven pose-query encoder, which applies attention over joint-wise heatmaps and frame-wise features to map the joint representations into a pose-aware node space, generating image-conditioned joint-aware node embeddings. Building upon these node embeddings, we propose a dual-branch decoupled spatio-temporal attention graph that models temporal propagation and spatial constraint reasoning in specialized local and global branches. Finally, a node-space expert fusion module is proposed to adaptively fuse the complementary outputs from both branches, integrating local and global cues for final joint predictions. Extensive experiments on three widely used video pose benchmarks demonstrate that our method outperforms state-of-the-art methods. The results highlight the value of explicit node-centric reasoning, offering a new perspective for advancing video-based human pose estimation.[120] DCG-Net: Dual Cross-Attention with Concept-Value Graph Reasoning for Interpretable Medical Diagnosis
Getamesay Dagnaw,Xuefei Yin,Muhammad Hassan Maqsood,Yanming Zhu,Alan Wee-Chung Liew
Main category: cs.CV
TL;DR: 本文提出DCG-Net,一种端到端可解释的医学图像分析框架,通过双交叉注意力机制和参数化概念图建模临床概念间的上下文依赖,兼顾高性能与临床可解释性。
Details
Motivation: 现有概念瓶颈模型(CBMs)忽略了临床概念之间的上下文依赖关系,导致解释性不足;同时深度学习模型在医学图像分析中普遍存在‘黑箱’问题。 Method: 提出DCG-Net框架:1)Dual Cross-Attention模块,用双向注意力替代余弦相似度匹配,实现视觉token与文本概念原型间的空间定位归因;2)基于正点互信息先验初始化、并经稀疏控制消息传递优化的Parametric Concept Graph,显式建模概念间关系。 Result: 在白细胞形态分类和皮肤病变诊断任务上达到SOTA分类性能,并生成符合临床知识的可解释诊断依据。 Conclusion: DCG-Net有效融合多模态对齐与结构化概念推理,在保持高精度的同时显著提升模型决策过程的临床可解释性与可信度。 Abstract: Deep learning models have achieved strong performance in medical image analysis, but their internal decision processes remain difficult to interpret. Concept Bottleneck Models (CBMs) partially address this limitation by structuring predictions through human-interpretable clinical concepts. However, existing CBMs typically overlook the contextual dependencies among concepts. To address these issues, we propose an end-to-end interpretable framework \emph{DCG-Net} that integrates multimodal alignment with structured concept reasoning. DCG-Net introduces a Dual Cross-Attention module that replaces cosine similarity matching with bidirectional attention between visual tokens and canonicalized textual concept-value prototypes, enabling spatially localized evidence attribution. To capture the relational structure inherent to clinical concepts, we develop a Parametric Concept Graph initialized with Positive Pointwise Mutual Information priors and refined through sparsity-controlled message passing. This formulation models inter-concept dependencies in a manner consistent with clinical domain knowledge. Experiments on white blood cell morphology and skin lesion diagnosis demonstrate that DCG-Net achieves state-of-the-art classification performance while producing clinically interpretable diagnostic explanations.[121] Prompt-Free Lightweight SAM Adaptation for Histopathology Nuclei Segmentation with Strong Cross-Dataset Generalization
Muhammad Hassan Maqsood,Yanming Zhu,Alfred Lam,Getamesay Dagnaw,Xuefei Yin,Alan Wee-Chung Liew
Main category: cs.CV
TL;DR: 本文提出了一种无需提示、轻量级的SAM适配方法,用于组织病理学核分割,通过多级编码器特征和残差解码实现高精度与高效率,并仅需微调4.1M参数,在多个基准数据集上达到SOTA性能和强跨数据集泛化能力。
Details
Motivation: 现有核分割方法计算开销大、跨数据集泛化能力弱;基于SAM的方法虽有潜力,但依赖提示或复杂解码器,不适用于密集且外观异质的病理图像。 Method: 提出一种无需提示、轻量级的SAM适配框架,利用多级编码器特征与残差解码结构,并仅对冻结SAM编码器中的LoRA模块进行微调。 Result: 在TNBC、MoNuSeg和PanNuke三个基准数据集上取得SOTA性能,展现出优异的跨数据集泛化能力,且仅需4.1M可训练参数。 Conclusion: 该方法兼顾准确性、效率与泛化性,显著提升了SAM在组织病理学核分割任务中的实用性。 Abstract: Histopathology nuclei segmentation is crucial for quantitative tissue analysis and cancer diagnosis. Although existing segmentation methods have achieved strong performance, they are often computationally heavy and show limited generalization across datasets, which constrains their practical deployment. Recent SAM-based approaches have shown great potential in general and medical imaging, but typically rely on prompt guidance or complex decoders, making them less suitable for histopathology images with dense nuclei and heterogeneous appearances. We propose a prompt-free and lightweight SAM adaptation that leverages multi-level encoder features and residual decoding for accurate and efficient nuclei segmentation. The framework fine-tunes only LoRA modules within the frozen SAM encoder, requiring just 4.1M trainable parameters. Experiments on three benchmark datasets TNBC, MoNuSeg, and PanNuke demonstrate state-of-the-art performance and strong cross-dataset generalization, highlighting the effectiveness and practicality of the proposed framework for histopathology applications.[122] High-fidelity Multi-view Normal Integration with Scale-encoded Neural Surface Representation
Tongyu Yang,Heng Guo,Yasuyuki Matsushita,Fumio Okura,Yu Luo,Xin Fan
Main category: cs.CV
TL;DR: 本文提出了一种尺度编码的神经表面表示方法,通过将像素覆盖区域(受相机内参和距离影响)融入神经表示中,解决多视角法向量不一致导致的高频细节模糊问题,并引入尺度感知的网格提取模块,实现高保真表面重建。
Details
Motivation: 传统多视角法向量融合方法未考虑像素空间覆盖面积随相机参数和距离变化的问题,导致不同距离下对应像素法向量不一致,进而使重建表面高频细节模糊。 Method: 提出尺度编码的神经表面表示:为每个3D点关联空间尺度,并基于混合网格编码计算其法向量;设计mesh提取模块,为每个顶点根据训练观测分配最优局部尺度。 Result: 在不同距离获取的法向量输入下,实现了更高质量、高保真的表面重建,性能优于现有多视角法向量融合方法。 Conclusion: 引入像素尺度建模可有效缓解多视角法向量不一致性,提升复杂场景下的表面重建精度与细节保留能力。 Abstract: Previous multi-view normal integration methods typically sample a single ray per pixel, without considering the spatial area covered by each pixel, which varies with camera intrinsics and the camera-to-object distance. Consequently, when the target object is captured at different distances, the normals at corresponding pixels may differ across views. This multi-view surface normal inconsistency results in the blurring of high-frequency details in the reconstructed surface. To address this issue, we propose a scale-encoded neural surface representation that incorporates the pixel coverage area into the neural representation. By associating each 3D point with a spatial scale and calculating its normal from a hybrid grid-based encoding, our method effectively represents multi-scale surface normals captured at varying distances. Furthermore, to enable scale-aware surface reconstruction, we introduce a mesh extraction module that assigns an optimal local scale to each vertex based on the training observations. Experimental results demonstrate that our approach consistently yields high-fidelity surface reconstruction from normals observed at varying distances, outperforming existing multi-view normal integration methods.[123] Toward a Multi-View Brain Network Foundation Model: Cross-View Consistency Learning Across Arbitrary Atlases
Jiaxing Xu,Jingying Ma,Xin Lin,Yuxiao Liu,Kai He,Qika Lin,Yiping Ke,Yang Li,Dinggang Shen,Mengling Feng
Main category: cs.CV
TL;DR: 本文提出MV-BrainFM,一种多视图脑网络基础模型,通过引入解剖距离先验和跨视图一致性学习,实现对任意图谱构建的脑网络的通用、可扩展表征学习,在20K+被试、17个fMRI数据集上显著优于现有方法。
Details
Motivation: 现有脑网络自监督模型受限于图谱依赖性、多视图信息利用不足及解剖先验整合薄弱。 Method: MV-BrainFM基于Transformer建模,显式融入解剖距离信息以引导区域间交互,并设计无监督跨视图一致性学习策略,在共享隐空间中对齐同一被试不同图谱生成的网络表征;采用统一多视图预训练范式,支持多数据集与多图谱联合训练。 Result: 在超过20,000名被试、17个fMRI数据集上实验表明,MV-BrainFM在单图谱与多图谱设置下均持续优于14种现有脑网络基础模型及任务特定基线模型。 Conclusion: MV-BrainFM有效融合多视图互补信息并保持图谱感知能力,具备强泛化性、可扩展性与计算高效性,为脑网络分析提供了更鲁棒的基础模型框架。 Abstract: Brain network analysis provides an interpretable framework for characterizing brain organization and has been widely used for neurological disorder identification. Recent advances in self-supervised learning have motivated the development of brain network foundation models. However, existing approaches are often limited by atlas dependency, insufficient exploitation of multiple network views, and weak incorporation of anatomical priors. In this work, we propose MV-BrainFM, a multi-view brain network foundation model designed to learn generalizable and scalable representations from brain networks constructed with arbitrary atlases. MV-BrainFM explicitly incorporates anatomical distance information into Transformer-based modeling to guide inter-regional interactions, and introduces an unsupervised cross-view consistency learning strategy to align representations from multiple atlases of the same subject in a shared latent space. By jointly enforcing within-view robustness and cross-view alignment during pretraining, the model effectively captures complementary information across heterogeneous network views while remaining atlas-aware. In addition, MV-BrainFM adopts a unified multi-view pretraining paradigm that enables simultaneous learning from multiple datasets and atlases, significantly improving computational efficiency compared to conventional sequential training strategies. The proposed framework also demonstrates strong scalability, consistently benefiting from increasing data diversity while maintaining stable performance across unseen atlas configurations. Extensive experiments on more than 20K subjects from 17 fMRI datasets show that MV-BrainFM consistently outperforms 14 existing brain network foundation models and task-specific baselines under both single-atlas and multi-atlas settings.[124] Scene Representation using 360° Saliency Graph and its Application in Vision-based Indoor Navigation
Preeti Meena,Himanshu Kumar,Sandeep Yadav
Main category: cs.CV
TL;DR: 本文提出了一种新颖的360°显著性图表示法,用于高效编码场景的视觉、上下文、语义和几何信息,以提升室内基于视觉的导航与场景定位性能。
Details
Motivation: 传统场景表示(如RGB-D、LiDAR等)隐式编码信息,难以高效支持场景索引和视觉导航;且现有方法在室内环境下易受光照变化、遮挡和阴影影响。 Method: 构建360°显著性图,将场景的视觉、上下文、语义和几何信息显式编码为节点、边、边权重和角位置;利用该图实现查询场景在拓扑地图中的定位,并基于嵌入的几何信息估计朝向目标的2D移动方向。 Result: 实验表明,所提360°显著性图在场景定位和基于视觉的室内导航任务中均优于现有方法。 Conclusion: 360°显著性图是一种鲁棒、丰富且高效的场景表示方法,能有效克服传统方法在室内导航中的局限性。 Abstract: A Scene, represented visually using different formats such as RGB-D, LiDAR scan, keypoints, rectangular, spherical, multi-views, etc., contains information implicitly embedded relevant to applications such as scene indexing, vision-based navigation. Thus, these representations may not be efficient for such applications. This paper proposes a novel 360° saliency graph representation of the scenes. This rich representation explicitly encodes the relevant visual, contextual, semantic, and geometric information of the scene as nodes, edges, edge weights, and angular position in the 360° graph. Also, this representation is robust against scene view change and addresses challenges of indoor environments such as varied illumination, occlusions, and shadows as in the case of existing traditional methods. We have utilized this rich and efficient representation for vision-based navigation and compared it with existing navigation methods using 360° scenes. However, these existing methods suffer from limitations of poor scene representation, lacking scene-specific information. This work utilizes the proposed representation first to localize the query scene in the given topological map, and then facilitate 2D navigation by estimating the next required movement directions towards the target destination in the topological map by using the embedded geometric information in the 360° saliency graph. Experimental results demonstrate the efficacy of the proposed 360° saliency graph representation in enhancing both scene localization and vision-based indoor navigation.[125] Uni-Classifier: Leveraging Video Diffusion Priors for Universal Guidance Classifier
Yujie Zhou,Pengyang Ling,Jiazi Bu,Bingjie Gao,Li Niu
Main category: cs.CV
TL;DR: 本文提出Uni-Classifier(Uni-C),一种基于视频扩散先验的即插即用模块,用于校准上游生成模型输出与下游模型输入之间的分布不匹配问题,从而提升多模型级联任务(如2D→视频/3D)及单模型生成质量。
Details
Motivation: 实际AI工作流中,多个生成模型级联时因上下游分布不匹配导致生成质量下降。 Method: 提出Uni-Classifier(Uni-C)模块,利用视频扩散先验引导前序模型的去噪过程,实现输出分布对齐;支持插件式部署,亦可独立增强单模型输出。 Result: 在视频和3D生成任务的大量实验表明,Uni-C在工作流级联和独立使用两种场景下均能持续提升生成质量。 Conclusion: Uni-C是一种简单有效、通用性强且泛化能力突出的分布对齐模块,适用于复杂生成工作流优化与单模型性能增强。 Abstract: In practical AI workflows, complex tasks often involve chaining multiple generative models, such as using a video or 3D generation model after a 2D image generator. However, distributional mismatches between the output of upstream models and the expected input of downstream models frequently degrade overall generation quality. To address this issue, we propose Uni-Classifier (Uni-C), a simple yet effective plug-and-play module that leverages video diffusion priors to guide the denoising process of preceding models, thereby aligning their outputs with downstream requirements. Uni-C can also be applied independently to enhance the output quality of individual generative models. Extensive experiments across video and 3D generation tasks demonstrate that Uni-C consistently improves generation quality in both workflow-based and standalone settings, highlighting its versatility and strong generalization capability.[126] Multi-Stage Fine-Tuning of Pathology Foundation Models with Head-Diverse Ensembling for White Blood Cell Classification
Antony Gitau,Martin Paulson,Bjørn-Jostein Singstad,Karl Thomas Hjelmervik,Ola Marius Lysaker,Veralia Gabriela Sanchez
Main category: cs.CV
TL;DR: 本文提出了一种多阶段微调方法,结合DINOBloom-base模型与多种分类头(线性、余弦、MLP),针对WBCBench 2026挑战赛的13类白细胞分类任务,在不同成熟度类别上实现头特异性优化,并通过头多样性集成提升性能,同时识别出标注错误或形态模糊样本。
Details
Motivation: 白细胞自动分类面临类别不平衡、域偏移及形态连续性混淆(如相邻成熟阶段特征细微重叠)等挑战,影响 leukemia 诊断准确性。 Method: 采用多阶段微调策略,以DINOBloom-base为骨干网络,训练线性、余弦和MLP三类分类头;基于各头在特定类别的F1表现差异,构建头多样性集成:MLP为主预测器,仅在四个预定义混淆对中当另两个头达成一致时才替换其预测。 Result: 余弦头在Band中性粒细胞(BNE)上F1达0.470,线性头在晚幼粒细胞(MMY)上达0.585,MLP头在早幼粒细胞(PMY)上达0.733;集成方法提升整体鲁棒性;一致误分类样本显著富集标注错误或形态模糊案例。 Conclusion: 分类头存在类特异性优势,头多样性集成可有效缓解形态连续性混淆问题;模型一致性分析可用于发现数据质量问题,提升临床可信度。 Abstract: The classification of white blood cells (WBCs) from peripheral blood smears is critical for the diagnosis of leukemia. However, automated approaches still struggle due to challenges including class imbalance, domain shift, and morphological continuum confusion, where adjacent maturation stages exhibit subtle, overlapping features. We present a multi-stage fine-tuning methodology for 13-class WBC classification in the WBCBench 2026 Challenge (ISBI 2026). Our best-performing model is a fine-tuned DINOBloom-base, on which we train multiple classifier head families (linear, cosine, and multilayer perceptron (MLP)). The cosine head performed best on the mature granulocyte boundary (Band neutrophil (BNE) F1 = 0.470), the linear head on more immature granulocyte classes (Metamyelocyte (MMY) F1 = 0.585), and the MLP head on the most immature granulocyte (Promyelocyte (PMY) F1 = 0.733), revealing class-specific specialization. Based on this specialization, we construct a head-diverse ensemble, where the MLP head acts as the primary predictor, and its predictions within the four predefined confusion pairs are replaced only when two other head families agree. We further show that cases consistently misclassified by all models are substantially enriched for probable labeling errors or inherent morphological ambiguity.[127] Jigsaw Regularization in Whole-Slide Image Classification
So Won Jeong,Veronika Ročková
Main category: cs.CV
TL;DR: 本文提出一种结合视觉基础模型嵌入和图神经网络的新方法,通过引入局部空间结构和跨块拼图正则化,显著提升了全切片图像分类性能。
Details
Motivation: 现有多数多实例学习方法将组织图像的图像块视为可交换的,忽略了其内在的空间与拓扑结构,限制了病理图像分析性能。 Method: 利用视觉基础模型提取每个图像块的局部空间结构嵌入,并结合图神经网络与新型拼图正则化(jigsaw regularization)实现跨块空间感知。 Result: 在乳腺癌、头颈部癌和结直肠癌等基准数据集上,该方法显著优于当前基于注意力机制的多实例学习最先进方法。 Conclusion: 融合局部嵌入与全局图结构建模能有效提升计算病理学中全切片图像的分类精度,验证了空间结构建模的重要性。 Abstract: Computational pathology involves the digitization of stained tissues into whole-slide images (WSIs) that contain billions of pixels arranged as contiguous patches. Statistical analysis of WSIs largely focuses on classification via multiple instance learning (MIL), in which slide-level labels are inferred from unlabeled patches. Most MIL methods treat patches as exchangeable, overlooking the rich spatial and topological structure that underlies tissue images. This work builds on recent graph-based methods that aim to incorporate spatial awareness into MIL. Our approach is new in two regards: (1) we deploy vision \emph{foundation-model embeddings} to incorporate local spatial structure within each patch, and (2) achieve across-patch spatial awareness using graph neural networks together with a novel {\em jigsaw regularization}. We find that a combination of these two features markedly improves classification over state-of-the-art attention-based MIL approaches on benchmark datasets in breast, head-and-neck, and colon cancer.[128] Monocular Models are Strong Learners for Multi-View Human Mesh Recovery
Haoyu Xie,Shengkai Xu,Cheng Guo,Muhammad Usama Saleem,Wenhan Wu,Chen Chen,Ahmed Helmy,Pu Wang,Hongfei Xue
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、无需相机标定的多视角人体网格恢复框架,利用预训练单视角模型作为先验,通过测试时优化实现高精度和强泛化能力。
Details
Motivation: 现有几何方法依赖繁琐的相机标定,学习方法因缺乏多视角训练数据而难以泛化到未见相机配置,限制了实际应用性能。 Method: 提出一种训练自由的框架:首先基于预训练单视角HMR模型生成鲁棒一致的多视角初始化,再通过测试时优化(结合多视角一致性与解剖学约束)进行精细化重建。 Result: 在标准基准上达到SOTA性能,超越了使用显式多视角监督训练的多视角模型。 Conclusion: 该方法实现了无需相机标定和多视角训练数据的高精度、强泛化多视角人体网格恢复,为真实场景部署提供了新范式。 Abstract: Multi-view human mesh recovery (HMR) is broadly deployed in diverse domains where high accuracy and strong generalization are essential. Existing approaches can be broadly grouped into geometry-based and learning-based methods. However, geometry-based methods (e.g., triangulation) rely on cumbersome camera calibration, while learning-based approaches often generalize poorly to unseen camera configurations due to the lack of multi-view training data, limiting their performance in real-world scenarios. To enable calibration-free reconstruction that generalizes to arbitrary camera setups, we propose a training-free framework that leverages pretrained single-view HMR models as strong priors, eliminating the need for multi-view training data. Our method first constructs a robust and consistent multi-view initialization from single-view predictions, and then refines it via test-time optimization guided by multi-view consistency and anatomical constraints. Extensive experiments demonstrate state-of-the-art performance on standard benchmarks, surpassing multi-view models trained with explicit multi-view supervision.[129] FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection
Maxime Fontana,Michael Spratling,Miaojing Shi
Main category: cs.CV
TL;DR: 本文提出FAAR方法,通过性能驱动的秩缩减(PDRS)和任务频谱金字塔解码器(TS-PD),实现多任务学习中参数高效、频率感知且自动调优的适配器微调。
Details
Motivation: 传统全量微调在多任务学习中成本过高;现有参数高效微调(PEFT)方法采用固定低秩,无法适配不同任务与网络位置,且忽略任务间空间关系建模。 Method: 提出FAAR框架,包含两个核心组件:1)性能驱动的秩缩减(PDRS),为每个适配器位置和每个任务动态分配最优秩;2)任务频谱金字塔解码器(TS-PD),利用图像频谱分析注入输入相关上下文以增强空间偏差学习,捕获跨任务关系。 Result: 在密集视觉多任务基准上,FAAR显著优于其他PEFT方法,在精度和效率上均取得提升;相比传统MTL微调,参数量最多减少9倍,同时提升整体性能。 Conclusion: FAAR通过频率感知与自动秩分配机制,有效解决了多任务学习中参数效率与任务协同建模的矛盾,为高效、自适应的多任务微调提供了新范式。 Abstract: Adapting models pre-trained on large-scale datasets is a proven way to reach strong performance quickly for down-stream tasks. However, the growth of state-of-the-art mod-els makes traditional full fine-tuning unsuitable and difficult, especially for multi-task learning (MTL) where cost scales with the number of tasks. As a result, recent studies investigate parameter-efficient fine-tuning (PEFT) using low-rank adaptation to significantly reduce the number of trainable parameters. However, these existing methods use a single, fixed rank, which may not be optimal for differ-ent tasks or positions in the MTL architecture. Moreover, these methods fail to learn spatial information that cap-tures inter-task relationships and helps to improve diverse task predictions. This paper introduces Frequency-Aware and Automatic Rank (FAAR) for efficient MTL fine-tuning. Our method introduces Performance-Driven Rank Shrink-ing (PDRS) to allocate the optimal rank per adapter location and per task. Moreover, by analyzing the image frequency spectrum, FAAR proposes a Task-Spectral Pyramidal Decoder (TS-PD) that injects input-specific context into spatial bias learning to better reflect cross-task relationships. Experiments performed on dense visual task benchmarks show the superiority of our method in terms of both accuracy and efficiency compared to other PEFT methods in MTL. FAAR reduces the number of parameters by up to 9 times compared to traditional MTL fine-tuning whilst improving overall performance. Our code is available.[130] PEARL: Personalized Streaming Video Understanding Model
Yuanhong Zheng,Ruichuan An,Xiaopeng Lin,Yuxing Liu,Sihan Yang,Huanyu Zhang,Haodong Li,Qintong Zhang,Renrui Zhang,Guopeng Li,Yifan Zhang,Yuheng Li,Wentao Zhang
Main category: cs.CV
TL;DR: 本文提出了个性化流视频理解(PSVU)这一新任务,并构建了首个专用基准PEARL-Bench,同时提出无需训练、即插即用的基线方法PEARL,在多个模型上验证了其有效性。
Details
Motivation: 现有图文多模态个性化方法局限于静态图像或离线视频,无法应对人类认知新概念所具有的连续流式特性,难以支持未来AI助手所需的实时交互式个性化响应。 Method: 正式定义PSVU任务;构建包含132个视频、2173条细粒度带时间戳标注的基准PEARL-Bench(含帧级与视频级两种评测模式);提出无需训练、可即插即用的PEARL策略作为强基线。 Result: PEARL在8个离线/在线模型上均取得SOTA性能,且在3种不同架构上均带来一致提升,验证了其有效性与鲁棒性;PEARL-Bench确保概念多样性与标注质量(自动化生成+人工校验)。 Conclusion: 该工作填补了流式视频个性化理解的研究空白,推动视觉语言模型(VLM)向实时、交互式个性化AI助手发展。 Abstract: Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.[131] Benchmarking Efficient & Effective Camera Pose Estimation Strategies for Novel View Synthesis
Jhacson Meza,Martin R. Oswald,Torsten Sattler
Main category: cs.CV
TL;DR: 本文提出一个面向新视角合成(NVS)的Structure-from-Motion(SfM)基准,旨在推动兼具效率与精度的SfM方法发展;通过减少特征点数量和结合神经网络初值与经典SfM优化,实现了高效且高精度的相机位姿估计。
Details
Motivation: 现有基于深度学习的SfM方法虽快但精度低,而传统SfM虽准但慢;亟需兼顾效率与效果的新方法,因此构建专用基准以促进该方向研究。 Method: 构建面向NVS的SfM基准,采用两种策略提升效率:(1)减少输入特征点数量;(2)用前馈神经网络提供初值,再用经典SfM(如bundle adjustment)进行精调。 Result: 实验表明:(1)减少特征点可显著加速传统SfM且保持高姿态精度;(2)神经网络初值+经典优化的混合策略在效率与精度间取得最佳平衡。 Conclusion: 兼顾效率与精度的SfM需融合学习与优化;本文基准为该方向提供了标准化评估平台,并开源代码与数据。 Abstract: Novel view synthesis (NVS) approaches such as NeRFs or 3DGS can produce photo-realistic 3D scene representation from a set of images with known extrinsic and intrinsic parameters. The necessary camera poses and calibrations are typically obtained from the images via Structure-from-Motion (SfM). Classical SfM approaches rely on local feature matches between the images to estimate both the poses and a sparse 3D model of the scene, using bundle adjustment to refine initial pose, intrinsics, and geometry estimates. In order to increase run-time efficiency, recent SfM systems forgo optimization via bundle adjustment. Instead, they train feed-forward (transformer-based) neural networks to directly regress camera parameters and the 3D structure. While orders of magnitude more efficient, such recent works produce significantly less accurate estimates. To stimulate research on developing SfM approaches that are both efficient \emph{and} effective, this paper develops a benchmark focused on SfM for novel view synthesis. Using existing datasets and two simple strategies for making the reconstruction process more efficient, we show that: (1) simply using fewer features already significantly accelerates classical SfM methods while maintaining high pose accuracy. (2) using feed-forward networks to obtain initial estimates and refining them using classical SfM techniques leads to the best efficiency-effectiveness trade-off. We will make our benchmark and code publicly available.[132] Thermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis
M. Kerem Aydin,Vishwanath Saragadam,Emma Alexander
Main category: cs.CV
TL;DR: 本文提出了一种轻量级预处理与splatting流水线,用于提升热成像在无RGB引导下的新视角合成(NVS)性能,通过扩展动态范围和稳定逐帧光度来解决热图像低动态范围和光度波动问题。
Details
Motivation: 热成像在黑暗和恶劣条件下具有可靠可见性,但其在新视角合成任务中远比可见光图像困难,主要源于低成本热传感器的低动态范围和帧间光度波动/辐射漂移问题。 Method: 提出一种轻量级预处理与splatting流水线,旨在扩展可用动态范围并稳定每帧光度,无需RGB图像引导(仅依赖相机位姿)。 Result: 在纯热成像NVS基准上达到当前最优性能,且无需任何数据集特定调优。 Conclusion: 该方法有效缓解了热图像固有的低动态范围与光度不稳定性问题,显著提升了热图像新视角合成质量,具备良好的泛化性和实用性。 Abstract: Thermal cameras provide reliable visibility in darkness and adverse conditions, but thermal imagery remains significantly harder to use for novel view synthesis (NVS) than visible-light images. This difficulty stems primarily from two characteristics of affordable thermal sensors. First, thermal images have extremely low dynamic range, which weakens appearance cues and limits the gradients available for optimization. Second, thermal data exhibit rapid frame-to-frame photometric fluctuations together with slow radiometric drift, both of which destabilize correspondence estimation and create high-frequency floater artifacts during view synthesis, particularly when no RGB guidance (beyond camera pose) is available. Guided by these observations, we introduce a lightweight preprocessing and splatting pipeline that expands usable dynamic range and stabilizes per-frame photometry. Our approach achieves state-of-the-art performance across thermal-only NVS benchmarks, without requiring any dataset-specific tuning.[133] Inverting Neural Networks: New Methods to Generate Neural Network Inputs from Prescribed Outputs
Rebecca Pattichis,Sebastian Janampa,Constantinos S. Pattichis,Marios S. Pattichis
Main category: cs.CV
TL;DR: 本文研究神经网络的逆问题,即寻找能被映射到特定类别的输入图像,并提出前向和后向两种通用逆方法,成功生成高分类置信度的随机样貌图像,揭示了模型的脆弱性。
Details
Motivation: 神经网络映射复杂难解释,需理解哪些输入图像导致特定类别输出,以揭示模型内在特征与潜在脆弱性。 Method: 提出两种逆方法:前向法基于根查找算法与输入图像的雅可比矩阵;后向法逐层逆向迭代,并在每层线性变换的零空间中加入随机向量。 Result: 在Transformer和线性序列网络上验证,所提方法生成的输入图像在所有情况下均获得接近完美的分类分数,且图像呈现随机样貌,暴露模型漏洞。 Conclusion: 所提方法比以往方法更全面覆盖解空间,有效求解神经网络逆映射问题,并揭示当前架构的鲁棒性缺陷。 Abstract: Neural network systems describe complex mappings that can be very difficult to understand. In this paper, we study the inverse problem of determining the input images that get mapped to specific neural network classes. Ultimately, we expect that these images contain recognizable features that are associated with their corresponding class classifications. We introduce two general methods for solving the inverse problem. In our forward pass method, we develop an inverse method based on a root-finding algorithm and the Jacobian with respect to the input image. In our backward pass method, we iteratively invert each layer, at the top. During the inversion process, we add random vectors sampled from the null-space of each linear layer. We demonstrate our new methods on both transformer architectures and sequential networks based on linear layers. Unlike previous methods, we show that our new methods are able to produce random-like input images that yield near perfect classification scores in all cases, revealing vulnerabilities in the underlying networks. Hence, we conclude that the proposed methods provide a more comprehensive coverage of the input image spaces that solve the inverse mapping problem.[134] CREG: Compass Relational Evidence for Interpreting Spatial Reasoning in Vision-Language Models
Kaizhen Tan
Main category: cs.CV
TL;DR: 本文提出CREG框架,用于解释视觉语言模型(VLMs)在空间方向关系推理中的决策依据;该方法无需训练,通过对比式梯度-激活归因并映射到罗盘极坐标系,生成方向性证据分布,并设计三项新指标评估其效果,在Qwen2-VL-7B上显著优于现有归因方法。
Details
Motivation: 现有归因方法(如GradCAM、attention rollout)只能揭示模型关注区域,无法刻画模型推断出的物体间方向关系,而VLMs在空间推理任务中表现强劲但方向编码机制尚不明确。 Method: 提出无训练可解释性框架CREG:将多层对比式Grad-times-Act归因投影至参考中心的极坐标系,生成覆盖罗盘扇区的方向证据分布;并设计三个评估指标——方向对齐误差(DAE)、边准确率(EA)和因果遮挡分数(COS)。 Result: 在Qwen2-VL-7B上,CREG在VSR和COCO-Pairs数据集上持续超越基线;COCO-Pairs上预测目标导向的CREG达DAE=55.5°、EA=0.553,较attention rollout分别提升16.1°和0.120;因果遮挡实验显示COS≥+0.42,验证方向解释的忠实性;在Qwen2-VL-2B上增益较小,表明CREG受益于更大模型更结构化的空间表征。 Conclusion: 对比式多层归因能比标准显著性解释更忠实地揭示VLM在空间推理中所依赖的方向性证据,CREG为理解VLM方向关系建模提供了新工具。 Abstract: Vision-language models (VLMs) perform strongly on spatial reasoning benchmarks, yet how they encode directional relations remains poorly understood. Existing attribution methods such as GradCAM and attention rollout reveal where a model attends, but not what direction it infers between objects. We introduce CREG (Compass Relational Evidence Graph), a training-free interpretability framework that projects multi-layer contrastive Grad-times-Act attributions into a reference-centered polar coordinate system, producing a directional evidence distribution over compass sectors. To evaluate directional explanations, we propose three metrics: Direction Alignment Error (DAE), Edge Accuracy (EA), and Causal Occlusion Score (COS). On Qwen2-VL-7B across VSR and COCO-Pairs, CREG consistently outperforms standard attribution baselines; on COCO-Pairs, prediction-targeted CREG achieves a DAE of 55.5 degrees and an EA of 0.553, improving over attention rollout by 16.1 degrees in angular error and 0.120 in EA. Causal occlusion experiments on 540 samples across both datasets further support the faithfulness of these directional explanations, with COS greater than or equal to +0.42. The gains are smaller on Qwen2-VL-2B, suggesting that CREG benefits from the more structured spatial representations that emerge at larger scales. Overall, our results show that contrastive, multi-layer attribution can expose directional evidence more faithfully than standard saliency-based explanations in VLM spatial reasoning.[135] Lessons and Open Questions from a Unified Study of Camera-Trap Species Recognition Over Time
Sooyoung Jeon,Hongjie Tian,Lemeng Wang,Zheda Mai,Vidhi Bakshi,Jiacheng Hou,Ping Zhang,Arpita Chowdhury,Jianyang Gu,Wei-Lun Chao
Main category: cs.CV
TL;DR: 本文提出首个针对相机陷阱物种识别的时间序列基准,揭示了现有生物基础模型在固定站点长期部署中的性能瓶颈,并指出时间漂移和类别不平衡是主要挑战,同时验证了模型更新与后处理技术的有效性。
Details
Motivation: 生态学家在固定地点长期部署相机陷阱时面临生态系统动态变化带来的背景和物种分布的时间漂移问题,而现有跨域泛化视角未能反映这一实际挑战。 Method: 构建包含546个相机陷阱、按时间顺序划分区间的流式评估基准;采用端用户中心的研究范式,系统评估零样本模型、自适应更新策略及后处理方法在真实部署生命周期下的表现。 Result: 发现生物基础模型(如BioCLIP 2)在多数站点初始阶段即表现不佳;朴素自适应反而导致性能低于零样本;时间漂移与严重类别不平衡是核心难点;结合模型更新与后处理可显著提升精度但仍存差距。 Conclusion: 相机陷阱物种识别需面向时间维度建模,强调站点自适应与动态更新机制设计;研究为生态实践提供部署指南,并为CV与ML领域提出预测零样本适用性、判断更新时机等新方向。 Abstract: Camera traps are vital for large-scale biodiversity monitoring, yet accurate automated analysis remains challenging due to diverse deployment environments. While the computer vision community has mostly framed this challenge as cross-domain generalization, this perspective overlooks a primary challenge faced by ecological practitioners: maintaining reliable recognition at the fixed site over time, where the dynamic nature of ecosystems introduces profound temporal shifts in both background and animal distributions. To bridge this gap, we present the first unified study of camera-trap species recognition over time. We introduce a realistic benchmark comprising 546 camera traps with a streaming protocol that evaluates models over chronologically ordered intervals. Our end-user-centric study yields four key findings. (1) Biological foundation models (e.g., BioCLIP 2) underperform at numerous sites even in initial intervals, underscoring the necessity of site-specific adaptation. (2) Adaptation is challenging under realistic evaluation: when models are updated using past data and evaluated on future intervals (mirrors real deployment lifecycles), naive adaptation can even degrade below zero-shot performance. (3) We identify two drivers of this difficulty: severe class imbalance and pronounced temporal shift in both species distribution and backgrounds between consecutive intervals. (4) We find that effective integration of model-update and post-processing techniques can largely improve accuracy, though a gap from the upper bounds remains. Finally, we highlight critical open questions, such as predicting when zero-shot models will succeed at a new site and determining whether/when model updates are necessary. Our benchmark and analysis provide actionable deployment guidelines for ecological practitioners while establishing new directions for future research in vision and machine learning.[136] End-to-End Optimization of Polarimetric Measurement and Material Classifier
Ryota Maeda,Naoki Arikawa,Yutaka No,Shinsaku Hiura
Main category: cs.CV
TL;DR: 本文提出了一种端到端优化框架,联合学习材料分类器与偏振元件最优旋转角度组合,在减少测量次数的同时实现高精度材料分类。
Details
Motivation: 偏振信息对材料识别很有价值,但传统偏振测量需多次调节入射光偏振态,耗时且对某些任务冗余;而最优测量角度配置尚不明确。 Method: 提出端到端优化框架,联合学习材料分类器和控制入射/反射光偏振态的偏振元件最优旋转角度,并基于自建Mueller矩阵材料数据集进行训练与验证。 Result: 在测量次数受限条件下,仍能实现高精度材料分类。 Conclusion: 联合优化分类器与偏振测量配置是提升效率与性能的有效途径,为轻量、实用的偏振材料识别提供了新思路。 Abstract: Material classification is a fundamental problem in computer vision and plays a crucial role in scene understanding. Previous studies have explored various material recognition methods based on reflection properties such as color, texture, specularity, and scattering. Among these cues, polarization is particularly valuable because it provides rich material information and enables recognition even at distances where capturing high-resolution texture is impractical. However, measuring polarimetric reflectance properties typically requires multiple modulations of the polarization state of the incident light, making the process time-consuming and often unnecessary for certain recognition tasks. While material classification can be achieved using only a subset of polarimetric measurements, the optimal configuration of measurement angles remains unclear. In this study, we propose an end-to-end optimization framework that jointly learns a material classifier and determines the optimal combinations of rotation angles for polarization elements that control both the incident and reflected light states. Using our Mueller-matrix material dataset, we demonstrate that our method achieves high-accuracy material classification even with a limited number of measurements.[137] When Negation Is a Geometry Problem in Vision-Language Models
Fawaz Sammani,Tzoulio Chamiti,Paul Gavrikov,Nikos Deligiannis
Main category: cs.CV
TL;DR: 本文提出了一种无需微调的测试时表征工程方法,通过在CLIP嵌入空间中识别并操控与否定相关的方向,提升其对文本否定的理解能力,并采用多模态大语言模型作为评判器构建更可靠的否定理解评估框架。
Details
Motivation: 现有联合视觉-语言嵌入模型(如CLIP)难以理解文本中的否定词(如“no”),而以往基于检索指标的评估方式无法真实反映否定理解能力。 Method: 提出基于多模态大语言模型(MLLMs-as-a-judge)的新评估框架;在CLIP嵌入空间中探测并利用潜在的否定方向,通过测试时表征工程进行干预,不依赖任何微调。 Result: 证实CLIP嵌入空间中存在可识别的否定方向;所提测试时干预方法显著提升否定理解能力;在分布偏移的非常见样本上验证了方法的良好泛化性。 Conclusion: 否定理解可被建模为CLIP嵌入空间中的特定方向,无需训练即可通过表征工程实现有效干预;MLLMs-as-a-judge提供更可信的否定理解评估范式。 Abstract: Joint Vision-Language Embedding models such as CLIP typically fail at understanding negation in text queries - for example, failing to distinguish "no" in the query: "a plain blue shirt with no logos". Prior work has largely addressed this limitation through data-centric approaches, fine-tuning CLIP on large-scale synthetic negation datasets. However, these efforts are commonly evaluated using retrieval-based metrics that cannot reliably reflect whether negation is actually understood. In this paper, we identify two key limitations of such evaluation metrics and investigate an alternative evaluation framework based on Multimodal LLMs-as-a-judge, which typically excel at understanding simple yes/no questions about image content, providing a fair evaluation of negation understanding in CLIP models. We then ask whether there already exists a direction in the CLIP embedding space associated with negation. We find evidence that such a direction exists, and show that it can be manipulated through test-time intervention via representation engineering to steer CLIP toward negation-aware behavior without any fine-tuning. Finally, we test negation understanding on non-common image-text samples to evaluate generalization under distribution shifts.[138] Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance
Liangyu Yuan,Yufei Huang,Mingkun Lei,Tong Zhao,Ruoyu Wang,Changxi Chi,Yiwei Wang,Chi Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于弱到强(W2S)原则的混合引导方法SGG,结合CFG与AutoGuidance的优点,缓解扩散模型采样中梯度误差累积问题,并将W2S原则迁移至训练目标,提升无引导模型泛化能力。
Details
Motivation: 现有扩散模型因模拟自由目标与迭代过程不匹配,导致梯度误差累积、结果不佳且泛化差;虽有CFG、AG等引导方法,但其有效适用条件尚不明确,缺乏指导性选择依据。 Method: 通过合成实验隔离分析CFG与AG的有效工作区间,基于弱到强(W2S)原则设计混合引导方法SGG;进一步将W2S原则融入训练目标,改进无引导扩散模型的泛化能力。 Result: 在SD3/SD3.5上,SGG推理性能优于现有训练无关引导方法;在Transformer架构训练实验中,验证了W2S迁移的有效性,在条件与无条件设置下均取得性能提升。 Conclusion: W2S原则为理解与设计扩散引导方法提供了新视角,SGG兼具高效性与通用性,且该原则可成功扩展至训练阶段,显著增强模型泛化能力。 Abstract: Diffusion models generate synthetic images through an iterative refinement process. However, the misalignment between the simulation-free objective and the iterative process often causes accumulated gradient error along the sampling trajectory, which leads to unsatisfactory results and a failure to generalize. Guidance techniques like Classifier Free Guidance (CFG) and AutoGuidance (AG) alleviate this by extrapolating between the main and inferior signal for stronger generalization. Despite empirical success, the effective operational regimes of prevalent guidance methods are still under-explored, leading to ambiguity when selecting the appropriate guidance method given a precondition. In this work, we first conduct synthetic comparisons to isolate and demonstrate the effective regime of guidance methods represented by CFG and AG from the perspective of weak-to-strong principle. Based on this, we propose a hybrid instantiation called SGG under the principle, taking the benefits of both. Furthermore, we demonstrate that the W2S principle along with SGG can be migrated into the training objective, improving the generalization ability of unguided diffusion models. We validate our approach with comprehensive experiments. At inference time, evaluations on SD3 and SD3.5 confirm that SGG outperforms existing training-free guidance variants. Training-time experiments on transformer architectures demonstrate the effective migration and performance gains in both conditional and unconditional settings. Code is available at https://github.com/851695e35/SGG.[139] RayMap3R: Inference-Time RayMap for Dynamic 3D Reconstruction
Feiran Wang,Zezhou Shang,Gaowen Liu,Yan Yan
Main category: cs.CV
TL;DR: RayMap3R is a training-free streaming framework for dynamic scene reconstruction that identifies and suppresses dynamic regions using RayMap-based predictions, achieving state-of-the-art performance.
Details
Motivation: Streaming feed-forward 3D reconstruction suffers from artifacts and drift caused by moving objects due to lack of explicit dynamic reasoning. Method: RayMap3R uses a dual-branch inference scheme to identify dynamic regions by contrasting RayMap and image predictions, suppresses their interference during memory updates, and introduces reset metric alignment and state-aware smoothing for metric consistency and trajectory stability. Result: RayMap3R achieves state-of-the-art performance among streaming approaches on dynamic scene reconstruction across multiple benchmarks. Conclusion: RayMap3R effectively handles dynamic scenes in real-time 3D reconstruction without requiring training, leveraging intrinsic static-scene bias in RayMap predictions. Abstract: Streaming feed-forward 3D reconstruction enables real-time joint estimation of scene geometry and camera poses from RGB images. However, without explicit dynamic reasoning, streaming models can be affected by moving objects, causing artifacts and drift. In this work, we propose RayMap3R, a training-free streaming framework for dynamic scene reconstruction. We observe that RayMap-based predictions exhibit a static-scene bias, providing an internal cue for dynamic identification. Based on this observation, we construct a dual-branch inference scheme that identifies dynamic regions by contrasting RayMap and image predictions, suppressing their interference during memory updates. We further introduce reset metric alignment and state-aware smoothing to preserve metric consistency and stabilize predicted trajectories. Our method achieves state-of-the-art performance among streaming approaches on dynamic scene reconstruction across multiple benchmarks.[140] GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
Di Kong,Yikai Wang,Wenjie Guo,Yifan Bu,Boya Zhang,Yuexin Duan,Xiawei Yue,Wenbiao Du,Yiman Zhong,Yuwen Chen,Cheng Ma
Main category: cs.CV
TL;DR: 本文提出GaussianPile方法,将3D高斯泼溅与成像系统感知的聚焦模型结合,实现对片层体数据的高效压缩与高保真重建,兼顾实时渲染、诊断保真度和2D/3D可视化能力。
Details
Motivation: 片层体数据(如显微镜、超声)需在强压缩的同时保留内部结构以支持分析,现有方法难以兼顾压缩率、细节保真与计算效率。 Method: 提出GaussianPile:(i)切片感知的堆叠策略,用各向异性3D高斯建模层间贡献;(ii)可微投影算子,编码成像系统有限厚度点扩散函数;(iii)紧凑编码与联合优化流程,同步完成重建与压缩;全部基于CUDA实现。 Result: 在显微与超声数据上,存储与重建成本显著降低,诊断保真度保持良好,支持快速2D可视化及3D体素化;重建最快仅需3分钟,比NeRF快11倍;相比体素网格实现稳定16倍压缩。 Conclusion: GaussianPile为片层体数据提供了实用、高效、高保真的压缩与可视化新路径,兼具部署可行性与分析友好性。 Abstract: Slice-based volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. We introduce GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our proposed method introduces three key innovations: (i) a slice-aware piling strategy that positions anisotropic 3D Gaussians to model through-slice contributions, (ii) a differentiable projection operator that encodes the finite-thickness point spread function of the imaging acquisition system, and (iii) a compact encoding and joint optimization pipeline that simultaneously reconstructs and compresses the Gaussian sets. Our CUDA-based design retains the compression and real-time rendering efficiency of Gaussian primitives while preserving high-frequency internal volumetric detail. Experiments on microscopy and ultrasound datasets demonstrate that our method reduces storage and reconstruction cost, sustains diagnostic fidelity, and enables fast 2D visualization, along with 3D voxelization. In practice, it delivers high-quality results in as few as 3 minutes, up to 11x faster than NeRF-based approaches, and achieves consistent 16x compression over voxel grids, offering a practical path to deployable compression and exploration of slice-based volumetric datasets.[141] ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework
Guanzhou Chen,Erfei Cui,Changyao Tian,Danni Yang,Ganlin Yang,Yu Qiao,Hongsheng Li,Gen Luo,Hongjie Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为ScaleEditor的开源多智能体框架,用于高效构建大规模、高质量的指令驱动图像编辑数据集,并基于该框架发布了目前最大的开源图像编辑数据集ScaleEdit-12M;实验表明,使用该数据集微调模型可显著提升多种编辑任务性能。
Details
Motivation: 现有图像编辑数据集依赖闭源模型或固定合成流程,导致成本高、质量低、泛化差,亟需一种低成本、可扩展、高质量的开源数据构建方案。 Method: 提出ScaleEditor:一个三层级多智能体框架,包含(1)融合世界知识的源图像扩展、(2)自适应多智能体编辑指令-图像协同生成、(3)任务感知的数据质量验证机制。 Result: 构建了规模达1200万样本、覆盖23类任务、横跨真实与合成域的ScaleEdit-12M数据集;在多个基准上显著提升UniWorld-V1和Bagel模型性能(如ImgEdit提升10.4%,RISE提升150.0%)。 Conclusion: 开源多智能体数据构建范式可在保持低成本与可扩展性的同时,达到接近商用级的数据质量,为统一多模态模型的指令编辑能力提供坚实基础。 Abstract: Instruction-based image editing has emerged as a key capability for unified multimodal models (UMMs), yet constructing large-scale, diverse, and high-quality editing datasets without costly proprietary APIs remains challenging. Previous image editing datasets either rely on closed-source models for annotation, which prevents cost-effective scaling, or employ fixed synthetic editing pipelines, which suffer from limited quality and generalizability. To address these challenges, we propose ScaleEditor, a fully open-source hierarchical multi-agent framework for end-to-end construction of large-scale, high-quality image editing datasets. Our pipeline consists of three key components: source image expansion with world-knowledge infusion, adaptive multi-agent editing instruction-image synthesis, and a task-aware data quality verification mechanism. Using ScaleEditor, we curate ScaleEdit-12M, the largest open-source image editing dataset to date, spanning 23 task families across diverse real and synthetic domains. Fine-tuning UniWorld-V1 and Bagel on ScaleEdit yields consistent gains, improving performance by up to 10.4% on ImgEdit and 35.1% on GEdit for general editing benchmarks and by up to 150.0% on RISE and 26.5% on KRIS-Bench for knowledge-infused benchmarks. These results demonstrate that open-source, agentic pipelines can approach commercial-grade data quality while retaining cost-effectiveness and scalability. Both the framework and dataset will be open-sourced.[142] A Multihead Continual Learning Framework for Fine-Grained Fashion Image Retrieval with Contrastive Learning and Exponential Moving Average Distillation
Ling Xiao,Toshihiko Yamasaki
Main category: cs.CV
TL;DR: 本文提出MCL-FIR框架,首次将类增量学习(CIL)引入细粒度时尚图像检索,通过多头设计、双元组对比学习与EMA蒸馏实现高效持续学习,在保持高精度的同时大幅降低训练成本。
Details
Motivation: 现有细粒度时尚图像检索方法难以应对新属性动态出现的场景,需完全重训练;预训练模型零样本性能不足,且尚无工作探索类增量学习在此任务中的应用。 Method: 提出多头持续学习框架MCL-FIR,包含:1)多头结构适配增量类别;2)将三元组重构为双元组并结合InfoNCE损失简化训练;3)采用指数移动平均(EMA)蒸馏实现知识迁移。 Result: 在四个数据集上验证,MCL-FIR在相似训练成本下显著优于CIL基线;相比静态方法,在仅约30%训练成本下达到可比性能。 Conclusion: MCL-FIR有效解决了细粒度时尚图像检索在动态场景下的可扩展性与效率难题,是首个面向该任务的类增量学习方案。 Abstract: Most fine-grained fashion image retrieval (FIR) methods assume a static setting, requiring full retraining when new attributes appear, which is costly and impractical for dynamic scenarios. Although pretrained models support zero-shot inference, their accuracy drops without supervision, and no prior work explores class-incremental learning (CIL) for fine-grained FIR. We propose a multihead continual learning framework for fine-grained fashion image retrieval with contrastive learning and exponential moving average (EMA) distillation (MCL-FIR). MCL-FIR adopts a multi-head design to accommodate evolving classes across increments, reformulates triplet inputs into doublets with InfoNCE for simpler and more effective training, and employs EMA distillation for efficient knowledge transfer. Experiments across four datasets demonstrate that, beyond its scalability, MCL-FIR achieves a strong balance between efficiency and accuracy. It significantly outperforms CIL baselines under similar training cost, and compared with static methods, it delivers comparable performance while using only about 30% of the training cost. The source code is publicly available in https://github.com/Dr-LingXiao/MCL-FIR.[143] IBCapsNet: Information Bottleneck Capsule Network for Noise-Robust Representation Learning
Canqun Xiang,Chen Yang,Jiaoyan Zhao
Main category: cs.CV
TL;DR: 本文提出IBCapsNet,一种基于信息瓶颈原理的新型胶囊网络架构,通过一次性变分聚合机制替代迭代动态路由,提升了计算效率和抗噪鲁棒性。
Details
Motivation: 解决传统胶囊网络(CapsNet)计算开销大和对输入污染鲁棒性差两大问题。 Method: 采用基于信息瓶颈原理的一次性变分聚合机制:先将初级胶囊压缩为全局上下文表示,再由类别特定的变分自编码器(VAE)推断受KL散度正则化的隐胶囊。 Result: 在MNIST、Fashion-MNIST、SVHN和CIFAR-10上,清洁数据准确率与CapsNet相当(如MNIST达99.41%,SVHN达92.01%),但在四种合成噪声下显著更优(钳位加性/乘性噪声平均提升+17.10%和+14.54%);训练快2.54倍、推理吞吐高3.64倍、参数减少4.66%。 Conclusion: IBCapsNet成功将信息论表征学习与胶囊网络结合,为构建鲁棒、高效且可解释的深度模型提供了新范式。 Abstract: Capsule networks (CapsNets) are superior at modeling hierarchical spatial relationships but suffer from two critical limitations: high computational cost due to iterative dynamic routing and poor robustness under input corruptions. To address these issues, we propose IBCapsNet, a novel capsule architecture grounded in the Information Bottleneck (IB) principle. Instead of iterative routing, IBCapsNet employs a one-pass variational aggregation mechanism, where primary capsules are first compressed into a global context representation and then processed by class-specific variational autoencoders (VAEs) to infer latent capsules regularized by the KL divergence. This design enables efficient inference while inherently filtering out noise. Experiments on MNIST, Fashion-MNIST, SVHN and CIFAR-10 show that IBCapsNet matches CapsNet in clean-data accuracy (achieving 99.41% on MNIST and 92.01% on SVHN), yet significantly outperforms it under four types of synthetic noise - demonstrating average improvements of +17.10% and +14.54% for clamped additive and multiplicative noise, respectively. Moreover, IBCapsNet achieves 2.54x faster training and 3.64x higher inference throughput compared to CapsNet, while reducing model parameters by 4.66%. Our work bridges information-theoretic representation learning with capsule networks, offering a principled path toward robust, efficient, and interpretable deep models. Code is available at https://github.com/cxiang26/IBCapsnet[144] MFSR: MeanFlow Distillation for One Step Real-World Image Super Resolution
Ruiqing Wang,Kai Zhang,Yuanzhi Zhu,Hanshu Yan,Shilin Lu,Jian Yang
Main category: cs.CV
TL;DR: 本文提出Mean Flows for Super-Resolution(MFSR),一种新型蒸馏框架,可在单步内生成高质量超分辨率图像,并支持可选的少量迭代进一步提升效果;其核心是用MeanFlow作为学习目标,并改进Classifier-Free Guidance蒸馏策略,兼顾效率、灵活性与恢复质量。
Details
Motivation: 现有基于扩散或流的Real-ISR方法推理慢、部署难;单步蒸馏虽快但常牺牲质量且失去多步精修能力。 Method: 提出MFSR框架:以MeanFlow为学习目标,使学生模型逼近概率流ODE中任意状态间的平均速度;并改进原始MeanFlow的CFG,引入教师CFG蒸馏策略,更好利用预训练生成先验。 Result: 在合成与真实世界数据集上,MFSR单步结果媲美甚至超越多步教师模型,计算成本显著降低,同时支持可选的少量迭代进一步优化。 Conclusion: MFSR实现了高效、灵活、高质量的真实图像超分辨率,解决了单步蒸馏中质量下降与缺乏精修能力的关键矛盾。 Abstract: Diffusion- and flow-based models have advanced Real-world Image Super-Resolution (Real-ISR), but their multi-step sampling makes inference slow and hard to deploy. One-step distillation alleviates the cost, yet often degrades restoration quality and removes the option to refine with more steps. We present Mean Flows for Super-Resolution (MFSR), a new distillation framework that produces photorealistic results in a single step while still allowing an optional few-step path for further improvement. Our approach uses MeanFlow as the learning target, enabling the student to approximate the average velocity between arbitrary states of the Probability Flow ODE (PF-ODE) and effectively capture the teacher's dynamics without explicit rollouts. To better leverage pretrained generative priors, we additionally improve original MeanFlow's Classifier-Free Guidance (CFG) formulation with teacher CFG distillation strategy, which enhances restoration capability and preserves fine details. Experiments on both synthetic and real-world benchmarks demonstrate that MFSR achieves efficient, flexible, and high-quality super-resolution, delivering results on par with or even better than multi-step teachers while requiring much lower computational cost.[145] Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models
Yifan Yang,Lei Zou,Wendy Jepson
Main category: cs.CV
TL;DR: 本文提出两种从卫星图像生成灾后街景图像的方法(VLM引导法和损伤敏感MoE法),并构建了结构感知评估框架,发现视觉逼真性高的方法可能在关键结构信息保真度上存在不足。
Details
Motivation: 自然灾害后亟需快速态势感知,但卫星影像缺乏地面视角细节,而真实街景数据又难以及时获取,因此需要填补这一跨视角数据鸿沟。 Method: 提出两种生成策略:基于视觉语言模型(VLM)引导的合成方法和损伤敏感的混合专家(MoE)方法;并设计多层级结构感知评估框架,包括像素级质量评估、ResNet语义一致性验证和VLM-as-a-Judge感知对齐评估。 Result: 在300个灾害场景实验中,ControlNet在语义准确性(0.71)上最优,而VLM增强与MoE模型在纹理合理性更优但语义清晰度较差;揭示了‘逼真性-保真性’权衡问题。 Conclusion: 视觉逼真的生成结果未必具备灾害评估所需的关键结构信息保真度,本研究为可信跨视角合成建立了基准,并强调应兼顾感知真实性和结构语义可靠性。 Abstract: In the immediate aftermath of natural disasters, rapid situational awareness is critical. Traditionally, satellite observations are widely used to estimate damage extent. However, they lack the ground-level perspective essential for characterizing specific structural failures and impacts. Meanwhile, ground-level data (e.g., street-view imagery) remains largely inaccessible during time-sensitive events. This study investigates Satellite-to-Street View Synthesis to bridge this data gap. We introduce two generative strategies to synthesize post-disaster street views from satellite imagery: a Vision-Language Model (VLM)-guided approach and a damage-sensitive Mixture-of-Experts (MoE) method. We benchmark these against general-purpose baselines (Pix2Pix, ControlNet) using a proposed Structure-Aware Evaluation Framework. This multi-tier protocol integrates (1) pixel-level quality assessment, (2) ResNet-based semantic consistency verification, and (3) a novel VLM-as-a-Judge for perceptual alignment. Experiments on 300 disaster scenarios reveal a critical realism--fidelity trade-off: while diffusion-based approaches (e.g., ControlNet) achieve high perceptual realism, they often hallucinate structural details. Quantitative results show that standard ControlNet achieves the highest semantic accuracy, 0.71, whereas VLM-enhanced and MoE models excel in textural plausibility but struggle with semantic clarity. This work establishes a baseline for trustworthy cross-view synthesis, emphasizing that visually realistic generations may still fail to preserve critical structural information required for reliable disaster assessment.[146] Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs
Huan Zheng,Yucheng Zhou,Tianyi Yan,Dubing Chen,Hongbo Lu,Wenlong Liao,Tao He,Pai Peng,Jianbing Shen
Main category: cs.CV
TL;DR: 本文提出了一种临床认知对齐(CogAlign)框架,通过分层临床认知数据集监督微调和反事实驱动的强化学习,提升多模态大模型在胃肠内镜诊断中的因果推理与临床逻辑一致性。
Details
Motivation: 现有医学多模态大语言模型在胃肠内镜应用中存在两大问题:模型推理与临床标准认知路径不一致,以及视觉特征与诊断结果之间缺乏因果关联。 Method: 1)构建分层临床认知数据集并采用监督微调(SFT),将专家的解剖定位、形态评估、微血管分析等层级诊断逻辑注入模型;2)理论分析指出常规监督微调易导致背景伪相关,进而提出基于病灶掩码生成反事实正常样本、并以临床认知为中心设计奖励函数的强化学习因果校正策略。 Result: 在多个基准上达到SOTA性能,显著提升复杂临床场景下的诊断准确率。 Conclusion: CogAlign框架有效弥合了通用多模态大模型与临床实践之间的鸿沟,实现了更可靠、可解释且符合临床逻辑的胃肠内镜诊断。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.[147] High-Quality and Efficient Turbulence Mitigation with Events
Xiaoran Zhang,Jian Ding,Yuxing Duan,Haoyue Liu,Gang Chen,Yi Chang,Luxin Yan
Main category: cs.CV
TL;DR: 本文提出EHETM方法,利用事件相机的高时间分辨率特性,通过分析湍流诱导事件的极性交替和动态物体形成的“事件管”现象,设计了两个互补模块实现高质量、高效率的大气湍流抑制。
Details
Motivation: 传统基于多帧图像的湍流抑制方法在精度与效率之间存在权衡,而事件相机因其微秒级时间分辨率和对动态变化的高效感知能力,有望突破这一瓶颈。 Method: 基于湍流诱导事件的极性交替与动态物体形成“事件管”的现象,设计了两个互补模块:一是利用极性加权梯度进行场景精细化;二是利用事件管约束实现运动解耦。 Result: EHETM在真实世界事件-帧湍流数据集(涵盖大气与热成像场景)上超越现有最优方法,尤其在含动态物体场景中表现更优,并将数据开销和系统延迟分别降低约77.3%和89.5%。 Conclusion: EHETM有效结合事件相机优势,实现了高质、高效的湍流抑制,在动态场景下具有显著优势,为实时低延迟湍流校正提供了新思路。 Abstract: Turbulence mitigation (TM) is highly ill-posed due to the stochastic nature of atmospheric turbulence. Most methods rely on multiple frames recorded by conventional cameras to capture stable patterns in natural scenarios. However, they inevitably suffer from a trade-off between accuracy and efficiency: more frames enhance restoration at the cost of higher system latency and larger data overhead. Event cameras, equipped with microsecond temporal resolution and efficient sensing of dynamic changes, offer an opportunity to break the bottleneck. In this work, we present EHETM, a high-quality and efficient TM method inspired by the superiority of events to model motions in continuous sequences. We discover two key phenomena: (1) turbulence-induced events exhibit distinct polarity alternation correlated with sharp image gradients, providing structural cues for restoring scenes; and (2) dynamic objects form spatiotemporally coherent ``event tubes'' in contrast to irregular patterns within turbulent events, providing motion priors for disentangling objects from turbulence. Based on these insights, we design two complementary modules that respectively leverage polarity-weighted gradients for scene refinement and event-tube constraints for motion decoupling, achieving high-quality restoration with few frames. Furthermore, we construct two real-world event-frame turbulence datasets covering atmospheric and thermal cases. Experiments show that EHETM outperforms SOTA methods, especially under scenes with dynamic objects, while reducing data overhead and system latency by approximately 77.3% and 89.5%, respectively. Our code is available at: https://github.com/Xavier667/EHETM.[148] The Role and Relationship of Initialization and Densification in 3D Gaussian Splatting
Ivan Desiatov,Torsten Sattler
Main category: cs.CV
TL;DR: 本文系统研究了3D高斯泼溅(3DGS)中初始化与致密化之间的关系,通过新提出的基准测试不同初始化(如激光扫描、立体视觉、单目深度估计、稀疏SfM点云)与致密化策略的组合效果,发现当前致密化方法难以充分利用密集初始化,往往无法显著超越稀疏SfM初始化。
Details
Motivation: 现有3DGS方法依赖致密化阶段从稀疏初始化生成密集高斯云,但其对不同初始化质量的利用效率尚不明确,需系统评估初始化与致密化的关系。 Method: 构建新基准,对比多种密集与稀疏初始化方式(激光扫描、多视图立体、单目深度、稀疏SfM)与不同致密化方案的组合性能。 Result: 当前致密化方法通常无法显著提升密集初始化的效果,表现常不优于稀疏SfM初始化。 Conclusion: 致密化策略存在瓶颈,亟需设计能更好利用高质量密集初始化的新方法;所提基准将公开以促进后续研究。 Abstract: 3D Gaussian Splatting (3DGS) has become the method of choice for photo-realistic 3D reconstruction of scenes, due to being able to efficiently and accurately recover the scene appearance and geometry from images. 3DGS represents the scene through a set of 3D Gaussians, parameterized by their position, spatial extent, and view-dependent color. Starting from an initial point cloud, 3DGS refines the Gaussians' parameters as to reconstruct a set of training images as accurately as possible. Typically, a sparse Structure-from-Motion point cloud is used as initialization. In order to obtain dense Gaussian clouds, 3DGS methods thus rely on a densification stage. In this paper, we systematically study the relation between densification and initialization. Proposing a new benchmark, we study combinations of different types of initializations (dense laser scans, dense (multi-view) stereo point clouds, dense monocular depth estimates, sparse SfM point clouds) and different densification schemes. We show that current densification approaches are not able to take full advantage of dense initialization as they are often unable to (significantly) improve over sparse SfM-based initialization. We will make our benchmark publicly available.[149] Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark
Yifei Deng,Chenglong Li,Yuyang Zhang,Guyue Hu,Jin Tang
Main category: cs.CV
TL;DR: 本文提出了一种跨模态模糊对齐网络(CFA-Net),用于文本-航拍图像行人检索任务,通过模糊逻辑建模词元级可靠性,并引入地面视角图像作为桥接代理,缓解航拍图像与文本描述间的语义鸿沟;同时构建了大规模基准数据集AERI-PEDES。
Details
Motivation: 航拍图像因视角和高度变化大、视觉信息退化严重,导致与文本描述的语义对齐困难,现有方法难以应对缺失或噪声视觉线索带来的语义不一致问题。 Method: 提出跨模态模糊对齐网络(CFA-Net):1)模糊词元对齐模块,利用模糊隶属函数动态建模词元级关联强度并抑制不可见/噪声词元影响;2)上下文感知动态对齐模块,以地面视角图像为桥接代理,自适应融合直接对齐与代理辅助对齐;3)构建AERI-PEDES数据集,采用思维链方式分解文本生成流程以提升文本准确性和语义一致性。 Result: 在AERI-PEDES和TBAPR两个数据集上的实验表明,所提方法显著优于现有方法,尤其在应对视觉信息缺失和噪声时展现出更强鲁棒性。 Conclusion: 模糊逻辑与桥接代理机制可有效缓解文本-航拍图像间模态差异,提升细粒度语义对齐精度;AERI-PEDES数据集为该任务提供了更可靠、更具挑战性的评估基准。 Abstract: Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text--image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network, which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text--aerial person retrieval. In particular, we design the Fuzzy Token Alignment module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporate the ground-view agent as a bridge in text--aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method.[150] Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation
Zihao Wang,Yuxiang Wei,Xinpeng Zhou,Tianyu Zhang,Tao Liang,Yalong Bai,Hongzhi Zhang,Wangmeng Zuo
Main category: cs.CV
TL;DR: Premier提出了一种基于可学习用户偏好嵌入和偏好适配器的文本到图像个性化生成框架,通过融合用户嵌入与文本提示、调制生成过程,并引入分散损失提升偏好区分度,在少量用户数据下仍具良好泛化能力。
Details
Motivation: 现有文本到图像生成方法难以准确捕捉和反映用户细微偏好,导致个性化效果不佳。 Method: Premier将用户偏好表示为可学习嵌入,设计偏好适配器融合该嵌入与文本提示,并用融合后的嵌入调制生成过程;引入分散损失增强用户嵌入间的区分性;对新用户采用已有嵌入的线性组合进行表征以支持少样本泛化。 Result: Premier在相同历史长度下优于先前方法,在偏好对齐性、文本一致性、ViPer代理指标及专家评估上均表现更优。 Conclusion: Premier提供了一种高效、可扩展且鲁棒的个性化图像生成框架,显著提升了用户偏好建模与生成质量之间的对齐程度。 Abstract: Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization. We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user's preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt. To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process. To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization. Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.[151] Weakly supervised multimodal segmentation of acoustic borehole images with depth-aware cross-attention
Jose Luis Lima de Jesus Silva
Main category: cs.CV
TL;DR: 本文提出了一种弱监督多模态分割框架,用于在无密集专家标注的情况下,融合声学井壁图像与深度对齐的测井曲线,通过置信度引导的深度感知交叉注意力机制(CG-DCA)实现鲁棒、可扩展的无标注井壁结构分割。
Details
Motivation: 声学井壁图像虽分辨率高,但大规模解释受限于稀缺的专家标注和地下信息的本征多模态性;亟需结合2D图像纹理与1D深度对齐测井数据的弱监督方法。 Method: 提出一种弱监督多模态分割框架,以阈值生成的伪标签为起点,通过学习模型进行精细化修正,包含去噪、置信度感知伪监督及物理结构化融合;重点对比了多种融合策略(如直接拼接、深度感知交叉注意力、门控融合、置信度调制),最终采用置信度门控的深度感知交叉注意力(CG-DCA)。 Result: CG-DCA模型在多个基线上显著且一致地提升性能,优于原始阈值法、去噪阈值法、潜在聚类及早期多模态方法;消融实验证明其优势源于置信度感知融合与结构化局部深度交互,而非模型复杂度;跨井分析验证其泛化稳定性。 Conclusion: 该工作确立了一种实用、可扩展的无标注分割框架,表明多模态性能提升的关键在于对辅助测井数据的选择性、深度感知式融合,而非简单堆叠模态。 Abstract: Acoustic borehole images provide high-resolution borehole-wall structure, but large-scale interpretation remains difficult because dense expert annotations are rarely available and subsurface information is intrinsically multimodal. The challenge is developing weakly supervised methods combining two-dimensional image texture with depth-aligned one-dimensional well-logs. Here, we introduce a weakly supervised multimodal segmentation framework that refines threshold-guided pseudo-labels through learned models. This preserves the annotation-free character of classical thresholding and clustering workflows while extending them with denoising, confidence-aware pseudo-supervision, and physically structured fusion. We establish that threshold-guided learned refinement provides the most robust improvement over raw thresholding, denoised thresholding, and latent clustering baselines. Multimodal performance depends strongly on fusion strategy: direct concatenation provides limited gains, whereas depth-aware cross-attention, gated fusion, and confidence-aware modulation substantially improve agreement with the weak supervisory reference. The strongest model, confidence-gated depth-aware cross-attention (CG-DCA), consistently outperforms threshold-based, image-only, and earlier multimodal baselines. Targeted ablations show its advantage depends specifically on confidence-aware fusion and structured local depth interaction rather than model complexity alone. Cross-well analyses confirm this performance is broadly stable. These results establish a practical, scalable framework for annotation-free segmentation, showing multimodal improvement is maximized when auxiliary logs are incorporated selectively and depth-aware.[152] VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation
Jun Du
Main category: cs.CV
TL;DR: 本文提出了一种面向低质量视频的多目标跟踪框架VSD-MOT,通过视觉语义蒸馏(DCSD)和动态语义权重调节(DSWR)模块,利用CLIP图像编码器的知识提升低质图像下的特征表达能力与鲁棒性。
Details
Motivation: 现有MOT算法在低质量视频中因信息丢失导致性能显著下降,亟需增强对退化图像的语义理解能力。 Method: 提出VSD-MOT框架:1)以CLIP图像编码器为教师模型,设计双约束语义蒸馏(DCSD)方法进行知识迁移;2)引入动态语义权重调节(DSWR)模块,根据实时帧质量自适应融合语义特征。 Result: 在真实低质量视频场景下显著优于现有方法,同时在常规场景中保持良好性能。 Conclusion: 视觉语义蒸馏与动态权重调节可有效缓解低质量视频带来的信息损失问题,为鲁棒多目标跟踪提供了新思路。 Abstract: Existing multi-object tracking algorithms typically fail to adequately address the issues in low-quality videos, resulting in a significant decline in tracking performance when image quality deteriorates in real-world scenarios. This performance degradation is primarily due to the algorithms' inability to effectively tackle the problems caused by information loss in low-quality images. To address the challenges of low-quality video scenarios, inspired by vision-language models, we propose a multi-object tracking framework guided by visual semantic distillation (VSD-MOT). Specifically, we introduce the CLIP Image Encoder to extract global visual semantic information from images to compensate for the loss of information in low-quality images. However, direct integration can substantially impact the efficiency of the multi-object tracking algorithm. Therefore, this paper proposes to extract visual semantic information from images through knowledge distillation. This method adopts a teacher-student learning framework, with the CLIP Image Encoder serving as the teacher model. To enable the student model to acquire the capability of extracting visual semantic information suitable for multi-object tracking tasks from the teacher model, we have designed the Dual-Constraint Semantic Distillation method (DCSD). Furthermore, to address the dynamic variation of frame quality in low-quality videos, we propose the Dynamic Semantic Weight Regulation (DSWR) module, which adaptively allocates fusion weights based on real-time frame quality assessment. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in low-quality video scenarios in the real world. Meanwhile, our method can maintain good performance in conventional scenarios.[153] SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval
Qunjie Huang,Weina Zhu
Main category: cs.CV
TL;DR: 本文提出SATTC方法,通过结构感知的测试时校准技术解决跨被试EEG-to-image检索中的主体偏移和嵌入空间hubness问题,提升小k值下的检索可靠性。
Details
Motivation: 跨被试EEG-to-image检索面临被试偏移和嵌入空间hubness问题,导致相似性几何失真、top-k排序不稳定,小k短列表不可靠。 Method: 提出无标签的SATTC校准头,直接作用于冻结EEG与图像编码器的相似度矩阵;融合几何专家(被试自适应EEG白化 + 改进CSLS)与结构专家(互近邻、双向top-k排名、类别流行度),通过乘积专家规则融合。 Result: 在THINGS-EEG数据集上,SATTC在强基线(余弦相似度+L2归一化+候选白化)基础上进一步提升Top-1/Top-5准确率,降低hubness与类别不平衡,增强小k短列表可靠性;效果可迁移至多种EEG编码器。 Conclusion: SATTC是一种编码器无关、无需标签的测试时校准层,适用于跨被试神经解码任务。 Abstract: Cross-subject EEG-to-image retrieval for visual decoding is challenged by subject shift and hubness in the embedding space, which distort similarity geometry and destabilize top-k rankings, making small-k shortlists unreliable. We introduce SATTC (Structure-Aware Test-Time Calibration), a label-free calibration head that operates directly on the similarity matrix of frozen EEG and image encoders. SATTC combines a geometric expert, subject-adaptive whitening of EEG embeddings with an adaptive variant of Cross-domain Similarity Local Scaling (CSLS), and a structural expert built from mutual nearest neighbors, bidirectional top-k ranks, and class popularity, fused via a simple Product-of-Experts rule. On THINGS-EEG under a strict leave-one-subject-out protocol, standardized inference with cosine similarities, L2-normalized embeddings, and candidate whitening already yields a strong cross-subject baseline over the original ATM retrieval setup. Building on this baseline, SATTC further improves Top-1 and Top-5 accuracy, reduces hubness and per-class imbalance, and produces more reliable small-k shortlists. These gains transfer across multiple EEG encoders, supporting SATTC as an encoder-agnostic, label-free test-time calibration layer for cross-subject neural decoding.[154] Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding
Jincen Jiang,Qianyu Zhou,Yuhang Li,Kui Su,Meili Wang,Jian Chang,Jian Jun Zhang,Xuequan Lu
Main category: cs.CV
TL;DR: 本文提出了一种名为SADG的结构感知域泛化框架,基于Mamba架构,通过结构感知序列化(SAS)、分层域感知建模(HDM)和谱图对齐(SGA)技术,在多任务点云域泛化中提升结构保真度与泛化性能。
Details
Motivation: 现有Transformer和Mamba模型在多任务域泛化(DG)中表现不佳:Transformer存在计算开销大、缺乏显式结构序;Mamba依赖坐标序列化,易受视角变化和缺失区域影响,导致结构漂移。 Method: 提出SADG框架,包含三部分:1)结构感知序列化(SAS),利用质心拓扑与测地曲率连续性生成变换不变序列;2)分层域感知建模(HDM),融合域内结构与域间关系;3)轻量谱图对齐(SGA),在测试时将目标特征向源原型谱空间对齐,无需参数更新。此外构建MP3DObject真实扫描数据集用于评估。 Result: 在重建、去噪、配准等多任务上显著超越SOTA方法,提升了结构保真度与跨域泛化能力。 Conclusion: 结构感知建模与无参数测试时对齐策略可有效缓解点云多任务域泛化中的结构失稳问题,为Mamba类模型在几何学习中的鲁棒应用提供了新范式。 Abstract: While recent Transformer and Mamba architectures have advanced point cloud representation learning, they are typically developed for single-task or single-domain settings. Directly applying them to multi-task domain generalization (DG) leads to degraded performance. Transformers effectively model global dependencies but suffer from quadratic attention cost and lack explicit structural ordering, whereas Mamba offers linear-time recurrence yet often depends on coordinate-driven serialization, which is sensitive to viewpoint changes and missing regions, causing structural drift and unstable sequential modeling. In this paper, we propose Structure-Aware Domain Generalization (SADG), a Mamba-based In-Context Learning framework that preserves structural hierarchy across domains and tasks. We design structure-aware serialization (SAS) that generates transformation-invariant sequences using centroid-based topology and geodesic curvature continuity. We further devise hierarchical domain-aware modeling (HDM) that stabilizes cross-domain reasoning by consolidating intra-domain structure and fusing inter-domain relations. At test time, we introduce a lightweight spectral graph alignment (SGA) that shifts target features toward source prototypes in the spectral domain without updating model parameters, ensuring structure-preserving test-time feature shifting. In addition, we introduce MP3DObject, a real-scan object dataset for multi-task DG evaluation. Comprehensive experiments demonstrate that the proposed approach improves structural fidelity and consistently outperforms state-of-the-art methods across multiple tasks including reconstruction, denoising, and registration.[155] CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
Xiefan Guo,Xinzhu Ma,Haiyu Zhang,Di Huang
Main category: cs.CV
TL;DR: 本文提出了一种名为Cross-Timestep Self-Calibration(CTCal)的新方法,通过利用小时间步下更可靠的文本-图像对齐信息来校准大时间步下的表示学习,从而提升文本到图像生成中细粒度对齐的精度。该方法可无缝集成于多种扩散模型(如SD 2.1、SD 3),并在多个基准上验证了其有效性与泛化性。
Details
Motivation: 现有扩散模型的常规损失函数仅提供隐式监督,难以建模细粒度的文本-图像对应关系,导致文本提示与生成图像对齐不精确。 Method: 提出Cross-Timestep Self-Calibration(CTCal),利用小时间步(噪声少)下稳定的跨注意力图来校准大时间步(噪声多)下的特征表示,并引入时间步感知的自适应加权策略,协同优化CTCal损失与原始扩散损失。 Result: 在T2I-Compbench++和GenEval等多个基准上显著提升文本-图像对齐质量,且适用于SD 2.1(扩散型)和SD 3(流型)等不同架构,验证了方法的有效性与模型无关性。 Conclusion: CTCal为文本到图像生成提供了显式的细粒度对齐监督机制,是一种简单、通用且高效的改进方案,有望成为扩散模型对齐优化的标准组件之一。 Abstract: Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during training. We further propose a timestep-aware adaptive weighting to achieve a harmonious integration of CTCal and diffusion loss. CTCal is model-agnostic and can be seamlessly integrated into existing text-to-image diffusion models, encompassing both diffusion-based (e.g., SD 2.1) and flow-based approaches (e.g., SD 3). Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate the effectiveness and generalizability of the proposed CTCal. Our code is available at https://github.com/xiefan-guo/ctcal.[156] Smart Operation Theatre: An AI-based System for Surgical Gauze Counting
Saraf Krish,Cai Yiyu,Huang Li Hui
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv5的AI系统,通过实时视频监控和目标识别技术,自动统计手术中进出患者的医用纱布数量,以预防'棉絮瘤'(Gossypiboma)的发生。该系统整合了人与纱布检测,使用11000张图像训练,准确率和帧率(提升至15 FPS)均优于前代,并支持医生手动修正计数,已在新加坡综合医院合作开发并优化落地。
Details
Motivation: 手术中遗留纱布导致Gossypiboma,带来严重医疗风险与法律纠纷;现有预防手段(人工清点、RFID纱布)存在效率低、成本高或兼容性差等问题。 Method: 基于YOLOv5构建端到端深度学习模型,实现对‘In’和‘Out’两个托盘上纱布及医护人员的同步检测;使用11,000张手术室图像训练并数据增强;集成手动修正功能,适配真实手术流程。 Result: 相较此前双模型方案(2800图、8 FPS),新模型精度提升、帧率达15 FPS;支持实时、自动化、可干预的纱布计数,已在SGH实际场景中验证可行性与可靠性。 Conclusion: 该AI系统为手术纱布管理提供了高效、鲁棒、临床可行的新范式,显著降低人为失误风险,具备推广至其他手术耗材智能监管的潜力。 Abstract: During surgeries, there is a risk of medical gauzes being left inside patients' bodies, leading to "Gossypiboma" in patients and can cause serious complications in patients and also lead to legal problems for hospitals from malpractice lawsuits and regulatory penalties. Diagnosis depends on imaging methods such as X-rays or CT scans, and the usual treatment involves surgical excision. Prevention methods, such as manual counts and RFID-integrated gauzes, aim to minimize gossypiboma risks. However, manual tallying of 100s of gauzes by nurses is time-consuming and diverts resources from patient care. In partnership with Singapore General Hospital (SGH) we have developed a new prevention method, an AI-based system for gauze counting in surgical settings. Utilizing real-time video surveillance and object recognition technology powered by YOLOv5, a Deep Learning model was designed to monitor gauzes on two designated trays labelled "In" and "Out". Gauzes are tracked from the "In" tray, prior to their use in the patient's body & in the "Out" tray post-use, ensuring accurate counting and verifying that no gauze remains inside the patient at the end of the surgery. We have trained it using numerous images from Operation Theatres & augmented it to satisfy all possible scenarios. This study has also addressed the shortcomings of previous project iterations. Previously, the project employed two models: one for human detection and another for gauze detection, trained on a total of 2800 images. Now we have an integrated model capable of identifying both humans and gauzes, using a training set of 11,000 images. This has led to improvements in accuracy and increased the frame rate from 8 FPS to 15 FPS now. Incorporating doctor's feedback, the system now also supports manual count adjustments, enhancing its reliability in actual surgeries.[157] Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping
Sunghyun Park,Jeongho Kim,Hyoungwoo Park,Debasmit Das,Sungrack Yun,Munawar Hayat,Jaegul Choo,Fatih Porikli,Seokeon Choi
Main category: cs.CV
TL;DR: 本文提出DiT-BlockSkip框架,通过时间步感知的动态补丁采样和基于残差特征预计算的块跳过机制,在保持个性化生成质量的同时显著降低Diffusion Transformers(DiTs)微调的内存开销,推动其在资源受限设备(如手机、IoT)上的部署。
Details
Motivation: Diffusion Transformers(DiTs)虽提升了文本到图像生成质量,但其微调过程计算复杂度高、内存消耗大,难以在资源受限设备上实际部署。 Method: 提出DiT-BlockSkip:1)时间步感知的动态补丁采样——根据扩散时间步自适应调整补丁尺寸并缩放到低分辨率;2)基于预计算残差特征的块跳过机制;3)基于交叉注意力掩码的Transformer块选择策略以识别关键微调模块。 Result: 在保证个性化生成质量(定性与定量均具竞争力)的同时,大幅降低训练内存占用,使大规模DiTs微调向手机、IoT等端侧设备部署迈进。 Conclusion: DiT-BlockSkip是一种高效、轻量的DiTs微调框架,兼顾性能与资源效率,为扩散模型在边缘设备上的实用化提供了可行路径。 Abstract: Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memory-efficient fine-tuning framework called DiT-BlockSkip, integrating timestep-aware dynamic patch sampling and block skipping by precomputing residual features. Our dynamic patch sampling strategy adjusts patch sizes based on the diffusion timestep, then resizes the cropped patches to a fixed lower resolution. This approach reduces forward & backward memory usage while allowing the model to capture global structures at higher timesteps and fine-grained details at lower timesteps. The block skipping mechanism selectively fine-tunes essential transformer blocks and precomputes residual features for the skipped blocks, significantly reducing training memory. To identify vital blocks for personalization, we introduce a block selection strategy based on cross-attention masking. Evaluations demonstrate that our approach achieves competitive personalization performance qualitatively and quantitatively, while reducing memory usage substantially, moving toward on-device feasibility (e.g., smartphones, IoT devices) for large-scale diffusion transformers.[158] PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
Xiaoya Cheng,Long Wang,Yan Liu,Xinyi Liu,Hanlin Tan,Yu Liu,Maojun Zhang,Shen Yan
Main category: cs.CV
TL;DR: PiLoT是一个统一框架,通过将实时视频流直接配准到地理参考3D地图,实现无人机的自我定位和目标地理定位,无需GNSS或昂贵主动传感器。
Details
Motivation: 传统方法依赖GNSS和VIO进行自定位、主动传感器进行目标定位,但在无GNSS环境易失效,且硬件成本高、系统复杂。 Method: 提出PiLoT框架:1)双线程引擎解耦地图渲染与定位;2)构建大规模带几何标注合成数据集训练轻量网络,支持零样本仿真到现实迁移;3)设计联合神经引导随机梯度优化器(JNGO)提升运动鲁棒性。 Result: 在多个公开及自建基准上超越现有最优方法,在NVIDIA Jetson Orin平台达25 FPS以上实时性能。 Conclusion: PiLoT实现了高精度、低延迟、强鲁棒且低成本的无人机端到端地理定位,验证了纯视觉+地理地图配准范式的有效性与实用性。 Abstract: We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity. PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a lightweight network that generalizes in a zero-shot manner from simulation to real data; and 3) a Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) that achieves robust convergence even under aggressive motion. Evaluations on a comprehensive set of public and newly collected benchmarks show that PiLoT outperforms state-of-the-art methods while running over 25 FPS on NVIDIA Jetson Orin platform. Our code and dataset is available at: https://github.com/Choyaa/PiLoT.[159] MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction
Jiaxin Cheng,Yue Wu,Yicong Zhou
Main category: cs.CV
TL;DR: 本文提出MEMO模型,仅使用交叉熵损失即可生成准确且清晰的边缘图,通过合成数据预训练、轻量微调模块及基于置信度梯度的渐进式预测策略,显著提升边缘锐度,无需后处理。
Details
Motivation: 现有基于学习的边缘检测模型使用交叉熵损失训练时易产生较厚的边缘预测,与人类标注的单像素清晰边缘存在偏差。 Method: 提出Masked Edge Prediction MOdel(MEMO),利用大规模合成边缘数据集预训练,再在下游数据集上进行轻量微调;训练中引入输入掩码比例变化策略;推理时依据预测置信度梯度,采用渐进式预测策略逐次确定边缘像素。 Result: MEMO在清晰度感知评估中超越先前方法,生成视觉效果优、无需后处理、类人质量的边缘图。 Conclusion: 仅靠精心设计的训练与推理策略(而非新损失函数或复杂架构),即可用标准交叉熵损失实现高质量、高清晰度边缘检测。 Abstract: Learning-based edge detection models trained with cross-entropy loss often suffer from thick edge predictions, which deviate from the crisp, single-pixel annotations typically provided by humans. While previous approaches to achieving crisp edges have focused on designing specialized loss functions or modifying network architectures, we show that a carefully designed training and inference strategy alone is sufficient to achieve human-like edge quality. In this work, we introduce the Masked Edge Prediction MOdel (MEMO), which produces both accurate and crisp edges using only cross-entropy loss. We first construct a large-scale synthetic edge dataset to pre-train MEMO, enhancing its generalization ability. Subsequent fine-tuning on downstream datasets requires only a lightweight module comprising 1.2\% additional parameters. During training, MEMO learns to predict edges under varying ratios of input masking. A key insight guiding our inference is that thick edge predictions typically exhibit a confidence gradient: high in the center and lower toward the boundaries. Leveraging this, we propose a novel progressive prediction strategy that sequentially finalizes edge predictions in order of prediction confidence, resulting in thinner and more precise contours. Our method achieves visually appealing, post-processing-free, human-like edge maps and outperforms prior methods on crispness-aware evaluations.[160] ME-IQA: Memory-Enhanced Image Quality Assessment via Re-Ranking
Kanglong Fan,Tianhe Wu,Wen Wen,Jianzhao Liu,Le Yang,Yabin Zhang,Yiting Liao,Junlin Li,Li Zhang
Main category: cs.CV
TL;DR: 本文提出ME-IQA框架,通过记忆增强的重排序机制缓解推理型视觉语言模型在图像质量评估中出现的离散坍缩问题,提升预测密度与失真敏感性。
Details
Motivation: 推理型视觉语言模型(VLMs)在图像质量评估(IQA)中虽引入文本推理,但其标量评分常缺乏敏感性、易发生离散坍缩(即评分集中于少数值)。 Method: 提出ME-IQA:(i)构建记忆库,基于推理摘要检索语义与感知对齐的邻近样本;(ii)将VLM重构为概率比较器,利用Thurstone Case V模型融合成对偏好概率与初始分数;(iii)引入门控反思与记忆更新机制以提升后续决策。 Result: 在多个IQA基准上,ME-IQA一致优于强推理型VLM基线、传统非推理IQA方法及其它测试时扩展方案,显著缓解离散坍缩,生成更密集、失真敏感的预测结果。 Conclusion: ME-IQA是一种即插即用、测试时有效的记忆增强重排序框架,有效提升了推理型VLM在IQA任务中的评分分辨能力与鲁棒性。 Abstract: Reasoning-induced vision-language models (VLMs) advance image quality assessment (IQA) with textual reasoning, yet their scalar scores often lack sensitivity and collapse to a few values, so-called discrete collapse. We introduce ME-IQA, a plug-and-play, test-time memory-enhanced re-ranking framework. It (i) builds a memory bank and retrieves semantically and perceptually aligned neighbors using reasoning summaries, (ii) reframes the VLM as a probabilistic comparator to obtain pairwise preference probabilities and fuse this ordinal evidence with the initial score under Thurstone's Case V model, and (iii) performs gated reflection and consolidates memory to improve future decisions. This yields denser, distortion-sensitive predictions and mitigates discrete collapse. Experiments across multiple IQA benchmarks show consistent gains over strong reasoning-induced VLM baselines, existing non-reasoning IQA methods, and test-time scaling alternatives.[161] Does Peer Observation Help? Vision-Sharing Collaboration for Vision-Language Navigation
Qunchao Jin,Yiliao Song,Qi Wu
Main category: cs.CV
TL;DR: 本文提出Co-VLN框架,通过让多个导航代理共享彼此在共同访问位置处的结构化感知记忆,以提升视觉语言导航(VLN)性能,无需额外探索开销。
Details
Motivation: 现有VLN系统受限于部分可观测性,而多机器人共存于同一环境为利用同伴观测提供了新机会。 Method: 提出模型无关的Co-VLN框架,当多个代理独立导航并发现共同经过的位置时,交换结构化感知记忆,扩展各自感受野。 Result: 在R2R基准上验证,Co-VLN显著提升了DUET(学习型)和MapGPT(零样本)两种范式的导航性能。 Conclusion: 同伴视觉信息共享可有效增强VLN性能,为协同具身导航研究奠定基础。 Abstract: Vision-Language Navigation (VLN) systems are fundamentally constrained by partial observability, as an agent can only accumulate knowledge from locations it has personally visited. As multiple robots increasingly coexist in shared environments, a natural question arises: can agents navigating the same space benefit from each other's observations? In this work, we introduce Co-VLN, a minimalist, model-agnostic framework for systematically investigating whether and how peer observations from concurrently navigating agents can benefit VLN. When independently navigating agents identify common traversed locations, they exchange structured perceptual memory, effectively expanding each agent's receptive field at no additional exploration cost. We validate our framework on the R2R benchmark under two representative paradigms (the learning-based DUET and the zero-shot MapGPT), and conduct extensive analytical experiments to systematically reveal the underlying dynamics of peer observation sharing in VLN. Results demonstrate that vision-sharing enabled model yields substantial performance improvements across both paradigms, establishing a strong foundation for future research in collaborative embodied navigation.[162] Less is More in Semantic Space: Intrinsic Decoupling via Clifford-M for Fundus Image Classification
Yifeng Zheng
Main category: cs.CV
TL;DR: 本文提出Clifford-M,一种轻量级多尺度视网膜图像诊断模型,摒弃传统频率分解方法,采用基于Clifford代数的稀疏几何交互机制,在参数极少(0.85M)下实现优于大模型的性能,并展现跨数据集鲁棒性。
Details
Motivation: 现有显式频率分解的多尺度医学视觉模型在眼底多标签诊断中收益有限,且带来显著计算与参数开销;作者旨在探索更高效、直接建模多尺度结构特征的方法。 Method: 提出Clifford-M骨干网络,以Clifford风格滚动积替代前馈扩展和频率分割模块,实现线性复杂度下的对齐与结构变化联合建模,构建紧凑双分辨率架构。 Result: 在ODIR-5K上零预训练达到平均AUC-ROC 0.8142、macro-F1 0.5481;在RFMiD上零微调获得macro AUC 0.7425±0.0198、micro AUC 0.7610±0.0344。 Conclusion: 无需显式频率工程,只要核心特征交互能直接捕获多尺度结构,即可实现高效且具竞争力的眼底诊断性能。 Abstract: Multi-label fundus diagnosis requires features that capture both fine-grained lesions and large-scale retinal structure. Many multi-scale medical vision models address this challenge through explicit frequency decomposition, but our ablation studies show that such heuristics provide limited benefit in this setting: replacing the proposed simple dual-resolution stem with Octave Convolution increased parameters by 35% and computation by a 2.23-fold increase in computation; without improving mean accuracy, while a fixed wavelet-based variant performed substantially worse. Motivated by these findings, we propose Clifford-M, a lightweight backbone that replaces both feed-forward expansion and frequency-splitting modules with sparse geometric interaction. The model is built on a Clifford-style rolling product that jointly captures alignment and structural variation with linear complexity, enabling efficient cross-scale fusion and self-refinement in a compact dual-resolution architecture. Without pre-training, Clifford-M achieves a mean AUC-ROC of 0.8142 and a mean macro-F1 (optimal threshold) of 0.5481 on ODIR-5K using only 0.85M parameters, outperforming substantially larger mid-scale CNN baselines under the same training protocol. When evaluated on RFMiD without fine-tuning, it attains 0.7425 +/- 0.0198 macro AUC and 0.7610 +/- 0.0344 micro AUC, indicating reasonable robustness to cross-dataset shift. These results suggest that competitive and efficient fundus diagnosis can be achieved without explicit frequency engineering, provided that the core feature interaction is designed to capture multi-scale structure directly.[163] Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models
Enguang Wang,Qiang Wang,Yuanchen Wu,Ke Yan,Xinbin Yuan,Shouhong Ding,Xialei Liu,Ming-Ming Cheng
Main category: cs.CV
TL;DR: 本文揭示了多模态大语言模型(MLLMs)在语言驱动训练过程中存在的视觉表征退化问题,并提出预测正则化(PRe)方法来缓解该问题,从而提升跨模态理解性能。
Details
Motivation: 尽管多模态大语言模型(MLLMs)在视觉-语言任务中表现出色,但其以语言生成为目标的训练过程对内部视觉基础能力的影响尚不明确。 Method: 通过诊断性分析发现中间层视觉表征的全局功能与局部结构均发生退化,归因于单一文本生成目标导致的视觉牺牲;进而提出预测正则化(PRe),强制退化的中间视觉特征重建初始视觉特征,以保持其固有视觉属性。 Result: 大量实验证明,缓解视觉表征退化可有效提升视觉-语言任务性能。 Conclusion: MLLM需兼顾强跨模态推理能力与核心视觉能力;维持鲁棒的内部视觉表征对实现全面多模态理解至关重要。 Abstract: While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.[164] Lean Learning Beyond Clouds: Efficient Discrepancy-Conditioned Optical-SAR Fusion for Semantic Segmentation
Chenxing Meng,Wuzhou Quan,Yingjie Cai,Liqun Cao,Liyan Zhang,Mingqiang Wei
Main category: cs.CV
TL;DR: 本文提出EDC框架,通过三流编码器、差异条件混合融合机制和教师引导的云去除分支,在光学-SAR语义分割中兼顾效率与鲁棒性,显著提升精度并降低计算开销。
Details
Motivation: 云遮挡严重损害光学遥感影像的语义完整性;现有方法在全局建模和跨模态融合中面临噪声传播与计算开销的效率-可靠性权衡问题。 Method: 提出EDC框架:1)带Carrier Tokens的三流编码器实现紧凑全局建模;2)差异条件混合融合(DCHF)机制选择性抑制不可靠区域;3)教师引导蒸馏的辅助云去除分支增强语义一致性。 Result: 在M3M-CR和WHU-OPT-SAR数据集上mIoU分别提升0.56%和0.88%,参数量减少46.7%,推理速度加快1.98×。 Conclusion: EDC有效缓解了遥感图像语义分割中云干扰下的效率与可靠性矛盾,为大规模高分辨率应用提供了实用解决方案。 Abstract: Cloud occlusion severely degrades the semantic integrity of optical remote sensing imagery. While incorporating Synthetic Aperture Radar (SAR) provides complementary observations, achieving efficient global modeling and reliable cross-modal fusion under cloud interference remains challenging. Existing methods rely on dense global attention to capture long-range dependencies, yet such aggregation indiscriminately propagates cloud-induced noise. Improving robustness typically entails enlarging model capacity, which further increases computational overhead. Given the large-scale and high-resolution nature of remote sensing applications, such computational demands hinder practical deployment, leading to an efficiency-reliability trade-off. To address this dilemma, we propose EDC, an efficiency-oriented and discrepancy-conditioned optical-SAR semantic segmentation framework. A tri-stream encoder with Carrier Tokens enables compact global context modeling with reduced complexity. To prevent noise contamination, we introduce a Discrepancy-Conditioned Hybrid Fusion (DCHF) mechanism that selectively suppresses unreliable regions during global aggregation. In addition, an auxiliary cloud removal branch with teacher-guided distillation enhances semantic consistency under occlusion. Extensive experiments demonstrate that EDC achieves superior accuracy and efficiency, improving mIoU by 0.56\% and 0.88\% on M3M-CR and WHU-OPT-SAR, respectively, while reducing the number of parameters by 46.7\% and accelerating inference by 1.98$\times$. Our implementation is available at https://github.com/mengcx0209/EDC.[165] PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-Based Structure Matching
Hanqiao Ye,Yuzhou Liu,Yangdong Liu,Shuhan Shen
Main category: cs.CV
TL;DR: 本文提出PlanaReLoc,一种以平面为基本单元的轻量级6自由度相机重定位方法,利用平面基元在结构化环境中建立跨模态对应关系,无需真实纹理地图、姿态先验或场景特定训练。
Details
Motivation: 平面基元不仅是射影几何中的基本实体,还作为区域表示封装了结构与语义信息,因此更适合在结构化环境中建立鲁棒的跨模态对应关系。 Method: 提出PlanaReLoc框架:首先通过深度匹配器在统一学习嵌入空间中关联查询图像与3D平面地图中的平面基元,然后在鲁棒框架下求解并优化6-DoF位姿。 Result: 在ScanNet和12Scenes数据集上百个场景的实验表明,该方法在无真实纹理/着色地图、无姿态先验、无逐场景训练条件下,显著提升了重定位精度与鲁棒性。 Conclusion: 平面基元比点特征更适用于结构化环境下的轻量级、高精度6-DoF相机重定位,PlanaReLoc为结构感知重定位提供了新范式。 Abstract: While structure-based relocalizers have long strived for point correspondences when establishing or regressing query-map associations, in this paper, we pioneer the use of planar primitives and 3D planar maps for lightweight 6-DoF camera relocalization in structured environments. Planar primitives, beyond being fundamental entities in projective geometry, also serve as region-based representations that encapsulate both structural and semantic richness. This motivates us to introduce PlanaReLoc, a streamlined plane-centric paradigm where a deep matcher associates planar primitives across the query image and the map within a learned unified embedding space, after which the 6-DoF pose is solved and refined under a robust framework. Through comprehensive experiments on the ScanNet and 12Scenes datasets across hundreds of scenes, our method demonstrates the superiority of planar primitives in facilitating reliable cross-modal structural correspondences and achieving effective camera relocalization without requiring realistically textured/colored maps, pose priors, or per-scene training. The code and data are available at https://github.com/3dv-casia/PlanaReLoc .[166] EruDiff: Refactoring Knowledge in Diffusion Models for Advanced Text-to-Image Synthesis
Xiefan Guo,Xinzhu Ma,Haoxiang Ma,Zihao Zhou,Di Huang
Main category: cs.CV
TL;DR: 本文提出EruDiff方法,通过扩散知识分布匹配(DK-DM)和仅负向强化学习(NO-RL),提升文本到图像扩散模型对隐式提示(需深层世界知识)的理解与生成能力,显著改善科学与常识类图像合成的准确性。
Details
Motivation: 现有文本到图像扩散模型虽能很好处理显式文本提示,但在处理需要深层世界知识(如自然科学、文化常识)的隐式提示时易生成反事实结果,根源在于模型内部知识结构错位、隐式提示表征混乱。 Method: 提出EruDiff框架:1)Diffusion Knowledge Distribution Matching(DK-DM),将隐式提示的知识分布对齐至显式锚点;2)Negative-Only Reinforcement Learning(NO-RL),针对显式提示渲染中的固有偏差进行细粒度纠偏。 Result: 在Science-T2I(科学知识)和WISE(世界知识)两个基准上,EruDiff显著提升FLUX和Qwen-Image等主流扩散模型的性能,验证了方法的有效性与泛化性。 Conclusion: 重构扩散模型内部知识结构是解决隐式提示理解问题的关键路径;DK-DM与NO-RL协同可有效弥合显/隐式提示间的知识表征鸿沟,为知识感知生成建模提供新范式。 Abstract: Text-to-image diffusion models have achieved remarkable fidelity in synthesizing images from explicit text prompts, yet exhibit a critical deficiency in processing implicit prompts that require deep-level world knowledge, ranging from natural sciences to cultural commonsense, resulting in counter-factual synthesis. This paper traces the root of this limitation to a fundamental dislocation of the underlying knowledge structures, manifesting as a chaotic organization of implicit prompts compared to their explicit counterparts. In this paper, we propose EruDiff, which aims to refactor the knowledge within diffusion models. Specifically, we develop the Diffusion Knowledge Distribution Matching (DK-DM) to register the knowledge distribution of intractable implicit prompts with that of well-defined explicit anchors. Furthermore, to rectify the inherent biases in explicit prompt rendering, we employ the Negative-Only Reinforcement Learning (NO-RL) strategy for fine-grained correction. Rigorous empirical evaluations demonstrate that our method significantly enhances the performance of leading diffusion models, including FLUX and Qwen-Image, across both the scientific knowledge benchmark (i.e., Science-T2I) and the world knowledge benchmark (i.e., WISE), underscoring the effectiveness and generalizability. Our code is available at https://github.com/xiefan-guo/erudiff.[167] MERIT: Multi-domain Efficient RAW Image Translation
Wenjun Huang,Shenghao Fu,Yian Jin,Yang Ni,Ziteng Cui,Hanning Chen,Yirui He,Yezi Liu,Sanggeon Yun,SungHeon Jeong,Ryozo Masukawa,William Youngwoo Chung,Mohsen Imani
Main category: cs.CV
TL;DR: 本文提出MERIT,首个用于多域RAW图像翻译的统一框架,通过单模型实现任意相机域间的翻译,并引入传感器感知的噪声建模损失和条件多尺度大核注意力模块,在新构建的MDRAW数据集上验证了其优越性能。
Details
Motivation: 不同相机传感器捕获的RAW图像存在显著域偏移(光谱响应、噪声特性、色调行为差异),导致下游视觉任务难以直接使用;而现有方法需为每对源-目标域单独训练RAW-to-RAW转换器,缺乏可扩展性。 Method: 提出统一框架MERIT,包含:1)传感器感知的噪声建模损失,显式对齐生成图像与目标域的信号相关噪声统计;2)条件多尺度大核注意力模块,增强上下文建模与传感器感知特征表达;3)构建首个面向多域RAW翻译的数据集MDRAW(含5种相机的配对与非配对RAW图像)。 Result: 在MDRAW数据集上实验表明,MERIT相比先前模型在图像质量上提升5.56 dB,在训练效率上减少80%迭代次数。 Conclusion: MERIT实现了高效、可扩展的多域RAW图像翻译,为跨设备RAW处理提供了实用化统一解决方案。 Abstract: RAW images captured by different camera sensors exhibit substantial domain shifts due to varying spectral responses, noise characteristics, and tone behaviors, complicating their direct use in downstream computer vision tasks. Prior methods address this problem by training domain-specific RAW-to-RAW translators for each source-target pair, but such approaches do not scale to real-world scenarios involving multiple types of commercial cameras. In this work, we introduce MERIT, the first unified framework for multi-domain RAW image translation, which leverages a single model to perform translations across arbitrary camera domains. To address domain-specific noise discrepancies, we propose a sensor-aware noise modeling loss that explicitly aligns the signal-dependent noise statistics of the generated images with those of the target domain. We further enhance the generator with a conditional multi-scale large kernel attention module for improved context and sensor-aware feature modeling. To facilitate standardized evaluation, we introduce MDRAW, the first dataset tailored for multi-domain RAW image translation, comprising both paired and unpaired RAW captures from five diverse camera sensors across a wide range of scenes. Extensive experiments demonstrate that MERIT outperforms prior models in both quality (5.56 dB improvement) and scalability (80% reduction in training iterations).[168] Dodgersort: Uncertainty-Aware VLM-Guided Human-in-the-Loop Pairwise Ranking
Yujin Park,Haejun Chung,Ikbeom Jang
Main category: cs.CV
TL;DR: Dodgersort is a novel framework for pairwise comparison labeling that reduces human annotation effort while improving ranking reliability, leveraging CLIP-based pre-ordering, neural ranking, probabilistic ensembles, and uncertainty-aware pair selection.
Details
Motivation: Pairwise comparison labeling offers higher inter-rater reliability than classification labeling, but exhaustive comparisons are prohibitively expensive (quadratic cost); thus, efficient and reliable active comparison selection is needed. Method: Dodgersort combines CLIP-based hierarchical pre-ordering, a neural ranking head, probabilistic ensemble models (Elo, BTL, GP), epistemic–aleatoric uncertainty decomposition, and information-theoretic pair selection. Result: Dodgersort achieves 11–16% reduction in human comparisons while improving inter-rater reliability across medical imaging, historical dating, and aesthetics tasks; on FG-NET, it extracts 5–20× more ranking information per comparison than baselines. Conclusion: Neural adaptation and ensemble uncertainty modeling are critical to Dodgersort’s gains, enabling Pareto-optimal accuracy–efficiency trade-offs in visual ranking. Abstract: Pairwise comparison labeling is emerging as it yields higher inter-rater reliability than conventional classification labeling, but exhaustive comparisons require quadratic cost. We propose Dodgersort, which leverages CLIP-based hierarchical pre-ordering, a neural ranking head and probabilistic ensemble (Elo, BTL, GP), epistemic--aleatoric uncertainty decomposition, and information-theoretic pair selection. It reduces human comparisons while improving the reliability of the rankings. In visual ranking tasks in medical imaging, historical dating, and aesthetics, Dodgersort achieves a 11--16\% annotation reduction while improving inter-rater reliability. Cross-domain ablations across four datasets show that neural adaptation and ensemble uncertainty are key to this gain. In FG-NET with ground-truth ages, the framework extracts 5--20$\times$ more ranking information per comparison than baselines, yielding Pareto-optimal accuracy--efficiency trade-offs.[169] GOLDMARK: Governed Outcome-Linked Diagnostic Model Assessment Reference Kit
Chad Vanderbilt,Gabriele Campanella,Siddharth Singi,Swaraj Nanda,Jie-Fu Chen,Ali Kamali,Amir Momeni Boroujeni,David Kim,Mohamed Yakoub,Jamal Benhamida,Meera Hameed,Neeraj Kumar,Gregory Goldgof
Main category: cs.CV
TL;DR: 本文提出了GOLDMARK,一个基于TCGA队列的标准化计算病理学基准框架,用于评估计算生物标志物(CBs)预测性能,提供结构化中间表示、预定义数据划分及跨中心验证结果。
Details
Motivation: 当前计算病理学缺乏标准化的中间数据格式、溯源追踪、检查点规范和可复现的评估指标,阻碍了临床级部署。 Method: 构建GOLDMARK基准框架,整合TCGA队列与OncoKB分级标签,发布tile坐标图、PFM特征嵌入、质控元数据、患者级划分、训练模型及评估输出,并在TCGA和独立MSKCC队列上进行双向验证。 Result: 在33个肿瘤-生物标志物任务中,TCGA和MSKCC上的平均AUROC分别为0.689和0.630;前8个高性能任务(如LGG IDH1、COAD MSI/BRAF等)AUROC达0.831和0.801,表现出稳定跨中心性能。 Conclusion: GOLDMARK为计算病理学提供了共享实验基础,支持方法间可复现的基准测试与直接比较,推动临床转化。 Abstract: Computational biomarkers (CBs) are histopathology-derived patterns extracted from hematoxylin-eosin (H&E) whole-slide images (WSIs) using artificial intelligence (AI) to predict therapeutic response or prognosis. Recently, slide-level multiple-instance learning (MIL) with pathology foundation models (PFMs) has become the standard baseline for CB development. While these methods have improved predictive performance, computational pathology lacks standardized intermediate data formats, provenance tracking, checkpointing conventions, and reproducible evaluation metrics required for clinical-grade deployment. We introduce GOLDMARK (https://artificialintelligencepathology.org), a standardized benchmarking framework built on a curated TCGA cohort with clinically actionable OncoKB level 1-3 biomarker labels. GOLDMARK releases structured intermediate representations, including tile coordinate maps, per-slide feature embeddings from canonical PFMs, quality-control metadata, predefined patient-level splits, trained slide-level models, and evaluation outputs. Models are trained on TCGA and evaluated on an independent MSKCC cohort with reciprocal testing. Across 33 tumor-biomarker tasks, mean AUROC was 0.689 (TCGA) and 0.630 (MSKCC). Restricting to the eight highest-performing tasks yielded mean AUROCs of 0.831 and 0.801, respectively. These tasks correspond to established morphologic-genomic associations (e.g., LGG IDH1, COAD MSI/BRAF, THCA BRAF/NRAS, BLCA FGFR3, UCEC PTEN) and showed the most stable cross-site performance. Differences between canonical encoders were modest relative to task-specific variability. GOLDMARK establishes a shared experimental substrate for computational pathology, enabling reproducible benchmarking and direct comparison of methods across datasets and models.[170] Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
Xinyu Zhang,Ziyi Kou,Chuan Qin,Mia Huang,Ergys Ristani,Ankit Kumar,Lele Chen,Kun He,Abdeslam Boularias,Li Guan
Main category: cs.CV
TL;DR: 本文提出Glove2Hand框架,将多模态传感手套视频转换为逼真的裸手视频,同时保留物理交互动态,并构建首个含触觉与IMU信号的HOI数据集HandSense,显著提升接触估计与遮挡下手部跟踪性能。
Details
Motivation: 传统手部视频缺乏接触力、运动信号等关键物理信息,且易受频繁遮挡影响,难以准确建模手物交互(HOI) Method: 提出Glove2Hand框架:1)设计新型3D高斯手模型保障时序渲染一致性;2)采用基于扩散模型的手部修复器实现裸手在场景中的无缝融合,处理复杂手物交互与非刚性形变 Result: 构建首个同步触觉与IMU信号的多模态HOI数据集HandSense;在视频接触估计和严重遮挡下的手部跟踪等下游任务中取得显著性能提升 Conclusion: Glove2Hand通过传感手套到裸手的高质量跨模态生成,有效桥接了物理交互感知与视觉表征之间的鸿沟,为HOI研究提供了新范式与高质量数据基础 Abstract: Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.[171] Ensemble of Small Classifiers For Imbalanced White Blood Cell Classification
Siddharth Srivastava,Adam Smith,Scott Brooks,Jack Bacon,Till Bretschneider
Main category: cs.CV
TL;DR: 本文提出了一种轻量级集成方法,结合SwinV2-Tiny、DinoBloom-Small和ConvNeXT-V2-Tiny三种预训练模型,用于白血病诊断中的白细胞分类,尤其关注粒细胞、单核细胞和淋巴细胞生成过程;通过数据增强缓解类别不平衡,并在3折分层交叉验证下实现优异性能,同时分析了模型在区分形态相似的早幼粒细胞与淋巴细胞时的局限性。
Details
Motivation: 自动化白细胞分类可替代耗时耗力的病理专家人工判读,但因染色、扫描差异及患者间异质性,罕见细胞类型分类仍具挑战。 Method: 采用轻量级预训练模型(SwinV2-Tiny、DinoBloom-Small、ConvNeXT-V2-Tiny)的集成方法,每类模型训练3个实例,共9个模型;输入图像经全部模型前向传播后,通过logit平均聚合预测结果;并辅以数据集扩展缓解类别不平衡。 Result: 该集成方法在具有挑战性的白细胞数据集上取得优异性能,并识别出模型在区分粒系中myelocytes与淋系中lymphocytes时存在混淆问题。 Conclusion: 轻量级多模型集成策略在白细胞分类任务中兼具高效性与鲁棒性,为血液病AI辅助诊断提供了可行且可解释的技术路径。 Abstract: Automating white blood cell classification for diagnosis of leukaemia is a promising alternative to time-consuming and resource-intensive examination of cells by expert pathologists. However, designing robust algorithms for classification of rare cell types remains challenging due to variations in staining, scanning and inter-patient heterogeneity. We propose a lightweight ensemble approach for classification of cells during Haematopoiesis, with a focus on the biology of Granulopoiesis, Monocytopoiesis and Lymphopoiesis. Through dataset expansion to alleviate some class imbalance, we demonstrate that a simple ensemble of lightweight pretrained SwinV2-Tiny, DinoBloom-Small and ConvNeXT-V2-Tiny models achieves excellent performance on this challenging dataset. We train 3 instantiations of each architecture in a stratified 3-fold cross-validation framework; for an input image, we forward-pass through all 9 models and aggregate through logit averaging. We further reason on the weaknesses of our model in confusing similar-looking myelocytes in granulopoiesis and lymphocytes in lymphopoiesis. Code: https://gitlab.com/siddharthsrivastava/wbc-bench-2026.[172] Fast and Robust Deformable 3D Gaussian Splatting
Han Jiao,Jiakai Sun,Lei Zhao,Zhanjie Zhang,Wei Xing,Huaizhong Lin
Main category: cs.CV
TL;DR: 本文提出FRoG框架,通过高斯嵌入与粗到细时间嵌入策略提升动态场景重建效率与鲁棒性,并引入深度与误差引导采样及不透明度调制,解决初始点云依赖、局部最优和渲染速度慢等问题。
Details
Motivation: 现有基于形变场的动态3D高斯方法存在渲染速度慢、严重依赖初始点云、在暗场景中易陷入局部最优等问题。 Method: 提出FRoG框架:1)结合每高斯嵌入与粗到细时间嵌入策略,实现时间嵌入早期融合以加速渲染;2)设计深度与误差引导的采样策略,在低偏差位置初始化新高斯,缓解形变场优化负担;3)调制不透明度变化以改善暗场景下的颜色保真度与收敛稳定性。 Result: 实验表明FRoG在保持最先进视觉质量的同时显著提升渲染速度,并增强对稀疏初始化和暗场景的鲁棒性。 Conclusion: FRoG是一种高效、鲁棒的动态场景重建方法,有效克服了现有基于形变场的3D高斯方法的关键缺陷。 Abstract: 3D Gaussian Splatting has demonstrated remarkable real-time rendering capabilities and superior visual quality in novel view synthesis for static scenes. Building upon these advantages, researchers have progressively extended 3D Gaussians to dynamic scene reconstruction. Deformation field-based methods have emerged as a promising approach among various techniques. These methods maintain 3D Gaussian attributes in a canonical field and employ the deformation field to transform this field across temporal sequences. Nevertheless, these approaches frequently encounter challenges such as suboptimal rendering speeds, significant dependence on initial point clouds, and vulnerability to local optima in dim scenes. To overcome these limitations, we present FRoG, an efficient and robust framework for high-quality dynamic scene reconstruction. FRoG integrates per-Gaussian embedding with a coarse-to-fine temporal embedding strategy, accelerating rendering through the early fusion of temporal embeddings. Moreover, to enhance robustness against sparse initializations, we introduce a novel depth- and error-guided sampling strategy. This strategy populates the canonical field with new 3D Gaussians at low-deviation initial positions, significantly reducing the optimization burden on the deformation field and improving detail reconstruction in both static and dynamic regions. Furthermore, by modulating opacity variations, we mitigate the local optima problem in dim scenes, improving color fidelity. Comprehensive experimental results validate that our method achieves accelerated rendering speeds while maintaining state-of-the-art visual quality.[173] Restoring Neural Network Plasticity for Faster Transfer Learning
Xander Coetzer,Arné Schreuder,Anna Sergeevna Bosman
Main category: cs.CV
TL;DR: 本文提出一种有针对性的权重重初始化策略,以在微调前恢复神经可塑性,从而提升迁移学习中模型在下游任务上的适应能力与性能。
Details
Motivation: 预训练模型在迁移学习中可能出现神经可塑性丧失(即权重饱和、梯度消失),导致难以有效适配下游任务,尤其在数据分布异常时;该问题在持续学习中被广泛研究,但在迁移学习中仍缺乏关注。 Method: 提出一种目标导向的权重重初始化策略,在标准微调前对部分网络权重进行有选择地重置,以恢复梯度流动和参数更新能力。 Result: 在多个图像分类基准上,CNN和ViT均展现出更高的测试准确率与更快的收敛速度,且计算开销极小、兼容现有迁移学习流程。 Conclusion: 有针对性的权重重初始化可有效缓解迁移学习中的神经可塑性丧失问题,是一种简单、通用且高效的改进手段。 Abstract: Transfer learning with models pretrained on ImageNet has become a standard practice in computer vision. Transfer learning refers to fine-tuning pretrained weights of a neural network on a downstream task, typically unrelated to ImageNet. However, pretrained weights can become saturated and may yield insignificant gradients, failing to adapt to the downstream task. This hinders the ability of the model to train effectively, and is commonly referred to as loss of neural plasticity. Loss of plasticity may prevent the model from fully adapting to the target domain, especially when the downstream dataset is atypical in nature. While this issue has been widely explored in continual learning, it remains relatively understudied in the context of transfer learning. In this work, we propose the use of a targeted weight re-initialization strategy to restore neural plasticity prior to fine-tuning. Our experiments show that both convolutional neural networks (CNNs) and vision transformers (ViTs) benefit from this approach, yielding higher test accuracy with faster convergence on several image classification benchmarks. Our method introduces negligible computational overhead and is compatible with common transfer learning pipelines.[174] TAFG-MAN: Timestep-Adaptive Frequency-Gated Latent Diffusion for Efficient and High-Quality Low-Dose CT Image Denoising
Tangtangfang Fang,Yang Jiao,Xiangjian He,Jingxi Hu,Jiaqi Yang
Main category: cs.CV
TL;DR: 本文提出TAFG-MAN,一种基于潜在扩散的低剂量CT图像去噪框架,通过时序自适应频域门控(TAFG)机制,在保持高效推理的同时提升细节保留与感知质量。
Details
Motivation: 低剂量CT图像噪声大、结构退化严重,传统去噪方法难以兼顾噪声抑制与细微解剖结构保留。 Method: 提出TAFG-MAN框架:包含感知优化的自编码器、紧凑潜在空间中的条件扩散恢复模块,以及轻量级Timestep-Adaptive Frequency-Gated(TAFG)条件机制;TAFG将条件特征分解为高低频分量,依据当前去噪特征和时间步嵌入动态生成门控,逐步释放高频引导。 Result: TAFG-MAN在质量-效率权衡上优于代表性基线;相比无TAFG的基线变体,在几乎不增加推理开销的前提下显著提升细节保留与感知质量;消融实验验证了TAFG机制的有效性。 Conclusion: TAFG-MAN通过频域与时序协同的条件建模,有效提升了低剂量CT去噪中噪声抑制与结构保真之间的平衡,为医学影像生成式建模提供了新思路。 Abstract: Low-dose computed tomography (LDCT) reduces radiation exposure but also introduces substantial noise and structural degradation, making it difficult to suppress noise without erasing subtle anatomical details. In this paper, we present TAFG-MAN, a latent diffusion framework for efficient and high-quality LDCT image denoising. The framework combines a perceptually optimized autoencoder, conditional latent diffusion restoration in a compact latent space, and a lightweight Timestep-Adaptive Frequency-Gated (TAFG) conditioning design. TAFG decomposes condition features into low- and high-frequency components, predicts timestep-adaptive gates from the current denoising feature and timestep embedding, and progressively releases high-frequency guidance in later denoising stages before cross-attention. In this way, the model relies more on stable structural guidance at early reverse steps and introduces fine details more cautiously as denoising proceeds, improving the balance between noise suppression and detail preservation. Experiments show that TAFG-MAN achieves a favorable quality-efficiency trade-off against representative baselines. Compared with its base variant without TAFG, it further improves detail preservation and perceptual quality while maintaining essentially the same inference cost, and ablation results confirm the effectiveness of the proposed conditioning mechanism.[175] Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
Xu Zhang,Jin Yuan,BinHong Yang,Xuan Liu,Qianjun Zhang,Yuyi Wang,Zhiyong Li,Hanwang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种可控的视频分割与描述生成新任务(SegCaptioning),通过用户提供的提示(如目标物体的边界框)同步生成精准的掩码和描述,并设计了场景图引导的细粒度SegCaptioning Transformer(SG-FSCFormer)框架,显著提升视频多模态理解与交互能力。
Details
Motivation: 现有视频多模态理解方法多聚焦全局理解,缺乏对用户交互的支持,难以满足用户对特定对象的细粒度控制需求。 Method: 提出Controllable Video Segmentation and Captioning(SegCaptioning)任务;构建Scene Graph-guided Fine-grained SegCaptioning Transformer(SG-FSCFormer),包含Prompt-guided Temporal Graph Former(用于建模用户意图)和Fine-grained Mask-linguistic Decoder(联合预测高质量掩码-文本对,并引入Multi-entity Contrastive loss实现细粒度对齐)。 Result: 在两个基准数据集上实验表明,SG-FSCFormer能有效捕捉用户意图,生成高精度、符合用户指定要求的多模态输出(掩码+描述)。 Conclusion: SG-FSCFormer为视频多模态理解提供了可交互、可控的新范式,推动了从被动理解向主动引导式理解的演进。 Abstract: Recent advancements in multimodal large models have significantly bridged the representation gap between diverse modalities, catalyzing the evolution of video multimodal interpretation, which enhances users' understanding of video content by generating correlated modalities. However, most existing video multimodal interpretation methods primarily concentrate on global comprehension with limited user interaction. To address this, we propose a novel task, Controllable Video Segmentation and Captioning (SegCaptioning), which empowers users to provide specific prompts, such as a bounding box around an object of interest, to simultaneously generate correlated masks and captions that precisely embody user intent. An innovative framework Scene Graph-guided Fine-grained SegCaptioning Transformer (SG-FSCFormer) is designed that integrates a Prompt-guided Temporal Graph Former to effectively captures and represents user intent through an adaptive prompt adaptor, ensuring that the generated content well aligns with the user's requirements. Furthermore, our model introduces a Fine-grained Mask-linguistic Decoder to collaboratively predict high-quality caption-mask pairs using a Multi-entity Contrastive loss, as well as provide fine-grained alignment between each mask and its corresponding caption tokens, thereby enhancing users' comprehension of videos. Comprehensive experiments conducted on two benchmark datasets demonstrate that SG-FSCFormer achieves remarkable performance, effectively capturing user intent and generating precise multimodal outputs tailored to user specifications. Our code is available at https://github.com/XuZhang1211/SG-FSCFormer.[176] GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies
Uzair Shah,Marco Agus,Mahmoud Gamal,Mahmood Alzubaidi,Corrado Cali,Pierre J. Magistretti,Abdesselam Bouzerdoum,Mowafa Househ
Main category: cs.CV
TL;DR: 本文提出GraPHFormer,一种基于CLIP式对比学习的多模态神经元形态分析架构,融合拓扑(持久性图像)与图结构(骨架图)信息,在多个基准上达到SOTA性能,并成功应用于胶质细胞分类及发育/退行性变化检测。
Details
Motivation: 现有方法孤立地分析神经元形态的拓扑或图结构,未能充分利用二者互补信息;需统一建模以更好揭示环路功能、发育与疾病关联。 Method: 构建双分支架构:视觉分支用DINOv2-ViT-S处理三通道持久性图像(未加权/持久性加权/半径加权密度),树形LSTM分支编码骨架图的几何与径向属性;两分支通过对比学习(对称InfoNCE损失)映射至共享嵌入空间,并引入保持拓扑语义的持久性空间变换进行数据增强。 Result: 在BIL-6、ACT-4等六个基准上,GraPHFormer在五个任务中达到SOTA,显著优于仅拓扑、仅图结构及传统形态计量学方法;成功区分不同皮层区域与物种的胶质细胞形态,并检测发育与退行性过程的形态标志。 Conclusion: GraPHFormer有效融合拓扑与图结构模态,为神经元形态分析提供了更鲁棒、可解释且生物学意义明确的新范式。 Abstract: Neuronal morphology encodes critical information about circuit function, development, and disease, yet current methods analyze topology or graph structure in isolation. We introduce GraPHFormer, a multimodal architecture that unifies these complementary views through CLIP-style contrastive learning. Our vision branch processes a novel three-channel persistence image encoding unweighted, persistence-weighted, and radius-weighted topological densities via DINOv2-ViT-S. In parallel, a TreeLSTM encoder captures geometric and radial attributes from skeleton graphs. Both project to a shared embedding space trained with symmetric InfoNCE loss, augmented by persistence-space transformations that preserve topological semantics. Evaluated on six benchmarks (BIL-6, ACT-4, JML-4, N7, M1-Cell, M1-REG) spanning self-supervised and supervised settings, GraPHFormer achieves state-of-the-art performance on five benchmarks, significantly outperforming topology-only, graph-only, and morphometrics baselines. We demonstrate practical utility by discriminating glial morphologies across cortical regions and species, and detecting signatures of developmental and degenerative processes. Code: https://github.com/Uzshah/GraPHFormer[177] Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models
Binesh Sadanandan,Vahid Behzadan
Main category: cs.CV
TL;DR: 本文指出,医疗视觉-语言模型(VLMs)中常用的‘同义改写一致性’作为可靠性代理存在根本缺陷;作者提出一种四象限安全分类法(基于一致性与图像依赖性),发现LoRA微调虽降低翻转率,却大幅增加‘危险样本’(一致但不依赖图像)比例,且这些样本高准确、低熵,难以被置信度筛选识别;建议部署评估必须同时进行一致性检查与纯文本基线测试。
Details
Motivation: 现有医疗VLM部署中将‘同义改写一致性’作为可靠性代理,但该指标无法区分模型是真正理解图文对齐,还是仅依赖文本表面模式而忽略图像,存在严重安全隐患。 Method: 提出四象限安全分类法(Ideal/Fragile/Dangerous/Worst),联合评估样本级一致性(同义改写下预测稳定性)和图像依赖性(移除图像后预测是否变化);在MIMIC-CXR和PadChest两个胸部X光数据集上评测5种医疗VLM配置,并引入文本-only基线进行对比分析。 Result: LoRA微调显著降低翻转率(如LLaVA-Rad Base在PadChest上翻转率仅1.5%),但98.5%样本落入Dangerous象限;Dangerous样本准确率高达99.6%、熵值低,易被误判为可靠;flip rate与Dangerous比例呈强负相关(r = -0.89)。 Conclusion: 一致性不能替代图像依赖性验证;部署前必须结合一致性检查与文本-only基线测试,否则会陷入‘虚假可靠性’陷阱;应重新设计医疗VLM的可靠性评估范式。 Abstract: Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.[178] SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis
Zhixiang Lu,Shijie Xu,Kaicheng Yan,Xuyue Cai,Chong Zhang,Yulong Li,Angelos Stefanidis,Anh Nguyen,Jionglong Su
Main category: cs.CV
TL;DR: SkinCLIP-VL 是一种资源高效、可信赖的皮肤癌诊断视觉语言模型,通过冻结 CLIP 编码器与轻量化量化 Qwen2.5-VL 结合 LoRA 微调,并提出一致性感知焦点对齐(CFA)损失,在数据稀缺和长尾分布下提升准确率与临床可解释性。
Details
Motivation: 解决视觉语言模型在皮肤病学中部署面临的高计算成本、极端数据稀缺和深度学习黑箱问题三重困境。 Method: 提出 SkinCLIP-VL 框架:采用冻结感知-自适应推理范式,将冻结 CLIP 编码器与轻量量化 Qwen2.5-VL 通过低秩适配(LoRA)集成;设计 Consistency-aware Focal Alignment(CFA)损失,融合焦点重加权、语义对齐与校准,以在长尾分布下严格对齐视觉区域与临床语义。 Result: 在 ISIC 和 Derm7pt 基准上,SkinCLIP-VL 以少 43% 参数量超越 13B 参数基线模型 4.3–6.2% 准确率;盲法专家评估与分布外测试证实其视觉接地推理显著提升临床可信度,优于传统显著性图。 Conclusion: SkinCLIP-VL 在保持高性能的同时显著降低资源消耗,并通过可解释的视觉接地推理增强临床可信度,为资源受限医疗场景下的 VLM 部署提供了可行路径。 Abstract: The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.[179] LPNSR: Prior-Enhanced Diffusion Image Super-Resolution via LR-Guided Noise Prediction
Shuwei Huang,Shizhuo Liu,Zijun Wei
Main category: cs.CV
TL;DR: 本文提出LPNSR,一种先验增强的高效扩散框架,通过优化中间噪声和改进初始上采样,解决了图像超分辨率中推理效率与重建质量的权衡问题,在4步内实现SOTA感知性能。
Details
Motivation: 现有基于扩散的图像超分辨率方法在紧凑采样轨迹下存在性能严重下降问题,主要源于中间步骤使用无约束高斯噪声导致误差累积、LR先验引导不足,以及双三次上采样带来的初始化偏差。 Method: 1)推导出残差偏移扩散范式下最优中间噪声的闭式解析解;2)设计LR引导的多输入感知噪声预测器,替代随机高斯噪声;3)引入高质量预上采样网络缓解初始化偏差;4)支持端到端4步紧凑轨迹优化。 Result: 在合成与真实世界数据集上均达到SOTA感知性能,且不依赖大规模文生图先验。 Conclusion: LPNSR有效提升了扩散模型在图像超分辨率任务中的效率与质量平衡,为轻量级高效扩散SR提供了新范式。 Abstract: Diffusion-based image super-resolution (SR), which aims to reconstruct high-resolution (HR) images from corresponding low-resolution (LR) observations, faces a fundamental trade-off between inference efficiency and reconstruction quality. The state-of-the-art residual-shifting diffusion framework achieves efficient 4-step inference, yet suffers from severe performance degradation in compact sampling trajectories. This is mainly attributed to two core limitations: the inherent suboptimality of unconstrained random Gaussian noise in intermediate steps, which leads to error accumulation and insufficient LR prior guidance, and the initialization bias caused by naive bicubic upsampling. In this paper, we propose LPNSR, a prior-enhanced efficient diffusion framework to address these issues. We first mathematically derive the closed-form analytical solution of the optimal intermediate noise for the residual-shifting diffusion paradigm, and accordingly design an LR-guided multi-input-aware noise predictor to replace random Gaussian noise, embedding LR structural priors into the reverse process while fully preserving the framework's core efficient residual-shifting mechanism. We further mitigate initial bias with a high-quality pre-upsampling network to optimize the diffusion starting point. With a compact 4-step trajectory, LPNSR can be optimized in an end-to-end manner. Extensive experiments demonstrate that LPNSR achieves state-of-the-art perceptual performance on both synthetic and real-world datasets, without relying on any large-scale text-to-image priors. The source code of our method can be found at https://github.com/Faze-Hsw/LPNSR.[180] SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments
Wen Jiang,Kangyao Huang,Li Wang,Wang Xu,Wei Fan,Jinyuan Liu,Shaoyu Liu,Hanfang Liang,Hongwei Duan,Bin Xu,Xiangyang Ji
Main category: cs.CV
TL;DR: 本文提出SpatialFly框架,通过几何引导的2D表示对齐机制,在无需显式3D重建的前提下提升无人机视觉语言导航(VLN)在复杂3D环境中的空间推理能力,显著改善路径精度与运动稳定性。
Details
Motivation: 无人机视觉语言导航(UAV VLN)在复杂3D环境中面临2D视觉感知与3D轨迹决策空间之间的结构表征不匹配问题,限制了空间推理能力。 Method: 提出SpatialFly:包含几何先验注入模块(将全局结构线索注入2D语义token)和几何感知重参数化模块(通过跨模态注意力对齐2D语义token与3D几何token,并采用门控残差融合保留语义判别性),全程仅基于RGB观测,无需显式3D重建。 Result: 在已见与未见环境中均超越现有SOTA UAV VLN基线;在未见Full split上,归一化误差(NE)降低4.03m,成功率(SR)提升1.27%;轨迹级分析显示路径对齐更优、运动更平滑稳定。 Conclusion: SpatialFly有效缓解了2D视觉与3D决策空间间的表征鸿沟,验证了几何引导的空间表示对提升UAV VLN鲁棒性与泛化性的关键作用。 Abstract: UAVs play an important role in applications such as autonomous exploration, disaster response, and infrastructure inspection. However, UAV VLN in complex 3D environments remains challenging. A key difficulty is the structural representation mismatch between 2D visual perception and the 3D trajectory decision space, which limits spatial reasoning. To this end, we propose SpatialFly, a geometry-guided spatial representation framework for UAV VLN. Operating on RGB observations without explicit 3D reconstruction, SpatialFly introduces a geometry-guided 2D representation alignment mechanism. Specifically, the geometric prior injection module injects global structural cues into 2D semantic tokens to provide scene-level geometric guidance. The geometry-aware reparameterization module then aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention, followed by gated residual fusion to preserve semantic discrimination. Experimental results show that SpatialFly consistently outperforms state-of-the-art UAV VLN baselines across both seen and unseen environments, reducing NE by 4.03m and improving SR by 1.27% over the strongest baseline on the unseen Full split. Additional trajectory-level analysis shows that SpatialFly produces trajectories with better path alignment and smoother, more stable motion.[181] When Minor Edits Matter: LLM-Driven Prompt Attack for Medical VLM Robustness in Ultrasound
Yasamin Medghalchi,Milad Yazdani,Amirhossein Dabiriaghdam,Moein Heidari,Mojan Izadkhah,Zahra Kavian,Giuseppe Carenini,Lele Wang,Dena Shahriari,Ilker Hacihaliloglu
Main category: cs.CV
TL;DR: 本文提出了一种可扩展的对抗性评估框架,利用大语言模型生成临床合理的对抗性提示变体,以评估医学视觉-语言模型(Med-VLMs)在超声图像分析中的鲁棒性,揭示了其在真实临床场景中因提示微小变化而导致输出不稳定的关键脆弱性。
Details
Motivation: 医学视觉-语言模型(Med-VLMs)虽在超声分析中表现优异,但其依赖自然语言指令,提示词的微小变化(如拼写错误、缩写或模糊表述)可能导致输出显著偏移,威胁临床可信度与安全性。 Method: 构建基于大语言模型(LLM)的对抗性评估框架,通过‘拟人化’重写和最小编辑生成临床合理且真实的对抗性提示;在超声多选题问答基准上系统评估SOTA Med-VLMs的脆弱性,并分析攻击成功率与模型置信度的关系及跨模型失败模式。 Result: 发现当前SOTA Med-VLMs对临床常见提示变异高度敏感,攻击成功率随攻击者LLM能力增强而上升,且高置信度预测常伴随错误输出,存在一致性的语义理解与解剖推理失败模式。 Conclusion: 当前Med-VLMs在超声分析中存在现实可行的鲁棒性缺陷,亟需针对性改进以保障临床部署的安全性与可靠性。 Abstract: Ultrasound is widely used in clinical practice due to its portability, cost-effectiveness, safety, and real-time imaging capabilities. However, image acquisition and interpretation remain highly operator dependent, motivating the development of robust AI-assisted analysis methods. Vision-language models (VLMs) have recently demonstrated strong multimodal reasoning capabilities and competitive performance in medical image analysis, including ultrasound. However, emerging evidence highlights significant concerns about their trustworthiness. In particular, adversarial robustness is critical because Med-VLMs operate via natural-language instructions, rendering prompt formulation a realistic and practically exploitable point of vulnerability. Small variations (typos, shorthand, underspecified requests, or ambiguous wording) can meaningfully shift model outputs. We propose a scalable adversarial evaluation framework that leverages a large language model (LLM) to generate clinically plausible adversarial prompt variants via "humanized" rewrites and minimal edits that mimic routine clinical communication. Using ultrasound multiple-choice question answering benchmarks, we systematically assess the vulnerability of SOTA Med-VLMs to these attacks, examine how attacker LLM capacity influences attack success, analyze the relationship between attack success and model confidence, and identify consistent failure patterns across models. Our results highlight realistic robustness gaps that must be addressed for safe clinical translation. Code will be released publicly following the review process.[182] A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors
Gia-Bao Doan,Nam-Khoa Huynh,Minh-Nhat-Huy Ho,Khanh-Thanh-Khoa Nguyen,Thanh-Hai Le
Main category: cs.CV
TL;DR: 本文提出了一种面向驾驶员监控的两阶段时序动作定位框架,结合VideoMAE特征提取与增强型自掩码注意力(AMA)检测器,并引入SPPF模块提升多尺度时序建模能力;在精度与效率间取得良好权衡,ViT-Giant+SPPF达92.67% mAP,轻量ViT变体亦保持稳健性能。
Details
Motivation: 现有时序动作定位方法难以兼顾准确率与计算效率,而车载舱内视频中的危险驾驶行为识别对道路安全和交通违规检测至关重要,亟需适配周期性检查场景(如安检点、车队管理)的高效可靠方案。 Method: 采用两阶段流水线:第一阶段基于VideoMAE预训练模型进行视频特征提取,选用ViT-Giant或轻量ViT变体;第二阶段设计Augmented Self-Mask Attention(AMA)检测器,并嵌入Spatial Pyramid Pooling-Fast(SPPF)模块以捕获多尺度时序特征。 Result: ViT-Giant backbone在特征提取阶段达88.09% Top-1准确率;加入SPPF后,ViT-Giant+SPPF在定位任务中取得92.67% mAP峰值;轻量ViT变体虽精度略低(82.55% Top-1),但微调计算成本大幅降低(101.85 vs 1584.06 GFLOPs/segment),且定位性能仍稳健。 Conclusion: 所提框架在驾驶员监控场景下有效平衡了模型性能与部署效率,SPPF模块具有普适增益,为周期性安全检查应用提供了可扩展、高精度的时序行为定位解决方案。 Abstract: The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.[183] SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
Pengchong Hu,Zhizhong Han
Main category: cs.CV
TL;DR: 本文提出了一种基于像素对齐高斯分布的3D高斯泼溅(3DGS)SLAM方法,通过沿视线调整每个高斯的位置以提升渲染质量,并建模像素深度分布以加速跟踪。
Details
Motivation: 现有3DGS SLAM方法中使用的3D高斯或视图绑定高斯在运动灵活性与表示能力之间难以平衡,导致收敛慢或渲染质量受限。 Method: 采用像素对齐的高斯表示,允许其沿视线方向自适应调整位置;同时将每个像素周围的深度分布建模为高斯分布,用于快速帧-场景对齐。 Result: 在主流基准上验证了方法有效性,在视角渲染、相机跟踪精度、运行时效率和存储复杂度方面均优于最新方法。 Conclusion: 所提方法在保持系统可扩展性的同时提升了渲染质量与跟踪速度,为RGBD SLAM提供了更优的3D高斯表示范式。 Abstract: 3D Gaussian Splatting (3DGS) has made remarkable progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified to improve system scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian distribution, and then use these distributions to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity. Please see our project page for code and videos at https://machineperceptionlab.github.io/SGAD-SLAM-Project .[184] Single-Eye View: Monocular Real-time Perception Package for Autonomous Driving
Haixi Zhang,Aiyinsi Zuo,Zirui Li,Chunshu Wu,Tong Geng,Zhiyao Duan
Main category: cs.CV
TL;DR: 本文提出LRHPerception,一种面向自动驾驶的实时单目感知系统,在保证性能的同时大幅提升计算效率,达29 FPS(相较最快映射方法提速555%)。
Details
Motivation: 当前基于相机的自动驾驶技术多注重有效性,而忽视计算效率,亟需兼顾高效性与性能的实时感知方案。 Method: 提出LRHPerception,融合端到端学习的计算效率与局部建图方法的表征丰富性,将单目视频统一处理为含RGB、道路分割、像素级深度、目标检测与轨迹预测的五通道张量。 Result: 在单GPU上实现29 FPS实时处理,较最快映射方法提速555%,同时显著提升目标跟踪与预测、道路分割及深度估计性能。 Conclusion: LRHPerception成功平衡了感知精度与计算效率,为资源受限场景下的实时自动驾驶感知提供了可行框架。 Abstract: Amidst the rapid advancement of camera-based autonomous driving technology, effectiveness is often prioritized with limited attention to computational efficiency. To address this issue, this paper introduces LRHPerception, a real-time monocular perception package for autonomous driving that uses single-view camera video to interpret the surrounding environment. The proposed system combines the computational efficiency of end-to-end learning with the rich representational detail of local mapping methodologies. With significant improvements in object tracking and prediction, road segmentation, and depth estimation integrated into a unified framework, LRHPerception processes monocular image data into a five-channel tensor consisting of RGB, road segmentation, and pixel-level depth estimation, augmented with object detection and trajectory prediction. Experimental results demonstrate strong performance, achieving real-time processing at 29 FPS on a single GPU, representing a 555% speedup over the fastest mapping-based approach.[185] Two Experts Are Better Than One Generalist: Decoupling Geometry and Appearance for Feed-Forward 3D Gaussian Splatting
Hwasik Jeong,Seungryong Lee,Gyeongjin Kang,Seungkwon Yang,Xiangyu Sun,Seungtae Nam,Eunbyung Park
Main category: cs.CV
TL;DR: 本文提出2Xplat,一种基于双专家设计的无姿态前馈3D高斯点绘(3DGS)框架,将几何估计(相机位姿预测)与高斯生成(外观建模)显式分离,显著提升了无姿态3DGS重建质量,在少量训练迭代下超越现有无姿态方法并媲美有姿态SOTA方法。
Details
Motivation: 现有统一单体架构将几何推理与外观建模耦合在共享表征中,可能不利于高保真3DGS生成;需探索更优的模块化设计范式。 Method: 提出双专家架构:几何专家负责预测相机位姿,外观专家利用预测位姿显式合成3D高斯;实现几何与外观任务的解耦建模。 Result: 在少于5K训练迭代下,性能显著超越先前无姿态前馈3DGS方法,并达到与当前最优有姿态方法相当的水平。 Conclusion: 模块化双专家设计优于传统统一架构,挑战了主流范式,表明解耦几何估计与外观合成有利于复杂3D任务。 Abstract: Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such "all-in-one" designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.[186] NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection
Yupeng Zhang,Ruize Han,Zhiwei Chen,Wei Feng,Liang Wan
Main category: cs.CV
TL;DR: 本文提出NoOVD框架,通过结合冻结视觉语言模型(VLM)知识的自蒸馏机制(K-FPN)和推理阶段调整建议框置信度的R-RPN模块,缓解开放词汇目标检测中因RPN/RoI头误将新类别物体判为背景而导致的召回率下降问题,在多个基准上取得SOTA性能。
Details
Motivation: 开放词汇目标检测(OVD)中,训练阶段RPN和RoI头易将未标注的新类别物体误判为背景,导致提案被提前过滤或错误分类;测试阶段这些提案又因低分被后处理剔除,显著降低新类别召回率。 Method: 提出NoOVD训练框架:1)K-FPN模块利用冻结VLM的预训练知识引导模型发现新类别物体并实现无额外数据的知识蒸馏;2)R-RPN模块在推理时动态调整提案置信度以提升新类别召回。 Result: 在OV-LVIS、OV-COCO和Objects365跨数据集评测中,该方法在多项指标上持续超越现有方法,显著提升新类别检测性能。 Conclusion: 融合VLM先验知识的自蒸馏与推理阶段置信度校准可有效缓解OVD中因背景误判引发的召回瓶颈,为开放词汇检测提供了新思路。 Abstract: Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection performance.To address these issues, we propose a novel training framework-NoOVD-which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs). Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation-without requiring additional data-thus preventing forced alignment of novel objects with background.Additionally, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects. Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.[187] CTFS : Collaborative Teacher Framework for Forward-Looking Sonar Image Semantic Segmentation with Extremely Limited Labels
Ping Guo,Chengzhou Li,Guanchen Meng,Qi Jia,Jinyuan Liu,Zhu Liu,Yu Liu,Zhongxuan Luo,Xin Fan
Main category: cs.CV
TL;DR: 本文提出了一种面向前视声呐图像的协同教师语义分割框架,通过多教师协作机制与跨教师可靠性评估机制,在极少量标注数据下显著提升分割性能。
Details
Motivation: 前视声呐图像存在严重斑点噪声、低纹理对比度、声学阴影和几何畸变等问题,导致传统师生框架在极有限标注数据下难以取得满意语义分割效果。 Method: 提出协同教师语义分割框架,包含一个通用教师与多个声呐专用教师;采用多教师交替指导策略,并设计跨教师可靠性评估机制,动态量化伪标签的一致性与稳定性以抑制噪声影响。 Result: 在FLSMD数据集上仅使用2%标注数据时,mIoU相较现有最优方法提升5.08%。 Conclusion: 所提框架能有效融合通用语义表征与声呐图像特有特征,并通过可靠性评估缓解伪标签噪声,显著提升小样本声呐图像语义分割性能。 Abstract: As one of the most important underwater sensing technologies, forward-looking sonar exhibits unique imaging characteristics. Sonar images are often affected by severe speckle noise, low texture contrast, acoustic shadows, and geometric distortions. These factors make it difficult for traditional teacher-student frameworks to achieve satisfactory performance in sonar semantic segmentation tasks under extremely limited labeled data conditions. To address this issue, we propose a Collaborative Teacher Semantic Segmentation Framework for forward-looking sonar images. This framework introduces a multi-teacher collaborative mechanism composed of one general teacher and multiple sonar-specific teachers. By adopting a multi-teacher alternating guidance strategy, the student model can learn general semantic representations while simultaneously capturing the unique characteristics of sonar images, thereby achieving more comprehensive and robust feature modeling. Considering the challenges of sonar images, which can lead teachers to generate a large number of noisy pseudo-labels, we further design a cross-teacher reliability assessment mechanism. This mechanism dynamically quantifies the reliability of pseudo-labels by evaluating the consistency and stability of predictions across multiple views and multiple teachers, thereby mitigating the negative impact caused by noisy pseudo-labels. Notably, on the FLSMD dataset, when only 2% of the data is labeled, our method achieves a 5.08% improvement in mIoU compared to other state-of-the-art approaches.[188] CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models
Nan Zhou,Huiqun Wang,Yaoyan Zheng,Di Huang
Main category: cs.CV
TL;DR: 本文提出了一种上下文感知的视觉微调框架(CoVFT),以解决多模态大语言模型(MLLMs)中视觉编码器微调不稳定的问题,通过引入上下文向量提取和上下文混合专家模块,实现了更稳定、更优的多模态性能。
Details
Motivation: 现有MLLMs中视觉编码器微调(VFT)方法在不同任务上表现不稳定,主要源于视觉编码器缺乏上下文感知能力,导致在多样化多模态语境下参数更新冲突。 Method: 提出Context-aware Visual Fine-tuning(CoVFT)框架,包含Context Vector Extraction(CVE)和Contextual Mixture-of-Experts(CoMoE)两个核心模块,将多模态上下文显式融入视觉适配过程,分解冲突优化信号,实现上下文敏感的稳定视觉更新。 Result: 在12个主流多模态基准上验证,CoVFT显著提升性能与稳定性;7B模型经CoVFT微调后平均性能超越未使用该方法的13B模型。 Conclusion: 视觉编码器微调需具备上下文感知能力;CoVFT为MLLMs中视觉模块优化提供了新范式,揭示了视觉编码器优化的巨大潜力。 Abstract: Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outperform the frozen baseline across multimodal tasks. Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. By integrating a Context Vector Extraction (CVE) and a Contextual Mixture-of-Experts (CoMoE) module, CoVFT decomposes conflicting optimization signals and enables stable, context-sensitive visual updates. Extensive experiments on 12 multimodal benchmarks demonstrate that CoVFT achieves state-of-the-art performance with superior stability. Notably, fine-tuning a 7B MLLM with CoVFT surpasses the average performance of its 13B counterpart, revealing substantial untapped potential in visual encoder optimization within MLLMs.[189] Hierarchical Text-Guided Brain Tumor Segmentation via Sub-Region-Aware Prompts
Bahram Mohammadi,Ta Duc Huy,Afrouz Sheikholeslami,Qi Chen,Vu Minh Hieu Phan,Sam White,Minh-Son To,Xuyun Zhang,Amin Beheshti,Luping Zhou,Yuankai Qi
Main category: cs.CV
TL;DR: 本文提出TextCSP框架,通过文本调制的软级联解码器、子区域感知提示调优和文本语义通道调制器,实现对脑肿瘤三个亚区(WT、TC、ET)的精细化分割,在TextBraTS数据集上Dice和HD95指标分别提升1.7%和6%。
Details
Motivation: 现有方法将放射学报告压缩为单一全局文本嵌入,忽视了WT、TC、ET三类亚区各自不同的临床特征和解剖包含关系,导致分割边界模糊。 Method: 提出TextCSP:(1)文本调制的软级联解码器,按WT→TC→ET粗到细顺序预测;(2)子区域感知的LoRA适配BioBERT软提示调优,生成各亚区专用文本表征;(3)文本语义通道调制器,将文本表征转化为通道级特征优化信号。 Result: 在TextBraTS数据集上,所有亚区的Dice和HD95指标均超越SOTA,主指标Dice提升1.7%,HD95提升6%。 Conclusion: 结合子区域特异性文本语义与层级解剖先验的多模态融合策略,可显著提升脑肿瘤亚区分割精度。 Abstract: Brain tumor segmentation remains challenging because the three standard sub-regions, i.e., whole tumor (WT), tumor core (TC), and enhancing tumor (ET), often exhibit ambiguous visual boundaries. Integrating radiological description texts with imaging has shown promise. However, most multimodal approaches typically compress a report into a single global text embedding shared across all sub-regions, overlooking their distinct clinical characteristics. We propose TextCSP (text-modulated soft cascade architecture), a hierarchical text-guided framework that builds on the TextBraTS baseline with three novel components: (1) a text-modulated soft cascade decoder that predicts WT->TC->ET in a coarse-to-fine manner consistent with their anatomical containment hierarchy. (2) sub-region-aware prompt tuning, which uses learnable soft prompts with a LoRA-adapted BioBERT encoder to generate specialized text representations tailored for each sub-region; (3) text-semantic channel modulators that convert the aforementioned representations into channel-wise refinement signals, enabling the decoder to emphasize features aligned with clinically described patterns. Experiments on the TextBraTS dataset demonstrate consistent improvements across all sub-regions against state-of-the-art methods by 1.7% and 6% on the main metrics Dice and HD95.[190] Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models
Qifan Li,Xingyu Zhou,Jinhua Zhang,Weiyi You,Shuhang Gu
Main category: cs.CV
TL;DR: 本文提出了一种新的方差扩展损失(Variance Expansion loss),以提升潜在扩散模型中潜在空间对采样扰动的鲁棒性,同时保持高重建保真度。
Details
Motivation: 现有基于β-VAE的潜在 tokenizer 导致潜在流形过于紧凑,对扩散采样中的随机扰动敏感,造成生成质量下降;作者指出采样鲁棒性是影响生成质量的关键但被忽视的因素。 Method: 引入方差扩展损失,通过重建损失与方差扩展之间的对抗平衡,在防止方差坍缩的同时增强潜在空间对采样扰动的鲁棒性。 Result: 在多种潜在扩散架构上实验验证,该方法一致提升了生成质量,证明潜在空间鲁棒性是实现稳定、保真扩散采样的关键缺失要素。 Conclusion: 潜在空间对采样扰动的鲁棒性至关重要;所提方差扩展损失能有效缓解方差坍缩,兼顾重建精度与采样稳定性,为潜在扩散模型提供了新设计原则。 Abstract: Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used $β$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.[191] DGRNet: Disagreement-Guided Refinement for Uncertainty-Aware Brain Tumor Segmentation
Bahram Mohammadi,Yanqiu Wu,Vu Minh Hieu Phan,Sam White,Minh-Son To,Jian Yang,Michael Sheng,Yang Song,Yuankai Qi
Main category: cs.CV
TL;DR: 本文提出DGRNet,通过多视角分歧估计不确定性并结合放射科报告进行文本条件细化,解决单模型不确定性量化不足和放射科报告信息利用不充分的问题,在TextBraTS数据集上Dice提升2.4%,HD95提升11%。
Details
Motivation: 现有深度学习方法在脑肿瘤分割中存在两个关键问题:缺乏可靠的单模型不确定性量化(影响临床决策),以及未充分利用放射科报告中的丰富语义信息来指导模糊区域的分割。 Method: 提出Disagreement-Guided Refinement Network(DGRNet):1)使用共享编码器-解码器+四个轻量级视角特定适配器生成多样化预测,实现单次前向传播的不确定性估计;2)构建分歧图定位高不确定性区域;3)依据放射科报告对这些区域进行文本引导的精细化分割;4)引入多样性保持训练策略(成对相似性惩罚+梯度隔离)防止视角坍缩。 Result: 在TextBraTS数据集上,DGRNet将Dice分数提升2.4%,HD95降低11%,显著超越当前最优方法,并提供有意义的不确定性估计。 Conclusion: DGRNet有效融合多视角分歧建模与文本条件引导机制,在提升分割精度的同时提供临床可解释的不确定性量化,为AI辅助脑肿瘤诊断提供了更安全、更可靠的新范式。 Abstract: Accurate brain tumor segmentation from MRI scans is critical for diagnosis and treatment planning. Despite the strong performance of recent deep learning approaches, two fundamental limitations remain: (1) the lack of reliable uncertainty quantification in single-model predictions, which is essential for clinical deployment because the level of uncertainty may impact treatment decision-making, and (2) the under-utilization of rich information in radiology reports that can guide segmentation in ambiguous regions. In this paper, we propose the Disagreement-Guided Refinement Network (DGRNet), a novel framework that addresses both limitations through multi-view disagreement-based uncertainty estimation and text-conditioned refinement. DGRNet generates diverse predictions via four lightweight view-specific adapters attached to a shared encoder-decoder, enabling efficient uncertainty quantification within a single forward pass. Afterward, we build disagreement maps to identify regions of high segmentation uncertainty, which are then selectively refined according to clinical reports. Moreover, we introduce a diversity-preserving training strategy that combines pairwise similarity penalties and gradient isolation to prevent view collapse. The experimental results on the TextBraTS dataset show that DGRNet favorably improves state-of-the-art segmentation accuracy by 2.4% and 11% in main metrics Dice and HD95, respectively, while providing meaningful uncertainty estimates.[192] Representation-Level Adversarial Regularization for Clinically Aligned Multitask Thyroid Ultrasound Assessment
Dina Salama,Mohamed Mahmoud,Nourhan Bayasi,David Liu,Ilker Hacihaliloglu
Main category: cs.CV
TL;DR: 本文提出了一种临床引导的多任务框架,联合预测甲状腺结节的分割掩码和TI-RADS风险类别,并引入RLAR(表示层对抗梯度正则化)来缓解标注者间变异导致的梯度冲突,从而在保持分割质量的同时提升风险分层性能。
Details
Motivation: 甲状腺超声检查中,放射科医生对结节轮廓勾画和TI-RADS风险分级存在显著个体差异,导致监督信号不一致,影响标准深度学习模型训练效果。 Method: 提出临床引导的多任务模型,联合预测结节掩码与TI-RADS类别;利用TI-RADS对齐的紧凑放射组学目标引导分类嵌入;引入RLAR正则化,在潜在空间中通过归一化对抗方向间的夹角惩罚过度对齐,以显式建模并控制多任务梯度竞争。 Result: 在公开TI-RADS数据集上,所提方法在风险分层任务上持续优于单任务及传统多任务基线,同时保持结节分割质量。 Conclusion: RLAR能有效缓解因标注者变异引起的多任务梯度冲突,临床引导的联合建模提升了甲状腺结节评估的鲁棒性与一致性,具有临床落地潜力。 Abstract: Thyroid ultrasound is the first-line exam for assessing thyroid nodules and determining whether biopsy is warranted. In routine reporting, radiologists produce two coupled outputs: a nodule contour for measurement and a TI-RADS risk category based on sonographic criteria. Yet both contouring style and risk grading vary across readers, creating inconsistent supervision that can degrade standard learning pipelines. In this paper, we address this workflow with a clinically guided multitask framework that jointly predicts the nodule mask and TI-RADS category within a single model. To ground risk prediction in clinically meaningful evidence, we guide the classification embedding using a compact TI-RADS aligned radiomics target during training, while preserving complementary deep features for discriminative performance. However, under annotator variability, naive multitask optimization often fails not because the tasks are unrelated, but because their gradients compete within the shared representation. To make this competition explicit and controllable, we introduce RLAR, a representation-level adversarial gradient regularizer. Rather than performing parameter-level gradient surgery, RLAR uses each task's normalized adversarial direction in latent space as a geometric probe of task sensitivity and penalizes excessive angular alignment between task-specific adversarial directions. On a public TI-RADS dataset, our clinically guided multitask model with RLAR consistently improves risk stratification while maintaining segmentation quality compared to single-task training and conventional multitask baselines. Code and pretrained models will be released.[193] Learning Progressive Adaptation for Multi-Modal Tracking
He Wang,Tianyang Xu,Zhangyong Tang,Xiao-Jun Wu,Josef Kittler
Main category: cs.CV
TL;DR: 本文提出了一种渐进式适配方法(PATrack),通过模态依赖、模态纠缠和任务级适配器,将预训练RGB模型有效迁移到多模态跟踪任务中,在RGB+Thermal、RGB+Depth、RGB+Event等任务上取得SOTA性能。
Details
Motivation: 现有方法仅采用参数高效微调,忽视了对单模态特征增强、跨模态交互建模以及预测头适配的深入设计,难以充分发挥RGB预训练模型在多模态跟踪中的潜力。 Method: 提出渐进式适配框架PATrack,包含三类适配器:1)模态依赖适配器——分解高低频成分以增强单模态表征;2)模态纠缠适配器——基于跨模态共享信息引导的交叉注意力实现可靠跨模态交互;3)任务级适配器——专门适配预测头以匹配融合特征。 Result: 在RGB+Thermal、RGB+Depth、RGB+Event三大主流多模态跟踪基准上均显著超越现有方法,验证了渐进式适配策略的有效性与泛化性。 Conclusion: 渐进式适配是提升多模态跟踪性能的关键路径,统一建模模态内、模态间及任务级特性可更充分释放预训练RGB模型的能力。 Abstract: Due to the limited availability of paired multi-modal data, multi-modal trackers are typically built by adopting pre-trained RGB models with parameter-efficient fine-tuning modules. However, these fine-tuning methods overlook advanced adaptations for applying RGB pre-trained models and fail to modulate a single specific modality, cross-modal interactions, and the prediction head. To address the issues, we propose to perform Progressive Adaptation for Multi-Modal Tracking (PATrack). This innovative approach incorporates modality-dependent, modality-entangled, and task-level adapters, effectively bridging the gap in adapting RGB pre-trained networks to multi-modal data through a progressive strategy. Specifically, modality-specific information is enhanced through the modality-dependent adapter, decomposing the high- and low-frequency components, which ensures a more robust feature representation within each modality. The inter-modal interactions are introduced in the modality-entangled adapter, which implements a cross-attention operation guided by inter-modal shared information, ensuring the reliability of features conveyed between modalities. Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task-level adapter specific to the prediction head is introduced. In summary, our design integrates intra-modal, inter-modal, and task-level adapters into a unified framework. Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate that our method shows impressive performance against state-of-the-art methods. Code is available at https://github.com/ouha1998/Learning-Progressive-Adaptation-for-Multi-Modal-Tracking.[194] Frequency Switching Mechanism for Parameter-E!cient Multi-Task Learning
Shih-Wen Liu,Yen-Chang Chen,Wei-Ta Chu,Fu-En Yang,Yu-Chiang Frank Wang
Main category: cs.CV
TL;DR: 本文提出Free Sinewich,一种基于频率切换的参数高效多任务学习框架,通过正弦调制与轻量Clock Net实现低开销、高区分度的任务权重生成,在密集预测任务上达到SOTA性能-效率权衡。
Details
Motivation: 现有参数高效微调(PEFT)方法主要面向单任务,缺乏对多任务学习(MTL)的有效支持,亟需一种既能共享参数又可保持任务特异性的高效多任务适配机制。 Method: 提出Free Sinewich框架:1)Sinewich层融合低秩因子与卷积先验为统一核;2)采用元素级正弦变换进行频率调制以生成任务专用权重;3)引入轻量Clock Net生成有界频率以稳定训练;理论分析表明正弦调制可提升低秩适配器秩,频率分离可降低任务间权重相关性。 Result: 在密集预测基准上显著优于单任务微调(最高+5.39%),仅需6.53M可训练参数,实现当前最优的性能-效率权衡。 Conclusion: Free Sinewich为多任务学习提供了一种紧凑、可扩展且基于频率共享的新范式,验证了频率域参数调制在PEFT中的有效性与潜力。 Abstract: Multi-task learning (MTL) aims to enable a single model to solve multiple tasks efficiently; however, current parameter-efficient fine-tuning (PEFT) methods remain largely limited to single-task adaptation. We introduce \textbf{Free Sinewich}, a parameter-efficient multi-task learning framework that enables near-zero-cost weight modulation via frequency switching (\textbf{Free}). Specifically, a \textbf{Sine-AWB (Sinewich)} layer combines low-rank factors and convolutional priors into a single kernel, which is then modulated elementwise by a sinusoidal transformation to produce task-specialized weights. A lightweight Clock Net is introduced to produce bounded frequencies that stabilize this modulation during training. Theoretically, sine modulation enhances the rank of low-rank adapters, while frequency separation decorrelates the weights of different tasks. On dense prediction benchmarks, Free Sinewich achieves state-of-the-art performance-efficiency trade-offs (e.g., up to +5.39\% improvement over single-task fine-tuning with only 6.53M trainable parameters), offering a compact and scalable paradigm based on frequency-based parameter sharing. Project page: \href{https://casperliuliuliu.github.io/projects/Free-Sinewich/}{https://casperliuliuliu.github.io/projects/Free-Sinewich}.[195] CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs
Shanmukha Vellamcheti,Uday Kiran Kothapalli,Disharee Bhowmick,Sathyanarayanan N. Aakur
Main category: cs.CV
TL;DR: 本文提出了一种诊断性基准,用于评估多模态大语言模型(MLLMs)在假设视角变化下的空间关系一致性;实验表明,尽管单视角准确率高,但模型在反事实视角变换下存在系统性退化,而结构化表征(如场景图)能提升其稳定性。
Details
Motivation: 现有MLLMs在单视角空间推理任务中表现良好,但其在反事实视角变化下是否能保持稳定的空间状态表征尚不清楚,需系统评估其空间表征的鲁棒性。 Method: 构建一个无需图像重渲染的受控诊断基准,涵盖100个合成场景和6000个关系查询,评估视角一致性、360°循环一致性及序列变换下的关系稳定性;对比视觉输入、文本边界框和结构化场景图等不同输入表征方式。 Result: 当前最优MLLMs在单视角准确率高,但在反事实视角变化下出现系统性性能下降,表现为循环一致性频繁违反及关系稳定性快速衰减;结构化表征(如场景图)显著提升稳定性。 Conclusion: 单视角空间准确性高估了模型空间表征的鲁棒性;表征结构对反事实空间推理至关重要,需更强调结构化建模以增强空间泛化能力。 Abstract: Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360° cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene graphs, and show that increasing representational structure improves stability. Our results suggest that single-view spatial accuracy overestimates the robustness of induced spatial representations and that representation structure plays a critical role in counterfactual spatial reasoning.[196] LiFR-Seg: Anytime High-Frame-Rate Segmentation via Event-Guided Propagation
Xiaoshan Wu,Xiaoyang Lyu,Yifei Yu,Bo Wang,Zhongrui Wang,Xiaojuan Qi
Main category: cs.CV
TL;DR: 本文提出Anytime Interframe Semantic Segmentation新任务,利用单帧RGB图像与异步事件流预测任意时刻的语义分割;设计LiFR-Seg框架,通过不确定性感知的事件驱动运动场配准和时序记忆注意力机制,在低帧率硬件上实现接近高帧率上限的分割性能(DSEC上73.82% mIoU,仅差0.09%)。
Details
Motivation: 标准相机低帧率(LFR)导致动态场景中帧间感知缺失,限制密集语义分割性能。 Method: 提出LiFR-Seg框架:1)基于事件数据构建事件驱动运动场,并学习其显式置信度;2)不确定性感知的语义特征时序传播(warpping);3)引入时序记忆注意力模块增强动态场景一致性。 Result: 在DSEC数据集上达到73.82% mIoU,与全帧高帧率(HFR)上界相差仅0.09%;在自建高频率合成数据集SHF-DSEC上也验证了有效性。 Conclusion: 本工作建立了利用低帧率硬件实现高帧率感知的新范式,证明了融合RGB与事件数据进行任意时刻语义分割的可行性与高效性。 Abstract: Dense semantic segmentation in dynamic environments is fundamentally limited by the low-frame-rate (LFR) nature of standard cameras, which creates critical perceptual gaps between frames. To solve this, we introduce Anytime Interframe Semantic Segmentation: a new task for predicting segmentation at any arbitrary time using only a single past RGB frame and a stream of asynchronous event data. This task presents a core challenge: how to robustly propagate dense semantic features using a motion field derived from sparse and often noisy event data, all while mitigating feature degradation in highly dynamic scenes. We propose LiFR-Seg, a novel framework that directly addresses these challenges by propagating deep semantic features through time. The core of our method is an uncertainty-aware warping process, guided by an event-driven motion field and its learned, explicit confidence. A temporal memory attention module further ensures coherence in dynamic scenarios. We validate our method on the DSEC dataset and a new high-frequency synthetic benchmark (SHF-DSEC) we contribute. Remarkably, our LFR system achieves performance (73.82% mIoU on DSEC) that is statistically indistinguishable from an HFR upper-bound (within 0.09%) that has full access to the target frame. This work presents a new, efficient paradigm for achieving robust, high-frame-rate perception with low-frame-rate hardware. Project Page: https://candy-crusher.github.io/LiFR_Seg_Proj/#; Code: https://github.com/Candy-Crusher/LiFR-Seg.git.[197] ReDiffuse: Rotation Equivariant Diffusion Model for Multi-focus Image Fusion
Bo Li,Tingting Bao,Lingling Zhang,Weiping Fu,Yaxian Wang,Jun Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为ReDiffuse的旋转等变扩散模型,用于多焦点图像融合(MFIF),通过构建端到端旋转等变的扩散架构并进行理论分析,显著提升了融合结果中几何结构的方向与一致性保持能力,在多个数据集和指标上取得性能提升。
Details
Motivation: 多焦点图像融合(MFIF)是一个病态问题,散焦模糊会导致对称几何结构(如纹理和边缘)扭曲变形,从而在融合图像中引入伪影;因此,将旋转等变性嵌入扩散网络对于保持原始方向和结构一致性至关重要。 Method: 提出ReDiffuse模型,精心设计基础扩散架构以实现端到端旋转等变性,并提供严格的理论分析来评估其内在等变误差。 Result: 在Lytro、MFFW、MFI-WHU和Road-MF四个数据集上全面评测,ReDiffuse在六项评估指标上较其他MFIF方法提升0.28–6.64%。 Conclusion: ReDiffuse通过嵌入旋转等变性有效缓解了散焦模糊导致的结构失真问题,验证了等变结构在MFIF扩散模型中的有效性与必要性。 Abstract: Diffusion models have achieved impressive performance on multi-focus image fusion (MFIF). However, a key challenge in applying diffusion models to the ill-posed MFIF problem is that defocus blur can make common symmetric geometric structures (e.g., textures and edges) appear warped and deformed, often leading to unexpected artifacts in the fused images. Therefore, embedding rotation equivariance into diffusion networks is essential, as it enables the fusion results to faithfully preserve the original orientation and structural consistency of geometric patterns underlying the input images. Motivated by this, we propose ReDiffuse, a rotation-equivariant diffusion model for MFIF. Specifically, we carefully construct the basic diffusion architectures to achieve end-to-end rotation equivariance. We also provide a rigorous theoretical analysis to evaluate its intrinsic equivariance error, demonstrating the validity of embedding equivariance structures. ReDiffuse is comprehensively evaluated against various MFIF methods across four datasets (Lytro, MFFW, MFI-WHU, and Road-MF). Results demonstrate that ReDiffuse achieves competitive performance, with improvements of 0.28-6.64\% across six evaluation metrics. The code is available at https://github.com/MorvanLi/ReDiffuse.[198] One Pool Is Not Enough: Multi-Cluster Memory for Practical Test-Time Adaptation
Yu-Wen Tseng,Xingyi Zheng,Ya-Chen Wu,I-Bin Liao,Yung-Hui Li,Hong-Han Shuai,Wen-Huang Cheng
Main category: cs.CV
TL;DR: 本文提出多簇记忆(MCM)框架,通过将测试流样本按分布模态聚类组织,提升实用测试时自适应(PTTA)性能,显著改善多个基准数据集上的准确率,并验证了记忆结构设计对PTTA的重要性。
Details
Motivation: 现有PTTA方法将测试样本存入单一无结构记忆池,但实际测试流具有多模态、时序相关特性,单簇结构无法匹配其内在分布结构,导致适应不稳定和模态丢失。 Method: 提出多簇记忆(MCM)框架,包含三部分:基于像素级统计描述符的簇分配、相邻簇合并(ACC)以控制内存开销、均匀簇检索(UCR)以保障各模态均衡监督;可即插即用地集成到现有TTA方法中。 Result: 在CIFAR-10-C、CIFAR-100-C、ImageNet-C和DomainNet共12种配置上均取得一致提升,最高增益达ImageNet-C上5.00%和DomainNet上12.13%;GMM诊断显示MCM保持近似最优的分布平衡性、熵与模态覆盖率,而单簇记忆存在持续失衡与模态退化。 Conclusion: 记忆组织方式是实用测试时自适应的关键设计维度,多簇结构更契合真实测试流的多模态本质,能显著提升鲁棒性与适应效果。 Abstract: Test-time adaptation (TTA) adapts pre-trained models to distribution shifts at inference using only unlabeled test data. Under the Practical TTA (PTTA) setting, where test streams are temporally correlated and non-i.i.d., memory has become an indispensable component for stable adaptation, yet existing methods universally store amples in a single unstructured pool. We show that this single-cluster design is fundamentally mismatched to PTTA: a stream clusterability analysis reveals that test streams are inherently multi-modal, with the optimal number of mixture components consistently far exceeding one. To close this structural gap, we propose Multi-Cluster Memory (MCM), a plug-and-play framework that organizes stored samples into multiple clusters using lightweight pixel-level statistical descriptors. MCM introduces three complementary mechanisms: descriptor-based cluster assignment to capture distinct distributional modes, Adjacent Cluster Consolidation (ACC) to bound memory usage by merging the most similar temporally adjacent clusters, and Uniform Cluster Retrieval (UCR) to ensure balanced supervision across all modes during adaptation. Integrated with three contemporary TTA methods on CIFAR-10-C, CIFAR-100-C, ImageNet-C, and DomainNet, MCM achieves consistent improvements across all 12 configurations, with gains up to 5.00% on ImageNet-C and 12.13% on DomainNet. Notably, these gains scale with distributional complexity: larger label spaces with greater multi-modality benefit most from multi-cluster organization. GMM-based memory diagnostics further confirm that MCM maintains near-optimal distributional balance, entropy, and mode coverage, whereas single-cluster memory exhibits persistent imbalance and progressive mode loss. These results establish memory organization as a key design axis for practical test-time adaptation.[199] MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics
Pengxiang Cai,Mengyang Li
Main category: cs.CV
TL;DR: 本文提出MS-CustomNet框架,用于扩散模型中多主体文本到图像生成的定制化,支持零样本插入多个用户提供的对象,并显式控制其层次结构与空间位置;通过新构建的MSI数据集训练,在身份保持(DINO-I: 0.61)和位置控制(YOLO-L: 0.94)上表现优异。
Details
Motivation: 现有扩散模型在多主体场景定制中难以提供对主体间构成结构和精确空间关系的显式用户控制。 Method: 提出MS-CustomNet框架,支持多主体零样本集成与用户定义的层级排列及空间定位;构建基于COCO的MSI数据集用于训练。 Result: 在多主体定制任务中取得DINO-I得分0.61(身份保持)和YOLO-L得分0.94(位置控制)。 Conclusion: MS-CustomNet显著提升了多主体图像生成中细粒度、用户导向的组成与空间控制能力。 Abstract: Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.[200] Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues
Wenjin Hou,Xiaoxiao Sun,Hehe Fan
Main category: cs.CV
TL;DR: 本文提出RLVC框架,利用带视觉线索的强化学习提升生成式零样本学习性能,通过任务相关特征合成和类级视觉线索对齐,实现SOTA结果。
Details
Motivation: 现有生成式零样本学习方法合成的特征缺乏任务相关性,且仅依赖语义原型难以区分语义相似但视觉不同的类别。 Method: 提出RLVC框架:采用基于结果奖励的强化学习机制驱动生成模型自进化;引入类级视觉线索以对齐视觉原型并稳定训练;设计新型冷启动训练策略。 Result: 在三个主流ZSL基准上取得SOTA性能,平均提升4.7%。 Conclusion: 结合强化学习与视觉线索的生成式ZSL方法能有效提升特征合成的任务相关性和判别能力,显著改善零样本识别性能。 Abstract: Recent advances in zero-shot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates. For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain.[201] Training-Free Instance-Aware 3D Scene Reconstruction and Diffusion-Based View Synthesis from Sparse Images
Jiatong Xia,Lingqiao Liu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、无需相机位姿的3D室内场景重建与渲染系统,通过稀疏RGB图像实现高保真重建、实例感知3D理解与编辑,并结合3D感知扩散模型提升渲染质量与可编辑性。
Details
Motivation: 传统辐射场方法依赖密集视角和逐场景优化,难以满足稀疏输入、零训练、易编辑等实际需求。 Method: 提出三阶段无训练流水线:(1) 基于形变异常剔除的鲁棒点云重建;(2) 形变引导的2D实例掩码到3D实例提升;(3) 点云投影+3D感知扩散模型渲染与补全;支持仅修改点云即可实现对象级编辑。 Result: 在稀疏图像输入下实现了高质量、实例一致的3D重建与新视角合成,并支持无需重训练的实例移除等编辑操作。 Conclusion: 该工作为高效、可编辑的零样本3D内容生成提供了新范式,摆脱了对场景特定优化的依赖。 Abstract: We introduce a novel, training-free system for reconstructing, understanding, and rendering 3D indoor scenes from a sparse set of unposed RGB images. Unlike traditional radiance field approaches that require dense views and per-scene optimization, our pipeline achieves high-fidelity results without any training or pose preprocessing. The system integrates three key innovations: (1) A robust point cloud reconstruction module that filters unreliable geometry using a warping-based anomaly removal strategy; (2) A warping-guided 2D-to-3D instance lifting mechanism that propagates 2D segmentation masks into a consistent, instance-aware 3D representation; and (3) A novel rendering approach that projects the point cloud into new views and refines the renderings with a 3D-aware diffusion model. Our method leverages the generative power of diffusion to compensate for missing geometry and enhances realism, especially under sparse input conditions. We further demonstrate that object-level scene editing such as instance removal can be naturally supported in our pipeline by modifying only the point cloud, enabling the synthesis of consistent, edited views without retraining. Our results establish a new direction for efficient, editable 3D content generation without relying on scene-specific optimization. Project page: https://jiatongxia.github.io/TID3R/[202] GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing
Zifeng Zhu,Jiaming Han,Jiaxiang Zhao,Minnan Luo,Xiangyu Yue
Main category: cs.CV
TL;DR: 本文提出GIDE框架,通过离散噪声反演机制实现DLLM中无需训练的高保真图像编辑,并构建GIDE-Bench基准验证其在语义正确性和感知质量上的显著提升。
Details
Motivation: DLLM中离散token化阻碍了标准噪声反演技术的应用,导致编辑时结构退化,难以实现精确、免训练的图像编辑。 Method: 提出GIDE框架,包含新型离散噪声反演机制,并将编辑流程分解为定位(grounding)、反演(inversion)和细化(refinement)三阶段;同时构建含805种组合场景的GIDE-Bench评估基准。 Result: 在GIDE-Bench上,语义正确性提升51.83%,感知质量提升50.39%;在ImgEdit-Bench上亦超越有训练基线,达到领先模型的逼真一致性。 Conclusion: GIDE为DLLM提供了统一、免训练、高保真的多模态图像编辑方案,显著推动了该方向的技术边界。 Abstract: While Diffusion Large Language Models (DLLMs) have demonstrated remarkable capabilities in multi-modal generation, performing precise, training-free image editing remains an open challenge. Unlike continuous diffusion models, the discrete tokenization inherent in DLLMs hinders the application of standard noise inversion techniques, often leading to structural degradation during editing. In this paper, we introduce GIDE (Grounded Inversion for DLLM Image Editing), a unified framework designed to bridge this gap. GIDE incorporates a novel Discrete Noise Inversion mechanism that accurately captures latent noise patterns within the discrete token space, ensuring high-fidelity reconstruction. We then decompose the editing pipeline into grounding, inversion, and refinement stages. This design enables GIDE supporting various editing instructions (text, point and box) and operations while strictly preserving the unedited background. Furthermore, to overcome the limitations of existing single-step evaluation protocols, we introduce GIDE-Bench, a rigorous benchmark comprising 805 compositional editing scenarios guided by diverse multi-modal inputs. Extensive experiments on GIDE-Bench demonstrate that GIDE significantly outperforms prior training-free methods, improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%. Additional evaluations on ImgEdit-Bench confirm its broad applicability, demonstrating consistent gains over trained baselines and yielding photorealistic consistency on par with leading models.[203] DSCSNet: A Dynamic Sparse Compression Sensing Network for Closely-Spaced Infrared Small Target Unmixing
Zhiyang Tang,Yiming Zhu,Ruimin Huang,Meng Yang,Yong Ma,Jun Huang,Fan Fan
Main category: cs.CV
TL;DR: 本文提出了一种动态稀疏压缩感知网络(DSCSNet),通过将ADMM算法深度展开并引入可学习参数,结合严格的ℓ1范数稀疏约束与自注意力动态阈值机制,在保持物理建模逻辑的同时提升红外小目标解混的精度与泛化能力。
Details
Motivation: 现有方法难以兼顾模型驱动方法的严格稀疏性保证与数据驱动方法对动态场景的适应性,而远距离红外小目标解混(CSOU)本身是一个高度病态的逆问题。 Method: 提出深度展开的Dynamic Sparse Compressed Sensing Network(DSCSNet),将ADMM与可学习参数耦合:在辅助变量更新中嵌入ℓ1范数稀疏约束替代传统ℓ2项,并在重建阶段引入自注意力驱动的动态阈值机制;所有模块在ADMM三步迭代中端到端联合优化。 Result: 在合成红外数据集CSIST-100K上,DSCSNet在CSO-mAP和亚像素定位误差等关键指标上优于当前最优方法。 Conclusion: DSCSNet在保留压缩感知物理逻辑的前提下,实现了鲁棒稀疏诱导与场景自适应能力的统一,显著提升了复杂红外场景下小目标解混的准确性与泛化性。 Abstract: Due to the limitations of optical lens focal length and detector resolution, distant clustered infrared small targets often appear as mixed spots. The Close Small Object Unmixing (CSOU) task aims to recover the number, sub-pixel positions, and radiant intensities of individual targets from these spots, which is a highly ill-posed inverse problem. Existing methods struggle to balance the rigorous sparsity guarantees of model-driven approaches and the dynamic scene adaptability of data-driven methods. To address this dilemma, this paper proposes a Dynamic Sparse Compressed Sensing Network (DSCSNet), a deep-unfolded network that couples the Alternating Direction Method of Multipliers (ADMM) with learnable parameters. Specifically, we embed a strict $\ell_1$-norm sparsity constraint into the auxiliary variable update step of ADMM to replace the traditional $\ell_2$-norm smoothness-promoting terms, which effectively preserves the discrete energy peaks of small targets. We also integrate a self-attention-based dynamic thresholding mechanism into the reconstruction stage, which adaptively adjusts the sparsification intensity using the sparsity-enhanced information from the iterative process. These modules are jointly optimized end-to-end across the three iterative steps of ADMM. Retaining the physical logic of compressed sensing, DSCSNet achieves robust sparsity induction and scene adaptability, thus enhancing the unmixing accuracy and generalization in complex infrared scenarios. Extensive experiments on the synthetic infrared dataset CSIST-100K demonstrate that DSCSNet outperforms state-of-the-art methods in key metrics such as CSO-mAP and sub-pixel localization error.[204] Boundary-Aware Instance Segmentation in Microscopy Imaging
Thomas Mendelson,Joshua Francois,Galit Lahav,Tammy Riklin-Raviv
Main category: cs.CV
TL;DR: 本文提出了一种无需提示、边界感知的实例分割框架BAISeg,通过预测有符号距离函数(SDF)替代二值掩码,并结合改进的Hausdorff距离损失进行训练,在密集显微镜图像中实现了更精准的细胞边界定位与邻近实例分离。
Details
Motivation: 现有基于基础模型(如SAM)的分割方法在密集显微镜图像中难以有效分离接触或重叠的细胞实例,且依赖大量人工提示。 Method: 提出基于有符号距离函数(SDF)的实例分割框架,使用可学习的sigmoid映射将SDF转为概率图,并设计统一的改进Hausdorff距离(MHD)损失函数联合优化区域与边界。 Result: 在多个公开与私有高通量显微镜数据集上,BAISeg在边界精度和实例级指标上均优于当前主流SAM及基础模型方法。 Conclusion: SDF建模与边界感知损失的结合为无提示、高精度细胞实例分割提供了新范式,显著提升了密集场景下的分割鲁棒性与几何一致性。 Abstract: Accurate delineation of individual cells in microscopy videos is essential for studying cellular dynamics, yet separating touching or overlapping instances remains a persistent challenge. Although foundation-model for segmentation such as SAM have broadened the accessibility of image segmentation, they still struggle to separate nearby cell instances in dense microscopy scenes without extensive prompting. We propose a prompt-free, boundary-aware instance segmentation framework that predicts signed distance functions (SDFs) instead of binary masks, enabling smooth and geometry-consistent modeling of cell contours. A learned sigmoid mapping converts SDFs into probability maps, yielding sharp boundary localization and robust separation of adjacent instances. Training is guided by a unified Modified Hausdorff Distance (MHD) loss that integrates region- and boundary-based terms. Evaluations on both public and private high-throughput microscopy datasets demonstrate improved boundary accuracy and instance-level performance compared to recent SAM-based and foundation-model approaches. Source code is available at: https://github.com/ThomasMendelson/BAISeg.git[205] JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
Haolun Zheng,Yu He,Tailun Chen,Shuo Shao,Zhixuan Chu,Hongbin Zhou,Lan Tao,Zhan Qin,Kui Ren
Main category: cs.CV
TL;DR: 本文提出JANUS框架,通过优化结构化提示分布,在黑盒、端到端奖励下高效实现文本到图像模型的越狱攻击,显著提升攻击成功率且兼顾语义保真。
Details
Motivation: 现有越狱攻击方法或依赖代理损失而非真实端到端目标,或需大规模高成本强化学习训练的生成器,存在效率与真实性不足的问题。 Method: JANUS将越狱建模为在T2I系统及其安全过滤器端到端黑盒奖励下的结构化提示分布优化;用低维混合策略替代高容量生成器,基于两个语义锚定的提示分布进行高效探索。 Result: 在Stable Diffusion 3.5 Large Turbo上,ASR-8从25.30%提升至43.15%,CLIP和NSFW得分持续更高;在开源与商用T2I模型上均有效。 Conclusion: 揭示了当前T2I安全流水线的结构性弱点,呼吁发展更强、分布感知的防御机制。 Abstract: Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.[206] Positional Segmentor-Guided Counterfactual Fine-Tuning for Spatially Localized Image Synthesis
Tian Xia,Matthew Sinclair,Andreas Schuh,Fabio De Sousa Ribeiro,Raghav Mehta,Rajat Rasal,Esther Puyol-Antón,Samuel Gerber,Kersten Petersen,Michiel Schaap,Ben Glocker
Main category: cs.CV
TL;DR: 本文提出Positional Seg-CFT方法,通过将解剖结构划分为区域段并为每段独立建模,实现空间局部化、解剖一致的反事实图像生成,优于现有全局干预方法。
Details
Motivation: 现有反事实图像生成方法受限于仅能处理主体级因素或依赖人工标注掩码,难以实现局部结构变化且易引入全局伪影。 Method: 提出Positional Seg-CFT:在Seg-CFT基础上,将每个解剖结构细分为多个区域段,为各区域独立提取测量指标并监督对应结构变量,从而支持空间定位的反事实编辑。 Result: 在冠状动脉CT血管造影数据上验证,Pos-Seg-CFT能生成真实、区域特异性的修改,显著提升疾病进展建模的空间精细度。 Conclusion: Positional Seg-CFT突破了全局干预限制,实现了更精细、解剖合理的局部反事实图像生成,拓展了其在医学影像分析中的应用潜力。 Abstract: Counterfactual image generation enables controlled data augmentation, bias mitigation, and disease modeling. However, existing methods guided by external classifiers or regressors are limited to subject-level factors (e.g., age) and fail to produce localized structural changes, often resulting in global artifacts. Pixel-level guidance using segmentation masks has been explored, but requires user-defined counterfactual masks, which are tedious and impractical. Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT) addressed this by using segmentation-derived measurements to supervise structure-specific variables, yet it remains restricted to global interventions. We propose Positional Seg-CFT, which subdivides each structure into regional segments and derives independent measurements per region, enabling spatially localized and anatomically coherent counterfactuals. Experiments on coronary CT angiography show that Pos-Seg-CFT generates realistic, region-specific modifications, providing finer spatial control for modeling disease progression.[207] Reframing Long-Tailed Learning via Loss Landscape Geometry
Shenghan Chen,Yiming Liu,Yanzhen Wang,Yujia Wang,Xiankai Lu
Main category: cs.CV
TL;DR: 本文提出了一种基于损失景观视角的持续学习框架,通过分组知识保留和分组锐度感知模块,缓解长尾数据分布下的尾部类别性能退化问题,无需外部样本或预训练模型,显著提升多个基准上的性能。
Details
Motivation: 长尾数据分布下模型在头部类别上过拟合、尾部类别上快速遗忘(即‘尾部性能退化’),传统方法难以平衡性能权衡。 Method: 从损失景观出发,提出持续学习启发的框架:1)Grouped Knowledge Preservation模块记忆组特定收敛参数,促进共享平坦解;2)Grouped Sharpness Aware模块显式优化损失曲面几何以寻找更平坦极小值。 Result: 在四个基准数据集上显著优于现有最先进方法,且无需外部训练样本或预训练模型。 Conclusion: 尾部性能退化源于各类别在损失景观中收敛点的发散,尤其在尖锐非鲁棒极小值处加剧;所提框架通过引导模型收敛至共享平坦解,有效缓解该问题。 Abstract: Balancing performance trade-off on long-tail (LT) data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called "tail performance degradation" (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. We observe that different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that is beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent "tail performance degradation". To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods. The code is available at:https://gkp-gsa.github.io/.[208] A Large-Scale Remote Sensing Dataset and VLM-based Algorithm for Fine-Grained Road Hierarchy Classification
Ting Han,Xiangyi Xie,Yiping Chen,Yumeng Du,Jin Ma,Aiguang Li,Jiaan Liu,Yin Gao
Main category: cs.CV
TL;DR: 本文提出了SYSU-HiRoads大规模分层道路数据集和RoadReasoner视觉-语言-几何框架,用于从遥感影像中自动进行多等级道路制图。
Details
Motivation: 现有道路提取方法难以同时兼顾道路分割、拓扑重建与等级分类,缺乏支持多任务联合训练与评估的大规模分层道路数据集。 Method: 构建包含密集掩码、矢量化中心线和三级分层标签的SYSU-HiRoads数据集;提出RoadReasoner框架,融合频率敏感特征、多尺度上下文建模,并在骨架段级别结合几何描述符与几何感知文本提示,利用视觉语言模型进行道路等级推理。 Result: 在SYSU-HiRoads和CHN6-CUG数据集上,RoadReasoner达到72.6%总体精度(OA)、64.2% F1分数和60.6%分割准确率(SegAcc),优于当前最优方法。 Conclusion: RoadReasoner能生成鲁棒的道路表面掩码、拓扑保持的路网及语义一致的等级划分,所提数据集与代码将公开,推动自动化交通基础设施测绘与管理应用。 Abstract: In this work, we present SYSU-HiRoads, a large-scale hierarchical road dataset, and RoadReasoner, a vision-language-geometry framework for automatic multi-grade road mapping from remote sensing imagery. SYSU-HiRoads is built from GF-2 imagery covering 3631 km2 in Henan Province, China, and contains 1079 image tiles at 0.8 m spatial resolution. Each tile is annotated with dense road masks, vectorized centerlines, and three-level hierarchy labels, enabling the joint training and evaluation of segmentation, topology reconstruction, and hierarchy classification. Building on this dataset, RoadReasoner is designed to generate robust road surface masks, topology-preserving road networks, and semantically coherent hierarchy assignments. We strengthen road feature representation and network connectivity by explicitly enhancing frequency-sensitive cues and multi-scale context. Moreover, we perform hierarchy inference at the skeleton-segment level with geometric descriptors and geometry-aware textual prompts, queried by vision-language models to obtain linguistically interpretable grade decisions. Experiments on SYSU-HiRoads and the CHN6-CUG dataset show that RoadReasoner surpasses state-of-the-art road extraction baselines and produces accurate and semantically consistent road hierarchy maps with 72.6% OA, 64.2% F1 score, and 60.6% SegAcc. The dataset and code will be publicly released to support automated transport infrastructure mapping, road inventory updating, and broader infrastructure management applications.[209] Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
Jinyu Xu,Tianqi Hu,Xiaonan Hu,Letian Zhou,Songliang Cao,Meng Zhang,Hao Lu
Main category: cs.CV
TL;DR: 本文提出了首个结合植物分类学的植物计数基准TPC-268,包含10,000张图像、678,050个点标注,覆盖268类可计数植物,支持层级推理与物种感知评估,推动细粒度、无类别限制的植物计数研究。
Details
Motivation: 现有视觉计数方法在人群和交通分析中进展显著,但面向植物的细粒度、分类学感知计数仍被忽视;植物形态非刚性、外观随生长阶段与环境变化大,亟需生物语义丰富的基准数据集。 Method: 构建TPC-268基准数据集,提供实例级点标注与林奈分类标签(界→种)及器官类别;采用无类别计数(CAC)范式,设计符合分类体系与尺度特性的数据划分,并对回归与检测类主流CAC方法进行基准评测。 Result: TPC-268涵盖242种植物与真菌、268个可计数类别,图像尺度跨越遥感冠层至组织显微;实验验证了现有CAC方法在该多尺度、高生物多样性任务上的局限性。 Conclusion: TPC-268是首个融合植物分类学与多尺度观测的生物驱动计数基准,为发展细粒度、无类别限制、具备生物学意义的视觉计数模型提供了坚实基础。 Abstract: Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants exhibit nonrigid morphologies and physical appearance variations across growth stages and environments. To fill this gap, we present TPC-268, the first plant counting benchmark incorporating plant taxonomy. Our dataset couples instance-level point annotations with Linnaean labels (kingdom -> species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The dataset features 10,000 images with 678,050 point annotations, includes 268 countable plant categories over 242 plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy. We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting. Dataset and code are available at https://github.com/tiny-smart/TPC-268.[210] QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression
Zhongyang Li,Yaqian Li,Faming Fang,Rinyoichi Takezoe,Zi-Hao Bo,Cheng Qian,Mo Guang,Guixu Zhang,Kaiwen Long
Main category: cs.CV
TL;DR: 本文提出Query Guided Mixture-of-Projector(QMoP)框架,通过查询引导的动态路由与多分支协作压缩视觉token,在保持性能的同时显著降低计算与内存开销,并构建VTCBench基准评估压缩导致的信息损失。
Details
Motivation: 多模态大语言模型面临视觉token远超文本token带来的计算与内存瓶颈;现有投影器方法依赖固定启发式策略,缺乏跨场景适应性。 Method: 提出QMoP框架,包含池化分支(粗粒度全局语义)、重采样分支(高层语义表示)和剪枝分支(细粒度关键token保留);引入Query Guided Router(QGR)动态选择与加权各分支输出;采用MoE风格融合机制聚合结果;并构建VTCBench基准评估视觉token压缩的信息损失。 Result: QMoP在多个任务上超越强基线,显著减少内存占用、计算量与推理时间,同时保持甚至提升模型性能。 Conclusion: 自适应、查询引导的多分支视觉token压缩框架QMoP有效缓解多模态大模型的效率瓶颈,且具备通用性与可扩展性;VTCBench为视觉token压缩研究提供了系统评估标准。 Abstract: Multimodal large language models suffer from severe computational and memory bottlenecks, as the number of visual tokens far exceeds that of textual tokens. While recent methods employ projector modules to align and compress visual tokens into text-aligned features, they typically depend on fixed heuristics that limit adaptability across diverse scenarios. In this paper, we first propose Query Guided Mixture-of-Projector (QMoP), a novel and flexible framework that adaptively compresses visual tokens via three collaborative branches: (1) a pooling-based branch for coarse-grained global semantics, (2) a resampler branch for extracting high-level semantic representations, and (3) a pruning-based branch for fine-grained token selection to preserve critical visual detail. To adaptively coordinate these branches, we introduce the Query Guided Router (QGR), which dynamically selects and weights the outputs from different branches based on both visual input and textual queries. A Mixture-of-Experts-style fusion mechanism is designed to aggregate the outputs, harnessing the strengths of each strategy while suppressing noise. To systematically evaluate the effects of Visual Token Compression, we also develop VTCBench, a dedicated benchmark for evaluating the information loss induced by visual token compression. Extensive experiments demonstrate that despite relying on fundamental compression modules, QMoP outperforms strong baselines and delivers significant savings in memory, computation, and inference time.[211] DepthTCM: High Efficient Depth Compression via Physics-aware Transformer-CNN Mixed Architecture
Young-Seo Chang,Yatong An,Jae-Sang Hyun
Main category: cs.CV
TL;DR: DepthTCM是一种面向物理原理的端到端深度图压缩框架,将高比特深度图无损转换为3通道图像表示,再利用Transformer-CNN混合网络进行编码解码,在保持高精度的同时显著降低码率和计算开销。
Details
Motivation: 现有深度图压缩方法难以兼顾高保真度与高效压缩;受物理条纹投影三维测量系统启发,需设计一种物理可解释、低熵且适合学习型编解码器的深度表示方式。 Method: 提出多波长深度(MWD)编码将深度图无损映射为平滑三通道图像;全局量化至4比特/通道以降低熵;采用融合卷积与Transformer层的神经网络进行端到端压缩与重建。 Result: 在Middlebury 2014数据集上达0.307 bpp且保持99.38%精度(媲美无损PNG);ScanNet++ iPhone子集上编/解码平均耗时分别为41.48 ms和47.45 ms;4比特量化相较8比特降低66%码率,仅损失0.68 dB PSNR和0.04%精度;Transformer-CNN模块比纯CNN提升最多0.75 dB PSNR。 Conclusion: DepthTCM通过物理启发的表示与混合网络架构,在深度图压缩中实现了高压缩率、高保真度与实时性之间的良好平衡,验证了物理先验与深度学习协同设计的有效性。 Abstract: We propose DepthTCM, a physics-aware end-to-end framework for depth map compression. In our framework of DepthTCM, the high-bit depth map is first converted to a conventional 3-channel image representation losslessly using a method inspired by a physical sinusoidal fringe pattern based profiliometry system, then the 3-channel color image is encoded and decoded by a recently developed Transformer-CNN mixed neural network architecture. Specifically, DepthTCM maps depth to a smooth 3-channel using multiwavelength depth (MWD) encoding, then globally quantized the MWD encoded representation to 4 bits per channel to reduce entropy, and finally is compressed using a learned codec that combines convolutional and Transformer layers. Experiment results demonstrate the advantage of our proposed method. On Middlebury 2014, DepthTCM reaches 0.307 bpp while preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG. We additionally demonstrate practical efficiency and scalability, reporting average end-to-end inference times of 41.48 ms (encoder) and 47.45 ms (decoder) on the ScanNet++ iPhone RGB-D subset. Ablations validate our design choices: relative to 8-bit quantization, 4-bit quantization reduces bitrate by 66% while maintaining comparable reconstruction quality, with only a marginal 0.68 dB PSNR change and a 0.04% accuracy difference. In addition, Transformer--CNN blocks further improve PSNR by up to 0.75 dB over CNN-only architectures.[212] BHDD: A Burmese Handwritten Digit Dataset
Swan Htet Aung,Hein Htet,Htoo Say Wah Khaing,Thuya Myo Nyunt
Main category: cs.CV
TL;DR: 本文介绍了缅甸手写数字数据集(BHDD),包含87,561张28×28灰度图像,用于支持缅甸语数字识别研究,并提供了多个基线模型的高准确率结果。
Details
Motivation: 构建首个面向缅甸手写数字的大规模公开数据集,以填补缅甸语手写数字识别研究的数据空白,并支持OCR及多语言手写识别发展。 Method: 构建并发布BHDD数据集;统计分析其类别分布、像素统计与形态变异性;评估MLP、两层CNN及改进CNN(含批归一化与数据增强)三种基线模型性能。 Result: 三个基线模型在测试集上分别达到99.40%、99.75%和99.83%的准确率;发现部分缅甸数字因字形圆润而易混淆。 Conclusion: BHDD是一个高质量、多样化且具挑战性的新基准数据集,为缅甸语手写识别提供了坚实基础,并已开源(CC BY-SA 4.0)。 Abstract: We introduce the Burmese Handwritten Digit Dataset (BHDD), a collection of 87,561 grayscale images of handwritten Burmese digits in ten classes. Each image is 28x28 pixels, following the MNIST format. The training set has 60,000 samples split evenly across classes; the test set has 27,561 samples with class frequencies as they arose during collection. Over 150 people of different ages and backgrounds contributed samples. We analyze the dataset's class distribution, pixel statistics, and morphological variation, and identify digit pairs that are easily confused due to the round shapes of the Myanmar script. Simple baselines (an MLP, a two-layer CNN, and an improved CNN with batch normalization and augmentation) reach 99.40%, 99.75%, and 99.83% test accuracy respectively. BHDD is available under CC BY-SA 4.0 at https://github.com/baseresearch/BHDD[213] Enhancing Brain Tumor Classification Using Vision Transformers with Colormap-Based Feature Representation on BRISC2025 Dataset
Faisal Ahmed
Main category: cs.CV
TL;DR: 本文提出了一种基于Vision Transformer并结合 colormap特征表示的深度学习框架,用于多类别脑肿瘤MRI图像分类,在BRISC2025数据集上达到98.90%准确率和99.97% AUC,显著优于ResNet和EfficientNet等CNN基线模型。
Details
Motivation: 准确的脑肿瘤MRI分类对早期诊断与治疗规划至关重要,而传统CNN在建模长程依赖和突出关键结构/强度变化方面存在局限。 Method: 提出一种增强型Vision Transformer框架,引入colormap技术将MRI灰度图像映射为彩色特征图,以强化重要解剖与强度差异,并利用ViT建模全局依赖关系。 Result: 在BRISC2025四分类任务(胶质瘤、脑膜瘤、垂体瘤、非肿瘤)上取得98.90%分类准确率和99.97% AUC,显著优于ResNet50/101及EfficientNetB2;各项指标(精度、召回率、F1)均表现优异。 Conclusion: ViT与colormap特征增强的结合可有效提升脑肿瘤分类的准确性与鲁棒性,具备良好的临床决策支持应用潜力。 Abstract: Accurate classification of brain tumors from magnetic resonance imaging (MRI) plays a critical role in early diagnosis and effective treatment planning. In this study, we propose a deep learning framework based on Vision Transformers (ViT) enhanced with colormap-based feature representation to improve multi-class brain tumor classification performance. The proposed approach leverages the ability of transformer architectures to capture long-range dependencies while incorporating color mapping techniques to emphasize important structural and intensity variations within MRI scans. Experiments are conducted on the BRISC2025 dataset, which includes four classes: glioma, meningioma, pituitary tumor, and non-tumor cases. The model is trained and evaluated using standard performance metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The proposed method achieves a classification accuracy of 98.90%, outperforming baseline convolutional neural network models including ResNet50, ResNet101, and EfficientNetB2. In addition, the model demonstrates strong generalization capability with an AUC of 99.97%, indicating high discriminative performance across all classes. These results highlight the effectiveness of combining Vision Transformers with colormap-based feature enhancement for accurate and robust brain tumor classification and suggest strong potential for clinical decision support applications.[214] ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
Haichao Zhang,Yijiang Li,Shwai He,Tushar Nagarajan,Mingfei Chen,Jianglin Lu,Ang Li,Yun Fu
Main category: cs.CV
TL;DR: 本文提出了一种VLM引导的JEPA式潜在世界建模框架,结合密集帧动态建模与长时序语义引导,通过双时间通路(密集JEPA分支和均匀采样的VLM‘思考者’分支)提升视频未来状态预测能力,尤其在手部操作轨迹预测任务中展现出更优的长时序推演性能。
Details
Motivation: 现有潜在世界模型(如V-JEPA2)依赖短时观测窗口进行密集预测,难以捕捉长时序语义;而视觉-语言模型(VLM)虽具强语义能力,但因稀疏采样、文本输出瓶颈及小数据适配问题,不适合作为独立密集预测器。 Method: 提出双时间通路框架:1)密集JEPA分支建模细粒度运动与交互;2)大步长均匀采样的VLM 'thinker' 分支提供知识丰富的长时序语义引导;并设计分层金字塔表示提取模块,将多层VLM表征聚合为适配潜在预测的引导特征。 Result: 在手部操作轨迹预测任务上,该方法优于纯VLM基线和纯JEPA预测器基线,并展现出更鲁棒的长时序推演行为。 Conclusion: 融合VLM的语义引导能力与JEPA的密集动态建模能力,可有效缓解各自局限,提升潜在世界模型在长时序、动作条件化场景下的预测性能与实用性。 Abstract: Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.[215] CornOrb: A Multimodal Dataset of Orbscan Corneal Topography and Clinical Annotations for Keratoconus Detection
Mohammed El Amine Lazouni,Leila Ryma Lazouni,Zineb Aziza Elaouaber,Mohammed Ammar,Sofiane Zehar,Mohammed Youcef Bouayad Agha,Ahmed Lazouni,Amel Feroui,Ali H. Al-Timemy,Siamak Yousefi,Mostafa El Habib Daho
Main category: cs.CV
TL;DR: CornOrb 是一个来自阿尔及利亚的公开多模态角膜地形图数据集,包含1454只眼睛(744名患者)的Orbscan图像和临床标注,旨在支持非洲人群 keratoconus 的AI检测研究。
Details
Motivation: 缺乏来自非洲的大规模、多模态角膜地形图数据集,限制了针对该地区人群的keratoconus人工智能诊断模型开发。 Method: 收集并匿名化处理阿尔及利亚患者的Orbscan检查数据,包括四种角膜地图(轴向曲率、前/后表面高度、厚度)及结构化临床表格数据,并统一预处理为PNG和CSV格式。 Result: 构建了一个含1454只眼(889例正常、565例圆锥角膜)的公开多模态数据集CornOrb,已发布于Zenodo。 Conclusion: CornOrb填补了非洲来源角膜地形图数据集的空白,为基于多模态AI的keratoconus检测与分析提供了高质量、可直接使用的基准资源。 Abstract: In this paper, we present CornOrb, a publicly accessible multimodal dataset of Orbscan corneal topography images and clinical annotations collected from patients in Algeria. The dataset comprises 1,454 eyes from 744 patients, including 889 normal eyes and 565 keratoconus cases. For each eye, four corneal maps are provided (axial curvature, anterior elevation, posterior elevation, and pachymetry), together with structured tabular data including demographic information and key clinical parameters such as astigmatism, maximum keratometry (Kmax), central and thinnest pachymetry, and anterior/posterior asphericity. All data were retrospectively acquired, fully anonymized, and pre-processed into standardized PNG and CSV formats to ensure direct usability for artificial intelligence research. This dataset represents one of the first large-scale Orbscan-based resources from Africa, specifically built to enable robust AI-driven detection and analysis of keratoconus using multimodal data. The data are openly available at Zenodo.[216] WorldCache: Content-Aware Caching for Accelerated Video World Models
Umair Nawaz,Ahmed Heakl,Ufaq Khan,Abdelrahman Shaker,Salman Khan,Fahad Shahbaz Khan
Main category: cs.CV
TL;DR: WorldCache是一种无需训练的感知约束动态缓存框架,用于加速Diffusion Transformers(DiTs)的视频生成推理,通过运动自适应阈值、显著性加权漂移估计、混合与形变优化近似及相位感知阈值调度,在保持99.4%质量的同时实现2.3倍加速。
Details
Motivation: 现有训练免费特征缓存方法依赖零阶保持假设,在动态场景中易导致鬼影、模糊和运动不一致等问题,亟需更鲁棒、感知合理的动态缓存机制。 Method: 提出WorldCache框架,包含运动自适应阈值、显著性加权漂移估计、基于混合与光流形变的最优特征近似、以及扩散步间相位感知的阈值调度策略,全程无需模型重训练。 Result: 在Cosmos-Predict2.5-2B模型和PAI-Bench评测中,WorldCache实现2.3×推理加速,同时保留99.4%原始生成质量,显著优于先前无训练缓存方法。 Conclusion: WorldCache验证了感知驱动、动力学建模的特征缓存策略可有效提升DiT视频生成的效率-质量平衡,为训练-free加速提供了新范式。 Abstract: Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.[217] Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting
Yuntian Bo,Yazhou Zhu,Piotr Koniusz,Haofeng Zhang
Main category: cs.CV
TL;DR: 本文提出FoB(Focus on Background),一种以背景为中心的提示生成器,用于改进基于SAM的少样本医学图像分割,通过精准生成和定位背景提示来抑制过分割,显著提升性能并具备跨域泛化能力。
Details
Motivation: 传统少样本医学图像分割方法存在性能瓶颈;SAM虽具类别无关分割能力,但在医学图像上易因解剖边界模糊导致过分割。 Method: 将SAM-based FSMIS重构为提示定位任务,提出背景中心的FoB模型:1)类别无关地生成支持背景提示;2)在查询图像中直接定位这些提示;3)建模前景-背景空间依赖关系;4)利用医学图像中背景提示的结构规律进行渐进式优化。 Result: 在三个多样化医学图像数据集上显著超越基线方法,达到FSMIS领域SOTA,并展现出强跨域泛化能力。 Conclusion: FoB通过背景提示的精准生成与定位,有效缓解SAM在医学图像上的过分割问题,为少样本医学图像分割提供了新范式。 Abstract: Conventional few-shot medical image segmentation (FSMIS) approaches face performance bottlenecks that hinder broader clinical applicability. Although the Segment Anything Model (SAM) exhibits strong category-agnostic segmentation capabilities, its direct application to medical images often leads to over-segmentation due to ambiguous anatomical boundaries. In this paper, we reformulate SAM-based FSMIS as a prompt localization task and propose FoB (Focus on Background), a background-centric prompt generator that provides accurate background prompts to constrain SAM's over-segmentation. Specifically, FoB bridges the gap between segmentation and prompt localization by category-agnostic generation of support background prompts and localizing them directly in the query image. To address the challenge of prompt localization for novel categories, FoB models rich contextual information to capture foreground-background spatial dependencies. Moreover, inspired by the inherent structural patterns of background prompts in medical images, FoB models this structure as a constraint to progressively refine background prompt predictions. Experiments on three diverse medical image datasets demonstrate that FoB outperforms other baselines by large margins, achieving state-of-the-art performance on FSMIS, and exhibiting strong cross-domain generalization. Our code is available at https://github.com/primebo1/FoB_SAM.[218] When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning
Zhengxian Wu,Kai Shi,Chuanrui Zhang,Zirui Liao,Jun Yang,Ni Yang,Qiuying Peng,Luyuan Zhang,Hangrui Xu,Tianhuang Su,Zhenyu Yang,Haonan Lu,Haoqian Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注或外部奖励模型的无监督自进化训练框架(SelfJudge),通过组内相对策略优化(GRPO)和自一致性信号引导,提升多模态大语言模型在数学推理任务上的性能与泛化能力。
Details
Motivation: 现有方法依赖高成本的人工标注数据或教师模型蒸馏,难以规模化;亟需一种低成本、可扩展的无监督训练范式。 Method: 提出SelfJudge框架:1)对每个输入采样多个推理路径;2)利用Actor的自一致性作为先验;3)引入有界Judge模块动态重加权路径质量;4)将评分建模为组内分布,转化为相对优势;5)采用Group Relative Policy Optimization(GRPO)在无标签数据上训练。 Result: 在五个数学推理基准上持续提升推理性能与泛化能力,验证了方法的有效性与可扩展性。 Conclusion: SelfJudge为多模态大语言模型提供了可扩展的自演化路径,摆脱了对人工标注和外部奖励模型的依赖。 Abstract: Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale.To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure.We use the Actor's self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality.We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models.The code are available at https://dingwu1021.github.io/SelfJudge/.[219] Text-Image Conditioned 3D Generation
Jiazhong Cen,Jiemin Fang,Sikuang Li,Guanjun Wu,Chen Yang,Taoran Yi,Zanwei Zhou,Zhikuan Bao,Lingxi Xie,Wei Shen,Qi Tian
Main category: cs.CV
TL;DR: 本文提出TIGON模型,通过结合文本与图像双模态条件,实现更灵活、保真度更高的3D内容生成,验证了跨模态互补性在3D生成中的有效性。
Details
Motivation: 现有3D生成模型多依赖单一模态(仅图像或仅文本)条件,分别受限于视角偏差或缺乏细节;亟需融合二者优势以提升用户意图表达能力与生成质量。 Method: 提出Text-Image Conditioned 3D Generation任务,构建轻量级双分支模型TIGON,含独立的图像与文本编码分支,并引入轻量跨模态融合机制。 Result: 实验表明,文本-图像联合条件显著优于单模态方法,简单晚融合策略已能超越现有SOTA,证实跨模态互补性有效。 Conclusion: 文本与图像双条件是提升3D生成灵活性与保真度的可行且高效路径,为未来多模态3D生成研究提供了新基准与方向。 Abstract: High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: https://jumpat.github.io/tigon-page[220] Identity-Consistent Video Generation under Large Facial-Angle Variations
Bin Hu,Zipeng Qi,Guoxi Huang,Zunnan Xu,Ruicheng Zhang,Chongjie Ye,Jun Zhou,Xiu Li,Jingdong Wang
Main category: cs.CV
TL;DR: 本文提出Mv²ID框架,通过多视角条件化和区域掩码训练策略,在单对配对数据下提升身份一致性并保持运动自然性,同时设计了新的位置编码机制和评估指标。
Details
Motivation: 单视角参考到视频方法在大面部角度变化下难以保持身份一致性,而简单引入多视角参考图像会加剧视图依赖的复制粘贴问题,影响面部运动自然性;收集跨配对数据成本高昂。 Method: 提出Mv²ID多视角条件化框架,采用区域掩码训练策略防止捷径学习、提取关键身份特征,并设计参考解耦RoPE机制为视频和条件token分配不同位置编码;构建大规模多角度面部数据集并定义专用评估指标。 Result: 实验表明该方法在不使用跨配对数据的情况下,显著提升身份一致性并保持运动自然性,性能超越依赖跨配对数据的现有方法。 Conclusion: Mv²ID通过多视角互补建模与结构化训练策略,在低数据成本前提下有效平衡身份一致性与运动自然性,为参考驱动视频生成提供了新范式。 Abstract: Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $\mathrm{Mv}^2\mathrm{ID}$, a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.[221] F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting
Injae Kim,Chaehyeon Kim,Minseong Bae,Minseok Joo,Hyunwoo J. Kim
Main category: cs.CV
TL;DR: 本文提出F4Splat方法,通过预测性稠密化策略,实现对3D高斯点阵的自适应空间分配,在控制高斯总数的同时提升重建质量与渲染效率。
Details
Motivation: 现有前馈式3D高斯点阵方法采用固定像素或体素到高斯的映射,导致跨视角高斯冗余,且缺乏在保持重建保真度前提下控制高斯总数的有效机制。 Method: 提出F4Splat,引入基于稠密化分数引导的高斯分配策略:模型预测区域级稠密化分数以估计所需高斯密度,并支持无需重训练即可显式控制最终高斯预算。 Result: 在未标定前馈方法中显著优于先前工作,以更少的高斯数量实现更优的新视角合成性能。 Conclusion: F4Splat实现了紧凑而高质量的3D表示,解决了高斯冗余与预算控制难题,提升了前馈式3D高斯点阵的实用性与效率。 Abstract: Feed-forward 3D Gaussian Splatting methods enable single-pass reconstruction and real-time rendering. However, they typically adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines that uniformly allocate Gaussians, leading to redundant Gaussians across views. Moreover, they lack an effective mechanism to control the total number of Gaussians while maintaining reconstruction fidelity. To address these limitations, we present F4Splat, which performs Feed-Forward predictive densification for Feed-Forward 3D Gaussian Splatting, introducing a densification-score-guided allocation strategy that adaptively distributes Gaussians according to spatial complexity and multi-view overlap. Our model predicts per-region densification scores to estimate the required Gaussian density and allows explicit control over the final Gaussian budget without retraining. This spatially adaptive allocation reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views, producing compact yet high-quality 3D representations. Extensive experiments demonstrate that our model achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods, while using significantly fewer Gaussians.[222] Privacy-Preserving Federated Action Recognition via Differentially Private Selective Tuning and Efficient Communication
Idris Zakariyya,Pai Chet Ng,Kaushik Bhargav Sivangi,S. Mohammad Sheikholeslami,Konstantinos N. Plataniotis,Fani Deligianni
Main category: cs.CV
TL;DR: 本文提出FedDP-STECAR框架,通过差分隐私保护下的选择性微调与高效通信,解决联邦视频动作识别中的模型暴露与通信开销问题,在保障高隐私(ε=0.65)的同时显著提升准确率与训练速度。
Details
Motivation: 联邦视频动作识别面临模型暴露(梯度泄露运动模式)和通信开销大(高维视频网络全模型同步)两大挑战。 Method: 提出FedDP-STECAR:在差分隐私下仅对少量任务相关层进行选择性微调与扰动,并仅上传这些层参与聚合,大幅降低信息泄露风险与通信量。 Result: 在UCF-101数据集和MViT-B-16x4模型上,中心化设置下隐私ε=0.65时准确率提升70.2%;联邦设置下训练快48%,准确率达73.1%,通信量减少超99%。 Conclusion: FedDP-STECAR实现了高隐私、高效率、高精度的可扩展联邦视频动作识别,兼顾时间特征一致性与通信效率。 Abstract: Federated video action recognition enables collaborative model training without sharing raw video data, yet remains vulnerable to two key challenges: \textit{model exposure} and \textit{communication overhead}. Gradients exchanged between clients and the server can leak private motion patterns, while full-model synchronization of high-dimensional video networks causes significant bandwidth and communication costs. To address these issues, we propose \textit{Federated Differential Privacy with Selective Tuning and Efficient Communication for Action Recognition}, namely \textit{FedDP-STECAR}. Our \textit{FedDP-STECAR} framework selectively fine-tunes and perturbs only a small subset of task-relevant layers under Differential Privacy (DP), reducing the surface of information leakage while preserving temporal coherence in video features. By transmitting only the tuned layers during aggregation, communication traffic is reduced by over 99\% compared to full-model updates. Experiments on the UCF-101 dataset using the MViT-B-16x4 transformer show that \textit{FedDP-STECAR} achieves up to \textbf{70.2\% higher accuracy} under strict privacy ($ε=0.65$) in centralized settings and \textbf{48\% faster training} with \textbf{73.1\% accuracy} in federated setups, enabling scalable and privacy-preserving video action recognition. Code available at https://github.com/izakariyya/mvit-federated-videodp[223] Test-Time Adaptation via Cache Personalization for Facial Expression Recognition in Videos
Masoumeh Sharafi,Muhammad Osama Zeeshan,Soufiane Belharbi,Alessandro Lameiras Koerich,Marco Pedersoli,Eric Granger
Main category: cs.CV
TL;DR: 本文提出TTA-CaP,一种基于缓存的测试时自适应方法,用于视频面部表情识别中的视觉-语言模型个性化,通过三缓存机制(源域原型、高置信目标样本、低置信负样本)和三门控更新策略实现无梯度、低开销的高效个性化。
Details
Motivation: 现有测试时自适应(TTA)方法多依赖无监督参数优化,计算开销大;而基于动态缓存的方法易受噪声伪标签影响导致误差累积和漂移,难以应对跨被试分布偏移下的视频FER任务。 Method: 提出TTA-CaP:构建三个协同缓存——个性化源缓存(存储源域原型)、正向目标缓存(积累高置信被试特异性样本)、负向目标缓存(存储低置信样本作为负例);引入基于时间稳定性、预测置信度与源缓存一致性的三门控机制控制缓存更新与替换;最后通过嵌入融合提升预测鲁棒性与时间稳定性。 Result: 在BioVid、StressID和BAH三个视频FER数据集上,TTA-CaP在被试特异性和环境变化场景下均超越当前最优TTA方法,同时保持低计算与内存开销。 Conclusion: TTA-CaP是一种高效、轻量、鲁棒的缓存式测试时自适应框架,为视觉-语言模型在真实场景视频FER中的部署提供了可行的个性化解决方案。 Abstract: Facial expression recognition (FER) in videos requires model personalization to capture the considerable variations across subjects. Vision-language models (VLMs) offer strong transfer to downstream tasks through image-text alignment, but their performance can still degrade under inter-subject distribution shifts. Personalizing models using test-time adaptation (TTA) methods can mitigate this challenge. However, most state-of-the-art TTA methods rely on unsupervised parameter optimization, introducing computational overhead that is impractical in many real-world applications. This paper introduces TTA through Cache Personalization (TTA-CaP), a cache-based TTA method that enables cost-effective (gradient-free) personalization of VLMs for video FER. Prior cache-based TTA methods rely solely on dynamic memories that store test samples, which can accumulate errors and drift due to noisy pseudo-labels. TTA-CaP leverages three coordinated caches: a personalized source cache that stores source-domain prototypes, a positive target cache that accumulates reliable subject-specific samples, and a negative target cache that stores low-confidence cases as negative samples to reduce the impact of noisy pseudo-labels. Cache updates and replacement are controlled by a tri-gate mechanism based on temporal stability, confidence, and consistency with the personalized cache. Finally, TTA-CaP refines predictions through fusion of embeddings, yielding refined representations that support temporally stable video-level predictions. Our experiments on three challenging video FER datasets, BioVid, StressID, and BAH, indicate that TTA-CaP can outperform state-of-the-art TTA methods under subject-specific and environmental shifts, while maintaining low computational and memory overhead for real-world deployment.[224] KHMP: Frequency-Domain Kalman Refinement for High-Fidelity Human Motion Prediction
Wenhan Wu,Zhishuai Guo,Chen Chen,Srijan Das,Hongfei Xue,Pu Wang,Aidong Lu
Main category: cs.CV
TL;DR: 本文提出KHMP框架,结合自适应卡尔曼滤波与物理约束,在DCT域中对人类运动预测进行去噪与平滑,显著提升预测精度与物理合理性。
Details
Motivation: 现有生成式人体运动预测方法常产生高频抖动和时间不连续性问题,影响预测质量与物理合理性。 Method: 提出KHMP框架:在DCT域应用自适应卡尔曼滤波,根据估计信噪比动态调整噪声参数;同时在训练时引入时间平滑性和关节角度限制等物理约束。 Result: 在Human3.6M和HumanEva-I数据集上达到SOTA精度,有效抑制抖动,生成更平滑、符合生物力学规律的运动序列。 Conclusion: KHMP成功融合自适应信号处理与物理信息学习,为 stochastic human motion prediction 提供新范式。 Abstract: Stochastic human motion prediction aims to generate diverse, plausible futures from observed sequences. Despite advances in generative modeling, existing methods often produce predictions corrupted by high-frequency jitter and temporal discontinuities. To address these challenges, we introduce KHMP, a novel framework featuring an adaptiveKalman filter applied in the DCT domain to generate high-fidelity human motion predictions. By treating high-frequency DCT coefficients as a frequency-indexed noisy signal, the Kalman filter recursively suppresses noise while preserving motion details. Notably, its noise parameters are dynamically adjusted based on estimated Signal-to-Noise Ratio (SNR), enabling aggressive denoising for jittery predictions and conservative filtering for clean motions. This refinement is complemented by training-time physical constraints (temporal smoothness and joint angle limits) that encode biomechanical principles into the generative model. Together, these innovations establish a new paradigm integrating adaptive signal processing with physics-informed learning. Experiments on the Human3.6M and HumanEva-I datasets demonstrate that KHMP achieves state-of-the-art accuracy, effectively mitigating jitter artifacts to produce smooth and physically plausible motions.[225] EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
Haolan Xu,Keli Cheng,Lei Wang,Ning Bi,Xiaoming Liu
Main category: cs.CV
TL;DR: 本文提出EmoTaG框架,一种基于预训练-适配范式的少样本情感感知3D说话人合成方法,通过在FLAME参数空间中建模运动并设计门控残差运动网络(GRMN),显著提升了表情表达性、口型同步性、视觉真实感与运动稳定性。
Details
Motivation: 现有少样本3D说话人合成方法在富有表现力的面部运动下存在几何不稳定性和音-情不匹配问题,亟需更有效的情感感知运动建模。 Method: 提出EmoTaG框架:1)将运动预测重构于结构化的FLAME参数空间而非直接形变3D高斯;2)设计门控残差运动网络(GRMN),从音频中提取情感韵律,并融合音频缺失的头部姿态与上脸动作线索。 Result: 在情感表达性、唇音同步性、视觉真实感和运动稳定性方面达到SOTA性能。 Conclusion: 在FLAME参数空间建模运动并引入显式几何先验,结合GRMN实现音频驱动的情感化、稳定且连贯的3D说话人生成,验证了情感感知运动建模的有效性。 Abstract: Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). By leveraging rich pre-trained priors, few-shot methods enable instant personalization from just a few seconds of video. However, under expressive facial motion, existing few-shot approaches often suffer from geometric instability and audio-emotion mismatch, highlighting the need for more effective emotion-aware motion modeling. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, thereby introducing explicit geometric priors that improve motion stability. Building upon this, we propose a Gated Residual Motion Network (GRMN), which captures emotional prosody from audio while supplementing head pose and upper-face cues absent from audio, enabling expressive and coherent motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.[226] Efficient Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution
Yu-Shan Tai,An-Yeu,Wu
Main category: cs.CV
TL;DR: 本文提出了一种粗到细的扩散模型(C2F)与时间步序列重分配(TRD)方法,以显著降低扩散模型在边缘设备上的计算开销,同时保持近乎无损的生成质量。
Details
Motivation: 扩散模型虽在图像生成上表现优异,但多步去噪过程计算开销大,难以部署于资源受限的边缘设备;现有压缩与时间步调整方法忽视输入冗余且搜索耗时长。 Method: 提出粗到细去噪(C2F)减少粗粒度特征生成阶段的计算,并设计时间步序列重分配(TRD)以高效调整采样轨迹,搜索时间低于10分钟。 Result: 在CIFAR10和LSUN-Church数据集上实现80%–90%的计算量削减,同时保持近似无损的生成性能。 Conclusion: C2F与TRD协同有效平衡了扩散模型的效率与质量,为边缘端高效部署提供了新思路。 Abstract: Recently, diffusion models (DMs) have made significant strides in high-quality image generation. However, the multi-step denoising process often results in considerable computational overhead, impeding deployment on resource-constrained edge devices. Existing methods mitigate this issue by compressing models and adjusting the time step sequence. However, they overlook input redundancy and require lengthy search times. In this paper, we propose Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution. Recognizing indistinguishable early-stage generated images, we introduce Coarse-to-Fine Denoising (C2F) to reduce computation during coarse feature generation. Furthermore, we design Time Step Sequence Redistribution (TRD) for efficient sampling trajectory adjustment, requiring less than 10 minutes for search. Experimental results demonstrate that the proposed methods achieve near-lossless performance with an 80% to 90% reduction in computation on CIFAR10 and LSUN-Church.[227] Respiratory Status Detection with Video Transformers
Thomas Savage,Evan Madill
Main category: cs.CV
TL;DR: 本研究探索了利用视频Transformer(ViViT)识别呼吸窘迫的可行性,构建了一个基于运动后恢复过程的视频数据集,并提出结合Lie相对编码与运动引导掩码的新方法,在时序排序任务中达到0.81 F1分数。
Details
Motivation: 通过视觉识别呼吸窘迫是关键临床技能,早期发现可挽救生命;本文旨在验证AI能否从视频中自动识别此类细微呼吸力学变化。 Method: 采集健康志愿者运动后恢复期视频,按呼吸状态自然变化标注短片段,构建时序排序任务;采用ViViT模型,引入Lie相对编码(LieRE)和运动引导掩码(Motion Guided Masking),并使用嵌入对比策略进行判别。 Result: 所提方法在呼吸窘迫识别的时序排序任务中取得F1分数0.81。 Conclusion: 现代视频Transformer具备识别呼吸力学细微变化的能力,为无接触式呼吸状态监测提供了新可能。 Abstract: Recognition of respiratory distress through visual inspection is a life saving clinical skill. Clinicians can detect early signs of respiratory deterioration, creating a valuable window for earlier intervention. In this study, we evaluate whether recent advances in video transformers can enable Artificial Intelligence systems to recognize the signs of respiratory distress from video. We collected videos of healthy volunteers recovering after strenuous exercise and used the natural recovery of each participants respiratory status to create a labeled dataset for respiratory distress. Splitting the video into short clips, with earlier clips corresponding to more shortness of breath, we designed a temporal ordering challenge to assess whether an AI system can detect respiratory distress. We found a ViViT encoder augmented with Lie Relative Encodings (LieRE) and Motion Guided Masking, combined with an embedding based comparison strategy, can achieve an F1 score of 0.81 on this task. Our findings suggest that modern video transformers can recognize subtle changes in respiratory mechanics.[228] FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
Yuqiu Liu,Jialin Song,Marissa Ramirez de Chanlatte,Rochishnu Chowdhury,Rushil Paresh Desai,Wuyang Chen,Daniel Martin,Michael Mahoney
Main category: cs.CV
TL;DR: 本文提出FluidGaussian方法,通过将几何重建与流体-结构相互作用耦合,利用流体仿真定义不确定性度量,并结合主动学习提升3D重建的视觉与物理保真度。
Details
Motivation: 现有基于多视角图像的3D重建方法过度关注视觉保真度,忽视物理交互与功能约束,导致重建结果不满足物理规律、交互不真实。 Method: 提出FluidGaussian:一种即插即用方法,将高粒度流体仿真引入3D重建流程,定义基于仿真的表面不确定性度量,并结合主动学习选择优化视角以联合提升视觉与物理一致性。 Result: 在NeRF Synthetic、Mip-NeRF 360和DrivAerNet++数据集上,相比基线方法,PSNR最高提升8.6%,流体仿真中速度散度降低62.3%。 Conclusion: 将物理交互(特别是流体-结构耦合)显式建模为重建指导信号,可显著提升3D重建的物理合理性与视觉质量,为面向物理世界的重建范式提供新思路。 Abstract: Real objects that inhabit the physical world follow physical laws and thus behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of real-world scenes from multi-view 2D images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. All these can lead to unphysical and implausible interactions. To address this, we consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object functionality, beyond visual cues? To answer this question, we propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to assess surface quality at high granularity. We define a simulation-based uncertainty metric induced by fluid simulations and integrate it with active learning to prioritize views that improve both visual and physical fidelity. In an empirical evaluation on NeRF Synthetic (Blender), Mip-NeRF 360, and DrivAerNet++, our FluidGaussian method yields up to +8.6% visual PSNR (Peak Signal-to-Noise Ratio) and -62.3% velocity divergence during fluid simulations. Our code is available at https://github.com/delta-lab-ai/FluidGaussian.[229] Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation
Zengqun Zhao,Yanzuo Lu,Ziquan Liu,Jifei Song,Jiankang Deng,Ioannis Patras
Main category: cs.CV
TL;DR: 本文提出Relax Forcing,一种结构化时序记忆机制,用于自回归视频扩散模型,通过将历史上下文分解为Sink、Tail和动态选择的History三类功能角色,缓解长时生成中的误差累积与运动退化问题。
Details
Motivation: 现有自强迫(self-forcing)策略虽缓解曝光偏差,但在分钟级长视频生成中仍因时序退化而受限;作者发现根本原因不在于记忆容量不足,而在于时序记忆在推理中未被合理利用。 Method: 提出Relax Forcing机制:将历史帧按功能划分为Sink(保障全局稳定性)、Tail(维持短期连续性)和动态选取的History(提供结构性运动引导),仅选择性地建模最相关的历史信息,而非密集关注全部生成历史。 Result: 在VBench-Long评测中,Relax Forcing显著提升运动动态性和整体时序一致性,同时降低注意力计算开销。 Conclusion: 结构化时序记忆是实现可扩展长视频生成的关键,应与现有强迫式训练策略协同使用。 Abstract: Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training-inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.[230] HamVision: Hamiltonian Dynamics as Inductive Bias for Medical Image Analysis
Mohamed A Mabrok
Main category: cs.CV
TL;DR: HamVision 是一种基于阻尼谐振子的医学图像分析框架,通过相空间分解生成位置、动量和能量三种表示,分别用于分割(HamSeg)和分类(HamCls),在多个医学数据集上达到或接近 SOTA 性能。
Details
Motivation: 引入物理启发的结构化归纳偏置(阻尼谐振子)以提升医学图像分析模型的可解释性与泛化能力,避免依赖大量标注或复杂架构设计。 Method: 将医学图像特征建模为阻尼谐振子的动力学系统,从中自然导出位置q(内容)、动量p(边界/纹理梯度)和能量H(无参显著图);HamSeg利用能量门控跳跃连接、动量注入解码器各层;HamCls对三者全局池化并拼接为相空间特征向量。 Result: HamSeg 在 ISIC2018、ISIC2017、TN3K、ACDC 上取得最高 Dice 分数(89.38%–92.40%),仅用 8.57M 参数;HamCls 在 BloodMNIST 和 PathMNIST 上达最高准确率(98.85%, 96.65%),在其余 MedMNIST 数据集上媲美 MedMamba/MedViT;诊断分析证实动量与能量具有符合物理直觉且任务相关的语义特性。 Conclusion: 谐振子动力学可作为强归纳偏置,无需额外监督即可生成语义清晰、任务适配的多尺度表征,为可解释、轻量、通用的医学视觉模型提供新范式。 Abstract: We present HamVision, a framework for medical image analysis that uses the damped harmonic oscillator, a fundamental building block of signal processing, as a structured inductive bias for both segmentation and classification tasks. The oscillator's phase-space decomposition yields three functionally distinct representations: position~$q$ (feature content), momentum~$p$ (spatial gradients that encode boundary and texture information), and energy $H = \tfrac{1}{2}|z|^2$ (a parameter-free saliency map). These representations emerge from the dynamics, not from supervision, and can be exploited by different task-specific heads without any modification to the oscillator itself. For segmentation, energy gates the skip connections while momentum injects boundary information at every decoder level (HamSeg). For classification, the three representations are globally pooled and concatenated into a phase-space feature vector (HamCls). We evaluate HamVision across ten medical imaging benchmarks spanning five imaging modalities. On segmentation, HamSeg achieves state-of-the-art Dice scores on ISIC\,2018 (89.38\%), ISIC\,2017 (88.40\%), TN3K (87.05\%), and ACDC (92.40\%), outperforming most baselines with only 8.57M parameters. On classification, HamCls achieves state-of-the-art accuracy on BloodMNIST (98.85\%) and PathMNIST (96.65\%), and competitive results on the remaining MedMNIST datasets against MedMamba and MedViT. Diagnostic analysis confirms that the oscillator's momentum consistently encodes an interior$\,{>}\,$boundary$\,{>}\,$exterior gradient for segmentation and that the energy map correlates with discriminative regions for classification, properties that emerge entirely from the Hamiltonian dynamics. Code is available at https://github.com/Minds-R-Lab/hamvision.[231] An InSAR Phase Unwrapping Framework for Large-scale and Complex Events
Yijia Song,Juliet Biggs,Alin Achim,Robert Popescu,Simon Orrego,Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的相位解缠框架,用于处理大规模InSAR干涉图及地震引起的相位不连续问题。
Details
Motivation: 传统相位解缠算法在处理地震引起的地表断裂和突变位移时易失效;现有学习方法受限于固定且较小的输入尺寸,难以适配真实大尺度、空间异质的InSAR图像。 Method: 采用扩散模型架构构建相位解缠框架,以建模并恢复物理上一致的解缠相位场,尤其针对断层相关相位跳变。 Result: 在合成与真实数据集上的实验表明,该方法能有效处理近地表变形引起的相位不连续,并良好扩展至大规模InSAR图像。 Conclusion: 所提扩散模型方法为复杂形变场景下的自动相位解缠提供了实用替代方案,克服了传统与现有学习方法的关键局限。 Abstract: Phase unwrapping remains a critical and challenging problem in InSAR processing, particularly in scenarios involving complex deformation patterns. In earthquake-related deformation, shallow sources can generate surface-breaking faults and abrupt displacement discontinuities, which severely disrupt phase continuity and often cause conventional unwrapping algorithms to fail. Another limitation of existing learning-based unwrapping methods is their reliance on fixed and relatively small input sizes, while real InSAR interferograms are typically large-scale and spatially heterogeneous. This mismatch restricts the applicability of many neural network approaches to real-world data. In this work, we present a phase unwrapping framework based on a diffusion model, developed to process large-scale interferograms and to address phase discontinuities caused by deformation. By leveraging a diffusion model architecture, the proposed method can recover physically consistent unwrapped phase fields even in the presence of fault-related phase jumps. Experimental results on both synthetic and real datasets demonstrate that the method effectively addresses discontinuities associated with near-surface deformation and scales well to large InSAR images, offering a practical alternative to manual unwrapping in challenging scenarios.[232] Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
Nikolay Kormushev,Josip Šarić,Matej Kristan
Main category: cs.CV
TL;DR: 本文提出OVRCOAT框架,通过CLIP条件下的目标性调整(COAT)和开放词汇掩码到文本精炼(OVR),解决开放词汇全景分割中的掩码选择偏差与区域理解不足问题,在多个数据集上取得新SOTA性能。
Details
Motivation: 开放词汇全景分割受限于两个耦合问题:一是基于闭合词汇训练的目标性头会抑制训练中未见类别的掩码(掩码选择偏差);二是CLIP等视觉语言模型优化于全局图像分类,缺乏对局部区域的理解能力。 Method: 提出OVRCOAT框架:1)CLIP条件下的目标性调整(COAT),修正背景/前景概率,保留高质量的开放词汇掩码;2)开放词汇掩码到文本精炼(OVR),增强CLIP在区域级的对齐能力,以更低内存开销提升已见与未见类别的分类性能。 Result: 在ADE20K上提升5.5% PQ,在Mapillary Vistas和Cityscapes上分别提升7.1%和3% PQ,显著优于现有方法。 Conclusion: OVRCOAT以简洁模块化设计协同优化目标性估计与掩码识别,有效缓解开放词汇全景分割的核心瓶颈,且具备高效性与可扩展性。 Abstract: Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision-language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP's region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code is available at: https://github.com/nickormushev/OVRCOAT[233] Knowledge Priors for Identity-Disentangled Open-Set Privacy-Preserving Video FER
Feng Xu,Xun Li,Lars Petersson,Yulei Sui,David Ahmedt Aristizabal,Dadong Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需身份标签的两阶段视频面部表情识别隐私保护框架,通过身份抑制网络和去噪模块,在开放集场景下实现隐私保护与表情识别性能的平衡,并引入基于伪造的验证方法评估隐私鲁棒性。
Details
Motivation: 现有隐私保护方法在现实开放集视频场景(身份未知且无身份标签)中失效,而面部数据天然暴露身份,带来严重隐私风险。 Method: 提出两阶段框架:第一阶段用无身份标签的真实视频构建 intra-和 inter-video 知识先验,训练身份抑制网络以匿名化身份但保留表情线索;第二阶段用去噪模块恢复表情相关信息;并设计基于伪造的验证方法,利用识别先验评估隐私鲁棒性。 Result: 在三个视频数据集上的实验表明,该方法在有效保护隐私的同时,FER准确率可媲美有身份监督的基线方法。 Conclusion: 该框架首次实现了无需任何身份标签的开放集视频FER隐私保护,在隐私性与实用性间取得良好平衡,并提供了无需标注的身份隐私评估新范式。 Abstract: Facial expression recognition relies on facial data that inherently expose identity and thus raise significant privacy concerns. Current privacy-preserving methods typically fail in realistic open-set video settings where identities are unknown, and identity labels are unavailable. We propose a two-stage framework for video-based privacy-preserving FER in challenging open-set settings that requires no identity labels at any stage. To decouple privacy and utility, we first train an identity-suppression network using intra- and inter-video knowledge priors derived from real-world videos without identity labels. This network anonymizes identity while preserving expressive cues. A subsequent denoising module restores expression-related information and helps recover FER performance. Furthermore, we introduce a falsification-based validation method that uses recognition priors to rigorously evaluate privacy robustness without requiring annotated identity labels. Experiments on three video datasets demonstrate that our method effectively protects privacy while maintaining FER accuracy comparable to identity-supervised baselines.[234] Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models
Jingchen Sun,Shaobo Han,Deep Patel,Wataru Kohno,Can Jin,Changyou Chen
Main category: cs.CV
TL;DR: 本文提出Beta-weighted Knowledge Distillation(Beta-KD),一种基于贝叶斯视角、不确定性感知的知识蒸馏框架,能自适应地平衡数据监督与教师指导,在多模态VQA任务上显著优于现有方法。
Details
Motivation: 传统知识蒸馏难以在数据噪声和教师不确定性之间取得最优平衡,需自适应调节学生对教师指导的依赖程度。 Method: 从统一贝叶斯视角建模师生学习过程,将教师监督视为学生激活上的Gibbs先验,推导出闭式、不确定性感知的加权机制,支持任意蒸馏目标及其组合。 Result: 在多模态视觉语言问答(VQA)基准上,用大教师VLM蒸馏学生VLM持续提升性能,Beta-KD显著优于现有知识蒸馏方法。 Conclusion: Beta-KD提供了一种通用、可解释且高性能的不确定性感知蒸馏范式,适用于各类蒸馏目标和多模态场景。 Abstract: Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher--student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments on multimodal VQA benchmarks demonstrate that distilling student Vision-Language Models from a large teacher VLM consistently improves performance. The results show that Beta-KD outperforms existing knowledge distillation methods. The code is available at https://github.com/Jingchensun/beta-kd.[235] Image-Based Structural Analysis Using Computer Vision and LLMs: PhotoBeamSolver
Altamirano-Muñiz Emilio Fernando
Main category: cs.CV
TL;DR: 本文介绍了一个名为PhotoBeamSolver的程序,该程序能够通过分析手绘梁模型图像,利用计算机视觉和统计学习技术自动识别结构元素并求解理想化梁模型。
Details
Motivation: 将计算机视觉技术应用于土木工程中的结构分析,以自动化处理手绘梁模型,提高教学与工程实践的效率。 Method: 采用计算机视觉与统计学习技术进行结构元素的检测与视觉解释,并开发了PhotoBeamSolver程序进行验证。 Result: 实现了从手绘图中自动识别并求解理想化梁模型的功能,并分析了该技术在土木工程中应用的挑战、限制与必要条件。 Conclusion: 计算机视觉在结构分析、基础设施检测及工程决策支持系统中具有潜力,但需解决可靠性与工程适用性等关键问题。 Abstract: This paper presents the development of a documented program capable of solving idealized beam models, such as those commonly used in textbooks and academic exercises, from drawings made by a person. The system is based on computer vision and statistical learning techniques for the detection and visual interpretation of structural elements. Likewise, the main challenges and limitations associated with the integration of computer vision into structural analysis are analyzed, as well as the requirements necessary for its reliable application in the field of civil engineering. In this context, the implementation of the PhotoBeamSolver program is explored, and the current state of computer vision in civil engineering is discussed, particularly in relation to structural analysis, infrastructure inspection, and engineering decision-support systems.[236] PAS3R: Pose-Adaptive Streaming 3D Reconstruction for Long Video Sequences
Lanbo Xu,Liang Guo,Caigui Jiang,Cheng Wang
Main category: cs.CV
TL;DR: 本文提出PAS3R框架,通过姿态自适应更新机制解决单目在线3D重建中的稳定性-适应性困境,利用运动感知更新、轨迹一致性训练和轻量级在线稳定模块,在长视频序列中显著提升轨迹精度、深度估计与点云重建质量。
Details
Motivation: 在线单目3D重建面临稳定性-适应性困境:模型需快速融合新视角,同时保持已建模场景结构;现有方法难以应对突发视角变化,导致轨迹漂移与几何不一致。 Method: 提出PAS3R框架,包含:(1)运动感知更新机制,联合帧间位姿变化与图像频域特征评估帧重要性;(2)轨迹一致性训练目标,引入相对位姿约束与加速度正则化;(3)轻量级在线稳定模块抑制高频轨迹抖动与几何伪影。 Result: 在多个基准上实验表明,PAS3R在长视频序列中显著提升轨迹精度、深度估计和点云重建质量,同时在短序列上保持竞争力。 Conclusion: PAS3R通过动态调节状态更新策略,有效平衡了在线重建中的稳定性与适应性,为长时序单目SLAM与三维重建提供了新范式。 Abstract: Online monocular 3D reconstruction enables dense scene recovery from streaming video but remains fundamentally limited by the stability-adaptation dilemma: the reconstruction model must rapidly incorporate novel viewpoints while preserving previously accumulated scene structure. Existing streaming approaches rely on uniform or attention-based update mechanisms that often fail to account for abrupt viewpoint transitions, leading to trajectory drift and geometric inconsistencies over long sequences. We introduce PAS3R, a pose-adaptive streaming reconstruction framework that dynamically modulates state updates according to camera motion and scene structure. Our key insight is that frames contributing significant geometric novelty should exert stronger influence on the reconstruction state, while frames with minor viewpoint variation should prioritize preserving historical context. PAS3R operationalizes this principle through a motion-aware update mechanism that jointly leverages inter-frame pose variation and image frequency cues to estimate frame importance. To further stabilize long-horizon reconstruction, we introduce trajectory-consistent training objectives that incorporate relative pose constraints and acceleration regularization. A lightweight online stabilization module further suppresses high-frequency trajectory jitter and geometric artifacts without increasing memory consumption. Extensive experiments across multiple benchmarks demonstrate that PAS3R significantly improves trajectory accuracy, depth estimation, and point cloud reconstruction quality in long video sequences while maintaining competitive performance on shorter sequences.[237] EpiMask: Leveraging Epipolar Distance Based Masks in Cross-Attention for Satellite Image Matching
Rahul Deshmukh,Aditya Chauhan,Avinash Kak
Main category: cs.CV
TL;DR: 本文提出EpiMask,一种专为卫星图像设计的半稠密匹配网络,通过引入仿射几何建模、基于极线距离的注意力掩码和微调预训练编码器,显著提升匹配精度。
Details
Motivation: 现有基于深度学习的图像匹配网络针对地面图像(针孔相机模型)优化,在处理逐行扫描成像的卫星图像时性能欠佳。 Method: EpiMask包含三个核心设计:(1) 采用分块仿射变换近似卫星相机几何模型;(2) 引入基于极线距离的注意力掩码,限制跨注意力至几何合理区域;(3) 微调基础预训练图像编码器以增强特征鲁棒性。 Result: 在SatDepth数据集上,相比重训练的地面图像模型,匹配精度最高提升30%。 Conclusion: EpiMask有效解决了卫星图像因成像机制差异导致的匹配性能下降问题,验证了显式建模卫星几何与注意力约束对提升匹配精度的关键作用。 Abstract: The deep-learning based image matching networks can now handle significantly larger variations in viewpoints and illuminations while providing matched pairs of pixels with sub-pixel precision. These networks have been trained with ground-based image datasets and, implicitly, their performance is optimized for the pinhole camera geometry. Consequently, you get suboptimal performance when such networks are used to match satellite images since those images are synthesized as a moving satellite camera records one line at a time of the points on the ground. In this paper, we present EpiMask, a semi-dense image matching network for satellite images that (1) Incorporates patch-wise affine approximations to the camera modeling geometry; (2) Uses an epipolar distance-based attention mask to restrict cross-attention to geometrically plausible regions; and (3) That fine-tunes a foundational pretrained image encoder for robust feature extraction. Experiments on the SatDepth dataset demonstrate up to 30% improvement in matching accuracy compared to re-trained ground-based models.[238] ALADIN:Attribute-Language Distillation Network for Person Re-Identification
Wang Zhou,Boran Duan,Haojun Ai,Ruiqi Lan,Ziyue Zhou
Main category: cs.CV
TL;DR: 本文提出ALADIN方法,通过属性-语言蒸馏将冻结CLIP教师模型的知识迁移到轻量ReID学生模型,引入细粒度属性局部对齐、场景感知软提示生成和跨模态对比/关系蒸馏,显著提升遮挡鲁棒性与泛化能力。
Details
Motivation: 现有CLIP引导的ReID方法依赖全局特征和固定文本提示,难以捕捉细粒度属性线索并适应多样化外观。 Method: 提出ALADIN:1)属性-语言知识蒸馏框架;2)细粒度属性局部对齐;3)场景感知提示生成器动态生成图像特定软提示;4)属性局部蒸馏约束文本属性与局部视觉特征一致性;5)跨模态对比与关系蒸馏保留属性间结构关系;6)利用多模态大模型生成结构化属性描述,并经CLIP转化为局部注意力图提供监督。 Result: 在Market-1501、DukeMTMC-reID和MSMT17上超越CNN、Transformer及CLIP基线方法,具备更好遮挡鲁棒性、泛化性与可解释性。 Conclusion: ALADIN有效融合语言先验与视觉细粒度建模能力,验证了属性级跨模态对齐与蒸馏在ReID中的有效性,为轻量化、鲁棒、可解释ReID提供了新范式。 Abstract: Recent vision-language models such as CLIP provide strong cross-modal alignment, but current CLIP-guided ReID pipelines rely on global features and fixed prompts. This limits their ability to capture fine-grained attribute cues and adapt to diverse appearances. We propose ALADIN, an attribute-language distillation network that distills knowledge from a frozen CLIP teacher to a lightweight ReID student. ALADIN introduces fine-grained attribute-local alignment to establish adaptive text-visual correspondence and robust representation learning. A Scene-Aware Prompt Generator produces image-specific soft prompts to facilitate adaptive alignment. Attribute-local distillation enforces consistency between textual attributes and local visual features, significantly enhancing robustness under occlusions. Furthermore, we employ cross-modal contrastive and relation distillation to preserve the inherent structural relationships among attributes. To provide precise supervision, we leverage Multimodal LLMs to generate structured attribute descriptions, which are then converted into localized attention maps via CLIP. At inference, only the student is used. Experiments on Market-1501, DukeMTMC-reID, and MSMT17 show improvements over CNN-, Transformer-, and CLIP-based methods, with better generalization and interpretability.[239] Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
Hyundong Jin,Dongyoon Han,Eunwoo Kim
Main category: cs.CV
TL;DR: 本文提出了一种面向大视觉语言模型的持续性遗忘框架,通过细粒度视觉-语言概念分解与概念驱动的拒绝专家路由机制,在序列化删除请求下实现精准、可解释的拒绝行为,同时保持模型通用能力。
Details
Motivation: 现有持续遗忘方法在序列更新中扭曲共享表征,导致视觉-语言对与拒绝行为间产生虚假关联,难以精准定位应拒绝目标,引发过度拒绝。 Method: 提出基于概念调制器识别遗忘类别的视觉-语言概念组合,并设计多个概念对齐的‘拒绝专家(refusers)’;引入多模态、概念驱动的路由机制,复用相似概念任务的refusers,并自适应调整未充分利用的refusers以适配新概念。 Result: 在多个视觉语言基准上实验表明,该框架能生成概念 grounded 的拒绝响应,在连续遗忘过程中显著优于现有方法,且更好保持模型通用性能。 Conclusion: 将拒绝行为锚定于细粒度跨模态概念而非原始输入,结合动态路由的专家系统,是提升持续遗忘精度与鲁棒性的有效范式。 Abstract: Continual unlearning poses the challenge of enabling large vision-language models to selectively refuse specific image-instruction pairs in response to sequential deletion requests, while preserving general utility. However, sequential unlearning updates distort shared representations, creating spurious associations between vision-language pairs and refusal behaviors that hinder precise identification of refusal targets, resulting in inappropriate refusals. To address this challenge, we propose a novel continual unlearning framework that grounds refusal behavior in fine-grained descriptions of visual and textual concepts decomposed from deletion targets. We first identify which visual-linguistic concept combinations characterize each forget category through a concept modulator, then determine how to generate appropriate refusal responses via a mixture of refusal experts, termed refusers, each specialized for concept-aligned refusal generation. To generate concept-specific refusal responses across sequential tasks, we introduce a multimodal, concept-driven routing scheme that reuses refusers for tasks sharing similar concepts and adapts underutilized ones for novel concepts. Extensive experiments on vision-language benchmarks demonstrate that the proposed framework outperforms existing methods by generating concept-grounded refusal responses and preserving the general utility across unlearning sequences.[240] Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation
Jingnan Luo,Mingqi Gao,Jun Liu,Bin-Bin Gao,Feng Zheng
Main category: cs.CV
TL;DR: 本文提出TrajSeg框架,通过双向文本-轨迹对齐和帧级内容集成模块,提升多模态大语言模型在视频推理分割任务中的轨迹感知能力,并实现端到端可训练的统一分割解码。
Details
Motivation: 现有方法依赖单向、隐式的文本-轨迹对齐,在剧烈视频动态下难以准确感知轨迹,亟需更鲁棒的对齐机制。 Method: 提出双向文本-轨迹对齐(文本→轨迹与轨迹→文本),引入帧级内容集成(FCI)模块适配轨迹级token至帧信息,并设计统一掩码解码器实现全帧联合分割。 Result: 在多个指代与推理型视频分割数据集上全面超越现有方法,所有指标均取得最优性能。 Conclusion: TrajSeg是一种简单、统一且端到端可训练的视频推理分割框架,显著提升了MLLM在复杂动态视频中的轨迹理解与分割能力。 Abstract: The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text-trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text-trajectory alignment, where MLLMs accept grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame-level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM-parsed trajectory-level token to frame-specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end-to-end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at https://github.com/haodi19/TrajSeg.[241] StreamingEval: A Unified Evaluation Protocol towards Realistic Streaming Video Understanding
Guowei Tang,Tianwen Qian,Huanran Zheng,Yifei Wang,Xiaoling Wang
Main category: cs.CV
TL;DR: 本文提出StreamingEval,一个用于评估视频大语言模型(Video-LLMs)在真实资源约束下流式视频理解能力的统一评测框架,强调效率、存储与准确性的权衡。
Details
Motivation: 现有流式视频理解研究多关注孤立指标(如问答准确率或编码效率),忽视实际部署所需的系统级综合性能与资源约束。 Method: 构建StreamingEval框架,采用固定容量记忆库标准化历史视觉上下文,并联合评估视觉编码效率、文本解码延迟和任务性能,对离线与在线视频模型进行标准化评测。 Result: 在多个数据集上的实验揭示了当前Video-LLMs与真实流式应用需求之间存在显著差距。 Conclusion: StreamingEval为流式视频理解提供了系统性评测基准,推动兼顾效率、存储与准确性的下一代Video-LLM研发。 Abstract: Real-time, continuous understanding of visual signals is essential for real-world interactive AI applications, and poses a fundamental system-level challenge. Existing research on streaming video understanding, however, typically focuses on isolated aspects such as question-answering accuracy under limited visual context or improvements in encoding efficiency, while largely overlooking practical deployability under realistic resource constraints. To bridge this gap, we introduce StreamingEval, a unified evaluation framework for assessing the streaming video understanding capabilities of Video-LLMs under realistic constraints. StreamingEval benchmarks both mainstream offline models and recent online video models under a standardized protocol, explicitly characterizing the trade-off between efficiency, storage and accuracy. Specifically, we adopt a fixed-capacity memory bank to normalize accessible historical visual context, and jointly evaluate visual encoding efficiency, text decoding latency, and task performance to quantify overall system deployability. Extensive experiments across multiple datasets reveal substantial gaps between current Video-LLMs and the requirements of realistic streaming applications, providing a systematic basis for future research in this direction. Codes will be released at https://github.com/wwgTang-111/StreamingEval1.[242] Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification
Jayanie Bogahawatte,Sachith Seneviratne,Saman Halgamuge
Main category: cs.CV
TL;DR: 本文提出了一种参数高效的提示调优方法和软分层文本引导策略,用于少样本弱监督全切片图像(WSI)分类,显著减少可训练参数与推理开销,并提升分类与肿瘤定位性能。
Details
Motivation: 现有基于预训练视觉语言模型(VLMs)的少样本弱监督WSI分类方法存在参数量大、推理开销高、且因硬性过滤低对齐实例导致信息丢失的问题;而获取细粒度实例标注成本高昂,亟需高效利用稀疏滑片级标签的方法。 Method: 1)提出一种参数高效的提示调优方法:通过缩放和平移文本编码器特征实现;2)设计软分层文本引导的WSI表征学习策略,避免硬性过滤低对齐实例,融合VLM先验知识与WSI固有层次结构。 Result: 在乳腺癌、肺癌和卵巢癌病理数据集上,分类性能分别提升10.9%、7.8%和13.8%;可训练参数分别减少18.1%(乳腺、肺癌)、5.8%(卵巢癌);同时在弱监督肿瘤定位任务中表现优异。 Conclusion: 所提方法在保持VLM先验知识的同时,有效建模WSI层次结构,以更少参数和更低开销实现了少样本弱监督WSI分类与定位的显著提升,具有临床实用价值。 Abstract: Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at https://github.com/Jayanie/HIPSS.[243] Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection
Kaiqiang Li,Gang Li,Mingle Zhou,Min Li,Delong Han,Jin Wan
Main category: cs.CV
TL;DR: 本文提出BTP框架,利用预训练的点云-语言模型(PLM)实现零样本3D异常检测,通过多粒度点云补丁与文本嵌入对齐,并融合几何描述符和辅助点云数据联合表征学习,显著提升局部与结构异常检测性能。
Details
Motivation: 现有基于2D渲染和视觉-语言模型的方法丢失3D几何细节、对局部异常不敏感,亟需直接建模3D点云内在结构的新方法。 Method: 提出BTP框架:1)多粒度点云补丁特征与文本嵌入对齐;2)引入几何描述符增强结构异常敏感性;3)设计联合表征学习策略,利用辅助点云数据提升鲁棒性与异常语义丰富性。 Result: 在Real3D-AD和Anomaly-ShapeNet数据集上,BTP在零样本3D异常检测任务中取得SOTA性能。 Conclusion: 直接利用预训练点云-语言模型并融合几何先验与辅助数据,是实现高性能零样本3D异常检测的有效路径。 Abstract: Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection. Code will be available at \href{https://github.com/wistful-8029/BTP-3DAD}{https://github.com/wistful-8029/BTP-3DAD}.[244] VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection
Xinghan Li,Junhao Xu,Jingjing Chen
Main category: cs.CV
TL;DR: 本文提出VIGIL框架,通过‘规划-检查’两阶段流程实现面向面部部件的结构化深度伪造检测,结合阶段门控证据注入与三阶段渐进训练,在新构建的OmniFake多级基准上显著提升泛化能力。
Details
Motivation: 现有基于多模态大语言模型(MLLM)的深度伪造检测方法将证据生成与篡改定位合并为一步,导致观察失真与幻觉解释混淆,结论不可靠。 Method: 提出VIGIL:1)‘规划-检查’流水线——先基于全局视觉线索选择待检面部部件,再独立调用法证证据进行部件级检验;2)阶段门控注入机制,仅在检验阶段注入部件级证据,保障规划阶段的感知自主性;3)三阶段渐进训练(含部件感知奖励的强化学习阶段),增强解剖合理性和证据-结论一致性。 Result: 在新构建的OmniFake五级分层基准(覆盖从基础生成器到社交媒体真实数据)及跨数据集评测中,VIGIL在所有泛化层级上均超越专家检测器和同期MLLM方法。 Conclusion: 面向部件的结构化法证流程、阶段隔离的证据利用与部件感知的强化学习训练,是提升MLLM在深度伪造检测中可解释性与泛化性的关键路径。 Abstract: Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.[245] PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation
Gensheng Pei,Xiruo Jiang,Xinhao Cai,Tao Chen,Yazhou Yao,Byeungwoo Jeon
Main category: cs.CV
TL;DR: 本文提出PEARL方法,通过Procrustes对齐与文本感知的拉普拉斯传播两步推理,实现无需训练的开放词汇语义分割,在不增加额外数据或辅助骨干网络的前提下达到新SOTA性能。
Details
Motivation: 现有训练自由的开放词汇语义分割方法存在依赖繁重后处理、图文模态分离、跨模态几何信息利用不足,或引入额外视觉骨干/多模型流程导致复杂度和延迟上升等问题。 Method: PEARL采用两步推理:1)Procrustes对齐——在最后一层自注意力块内通过稳定极分解迭代进行正交投影,将key旋转至query子空间;2)文本感知的拉普拉斯传播——在小网格上基于置信度加权、文本引导的图求解优化逐像素logits,文本提供数据可信度信号与邻域门控,图像梯度保持边界。 Result: PEARL在标准基准上全面超越现有方法,达成训练自由开放词汇语义分割的新SOTA,且在含背景与不含背景协议下均表现优异。 Conclusion: PEARL是一种轻量、即插即用、完全无需训练的方法,仅使用固定常数,仅引入微小延迟(每头小投影+若干共轭梯度步),在保持设计简洁性的同时显著提升性能。 Abstract: Training-free open-vocabulary semantic segmentation (OVSS) promises rapid adaptation to new label sets without retraining. Yet, many methods rely on heavy post-processing or handle text and vision in isolation, leaving cross-modal geometry underutilized. Others introduce auxiliary vision backbones or multi-model pipelines, which increase complexity and latency while compromising design simplicity. We present PEARL, \textbf{\underline{P}}rocrust\textbf{\underline{e}}s \textbf{\underline{a}}lignment with text-awa\textbf{\underline{r}}e \textbf{\underline{L}}aplacian propagation, a compact two-step inference that follows an align-then-propagate principle. The Procrustes alignment step performs an orthogonal projection inside the last self-attention block, rotating keys toward the query subspace via a stable polar iteration. The text-aware Laplacian propagation then refines per-pixel logits on a small grid through a confidence-weighted, text-guided graph solve: text provides both a data-trust signal and neighbor gating, while image gradients preserve boundaries. In this work, our method is fully training-free, plug-and-play, and uses only fixed constants, adding minimal latency with a small per-head projection and a few conjugate-gradient steps. Our approach, PEARL, sets a new state-of-the-art in training-free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with-background and without-background protocols.[246] PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion Models
Yiwei Xie,Zheng Zhang,Ping Liu
Main category: cs.CV
TL;DR: 本文提出PROBE诊断协议,用于量化文本到视频扩散模型中被擦除概念的'再激活潜力',揭示当前擦除方法仅实现输出级抑制而非表征级移除。
Details
Motivation: 现有文本到视频(T2V)模型的概念擦除评估仅关注生成帧中目标概念是否缺失,误将输出级抑制等同于表征级移除,缺乏对潜在残留能力的深入检验。 Method: 提出PROBE协议:在冻结全部模型参数前提下,通过去噪重建目标与新颖的潜在对齐约束(锚定原始概念的时空结构),优化轻量伪标记嵌入,以量化被擦除概念的再激活潜力;构建涵盖分类器检测、语义相似性、时序再激活分析和人工验证的多级评估框架。 Result: 在三种T2V架构、三类概念和三种擦除策略上的系统实验表明:所有测试方法均存在可测量的残余能力,其鲁棒性与干预深度相关;发现视频特有失效模式——'时序重现',即被抑制概念在视频帧中渐进式重现,而帧级指标无法检测。 Conclusion: 当前概念擦除方法主要实现输出级抑制,而非真正意义上的表征级移除;PROBE协议为T2V模型的安全审计提供了可复现的诊断工具。 Abstract: Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textit{reactivation potential} of erased concepts in T2V models. With all model parameters frozen, PROBE optimizes a lightweight pseudo-token embedding through a denoising reconstruction objective combined with a novel latent alignment constraint that anchors recovery to the spatiotemporal structure of the original concept. We make three contributions: (1) a multi-level evaluation framework spanning classifier-based detection, semantic similarity, temporal reactivation analysis, and human validation; (2) systematic experiments across three T2V architectures, three concept categories, and three erasure strategies revealing that all tested methods leave measurable residual capacity whose robustness correlates with intervention depth; and (3) the identification of temporal re-emergence, a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics. These findings suggest that current erasure methods achieve output-level suppression rather than representational removal. We release our protocol to support reproducible safety auditing. Our code is available at https://github.com/YiweiXie/PRObingBasedEvaluation.[247] From Part to Whole: 3D Generative World Model with an Adaptive Structural Hierarchy
Bi'an Du,Daizong Liu,Pufan Li,Wei Hu
Main category: cs.CV
TL;DR: 本文提出一种自适应的部件-整体3D生成世界模型,通过在3D潜在空间中学习动态部件结构,实现跨类别泛化和部件数量外推。
Details
Motivation: 现有单图3D生成方法难以在稀疏监督下对多样语义类别和复杂结构实现可靠泛化;固定部件数或单体建模易导致过拟合、部件缺失或组合泛化能力差。 Method: 提出基于图像token自主发现软性、可组合部件掩码的part-to-whole 3D生成模型;引入自适应slot-gating机制动态调整部件激活概率并合并冗余部件;将各部件对齐至类无关原型库,并设计轻量3D去噪器统一优化几何与外观。 Result: 实验表明该方法在跨类别迁移和部件数量外推任务上持续提升;消融实验证实原型库增强形状先验共享,slot-gating提升结构适应性。 Conclusion: 将单图3D生成重构为学习灵活3D潜在空间中的自适应部件-整体层次结构,显著提升了泛化性、结构性与可扩展性。 Abstract: Single-image 3D generation lies at the core of vision-to-graphics models in the real world. However, it remains a fundamental challenge to achieve reliable generalization across diverse semantic categories and highly variable structural complexity under sparse supervision. Existing approaches typically model objects in a monolithic manner or rely on a fixed number of parts, including recent part-aware models such as PartCrafter, which still require a labor-intensive user-specified part count. Such designs easily lead to overfitting, fragmented or missing structural components, and limited compositional generalization when encountering novel object layouts. To this end, this paper rethinks single-image 3D generation as learning an adaptive part-whole hierarchy in the flexible 3D latent space. We present a novel part-to-whole 3D generative world model that autonomously discovers latent structural slots by inferring soft and compositional masks directly from image tokens. Specifically, an adaptive slot-gating mechanism dynamically determines the slot-wise activation probabilities and smoothly consolidates redundant slots within different objects, ensuring that the emergent structure remains compact yet expressive across categories. Each distilled slot is then aligned to a learnable, class-agnostic prototype bank, enabling powerful cross-category shape sharing and denoising through universal geometric prototypes in the real world. Furthermore, a lightweight 3D denoiser is introduced to reconstruct geometry and appearance via unified diffusion objectives. Experiments show consistent gains in cross-category transfer and part-count extrapolation, and ablations confirm complementary benefits of the prototype bank for shape-prior sharing as well as slot-gating for structural adaptation.[248] Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning
Minseok Kang,Minhyeok Lee,Minjung Kim,Jungho Lee,Donghyeong Kim,Sungmin Woo,Inseok Jeon,Sangyoun Lee
Main category: cs.CV
TL;DR: 本文提出了一种弱监督视频场景图生成(WS-VSGG)新方法,通过可学习的主宾对亲和度(PALS/PAM)抑制非交互对,并结合关系感知匹配(RAM)提升伪标签质量,显著提升性能。
Details
Motivation: 现有弱监督方法使用现成检测器产生大量非交互对象对,干扰关系建模;而全监督检测器天然过滤非交互对象,二者存在关键差异。 Method: 提出Pair Affinity Learning and Scoring(PALS)用于推理时排序,Pair Affinity Modulation(PAM)融入上下文推理以抑制非交互对;并设计Relation-Aware Matching(RAM)利用视觉-语言对齐改善伪标签生成。 Result: 在Action Genome数据集上,该方法在多个基线与骨干网络上均取得显著提升,达到当前最优WS-VSGG性能。 Conclusion: 引入对交互性的显式建模(pair affinity)及更鲁棒的伪标签策略(RAM),可有效缓解弱监督设定下因检测器泛化性带来的噪声问题,推动WS-VSGG实用化。 Abstract: Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy pairs.We address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.[249] Exploring Multimodal Prompts For Unsupervised Continuous Anomaly Detection
Mingle Zhou,Jiahui Liu,Jin Wan,Gang Li,Min Li
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态提示的无监督持续异常检测框架,通过构建持续多模态提示记忆库(CMPMB)和缺陷语义引导的自适应融合机制(DSG-AFM),在MVTec AD和VisA数据集上实现了图像级AUROC和像素级AUPR的SOTA性能。
Details
Motivation: 现有仅依赖视觉信息的无监督持续异常检测(UCAD)方法难以充分建模复杂场景下的正常性流形,限制了检测精度提升。 Method: 提出基于多模态提示的UCAD框架,包括:1)持续多模态提示记忆库(CMPMB),用于跨任务渐进式提炼并存储视觉与文本域中的典型正常模式;2)缺陷语义引导的自适应融合机制(DSG-AFM),结合自适应归一化模块(ANM)与动态融合策略(DFS),协同提升检测精度与对抗鲁棒性。 Result: 在MVTec AD和VisA数据集上,图像级AUROC和像素级AUPR均达到当前最优(SOTA)性能。 Conclusion: 引入多模态提示与持续记忆机制可有效增强正常性建模能力,显著提升无监督持续异常检测的准确性与鲁棒性。 Abstract: Unsupervised Continuous Anomaly Detection (UCAD) is gaining attention for effectively addressing the catastrophic forgetting and heavy computational burden issues in traditional Unsupervised Anomaly Detection (UAD). However, existing UCAD approaches that rely solely on visual information are insufficient to capture the manifold of normality in complex scenes, thereby impeding further gains in anomaly detection accuracy. To overcome this limitation, we propose an unsupervised continual anomaly detection framework grounded in multimodal prompting. Specifically, we introduce a Continual Multimodal Prompt Memory Bank (CMPMB) that progressively distills and retains prototypical normal patterns from both visual and textual domains across consecutive tasks, yielding a richer representation of normality. Furthermore, we devise a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM) that integrates an Adaptive Normalization Module (ANM) with a Dynamic Fusion Strategy (DFS) to jointly enhance detection accuracy and adversarial robustness. Benchmark experiments on MVTec AD and VisA datasets show that our approach achieves state-of-the-art (SOTA) performance on image-level AUROC and pixel-level AUPR metrics.[250] Rethinking SAR ATR: A Target-Aware Frequency-Spatial Enhancement Framework with Noise-Resilient Knowledge Guidance
Yansong Lin,Zihan Cheng,Jielei Wang,Guoming Lua,Zongyong Cui
Main category: cs.CV
TL;DR: 本文提出了一种面向目标的频域-空间增强框架(FSCE),结合浅层特征自适应增强(DSAF)模块和噪声鲁棒的知识蒸馏方法,提升SAR图像在相干斑噪声下的目标识别性能与泛化能力。
Details
Motivation: SAR图像固有的相干斑噪声会掩盖目标显著特征,导致识别精度下降和模型泛化能力受限。 Method: 提出FSCE框架,包含频域-空间浅层特征自适应增强(DSAF)模块(融合空间多尺度卷积与频域小波卷积);采用师生学习范式与在线知识蒸馏(KD),通过注意力迁移与噪声鲁棒表征学习协同优化。 Result: 在MSTAR、FUSARShip和OpenSARShip数据集上实验表明:DSAFNet-L性能达到或超越现有方法;DSAFNet-M大幅降低模型复杂度且保持相近精度。 Conclusion: FSCE框架有效提升了SAR目标识别在噪声环境下的稳定性与跨模型泛化能力。 Abstract: Synthetic aperture radar automatic target recognition (SAR ATR) is of considerable importance in marine navigation and disaster monitoring. However, the coherent speckle noise inherent in SAR imagery often obscures salient target features, leading to degraded recognition accuracy and limited model generalization. To address this issue, this paper proposes a target-aware frequency-spatial enhancement framework with noise-resilient knowledge guidance (FSCE) for SAR target recognition. The proposed framework incorporates a frequency-spatial shallow feature adaptive enhancement (DSAF) module, which processes shallow features through spatial multi-scale convolution and frequency-domain wavelet convolution. In addition, a teacher-student learning paradigm combined with an online knowledge distillation method (KD) is employed to guide the student network to focus more effectively on target regions, thereby enhancing its robustness to high-noise backgrounds. Through the collaborative optimization of attention transfer and noise-resilient representation learning, the proposed approach significantly improves the stability of target recognition under noisy conditions. Based on the FSCE framework, two network architectures with different performance emphases are developed: lightweight DSAFNet-M and high-precision DSAFNet-L. Extensive experiments are conducted on the MSTAR, FUSARShip and OpenSARShip datasets. The results show that DSAFNet-L achieves competitive or superior performance compared with various methods on three datasets; DSAFNet-M significantly reduces the model complexity while maintaining comparable accuracy. These results indicate that the proposed FSCE framework exhibits strong cross-model generalization.[251] CataractSAM-2: A Domain-Adapted Model for Anterior Segment Surgery Segmentation and Scalable Ground-Truth Annotation
Mohammad Eslami,Dhanvinkumar Ganeshkumar,Saber Kazeminasab,Michael G. Morley,Michael V. Boland,Michael M. Lin,John B. Miller,David S. Friedman,Nazlee Zebardast,Lucia Sobrin,Tobias Elze
Main category: cs.CV
TL;DR: CataractSAM-2 是 Segment Anything Model 2 的医学领域适配版本,专用于白内障手术视频的实时语义分割;提出交互式视频标注框架以降低标注成本;模型具备跨术式(如青光眼小梁切除术)零样本泛化能力;开源模型与工具包。
Details
Motivation: 解决白内障手术视频中高精度、实时语义分割需求,缓解医学图像标注成本高、数据稀缺问题,并支持机器人辅助与计算机引导手术系统的精准术中感知。 Method: 基于 SAM 2 进行领域自适应训练;设计融合稀疏提示与视频掩码传播的交互式标注框架;在白内障手术视频数据上训练并验证跨术式(青光眼手术)零样本迁移能力。 Result: 实现白内障手术视频高精度实时分割;显著降低人工标注时间;在未见的青光眼手术视频上展现强零样本泛化性能;开源模型与标注工具。 Conclusion: CataractSAM-2 为前节眼科手术提供了可扩展的数据构建基础与实时AI感知能力,推动医疗机器人与手术视频理解的发展。 Abstract: We present CataractSAM-2, a domain-adapted extension of Meta's Segment Anything Model 2, designed for real-time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM-2 enables precise intraoperative perception crucial for robotic-assisted and computer-guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video-based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high-quality ground-truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model's strong zero-shot generalization to glaucoma trabeculectomy procedures, confirming its cross-procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open-source resources, establishing CataractSAM-2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics, as well as surgical video understanding.[252] Rethinking Visual Privacy: A Compositional Privacy Risk Framework for Severity Assessment with VLMs
Efthymios Tsaprazlis,Tiantian Feng,Anil Ramakrishna,Sai Praneeth Karimireddy,Rahul Gupta,Shrikanth Narayanan
Main category: cs.CV
TL;DR: 本文提出了一种面向视觉隐私风险的组合性评估框架CPRT,强调隐私风险并非二元属性,而是由多个视觉属性组合引发的连续严重性问题,并构建了对应数据集与轻量级模型以提升模型对组合隐私风险的识别能力。
Details
Motivation: 现有视觉隐私基准将隐私视为二元属性,忽视了多个看似无害的视觉属性组合后可能引发严重隐私风险,因此需要一种更细粒度、符合法规要求的组合式隐私风险评估框架。 Method: 提出了合规感知的组合隐私风险分类法(CPRT),定义四级严重性等级和可解释的连续评分函数;构建了6.7K图像的分类对齐数据集并标注组合风险真值;评估了前沿及开源VLM在组合风险识别上的表现,并训练了一个8B参数的监督微调模型。 Result: 前沿模型在结构化提示下能较好对齐组合风险严重性,但系统性低估组合风险;小模型难以掌握分级隐私推理;所提8B SFT模型在组合隐私评估上接近前沿模型性能。 Conclusion: 隐私风险具有本质的组合性,需超越二元判断;CPRT为视觉隐私评估提供了可解释、可扩展、合规对齐的新范式,并验证了轻量模型在该任务上的可行性。 Abstract: Existing visual privacy benchmarks largely treat privacy as a binary property, labeling images as private or non-private based on visible sensitive content. We argue that privacy is fundamentally compositional. Attributes that are benign in isolation may combine to produce severe privacy violations. We introduce the Compositional Privacy Risk Taxonomy (CPRT), a regulation-aware framework that organizes visual attributes according to standalone identifiability and compositional harm potential. CPRT defines four graded severity levels and is paired with an interpretable scoring function that assigns continuous privacy severity scores. We further construct a taxonomy-aligned dataset of 6.7K images and derive ground-truth compositional risk scores. By evaluating frontier and open-weight VLMs we find that frontier models align well with compositional severity when provided structured guidance, but systematically underestimate composition-driven risks. Smaller models struggle to internalize graded privacy reasoning. To bridge this gap, we introduce a deployable 8B supervised fine-tuned (SFT) model that closely matches frontier-level performance on compositional privacy assessment.[253] HACMatch Semi-Supervised Rotation Regression with Hardness-Aware Curriculum Pseudo Labeling
Mei Li,Huayi Zhou,Suizhi Huang,Yuxiang Lu,Yue Ding,Hongtao Lu
Main category: cs.CV
TL;DR: 本文提出了一种面向3D旋转回归的半监督学习框架,通过难度感知的课程学习和结构化数据增强,在仅使用少量标注2D图像的情况下显著提升性能。
Details
Motivation: 现有3D旋转回归模型依赖大量标注数据或额外模态信息,而半监督方法(如FisherMatch)的伪标签筛选机制僵化,难以区分可靠与不可靠样本。 Method: 提出难度感知的课程学习框架(含多阶段与自适应策略),动态选择伪标签样本;设计面向旋转估计的结构化数据增强,通过拼接增强块保持几何完整性并提升特征多样性。 Result: 在PASCAL3D+和ObjectNet3D数据集上,尤其在低标注数据场景下,显著优于现有监督与半监督基线方法。 Conclusion: 难度感知课程学习与结构化数据增强可有效提升半监督3D旋转回归的鲁棒性与泛化能力,为小样本旋转估计提供了新思路。 Abstract: Regressing 3D rotations of objects from 2D images is a crucial yet challenging task, with broad applications in autonomous driving, virtual reality, and robotic control. Existing rotation regression models often rely on large amounts of labeled data for training or require additional information beyond 2D images, such as point clouds or CAD models. Therefore, exploring semi-supervised rotation regression using only a limited number of labeled 2D images is highly valuable. While recent work FisherMatch introduces semi-supervised learning to rotation regression, it suffers from rigid entropy-based pseudo-label filtering that fails to effectively distinguish between reliable and unreliable unlabeled samples. To address this limitation, we propose a hardness-aware curriculum learning framework that dynamically selects pseudo-labeled samples based on their difficulty, progressing from easy to complex examples. We introduce both multi-stage and adaptive curriculum strategies to replace fixed-threshold filtering with more flexible, hardness-aware mechanisms. Additionally, we present a novel structured data augmentation strategy specifically tailored for rotation estimation, which assembles composite images from augmented patches to introduce feature diversity while preserving critical geometric integrity. Comprehensive experiments on PASCAL3D+ and ObjectNet3D demonstrate that our method outperforms existing supervised and semi-supervised baselines, particularly in low-data regimes, validating the effectiveness of our curriculum learning framework and structured augmentation approach.[254] SARe: Structure-Aware Large-Scale 3D Fragment Reassembly
Hanze Jia,Chunshi Wang,Yuxiao Yang,Zhonghua Jiang,Yawei Luo,Shuainan Ye,Tan Tang
Main category: cs.CV
TL;DR: 本文提出了一种名为Structure-Aware Reassembly (SARe)的生成式框架,用于解决大规模3D碎片重装配问题,通过显式建模接触关系与推理优化,显著提升了高碎片数量下的重建鲁棒性与成功率。
Details
Motivation: 现有端到端方法因接触推理不可靠(尤其是碎片邻接关系不准)易导致级联失败;且碎片数量增多时,目标形状未知、语义线索弱,使问题更具挑战性。 Method: 提出SARe框架,包含两部分:SARe-Gen(联合预测断裂面token概率和碎片间接触图,采用基于查询点的条件机制与冻结几何编码器提取结构化局部几何token)和SARe-Refine(在推理时通过几何一致性检验验证候选接触边,固定已验证子结构并重采样不确定区域)。 Result: 在合成断裂、扫描真实物体模拟断裂及真实物理断裂扫描三类数据上均达到SOTA性能,尤其在碎片数量增加时退化更平缓、成功率更高。 Conclusion: 显式结构感知与分阶段生成-精炼策略能有效缓解大规模碎片重装配中的级联错误,提升整体稳定性与可扩展性。 Abstract: 3D fragment reassembly aims to recover the rigid poses of unordered fragment point clouds or meshes in a common object coordinate system to reconstruct the complete shape. The problem becomes particularly challenging as the number of fragments grows, since the target shape is unknown and fragments provide weak semantic cues. Existing end-to-end approaches are prone to cascading failures due to unreliable contact reasoning, most notably inaccurate fragment adjacencies. To address this, we propose Structure-Aware Reassembly (SARe), a generative framework with SARe-Gen for Euclidean-space assembly generation and SARe-Refine for inference-time refinement, with explicit contact modeling. SARe-Gen jointly predicts fracture-surface token probabilities and an inter-fragment contact graph to localize contact regions and infer candidate adjacencies. It adopts a query-point-based conditioning scheme and extracts aligned local geometric tokens at query locations from a frozen geometry encoder, yielding queryable structural representations without additional structural pretraining. We further introduce an inference-time refinement stage, SARe-Refine. By verifying candidate contact edges with geometric-consistency checks, it selects reliable substructures and resamples the remaining uncertain regions while keeping verified parts fixed, leading to more stable and consistent assemblies in the many-fragment regime. We evaluate SARe across three settings, including synthetic fractures, simulated fractures from scanned real objects, and real physically fractured scans. The results demonstrate state-of-the-art performance, with more graceful degradation and higher success rates as the fragment count increases in challenging large-scale reassembly.[255] AdaEdit: Adaptive Temporal and Channel Modulation for Flow-Based Image Editing
Guandong Li,Zhaobin Chu
Main category: cs.CV
TL;DR: 本文提出AdaEdit,一种无需训练的自适应图像编辑框架,通过渐进式注入调度和通道选择性潜在扰动解决流匹配模型中基于反转的图像编辑的注入困境,显著提升了编辑质量。
Details
Motivation: 现有基于反转的图像编辑方法采用固定的特征注入策略,忽略了不同时间步和通道对注入需求的异质性,导致背景保留与编辑内容生成之间的权衡困境。 Method: 提出两个核心方法:1)渐进式注入调度,用连续衰减函数(如sigmoid、cosine)替代二值化时间表,实现源特征保留到目标特征生成的平滑过渡;2)通道选择性潜在扰动,依据反演潜变量与随机潜变量的分布差异估计各通道重要性,并差异化施加扰动强度。 Result: 在PIE-Bench基准(700张图像、10种编辑类型)上,AdaEdit相比强基线将LPIPS降低8.7%,SSIM提升2.6%,PSNR提升2.3%,同时保持有竞争力的CLIP相似度。 Conclusion: AdaEdit是一种即插即用、兼容多种ODE求解器的训练-free自适应编辑框架,有效缓解了注入困境,在编辑保真度和结构一致性方面取得显著提升。 Abstract: Inversion-based image editing in flow matching models has emerged as a powerful paradigm for training-free, text-guided image manipulation. A central challenge in this paradigm is the injection dilemma: injecting source features during denoising preserves the background of the original image but simultaneously suppresses the model's ability to synthesize edited content. Existing methods address this with fixed injection strategies -- binary on/off temporal schedules, uniform spatial mixing ratios, and channel-agnostic latent perturbation -- that ignore the inherently heterogeneous nature of injection demand across both the temporal and channel dimensions. In this paper, we present AdaEdit, a training-free adaptive editing framework that resolves this dilemma through two complementary innovations. First, we propose a Progressive Injection Schedule that replaces hard binary cutoffs with continuous decay functions (sigmoid, cosine, or linear), enabling a smooth transition from source-feature preservation to target-feature generation and eliminating feature discontinuity artifacts. Second, we introduce Channel-Selective Latent Perturbation, which estimates per-channel importance based on the distributional gap between the inverted and random latents and applies differentiated perturbation strengths accordingly -- strongly perturbing edit-relevant channels while preserving structure-encoding channels. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing types) demonstrate that AdaEdit achieves an 8.7% reduction in LPIPS, a 2.6% improvement in SSIM, and a 2.3% improvement in PSNR over strong baselines, while maintaining competitive CLIP similarity. AdaEdit is fully plug-and-play and compatible with multiple ODE solvers including Euler, RF-Solver, and FireFlow. Code is available at https://github.com/leeguandong/AdaEdit[256] 4DGS360: 360° Gaussian Reconstruction of Dynamic Objects from a Single Video
Jae Won Jang,Yeonjin Chang,Wonsik Shin,Juhwan Cho,Nojun Kwak
Main category: cs.CV
TL;DR: 4DGS360是一种无需扩散模型的框架,用于从单目视频中重建360度动态物体,通过3D原生初始化和AnchorTAP3D追踪器提升几何一致性与遮挡区域重建质量,并在新提出的iPhone360基准上验证了其SOTA性能。
Details
Motivation: 现有方法依赖2D先验,导致初始点过拟合于各视角可见表面,难以重建一致的360度几何结构,尤其在遮挡区域存在几何模糊问题。 Method: 提出3D原生初始化策略与新型3D追踪器AnchorTAP3D:后者以高置信度2D轨迹点为锚点生成鲁棒3D点轨迹,抑制漂移并增强遮挡区域几何保真度;结合后续优化实现连贯的360度4D重建。同时构建iPhone360新基准,支持更严格的360度泛化评估。 Result: 在iPhone360、iPhone和DAVIS数据集上均取得定性与定量SOTA结果,显著提升360度动态物体重建的一致性与完整性。 Conclusion: 4DGS360通过摒弃扩散先验、强化3D原生建模与轨迹跟踪,有效缓解了单目动态重建中的几何歧义与视角泛化瓶颈,为360度动态场景理解提供了新范式。 Abstract: We introduce 4DGS360, a diffusion-free framework for 360$^{\circ}$ dynamic object reconstruction from casual monocular video. Existing methods often fail to reconstruct consistent 360$^{\circ}$ geometry, as their heavy reliance on 2D-native priors causes initial points to overfit to visible surface in each training view. 4DGS360 addresses this challenge through a advanced 3D-native initialization that mitigates the geometric ambiguity of occluded regions. Our proposed 3D tracker, AnchorTAP3D, produces reinforced 3D point trajectories by leveraging confident 2D track points as anchors, suppressing drift and providing reliable initialization that preserves geometry in occluded regions. This initialization, combined with optimization, yields coherent 360$^{\circ}$ 4D reconstructions. We further present iPhone360, a new benchmark where test cameras are placed up to 135$^{\circ}$ apart from training views, enabling 360$^{\circ}$ evaluation that existing datasets cannot provide. Experiments show that 4DGS360 achieves state-of-the-art performance on the iPhone360, iPhone, and DAVIS datasets, both qualitatively and quantitatively.[257] Efficient Zero-Shot AI-Generated Image Detection
Ryosuke Sonoda,Ramya Srinivasan
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的AI生成图像检测方法,通过测量图像表征对结构化频率扰动的敏感性来识别细微伪造痕迹,具有计算轻量、推理速度快(比现有方法快1-2个数量级)和检测性能优越(在OpenFake上AUC提升近10%)的特点。
Details
Motivation: 现有基于训练的检测器泛化能力差,而无训练方法虽鲁棒性强但难以捕捉真实与合成图像间的细微差异。 Method: 提出一种无训练检测方法,利用傅里叶变换生成结构化频率扰动,测量图像表征对此类扰动的敏感性以判别真假。 Result: 在多个挑战性基准(尤其是OpenFake)上显著优于当前最优方法,AUC提升近10%,且推理速度提高1–2个数量级。 Conclusion: 该方法在保持无训练鲁棒性的同时,提升了对细微伪造痕迹的感知能力,实现了高精度与高效率的统一。 Abstract: The rapid progress of text-to-image models has made AI-generated images increasingly realistic, posing significant challenges for accurate detection of generated content. While training-based detectors often suffer from limited generalization to unseen images, training-free approaches offer better robustness, yet struggle to capture subtle discrepancies between real and synthetic images. In this work, we propose a training-free AI-generated image detection method that measures representation sensitivity to structured frequency perturbations, enabling detection of minute manipulations. The proposed method is computationally lightweight, as perturbation generation requires only a single Fourier transform for an input image. As a result, it achieves one to two orders of magnitude faster inference than most training-free detectors.Extensive experiments on challenging benchmarks demonstrate the efficacy of our method over state-of-the-art (SoTA). In particular, on OpenFake benchmark, our method improves AUC by nearly $10\%$ compared to SoTA, while maintaining substantially lower computational cost.[258] PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation
Jiacheng Lu,Hui Ding,Shiyu Zhang,Guoping Huo
Main category: cs.CV
TL;DR: 本文提出PGR-Net,一种基于先验引导的ROI推理网络,通过数据驱动的空间先验和分层Top-K ROI决策机制提升脑肿瘤MRI分割精度,兼顾高Dice分数与低参数量。
Details
Motivation: 脑肿瘤在MRI中空间稀疏,现有方法忽视临床观察到的肿瘤空间先验,导致背景区域冗余计算。 Method: 提出PGR-Net框架:1)构建数据驱动空间先验集;2)引入分层Top-K ROI决策机制;3)设计WinGS-ROI模块生成中心增强的指导图;4)采用窗口化RetNet主干增强定位可靠性。 Result: 在BraTS-2019/2023和MSD Task01上取得SOTA性能,全肿瘤区域Dice达89.02%、91.82%、89.67%,仅用8.64M参数。 Conclusion: PGR-Net有效融合空间先验与ROI感知机制,在保持轻量化的同时显著提升脑肿瘤分割精度与定位稳定性。 Abstract: Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided ROI Reasoning Network) - an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions across encoder layers to improve localization precision. We further develop the WinGS-ROI (Windowed Gaussian-Spatial Decay ROI) module, which uses multi-window Gaussian templates with a spatial decay function to produce center-enhanced guidance maps, thus directing feature learning throughout the network. With these ROI features, a windowed RetNet backbone is adopted to enhance localization reliability. Experiments on BraTS-2019/2023 and MSD Task01 show that PGR-Net consistently outperforms existing approaches while using only 8.64M Params, achieving Dice scores of 89.02%, 91.82%, and 89.67% on the Whole Tumor region. Code is available at https://github.com/CNU-MedAI-Lab/PGR-Net.[259] Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition
Wen Guo,Pengfei Zhao,Zongmeng Wang,Yufan Hu,Junyu Gao
Main category: cs.CV
TL;DR: 本文提出TCEI框架,通过直觉系统(瞬时记忆)和经验系统(历史测试视频积累)协同实现多目标跟踪中的测试时自适应,兼顾帧间时序一致性和身份关联,显著缓解分布偏移导致的性能下降。
Details
Motivation: 现有测试时自适应方法在多目标跟踪中效果不佳,因其仅关注单帧适应,忽视帧间时间一致性与跨帧身份关联;而训练与测试数据在外观、运动模式和类别上存在分布偏移,导致在线推理性能明显下降。 Method: 提出Test-time Calibration from Experience and Intuition(TCEI)框架:直觉系统利用瞬时记忆快速预测近期目标;经验系统基于先前测试视频积累的经验对直觉预测进行重评估与校准;同时将在线测试中高置信度和低置信度目标分别作为历史先验和反思案例,支撑模型持续适应。 Result: 在多个主流MOT基准数据集上均取得一致且显著的性能提升,有效增强了模型在分布偏移下的适应能力。 Conclusion: TCEI通过融合直觉与经验双系统,首次在MOT中系统性建模测试时的时序一致性与身份关联,为测试时自适应提供了新范式。 Abstract: Multiple Object Tracking (MOT) has long been a fundamental task in computer vision, with broad applications in various real-world scenarios. However, due to distribution shifts in appearance, motion pattern, and catagory between the training and testing data, model performance degrades considerably during online inference in MOT. Test-Time Adaptation (TTA) has emerged as a promising paradigm to alleviate such distribution shifts. However, existing TTA methods often fail to deliver satisfactory results in MOT, as they primarily focus solely on frame-level adaptation while neglecting temporal consistency and identity association across frames and videos. Inspired by human decision-making process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework. In this framework, the Intuitive system utilizes transient memory to recall recently observed objects for rapid predictions, while the Experiential system leverages the accumulated experience from prior test videos to reassess and calibrate these intuitive predictions. Furthermore, both confident and uncertain objects during online testing are exploited as historical priors and reflective cases, respectively, enabling the model to adapt to the testing environment and alleviate performance degradation. Extensive experiments demonstrate that the proposed TCEI framework consistently achieves superior performance across multiple benchmark datasets and significantly enhances the model's adaptability under distribution shifts. The code will be released at https://github.com/1941Zpf/TCEI.[260] No Dense Tensors Needed: Fully Sparse Object Detection on Event-Camera Voxel Grids
Mohamad Yazan Sadoun,Sarah Sharif,Yaser Mike Banad
Main category: cs.CV
TL;DR: 本文提出SparseVoxelDet,首个完全基于稀疏计算的事件相机目标检测器,全程使用3D稀疏卷积处理占用体素,不生成任何稠密张量,在FRED数据集上实现高精度与极低内存/存储开销。
Details
Motivation: 现有事件相机检测器多将稀疏事件流转换为稠密张量,浪费了神经形态传感的表征效率;需一种真正利用事件稀疏性的端到端稀疏检测范式。 Method: 提出SparseVoxelDet:全稀疏架构,包括基于3D稀疏卷积的骨干网络、特征金字塔融合与检测头,仅在非空体素位置进行计算,无任何稠密特征图生成。 Result: 在FRED基准上达83.38% mAP@50,每帧仅处理14,900个活跃体素(占网格0.23%),相较稠密YOLOv11(409,600像素)实现858倍GPU内存压缩和3670倍存储缩减;71%失败源于定位偏差而非漏检。 Conclusion: 原生稀疏处理是事件相机目标检测的有效范式,无需专用类脑硬件,且表示代价随场景动态性而非传感器分辨率增长,具备良好可扩展性。 Abstract: Event cameras produce asynchronous, high-dynamic-range streams well suited for detecting small, fast-moving drones, yet most event-based detectors convert the sparse event stream into dense tensors, discarding the representational efficiency of neuromorphic sensing. We propose SparseVoxelDet, to our knowledge the first fully sparse object detector for event cameras, in which backbone feature extraction, feature pyramid fusion, and the detection head all operate exclusively on occupied voxel positions through 3D sparse convolutions; no dense feature tensor is instantiated at any stage of the pipeline. On the FRED benchmark (629,832 annotated frames), SparseVoxelDet achieves 83.38% mAP at 50 while processing only 14,900 active voxels per frame (0.23% of the T.H.W grid), compared to 409,600 pixels for the dense YOLOv11 baseline (87.68% mAP at 50). Relaxing the IoU threshold from 0.50 to 0.40 recovers mAP to 89.26%, indicating that the remaining accuracy gap is dominated by box regression precision rather than detection capability. The sparse representation yields 858 times GPU memory compression and 3,670 times storage reduction relative to the equivalent dense 3D voxel tensor, with data-structure size that scales with scene dynamics rather than sensor resolution. Error forensics across 119,459 test frames confirms that 71 percent of failures are localization near-misses rather than missed targets. These results demonstrate that native sparse processing is a viable paradigm for event-camera object detection, exploiting the structural sparsity of neuromorphic sensor data without requiring neuromorphic computing hardware, and providing a framework whose representation cost is governed by scene activity rather than pixel count, a property that becomes increasingly valuable as event cameras scale to higher resolutions.[261] FedCVU: Federated Learning for Cross-View Video Understanding
Shenghan Zhang,Run Ling,Ke Cao,Ao Ma,Zhanjie Zhang
Main category: cs.CV
TL;DR: 本文提出FedCVU框架,通过VS-Norm、CV-Align和SLA三个模块解决跨视角联邦视频理解中的非独立同分布、表征错位与通信开销大三大挑战,在动作理解和行人重识别任务上显著提升未见视角准确率。
Details
Motivation: 联邦学习应用于多摄像头跨视角视频理解时面临视角与背景异质导致的非独立同分布数据、局部分布偏差引起的表征错位、以及大型视频模型带来的高通信开销三大挑战。 Method: 提出FedCVU框架,包含:(i) VS-Norm——保留归一化参数以适应视角特异性统计;(ii) CV-Align——轻量对比正则化模块提升跨视角表征对齐;(iii) SLA——选择性层聚合策略降低通信成本。 Result: 在跨视角协议下的动作理解与行人重识别任务中,FedCVU持续提升未见视角准确率,同时保持优异的已见视角性能,优于现有SOTA联邦学习基线,并对域异质性和通信约束具有鲁棒性。 Conclusion: FedCVU有效缓解了跨视角联邦视频理解中的关键瓶颈,为隐私保护下的多视角协同建模提供了实用且可扩展的解决方案。 Abstract: Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.[262] OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging
Meilin Liu,Jiaying Wang,Jing Shan
Main category: cs.CV
TL;DR: 本文提出OmniFM,一种模态与任务无关的联邦学习框架,通过频域分析实现跨模态医学影像分析的统一训练,显著提升异构环境下的性能。
Details
Motivation: 现有联邦学习框架在医学影像分析中受限于任务特定骨干网络和对异构成像模态的脆弱性,难以适应真实场景中机构间模态分布差异大、下游任务多样的需求。 Method: OmniFM基于低频谱分量具有跨模态一致性的洞察,设计了全局频谱知识检索、嵌入级交叉注意力融合和前后缀频谱提示机制,并引入谱邻近对齐目标进行正则化。 Result: 在真实数据集上的实验表明,OmniFM在模态内和跨模态异构场景下均持续超越当前最优联邦学习基线,在微调和从头训练设置下均取得更优结果。 Conclusion: OmniFM为医学影像联邦学习提供了通用、鲁棒且可扩展的解决方案,推动了其在真实临床环境中的落地应用。 Abstract: Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to task-specific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM, a modality- and task-agnostic FL framework that unifies training across classification, segmentation, super-resolution, visual question answering, and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates (i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix-Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.[263] Cross-Scenario Deraining Adaptation with Unpaired Data: Superpixel Structural Priors and Multi-Stage Pseudo-Rain Synthesis
Kangbo Zhao,Miaoxin Guan,Xiang Chen,Yukai Shi,Jinshan Pan
Main category: cs.CV
TL;DR: 本文提出了一种跨场景去雨自适应框架,仅需目标域无雨背景图像,通过超像素生成、分辨率自适应融合和伪标签重合成机制,生成逼真的伪数据以提升模型在分布外(OOD)场景下的泛化能力。
Details
Motivation: 现有深度学习去雨方法在合成数据上表现良好,但在真实复杂雨况下因源域与目标域间显著的领域差异而性能大幅下降。 Method: 提出无需目标域成对雨图的跨场景自适应框架:1)用SLIC提取源域稳定结构先验(Sup-Gen模块);2)基于纹理相似性的分辨率自适应融合策略,将源结构与目标背景对齐以生成多样化伪数据;3)多阶段噪声生成的伪标签重合成机制模拟真实雨纹。该框架可即插即用地集成到任意去雨架构中。 Result: 在多个SOTA模型上验证,该方法在OOD域PSNR提升达32%–59%,并显著加快训练收敛速度。 Conclusion: 所提框架有效缓解了去雨任务中源-目标域不匹配问题,仅依赖无雨背景即可生成高质量伪数据,显著增强模型泛化性与实用性。 Abstract: Image deraining plays a pivotal role in low-level computer vision, serving as a prerequisite for robust outdoor surveillance and autonomous driving systems. While deep learning paradigms have achieved remarkable success in firmly aligned settings, they often suffer from severe performance degradation when generalized to unseen Out-of-Distribution (OOD) scenarios. This failure stems primarily from the significant domain discrepancy between synthetic training datasets and the complex physical dynamics of real-world rain. To address these challenges, this paper proposes a pioneering cross-scenario deraining adaptation framework. Diverging from conventional approaches, our method obviates the requirements for paired rainy observations in the target domain, leveraging exclusively rain-free background images. We design a Superpixel Generation (Sup-Gen) module to extract stable structural priors from the source domain using Simple Linear Iterative Clustering. Subsequently, a Resolution-adaptive Fusion strategy is introduced to align these source structures with target backgrounds through texture similarity, ensuring the synthesis of diverse and realistic pseudo-data. Finally, we implement a pseudo-label re-Synthesize mechanism that employs multi-stage noise generation to simulate realistic rain streaks. This framework functions as a versatile plug-and-play module capable of seamless integration into arbitrary deraining architectures. Extensive experiments on state-of-the-art models demonstrate that our approach yields remarkable PSNR gains of up to 32% to 59% in OOD domains while significantly accelerating training convergence.[264] HumanOmni-Speaker: Identifying Who said What and When
Detao Bai,Shimin Yao,Weixuan Chen,Xihan Wei,Zhiheng Ma
Main category: cs.CV
TL;DR: 本文提出VR-SDR任务与HumanOmni-Speaker模型,通过高帧率视觉采样与视觉Delta编码器,实现真正端到端的多说话人身份-时间-语义对齐,打破现有模型依赖视觉捷径的‘能力幻觉’。
Details
Motivation: 现有全模态大模型在理解多人对话动态(谁在何时说了什么)上存在根本缺陷,依赖数据集中的视觉偏差和低帧率采样,无法捕捉唇动等高频时序动态,造成‘能力幻觉’。 Method: 提出VR-SDR任务和HumanOmni-Speaker-Speaker基准;构建HumanOmni-Speaker模型,采用25fps原始视频输入,引入视觉Delta编码器,将帧间运动残差压缩为每帧仅6个token,实现高效细粒度视觉表征。 Result: 模型能原生支持端到端唇读与高精度空间定位,无需裁剪;在多种说话人中心任务上性能显著优于现有方法。 Conclusion: 通过严格消除视觉捷径并提升视觉时序建模能力,本文推动全模态模型向真实人类级对话理解迈进,为多说话人视听理解建立了新范式与新基准。 Abstract: While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.[265] RefracGS: Novel View Synthesis Through Refractive Water Surfaces with 3D Gaussian Ray Tracing
Yiming Shao,Qiyu Dai,Chong Gao,Guanbin Li,Yeqiang Wang,He Sun,Qiong Zeng,Baoquan Chen,Wenzheng Chen
Main category: cs.CV
TL;DR: 本文提出RefracGS框架,通过联合重建折射水面与水下场景,解决非平面折射表面下的新视角合成难题;采用神经高度场建模水面、3D高斯场表示水下场景,并设计折射感知的高斯光线追踪算法,实现高质量、实时的新视角合成与一致的表面恢复。
Details
Motivation: 现有NVS方法(如NeRF、3DGS)假设光线直线传播,在非平面折射表面(如波动水面)下因严重空间变化光学畸变而失效,导致明显伪影。 Method: 提出RefracGS:用神经高度场显式建模折射水面(捕获波形几何),用3D高斯场表示水下场景;设计基于斯涅尔定律的折射感知高斯光线追踪,支持非线性光线路径计算、高效渲染及梯度反传;端到端联合优化两个表示。 Result: 在合成与真实复杂波浪场景上,RefracGS在视觉质量上超越以往折射NVS方法,训练速度快15倍,渲染达200 FPS(实时)。 Conclusion: 显式解耦折射界面与目标场景并联合优化,是提升折射条件下新视角合成保真度与效率的有效范式。 Abstract: Novel view synthesis (NVS) through non-planar refractive surfaces presents fundamental challenges due to severe, spatially varying optical distortions. While recent representations like NeRF and 3D Gaussian Splatting (3DGS) excel at NVS, their assumption of straight-line ray propagation fails under these conditions, leading to significant artifacts. To overcome this limitation, we introduce RefracGS, a framework that jointly reconstructs the refractive water surface and the scene beneath the interface. Our key insight is to explicitly decouple the refractive boundary from the target objects: the refractive surface is modeled via a neural height field, capturing wave geometry, while the underlying scene is represented as a 3D Gaussian field. We formulate a refraction-aware Gaussian ray tracing approach that accurately computes non-linear ray trajectories using Snell's law and efficiently renders the underlying Gaussian field while backpropagating the loss gradients to the parameterized refractive surface. Through end-to-end joint optimization of both representations, our method ensures high-fidelity NVS and view-consistent surface recovery. Experiments on both synthetic and real-world scenes with complex waves demonstrate that RefracGS outperforms prior refractive methods in visual quality, while achieving 15x faster training and real-time rendering at 200 FPS. The project page for RefracGS is available at https://yimgshao.github.io/refracgs/.[266] PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and Paraganglioma
Zelin Liu,Xiangfu Yu,Jie Huang,Ge Wang,Yizhe Yuan,Zhenyu Yi,Jing Xie,Haotian Jiang,Lichi Zhang
Main category: cs.CV
TL;DR: 本文提出PPGL-Swarm,一种基于多智能体的PPGL诊断系统,可自动完成GAPP评分、基因风险预警及多模态报告生成,并通过可追溯的推理链提升临床决策透明度与准确性。
Details
Motivation: 现有PPGL临床诊断存在GAPP评分工作量大、主观性强、遗漏关键遗传风险(如SDHB突变)等问题,且现有AI系统缺乏可解释性与领域知识整合能力。 Method: 构建PPGL-Swarm多智能体系统:将诊断任务分解为微任务,由细胞学、Ki-67、基因、表格等专用智能体协同执行;引入知识增强机制解析基因型与检验数据;采用强化学习优化工具选择与任务分配。 Result: 系统实现全自动GAPP评分(含量化细胞密度与Ki-67)、SDHB等基因突变风险预警、生成含证据溯源的多模态诊断报告,并提供可审计的推理路径。 Conclusion: PPGL-Swarm提升了PPGL诊断的客观性、可解释性与临床实用性,为罕见肿瘤的智能化、个体化诊疗提供了新范式。 Abstract: Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors, of which 15-25% develop metastatic disease with 5-year survival rates reported as low as 34%. PPGL may indicate hereditary syndromes requiring stricter, syndrome-specific treatment and surveillance, but clinicians often fail to recognize these associations in routine care. Clinical practice uses GAPP score for PPGL grading, but several limitations remain for PPGL diagnosis: (1) GAPP scoring demands a high workload for clinician because it requires the manual evaluation of six independent components; (2) key components such as cellularity and Ki-67 are often evaluated with subjective criteria; (3) several clinically relevant metastatic risk factors are not captured by GAPP, such as SDHB mutations, which have been associated with reported metastatic rates of 35-75%. Agent-driven diagnostic systems appear promising, but most lack traceable reasoning for decision-making and do not incorporate domain-specific knowledge such as PPGL genotype information. To address these limitations, we present PPGL-Swarm, an agentic PPGL diagnostic system that generates a comprehensive report, including automated GAPP scoring (with quantified cellularity and Ki-67), genotype risk alerts, and multimodal report with integrated evidence. The system provides an auditable reasoning trail by decomposing diagnosis into micro-tasks, each assigned to a specialized agent. The gene and table agents use knowledge enhancement to better interpret genotype and laboratory findings, and during training we use reinforcement learning to refine tool selection and task assignment.[267] Rethinking Token Reduction for Large Vision-Language Models
Yi Wang,Haofei Zhang,Qihan Huang,Anda Cao,Gongfan Fang,Wei Wang,Xuan Jin,Jie Song,Mingli Song,Xinchao Wang
Main category: cs.CV
TL;DR: 本文提出MetaCompress,一种基于学习的、提示无关的视觉标记压缩方法,专为多轮视觉问答(MT-VQA)设计,通过可学习的压缩映射统一剪枝与合并策略,并以数据高效方式训练,在保持跨轮对话泛化能力的同时实现更优的效率-精度权衡。
Details
Motivation: 现有视觉标记压缩方法主要面向单轮VQA,难以应对多轮VQA中后续问题未知、可能指向任意图像区域的挑战;提示依赖法偏向初始提示而丢失后续信息,提示无关法则依赖启发式指标(如注意力分数),性能受限。 Method: 将标记压缩建模为可学习的压缩映射,统一剪枝与合并等范式;提出数据高效的训练范式,在低计算开销下学习最优压缩策略;方法为提示无关、学习驱动,不依赖特定轮次的文本提示。 Result: 在多个MT-VQA基准和不同LVLM架构上验证,MetaCompress显著提升效率-精度平衡,且跨对话轮次保持强泛化能力;代码已开源。 Conclusion: MetaCompress克服了传统启发式压缩方法在多轮场景下的局限,证明了学习式、提示无关压缩在复杂交互式视觉语言任务中的有效性与实用性。 Abstract: Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.[268] Getting to the Point: Why Pointing Improves LVLMs
Simone Alghisi,Massimo Rizzoli,Seyed Mahed Mousavi,Giuseppe Riccardi
Main category: cs.CV
TL;DR: 本文研究了指向(pointing)机制在大型视觉语言模型(LVLMs)零样本计数任务中的作用,发现Point-then-Count方法通过显式坐标预测提升了泛化能力,其增益源于坐标所编码的空间信息,但存在区域空间偏差。
Details
Motivation: 尽管指向机制被证明能提升LVLMs的准确性和可解释性,但其内在机制、在认知任务(如零样本计数)中的作用以及中间点预测的可靠性仍不清楚。 Method: 在零样本计数任务上,对比两种微调范式:Direct Counting(仅预测总数)与Point-then-Count(先预测目标物体坐标再计数),并进行F1评估、空间偏差分析和机制分析。 Result: Point-then-Count显著提升分布外泛化能力;89%以上坐标预测准确(F1衡量),但存在图像区域间性能差异;机制分析表明计数性能提升源于坐标所含的空间信息。 Conclusion: 指向不仅提升LVLMs计数性能,更促进对空间关系的建模与技能学习,但中间点的可靠性受空间偏差影响,需进一步校准。 Abstract: Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects' coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89\% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.[269] Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts
Xu Liu,Yongheng Zhang,Qiguang Chen,Yao Li,Sheng Wang,Libo Qin
Main category: cs.CV
TL;DR: 本文提出DaP-ICoT方法,通过动态视觉思维整合与精准视觉思维引导,解决现有ICoT中静态视觉定位和断裂视觉表征问题,显著提升多模态推理效率与性能。
Details
Motivation: 现有Interleaved-modal Chain-of-Thought(ICoT)方法存在两个关键缺陷:静态视觉思维定位(固定步骤插入视觉信息,导致低效不灵活)和断裂视觉思维表征(视觉token不连续、语义不连贯)。 Method: 提出DaP-ICoT框架,包含两个核心组件:(1)动态视觉思维整合——根据推理需求自适应引入视觉输入;(2)精准视觉思维引导——确保视觉表征语义连贯且上下文对齐。 Result: 在多个基准和模型上达到SOTA性能,并将插入图像数量大幅减少,token消耗降低72.6%。 Conclusion: DaP-ICoT有效克服了ICoT中视觉信息利用的僵化与割裂问题,实现了更高效、更精准的多模态链式推理。 Abstract: Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.[270] SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis
Bingxuan Zhao,Qing Zhou,Chuang Yang,Qi Wang
Main category: cs.CV
TL;DR: 本文提出SHARP方法,通过频谱感知的动态位置嵌入缩放策略,在不额外训练的前提下提升遥感图像生成分辨率,结合领域专用DiT模型RS-FLUX,在多尺度生成中显著优于现有无训练方法。
Details
Motivation: 遥感图像合成面临两大挑战:缺乏领域专用的DiT先验模型,以及在高分辨率下训练成本过高;现有基于RoPE缩放的免训练分辨率提升方法采用静态缩放策略,无法适配遥感图像富含中高频细节(如车辆、建筑轮廓)的特点,损害生成质量。 Method: 1)基于FLUX模型,在10万+遥感图像上微调构建领域专用先验RS-FLUX;2)提出SHARP方法:引入有理分式时间调度函数k_rs(t)动态调控RoPE缩放强度,早期强缩放以促进全局布局,后期渐进减弱以保留高频细节,实现与扩散去噪过程频率特性匹配的自适应位置嵌入。 Result: SHARP在六种正方形和矩形分辨率下均显著超越所有免训练基线,在CLIP Score、Aesthetic Score和HPSv2指标上表现最优,尤其在大倍率外推时优势更明显,且计算开销可忽略;支持单组超参鲁棒多尺度生成。 Conclusion: SHARP验证了将位置嵌入缩放与扩散过程的频率演化规律对齐的有效性,为遥感图像生成提供了高效、通用、免训练的高分辨率解决方案,并推动了领域专用生成先验与动态架构设计的协同发展。 Abstract: Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.[271] Dynamic Exposure Burst Image Restoration
Woohyeok Kim,Jaesung Rim,Daeyeon Kim,Sunghyun Cho
Main category: cs.CV
TL;DR: 本文提出了一种动态曝光的突发图像恢复方法(DEBIR),通过根据拍摄环境动态预测最优曝光时间来提升恢复质量。
Details
Motivation: 现有突发图像恢复方法依赖手动设计的曝光设置,而这些设置对恢复性能影响显著,但如何找到最优曝光设置的问题被忽视。 Method: 提出DEBIR框架,包含 Burst Auto-Exposure Network(BAENet)用于基于预览图、运动幅度和增益预测每张突发图像的最优曝光时间;再用突发图像恢复网络进行重建;训练中引入可微突发模拟器和三阶段训练策略。 Result: 在实验中实现了最先进的恢复质量,并在真实相机系统上验证了其实用性。 Conclusion: 动态预测并优化曝光时间能显著提升突发图像恢复效果,所提DEBIR框架兼具高性能与实用性。 Abstract: Burst image restoration aims to reconstruct a high-quality image from burst images, which are typically captured using manually designed exposure settings. Although these exposure settings significantly influence the final restoration performance, the problem of finding optimal exposure settings has been overlooked. In this paper, we present Dynamic Exposure Burst Image Restoration (DEBIR), a novel burst image restoration pipeline that enhances restoration quality by dynamically predicting exposure times tailored to the shooting environment. In our pipeline, Burst Auto-Exposure Network (BAENet) estimates the optimal exposure time for each burst image based on a preview image, as well as motion magnitude and gain. Subsequently, a burst image restoration network reconstructs a high-quality image from burst images captured using these optimal exposure times. For training, we introduce a differentiable burst simulator and a three-stage training strategy. Our experiments demonstrate that our pipeline achieves state-of-the-art restoration quality. Furthermore, we validate the effectiveness of our approach on a real-world camera system, demonstrating its practicality.[272] Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends
Simone Nascivera,Leonard Bauersfeld,Jeff Delaune,Davide Scaramuzza
Main category: cs.CV
TL;DR: 本文提出了一种基于图像条件的强化学习框架,用于在线自适应调整视觉里程计(VO)前端参数,显著提升了特征跟踪长度和计算效率。
Details
Motivation: 现有稀疏直接/半直接VO系统性能严重依赖手工调参,而固定参数难以适应不同场景(如纹理密度、光照、运动模糊等),导致实际环境中鲁棒性差。 Method: 将VO前端配置建模为序列决策问题,设计轻量级纹理感知CNN编码器与特权评论家(privileged critic)协同训练策略网络,使策略能根据输入图像内容实时输出最优检测与跟踪参数。 Result: 在TartanAirV2和TUM RGB-D数据集上实现特征轨迹长度提升3倍、计算成本降低3倍,且全程仅在仿真中训练。 Conclusion: 该方法首次实现了图像驱动的VO前端在线自适应调参,无需真实标签或人工干预,显著增强资源受限机器人在动态环境中的鲁棒性与效率。 Abstract: Resource-constrained autonomous robots rely on sparse direct and semi-direct visual-(inertial)-odometry (VO) pipelines, as they provide a favorable tradeoff between accuracy, robustness, and computational cost. However, the performance of most systems depends critically on hand-tuned hyperparameters governing feature detection, tracking, and outlier rejection. These parameters are typically fixed during deployment, even though their optimal values vary with scene characteristics such as texture density, illumination, motion blur, and sensor noise, leading to brittle performance in real-world environments. We propose the first image-conditioned reinforcement learning framework for online tuning of VO frontend parameters, effectively embedding the expert into the system. Our key idea is to formulate the frontend configuration as a sequential decision-making problem and learn a policy that directly maps visual input to feature detection and tracking parameters. The policy uses a lightweight texture-aware CNN encoder and a privileged critic during training. Unlike prior RL-based approaches that rely solely on internal VO statistics, our method observes the image content and proactively adapts parameters before tracking degrades. Experiments on TartanAirV2 and TUM RGB-D show 3x longer feature tracks and 3x lower computational cost, despite training entirely in simulation.[273] The Universal Normal Embedding
Chen Tasker,Roy Betser,Eyal Gofer,Meir Yossef Levi,Guy Gilboa
Main category: cs.CV
TL;DR: 本文提出'通用正态嵌入(UNE)'假设,认为生成模型的扩散噪声与视觉编码器的语义嵌入共享一个近似高斯的潜在空间;通过构建NoiseZoo数据集并在线性探针和可控编辑任务中验证该假设,揭示了编码与生成之间的统一潜在几何结构。
Details
Motivation: 生成模型与视觉编码器虽各自发展迅速,但目标与数学基础不同;作者观察到二者潜在空间均呈现高斯特性,因而提出它们可能源于同一共享的高斯潜在源(UNE)。 Method: 提出UNE假设,并构建NoiseZoo数据集(含DDIM反演噪声与CLIP/DINO编码器嵌入);在CelebA上使用线性探针评估语义对齐性,并利用学习到的线性方向实现无需修改架构的可控图像编辑(如微笑、性别、年龄),辅以正交化缓解纠缠。 Result: 线性探针在噪声与编码器嵌入空间中均能实现强且对齐的属性预测;基于噪声的线性方向可实现高质量、可控的图像编辑;实证支持UNE假设,揭示编码与生成共享高斯样潜在几何。 Conclusion: 生成模型的扩散噪声并非无意义随机量,而是承载语义信息的近高斯潜在表示;UNE为统一理解视觉编码与生成提供了新范式,并开启了基于噪声语义操作的新编辑范式。 Abstract: Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/[274] Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTEvent
Lokeshwaran Manohar,Moritz Roidl
Main category: cs.CV
TL;DR: 本文在工业多类识别任务MTEvent上对循环ReYOLOv8s进行了基准测试,并与非循环YOLOv8s基线对比,分析了时序记忆和事件域预训练的影响,发现GEN1初始化微调效果最佳,而PEDRo初始化表现较差。
Details
Motivation: 现有基于事件的物体检测研究多集中于户外驾驶或有限类别场景,缺乏针对工业环境多类别识别的系统性评估。 Method: 在MTEvent数据集上对循环ReYOLOv8s进行基准测试,以非循环YOLOv8s为基线;比较从零训练、GEN1事件域预训练微调及PEDRo初始化三种训练策略在不同clip长度下的性能。 Result: 最佳从零训练循环模型(C21)在验证集上达到0.285 mAP50,相对基线提升9.6%;GEN1预训练微调在clip长度21时达最优0.329 mAP50;PEDRo初始化仅得0.251 mAP50。 Conclusion: 事件域预训练显著提升工业场景下循环事件检测性能,且优于随机初始化;源域不匹配的预训练可能不如从零训练;主要失败模式为类别不平衡与人-物交互。 Abstract: Event cameras are attractive for industrial robotics because they provide high temporal resolution, high dynamic range, and reduced motion blur. However, most event-based object detection studies focus on outdoor driving scenarios or limited class settings. In this work, we benchmark recurrent ReYOLOv8s on MTEvent for industrial multi-class recognition and use a non-recurrent YOLOv8s variant as a baseline to analyze the effect of temporal memory. On the MTEvent validation split, the best scratch recurrent model (C21) reaches 0.285 mAP50, corresponding to a 9.6% relative improvement over the nonrecurrent YOLOv8s baseline (0.260). Event-domain pretraining has a stronger effect: GEN1-initialized fine-tuning yields the best overall result of 0.329 mAP50 at clip length 21, and unlike scratch training, GEN1-pretrained models improve consistently with clip length. PEDRo initialization drops to 0.251, indicating that mismatched source-domain pretraining can be less effective than training from scratch. Persistent failure modes are dominated by class imbalance and human-object interaction. Overall, we position this work as a focused benchmarking and analysis study of recurrent event-based detection in industrial environments.[275] Timing In stand-up Comedy: Text, Audio, Laughter, Kinesics (TIC-TALK): Pipeline and Database for the Multimodal Study of Comedic Timing
Yaelle Zribi,Florian Cafiero,Vincent Lépinay,Chahan Vidal-Gorène
Main category: cs.CV
TL;DR: 本文提出了TIC-TALK多模态数据集,整合语言、手势与观众反馈(如笑声),用于分析脱口秀表演中身体动态与观众反应的关系;发现动能越低(即更静止)越易引发笑声,特写镜头比例与笑声正相关,个人/身体类主题比地缘政治类更易引笑。
Details
Motivation: 现有研究多聚焦脱口秀的言语内容,而忽视了表演中的身体表现和实时观众反馈,亟需一个能同时建模语言、动作与观众反应的对齐多模态资源。 Method: 构建TIC-TALK数据集:使用BERTopic进行60秒主题切分;Whisper-AT检测0.8秒级笑声;YOLOv8-cls分类镜头类型;YOLOv8s-pose以1fps提取17关节骨骼关键点;所有模态流通过层级时间包含关系对齐,不重采样;保留原始坐标以计算连续运动学指标(如臂展、动能、躯干倾斜)。 Result: 在24个主题上发现:动能与笑声率显著负相关(r = -0.75);个人/身体类主题比地缘政治类引发更多笑声;特写镜头比例与笑声正相关(r = +0.28)。 Conclusion: 身体静止性、镜头构图与主题类型是影响观众笑声的关键非语言因素,TIC-TALK为喜剧表演的多模态建模与人机交互研究提供了可扩展基础资源。 Abstract: Stand-up comedy, and humor in general, are often studied through their verbal content. Yet live performance relies just as much on embodied presence and audience feedback. We introduce TIC-TALK, a multimodal resource with 5,400+ temporally aligned topic segments capturing language, gesture, and audience response across 90 professionally filmed stand-up comedy specials (2015-2024). The pipeline combines BERTopic for 60 s thematic segmentation with dense sentence embeddings, Whisper-AT for 0.8 s laughter detection, a fine-tuned YOLOv8-cls shot classifier, and YOLOv8s-pose for raw keypoint extraction at 1 fps. Raw 17-joint skeletal coordinates are retained without prior clustering, enabling the computation of continuous kinematic signals-arm spread, kinetic energy, and trunk lean-that serve as proxies for performance dynamics. All streams are aligned by hierarchical temporal containment without resampling, and each topic segment stores its sentence-BERT embedding for downstream similarity and clustering tasks. As a concrete use case, we study laughter dynamics across 24 thematic topics: kinetic energy negatively predicts audience laughter rate (r = -0.75, N = 24), consistent with a stillness-before-punchline pattern; personal and bodily content elicits more laughter than geopolitical themes; and shot close-up proportion correlates positively with laughter (r = +0.28), consistent with reactive montage.[276] Anatomical Token Uncertainty for Transformer-Guided Active MRI Acquisition
Lev Ayzenberg,Shady Abu-Hussein,Raja Giryes,Hayit Greenspan
Main category: cs.CV
TL;DR: 本文提出了一种基于预训练医学图像分词器和潜在Transformer的主动采样框架(TRUST-MRI),利用潜在空间中token熵定义不确定性,指导k空间的自适应采样,显著提升CS-MRI重建质量。
Details
Motivation: MRI全数据采集速度慢,限制临床效率;传统CS-MRI依赖固定采样轨迹与重建模型,缺乏对当前扫描内容的自适应能力。 Method: 构建基于预训练医学图像tokenizer和latent transformer的主动采样框架;定义token熵作为潜在不确定性度量;提出两种策略:Latent Entropy Selection(LES)将patch-wise熵映射到k空间选线,Gradient-based Entropy Optimization(GEO)通过梯度优化降低总潜在熵。 Result: 在fastMRI Knee/Brain数据集×8和×16加速下,所提LES与GEO策略在感知指标(如LPIPS)和特征距离(如SSIM、PSNR)上均优于SOTA方法。 Conclusion: 基于潜在token熵的主动采样可有效提升欠采样MRI重建性能,为数据驱动的智能采集提供了新范式。 Abstract: Full data acquisition in MRI is inherently slow, which limits clinical throughput and increases patient discomfort. Compressed Sensing MRI (CS-MRI) seeks to accelerate acquisition by reconstructing images from under-sampled k-space data, requiring both an optimal sampling trajectory and a high-fidelity reconstruction model. In this work, we propose a novel active sampling framework that leverages the inherent discrete structure of a pretrained medical image tokenizer and a latent transformer. By representing anatomy through a dictionary of quantized visual tokens, the model provides a well-defined probability distribution over the latent space. We utilize this distribution to derive a principled uncertainty measure via token entropy, which guides the active sampling process. We introduce two strategies to exploit this latent uncertainty: (1) Latent Entropy Selection (LES), projecting patch-wise token entropy into the $k$-space domain to identify informative sampling lines, and (2) Gradient-based Entropy Optimization (GEO), which identifies regions of maximum uncertainty reduction via the $k$-space gradient of a total latent entropy loss. We evaluate our framework on the fastMRI singlecoil Knee and Brain datasets at $\times 8$ and $\times 16$ acceleration. Our results demonstrate that our active policies outperform state-of-the-art baselines in perceptual metrics, and feature-based distances. Our code is available at https://github.com/levayz/TRUST-MRI.[277] Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment
Lei Yang,Yi He,Fei Wu,Shilin Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需级联的多任务学习架构,通过联合建模音素、视素等中间表示,并引入语义引导的局部对比损失,提升中文普通话视觉语音识别(VSR)性能,缓解误差累积与推理延迟问题。
Details
Motivation: 中文普通话是声调语言,传统端到端序列建模效果受限;现有级联系统(如引入拼音)虽提升精度,但存在错误累积和推理延迟问题。 Method: 提出无级联的多任务学习架构,联合建模音素、视素等中间表示;设计语义引导的局部对比损失,实现特征时序对齐与按需激活。 Result: 在公开数据集上实验表明,所提方法显著优于现有方法,兼顾推理效率与识别精度。 Conclusion: 该方法有效缓解了声调语言VSR中的误差传播与延迟问题,验证了多中间表示联合建模与对比对齐策略的有效性。 Abstract: Chinese mandarin visual speech recognition (VSR) is a task that has advanced in recent years, yet still lags behind the performance on non-tonal languages such as English. One primary challenge arises from the tonal nature of Mandarin, which limits the effectiveness of conventional sequence-to-sequence modeling approaches. To alleviate this issue, existing Chinese VSR systems commonly incorporate intermediate representations, most notably pinyin, within cascade architectures to enhance recognition accuracy. While beneficial, in these cascaded designs, the subsequent stage during inference depends on the output of the preceding stage, leading to error accumulation and increased inference latency. To address these limitations, we propose a cascade-free architecture based on multitask learning that jointly integrates multiple intermediate representations, including phoneme and viseme, to better exploit contextual information. The proposed semantic-guided local contrastive loss temporally aligns the features, enabling on-demand activation during inference, thereby providing a trade-off between inference efficiency and performance while mitigating error accumulation caused by projection and re-embedding. Experiments conducted on publicly available datasets demonstrate that our method achieves superior recognition performance.[278] Clinical Graph-Mediated Distillation for Unpaired MRI-to-CFI Hypertension Prediction
Dillan Imans,Phuoc-Nguyen Bui,Duc-Tai Le,Hyunseung Choo
Main category: cs.CV
TL;DR: 本文提出Clinical Graph-Mediated Distillation (CGMD)框架,在无配对MRI-眼底图像数据下,利用临床生物标志物构建跨模态kNN图,将MRI中的高血压知识蒸馏至眼底模型,显著提升无监督跨模态知识迁移效果。
Details
Motivation: 眼底图像虽廉价易得但HTN表征微弱;脑MRI虽信息强但昂贵且与眼底图像极少配对采集,导致模态割裂、数据无法联合建模。 Method: 提出CGMD框架:1)基于共享结构化生物标志物构建横跨MRI与眼底队列的临床相似性kNN图;2)训练MRI教师模型,并在图上传播其表征;3)为眼底样本插值得到脑启发的表征目标;4)以HTN监督、目标蒸馏和关系蒸馏联合优化眼底学生模型。 Result: 在新构建的无配对MRI-眼底-生物标志物数据集上,CGMD持续优于标准蒸馏及非图式插值基线;消融实验证实临床驱动的图连通性至关重要。 Conclusion: CGMD有效弥合了MRI与眼底影像之间的模态鸿沟,证明了借助临床先验图结构实现无配对跨模态知识迁移的可行性与有效性。 Abstract: Retinal fundus imaging enables low-cost and scalable hypertension (HTN) screening, but HTN-related retinal cues are subtle, yielding high-variance predictions. Brain MRI provides stronger vascular and small-vessel-disease markers of HTN, yet it is expensive and rarely acquired alongside fundus images, resulting in modality-siloed datasets with disjoint MRI and fundus cohorts. We study this unpaired MRI-fundus regime and introduce Clinical Graph-Mediated Distillation (CGMD), a framework that transfers MRI-derived HTN knowledge to a fundus model without paired multimodal data. CGMD leverages shared structured biomarkers as a bridge by constructing a clinical similarity kNN graph spanning both cohorts. We train an MRI teacher, propagate its representations over the graph, and impute brain-informed representation targets for fundus patients. A fundus student is then trained with a joint objective combining HTN supervision, target distillation, and relational distillation. Experiments on our newly collected unpaired MRI-fundus-biomarker dataset show that CGMD consistently improves fundus-based HTN prediction over standard distillation and non-graph imputation baselines, with ablations confirming the importance of clinically grounded graph connectivity. Code is available at https://github.com/DillanImans/CGMD-unpaired-distillation.[279] Ctrl-A: Control-Driven Online Data Augmentation
Jesper B. Christensen,Ciaran Bench,Spencer A. Thomas,Hüsnü Aslan,David Balslev-Harder,Nadia A. S. Smith,Alessandra Manzin
Main category: cs.CV
TL;DR: 本文提出ControlAugment(Ctrl-A),一种基于控制理论的自动数据增强算法,可在训练过程中在线动态调整图像增强强度分布,无需手动设计增强策略,且在多个基准数据集上表现优异。
Details
Motivation: 解决现有数据增强方法需人工设计增强策略、初始化增强强度,难以适应新视觉任务的问题。 Method: 引入控制理论中的控制环架构和相对操作响应曲线,在训练中动态、独立地调整各增强操作的强度分布,并采用操作依赖的更新机制。 Result: 在CIFAR-10、CIFAR-100和SVHN-core数据集上,使用WideResNet-28-10模型验证了Ctrl-A性能与当前最优增强方法相当。 Conclusion: Ctrl-A是一种无需人工干预、可自适应调整增强强度的通用数据增强框架,提升了增强策略的自动化与泛化能力。 Abstract: We introduce ControlAugment (Ctrl-A), an automated data augmentation algorithm for image-vision tasks, which incorporates principles from control theory for online adjustment of augmentation strength distributions during model training. Ctrl-A eliminates the need for initialization of individual augmentation strengths. Instead, augmentation strength distributions are dynamically, and individually, adapted during training based on a control-loop architecture and what we define as relative operation response curves. Using an operation-dependent update procedure provides Ctrl-A with the potential to suppress augmentation styles that negatively impact model performance, alleviating the need for manually engineering augmentation policies for new image-vision tasks. Experiments on the CIFAR-10, CIFAR-100, and SVHN-core benchmark datasets using the common WideResNet-28-10 architecture demonstrate that Ctrl-A is highly competitive with existing state-of-the-art data augmentation strategies.[280] Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion
Yanglin Deng,Tianyang Xu,Chunyang Cheng,Hui Li,Xiao-jun Wu,Josef Kittler
Main category: cs.CV
TL;DR: 本文挑战了红外与可见光图像融合(IVIF)中严格配对训练范式(SPTP)的必要性,提出并验证了非配对(UPTP)和任意配对(APTP)训练范式,在数据稀缺且未对齐条件下仍能实现高性能融合。
Details
Motivation: 现有IVIF方法依赖大量刚性对齐的红外-可见光图像对进行训练,但获取此类数据成本高、耗时长;且固定配对限制了跨模态关系建模能力,损害泛化性能。 Method: 提出理论目标刻画APTP,并设计一种实用框架以在极少且未对齐的数据下显著增强跨模态关系;构建三个轻量级端到端基线模型(基于CNN、Transformer、GAN)及配套创新损失函数。 Result: 在严重受限、内容不一致的数据集上,UPTP和APTP训练出的模型性能媲美使用100倍大SPTP数据集训练的结果。 Conclusion: UPTP和APTP是可行且高效的IVIF训练范式,可大幅降低数据采集成本与难度,提升模型鲁棒性,为IVIF研究提供切实可行的新路径。 Abstract: Infrared and visible image fusion(IVIF) combines complementary modalities while preserving natural textures and salient thermal signatures. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labour-intensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100$\times$ larger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies. The code is available at \href{https://github.com/yanglinDeng/IVIF_unpair}{\textcolor{blue}{https://github.com/yanglinDeng/IVIF\_unpair}}.[281] SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection
Shuxian Zhao,Jie Gui,Baosheng Yu,Lu Dong,Zhipeng Gui
Main category: cs.CV
TL;DR: 本文提出了SteelDefectX——一个包含7778张图像、25类缺陷、并配有粗粒度到细粒度文本描述的视觉-语言数据集,用于提升钢铁表面缺陷检测的可解释性、泛化性和迁移能力,并建立了涵盖多种任务的基准。
Details
Motivation: 现有钢铁表面缺陷检测方法多依赖仅含标签的图像分类模型,导致可解释性和泛化能力受限。 Method: 构建了SteelDefectX视觉-语言数据集,包含粗粒度(缺陷类别、视觉属性、工业成因)与细粒度(形状、尺寸、深度、位置、对比度)文本标注,并设计四类评估任务的基准。 Result: 实验表明,粗细粒度结合的文本标注显著提升了模型的可解释性、泛化能力和跨域零样本迁移性能。 Conclusion: SteelDefectX为可解释、可泛化的钢铁缺陷检测研究提供了新基准和公开资源,有望推动该领域发展。 Abstract: Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.[282] Multi-View Deformable Convolution Meets Visual Mamba for Coronary Artery Segmentation
Xiaochan Yuan,Pai Zeng
Main category: cs.CV
TL;DR: 本文提出MDSVM-UNet,一种结合多向蛇形卷积(MDSConv)与残差视觉Mamba(RVM)的两阶段冠状动脉分割框架,以解决CTA图像中血管细长、多分支及前景-背景类别不平衡带来的分割难题。
Details
Motivation: 冠状动脉在CTA图像中形态细长、多分支且与背景严重类别不平衡;CNN难以建模长程依赖,ViT计算开销大,不适用于临床资源受限场景。 Method: 提出MDSVM-UNet:编码阶段采用MDSConv,在矢状、冠状、轴状三正交平面学习自适应偏移以融合多视角特征;解码阶段设计基于RVM的上采样模块,利用选择性状态空间机制建模层间长程依赖;并采用先粗后精的两阶段策略(全图粗分割→智能块提取→块级精分割)。 Result: 该方法在保持线性计算复杂度的同时,显著提升冠状动脉分割精度,尤其改善细小、迂曲血管结构的识别与假阳性抑制能力。 Conclusion: MDSVM-UNet有效平衡了建模能力与计算效率,为临床CTA图像中冠状动脉的精准、高效自动分割提供了新范式。 Abstract: Accurate segmentation of coronary arteries from computed tomography angiography (CTA) images is of paramount clinical importance for the diagnosis and treatment planning of cardiovascular diseases. However, coronary artery segmentation remains challenging due to the inherent multi-branching and slender tubular morphology of the vasculature, compounded by severe class imbalance between foreground vessels and background tissue. Conventional convolutional neural network (CNN)-based approaches struggle to capture long-range dependencies among spatially distant vascular structures, while Vision Transformer (ViT)-based methods incur prohibitive computational overhead that hinders deployment in resource-constrained clinical settings. Motivated by the recent success of state space models (SSMs) in efficiently modeling long-range sequential dependencies with linear complexity, we propose MDSVM-UNet, a novel two-stage coronary artery segmentation framework that synergistically integrates multidirectional snake convolution (MDSConv) with residual visual Mamba (RVM). In the encoding stage, we introduce MDSConv, a deformable convolution module that learns adaptive offsets along three orthogonal anatomical planes -- sagittal, coronal, and axial -- thereby enabling comprehensive multi-view feature fusion that faithfully captures the elongated and tortuous geometry of coronary vessels. In the decoding stage, we design an RVM-based upsampling decoder block that leverages selective state space mechanisms to model inter-slice long-range dependencies while preserving linear computational complexity. Furthermore, we propose a progressive two-stage segmentation strategy: the first stage performs coarse whole-image segmentation to guide intelligent block extraction, while the second stage conducts fine-grained block-level segmentation to recover vascular details and suppress false positives..[283] Climate Prompting: Generating the Madden-Julian Oscillation using Video Diffusion and Low-Dimensional Conditioning
Sulian Thual,Feiyang Cai,Jingjing Wang,Feng Luo
Main category: cs.CV
TL;DR: 本文提出了一种基于大气再分析数据训练的视频扩散模型,用于生成受低维指标调控的MJO序列,从而连接传统理论与深度学习建模,并揭示其物理驱动机制。
Details
Motivation: 理解生成式深度学习与传统MJO理论框架之间的关系,弥合低维理论与高分辨率大气复杂性之间的鸿沟。 Method: 构建并训练一个视频扩散模型,以大气再分析数据为基础,生成受关键低维指标(如季节、ENSO等)调控的MJO序列。 Result: 生成的MJO序列能较好复现复合结构、功率谱及多尺度对流耦合波等关键特征(存在一定偏差);通过理想化条件控制可解构MJO过程并识别物理驱动因子。 Conclusion: 该方法为连接低维MJO理论与高分辨率大气建模提供了实用框架,有助于提升热带大气预测能力。 Abstract: Generative Deep Learning is a powerful tool for modeling of the Madden-Julian oscillation (MJO) in the tropics, yet its relationship to traditional theoretical frameworks remains poorly understood. Here we propose a video diffusion model, trained on atmospheric reanalysis, to synthetize long MJO sequences conditioned on key low-dimensional metrics. The generated MJOs capture key features including composites, power spectra and multiscale structures including convectively coupled waves, despite some bias. We then prompt the model to generate more tractable MJOs based on intentionally idealized low-dimensional conditionings, for example a perpetual MJO, an isolated modulation by seasons and/or the El Nino-Southern Oscillation, and so on. This enables deconstructing the underlying processes and identifying physical drivers. The present approach provides a practical framework for bridging the gap between low-dimensional MJO theory and high-resolution atmospheric complexity and will help tropical atmosphere prediction.[284] Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
Yuyang You,Yongzhi Li,Jiahui Li,Yadong Mu,Quan Chen,Peng Jiang
Main category: cs.CV
TL;DR: 本文提出了一种专为视频扩散模型设计的新型蒸馏框架,通过自适应回归损失、时间正则化损失和推理时帧插值策略,有效缓解了视频生成中的过饱和、时间不一致和模式坍塌等问题,在VBench和VBench2基准上显著提升了感知保真度与运动真实性。
Details
Motivation: 现有视频扩散模型蒸馏方法多直接套用图像蒸馏技术,易导致过饱和、时间不一致和模式坍塌等缺陷,缺乏针对视频特性的专用蒸馏方案。 Method: 提出三方面创新:(1) 自适应回归损失,动态调整空间监督权重以抑制分布偏移引发的伪影;(2) 时间正则化损失,防止时间坍塌,提升采样轨迹的平滑性与物理合理性;(3) 推理时帧插值策略,在降低采样开销的同时保持感知质量。 Result: 在VBench和VBench2基准上的大量实验与消融研究表明,该方法能稳定实现少步长视频合成,显著提升感知保真度与运动真实性,并在多项指标上持续优于现有蒸馏基线。 Conclusion: 所提蒸馏框架是首个面向视频扩散模型定制的方法,有效解决了通用图像蒸馏迁移带来的关键问题,为高效高质量视频生成提供了新范式。 Abstract: Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.[285] Adversarial Camouflage
Paweł Borsukiewicz,Daniele Lunghi,Melissa Tessa,Jacques Klein,Tegawendé F. Bissyandé
Main category: cs.CV
TL;DR: 本文提出了一种名为'对抗性伪装'的新型隐私保护方法,通过在面部特定区域投影优化后的低维模式(颜色、形状、角度),显著降低多种先进人脸识别模型的识别准确率,并在仿真和真实人类实验中均取得良好效果。
Details
Motivation: 由于人脸识别算法的广泛应用引发了大规模监控和个体隐私泄露的风险,亟需一种简单高效、可在物理世界中复现的用户隐私保护方案。 Method: 定义一个由颜色、形状和角度参数化的低维模式空间,将优化得到的模式投影到语义有效的面部区域进行评估,以最大化多种人脸识别架构的识别错误率,提升对黑盒系统的跨模型迁移性。 Result: 在仿真中显著降低了所有测试的最先进模型性能,在真实人类实验中也展现出良好效果,并揭示了不同模型的鲁棒性差异及攻击在不同架构间的可迁移性证据。 Conclusion: Adversarial Camouflage 是一种高效、易复现且具有强跨模型迁移能力的物理世界隐私保护方法,为应对人脸识别带来的隐私威胁提供了实用新思路。 Abstract: While the rapid development of facial recognition algorithms has enabled numerous beneficial applications, their widespread deployment has raised significant concerns about the risks of mass surveillance and threats to individual privacy. In this paper, we introduce \textit{Adversarial Camouflage} as a novel solution for protecting users' privacy. This approach is designed to be efficient and simple to reproduce for users in the physical world. The algorithm starts by defining a low-dimensional pattern space parameterized by color, shape, and angle. Optimized patterns, once found, are projected onto semantically valid facial regions for evaluation. Our method maximizes recognition error across multiple architectures, ensuring high cross-model transferability even against black-box systems. It significantly degrades the performance of all tested state-of-the-art face recognition models during simulations and demonstrates promising results in real-world human experiments, while revealing differences in model robustness and evidence of attack transferability across architectures.[286] Manifold-Aware Exploration for Reinforcement Learning in Video Generation
Mingzhe Zheng,Weijie Kong,Yue Wu,Dengyang Jiang,Yue Ma,Xuanhua He,Bin Lin,Kaixiong Gong,Zhao Zhong,Liefeng Bo,Qifeng Chen,Harry Yang
Main category: cs.CV
TL;DR: 本文提出SAGE-GRPO方法,通过微/宏观双重约束机制,在视频生成的GRPO训练中稳定探索过程,提升奖励估计可靠性与视频质量。
Details
Motivation: 现有视频生成GRPO方法(如FlowGRPO)可靠性远低于语言和图像模型,主因是ODE-to-SDE转换引入过多噪声,导致rollout质量下降、奖励估计失真、对齐不稳定。 Method: 将预训练模型视为有效视频流形,约束探索在其邻域内:微观上设计带对数曲率校正的流形感知SDE,并引入梯度范数均衡器;宏观上采用双信任域机制,含周期性移动锚点与步进约束,防止长程漂移。 Result: 在HunyuanVideo1.5上使用VideoAlign奖励模型评估,SAGE-GRPO在VQ、MQ、TA及CLIPScore、PickScore等视觉指标上均显著优于先前方法。 Conclusion: SAGE-GRPO通过流形约束与分层稳定性设计,有效缓解了视频GRPO训练中的噪声敏感性和漂移问题,实现了更可靠的奖励对齐与更高视频质量。 Abstract: Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE-GRPO-Page/.[287] Thermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems
Chengyin Hu,Yikun Guo,Yuxian Dong,Qike Zhang,Kalibinuer Tiliwalidi,Yiwei Wei,Haitao Shi,Jiujiang Guo,Jiahuan Long,Xiang Chen
Main category: cs.CV
TL;DR: 本文提出了一种面向红外行人检测器的通用物理补丁攻击方法(UPPA),利用参数化贝塞尔块建模扰动,并通过粒子群优化(PSO)实现全局统一优化,生成可物理实现的低温冷补丁,在保持拓扑稳定性和热辐射自然性的前提下,实现了高攻击成功率、跨域泛化性与黑盒迁移性。
Details
Motivation: 现有红外物理攻击方法依赖实例特异性在线优化和刚性图案设计,导致部署成本高、物理鲁棒性不足。 Method: 提出通用物理补丁攻击(UPPA),采用几何约束的参数化贝塞尔块建模扰动,结合粒子群优化(PSO)在全局数据分布上统一优化;物理部署时将数字扰动转化为具有连续平滑低温分布的物理冷补丁。 Result: UPPA在无任何在线计算开销下实现优异的物理攻击成功率,并展现出强跨域泛化能力和可靠的黑盒迁移性。 Conclusion: UPPA是首个面向红外域的通用物理攻击方法,兼顾拓扑稳定性、热物理合理性与实际部署可行性,为评估红外检测器安全性提供了新范式。 Abstract: Although infrared pedestrian detectors have been widely deployed in visual perception tasks, their vulnerability to physical adversarial attacks is becoming increasingly apparent. Existing physical attack methods predominantly rely on instance-specific online optimization and rigid pattern design, leading to high deployment costs and insufficient physical robustness. To address these limitations, this work proposes the Universal Physical Patch Attack (UPPA), the first universal physical attack method in the infrared domain. This method employs geometrically constrained parameterized Bezier blocks to model perturbations and utilizes the Particle Swarm Optimization (PSO) algorithm to perform unified optimization across the global data distribution, thus maintaining topological stability under dynamic deformations. In the physical deployment phase, we materialize the optimized digital perturbations into physical cold patches, achieving a continuous and smooth low-temperature distribution that naturally aligns with the thermal radiation characteristics of infrared imaging. Extensive experiments demonstrate that UPPA achieves an outstanding physical attack success rate without any online computational overhead, while also exhibiting strong cross-domain generalization and reliable black-box transferability.[288] Deep S2P: Integrating Learning Based Stereo Matching Into the Satellite Stereo Pipeline
Elías Masquil,Thibaud Ehret,Pablo Musé,Gabriele Facciolo
Main category: cs.CV
TL;DR: 本文将多种现代学习型立体匹配器(如StereoAnywhere、MonSter等)集成到卫星立体管道(S2P)中,并适配其整流阶段以兼容卫星影像的视差极性与范围,显著提升了数字表面模型(DSM)精度和几何细节,但对植被等复杂地表仍存在挑战。
Details
Motivation: 现有学习型立体匹配器虽在标准基准上表现优异,但因视角几何与视差假设差异,难以直接融入实际卫星处理流程(如S2P),亟需适配与验证。 Method: 将StereoAnywhere、MonSter、Foundation Stereo及卫星微调版MonSter集成进S2P,并修改整流模块以统一 disparity 极性和范围;开源适配代码。 Result: 在卫星影像上实验表明,所有学习型方法均一致优于传统基于代价体的方法,DSM精度提升,几何细节更清晰、结构更锐利;但常用指标(如MAE)出现饱和;植被区域性能仍受限。 Conclusion: 学习型立体匹配器经适配后可有效提升卫星DSM生成质量,但需发展更贴合遥感感知特性的评估指标,且对自然复杂地表的泛化能力仍是开放问题。 Abstract: Digital Surface Model generation from satellite imagery is a core task in Earth observation and is commonly addressed using classical stereoscopic matching algorithms in satellite pipelines as in the Satellite Stereo Pipeline (S2P). While recent learning-based stereo matchers achieve state-of-the-art performance on standard benchmarks, their integration into operational satellite pipelines remains challenging due to differences in viewing geometry and disparity assumptions. In this work, we integrate several modern learning-based stereo matchers, including StereoAnywhere, MonSter, Foundation Stereo, and a satellite fine-tuned variant of MonSter, into the Satellite Stereo Pipeline, adapting the rectification stage to enforce compatible disparity polarity and range. We release the corresponding code to enable reproducible use of these methods in large-scale Earth observation workflows. Experiments on satellite imagery show consistent improvements over classical cost-volume-based approaches in terms of Digital Surface Model accuracy, although commonly used metrics such as mean absolute error exhibit saturation effects. Qualitative results reveal substantially improved geometric detail and sharper structures, highlighting the need for evaluation strategies that better reflect perceptual and structural fidelity. At the same time, performance over challenging surface types such as vegetation remains limited across all evaluated models, indicating open challenges for learning-based stereo in natural environments.[289] Not All Layers Are Created Equal: Adaptive LoRA Ranks for Personalized Image Generation
Donald Shenaj,Federico Errica,Antonio Carta
Main category: cs.CV
TL;DR: 本文提出LoRA²方法,通过在微调过程中自适应调整各层LoRA的秩,以在性能与内存消耗间取得更好权衡,优于固定秩的LoRA。
Details
Motivation: 现有LoRA微调中秩的选择依赖经验共识,缺乏对个性化主体复杂度的适配,而穷举搜索最优秩组合计算代价过高。 Method: 受变分网络宽度学习启发,为每层LoRA引入可学习的、有序重要性的秩位置,并在微调中动态调整各层秩大小。 Result: LoRA²在29个主题上,在DINO、CLIP-I和CLIP-T指标上达到有竞争力的性能,同时显著降低所需内存和平均秩。 Conclusion: 自适应秩分配是提升LoRA效率与效果的有效途径,LoRA²为个性化图像生成提供了更优的轻量微调范式。 Abstract: Low Rank Adaptation (LoRA) is the de facto fine-tuning strategy to generate personalized images from pre-trained diffusion models. Choosing a good rank is extremely critical, since it trades off performance and memory consumption, but today the decision is often left to the community's consensus, regardless of the personalized subject's complexity. The reason is evident: the cost of selecting a good rank for each LoRA component is combinatorial, so we opt for practical shortcuts such as fixing the same rank for all components. In this paper, we take a first step to overcome this challenge. Inspired by variational methods that learn an adaptive width of neural networks, we let the ranks of each layer freely adapt during fine-tuning on a subject. We achieve it by imposing an ordering of importance on the rank's positions, effectively encouraging the creation of higher ranks when strictly needed. Qualitatively and quantitatively, our approach, LoRA$^2$, achieves a competitive trade-off between DINO, CLIP-I, and CLIP-T across 29 subjects while requiring much less memory and lower rank than high rank LoRA versions. Code: https://github.com/donaldssh/NotAllLayersAreCreatedEqual.[290] CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal
Qingdong He,Chaoyi Wang,Peng Tang,Yifan Yang,Xiaobin Hu
Main category: cs.CV
TL;DR: 本文提出了一种无需掩码的端到端视频字幕去除框架CLEAR,通过两阶段自监督与LoRA微调实现高效、泛化强的字幕清除,在多语言零样本任务中显著优于现有方法。
Details
Motivation: 现有基于扩散模型的方法需在训练和推理中显式提供掩码序列,限制了实际部署;亟需一种无需掩码、端到端、跨语言泛化能力强的方案。 Method: 提出CLEAR框架:Stage I采用双编码器自监督正交约束学习解耦字幕表征;Stage II基于LoRA进行生成反馈驱动的动态上下文适配;全程无需掩码输入。 Result: 在中文字幕基准上PSNR提升6.77dB、VFID降低74.7%;零样本泛化至英、韩、法、日、俄、德六种语言;仅需基模型0.77%参数量训练。 Conclusion: CLEAR实现了真正端到端、掩码无关、轻量高效且跨语言鲁棒的视频字幕去除,其生成反馈机制是关键创新。 Abstract: Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.[291] SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
Linkuan Zhou,Yinghao Xia,Yufei Shen,Xiangyu Li,Wenjie Du,Cong Cong,Leyi Wei,Ran Su,Qiangguo Jin
Main category: cs.CV
TL;DR: 本文提出SHAPE框架,通过结构感知的分层无监督域自适应与合理性评估,提升医学图像分割在跨模态场景下的性能。
Details
Motivation: 现有无监督域自适应方法存在语义无关的特征对齐和忽略全局解剖约束的伪标签验证问题,导致分布保真度低和生成解剖上不合理结构。 Method: 基于DINOv3构建SHAPE框架,包含分层特征调制(HFM)模块以生成高保真且类别感知特征;引入超图合理性估计(HPE)评估全局解剖合理性;结合结构异常剪枝(SAP)利用多视角稳定性去除残余伪影。 Result: 在心脏和腹部跨模态数据集上显著超越现有方法,心脏数据Dice分数达90.08%(MRI→CT)和78.51%(CT→MRI),腹部数据达87.48%(MRI→CT)和86.89%(CT→MRI)。 Conclusion: SHAPE通过强调全局解剖合理性而非单纯分布对齐,为医学图像UDA提供了更可靠、更具临床可行性的新范式。 Abstract: Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudo-label validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class-awareness. This shifts the core challenge to robustly validating pseudo-labels. To augment conventional pixel-level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture. This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross-view stability. SHAPE significantly outperforms prior methods on cardiac and abdominal cross-modality benchmarks, achieving state-of-the-art average Dice scores of 90.08% (MRI->CT) and 78.51% (CT->MRI) on cardiac data, and 87.48% (MRI->CT) and 86.89% (CT->MRI) on abdominal data. The code is available at https://github.com/BioMedIA-repo/SHAPE.[292] A Latent Representation Learning Framework for Hyperspectral Image Emulation in Remote Sensing
Chedly Ben Azizi,Claire Guilloteau,Gilles Roussel,Matthieu Puigt
Main category: cs.CV
TL;DR: 本文提出了一种基于潜在表示的高光谱图像(HSI)生成框架,通过学习HSI数据的潜在生成表示,支持光谱级和空谱级仿真,并在重建精度、光谱保真度和空间鲁棒性上优于传统回归型仿真器。
Details
Motivation: 传统辐射传输模型计算昂贵且通常仅输出光谱信息,难以满足大规模仿真、算法开发与任务设计对高效、空谱联合仿真的需求。 Method: 提出基于潜在表示的高光谱仿真框架,支持单步直接训练或两步策略(VAE预训练+参数到潜在空间插值),实现光谱级与空间-光谱联合仿真。 Result: 在PROSAIL模拟植被数据和Sentinel-3 OLCI影像上的实验表明,该方法在重建精度、谱保真度及对真实空间变异的鲁棒性上均优于经典回归型仿真器;且生成的HSI在下游生物物理参数反演任务中性能保持良好。 Conclusion: 所提潜在表示框架能高效、高保真地生成空谱一致的合成HSI,具备实际遥感应用价值,为算法验证与系统设计提供了可靠数据支撑。 Abstract: Synthetic hyperspectral image (HSI) generation is essential for large-scale simulation, algorithm development, and mission design, yet traditional radiative transfer models remain computationally expensive and often limited to spectrum-level outputs. In this work, we propose a latent representation-based framework for hyperspectral emulation that learns a latent generative representation of hyperspectral data. The proposed approach supports both spectrum-level and spatial-spectral emulation and can be trained either in a direct one-step formulation or in a two-step strategy that couples variational autoencoder (VAE) pretraining with parameter-to-latent interpolation. Experiments on PROSAIL-simulated vegetation data and Sentinel-3 OLCI imagery demonstrate that the method outperforms classical regression-based emulators in reconstruction accuracy, spectral fidelity, and robustness to real-world spatial variability. We further show that emulated HSIs preserve performance in downstream biophysical parameter retrieval, highlighting the practical relevance of emulated data for remote sensing applications.[293] The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation
Guannan Lai,Da-Wei Zhou,Zhenguo Li,Han-Jia Ye
Main category: cs.CV
TL;DR: 本文提出GOLD方法,通过识别并动态维护预训练分类器行空间的'黄金子空间',实现高效、稳定的持续测试时自适应(CTTA),在不牺牲性能的前提下显著提升在线推理效率。
Details
Motivation: 现有持续测试时自适应(CTTA)方法在适应能力与推理效率之间存在权衡:参数更新越多,适应性越强但效率越低;理想方案是仅更新最小必要特征子空间(即'黄金子空间')。 Method: 理论证明'黄金子空间'即为预训练分类器的行空间,并提出样本级平均梯度外积(AGOP)作为无需重训练即可估计分类器权重的高效代理;基于此,设计轻量适配器GOLD,将特征投影至该子空间,并学习紧凑缩放向量,同时用AGOP动态更新子空间。 Result: 在分类与分割基准(含自动驾驶场景)上,GOLD在效率、稳定性与整体性能上均优于现有CTTA方法。 Conclusion: GOLD通过精准定位并高效维护'黄金子空间',成功缓解了CTTA中效率与泛化性的固有矛盾,为实际部署提供了更优解决方案。 Abstract: Continual Test-Time Adaptation (CTTA) aims to enable models to adapt online to unlabeled data streams under distribution shift without accessing source data. Existing CTTA methods face an efficiency-generalization trade-off: updating more parameters improves adaptation but severely reduces online inference efficiency. An ideal solution is to achieve comparable adaptation with minimal feature updates; we call this minimal subspace the golden subspace. We prove its existence in a single-step adaptation setting and show that it coincides with the row space of the pretrained classifier. To enable online maintenance of this subspace, we introduce the sample-wise Average Gradient Outer Product (AGOP) as an efficient proxy for estimating the classifier weights without retraining. Building on these insights, we propose Guided Online Low-rank Directional adaptation (GOLD), which uses a lightweight adapter to project features onto the golden subspace and learns a compact scaling vector while the subspace is dynamically updated via AGOP. Extensive experiments on classification and segmentation benchmarks, including autonomous-driving scenarios, demonstrate that GOLD attains superior efficiency, stability, and overall performance. Our code is available at https://github.com/AIGNLAI/GOLD.[294] SatGeo-NeRF: Geometrically Regularized NeRF for Satellite Imagery
Valentin Wagner,Sebastian Bullinger,Michael Arens,Rainer Stiefelhagen
Main category: cs.CV
TL;DR: SatGeo-NeRF is a geometrically regularized NeRF for satellite imagery that introduces three model-agnostic regularizers—Gravity-Aligned Planarity, Granularity, and Depth-Supervised—to reduce geometric artifacts and improve altitude estimation accuracy.
Details
Motivation: Current state-of-the-art NeRF models for satellite imagery suffer from overfitting-induced geometric artifacts, leading to inaccurate 3D reconstructions. Method: Introduces three model-agnostic regularizers: (1) Gravity-Aligned Planarity Regularization aligns inferred surface normals with gravity to enforce local planarity and enable cross-ray gradient flow; (2) Granularity Regularization enforces coarse-to-fine geometry learning; (3) Depth-Supervised Regularization stabilizes early training using depth supervision. Result: On the DFC2019 benchmark, SatGeo-NeRF reduces Mean Altitude Error by 13.9% and 11.7% relative to EO-NeRF and EO-GS, respectively. Conclusion: Geometric regularization significantly improves NeRF’s geometric fidelity for satellite imagery reconstruction, offering a robust, model-agnostic framework for accurate altitude estimation. Abstract: We present SatGeo-NeRF, a geometrically regularized NeRF for satellite imagery that mitigates overfitting-induced geometric artifacts observed in current state-of-the-art models using three model-agnostic regularizers. Gravity-Aligned Planarity Regularization aligns depth-inferred, approximated surface normals with the gravity axis to promote local planarity, coupling adjacent rays via a corresponding surface approximation to facilitate cross-ray gradient flow. Granularity Regularization enforces a coarse-to-fine geometry-learning scheme, and Depth-Supervised Regularization stabilizes early training for improved geometric accuracy. On the DFC2019 satellite reconstruction benchmark, SatGeo-NeRF improves the Mean Altitude Error by 13.9% and 11.7% relative to state-of-the-art baselines such as EO-NeRF and EO-GS.[295] Camera-Agnostic Pruning of 3D Gaussian Splats via Descriptor-Based Beta Evidence
Peter Fasogbon,Ugurcan Budak,Patrice Rondao Alface,Hamed Rezazadegan Tavakoli
Main category: cs.CV
TL;DR: 本文提出了一种不依赖相机参数的、一次性后训练3D高斯点阵剪枝方法,基于属性导出的邻域描述符和Beta证据模型来评估每个高斯点的可靠性。
Details
Motivation: 现有剪枝策略大多依赖相机参数、渲染图像或视角相关度量,在新兴的相机无关交换场景(如直接共享.ply格式点云)中受限。 Method: 提出一种混合描述符框架,从高斯点表示中直接提取结构与外观一致性信息;将剪枝建模为统计证据估计问题,并引入Beta证据模型为每个高斯点生成概率置信分数。 Result: 在ISO/IEC MPEG标准测试序列上实验表明,该方法可在显著剪枝的同时保持重建质量。 Conclusion: 所提方法为3D高斯点阵提供了一种实用、通用且相机无关的剪枝替代方案。 Abstract: The pruning of 3D Gaussian splats is essential for reducing their complexity to enable efficient storage, transmission, and downstream processing. However, most of the existing pruning strategies depend on camera parameters, rendered images, or view-dependent measures. This dependency becomes a hindrance in emerging camera-agnostic exchange settings, where splats are shared directly as point-based representations (e.g., .ply). In this paper, we propose a camera-agnostic, one-shot, post-training pruning method for 3D Gaussian splats that relies solely on attribute-derived neighbourhood descriptors. As our primary contribution, we introduce a hybrid descriptor framework that captures structural and appearance consistency directly from the splat representation. Building on these descriptors, we formulate pruning as a statistical evidence estimation problem and introduce a Beta evidence model that quantifies per-splat reliability through a probabilistic confidence score. Experiments conducted on standardized test sequences defined by the ISO/IEC MPEG Common Test Conditions (CTC) demonstrate that our approach achieves substantial pruning while preserving reconstruction quality, establishing a practical and generalizable alternative to existing camera-dependent pruning strategies.[296] Chronological Contrastive Learning: Few-Shot Progression Assessment in Irreversible Diseases
Clemens Watzenböck,Daniel Aletaha,Michaël Deman,Thomas Deimel,Jana Eder,Ivana Janickova,Robert Janiczek,Peter Mandl,Philipp Seeböck,Gabriela Supp,Paul Weiser,Georg Langs
Main category: cs.CV
TL;DR: 本文提出ChronoCon,一种利用患者纵向影像检查时间顺序进行对比学习的自监督方法,无需专家标注即可学习疾病严重程度相关表征,在类风湿性关节炎X光评估中显著提升标签效率。
Details
Motivation: 定量疾病严重程度评分成本高、耗时且存在阅片者间差异;而临床存档中纵向影像数据远多于专家标注分数,现有自监督方法未充分利用其时间结构。 Method: ChronoCon是一种对比学习方法,用患者纵向扫描的就诊时间顺序替代基于标签的排序损失,在不可逆疾病单调进展的临床假设下,仅依赖时间顺序学习疾病相关表征。 Result: 在类风湿性关节炎放射影像评估任务中,ChronoCon在低标注场景下显著优于基于ImageNet预训练的全监督基线;仅用5名患者的专家评分进行微调,即达到86%的组内相关系数(ICC)。 Conclusion: ChronoCon证明了利用常规可得的影像元数据(如时间顺序)进行时间对比学习,可有效降低不可逆疾病领域对人工标注的依赖。 Abstract: Quantitative disease severity scoring in medical imaging is costly, time-consuming, and subject to inter-reader variability. At the same time, clinical archives contain far more longitudinal imaging data than expert-annotated severity scores. Existing self-supervised methods typically ignore this chronological structure. We introduce ChronoCon, a contrastive learning approach that replaces label-based ranking losses with rankings derived solely from the visitation order of a patient's longitudinal scans. Under the clinically plausible assumption of monotonic progression in irreversible diseases, the method learns disease-relevant representations without using any expert labels. This generalizes the idea of Rank-N-Contrast from label distances to temporal ordering. Evaluated on rheumatoid arthritis radiographs for severity assessment, the learned representations substantially improve label efficiency. In low-label settings, ChronoCon significantly outperforms a fully supervised baseline initialized from ImageNet weights. In a few-shot learning experiment, fine-tuning ChronoCon on expert scores from only five patients yields an intraclass correlation coefficient of 86% for severity score prediction. These results demonstrate the potential of chronological contrastive learning to exploit routinely available imaging metadata to reduce annotation requirements in the irreversible disease domain. Code is available at https://github.com/cirmuw/ChronoCon.[297] Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
Roy Amoyal,Oren Freifeld,Chaim Baskin
Main category: cs.CV
TL;DR: 本文提出了一种名为高斯点阵对齐(GSA)的新方法,用于在无需真实尺度先验的情况下,对两个独立的3D高斯点阵模型进行相似变换对齐(旋转、平移、缩放),尤其支持同类不同物体(如不同汽车)之间的类别级对齐。
Details
Motivation: 现有方法仅能对同一物体的3DGS模型进行对齐,且通常依赖已知真实尺度;而实际中常需处理不同实例间的对齐问题,缺乏鲁棒、自动的类别级3DGS对齐方案。 Method: GSA利用视角引导的球面图特征获取鲁棒对应关系,并设计两步优化框架:第一步为基于特征引导的绝对定向迭代求解器(粗配准),对大初始化误差鲁棒;第二步为多视角特征一致性约束的精配准,受逆辐射场思想启发。整个过程保持3DGS模型固定。 Result: 在同物体对齐任务上,GSA显著优于先前方法(即使它们获得真实尺度);在更难的不同物体同类对齐任务上,GSA首次实现了有效对齐,大幅超越基线方法。 Conclusion: GSA是首个支持类别级3D高斯点阵对齐的方法,解决了尺度未知、初始化差等关键挑战,拓展了3DGS在跨实例建模与应用中的潜力。 Abstract: We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation, translation, and scale), even when they are of different objects in the same category (e.g., different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g., the same car) and often must be given true scale as input, while we estimate it successfully. GSA leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns 3DGS models while keeping them fixed. First, we apply an iterative feature-guided absolute orientation solver as our coarse registration, which is robust to poor initialization (e.g., 180 degrees misalignment or a 10x scale gap). Next, we use a fine registration step that enforces multi-view feature consistency, inspired by inverse radiance-field formulations. The first step already achieves state-of-the-art performance, and the second further improves results. In the same-object case, GSA outperforms prior works, often by a large margin, even when the other methods are given the true scale. In the harder case of different objects in the same category, GSA vastly surpasses them, providing the first effective solution for category-level 3DGS registration and unlocking new applications. Project webpage: https://bgu-cs-vil.github.io/GSA-project/[298] MultiBind: A Benchmark for Attribute Misbinding in Multi-Subject Generation
Wenqing Tian,Hanyi Mao,Zhaocheng Liu,Lihua Zhang,Qiang Liu,Jian Wu,Liang Wang
Main category: cs.CV
TL;DR: 本文提出MultiBind基准,用于评估多主体图像生成中的跨主体属性错绑定问题,并设计了维度混淆评估协议来识别和解释此类失败模式。
Details
Motivation: 现有基准和指标难以诊断多主体图像生成中跨主体属性错绑定这一关键失败模式。 Method: 构建了基于真实多人照片的MultiBind基准,包含槽位排序的主题裁剪、掩码、边界框、标准化主题参考、修复背景参考及密集实体索引提示;并提出维度混淆评估协议,通过匹配生成主体与真实槽位,并使用专家模型分别评估身份、外观、姿态和表情的槽间相似性。 Result: 实验表明,MultiBind能揭示现代多参考生成器中传统重建指标无法发现的绑定失败。 Conclusion: MultiBind为多主体图像生成提供了更细粒度、可解释的评估框架,有效暴露并分类跨主体干扰问题。 Abstract: Subject-driven image generation is increasingly expected to support fine-grained control over multiple entities within a single image. In multi-reference workflows, users may provide several subject images, a background reference, and long, entity-indexed prompts to control multiple people within one scene. In this setting, a key failure mode is cross-subject attribute misbinding: attributes are preserved, edited, or transferred to the wrong subject. Existing benchmarks and metrics largely emphasize holistic fidelity or per-subject self-similarity, making such failures hard to diagnose. We introduce MultiBind, a benchmark built from real multi-person photographs. Each instance provides slot-ordered subject crops with masks and bounding boxes, canonicalized subject references, an inpainted background reference, and a dense entity-indexed prompt derived from structured annotations. We also propose a dimension-wise confusion evaluation protocol that matches generated subjects to ground-truth slots and measures slot-to-slot similarity using specialists for face identity, appearance, pose, and expression. By subtracting the corresponding ground-truth similarity matrices, our method separates self-degradation from true cross-subject interference and exposes interpretable failure patterns such as drift, swap, dominance, and blending. Experiments on modern multi-reference generators show that MultiBind reveals binding failures that conventional reconstruction metrics miss.[299] FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection
Zhilin Tu,Kemou Li,Fengpeng Li,Jianwei Fei,Jiamin Zhang,Haiwei Wu
Main category: cs.CV
TL;DR: 本文提出FeatDistill框架,结合特征蒸馏与多专家集成,提升AI生成图像检测在真实场景下的鲁棒性与泛化能力。
Details
Motivation: 深度伪造技术快速迭代与广泛传播严重威胁信息安全,亟需具备强鲁棒性与泛化能力的AI生成图像检测方法,尤其需应对真实场景中的退化干扰、特征表征不足和泛化受限三大瓶颈。 Method: 构建基于CLIP与SigLIP变体的四骨干Vision Transformer多专家集成;引入全面退化建模以增强数据覆盖;采用两阶段训练:先二分类优化,再进行密集特征级自蒸馏以对齐表征;推理时融合四个专家预测概率。 Result: 在NTIRE挑战赛设定下,FeatDistill在多种‘in-the-wild’条件下展现出强鲁棒性与泛化能力,仅需约10GB峰值GPU显存,兼顾高效性与实用性。 Conclusion: FeatDistill为真实世界深度伪造图像检测提供了一种有效且实用的解决方案,显著缓解过拟合并提升语义一致性与跨生成器/退化类型的稳定性。 Abstract: The rapid iteration and widespread dissemination of deepfake technology have posed severe challenges to information security, making robust and generalizable detection of AI-generated forged images increasingly important. In this paper, we propose FeatDistill, an AI-generated image detection framework that integrates feature distillation with a multi-expert ensemble, developed for the NTIRE Challenge on Robust AI-Generated Image Detection in the Wild. The framework explicitly targets three practical bottlenecks in real-world forensics: degradation interference, insufficient feature representation, and limited generalization. Concretely, we build a four-backbone Vision Transformer (ViT) ensemble composed of CLIP and SigLIP variants to capture complementary forensic cues. To improve data coverage, we expand the training set and introduce comprehensive degradation modeling, which exposes the detector to diverse quality variations and synthesis artifacts commonly encountered in unconstrained scenarios. We further adopt a two-stage training paradigm: the model is first optimized with a standard binary classification objective, then refined by dense feature-level self-distillation for representation alignment. This design effectively mitigates overfitting and enhances semantic consistency of learned features. At inference time, the final prediction is obtained by averaging the probabilities from four independently trained experts, yielding stable and reliable decisions across unseen generators and complex degradations. Despite the ensemble design, the framework remains efficient, requiring only about 10 GB peak GPU memory. Extensive evaluations in the NTIRE challenge setting demonstrate that FeatDistill achieves strong robustness and generalization under diverse ``in-the-wild'' conditions, offering an effective and practical solution for real-world deepfake image detection.[300] GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
Ayesh Abu Lehyeh,Xiaohan Zhang,Ahmad Arrabi,Waqas Sultani,Chen Chen,Safwan Wshah
Main category: cs.CV
TL;DR: 本文提出GeoFlow,一种轻量级高效的细粒度跨视角地理定位方法,通过直接概率映射和迭代精炼采样(IRS)算法,在不牺牲精度的前提下实现29 FPS实时性能。
Details
Motivation: 现有方法在高精度与实时性之间存在难以兼顾的权衡,而自主导航在无GPS区域亟需兼具准确与快速的定位能力。 Method: 提出GeoFlow框架:1)学习从地面图像到卫星图像的直接概率位移映射(距离与方向);2)设计迭代精炼采样(IRS)算法,对多初始假设进行迭代优化并达成共识。 Result: 在KITTI和VIGOR数据集上达到SOTA效率(29 FPS),同时保持具有竞争力的定位精度。 Conclusion: GeoFlow成功打破精度-速度权衡,为实用化实时地理定位系统提供了新路径。 Abstract: Accurate and fast localization is vital for safe autonomous navigation in GPS-denied areas. Fine-Grained Cross-View Geolocalization (FG-CVG) aims to estimate the precise 2-Degree-of-Freedom (2-DoF) location of a ground image relative to a satellite image. However, current methods force a difficult trade-off, with high-accuracy models being slow for real-time use. In this paper, we introduce GeoFlow, a new approach that offers a lightweight and highly efficient framework that breaks this accuracy-speed trade-off. Our technique learns a direct probabilistic mapping, predicting the displacement (in distance and direction) required to correct any given location hypothesis. This is complemented by our novel inference algorithm, Iterative Refinement Sampling (IRS). Instead of trusting a single prediction, IRS refines a population of hypotheses, allowing them to iteratively 'flow' from random starting points to a robust, converged consensus. Even its iterative nature, this approach offers flexible inference-time scaling, allowing a direct trade-off between performance and computation without any re-training. Experiments on the KITTI and VIGOR datasets show that GeoFlow achieves state-of-the-art efficiency, running at real-time speeds of 29 FPS while maintaining competitive localization accuracy. This work opens a new path for the development of practical real-time geolocalization systems.[301] Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection
Youbin Kim,Jinho Park,Hogun Park,Eunbyung Park
Main category: cs.CV
TL;DR: Group3D是一种多视角开放词汇3D目标检测框架,通过将语义约束直接融入实例构建过程,利用多模态大语言模型生成的场景自适应词表和语义兼容组,实现几何一致性与语义兼容性联合驱动的3D片段融合,从而缓解传统几何主导融合导致的过合并或碎片化问题,在ScanNet和ARKitScenes上达到SOTA并具备强零样本泛化能力。
Details
Motivation: 现有方法将几何实例构建与语义标注解耦,仅依赖几何一致性进行片段合并,在视图不完整或几何证据不足时易产生不可逆的关联错误(如过合并或碎片化)。 Method: Group3D基于多模态大语言模型构建场景自适应开放词表,并组织为语义兼容组以建模跨视角类别等价性;在3D片段合并阶段,同时要求几何一致性与语义兼容性作为合并约束,实现语义门控的融合机制。 Result: 在ScanNet和ARKitScenes数据集上达到多视角开放词汇3D检测的SOTA性能,且在零样本场景下展现出强泛化能力。 Conclusion: 将语义约束前移至实例构建阶段可有效提升开放词汇3D检测鲁棒性与准确性,Group3D验证了语义-几何协同建模在多视角3D理解中的关键作用。 Abstract: Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.[302] Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention
Junhao Du,Jialong Xue,Anqi Li,Jincheng Dai,Guo Lu
Main category: cs.CV
TL;DR: 本文提出了一种统一的时空 token 压缩策略,将视觉 token 压缩建模为全局保留池中的时空分配任务,通过融合注意力权重与语义相似性的统一选择机制,在极低保留率(~2%)下显著保持 Video-LLM 性能(90.1%),大幅降低计算开销(FLOPs 降至 ~2.6%),且无需重训练、即插即用。
Details
Motivation: 现有 Video-LLM 的 token 压缩方法采用两阶段时空分离策略,在极低保留率下易导致时空分配失衡、关键视觉证据丢失,影响问答性能。 Method: 将 token 压缩重构为全局 spatiotemporal allocation 任务;设计融合 attention 权重与语义相似性的统一 token 选择机制;对未选 token 进行聚类合并与信息填充;在 LLM 内部引入 query-aware 的文本感知二次压缩。 Result: 仅保留约 2% 视觉 token 即可维持基线 90.1% 的多基准性能,FLOPs 降至约 2.6%,显著降低端到端推理延迟与显存消耗,泛化于多种 backbone。 Conclusion: 所提统一时空 token 压缩策略在超低 token 保留率下实现了视频理解的最先进性能,是一种高效、通用、免训练的 plug-and-play 模块。 Abstract: Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.[303] GeoFusion-CAD: Structure-Aware Diffusion with Geometric State Space for Parametric 3D Design
Xiaolei Zhou,Chuangjie Fang,Jie Wu,Jingyi Yang,Boyi Lin,Jianwei Zheng
Main category: cs.CV
TL;DR: 本文提出GeoFusion-CAD,一种基于扩散模型的端到端方法,通过层次化树编码和C-Mamba模块建模长程结构依赖,显著提升长序列参数化CAD命令生成的可扩展性、几何保真度与拓扑一致性,并发布新基准DeepCAD-240。
Details
Motivation: 现有Transformer类方法在生成长CAD命令序列时受限于二次注意力计算复杂度和上下文窗口限制,难以处理复杂几何与拓扑依赖。 Method: 提出GeoFusion-CAD:将CAD程序编码为层次化树结构,在状态空间扩散过程中联合建模几何与拓扑;引入轻量级C-Mamba模块实现选择性状态转移以捕获长程结构依赖;构建新基准DeepCAD-240(序列长度40–240)用于长序列评估。 Result: 在短/长命令序列上均优于Transformer基线,几何保真度与拓扑一致性更高;在长序列CAD生成任务上达到新SOTA;代码与数据集已开源。 Conclusion: GeoFusion-CAD为参数化CAD生成提供了可扩展、结构感知的新范式,是面向下一代CAD建模系统的重要基础。 Abstract: Parametric Computer-Aided Design (CAD) is fundamental to modern 3D modeling, yet existing methods struggle to generate long command sequences, especially under complex geometric and topological dependencies. Transformer-based architectures dominate CAD sequence generation due to their strong dependency modeling, but their quadratic attention cost and limited context windowing hinder scalability to long programs. We propose GeoFusion-CAD, an end-to-end diffusion framework for scalable and structure-aware generation. Our proposal encodes CAD programs as hierarchical trees, jointly capturing geometry and topology within a state-space diffusion process. Specifically, a lightweight C-Mamba block models long-range structural dependencies through selective state transitions, enabling coherent generation across extended command sequences. To support long-sequence evaluation, we introduce DeepCAD-240, an extended benchmark that increases the sequence length ranging from 40 to 240 while preserving sketch-extrusion semantics from the ABC dataset. Extensive experiments demonstrate that GeoFusion-CAD achieves superior performance on both short and long command ranges, maintaining high geometric fidelity and topological consistency where Transformer-based models degrade. Our approach sets new state-of-the-art scores for long-sequence parametric CAD generation, establishing a scalable foundation for next-generation CAD modeling systems. Code and datasets are available at GitHub.[304] Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
SII-GAIR,Sand. ai,:,Ethan Chern,Hansi Teng,Hanwen Sun,Hao Wang,Hong Pan,Hongyu Jia,Jiadi Su,Jin Li,Junjie Yu,Lijie Liu,Lingzhi Li,Lyumanshan Ye,Min Hu,Qiangang Wang,Quanwei Qi,Steffi Chern,Tao Bu,Taoran Wang,Teren Xu,Tianning Zhang,Tiantian Mi,Weixian Xu,Wenqiang Zhang,Wentai Zhang,Xianping Yi,Xiaojie Cai,Xiaoyang Kang,Yan Ma,Yixiu Liu,Yunbo Zhang,Yunpeng Huang,Yutong Lin,Zewei Tao,Zhaoliang Liu,Zheng Zhang,Zhiyao Cen,Zhixuan Yu,Zhongshu Wang,Zhulin Hu,Zijin Zhou,Zinan Guo,Yue Cao,Pengfei Liu
Main category: cs.CV
TL;DR: daVinci-MagiHuman是一个开源的音视频联合生成基础模型,采用单流Transformer统一处理文本、视频和音频,专长于高质量、多语言、同步的人类行为生成,并在效率与性能上均达到领先水平。
Details
Motivation: 解决现有音视频生成模型架构复杂、优化困难、人类行为建模不自然、多语言支持不足及推理效率低等问题,推动开放、高效、高保真人类中心生成技术发展。 Method: 提出单流Transformer架构,将文本、视频、音频token化后统一输入;结合模型蒸馏、潜在空间超分辨率和Turbo VAE解码器以加速推理;支持多语言语音生成。 Result: 在自动评估中视觉质量、文本对齐度最高,语音词错率最低(14.60%);人工两两对比中,对Ovi 1.1和LTX 2.3胜率分别为80.0%和60.9%;5秒256p视频可在单张H100 GPU上2秒生成。 Conclusion: daVinci-MagiHuman验证了单流统一建模在音视频生成中的有效性与可扩展性,为开源社区提供了高性能、易部署、全栈开放的人类生成基础模型。 Abstract: We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.[305] LRC-WeatherNet: LiDAR, RADAR, and Camera Fusion Network for Real-time Weather-type Classification in Autonomous Driving
Nour Alhuda Albashir,Lars Pernickel,Danial Hamoud,Idriss Gouigah,Eren Erdal Aksoy
Main category: cs.CV
TL;DR: 本文提出LRC-WeatherNet,一种融合LiDAR、RADAR和相机数据的多传感器融合框架,用于实时天气分类,提升自动驾驶车辆在恶劣天气下的感知与导航能力。
Details
Motivation: 自动驾驶车辆在雨、雾、雪等恶劣天气下感知与导航性能显著下降,单一传感器(如LiDAR、RADAR、RGB相机)在不同天气中各有优劣且易受干扰,亟需鲁棒的多模态天气识别方法。 Method: 提出LRC-WeatherNet框架,采用鸟瞰图(BEV)统一早期融合与门控机制的中层特征图融合,动态加权各模态在不同天气下的可靠性。 Result: 在涵盖9类天气的MSU-4S数据集上,LRC-WeatherNet在分类精度和计算效率上均显著优于单模态基线方法,并首次实现三模态联合实时天气分类。 Conclusion: LRC-WeatherNet为恶劣天气下自动驾驶的环境感知提供了可靠、高效、可部署的多传感器天气识别方案,并开源模型与代码。 Abstract: Autonomous vehicles face major perception and navigation challenges in adverse weather such as rain, fog, and snow, which degrade the performance of LiDAR, RADAR, and RGB camera sensors. While each sensor type offers unique strengths, such as RADAR robustness in poor visibility and LiDAR precision in clear conditions, they also suffer distinct limitations when exposed to environmental obstructions. This study proposes LRC-WeatherNet, a novel multi-sensor fusion framework that integrates LiDAR, RADAR, and camera data for real-time classification of weather conditions. By employing both early fusion using a unified Bird's Eye View representation and mid-level gated fusion of modality-specific feature maps, our approach adapts to the varying reliability of each sensor under changing weather. Evaluated on the extensive MSU-4S dataset covering nine weather types, LRC-WeatherNet achieves superior classification performance and computational efficiency, significantly outperforming unimodal baselines in adverse conditions. This work is the first to combine all three modalities for robust, real-time weather classification in autonomous driving. We release our trained models and source code in https://github.com/nouralhudaalbashir/LRC-WeatherNet.[306] STENet: Superpixel Token Enhancing Network for RGB-D Salient Object Detection
Jianlin Chen,Gongyang Li,Zhijiang Zhang,Liang Chang,Dan Zeng
Main category: cs.CV
TL;DR: 本文提出了一种名为STENet的超像素标记增强网络,用于RGB-D显著性目标检测,通过引入超像素驱动的跨模态交互模块,在降低计算复杂度的同时提升全局和局部特征表达能力。
Details
Motivation: 现有基于Transformer的RGB-D显著性目标检测方法存在注意力机制二次复杂度高、局部细节提取能力有限的问题。 Method: 提出超像素标记增强网络(STENet),包含两个定制的超像素驱动跨模态交互模块:超像素注意力全局增强模块(建模像素到超像素的全局关系)和超像素注意力局部精炼模块(在超像素内筛选并增强局部像素);更新了超像素生成方法以支持像素与超像素灵活转换,并融合全局、局部及跨尺度特征。 Result: 在七个RGB-D显著性目标检测数据集上,STENet性能达到当前最优水平之一。 Conclusion: 引入超像素作为中间语义单元可有效缓解Transformer在RGB-D SOD任务中的计算瓶颈与局部建模不足问题,为多模态显著性检测提供了新思路。 Abstract: Transformer-based methods for RGB-D Salient Object Detection (SOD) have gained significant interest, owing to the transformer's exceptional capacity to capture long-range pixel dependencies. Nevertheless, current RGB-D SOD methods face challenges, such as the quadratic complexity of the attention mechanism and the limited local detail extraction. To overcome these limitations, we propose a novel Superpixel Token Enhancing Network (STENet), which introduces superpixels into cross-modal interaction. STENet follows the two-stream encoder-decoder structure. Its cores are two tailored superpixel-driven cross-modal interaction modules, responsible for global and local feature enhancement. Specifically, we update the superpixel generation method by expanding the neighborhood range of each superpixel, allowing for flexible transformation between pixels and superpixels. With the updated superpixel generation method, we first propose the Superpixel Attention Global Enhancing Module to model the global pixel-to-superpixel relationship rather than the traditional global pixel-to-pixel relationship, which can capture region-level information and reduce computational complexity. We also propose the Superpixel Attention Local Refining Module, which leverages pixel similarity within superpixels to filter out a subset of pixels (i.e., local pixels) and then performs feature enhancement on these local pixels, thereby capturing concerned local details. Furthermore, we fuse the globally and locally enhanced features along with the cross-scale features to achieve comprehensive feature representation. Experiments on seven RGB-D SOD datasets reveal that our STENet achieves competitive performance compared to state-of-the-art methods. The code and results of our method are available at https://github.com/Mark9010/STENet.[307] SegMaFormer: A Hybrid State-Space and Transformer Model for Efficient Segmentation
Duy D. Nguyen,Phat T. Tran-Truong
Main category: cs.CV
TL;DR: 本文提出SegMaFormer,一种轻量级混合架构,结合Mamba与Transformer模块,在3D医学图像分割中实现高效长程依赖建模,显著降低参数量和计算量,同时保持高精度。
Details
Motivation: 现有基于Transformer的3D医学图像分割模型计算复杂度高、参数量大,难以适应体数据和标注数据稀缺的医学场景。 Method: 提出SegMaFormer:在编码器早期高分辨率阶段使用Mamba层以降低计算开销并捕获空间上下文,后期低分辨率阶段使用自注意力机制精炼特征;引入广义旋转位置嵌入增强空间感知能力。 Result: 在Synapse、BraTS和ACDC三个公开数据集上达到与更大模型相当的Dice系数,参数量减少最多达75倍,FLOPs显著下降。 Conclusion: SegMaFormer是一种高效且高性能的3D医学图像分割解决方案,兼顾精度与效率。 Abstract: The advent of Transformer and Mamba-based architectures has significantly advanced 3D medical image segmentation by enabling global contextual modeling, a capability traditionally limited in Convolutional Neural Networks (CNNs). However, state-of-the-art Transformer models often entail substantial computational complexity and parameter counts, which is particularly prohibitive for volumetric data and further exacerbated by the limited availability of annotated medical imaging datasets. To address these limitations, this work introduces SegMaFormer, a lightweight hybrid architecture that synergizes Mamba and Transformer modules within a hierarchical volumetric encoder for efficient long-range dependency modeling. The model strategically employs Mamba-based layers in early, high-resolution stages to reduce computational overhead while capturing essential spatial context, and reserves self-attention mechanisms for later, lower-resolution stages to refine feature representation. This design is augmented with generalized rotary position embeddings to enhance spatial awareness. Despite its compact structure, SegMaFormer achieves competitive performance on three public benchmarks (Synapse, BraTS, and ACDC), matching the Dice coefficient of significantly larger models. Empirically, our approach reduces parameters by up to 75x and substantially decreases FLOPs compared to current state-of-the-art models, establishing an efficient and high-performing solution for 3D medical image segmentation.[308] 6D Robotic OCT Scanning of Curved Tissue Surfaces
Suresh Guttikonda,Maximilian Neidhardt,Vidas Raudonis,Alexander Schlaefer
Main category: cs.CV
TL;DR: 本文提出了一种用于机器人搭载OCT探头的六自由度手眼标定标记方法,解决了曲面组织大范围扫描中传统平移扫描和图像配准方法的局限性,并验证了其高重复性和无误差累积优势。
Details
Motivation: 传统OCT手持扫描依赖图像重叠配准,而机器人扫描多限于平移以避免复杂的手眼标定;但二者均难以应对曲面组织的大范围扫描需求。 Method: 设计并使用一种专用标记物实现机器人与OCT探头之间的全六维(6-DOF)手眼标定,并基于该标定进行机器人扫描,无需依赖图像配准。 Result: 标定结果具有高度可重复性;在两类仿体曲面上的机器人扫描实验表明,该方法可实现大范围、一致性的曲面扫描,且无配准导致的路径误差累积。 Conclusion: 所提出的六维标定方法克服了传统方法在曲面扫描中的局限,提升了机器人OCT系统对复杂解剖结构的成像能力与可靠性。 Abstract: Optical coherence tomography (OCT) is a non-invasive volumetric imaging modality with high spatial and temporal resolution. For imaging larger tissue structures, OCT probes need to be moved to scan the respective area. For handheld scanning, stitching of the acquired OCT volumes requires overlap to register the images. For robotic scanning and stitching, a typical approach is to restrict the motion to translations, as this avoids a full hand-eye calibration, which is complicated by the small field of view of most OCT probes. However, stitching by registration or by translational scanning are limited when curved tissue surfaces need to be scanned. We propose a marker for full six-dimensional hand-eye calibration of a robot mounted OCT probe. We show that the calibration results in highly repeatable estimates of the transformation. Moreover, we evaluate robotic scanning of two phantom surfaces to demonstrate that the proposed calibration allows for consistent scanning of large, curved tissue surfaces. As the proposed approach is not relying on image registration, it does not suffer from a potential accumulation of errors along a scan path. We also illustrate the improvement compared to conventional 3D-translational robotic scanning.[309] Tuning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models
Purui Bai,Junxian Duan,Pin Wang,Jinhua Hao,Ming Sun,Chao Zhou,Huaibo Huang
Main category: cs.CV
TL;DR: 本文提出ResFlow-Tuner,基于FLUX.1-dev流匹配模型,结合统一多模态融合(UMMF)与无训练测试时缩放(TTS),显著提升真实世界图像恢复性能。
Details
Motivation: 高效利用超大规模预训练文生图模型(如FLUX.1-dev)并充分释放其在真实世界图像恢复(Real-IR)任务中的潜力仍具挑战。 Method: 提出ResFlow-Tuner框架:1)采用统一多模态融合(UMMF)将多模态条件编码为单一序列,驱动MM-DiT架构高质量图像合成;2)引入无训练的测试时缩放(TTS),通过奖励模型(RM)反馈动态调整去噪方向。 Result: 在多个标准基准上达到SOTA性能;验证了流匹配模型在低层视觉任务中的强大能力,并提出了适用于大模型的高效推理时缩放新范式。 Conclusion: ResFlow-Tuner有效 bridged 预训练大模型与真实图像恢复任务之间的鸿沟,其UMMF与TTS设计为低层视觉提供了可扩展、可控且无需额外训练的新推理范式。 Abstract: Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.[310] GTSR: Subsurface Scattering Awared 3D Gaussians for Translucent Surface Reconstruction
Youwen Yuan,Xi Zhao
Main category: cs.CV
TL;DR: 本文提出了一种基于3D高斯泼溅(3DGS)的新管线GTSR,用于高效重建半透明物体表面几何,通过表面与内部两组高斯体建模,并结合Fresnel混合与Disney BSDF约束,显著提升重建质量与实时渲染性能。
Details
Motivation: 现有基于3DGS的方法擅长重建不透明物体,但忽略半透明物体的光学特性,难以准确建模;而基于可微路径追踪或神经隐式场的方法计算开销大。 Method: 提出GTSR管线:引入表面高斯与内部高斯分别建模几何表面和光散射颜色;使用Fresnel项融合二者以渲染半透明外观;结合Disney BSDF模型与延迟渲染增强法线和深度约束。 Result: 在NeuralTO Syn数据集上超越基线方法,具备优异实时渲染性能;扩展了含多种材质半透明物体的新数据集,并验证了方法对不同材质的泛化能力。 Conclusion: GTSR是一种高效、鲁棒且可泛化的半透明物体三维重建方法,兼顾几何精度与物理真实感渲染。 Abstract: Reconstructing translucent objects from multi-view images is a difficult problem. Previously, researchers have used differentiable path tracing and the neural implicit field, which require relatively large computational costs. Recently, many works have achieved good reconstruction results for opaque objects based on a 3DGS pipeline with much higher efficiency. However, such methods have difficulty dealing with translucent objects, because they do not consider the optical properties of translucent objects. In this paper, we propose a novel 3DGS-based pipeline (GTSR) to reconstruct the surface geometry of translucent objects. GTSR combines two sets of Gaussians, surface and interior Gaussians, which are used to model the surface and scattering color when lights pass translucent objects. To render the appearance of translucent objects, we introduce a method that uses the Fresnel term to blend two sets of Gaussians. Furthermore, to improve the reconstructed details of non-contour areas, we introduce the Disney BSDF model with deferred rendering to enhance constraints of the normal and depth. Experimental results demonstrate that our method outperforms baseline reconstruction methods on the NeuralTO Syn dataset while showing great real-time rendering performance. We also extend the dataset with new translucent objects of varying material properties and demonstrate our method can adapt to different translucent materials.[311] DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation
Binhong Tan,Zhaoxin Wang,Handing Wang
Main category: cs.CV
TL;DR: 本文提出DTV1双阶段推理时防御框架,通过类别感知的序列级干预和视觉生成阶段的进一步衰减,有效提升文本到图像扩散模型的安全性。
Details
Motivation: 现有推理时防御方法在文本嵌入空间进行类别无关的token级干预,无法捕捉分布在完整token序列中的恶意语义,且易受对抗性提示攻击。 Method: 提出DTV1双阶段推理时防御框架:第一阶段进行类别感知的序列级干预于完整提示嵌入,以更好捕获分布式恶意语义;第二阶段在视觉生成阶段进一步衰减剩余不安全影响。 Result: 在真实不安全提示、对抗性提示及多个有害类别上的实验表明,该方法实现了有效且鲁棒的防御,在性相关类别基准上平均防御成功率(DSR)达94.43%,在七个不安全类别上达88.56%,同时保持良性提示的合理生成质量。 Conclusion: DTV1框架通过序列级干预与视觉阶段衰减相结合,显著提升了T2I扩散模型的安全性与鲁棒性,同时兼顾生成质量。 Abstract: Text-to-Image (T2I) diffusion models have demonstrated strong generation ability, but their potential to generate unsafe content raises significant safety concerns. Existing inference-time defense methods typically perform category-agnostic token-level intervention in the text embedding space, which fails to capture malicious semantics distributed across the full token sequence and remains vulnerable to adversarial prompts. In this paper, we propose DTVI, a dual-stage inference-time defense framework for safe T2I generation. Unlike existing methods that intervene on specific token embeddings, our method introduces category-aware sequence-level intervention on the full prompt embedding to better capture distributed malicious semantics, and further attenuates the remaining unsafe influences during the visual generation stage. Experimental results on real-world unsafe prompts, adversarial prompts, and multiple harmful categories show that our method achieves effective and robust defense while preserving reasonable generation quality on benign prompts, obtaining an average Defense Success Rate (DSR) of 94.43% across sexual-category benchmarks and 88.56 across seven unsafe categories, while maintaining generation quality on benign prompts.[312] Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
Hayeon Kim,Ji Ha Jang,Junghun James Kim,Se Young Chun
Main category: cs.CV
TL;DR: 本文提出UNCHA方法,通过超球面不确定性建模部分-整体语义代表性,并将其融入对比学习目标中,提升超球面视觉语言模型对多对象复杂场景的理解能力。
Details
Motivation: 现有视觉语言模型在欧氏空间中难以有效表达层次化关系(如部分-整体、父子结构)和多对象组合场景;超球面VLM虽有所改进,但未考虑各部分对整体语义代表性的差异。 Method: 提出不确定性引导的组合式超球面对齐(UNCHA),利用超球面不确定性量化各部分对整体的代表性,并设计不确定性加权对比损失与熵正则化的蕴含损失进行联合优化。 Result: UNCHA在零样本分类、检索和多标签分类等基准上达到SOTA性能。 Conclusion: 通过引入超球面不确定性建模部分-整体语义代表性,UNCHA能更准确地学习部分-整体顺序关系,从而更好捕获图像中的组合结构,提升复杂多对象场景理解能力。 Abstract: While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git.[313] FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
Wuyang Luo,Chengkai Tan,Chang Ge,Binye Hong,Su Yang,Yongjiu Ma
Main category: cs.CV
TL;DR: 本文提出FontCrafter框架,通过元素驱动的方式实现高保真、细粒度控制的艺术字体生成,结合上下文生成策略、上下文感知掩码适配器(CMA)、注意力重定向与边缘重绘等技术,在零样本生成中展现出优异的结构与纹理保持能力及风格混合灵活性。
Details
Motivation: 现有艺术字体生成方法存在风格多样性有限、控制粒度粗等问题,难以兼顾参考元素的结构与纹理高保真重建。 Method: 提出元素驱动的FontCrafter框架;构建大规模ElementFont数据集;采用基于inpainting的上下文生成策略将元素作为视觉上下文进行像素级风格迁移;设计轻量级Context-aware Mask Adapter(CMA)注入字形形状信息;引入训练无关的注意力重定向机制实现区域感知风格控制并抑制笔画幻觉;应用边缘重绘提升边界自然度。 Result: 在零样本生成任务中显著优于现有方法,尤其在结构与纹理保真度方面表现突出,并支持灵活的风格混合等细粒度控制。 Conclusion: 元素驱动范式结合多模块协同设计可有效提升艺术字体生成的质量与可控性,为细粒度字体风格编辑提供了新思路。 Abstract: Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of element-driven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures. We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which contains diverse element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level. To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. In addition, edge repainting is applied to make boundaries more natural. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, particularly in preserving structural and textural fidelity, while also supporting flexible controls such as style mixture.[314] SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
Byungwoo Jeon,Dongyoung Kim,Huiwon Jang,Insoo Kim,Jinwoo Shin
Main category: cs.CV
TL;DR: 本文提出SpatialBoost框架,通过将2D图像中的密集3D空间信息转化为语言描述,并利用大语言模型(LLM)注入预训练视觉编码器,提升其3D空间感知能力;在ADE20K等基准上显著提升性能(如mIoU提升3.8%)。
Details
Motivation: 现有大规模预训练视觉编码器主要基于2D图像训练,缺乏对真实世界中物体与背景间3D空间关系的建模能力,限制了其在诸多下游任务中的表现。 Method: 提出SpatialBoost框架:首先将2D图像中隐含的密集3D空间信息转化为语言描述;然后借助大语言模型(LLM),通过多轮Chain-of-Thought(CoT)推理过程,将该空间知识注入预训练视觉编码器(如DINOv3),构建分层空间理解。 Result: 在ADE20K数据集上,SpatialBoost将DINOv3的mIoU从55.9提升至59.7(+3.8%),达到SOTA;同时在多个需3D感知与通用视觉能力的基准上验证了有效性。 Conclusion: SpatialBoost是一种可扩展、无需重训视觉编码器的轻量级方法,能有效增强其3D空间感知能力,为桥接2D视觉表征与3D语义理解提供了新范式。 Abstract: Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.[315] Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning
Xingyu Zhu,Liang Yi,Shuo Wang,Wenbo Zhu,Yonglinag Wu,Beier Zhu,Hanwang Zhang
Main category: cs.CV
TL;DR: 本文提出BayesMM,一种用于测试时点云分析的多模态贝叶斯分布学习框架,通过建模文本先验与流式视觉特征为高斯分布,并利用贝叶斯模型平均融合双模态信息,实现无需训练的持续自适应,在多个点云基准上显著提升域偏移下的鲁棒性。
Details
Motivation: 现有基于缓存的测试时自适应(TTA)方法在处理持续演化的测试数据流时存在历史信息有限、信息逐步丢失以及预测融合方式启发式导致不稳定的问题。 Method: BayesMM将每类的文本先验(来自语义提示)和流式点云视觉特征分别建模为高斯分布;视觉参数随新样本在线更新;双模态通过贝叶斯模型平均融合,依据后验证据自动调节各自贡献。 Result: 在多个点云基准上实验表明,BayesMM在分布偏移下保持强鲁棒性,平均性能提升超4%。 Conclusion: BayesMM提供了一种无需训练、持续自适应的多模态测试时学习范式,有效缓解了传统缓存式TTA的信息衰减与融合不稳定性问题。 Abstract: Multimodal 3D vision-language models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable. To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual parameters are derived from semantic prompts, while visual parameters are updated online with arriving samples. The two modalities are fused via Bayesian model averaging, which automatically adjusts their contributions based on posterior evidence, yielding a unified prediction that adapts continually to evolving test-time data without training. Extensive experiments on multiple point cloud benchmarks demonstrate that BayesMM maintains robustness under distributional shifts, yielding over 4% average improvement.[316] P-Flow: Prompting Visual Effects Generation
Rui Zhao,Mike Zheng Shou
Main category: cs.CV
TL;DR: 本文提出P-Flow框架,无需训练即可通过测试时提示优化,利用视觉语言模型的语义与时间推理能力,实现对视频生成中动态视觉效果(如爆炸、破碎)的文本定制。
Details
Motivation: 现有视频生成模型虽能较好遵循文本提示,但对动态视觉效果(如物体破碎、爆炸等)的定制仍缺乏有效方法;人工编写精确描述此类复杂时序现象的提示费时且困难。 Method: 提出无训练的P-Flow框架,利用视觉语言模型在测试时进行提示优化:根据参考视频与生成视频在动态效果上的差异,迭代调整文本提示,以提升目标效果在新场景中的生成质量。 Result: 实验表明P-Flow在文本到视频和图像到视频任务上均能实现高保真、多样化的动态视觉效果定制,并优于其他基线方法。 Conclusion: P-Flow为动态视觉效果的文本驱动定制提供了高效、灵活且无需模型微调的新范式,凸显了测试时提示优化与多模态推理结合的潜力。 Abstract: Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearance-driven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks. Code is available at https://github.com/showlab/P-Flow.[317] Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models
Xingyu Zhu,Beier Zhu,Shuo Wang,Junfeng Fang,Kesen Zhao,Hanwang Zhang,Xiangnan He
Main category: cs.CV
TL;DR: 本文提出NullSteer,一种基于零空间投影的激活防御框架,通过在线性变换中构造拒绝方向,在保障模型对良性输入性能不下降的前提下,有效提升对视觉越狱攻击的拒绝能力。
Details
Motivation: 现有激活引导方法虽能增强拒绝行为,但易导致过拒绝并缺乏理论可解释性,难以兼顾安全性与实用性。 Method: NullSteer构建拒绝方向于模型激活空间中,利用线性变换将扰动限制在有害方向,而在良性子空间中保持零扰动,实现安全增强与能力保留的理论平衡。 Result: 在MiniGPT-4上平均攻击成功率(ASR)降低超15%,且在通用基准测试中性能与原始模型相当。 Conclusion: NullSteer在理论上保证了安全性提升不损害模型基础能力,显著提升了VLM在开放世界中的鲁棒性与可信度。 Abstract: As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage. Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness. However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs. Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness. To better balance safety and utility, we propose NullSteer, a null-space projected activation defense framework. Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model's general capabilities. Extensive experiments show that NullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15 percent on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.[318] FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario
Hang Dai,Hongwei Fan,Han Zhang,Duojin Wu,Jiyao Zhang,Hao Dong
Main category: cs.CV
TL;DR: 本文提出FreeArtGS,一种用于自由移动场景下可动物体重建的新方法,仅需单目RGB-D视频输入,结合自由移动部件分割、关节估计和基于3D高斯泼溅(3DGS)的端到端优化,实现高可扩展性与实用性的 articulated object 重建。
Details
Motivation: 现有可动物体重建方法在离散关节约束或随意单目视频设置下存在轴对齐困难或覆盖不足问题,难以满足增强现实与机器人领域对高可扩展重建的需求。 Method: FreeArtGS融合自由移动部件分割(利用现有点跟踪与特征模型先验)、关节估计(统一物体-相机位姿标定与关节类型/轴恢复)以及基于3D高斯泼溅的端到端优化,以单目RGB-D视频为唯一输入。 Result: 在两个基准数据集及真实自由移动可动物体上实验表明,FreeArtGS在自由移动场景下性能最优,并在以往重建设定中仍具强竞争力。 Conclusion: FreeArtGS是一种实用、高效、高可扩展的可动物体三维重建方法,适用于真实场景资产生成。 Abstract: The increasing demand for augmented reality and robotics is driving the need for articulated object reconstruction with high scalability. However, existing settings for reconstructing from discrete articulation states or casual monocular videos require non-trivial axis alignment or suffer from insufficient coverage, limiting their applicability. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simple setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, the free-moving part segmentation module identifies rigid parts from relative motion under unconstrained capture. The joint estimation module calibrates the unified object-to-camera poses and recovers joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry, and joint angles of the articulated object. We conduct experiments on two benchmarks and real-world free-moving articulated objects. Experimental results demonstrate that FreeArtGS consistently excels in reconstructing free-moving articulated objects and remains highly competitive in previous reconstruction settings, proving itself a practical and effective solution for realistic asset generation. The project page is available at: https://freeartgs.github.io/[319] StreamingClaw Technical Report
Jiawei Chen,Zhe Chen,Chaoqun Du,Maokui He,Wei He,Hengtao Li,Qizhen Li,Zide Liu,Hao Ma,Xuhao Pan,Chang Ren,Xudong Rao,Xintian Shen,Chenfeng Wang,Tao Wei,Chengjun Yu,Pengfei Yu,Shengyu Yao,Chunpeng Zhou,Kun Zhan,Lihao Zheng,Pan Zhou,Xuhan Zhu,Yufei Zheng
Main category: cs.CV
TL;DR: 本文提出StreamingClaw,一个面向流式视频理解与具身智能的统一代理框架,支持实时推理、前瞻性交互、多模态长时记忆、感知-决策-行动闭环及OpenClaw兼容性。
Details
Motivation: 现有代理在流式视频理解中存在能力碎片化问题,如仅支持离线处理、缺乏长时多模态记忆、难以实现实时推理与主动交互,制约其在真实环境中的持续感知、实时决策与动作执行。 Method: 提出StreamingClaw框架,集成五大核心能力:实时流式推理、面向目标演化的未来事件推理与主动交互、多模态长时存储/分层演化/跨代理高效记忆检索、感知-决策-行动闭环(含流式工具与动作中心技能)、OpenClaw兼容性。 Result: StreamingClaw实现了在线实时推理、多模态长时记忆与主动交互的统一,并能将决策转化为可执行动作,直接控制物理世界,支持具身交互的实际部署。 Conclusion: StreamingClaw有效缓解了当前流式视频理解与具身智能代理的关键瓶颈,为构建可持续、实时、主动、可部署的具身系统提供了统一框架。 Abstract: Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.[320] Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding
Yunzhuo Sun,Xinyue Liu,Yanyang Li,Nanding Wu,Yifang Xu,Linlin Zong,Xianchao Zhang,Wenxin Liang
Main category: cs.CV
TL;DR: 本文提出了一种两阶段框架,利用大语言模型引导字幕匹配与文本生成视频作为时序先验,并通过多模态可控Mamba网络实现高效时序定位,显著提升了长序列视频时刻检索的精度与效率。
Details
Motivation: 现有文本驱动视频时刻检索方法难以捕捉未剪辑视频中的隐含时序动态,依赖自然语言查询或静态图像增强,忽略运动序列,且Transformer架构计算开销高;同时未能有效融合字幕上下文与时序先验。 Method: 第一阶段:LLM引导字幕匹配,结合查询生成辅助短视频以建模运动信息作为时序先验;第二阶段:采用多模态可控Mamba网络,引入视频引导门控机制,实现生成先验与长视频序列的高效融合与噪声过滤。 Result: 在TVR基准上显著超越SOTA方法,提升长序列定位的召回率,并降低计算开销。 Conclusion: 所提两阶段框架具有模型无关性,可广泛适用于多模态视频时刻检索任务,有效缓解时序动态建模不足与计算效率低下的问题。 Abstract: Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.[321] Biophysics-Enhanced Neural Representations for Patient-Specific Respiratory Motion Modeling
Jan Boysen,Hristina Uzunova,Heinz Handels,Jan Ehrhardt
Main category: cs.CV
TL;DR: 本文提出了一种基于物理正则化隐式神经表示(INR)的呼吸运动建模方法PRISM-RM,用于放疗中呼吸运动补偿,具有轨迹感知、时空连续、微分同胚及生理合理性等优势,在外推任务中表现优于传统配准方法。
Details
Motivation: 呼吸运动导致肺部和上腹部放疗中的剂量投递不确定性,需建模以提升靶区追踪精度;现有方法依赖固定参考态或缺乏外推能力,亟需更鲁棒、泛化性强的运动建模技术。 Method: 提出PRISM-RM:一种物理正则化的隐式神经表示(INR)模型,摒弃固定参考呼吸态,实现轨迹感知的时空连续且微分同胚的运动场建模,并引入双生物物理约束保障生理合理性。 Result: 在插值任务中性能与基线INR方法及序列配准法相当;在外推任务中显著优于初始INR方法,虽仍略逊于配准法,但INR固有特性(如连续性、可微性)使其展现出更强的建模潜力与提升趋势。 Conclusion: PRISM-RM为呼吸运动建模提供了新范式,兼具理论严谨性与临床实用性,隐式神经表示有望成为未来放疗运动管理的关键技术。 Abstract: A precise spatial delivery of the radiation dose is crucial for the treatment success in radiotherapy. In the lung and upper abdominal region, respiratory motion introduces significant treatment uncertainties, requiring special motion management techniques. To address this, respiratory motion models are commonly used to infer the patient-specific respiratory motion and target the dose more efficiently. In this work, we investigate the possibility of using implicit neural representations (INR) for surrogate-based motion modeling. Therefore, we propose physics-regularized implicit surrogate-based modeling for respiratory motion (PRISM-RM). Our new integrated respiratory motion model is free of a fixed reference breathing state. Unlike conventional pairwise registration techniques, our approach provides a trajectory-aware spatio-temporally continuous and diffeomorphic motion representation, improving generalization to extrapolation scenarios. We introduce biophysical constraints, ensuring physiologically plausible motion estimation across time beyond the training data. Our results show that our trajectory-aware approach performs on par in interpolation and improves the extrapolation ability compared to our initially proposed INR-based approach. Compared to sequential registration-based approaches both our approaches perform equally well in interpolation, but underperform in extrapolation scenarios. However, the methodical features of INRs make them particularly effective for respiratory motion modeling, and with their performance steadily improving, they demonstrate strong potential for advancing this field.[322] DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment
Xin Cai,Zhiyuan You,Zhoutong Zhang,Tianfan Xue
Main category: cs.CV
TL;DR: 本文提出Detail-Aligned VAE(DA-VAE),通过轻量适配预训练VAE与扩散模型,在保持原有结构前提下提升压缩比,显著减少高分辨率图像生成所需token数(如1024×1024图像仅需32×32 tokens),并在SD3.5上实现4–6倍加速而不损质量。
Details
Motivation: 高分辨率下扩散模型token数量大、计算开销高;单纯提高VAE压缩率会导致潜在空间结构退化,影响扩散训练;而预训练扩散模型本身已具备良好结构,应加以利用。 Method: 提出DA-VAE:采用显式潜在布局——前C通道直接来自预训练VAE(基础分辨率),新增D通道编码高分辨率细节;引入细节对齐机制以保留原始潜在空间结构,并采用warm-start微调策略。 Result: 在Stable Diffusion 3.5上实现1024×1024图像生成仅用32×32 tokens(减少4倍),耗时5个H100天;进一步支持2048×2048生成,提速6倍且图像质量不变;ImageNet上定量验证有效。 Conclusion: 无需重训扩散模型,仅轻量适配预训练VAE即可高效扩展潜在维度并保持结构,为高分辨率扩散生成提供实用、可扩展的新范式。 Abstract: Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training harder. Existing methods address this with extra objectives such as semantic alignment or selective dropout, but usually require costly diffusion retraining. Pretrained diffusion models, however, already exhibit a structured, lower-dimensional latent space; thus, a simpler idea is to expand the latent dimensionality while preserving this structure. We therefore propose \textbf{D}etail-\textbf{A}ligned VAE, which increases the compression ratio of a pretrained VAE with only lightweight adaptation of the pretrained diffusion backbone. DA-VAE uses an explicit latent layout: the first $C$ channels come directly from the pretrained VAE at a base resolution, while an additional $D$ channels encode higher-resolution details. A simple detail-alignment mechanism encourages the expanded latent space to retain the structure of the original one. With a warm-start fine-tuning strategy, our method enables $1024 \times 1024$ image generation with Stable Diffusion 3.5 using only $32 \times 32$ tokens, $4\times$ fewer than the original model, within 5 H100-days. It further unlocks $2048 \times 2048$ generation with SD3.5, achieving a $6\times$ speedup while preserving image quality. We also validate the method and its design choices quantitatively on ImageNet.[323] OpenEarth-Agent: From Tool Calling to Tool Creation for Open-Environment Earth Observation
Sijie Zhao,Feng Liu,Xueliang Zhang,Hao Chen,Xinyu Gu,Zhe Jiang,Fenghua Ling,Ben Fei,Wenlong Zhang,Junjue Wang,Weihao Xuan,Pengfeng Xiao,Naoto Yokoya,Lei Bai
Main category: cs.CV
TL;DR: 本文提出了OpenEarth-Agent,首个面向开放环境地球观测(EO)的工具创建型智能体框架,通过自适应工作流规划与动态工具生成,突破传统预定义工具调用范式的局限;同时构建了涵盖596个真实全链路案例的OpenEarth-Bench基准,验证其在多领域EO任务中的强泛化性与鲁棒性。
Details
Motivation: 现有遥感智能体受限于封闭环境和预定义工具,难以应对开放环境中多源异构数据与多样化任务的挑战。 Method: 提出OpenEarth-Agent框架,采用自适应工作流规划与动态工具创建机制,并集成多阶段工具与跨领域知识库;配套构建OpenEarth-Bench基准,仅提供基础预训练模型工具,全面评测开放环境下智能体的适应能力。 Result: OpenEarth-Agent在OpenEarth-Bench和Earth-Bench上均表现出色:仅用6个基础模型即达到依赖104个专用工具的调用型智能体的性能,且在完整工具集下显著超越;部分自建工具对数据异常的鲁棒性优于人工设计工具。 Conclusion: OpenEarth-Agent证明了工具创建范式在开放环境EO任务中的有效性与优越性,为构建通用遥感智能体提供了新路径。 Abstract: Earth Observation (EO) is essential for perceiving dynamic land surface changes, yet deploying autonomous EO in open environments is hindered by the immense diversity of multi-source data and heterogeneous tasks. While remote sensing agents have emerged to streamline EO workflows, existing tool-calling agents are confined to closed environments. They rely on pre-defined tools and are restricted to narrow scope, limiting their generalization to the diverse data and tasks. To overcome these limitations, we introduce OpenEarth-Agent, the first tool-creation agent framework tailored for open-environment EO. Rather than calling predefined tools, OpenEarth-Agent employs adaptive workflow planning and tool creation to generalize to unseen data and tasks. This adaptability is bolstered by an open-ended integration of multi-stage tools and cross-domain knowledge bases, enabling robust execution in the entire EO pipeline across multiple application domains. To comprehensively evaluate EO agents in open environments, we propose OpenEarth-Bench, a novel benchmark comprising 596 real-world, full-pipeline cases across seven application domains, explicitly designed to assess agents' adaptive planning and tool creation capabilities. Only essential pre-trained model tools are provided in this benchmark, devoid of any other predefined task-specific tools. Extensive experiments demonstrate that OpenEarth-Agent successfully masters full-pipeline EO across multiple domains in the open environment. Notably, on the cross-benchmark Earth-Bench, our tool-creating agent equipped with 6 essential pre-trained models achieves performance comparable to tool-calling agents relying on 104 specialized tools, and significantly outperforms them when provided with the complete toolset. In several cases, the created tools exhibit superior robustness to data anomalies compared to human-engineered counterparts.[324] Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
Kejia Liu,Haoyang Zhou,Ruoyu Xu,Peicheng Wang,Mingli Song,Haofei Zhang
Main category: cs.CV
TL;DR: 本文提出Bearing-UAV,一种纯视觉驱动的跨视角导航方法,可同时预测无人机绝对位置和航向,兼顾精度、轻量与鲁棒性,并构建了多城市基准Bearing-UAV-90k。
Details
Motivation: 现有跨视角地理定位方法主要匹配无人机图像与地图瓦片,存在精度与存储开销的权衡,且忽略航向信息;同时未充分建模跨视角差异与重叠变化,泛化能力受限。 Method: Bearing-UAV联合预测无人机绝对位置与航向,利用全局与局部结构特征,并显式编码相对空间关系,提升对视角差异、错位及特征稀疏的鲁棒性;并构建Bearing-UAV-90k多城市基准。 Result: 在多种地形上,Bearing-UAV的定位误差低于以往匹配/检索范式方法;代码与数据集将开源。 Conclusion: Bearing-UAV实现了高精度、轻量化、强鲁棒的野外跨视角导航,为GNSS拒止环境下无人机视觉导航提供了新范式。 Abstract: Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV's heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitly encodes relative spatial relationships, making it robust to cross-view variations, misalignment, and feature-sparse conditions. We also present Bearing-UAV-90k, a multi-city benchmark for evaluating cross-view localization and navigation. Extensive experiments show encouraging results that Bearing-UAV yields lower localization error than previous matching/retrieval paradigm across diverse terrains. Our code and dataset will be made publicly available.[325] ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints
Kaili Huang,Hongming Zhang,Rui Shen,Linjun Dai,Jiahao Wang,Hanming Deng,Lewei Lu
Main category: cs.CV
TL;DR: 本文提出ACPO方法解决DPO在多模态对齐中导致的视觉锚点崩溃问题,通过非对称地缩放拒绝响应的奖励来稳定选择响应的梯度,从而减少幻觉并提升多模态基准性能。
Details
Motivation: DPO在对齐大视觉语言模型时存在似然位移问题,尤其导致‘视觉锚点崩溃’,即模型放弃视觉证据而依赖强语言先验,引发严重幻觉。 Method: 提出非对称约束偏好优化(ACPO),采用动态、目标导向的缩放策略,仅对拒绝响应的奖励施加复杂度感知的缩放系数,非对称抑制其梯度流,同时保持选择响应分布的梯度稳定性。 Result: 在InternVL模型上实验表明,ACPO有效逆转了标准DPO中选择奖励退化现象,在幻觉评测(HallusionBench、MM-IFEval)和通用多模态榜单(MMBench、MMStar、OCRBenchV2)上均优于基线,并同步提升通用能力。 Conclusion: ACPO是一种模态无关的对齐机制,其打破梯度对称性的设计对多模态任务尤为关键,可缓解语言先验对视觉token的压制,显著提升对齐鲁棒性与生成可信度。 Abstract: While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.[326] Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement
Junrong Guo,Shancheng Fang,Yadong Qu,Hongtao Xie
Main category: cs.CV
TL;DR: 本文提出VFLM框架,通过视觉反馈迭代优化布局生成,利用OCR精度驱动的视觉奖励模型提升可读性与美观性,显著优于现有方法。
Details
Motivation: 现有基于代码的多模态大模型布局生成方法无法感知渲染后的视觉效果,难以保证可读性和美观性,因此需要引入视觉反馈机制。 Method: 提出视觉反馈布局模型(VFLM),采用自改进框架,结合强化学习与视觉接地的奖励模型(融入OCR精度),实现基于视觉信息的自适应反思式迭代生成。 Result: 在多个基准测试中,VFLM持续超越先进多模态大语言模型、现有布局模型及纯代码基线方法。 Conclusion: 视觉反馈是面向设计任务的多模态大语言模型的关键要素,VFLM验证了其有效性与必要性。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.[327] A Backbone Benchmarking Study on Self-supervised Learning as a Auxiliary Task with Texture-based Local Descriptors for Face Analysis
Shukesh Reddy,Abhijit Das
Main category: cs.CV
TL;DR: 本文研究了不同骨干网络在自监督学习(SSL)作为辅助任务时对局部纹理特征建模及高效人脸分析的影响,特别结合掩码自编码器(MAE)与局部模式SSAT(L-SSAT)框架;实验表明骨干网络性能高度依赖下游任务,不存在适用于所有人脸分析任务的通用骨干网络。
Details
Motivation: 提升人脸分析中局部纹理特征建模的鲁棒性与判别力,探索SSL作为辅助任务时骨干网络选择的影响机制。 Method: 将不同深度的骨干网络嵌入L-SSAT框架,以MAE为自监督辅助任务联合重建纹理特征,并在多个数据集上系统评估其对人脸属性预测、情绪识别和深度伪造检测等任务的影响。 Result: 骨干网络性能显著依赖下游任务:FaceForensics++达0.94,CelebA达0.87,AffectNet达0.88;无单一骨干网络能在所有任务中保持一致高性能。 Conclusion: 不存在适用于所有人脸分析任务的通用骨干网络;骨干选择需依据具体下游任务定制化设计。 Abstract: In this work, we benchmark with different backbones and study their impact for self-supervised learning (SSL) as an auxiliary task to blend texture-based local descriptors into feature modelling for efficient face analysis. It is established in previous work that combining a primary task and a self-supervised auxiliary task enables more robust and discriminative representation learning. We employed different shallow to deep backbones for the SSL task of Masked Auto-Encoder (MAE) as an auxiliary objective to reconstruct texture features such as local patterns alongside the primary task in local pattern SSAT (L-SSAT), ensuring robust and unbiased face analysis. To expand the benchmark, we conducted a comprehensive comparative analysis across multiple model configurations within the proposed framework. To this end, we address the three research questions: "What is the role of the backbone in performance L-SSAT?", "What type of backbone is effective for different face analysis tasks?", and "Is there any generalized backbone for effective face analysis with L-SSAT?". Towards answering these questions, we provide a detailed study and experiments. The performance evaluation demonstrates that the backbone for the proposed method is highly dependent on the downstream task, achieving average accuracies of 0.94 on FaceForensics++, 0.87 on CelebA, and 0.88 on AffectNet. For consistency of feature representation quality and generalisation capability across various face analysis paradigms, including face attribute prediction, emotion classification, and deepfake detection, there is no unified backbone.[328] PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation
Mingju Gao,Kaisen Yang,Huan-ang Gao,Bohan Li,Ao Ding,Wenyi Li,Yangcheng Yu,Jinkun Liu,Shaocong Xu,Yike Niu,Haohan Chi,Hao Chen,Hao Tang,Li Yi,Hao Zhao
Main category: cs.CV
TL;DR: 本文提出PAM(Pose-Appearance-Motion)引擎,统一建模手-物交互(HOI)的姿势、外观与运动,实现可控HOI视频生成,在DexYCB和OAKINK2数据集上显著优于现有方法,并验证了其在下游姿态估计任务中的数据增强有效性。
Details
Motivation: 现有HOI生成研究分散于姿态预测、单图生成和视频生成三个互不兼容的方向,缺乏统一建模姿态、外观与运动的框架,难以支持真实场景部署。 Method: 提出PAM引擎,融合深度、分割与关键点等多条件输入,端到端生成高分辨率HOI视频;采用可控生成架构,支持sim-to-real应用。 Result: 在DexYCB上FVD达29.13、MPJPE为19.37mm,优于InterDyn和CosHand;在OAKINK2上FVD从68.76降至46.31;多条件输入效果最优;合成数据可使半量真实数据训练的手势估计模型达到全量基准性能。 Conclusion: PAM为HOI生成提供了首个统一、可控、高保真的视频生成框架,推动了HOI在具身AI与AR/VR中的实际应用。 Abstract: Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.[329] Mixture of Mini Experts: Overcoming the Linear Layer Bottleneck in Multiple Instance Learning
Daniel Shao,Joel Runevic,Richard J. Chen,Drew F. K. Williamson,Ahrong Kim,Andrew H. Song,Faisal Mahmood
Main category: cs.CV
TL;DR: 本文提出MAMMOTH模块,通过低秩、多头专家混合机制对MIL框架中的patch特征进行任务特异性转换,显著提升各类MIL模型在病理图像分类任务上的性能,且参数高效。
Details
Motivation: 现有MIL方法忽略了将通用patch特征映射为任务特异性特征的关键线性层,该层被假设为未被重视的性能瓶颈。 Method: 提出MAMMOTH——一种参数高效的多头混合专家模块,为每个patch的表型定制低秩变换,可无缝嵌入任意MIL模型。 Result: 在8种MIL方法和19个分类任务上验证,MAMMOTH在152种配置中提升130种,平均性能提升+3.8%;甚至简单聚合方式(如max/mean pooling)配MAMMOTH后也超越所有使用标准线性层的方法。 Conclusion: 任务特异性patch特征变换比聚合方法选择更重要;MAMMOTH是一种通用、轻量、有效的MIL性能增强模块。 Abstract: Multiple Instance Learning (MIL) is the predominant framework for classifying gigapixel whole-slide images in computational pathology. MIL follows a sequence of 1) extracting patch features, 2) applying a linear layer to obtain task-specific patch features, and 3) aggregating the patches into a slide feature for classification. While substantial efforts have been devoted to optimizing patch feature extraction and aggregation, none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features. We hypothesize that this layer constitutes an overlooked performance bottleneck and that stronger representations can be achieved with a low-rank transformation tailored to each patch's phenotype, yielding synergistic effects with any of the existing MIL approaches. To this end, we introduce MAMMOTH, a parameter-efficient, multi-head mixture of experts module designed to improve the performance of any MIL model with minimal alterations to the total number of parameters. Across eight MIL methods and 19 different classification tasks, we find that such task-specific transformation has a larger effect on performance than the choice of aggregation method. For instance, when equipped with MAMMOTH, even simple methods such as max or mean pooling attain higher average performance than any method with the standard linear layer. Overall, MAMMOTH improves performance in 130 of the 152 examined configurations, with an average $+3.8\%$ change in performance. Code is available at https://github.com/mahmoodlab/mammoth.[330] Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
Meiqi Wu,Zhixin Cai,Fufangchen Zhao,Xiaokun Feng,Rujing Dang,Bingze Song,Ruitian Tian,Jiashu Zhu,Jiachen Lei,Hao Dou,Jing Tang,Lei Sun,Jiahong Wu,Xiangxiang Chu,Zeming Liu,Kaiqi Huang
Main category: cs.CV
TL;DR: 本文提出Omni-WorldBench,首个面向4D世界模型交互响应能力的综合评测基准,包含多层级交互提示集Omni-WorldSuite和基于智能体的因果评估框架Omni-Metrics,揭示了当前模型在时空交互建模上的关键缺陷。
Details
Motivation: 现有视频生成与3D重建两类世界模型评测基准分别局限于视觉保真度/跨模态对齐或静态几何指标,忽视时间动态与交互因果性;而未来世界建模应走向联合建模空间结构与时间演化的4D生成,其核心是交互响应能力——即动作如何驱动时空状态转移。 Method: 提出Omni-WorldBench基准:1)Omni-WorldSuite——覆盖多交互层级与场景类型的系统化提示集;2)Omni-Metrics——基于智能体的因果评估框架,量化交互动作对最终结果及中间状态演化轨迹的影响。 Result: 在18个代表性世界模型上开展大规模评测,发现当前模型在交互响应能力上存在显著局限,尤其在中间状态演化建模和因果一致性方面表现薄弱。 Conclusion: Omni-WorldBench填补了4D世界模型交互评测空白,为推动具备真实时空因果推理能力的世界模型发展提供了标准化工具与实证依据;代码与数据将开源。 Abstract: Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.[331] SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
Sashuai Zhou,Qiang Zhou,Junpeng Ma,Yue Cao,Ruofan Hu,Ziang Zhang,Xiaoda Yang,Zhibin Wang,Jun Song,Cheng Yu,Bo Zheng,Zhou Zhao
Main category: cs.CV
TL;DR: 本文提出SpatialReward,一种可验证的空间布局评估奖励模型,用于提升文本到图像生成中的空间一致性;通过Prompt Decomposer、专家检测器和视觉语言模型的多阶段流程,结合新基准SpatRelBench进行验证,显著提升了Stable Diffusion和FLUX等模型的空间准确性和人类评价一致性。
Details
Motivation: 现有文本到图像生成的强化学习奖励模型忽视细粒度空间关系,导致生成图像虽整体合理但物体位置不准确。 Method: 提出SpatialReward多阶段奖励模型:Prompt Decomposer解析提示词中的实体、属性与空间元信息;专家检测器实现物体位置与属性的精确视觉定位;视觉语言模型基于定位结果进行链式推理以评估复杂空间关系;并构建空间关系评测基准SpatRelBench。 Result: 在Stable Diffusion和FLUX上集成SpatialReward进行RL训练,显著提升了生成图像的空间一致性与整体质量,且结果更符合人类判断。 Conclusion: 可验证的奖励模型能有效支持文本到图像生成中更准确、可控的优化,具有重要应用潜力。 Abstract: Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.[332] Benchmarking Deep Learning Models for Aerial LiDAR Point Cloud Semantic Segmentation under Real Acquisition Conditions: A Case Study in Navarre
Alex Salvatierra,José Antonio Sanz,Christian Gutiérrez,Mikel Galar
Main category: cs.CV
TL;DR: 本文在西班牙纳瓦拉地区真实航拍条件下获取的大规模航空LiDAR数据集上,对KPConv、RandLA-Net、Superpoint Transformer和Point Transformer V3四种主流点云语义分割模型进行了系统基准测试,评估其在城市、乡村与工业等异构场景下对地面、植被、建筑、车辆等五类目标的分割性能。
Details
Motivation: 现有3D语义分割模型多针对室内或地面数据,其在真实航空采集条件下的表现尚缺乏充分研究;不同已有工作在数据集设计、采集条件和模型选择上差异较大,导致结论难以横向比较,亟需统一基准评估。 Method: 在真实作业飞行条件下采集的纳瓦拉大规模航空LiDAR数据集上,对KPConv、RandLA-Net、Superpoint Transformer和Point Transformer V3四种SOTA模型进行实验对比,评估其在五类常见航测语义类别上的分割精度(如mIoU、OA)及对类别不平衡、几何多样性等挑战的鲁棒性。 Result: 所有模型整体精度均超93%;KPConv以78.51%的平均IoU表现最优且各类别均衡;Point Transformer V3在稀疏的车辆类上IoU达75.11%,表现最佳;Superpoint Transformer与RandLA-Net则在分割鲁棒性与计算效率间有所权衡。 Conclusion: KPConv在航空LiDAR语义分割任务中综合性能最优,尤其适合兼顾精度与均衡性的实际应用;Point Transformer V3对小目标(如车辆)更具优势;模型选择需依据具体应用场景(如侧重某类目标或计算资源约束)进行权衡。 Abstract: Recent advances in deep learning have significantly improved 3D semantic segmentation, but most models focus on indoor or terrestrial datasets. Their behavior under real aerial acquisition conditions remains insufficiently explored, and although a few studies have addressed similar scenarios, they differ in dataset design, acquisition conditions, and model selection. To address this gap, we conduct an experimental benchmark evaluating several state-of-the-art architectures on a large-scale aerial LiDAR dataset acquired under operational flight conditions in Navarre, Spain, covering heterogeneous urban, rural, and industrial landscapes. This study compares four representative deep learning models, including KPConv, RandLA-Net, Superpoint Transformer, and Point Transformer V3, across five semantic classes commonly found in airborne surveys, such as ground, vegetation, buildings, and vehicles, highlighting the inherent challenges of class imbalance and geometric variability in aerial data. Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes, particularly on challenging and underrepresented categories. Point Transformer V3 demonstrates superior performance on the underrepresented vehicle class (75.11% IoU), while Superpoint Transformer and RandLA-Net trade off segmentation robustness for computational efficiency.[333] Riverine Land Cover Mapping through Semantic Segmentation of Multispectral Point Clouds
Sopitta Thurachen,Josef Taher,Matti Lehtomäki,Leena Matikainen,Linnea Blåfield,Mikel Calle Navarro,Antero Kukko,Tomi Westerlund,Harri Kaartinen
Main category: cs.CV
TL;DR: 本文提出使用Point Transformer v2(PTv2)对多光谱LiDAR点云进行语义分割,实现河岸环境下的高精度地物覆盖分类,实验在芬兰Oulanka河数据上取得mIoU达0.950,并验证了多数据集训练可提升模型泛化能力。
Details
Motivation: 河岸环境中的精准地物覆盖制图对河流管理、生态理解与地貌变化监测至关重要,但现有方法在复杂自然场景中精度和泛化性受限。 Method: 采用Point Transformer v2模型,融合多光谱LiDAR点云的几何(如坐标、法向量)与光谱(强度、反射率)特征,进行六类地物(沙、砾石、低矮植被、高大植被、林下地表、水体)的语义分割;并在Oulanka河主数据集基础上引入另一河流的稀疏标注数据开展多数据集联合训练以提升泛化性。 Result: 全特征配置下mIoU达0.950,显著优于仅用几何特征的基线;消融实验表明强度与反射率特征最关键;多数据集训练提升了跨环境泛化性能。 Conclusion: Transformer架构适用于多光谱点云的河岸地物分类任务,该方法为沉积物输运监测等河流管理应用提供了新工具,且在标注数据有限时仍具鲁棒性。 Abstract: Accurate land cover mapping in riverine environments is essential for effective river management, ecological understanding, and geomorphic change monitoring. This study explores the use of Point Transformer v2 (PTv2), an advanced deep neural network architecture designed for point cloud data, for land cover mapping through semantic segmentation of multispectral LiDAR data in real-world riverine environments. We utilize the geometric and spectral information from the 3-channel LiDAR point cloud to map land cover classes, including sand, gravel, low vegetation, high vegetation, forest floor, and water. The PTv2 model was trained and evaluated on point cloud data from the Oulanka river in northern Finland using both geometry and spectral features. To improve the model's generalization in new riverine environments, we additionally investigate multi-dataset training that adds sparsely annotated data from an additional river dataset. Results demonstrated that using the full-feature configuration resulted in performance with a mean Intersection over Union (mIoU) of 0.950, significantly outperforming the geometry baseline. Other ablation studies revealed that intensity and reflectance features were the key for accurate land cover mapping. The multi-dataset training experiment showed improved generalization performance, suggesting potential for developing more robust models despite limited high-quality annotated data. Our work demonstrates the potential of applying transformer-based architectures to multispectral point clouds in riverine environments. The approach offers new capabilities for monitoring sediment transport and other river management applications.[334] EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild
Jeffri Murrugarra-Llerena,Pranav Chitale,Zicheng Liu,Kai Ao,Yujin Ham,Guha Balakrishnan,Paola Cascante-Bonilla
Main category: cs.CV
TL;DR: 本文提出了首个第一人称视角的社会群体检测数据集EgoGroups,覆盖全球65个国家、多种环境与文化背景,用于评估视觉语言模型和监督模型在零样本下的社会群体检测能力,并揭示了人群密度和文化区域对模型性能的影响。
Details
Motivation: 现有社会群体检测基准存在场景多样性低、依赖第三人称摄像头(如监控视频)的问题,缺乏真实世界中不同文化背景和非约束环境下群体形成与演化的评估能力。 Method: 构建了名为EgoGroups的第一人称视角大规模社会群体检测数据集,涵盖65个国家、四种天气/时段条件及高低中三种人群密度场景,并提供密集的人体与社会群体标注及丰富的地理与场景元数据;在此基础上对主流视觉语言模型(VLM)和大语言模型(LLM)以及监督模型进行了零样本与有监督的群体检测性能评估。 Result: 实验发现VLM/LLM在零样本设置下可超越监督基线模型;同时人群密度和文化区域显著影响模型性能。 Conclusion: EgoGroups填补了社会群体检测在真实、多元文化、第一人称视角场景下的数据空白,为评估和提升AI社会智能提供了新基准,并揭示了当前模型在跨文化与复杂环境下的局限性。 Abstract: Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.[335] GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning
Yixuan Luo,Feng Qiao,Zhexiao Xiong,Yanjing Li,Nathan Jacobs
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注的大规模光学流合成框架\modelname,利用预训练深度估计网络生成伪光流,并结合下一帧生成模型合成高质量、像素对齐的帧-流数据对;引入不一致像素过滤策略提升微调性能,在KITTI和Sintel等基准上达到或超越现有无/半监督方法。
Details
Motivation: 监督式光流估计受限于昂贵的真实标注;无/半监督方法依赖亮度恒定和平滑性假设,导致复杂场景下运动估计不准。 Method: 利用预训练深度估计网络生成伪光流作为条件输入,驱动下一帧生成模型合成高保真、像素对齐的后续帧,从而构建大规模合成帧-流数据对;并提出不一致像素过滤策略剔除生成帧中不可靠像素。 Result: 在KITTI2012、KITTI2015和Sintel数据集上,\modelname在无/半监督设定下达到或优于现有方法的性能。 Conclusion: 该框架为光学流学习提供了一种可扩展、免标注的替代方案,通过高质量合成数据弥补真实标注缺失,显著提升无监督训练效果。 Abstract: Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame--flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.[336] DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
Zhengyao Lv,Menghan Xia,Xintao Wang,Kwan-Yee K. Wong
Main category: cs.CV
TL;DR: 本文提出DUO-VSR,一种基于双流蒸馏的三阶段框架,通过结合分布匹配与对抗监督,实现稳定高效的一次性视频超分辨率。
Details
Motivation: 现有基于扩散模型的视频超分辨率方法虽保真度高,但采样成本过高;而直接应用分布匹配蒸馏(DMD)会导致训练不稳定和监督不足。 Method: 提出三阶段DUO-VSR框架:1)渐进引导蒸馏初始化以稳定训练;2)双流蒸馏联合优化DMD与RFS-GAN(利用真实/生成分数模型的判别特征提供对抗监督);3)偏好引导细化阶段对齐感知质量偏好。 Result: 在多个基准上显著优于先前的一次性VSR方法,在视觉质量和推理效率上均取得提升。 Conclusion: DUO-VSR通过统一分布匹配与对抗监督,并引入多阶段协同蒸馏策略,有效解决了扩散模型VSR中训练不稳定与监督不足的问题,实现了高质量、高效率的一次性生成。 Abstract: Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a Dual-Stream Distillation strategy that unifies distribution matching and adversarial supervision for one-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.[337] Repurposing Geometric Foundation Models for Multi-view Diffusion
Wooseok Jang,Seonghu Jeon,Jisang Han,Jinhyeok Choi,Minkyung Kwon,Seungryong Kim,Saining Xie,Sainan Liu
Main category: cs.CV
TL;DR: 本文提出Geometric Latent Diffusion(GLD)框架,利用几何基础模型的几何一致特征空间作为多视角扩散模型的潜在空间,显著提升新视角合成(NVS)的质量与3D一致性,并加速训练。
Details
Motivation: 现有NV方法多采用与视角无关的VAE潜在空间,难以保证跨视角几何一致性;而最优用于NV的潜在空间尚未被充分探索。 Method: 提出GLD框架,将几何基础模型提取的几何一致特征作为空间潜变量,构建面向多视角生成的扩散模型。 Result: GLD在2D图像质量与3D一致性指标上均优于VAE和RAE,训练速度提升4.4倍以上,且无需大规模文本到图像预训练即可媲美SOTA方法。 Conclusion: 几何一致的特征空间比传统VAE潜空间更适合作为NVS任务的扩散潜空间,为生成式建模与三维理解的结合提供了新范式。 Abstract: While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.[338] The Dual Mechanisms of Spatial Reasoning in Vision-Language Models
Kelly Cui,Nikhil Prakash,Ayush Raina,David Bau,Antonio Torralba,Tamar Rott Shaham
Main category: cs.CV
TL;DR: 本文揭示了视觉语言模型(VLMs)中空间关联表征的双重机制:语言模型中间层可编码内容无关的空间关系,但起主导作用的是视觉编码器——其全局分布(含背景)的视觉表征直接支撑空间推理;增强全图token的视觉空间表征可提升自然图像上的空间推理性能。
Details
Motivation: 尚不清楚视觉语言模型(VLMs)中物体与其属性及空间关系的关联是在何处、如何被计算的。 Method: 通过分析VLM内部表征,区分语言模型骨干网络与视觉编码器各自对空间关系的贡献,并检验空间信号在视觉token中的分布特性(是否局限于物体区域);进一步通过增强全局视觉token的空间表征进行干预实验。 Result: 发现空间关联依赖两个并行机制:语言模型中间层可表征内容无关的空间关系(次要作用),而视觉编码器的全局分布(含背景)的空间布局表征是主导来源;增强该全局视觉空间表征能提升自然图像上的空间推理性能。 Conclusion: VLM中的空间关联计算主要由视觉编码器驱动,其空间信号呈全局分布;语言模型更多是利用而非生成关键空间信息;该发现强调了视觉编码器在空间推理中的核心地位。 Abstract: Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.[339] 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing
Haoyu Zhen,Xiaolong Li,Yilin Zhao,Han Zhang,Sifei Liu,Kaichun Mo,Chuang Gan,Subhashree Radhakrishnan
Main category: cs.CV
TL;DR: 本文提出了一种结构化推理框架,通过场景图推理实现文本引导的空间布局编辑,显著提升了大模型在细粒度视觉编辑中的空间理解与布局一致性。
Details
Motivation: 现有大语言模型(LLMs)和视觉语言模型(VLMs)在空间理解和布局一致性方面存在不足,难以胜任细粒度视觉编辑任务。 Method: 提出结构化推理框架,基于输入场景图和自然语言指令,通过场景图推理生成满足文本条件且保持空间一致性的更新场景图;采用显式的结构化关系表示来引导推理过程。 Result: 在新构建的文本引导布局编辑基准上,相比CoT-SFT和GRPO基线,平均IoU提升15%,中心距离误差降低25%;相比SOTA零样本LLM,mIoU最高提升20%。 Conclusion: 结构化关系表示能有效增强模型对空间关系的建模能力,提升可解释性与控制力,为视觉编辑任务提供更精准的空间推理方法。 Abstract: Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.[340] DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models
Zhide Zhong,Junfeng Li,Junjie He,Haodong Yan,Xin Gong,Guanyi Zhao,Yingjie Cai,Jiantao Gao,Xu Yan,Bingbing Liu,Yingcong Chen,Liuqing Yang,Haoang Li
Main category: cs.CV
TL;DR: 本文提出DualCoT-VLA,一种具有并行推理机制的视觉-语言链式思维(CoT)方法,通过融合视觉CoT(低层空间理解)与语言CoT(高层任务规划),同时解决现有VLA模型在多步逻辑规划、细粒度操作及推理延迟方面的瓶颈。
Details
Motivation: 标准VLA模型难以处理需逻辑规划的复杂多步任务和需精细空间感知的精确操作;现有CoT-VLA方法存在单模态CoT无法兼顾高低层信息、以及自回归解码导致高延迟与误差累积的问题。 Method: 提出DualCoT-VLA:1)双路径CoT——视觉CoT建模低层空间细节,语言CoT建模高层任务逻辑;2)并行CoT机制——引入两组可学习查询token,将自回归推理转为单步前向推理。 Result: 在LIBERO和RoboCasa GR1基准及真实机器人平台上均达到SOTA性能。 Conclusion: DualCoT-VLA通过多模态并行链式思维有效提升VLA模型在复杂任务中的规划能力、操作精度与推理效率。 Abstract: Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.[341] UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation
Ziyi Wang,Xinshun Wang,Shuang Chen,Yang Cong,Mengyuan Liu
Main category: cs.CV
TL;DR: UniMotion is the first unified framework that simultaneously handles human motion, language, and RGB images using continuous representations, overcoming limitations of prior discrete-tokenized models through novel architectures (CMA-VAE, DPA, LRA) and achieves SOTA across seven cross-modal tasks.
Details
Motivation: Existing unified models handle only subsets of modalities and rely on discrete tokenization, causing quantization errors and breaking temporal continuity; there's a need for a truly unified, continuous multimodal framework. Method: Introduces UniMotion with Cross-Modal Aligned Motion VAE (CMA-VAE), symmetric dual-path embedders for continuous Motion and RGB pathways, Dual-Posterior KL Alignment (DPA) to inject visual-semantic priors into motion, and Latent Reconstruction Alignment (LRA) for self-supervised pre-training to solve cold-start in motion pathway calibration. Result: Achieves state-of-the-art performance across seven any-to-any understanding, generation, and editing tasks among motion, language, and RGB images, especially excelling in cross-modal compositional tasks. Conclusion: Treating motion as a first-class continuous modality—aligned with RGB and language in a shared LLM backbone—enables robust, flexible, and high-fidelity cross-modal reasoning and generation, setting a new foundation for unified embodied AI. Abstract: We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.[342] End-to-End Training for Unified Tokenization and Latent Denoising
Shivam Duggal,Xingjian Bai,Zongze Wu,Richard Zhang,Eli Shechtman,Antonio Torralba,Phillip Isola,William T. Freeman
Main category: cs.CV
TL;DR: 本文提出UNITE,一种统一的自编码器架构,用于联合优化图像分词和潜在扩散,通过单阶段训练实现高质量图像生成,无需预训练编码器或对抗损失。
Details
Motivation: 现有潜在扩散模型(LDMs)需分阶段训练:先训练tokenizer,再在冻结的潜在空间中训练扩散模型,流程复杂;作者认为分词与生成本质上是同一潜在推断问题在不同条件下的表现,因此希望统一二者训练过程。 Method: 提出UNITE架构,包含一个权重共享的生成式编码器(Generative Encoder),兼具图像分词与潜在表示生成功能;采用单阶段训练,通过两次前向传播(分别对应图像到潜变量、噪声+条件到潜变量)联合优化两个任务;利用共享参数使梯度协同塑造潜在空间,形成‘通用潜在语言’。 Result: 在ImageNet 256×256上,UNITE Base和Large模型分别达到FID 2.12和1.73,接近SOTA;在图像与分子模态上均表现优异;无需对抗损失或预训练编码器(如DINO);并通过表征对齐与压缩分析验证了生成式编码器的有效性。 Conclusion: 单阶段联合训练分词与生成是可行的,UNITE证明了从零开始学习统一潜在空间的有效性,为简化LDM训练流程提供了新范式。 Abstract: Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.[343] VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
Ruoliu Yang,Chu Wu,Caifeng Shan,Ran He,Chaoyou Fu
Main category: cs.CV
TL;DR: VideoDetective 是一种用于长视频问答的新框架,通过构建视觉-时间亲和图并结合假设-验证-精炼循环,实现对关键视频片段的稀疏、高效定位,显著提升多模态大模型在长视频理解任务上的性能。