cs.CL [Back]

[1] Enhancing Urban Visual Place Recognition for Crowdsourced Flood Imagery via LLM-Guided Attention

Fengyi Xu,Jun Ma,Waishan Qiu,Cui Guo

Main category: cs.CL

TL;DR: 本文提出VPR-AttLLM，一种将大语言模型（LLM）的语义推理与地理空间知识融入现有视觉定位识别（VPR）流程的框架，通过注意力机制增强描述符，提升社交媒体中缺乏地理元数据的洪水图像定位精度，无需重新训练模型。

Details

Motivation: 社交媒体中的众包街景图像常因缺少准确地理位置信息且存在视觉失真，导致现有VPR模型在跨源场景下性能下降，难以用于应急响应。 Method: 提出VPR-AttLLM框架，利用LLM识别城市上下文中具有位置信息的区域，抑制瞬态视觉噪声，并通过注意力机制优化VPR描述符，实现模型无关、无需再训练的性能提升。 Result: 在多个扩展基准（包括含真实洪水图像的SF-XL、合成洪水场景和Mapillary数据，以及新构建的HK-URBAN数据集）上验证，结合三种先进VPR模型（CosPlace、EigenPlaces、SALAD），召回率相对提升1-3%，在最具挑战性的实际洪水图像上最高达8%。 Conclusion: VPR-AttLLM建立了一种可推广的LLM引导多模态融合范式，通过将城市感知理论融入注意力机制，桥接人类空间推理与现代VPR架构，具备即插即用、跨源鲁棒性强和可解释性优势，适用于大规模城市监测与灾时图像快速地理定位。 Abstract: Crowdsourced street-view imagery from social media provides valuable real-time visual evidence of urban flooding and other crisis events, yet it often lacks reliable geographic metadata for emergency response. Existing Visual Place Recognition (VPR) models exhibit substantial performance degradation when applied to such imagery due to visual distortions and domain shifts inherent in cross-source scenarios. This paper presents VPR-AttLLM, a model-agnostic framework that integrates the semantic reasoning and geospatial knowledge of Large Language Models (LLMs) into established VPR pipelines through attention-guided descriptor enhancement. By leveraging LLMs to identify location-informative regions within the city context and suppress transient visual noise, VPR-AttLLM improves retrieval performance without requiring model retraining or additional data. Comprehensive evaluations are conducted on extended benchmarks including SF-XL enriched with real social-media flood images, synthetic flooding scenarios over established query sets and Mapillary photos, and a new HK-URBAN dataset capturing morphologically distinct cityscapes. Integrating VPR-AttLLM with three state-of-the-art VPR models-CosPlace, EigenPlaces, and SALAD-consistently improves recall performance, yielding relative gains typically between 1-3% and reaching up to 8% on the most challenging real flood imagery. Beyond measurable gains in retrieval accuracy, this study establishes a generalizable paradigm for LLM-guided multimodal fusion in visual retrieval systems. By embedding principles from urban perception theory into attention mechanisms, VPR-AttLLM bridges human-like spatial reasoning with modern VPR architectures. Its plug-and-play design, strong cross-source robustness, and interpretability highlight its potential for scalable urban monitoring and rapid geo-localization of crowdsourced crisis imagery.

[2] Reinforcement Learning for Latent-Space Thinking in LLMs

Enes Özeren,Matthias Aßenmacher

Main category: cs.CL

TL;DR: 本文研究了在潜空间中进行推理的Coconut方法，并探索了使用强化学习（如GRPO和新设计的Latent RL）来优化潜思维步骤的方法，但在数学推理任务上仍落后于传统的语言空间CoT模型。

Details

Motivation: 绕过离散语言空间推理的低效性，探索更高效的连续嵌入空间（潜空间）推理方法，并解决现有方法在复杂数学推理任务中性能不足的问题。 Method: 采用监督微调（Coconut）和强化学习技术（包括GRPO和新提出的Latent RL）对潜空间推理模型进行训练，直接优化潜层思维过程。 Result: 实验表明，Coconut方法对设计选择敏感且存在固有局限；尽管引入强化学习，潜空间推理模型在数学推理任务上的表现仍落后于传统语言空间的CoT模型。 Conclusion: 当前的潜空间推理方法在复杂数学任务上尚未达到传统CoT的性能水平，未来需进一步探索更有效的优化策略。 Abstract: Chain-of-Thought (CoT) reasoning typically utilizes the discrete language space for thinking, which is inherently inefficient, as many generated tokens only enforce linguistic rules that are not required for reasoning. To bypass this, latent-space thinking allows models to think using the continuous embedding space. While existing methods for training those models show domain-specific gains, they fail to maintain performance in complex tasks, such as mathematical reasoning. We experimentally demonstrate that the Coconut approach, a form of supervised fine-tuning for latent-space thinking, is highly sensitive to design choices and exhibits several inherent limitations. To address these issues, we investigate reinforcement learning (RL) techniques -- an underexplored direction in latent-space thinking -- including GRPO and design a novel Latent RL method for directly optimizing the latent thinking steps. Our experimental results reveal that these RL-trained models still lag behind traditional language-space CoT models in the mathematical reasoning domain. We make our codebase publicly available.

[3] KH-FUNSD: A Hierarchical and Fine-Grained Layout Analysis Dataset for Low-Resource Khmer Business Document

Nimol Thuon,Jun Du

Main category: cs.CL

TL;DR: 本文提出了KH-FUNSD，首个用于高棉语商业文档理解的公开分层标注数据集，支持对收据、发票和报价单等文档进行区域检测、实体识别与关系抽取及细粒度语义分类，并为低资源非拉丁语脚本文档分析提供了基准结果。

Details

Motivation: 高棉语作为使用超1700万人的语言，在文档AI工具开发中长期被忽视，尤其缺乏针对商业文档的资源，限制了自动化布局分析的发展。 Method: 设计了一个三层标注框架：(1) 区域检测将文档划分为页眉、表单区、页脚等核心区域；(2) 采用类似FUNSD的方式标注问题、答案、标题等实体及其关系；(3) 细粒度分类赋予字段标签、值、符号等具体语义角色。同时对多个主流模型进行了基准测试。 Result: 发布了首个高棉语表单文档理解数据集KH-FUNSD，提供了多层级标注支持，并建立了首个基准性能评估，揭示了非拉丁、低资源脚本在文档分析中的独特挑战。 Conclusion: KH-FUNSD填补了高棉语商业文档处理的空白，其多级标注结构有助于推动低资源语言在文档AI领域的发展，促进公共管理和私营企业的数字化进程。 Abstract: Automated document layout analysis remains a major challenge for low-resource, non-Latin scripts. Khmer is a language spoken daily by over 17 million people in Cambodia, receiving little attention in the development of document AI tools. The lack of dedicated resources is particularly acute for business documents, which are critical for both public administration and private enterprise. To address this gap, we present \textbf{KH-FUNSD}, the first publicly available, hierarchically annotated dataset for Khmer form document understanding, including receipts, invoices, and quotations. Our annotation framework features a three-level design: (1) region detection that divides each document into core zones such as header, form field, and footer; (2) FUNSD-style annotation that distinguishes questions, answers, headers, and other key entities, together with their relationships; and (3) fine-grained classification that assigns specific semantic roles, such as field labels, values, headers, footers, and symbols. This multi-level approach supports both comprehensive layout analysis and precise information extraction. We benchmark several leading models, providing the first set of baseline results for Khmer business documents, and discuss the distinct challenges posed by non-Latin, low-resource scripts. The KH-FUNSD dataset and documentation will be available at URL.

[4] Direct Confidence Alignment: Aligning Verbalized Confidence with Internal Confidence In Large Language Models

Glenn Zhang,Treasure Mayowa,Jason Fan,Yicheng Fu,Aaron Sandoval,Sean O'Brien,Kevin Zhu

Main category: cs.CL

TL;DR: 本文提出了一种名为直接置信度对齐（DCA）的方法，利用直接偏好优化来对齐大语言模型的口头置信度与内部置信度，从而提升模型的透明性和可靠性，并在多个开放权重模型和数据集上验证了其有效性。

Details

Motivation: 由于大语言模型的内部置信度与口头表达的置信度不一致，导致校准方法产生误导性结果，因此需要一种方法来改善两者之间的对齐以增强模型的可信度。 Method: 采用直接偏好优化（DPO）实现直接置信度对齐（DCA），使模型的口头置信度与其基于token概率的内部置信度保持一致，而非依赖于真实准确性。 Result: 在多个开源大模型和多种数据集上评估显示，DCA能改善部分模型架构的置信度对齐指标，减少置信表达的不一致性；但对其他模型效果有限。此外提出了三个基于校准误差的新评估指标。 Conclusion: DCA有助于提升大语言模型中不同置信度形式之间的一致性，增强模型可解释性与可靠性，但其效果因模型结构而异，表明未来需发展更模型感知的校准方法。 Abstract: Producing trustworthy and reliable Large Language Models (LLMs) has become increasingly important as their usage becomes more widespread. Calibration seeks to achieve this by improving the alignment between the model's confidence and the actual likelihood of its responses being correct or desirable. However, it has been observed that the internal confidence of a model, derived from token probabilities, is not well aligned with its verbalized confidence, leading to misleading results with different calibration methods. In this paper, we propose Direct Confidence Alignment (DCA), a method using Direct Preference Optimization to align an LLM's verbalized confidence with its internal confidence rather than ground-truth accuracy, enhancing model transparency and reliability by ensuring closer alignment between the two confidence measures. We evaluate DCA across multiple open-weight LLMs on a wide range of datasets. To further assess this alignment, we also introduce three new calibration error-based metrics. Our results show that DCA improves alignment metrics on certain model architectures, reducing inconsistencies in a model's confidence expression. However, we also show that it can be ineffective on others, highlighting the need for more model-aware approaches in the pursuit of more interpretable and trustworthy LLMs.

[5] Hold Onto That Thought: Assessing KV Cache Compression On Reasoning

Minghui Liu,Aadi Palnitkar,Tahseen Rabbani,Hyunwoo Jae,Kyle Rui Sang,Dixi Yao,Shayan Shabihi,Fuheng Zhao,Tian Li,Ce Zhang,Furong Huang,Kunpeng Zhang

Main category: cs.CL

TL;DR: 本文研究了在长推理任务中，针对大语言模型KV缓存增长问题的多种压缩策略的性能，发现H2O和改进版SnapKV在推理模型中表现最佳，且缓存大小与推理成本存在权衡。

Details

Motivation: 大语言模型在长上下文任务中受限于KV缓存的线性增长，现有压缩策略多关注预填充阶段，缺乏在长解码推理任务中的评估。 Method: 在GSM8K和MATH500等长推理基准上，系统评测了多种KV缓存压缩策略，并提出了适配解码阶段的SnapKV变体。 Result: 发现H2O和解码增强版SnapKV在推理任务中表现最优，且低缓存预算下的逐出策略可生成更长的推理链，揭示了缓存大小与推理成本之间的权衡。 Conclusion: 针对推理任务的KV缓存压缩应结合重访问项追踪机制，且需根据任务类型选择合适策略，未来设计需平衡缓存效率与推理质量。 Abstract: Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their performance is rarely assessed on reasoning tasks requiring long decoding. In particular, short but complex prompts, such as those in benchmarks like GSM8K and MATH500, often benefit from multi-step reasoning and self-reflection, resulting in thinking sequences thousands of tokens long. In this work, we benchmark the performance of several popular compression strategies on long-reasoning tasks. For the non-reasoning Llama-3.1-8B-Instruct, we determine that no singular strategy fits all, and that performance is heavily influenced by dataset type. However, we discover that H2O and our decoding-enabled variant of SnapKV are dominant strategies for reasoning models, indicating the utility of heavy-hitter tracking for reasoning traces. We also find that eviction strategies at low budgets can produce longer reasoning traces, revealing a tradeoff between cache size and inference costs.

[6] Benchmarking Contextual Understanding for In-Car Conversational Systems

Philipp Habicht,Lev Sorokin,Abdullah Saydemir,Ken E. Friedl,Andrea Stocco

Main category: cs.CL

TL;DR: 本文研究了利用大语言模型（LLM）和先进提示技术评估车载对话式问答（ConvQA）系统响应与用户话语一致性的方法，重点在于上下文理解和基于用户约束的场所推荐。通过合成生成包含正确和错误响应的数据，结合多种提示技术对13种LLM进行评估，结果表明基于LLM的评估在准确性和成本效率上优于传统人工评估。

Details

Motivation: 评估ConvQA系统在真实车内场景中的准确性和可靠性具有挑战性，尤其需要衡量其对用户话语的上下文理解能力和满足实际约束的推荐能力，因此需要一种自动化、可扩展且精确的评估方法。 Method: 采用大语言模型（LLM）结合输入-输出、思维链、自洽性及多智能体提示技术，通过合成生成用户话语及对应的正确与含错系统响应，在餐厅推荐案例中评估不同规模和类型LLM的表现。 Result: 小规模非推理模型在多智能体提示下提升最显著；推理模型整体表现更优，其中单智能体+自洽性提示效果最佳；DeepSeek-R1达到0.99 F1分数（每次请求0.002美元）；DeepSeek-V3在效果与成本效率间取得最佳平衡。 Conclusion: 基于LLM的评估方法能有效替代传统人工评估，为ConvQA系统的上下文理解能力提供可扩展且高精度的自动化评测方案。 Abstract: In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions. However, assessing their accuracy and reliability remains a challenge. This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances. The focus lies on contextual understanding and the ability to provide accurate venue recommendations considering user constraints and situational context. To evaluate utterance-response coherence using an LLM, we synthetically generate user utterances accompanied by correct and modified failure-containing system responses. We use input-output, chain-of-thought, self-consistency prompting, and multi-agent prompting techniques with 13 reasoning and non-reasoning LLMs of varying sizes and providers, including OpenAI, DeepSeek, Mistral AI, and Meta. We evaluate our approach on a case study involving restaurant recommendations. The most substantial improvements occur for small non-reasoning models when applying advanced prompting techniques, particularly multi-agent prompting. However, reasoning models consistently outperform non-reasoning models, with the best performance achieved using single-agent prompting with self-consistency. Notably, DeepSeek-R1 reaches an F1-score of 0.99 at a cost of 0.002 USD per request. Overall, the best balance between effectiveness and cost-time efficiency is reached with the non-reasoning model DeepSeek-V3. Our findings show that LLM-based evaluation offers a scalable and accurate alternative to traditional human evaluation for benchmarking contextual understanding in ConvQA systems.

[7] VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

Avinash Amballa,Yashas Malur Saidutta,Chi-Heng Lin,Vivek Kulkarni,Srinivas Chappidi

Main category: cs.CL

TL;DR: 本文提出了一种名为Voyager的新方法，用于生成多样化的合成数据集，利用行列式点过程直接优化数据多样性，在无需训练、适用于闭源模型且可扩展的情况下显著优于现有基线方法。

Details

Motivation: 现有的大语言模型生成的合成数据缺乏多样性，限制了其在下游任务中的应用效果。 Method: 提出Voyager方法，采用行列式点过程（determinantal point processes）迭代地优化数据集的多样性，并保持无需训练和可扩展的特性。 Result: 实验证明，Voyager在多样性指标上比主流基线方法提升1.5-3倍，并具备良好的理论支持。 Conclusion: Voyager是一种高效、可扩展且无需训练的方法，能显著提升合成数据集的多样性，适用于广泛的大模型应用场景。 Abstract: Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper, we propose Voyager, a novel principled approach to generate diverse datasets. Our approach is iterative and directly optimizes a mathematical quantity that optimizes the diversity of the dataset using the machinery of determinantal point processes. Furthermore, our approach is training-free, applicable to closed-source models, and scalable. In addition to providing theoretical justification for the working of our method, we also demonstrate through comprehensive experiments that Voyager significantly outperforms popular baseline approaches by providing a 1.5-3x improvement in diversity.

[8] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Jiayi Yuan,Cameron Shinn,Kai Xu,Jingze Cui,George Klimiashvili,Guangxuan Xiao,Perkz Zheng,Bo Li,Yuxin Zhou,Zhouhai Ye,Weijie You,Tian Zheng,Dominic Brown,Pengbo Wang,Richard Cai,Julien Demouth,John D. Owens,Xia Hu,Song Han,Timmy Liu,Huizi Mao

Main category: cs.CL

TL;DR: 本文提出了一种名为BLASST的即插即用稀疏注意力方法，通过动态剪枝注意力矩阵来加速大语言模型的长上下文推理，无需预计算或代理评分，兼容现有FlashAttention设计，并在保持高精度的同时显著提升推理速度。

Details

Motivation: 由于大语言模型对长上下文推理需求的增长，标准注意力机制面临计算和内存瓶颈，亟需高效且通用的稀疏化方案。 Method: BLASST利用在线softmax中的已有信息和固定阈值识别可忽略的注意力得分，跳过后续计算与访存操作；并提出自动化校准方法确定最优阈值，适用于多种注意力变体及预填充与解码阶段。 Result: 在现代GPU上实现了预填充阶段1.62倍加速（74.7%稀疏度）和解码阶段1.48倍加速（73.2%稀疏度），同时保持高准确性；结合稀疏感知训练可进一步提升性能。 Conclusion: BLASST为长上下文推理提供了一个高效、统一且易于集成的稀疏注意力解决方案，兼具高性能与实用性，并为未来稀疏训练提供了方向。 Abstract: The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the standard attention mechanism. To address this challenge, we introduce BLASST, a drop-in sparse attention method that dynamically prunes the attention matrix without any pre-computation or proxy scores. Our method uses a fixed threshold and existing information from online softmax to identify negligible attention scores, skipping softmax computation, Value block loading, and the subsequent matrix multiplication. This fits seamlessly into existing FlashAttention kernel designs with negligible latency overhead. The approach is applicable to both prefill and decode stages across all attention variants (MHA, GQA, MQA, and MLA), providing a unified solution for accelerating long-context inference. We develop an automated calibration procedure that reveals a simple inverse relationship between optimal threshold and context length, enabling robust deployment across diverse scenarios. Maintaining high accuracy, we demonstrate a 1.62x speedup for prefill at 74.7% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs. Furthermore, we explore sparsity-aware training as a natural extension, showing that models can be trained to be inherently more robust to sparse attention patterns, pushing the accuracy-sparsity frontier even further.

[9] Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

Yoav Gelberg,Koshi Eguchi,Takuya Akiba,Edoardo Cetin

Main category: cs.CL

TL;DR: 本文提出了一种名为DroPE的新方法，通过在训练后移除位置嵌入（Positional Embeddings）实现语言模型上下文长度的零样本扩展，无需昂贵的长序列微调。

Details

Motivation: 现有方法过度依赖显式的位置信息，导致模型难以泛化到未见过的序列长度，即使使用位置嵌入缩放技术也效果有限。而位置嵌入并非语言建模的本质需求，因此可在预训练后安全移除。 Method: 在预训练完成后移除位置嵌入，并引入一个短暂的重校准阶段，使模型适应无位置嵌入的推理环境，从而实现对更长上下文的零样本泛化。 Result: DroPE在不进行长上下文微调的情况下，实现了无缝的零样本上下文扩展，在多种模型和数据集上均显著优于现有的专用架构和RoPE缩放方法，同时保持原有上下文性能不受影响。 Conclusion: 位置嵌入虽有助于训练收敛，但限制了测试时的长度泛化；DroPE通过训练后移除并重校准，打破了需昂贵微调来扩增上下文的固有瓶颈，为上下文扩展提供了高效新范式。 Abstract: So far, expensive finetuning beyond the pretraining sequence length has been a requirement for effectively extending the context of language models (LM). In this work, we break this key bottleneck by Dropping the Positional Embeddings of LMs after training (DroPE). Our simple method is motivated by three key theoretical and empirical observations. First, positional embeddings (PEs) serve a crucial role during pretraining, providing an important inductive bias that significantly facilitates convergence. Second, over-reliance on this explicit positional information is also precisely what prevents test-time generalization to sequences of unseen length, even when using popular PE-scaling methods. Third, positional embeddings are not an inherent requirement of effective language modeling and can be safely removed after pretraining, following a short recalibration phase. Empirically, DroPE yields seamless zero-shot context extension without any long-context finetuning, quickly adapting pretrained LMs without compromising their capabilities in the original training context. Our findings hold across different models and dataset sizes, far outperforming previous specialized architectures and established rotary positional embedding scaling methods.

[10] Diffusion Language Model Inference with Monte Carlo Tree Search

Zheng Huang,Kiran Ramnath,Yueyan Chen,Aosong Feng,Sangmin Woo,Balasubramaniam Srinivasan,Zhichao Xu,Kang Zhou,Shuai Wang,Haibo Ding,Lin Lee Cheong

Main category: cs.CL

TL;DR: 本文提出了MEDAL框架，将蒙特卡洛树搜索（MCTS）引入扩散语言模型（DLMs）的推理初始化阶段，以解决并行去噪生成中的组合搜索问题，显著提升了生成质量。

Details

Motivation: 现有的DLM推理方法依赖启发式策略或额外训练，难以有效探索最优解码路径，缺乏原则性的搜索机制。 Method: 提出MEDAL框架，在初始化阶段使用蒙特卡洛树搜索探索有希望的去遮罩轨迹，并通过限制搜索空间到高置信度动作、优先选择提升模型置信度的token来实现高效搜索。 Result: 在多个基准测试中，MEDAL相比现有推理策略最高提升了22.0%，显著改善了生成效果。 Conclusion: MEDAL为扩散语言模型的推理引入了有效的基于搜索的范式，证明了结合规划与优化的搜索策略在DLM中的潜力。 Abstract: Diffusion language models (DLMs) have recently emerged as a compelling alternative to autoregressive generation, offering parallel generation and improved global coherence. During inference, DLMs generate text by iteratively denoising masked sequences in parallel; however, determining which positions to unmask and which tokens to commit forms a large combinatorial search problem. Existing inference methods approximate this search using heuristics, which often yield suboptimal decoding paths; other approaches instead rely on additional training to guide token selection. To introduce a principled search mechanism for DLMs inference, we introduce MEDAL, a framework that integrates Monte Carlo Tree SEarch initialization for Diffusion LAnguage Model inference. We employ Monte Carlo Tree Search at the initialization stage to explore promising unmasking trajectories, providing a robust starting point for subsequent refinement. This integration is enabled by restricting the search space to high-confidence actions and prioritizing token choices that improve model confidence over remaining masked positions. Across multiple benchmarks, MEDAL achieves up to 22.0% improvement over existing inference strategies, establishing a new paradigm for search-based inference in diffusion language models.

[11] Semantic Distance Measurement based on Multi-Kernel Gaussian Processes

Yinzhu Cheng,Haihua Xie,Yaqing Wang,Miao He,Mingming Sun

Main category: cs.CL

TL;DR: 提出了一种基于多核高斯过程（MK-GP）的语义距离度量方法，通过数据驱动方式自动学习核参数，在细粒度情感分类的上下文学习任务中验证了其有效性。

Details

Motivation: 传统语义距离方法通常是固定的，难以适应特定数据分布和任务需求，缺乏灵活性。 Method: 将文本相关的潜在语义函数建模为高斯过程，协方差函数采用结合Matérn和多项式成分的复合核函数，并在监督下从数据中自动学习核参数。 Result: 所提出的语义距离在基于大语言模型的上下文学习框架下的细粒度情感分类任务中表现出良好性能。 Conclusion: 基于多核高斯过程的语义距离测量方法能够有效适应任务需求，具有比传统固定方法更强的灵活性和实用性。 Abstract: Semantic distance measurement is a fundamental problem in computational linguistics, providing a quantitative characterization of similarity or relatedness between text segments, and underpinning tasks such as text retrieval and text classification. From a mathematical perspective, a semantic distance can be viewed as a metric defined on a space of texts or on a representation space derived from them. However, most classical semantic distance methods are essentially fixed, making them difficult to adapt to specific data distributions and task requirements. In this paper, a semantic distance measure based on multi-kernel Gaussian processes (MK-GP) was proposed. The latent semantic function associated with texts was modeled as a Gaussian process, with its covariance function given by a combined kernel combining Matérn and polynomial components. The kernel parameters were learned automatically from data under supervision, rather than being hand-crafted. This semantic distance was instantiated and evaluated in the context of fine-grained sentiment classification with large language models under an in-context learning (ICL) setup. The experimental results demonstrated the effectiveness of the proposed measure.

[12] Adversarially Probing Cross-Family Sound Symbolism in 27 Languages

Anika Sharma,Tianyi Niu,Emma Wrenn,Shashank Srivastava

Main category: cs.CL

TL;DR: 本文首次通过大规模计算方法跨语言分析了声音象征现象在大小语义领域中的表现，发现即使在无亲缘关系的语言中，音位形式也能显著预测大小意义，表明存在跨家族的声音象征偏见。

Details

Motivation: 声音象征（即词汇声音与意义之间的非任意映射）长期以来通过如Bouba Kiki等轶事实验得到证明，但很少进行大规模测试。本文旨在通过跨语言的大规模计算分析，验证声音象征在大小语义领域的普遍性。 Method: 构建了一个包含810个形容词（27种语言，每种30个词）的类型广泛的数据集，每个词均有音位转录并用母语者音频验证；使用基于音段包特征的可解释分类器，并训练对抗性擦除器以抑制语言身份同时保留大小语义信号。 Result: 发现音系形式能显著预测大小语义，且元音和辅音均有贡献；对抗擦除后语言识别准确率低于随机水平，而大小预测仍显著高于随机，显示跨语言家族的声音象征偏差。 Conclusion: 研究结果支持跨语言家族存在普遍的声音象征偏见，为声音象征的普遍性提供了大规模实证证据，并公开数据、代码和诊断工具以促进未来对表意性的研究。 Abstract: The phenomenon of sound symbolism, the non-arbitrary mapping between word sounds and meanings, has long been demonstrated through anecdotal experiments like Bouba Kiki, but rarely tested at scale. We present the first computational cross-linguistic analysis of sound symbolism in the semantic domain of size. We compile a typologically broad dataset of 810 adjectives (27 languages, 30 words each), each phonemically transcribed and validated with native-speaker audio. Using interpretable classifiers over bag-of-segment features, we find that phonological form predicts size semantics above chance even across unrelated languages, with both vowels and consonants contributing. To probe universality beyond genealogy, we train an adversarial scrubber that suppresses language identity while preserving size signal (also at family granularity). Language prediction averaged across languages and settings falls below chance while size prediction remains significantly above chance, indicating cross-family sound-symbolic bias. We release data, code, and diagnostic tools for future large-scale studies of iconicity.

[13] Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics

Abhay Srivastava,Sam Jung,Spencer Mateega

Main category: cs.CL

TL;DR: 本文提出了MARKET-BENCH，一个用于评估大语言模型在基础量化交易任务中表现的基准，通过将自然语言策略转化为可执行回测代码来测试模型的结构可靠性和数值准确性。

Details

Motivation: 现有大语言模型在金融领域的实际应用能力尚不明确，尤其是在需要精确代码生成和数值推理的量化交易任务中缺乏系统性评估标准。 Method: 设计包含三种经典交易策略（定时交易、配对交易、Delta对冲）的基准测试，要求模型根据自然语言描述生成可运行且数值准确的回测代码，并采用多轮pass@k指标评估结构性可靠性和数值误差。 Result: 12个主流大模型在简单策略上表现较好（平均pass@3达0.80），但在复杂任务中错误差异巨大；Gemini 3 Pro与Claude 4.5 Sonnet在可靠性与误差控制上表现均衡，GPT-5.1 Codex-Max在前两项策略实现完美通过率，Qwen3 Max虽能稳定生成可运行代码但存在P&L路径不准确问题。 Conclusion: 当前大语言模型能够支持基础交易系统的搭建，但在价格、持仓和风险等关键金融变量的稳健推理方面仍存在显著不足。 Abstract: We introduce MARKET-BENCH, a benchmark that evaluates large language models (LLMs) on introductory quantitative trading tasks by asking them to construct executable backtesters from natural-language strategy descriptions and market assumptions. Each instance specifies one of three canonical strategies -- scheduled trading on Microsoft (NASDAQ: MSFT), pairs trading on Coca-Cola (NASDAQ: KO) and Pepsi (NASDAQ: PEP), or delta hedging on MSFT -- and models must produce code whose P\&L, drawdown, and position paths match a verifiable reference implementation. We assess twelve state-of-the-art models using a multi-round pass@k metric that separates structural reliability (whether the backtest runs) from numerical accuracy (mean absolute error of the backtest metrics). While most models reliably execute the simplest strategy (average pass@3 of 0.80), errors vary by orders of magnitude across models and tasks: Gemini 3 Pro and Claude 4.5 Sonnet combine strong reliability with low error on simpler strategies, GPT-5.1 Codex-Max achieves perfect pass@1 on the first two strategies and the lowest best-run error on the easiest task, and Qwen3 Max attains perfect pass@3 yet sometimes produces inaccurate P\&L paths. These results show that current LLMs can scaffold basic trading infrastructure but still struggle to reason robustly about prices, inventory, and risk; we release MARKET-BENCH and a public leaderboard at https://marketbench.ai.

[14] F5-TTS-RO: Extending F5-TTS to Romanian TTS via Lightweight Input Adaptation

Radu-Gabriel Chivereanu,Tiberiu Boros

Main category: cs.CL

TL;DR: 本文提出了一种轻量级输入级适配器，使F5-TTS模型能够支持罗马尼亚语，同时保持其原有的语音克隆及中英文支持能力。

Details

Motivation: 扩展F5-TTS模型以支持罗马尼亚语，同时不损害其已有语言能力和语音特性。 Method: 冻结原模型权重，附加一个子网络作为文本编码器中字符级嵌入矩阵的扩展，并利用F5-TTS中的ConvNeXt模块建模新嵌入间的依赖关系，实现软性的字母到音素转换。 Result: 人类听测结果显示模型保留了语音克隆能力，并在一定程度上实现了英罗语码混用，但生成语音中仍残留英语口音特征。 Conclusion: 所提方法有效增强了多语言支持能力，具备实用性与可扩展性，代码和音频样例已开源。 Abstract: This work introduces a lightweight input-level adapter for the F5-TTS model that enables Romanian Language support. To preserve the existing capabilities of the model (voice cloning, English and Chinese support), we keep the original weights frozen, append a sub-network to the model and train it as an extension for the textual embedding matrix of the text encoder. For simplicity, we rely on ConvNeXt module implemented in F5-TTS to also model the co-dependencies between the new character-level embeddings. The module serves as a ``soft`` letter-to-sound layer, converting Romanian text into a continuous representation that the F5-TTS model uses to produce naturally sounding Romanian utterances. We evaluate the model with a pool of 20 human listeners across three tasks: (a) audio similarity between reference and generated speech, (b) pronunciation and naturalness and (c) Romanian-English code-switching. The results indicate that our approach maintains voice cloning capabilities and enables, to a certain extent, code-switching within the same utterance; however, residual English accent characteristics remain. We open-source our code and provide example audio samples at https://github.com/racai-ro/Ro-F5TTS.

Yushen Fang,Jianjun Li,Mingqian Ding,Chang Liu,Xinchi Zou,Wenqi Yang

Main category: cs.CL

TL;DR: 提出了一种新的通用信息抽取范式SCIR框架和一个多任务双语自纠正数据集MBSC，通过双路径自纠正模块和反馈优化机制，实现了与现有大模型的即插即用兼容，显著降低训练成本并提升性能。

Details

Motivation: 现有的基于大语言模型的信息抽取系统在微调时面临训练成本高和难以对齐大模型偏好的问题，因此需要一种更高效、低成本且能有效利用大模型能力的新范式。 Method: 提出了Self-Correcting Iterative Refinement（SCIR）框架，包含双路径自纠正模块和反馈驱动优化机制，并构建了包含十万余条数据的多任务双语自纠正（MBSC）数据集，用于蒸馏GPT-4的能力以实现偏好对齐。 Result: 在命名实体识别、关系抽取和事件抽取三个任务上，SCIR相比现有最先进方法平均提升了5.27%的基于span的Micro-F1分数，同时训练成本降低了87%。 Conclusion: SCIR框架结合MBSC数据集有效解决了信息抽取中训练成本高和偏好对齐难的问题，推动了轻量高效信息抽取范式的發展。 Abstract: Although Large language Model (LLM)-powered information extraction (IE) systems have shown impressive capabilities, current fine-tuning paradigms face two major limitations: high training costs and difficulties in aligning with LLM preferences. To address these issues, we propose a novel universal IE paradigm, the Self-Correcting Iterative Refinement (SCIR) framework, along with a Multi-task Bilingual (Chinese-English) Self-Correcting (MBSC) dataset containing over 100,000 entries. The SCIR framework achieves plug-and-play compatibility with existing LLMs and IE systems through its Dual-Path Self-Correcting module and feedback-driven optimization, thereby significantly reducing training costs. Concurrently, the MBSC dataset tackles the challenge of preference alignment by indirectly distilling GPT-4's capabilities into IE result detection models. Experimental results demonstrate that SCIR outperforms state-of-the-art IE methods across three key tasks: named entity recognition, relation extraction, and event extraction, achieving a 5.27 percent average improvement in span-based Micro-F1 while reducing training costs by 87 percent compared to baseline approaches. These advancements not only enhance the flexibility and accuracy of IE systems but also pave the way for lightweight and efficient IE paradigms.

[16] Can GPT replace human raters? Validity and reliability of machine-generated norms for metaphors

Veronica Mangiaterra,Hamad Al-Azary,Chiara Barattieri di San Pietro,Paolo Canal,Valentina Bambini

Main category: cs.CL

TL;DR: 本研究首次评估了三种GPT模型生成的隐喻在熟悉度、可理解性和意象性方面的评分的有效性与可靠性，结果表明机器生成的评分与人类评分具有中等到强的相关性，尤其在较大模型上表现更优，且能有效预测行为和脑电反应，稳定性高，但在处理隐喻的常规性和多模态特征时与人类对齐较差。

Details

Motivation: 随着大语言模型在科学研究中的广泛应用，其可信度问题日益重要；在心理语言学中，尽管LLM已用于自动生成词汇的人类评分数据，但其对复杂语言单位如隐喻的评分能力尚不清楚，因此需系统评估其有效性与可靠性。 Method: 使用三种GPT模型为来自意大利和英语研究的687个隐喻项目生成熟悉度、可理解性和意象性评分，并通过与人类评分的相关性、对反应时和脑电（EEG）信号的预测能力以及跨会话稳定性进行验证。 Result: 机器生成的评分与人类评分呈正相关：熟悉度在英意两种语言中均达中等到强相关（尤其在低感知负荷条件下），意象性在英语中为中等相关，在意大利语中为中等到强相关，英语隐喻的可理解性相关性最强；大模型优于小模型；机器评分能显著预测反应时和EEG波幅，效果接近人类评分；且跨会话评分高度稳定。 Conclusion: GPT模型（尤其是较大的模型）可以有效且可靠地替代或补充人类被试对隐喻属性进行评分，但在涉及隐喻的常规性和多模态意义方面与人类判断存在偏差，提示在使用时需谨慎考虑刺激材料的性质。 Abstract: As Large Language Models (LLMs) are increasingly being used in scientific research, the issue of their trustworthiness becomes crucial. In psycholinguistics, LLMs have been recently employed in automatically augmenting human-rated datasets, with promising results obtained by generating ratings for single words. Yet, performance for ratings of complex items, i.e., metaphors, is still unexplored. Here, we present the first assessment of the validity and reliability of ratings of metaphors on familiarity, comprehensibility, and imageability, generated by three GPT models for a total of 687 items gathered from the Italian Figurative Archive and three English studies. We performed a thorough validation in terms of both alignment with human data and ability to predict behavioral and electrophysiological responses. We found that machine-generated ratings positively correlated with human-generated ones. Familiarity ratings reached moderate-to-strong correlations for both English and Italian metaphors, although correlations weakened for metaphors with high sensorimotor load. Imageability showed moderate correlations in English and moderate-to-strong in Italian. Comprehensibility for English metaphors exhibited the strongest correlations. Overall, larger models outperformed smaller ones and greater human-model misalignment emerged with familiarity and imageability. Machine-generated ratings significantly predicted response times and the EEG amplitude, with a strength comparable to human ratings. Moreover, GPT ratings obtained across independent sessions were highly stable. We conclude that GPT, especially larger models, can validly and reliably replace - or augment - human subjects in rating metaphor properties. Yet, LLMs align worse with humans when dealing with conventionality and multimodal aspects of metaphorical meaning, calling for careful consideration of the nature of stimuli.

[17] Large language models have learned to use language

Gary Lupyan

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型在语言使用上的能力，认为这可能为语言科学带来突破，并指出我们已进入后图灵测试时代。

Details

Motivation: 认识到大语言模型已经学会使用语言，可能推动语言科学的发展。 Method: 通过理论分析和对现有观念的反思，提出需要放弃一些长期持有的语言知识评估观念。 Result: 指出当前已进入后图灵测试时代，需重新思考语言能力的评估方式。 Conclusion: 为了实现语言科学的突破，必须接受大语言模型的语言能力并更新评估范式。 Abstract: Acknowledging that large language models have learned to use language can open doors to breakthrough language science. Achieving these breakthroughs may require abandoning some long-held ideas about how language knowledge is evaluated and reckoning with the difficult fact that we have entered a post-Turing test era.

[18] The American Ghost in the Machine: How language models align culturally and the effects of cultural prompting

James Luther,Donald Brown

Main category: cs.CL

TL;DR: 该研究使用VSM13国际调查和霍夫斯泰德文化维度，评估了多个主流大语言模型（如GPT系列、Claude、Llama等）的文化对齐性，并通过文化提示（cultural prompting）测试其适应不同文化（中国、法国、印度、伊朗、日本、美国）的能力。结果显示，大多数模型默认偏向美国文化，且在尝试对齐日本和中国文化时表现不佳，即使部分模型源自中国企业。

Details

Motivation: 随着生成式大语言模型在人机交互中的广泛应用，确保其文化对齐性变得至关重要。研究旨在揭示主流LLMs的文化倾向，并探索通过系统提示调整其文化适配性的可行性。 Method: 采用VSM13国际调查和霍夫斯泰德文化维度理论，对八种流行LLM进行文化对齐分析，并通过设计针对特定国家的文化系统提示，评估模型输出的文化适应性变化。 Result: 多数模型在无明确文化设定时倾向于美国文化；通过文化提示，七种模型能部分转向目标文化，但在对中国和日本文化的对齐上普遍存在困难，包括来自中国的DeepSeek模型。 Conclusion: 当前主流大语言模型存在显著的西方尤其是美国文化偏见，文化提示可在一定程度上调节模型的文化倾向，但对高语境或复杂文化（如中日）的对齐仍具挑战。 Abstract: Culture is the bedrock of human interaction; it dictates how we perceive and respond to everyday interactions. As the field of human-computer interaction grows via the rise of generative Large Language Models (LLMs), the cultural alignment of these models become an important field of study. This work, using the VSM13 International Survey and Hofstede's cultural dimensions, identifies the cultural alignment of popular LLMs (DeepSeek-V3, V3.1, GPT-5, GPT-4.1, GPT-4, Claude Opus 4, Llama 3.1, and Mistral Large). We then use cultural prompting, or using system prompts to shift the cultural alignment of a model to a desired country, to test the adaptability of these models to other cultures, namely China, France, India, Iran, Japan, and the United States. We find that the majority of the eight LLMs tested favor the United States when the culture is not specified, with varying results when prompted for other cultures. When using cultural prompting, seven of the eight models shifted closer to the expected culture. We find that models had trouble aligning with Japan and China, despite two of the models tested originating with the Chinese company DeepSeek.

[19] NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data

Agniva Maiti,Manya Pandey,Murari Mandal

Main category: cs.CL

TL;DR: 本文提出了NagaNLP，一个用于 Nagamese 语言的开源工具包，采用大语言模型生成并由人工验证的合成数据方法，构建了高质量的语料库，并训练出在POS和NER任务上表现优异的模型，以及一个高性能的对话模型NagaLLaMA，为低资源语言提供了可复用的框架。

Details

Motivation: Nagamese等大多数世界语言在自然语言处理中严重缺乏资源，导致其难以融入数字技术，亟需有效方法缓解数据稀缺问题。 Method: 提出一种多阶段混合方法：首先使用专家指导的大语言模型（Gemini）生成候选语料，再由母语者进行修正和标注，构建出1万对话语数据集和高质量标注语料库，并基于此训练XLM-RoBERTa-base和Llama-3.2-3B Instruct模型。 Result: 微调后的XLM-RoBERTa-base在词性标注任务上达到93.81%准确率（0.90 F1-Macro），命名实体识别达0.75 F1-Macro；Llama-3.2-3B模型NagaLLaMA在对话任务中困惑度低至3.85，相比少样本版本提升一个数量级。 Conclusion: 该研究成功为Nagamese建立了首个高效NLP工具链，证明了合成数据结合人工验证的有效性，所发布工具包为低资源语言处理提供了可复现、可扩展的范式。 Abstract: The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81\% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.

[20] HyperEdit: Unlocking Instruction-based Text Editing in LLMs via Hypernetworks

Yiming Zeng,Jinghan Cao,Zexin Li,Wanhao Yu,Zhankai Ye,Dawei Xiang,Ting Hua,Xin Liu,Shangqian Gao,Tingting Yu

Main category: cs.CL

TL;DR: HyperEdit提出了一种基于超网络的动态适应和差异感知正则化方法，以提升大模型在指令驱动文本编辑中的准确性和保真度。

Details

Motivation: 现有大语言模型在指令驱动的文本编辑任务中难以准确对齐用户意图，且容易对未修改区域进行过度编辑，影响功能完整性。 Method: 引入超网络生成请求特定参数，实现动态适应；采用差异感知正则化，聚焦于修改片段进行监督，避免过度编辑。 Result: 在仅使用3B参数的情况下，HyperEdit在修改区域的BLEU得分上比现有最先进基线相对提升9%至30%。 Conclusion: HyperEdit有效提升了指令驱动文本编辑的精确性与一致性，支持最小化、忠实的文本修改，适用于代码编辑等高可靠性需求场景。 Abstract: Instruction-based text editing is increasingly critical for real-world applications such as code editors (e.g., Cursor), but Large Language Models (LLMs) continue to struggle with this task. Unlike free-form generation, editing requires faithfully implementing user instructions while preserving unchanged content, as even minor unintended modifications can break functionality. Existing approaches treat editing as generic text generation, leading to two key failures: they struggle to faithfully align edits with diverse user intents, and they often over-edit unchanged regions. We propose HyperEdit to address both issues. First, we introduce hypernetwork-based dynamic adaptation that generates request-specific parameters, enabling the model to tailor its editing strategy to each instruction. Second, we develop difference-aware regularization that focuses supervision on modified spans, preventing over-editing while ensuring precise, minimal changes. HyperEdit achieves a 9%--30% relative improvement in BLEU on modified regions over state-of-the-art baselines, despite utilizing only 3B parameters.

[21] Coupled Variational Reinforcement Learning for Language Model General Reasoning

Xueru Wen,Jie Lou,Yanjiang Liu,Hongyu Lin,Ben He,Xianpei Han,Le Sun,Yaojie Lu,Debing Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为CoVRL的新型强化学习方法，通过耦合先验与后验分布，在无需验证器的情况下提升语言模型的推理能力，显著提高了数学和通用推理任务的性能。

Details

Motivation: 现有强化学习在语言模型推理中受限于可验证奖励的需求，而无验证器方法因推理路径与答案脱节导致探索效率低、连贯性差。 Method: 提出CoVRL方法，结合变分推断与强化学习，采用混合采样策略耦合先验与后验分布，构建并优化复合分布以实现高效探索并保持思维与答案的一致性。 Result: 在数学和通用推理基准上，相比基础模型提升12.4%，相比强无验证器基线再提升2.3%。 Conclusion: CoVRL为增强语言模型的通用推理能力提供了一个有原则的框架，有效解决了推理路径与答案脱节的问题。 Abstract: While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.

[22] Human-Inspired Learning for Large Language Models via Obvious Record and Maximum-Entropy Method Discovery

Hong Su

Main category: cs.CL

TL;DR: 提出一种受人类学习启发的框架，通过显式记忆和最大熵方法发现机制，提升大语言模型对罕见、低资源或未见场景的适应能力。

Details

Motivation: 大语言模型在处理稀有、低资源或未见场景时表现不佳，且依赖隐式参数记忆，缺乏对方法的显式获取与优化能力。 Method: 引入两种机制：一是‘显式记录’（Obvious Record），将因果或问答关系以符号形式存储；二是‘最大熵方法发现’（Maximum-Entropy Method Discovery），优先保留语义差异大的方法，增强策略多样性。 Result: 在60个语义多样的问答对基准上验证，该方法在未见问题覆盖度和内部多样性上均显著优于随机基线。 Conclusion: 该框架能有效提升模型对罕见情况的学习与泛化能力，推动模型从直觉预测向类人方法学习转变。 Abstract: Large Language Models (LLMs) excel at extracting common patterns from large-scale corpora, yet they struggle with rare, low-resource, or previously unseen scenarios-such as niche hardware deployment issues or irregular IoT device behaviors-because such cases are sparsely represented in training data. Moreover, LLMs rely primarily on implicit parametric memory, which limits their ability to explicitly acquire, recall, and refine methods, causing them to behave predominantly as intuition-driven predictors rather than deliberate, method-oriented learners. Inspired by how humans learn from rare experiences, this paper proposes a human-inspired learning framework that integrates two complementary mechanisms. The first, Obvious Record, explicitly stores cause--result (or question--solution) relationships as symbolic memory, enabling persistent learning even from single or infrequent encounters. The second, Maximum-Entropy Method Discovery, prioritizes and preserves methods with high semantic dissimilarity, allowing the system to capture diverse and underrepresented strategies that are typically overlooked by next-token prediction. Verification on a benchmark of 60 semantically diverse question--solution pairs demonstrates that the proposed entropy-guided approach achieves stronger coverage of unseen questions and significantly greater internal diversity than a random baseline, confirming its effectiveness in discovering more generalizable and human-inspired methods.

[23] StruProKGR: A Structural and Probabilistic Framework for Sparse Knowledge Graph Reasoning

Yucan Guo,Saiping Guan,Miao Su,Zeya Zhao,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng

Main category: cs.CL

TL;DR: 提出了一种名为StruProKGR的结构化与概率化框架，用于在稀疏知识图谱上进行高效且可解释的推理，通过距离引导的路径收集和基于结构信息的概率路径聚合，在多个基准上优于现有方法。

Details

Motivation: 稀疏知识图谱中知识不完整，传统基于路径的方法依赖计算成本高的随机游走且忽略路径间的结构关联，导致推理效果和效率受限。 Method: 提出StruProKGR框架：1）采用距离引导的路径收集机制减少计算开销并探索更相关路径；2）通过结合结构信息的概率路径聚合，优先选择相互增强的路径。 Result: 在五个稀疏KG推理基准上的实验表明，StruProKGR在有效性和效率方面均优于现有的基于路径的方法。 Conclusion: StruProKGR为稀疏知识图谱推理提供了一个有效、高效且可解释的解决方案，显著提升了路径-based方法的性能。 Abstract: Sparse Knowledge Graphs (KGs) are commonly encountered in real-world applications, where knowledge is often incomplete or limited. Sparse KG reasoning, the task of inferring missing knowledge over sparse KGs, is inherently challenging due to the scarcity of knowledge and the difficulty of capturing relational patterns in sparse scenarios. Among all sparse KG reasoning methods, path-based ones have attracted plenty of attention due to their interpretability. Existing path-based methods typically rely on computationally intensive random walks to collect paths, producing paths of variable quality. Additionally, these methods fail to leverage the structured nature of graphs by treating paths independently. To address these shortcomings, we propose a Structural and Probabilistic framework named StruProKGR, tailored for efficient and interpretable reasoning on sparse KGs. StruProKGR utilizes a distance-guided path collection mechanism to significantly reduce computational costs while exploring more relevant paths. It further enhances the reasoning process by incorporating structural information through probabilistic path aggregation, which prioritizes paths that reinforce each other. Extensive experiments on five sparse KG reasoning benchmarks reveal that StruProKGR surpasses existing path-based methods in both effectiveness and efficiency, providing an effective, efficient, and interpretable solution for sparse KG reasoning.

[24] Understanding Syllogistic Reasoning in LLMs from Formal and Natural Language Perspectives

Aheli Poddar,Saptarshi Sahoo,Sujata Ghosh

Main category: cs.CL

TL;DR: 研究了14个大语言模型在三段论推理中的表现，探讨其逻辑和自然语言理解能力，发现某些模型在符号推理上表现完美，引发对LLM是否正成为形式化推理机制的思考。

Details

Motivation: 探索大语言模型在三段论推理中的基本推理能力及其发展方向，从逻辑和自然语言两个角度分析其表现。 Method: 使用14个大语言模型，评估其在符号推断和自然语言理解方面的三段论推理能力。 Result: 三段论推理能力并非在所有大语言模型中均匀出现，但某些模型在符号推理任务中表现出色。 Conclusion: 大语言模型可能正在演变为更形式化的推理机制，而非体现人类推理的细微之处。 Abstract: We study syllogistic reasoning in LLMs from the logical and natural language perspectives. In process, we explore fundamental reasoning capabilities of the LLMs and the direction this research is moving forward. To aid in our studies, we use 14 large language models and investigate their syllogistic reasoning capabilities in terms of symbolic inferences as well as natural language understanding. Even though this reasoning mechanism is not a uniform emergent property across LLMs, the perfect symbolic performances in certain models make us wonder whether LLMs are becoming more and more formal reasoning mechanisms, rather than making explicit the nuances of human reasoning.

[25] Which Pieces Does Unigram Tokenization Really Need?

Sander Land,Yuval Pinter

Main category: cs.CL

TL;DR: 本文提出了一种清晰的Unigram分词算法实现指南，并识别出一种在训练损失略高但压缩性能更好的简化算法。

Details

Motivation: 尽管Unigram分词算法在理论上具有优雅性，但其实现复杂，限制了其广泛应用。 Method: 提供了一个清晰的实现指南和参数选择建议，并提出了一种简化的算法。 Result: 新算法在略微增加训练损失的情况下，实现了更好的压缩效果。 Conclusion: 该研究弥合了Unigram算法理论与实践之间的差距，促进了其更广泛的应用。 Abstract: The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece package and adapters thereof. We bridge this gap between theory and practice by providing a clear guide to implementation and parameter choices. We also identify a simpler algorithm that accepts slightly higher training loss in exchange for improved compression.

[26] LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases

Yida Cai,Ranjuexiao Hu,Huiyuan Xie,Chenyang Li,Yun Liu,Yuxiao Ye,Zhenghao Liu,Weixing Shen,Zhiyuan Liu

Main category: cs.CL

TL;DR: 本文提出了一个用于中文民事案件法律关系抽取的综合模式，并构建了专家标注的基准数据集LexRel，评估了大语言模型在该任务上的表现，发现现有模型存在显著局限性，同时证明引入法律关系信息有助于提升其他下游法律AI任务的性能。

Details

Motivation: 由于缺乏全面的模式，中文民事案件中的法律关系在法律人工智能领域尚未得到充分探索，因此需要构建相应的框架和数据集以推动研究发展。 Method: 提出包含层次分类体系和论据定义的综合模式，构建LexRel基准数据集，并基于该数据集对最先进的大语言模型进行法律关系抽取任务的评估，同时探究其在下游任务中的应用效果。 Result: 当前的大语言模型在准确识别民事法律关系方面表现出明显不足，而引入法律关系信息能够持续提升其他法律AI下游任务的性能。 Conclusion: 建立系统的法律关系模式和高质量数据集对推动中文法律人工智能的发展至关重要，未来需针对性优化模型以更好处理法律关系抽取任务。 Abstract: Legal relations form a highly consequential analytical framework of civil law system, serving as a crucial foundation for resolving disputes and realizing values of the rule of law in judicial practice. However, legal relations in Chinese civil cases remain underexplored in the field of legal artificial intelligence (legal AI), largely due to the absence of comprehensive schemas. In this work, we firstly introduce a comprehensive schema, which contains a hierarchical taxonomy and definitions of arguments, for AI systems to capture legal relations in Chinese civil cases. Based on this schema, we then formulate legal relation extraction task and present LexRel, an expert-annotated benchmark for legal relation extraction in Chinese civil law. We use LexRel to evaluate state-of-the-art large language models (LLMs) on legal relation extractions, showing that current LLMs exhibit significant limitations in accurately identifying civil legal relations. Furthermore, we demonstrate that incorporating legal relations information leads to consistent performance gains on other downstream legal AI tasks.

[27] Modeling Authorial Style in Urdu Novels Using Character Interaction Graphs and Graph Neural Networks

Hassan Mujtaba,Hamza Naveed,Hanzlah Munir

Main category: cs.CL

TL;DR: 提出一种基于图的框架，通过将乌尔都语小说建模为角色互动网络，利用图表示学习来识别作者风格，在52部乌尔都语小说上的实验表明，学习到的图表示显著优于手工设计和无监督基线方法。

Details

Motivation: 传统作者分析主要依赖文本的词汇和风格特征，而对低资源语言（如乌尔都语）的高层叙事结构探索不足，因此需要探索仅从叙事结构推断作者风格的可能性。 Method: 将每部小说表示为图，节点为角色，边为角色在叙事中的共现关系，并比较多种图表示方法，包括全局结构特征、节点级语义摘要、无监督图嵌入和有监督图神经网络。 Result: 在包含52部乌尔都语小说的数据集上，学习到的图表示在严格的作者感知评估协议下最高达到0.857的准确率，显著优于基线方法。 Conclusion: 叙事结构本身可以有效反映作者风格，基于图神经网络的表示学习是分析低资源语言文学作品作者风格的有效途径。 Abstract: Authorship analysis has traditionally focused on lexical and stylistic cues within text, while higher-level narrative structure remains underexplored, particularly for low-resource languages such as Urdu. This work proposes a graph-based framework that models Urdu novels as character interaction networks to examine whether authorial style can be inferred from narrative structure alone. Each novel is represented as a graph where nodes correspond to characters and edges denote their co-occurrence within narrative proximity. We systematically compare multiple graph representations, including global structural features, node-level semantic summaries, unsupervised graph embeddings, and supervised graph neural networks. Experiments on a dataset of 52 Urdu novels written by seven authors show that learned graph representations substantially outperform hand-crafted and unsupervised baselines, achieving up to 0.857 accuracy under a strict author-aware evaluation protocol.

[28] Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches

Amirhossein Yousefiramandi,Ciaran Cooney

Main category: cs.CL

TL;DR: 本文研究了在资源受限情况下，如何高效微调仅解码器的大语言模型（LLM）用于下游文本分类任务，比较了基于嵌入表示和指令微调两种方法，发现前者在F1分数上显著优于后者，并可媲美甚至超越领域特定模型（如BERT）。

Details

Motivation: 在计算资源有限的情况下，如何有效利用大规模语言模型进行文本分类是一个重要问题。现有的微调方法可能计算开销大或性能不足，因此需要探索高效的微调策略。 Method: 采用两种方法：一是在因果LLM上附加分类头并使用最终token的嵌入作为序列表示进行微调；二是以prompt->response形式对LLM进行指令微调。结合4比特量化与低秩适应（LoRA）实现单GPU上最多8B参数模型的微调。 Result: 在专有单标签数据集和公开WIPO-Alpha专利数据集（极端多标签分类）上的实验表明，基于嵌入的方法在F1分数上显著优于指令微调方法，并且性能与微调后的领域专用模型（如BERT）相当甚至更优。 Conclusion: 直接利用因果LLM的内部表示，结合高效的微调技术，可以在资源受限条件下实现优异的文本分类性能。研究还讨论了两种方法的优势，并提出了分类任务中优化LLM微调的实用指南和未来方向。 Abstract: We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pre-trained causal LLM and fine-tuning on the task (using the LLM's final token embedding as a sequence representation), and (2) instruction-tuning the LLM in a prompt->response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two datasets - a proprietary single-label dataset and the public WIPO-Alpha patent dataset (extreme multi-label classification) - show that the embedding-based method significantly outperforms the instruction-tuned method in F1-score, and is very competitive with - even surpassing - fine-tuned domain-specific models (e.g. BERT) on the same tasks. These results demonstrate that directly leveraging the internal representations of causal LLMs, along with efficient fine-tuning techniques, yields impressive classification performance under limited computational resources. We discuss the advantages of each approach while outlining practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.

[29] CoDA: A Context-Decoupled Hierarchical Agent with Reinforcement Learning

Xuanzhang Liu,Jianglun Feng,Zhuoran Zhuang,Junzhe Zhao,Maofei Que,Jieting Li,Dianlei Wang,Hao Tong,Ye Chen,Pan Li

Main category: cs.CL

TL;DR: 本文提出了一种名为CoDA的上下文解耦分层代理框架，通过将高层规划与低层执行分离，有效缓解了大语言模型在复杂多步任务中因上下文爆炸导致的性能下降问题。

Details

Motivation: 由于大语言模型代理在处理复杂多步任务时，长文本输出积累导致上下文窗口过载，引发推理失败，即“上下文爆炸”问题，限制了其性能发挥。 Method: 提出CoDA框架，利用单一共享的大语言模型主干，在两个隔离的上下文中分别担任高层Planner（负责任务分解）和低层Executor（负责工具交互）；并通过PEC0（Planner-Executor Co-Optimization）强化学习方法对整个系统进行端到端联合优化。 Result: 实验表明，CoDA在复杂的多跳问答基准上显著优于现有最先进基线方法，并在长上下文场景中表现出强鲁棒性，而其他基线方法性能严重下降。 Conclusion: CoDA的分层设计能有效缓解上下文过载问题，通过高、低层次角色的协同优化，实现了更稳定和高效的复杂任务求解能力。 Abstract: Large Language Model (LLM) agents trained with reinforcement learning (RL) show great promise for solving complex, multi-step tasks. However, their performance is often crippled by "Context Explosion", where the accumulation of long text outputs overwhelms the model's context window and leads to reasoning failures. To address this, we introduce CoDA, a Context-Decoupled hierarchical Agent, a simple but effective reinforcement learning framework that decouples high-level planning from low-level execution. It employs a single, shared LLM backbone that learns to operate in two distinct, contextually isolated roles: a high-level Planner that decomposes tasks within a concise strategic context, and a low-level Executor that handles tool interactions in an ephemeral, isolated workspace. We train this unified agent end-to-end using PECO (Planner-Executor Co-Optimization), a reinforcement learning methodology that applies a trajectory-level reward to jointly optimize both roles, fostering seamless collaboration through context-dependent policy updates. Extensive experiments demonstrate that CoDA achieves significant performance improvements over state-of-the-art baselines on complex multi-hop question-answering benchmarks, and it exhibits strong robustness in long-context scenarios, maintaining stable performance while all other baselines suffer severe degradation, thus further validating the effectiveness of our hierarchical design in mitigating context overload.

[30] NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Jingzhe Ding,Shengda Long,Changxin Pu,Huan Zhou,Hongwan Gao,Xiang Gao,Chao He,Yue Hou,Fei Hu,Zhaojian Li,Weiran Shi,Zaiyuan Wang,Daoguang Zan,Chenchen Zhang,Xiaoxu Zhang,Qizhi Chen,Xianfu Cheng,Bo Deng,Qingshui Gu,Kai Hua,Juntao Lin,Pai Liu,Mingchen Li,Xuanguang Pan,Zifan Peng,Yujia Qin,Yong Shan,Zhewen Tan,Weihao Xie,Zihan Wang,Yishuo Yuan,Jiayu Zhang,Enduo Zhao,Yunfei Zhao,He Zhu,Chenyang Zou,Ming Ding,Jianpeng Jiao,Jiaheng Liu,Minghao Liu,Qian Liu,Chongyao Tao,Jian Yang,Tong Yang,Zhaoxiang Zhang,Xinjie Chen,Wenhao Huang,Ge Zhang

Main category: cs.CL

TL;DR: NL2Repo Bench 是一个用于评估编码智能体在长周期仓库生成能力的新基准，揭示了当前模型在构建完整软件系统时面临的挑战。

Details

Motivation: 现有基准无法充分评估编码智能体在构建完整软件系统所需的长周期推理与执行能力。 Method: 提出 NL2Repo Bench 基准，仅提供自然语言需求文档和空工作区，要求智能体自主设计架构、管理依赖、实现多模块逻辑并生成可安装的 Python 库。 Result: 实验显示最先进的智能体平均测试通过率低于 40%，极少能正确完成整个仓库，暴露出过早终止、全局连贯性丢失、跨文件依赖脆弱等长周期失败模式。 Conclusion: NL2Repo Bench 为衡量持续代理能力提供了严格、可验证的测试平台，并表明长周期推理是下一代自主编码智能体的核心瓶颈。 Abstract: Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.

[31] Curió-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining

Thales Sales Almeida,Rodrigo Nogueira,Hélio Pedrini

Main category: cs.CL

TL;DR: 本文研究了在特定语言（葡萄牙语）背景下，通过继续预训练来扩展语言模型能力的方法，提出并比较了两个基于LLaMA-2的70亿参数模型：使用1000亿token的Curió 7B和仅使用100亿高质量教育与STEM领域token的Curió-Edu 7B。结果显示，尽管数据量更小，Curió-Edu 7B在评估中表现更优，表明数据质量在语言适应中起关键作用。

Details

Motivation: 探索在有限计算资源下，继续预训练中数据数量与质量对语言模型适应特定语言或领域的影响，特别是针对葡萄牙语这一低资源语言的高效适配方法。 Method: 基于LLaMA-2构建两个70亿参数模型：Curió 7B在1000亿葡萄牙语token上继续预训练；Curió-Edu 7B仅在100亿经过教育与STEM过滤的高质量token上训练。通过对比两者在下游任务上的表现评估数据质量的影响。 Result: Curió-Edu 7B虽然仅使用10%的数据和20%的计算资源，但在各项评估中均优于使用全量数据训练的Curió 7B，证明高质量数据在语言适应中的决定性作用。 Conclusion: 在继续预训练中，数据质量比数据数量更为重要，尤其是在目标语言先前暴露有限的情况下，精心筛选的高质量子集可显著提升模型性能，同时大幅降低训练成本。 Abstract: Continued pretraining extends a language model's capabilities by further exposing it to additional data, often tailored to a specific linguistic or domain context. This strategy has emerged as an efficient alternative to full retraining when adapting general-purpose models to new settings. In this work, we investigate this paradigm through Curió 7B, a 7-billion-parameter model derived from LLaMA-2 and trained on 100 billion Portuguese tokens from the ClassiCC-PT corpus - the most extensive Portuguese-specific continued-pretraining effort above the three-billion-parameter scale to date. Beyond scale, we investigate whether quantity alone suffices or whether data quality plays a decisive role in linguistic adaptation. To this end, we introduce Curió-Edu 7B, a variant trained exclusively on the educational and STEM-filtered subset of the same corpus, totaling just 10 billion tokens. Despite using only 10% of the data and 20% of the computation, Curió-Edu 7B surpasses the full-corpus model in our evaluations, demonstrating that data selection can be fundamental even when adapting models with limited prior exposure to the target language. The developed models are available at https://huggingface.co/collections/ClassiCC-Corpus/curio-edu

[32] Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions

Pedro Henrique Luz de Araujo,Michael A. Hedderich,Ali Modarressi,Hinrich Schuetze,Benjamin Roth

Main category: cs.CL

TL;DR: 本文提出了一种新的评估协议，通过超过100轮的长程对话和评估数据集，系统衡量大语言模型在长时间交互中的角色保真度、指令遵循和安全性表现，发现角色保真度随对话延长而下降，尤其在目标导向对话中与指令遵循存在权衡。

Details

Motivation: 现有对赋予角色的大语言模型（LLMs）的评估多局限于短时单轮对话，无法反映真实应用场景下的长期交互表现，因此需要一种能有效评估长上下文影响的协议。 Method: 设计了一种结合长程角色对话（超过100轮）与标准化评估数据集的评估协议，构建基于对话条件的基准测试，并对七种先进的开源与闭源大语言模型在角色保真度、指令遵循和安全性方面进行测评。 Result: 发现随着对话进行，角色保真度显著下降，尤其在需同时维持角色一致性和完成任务的场景中；存在角色保真与指令遵循之间的权衡：非角色基线模型初始表现更优，而随着对话推进，角色模型输出逐渐趋近于基线。 Conclusion: 角色分配在长期对话中具有脆弱性，当前模型难以持续保持角色一致性与功能性的平衡，本文提出的评估协议有助于系统识别和测量此类问题。 Abstract: Persona-assigned large language models (LLMs) are used in domains such as education, healthcare, and sociodemographic simulation. Yet, they are typically evaluated only in short, single-round settings that do not reflect real-world usage. We introduce an evaluation protocol that combines long persona dialogues (over 100 rounds) and evaluation datasets to create dialogue-conditioned benchmarks that can robustly measure long-context effects. We then investigate the effects of dialogue length on persona fidelity, instruction-following, and safety of seven state-of-the-art open- and closed-weight LLMs. We find that persona fidelity degrades over the course of dialogues, especially in goal-oriented conversations, where models must sustain both persona fidelity and instruction following. We identify a trade-off between fidelity and instruction following, with non-persona baselines initially outperforming persona-assigned models; as dialogues progress and fidelity fades, persona responses become increasingly similar to baseline responses. Our findings highlight the fragility of persona applications in extended interactions and our work provides a protocol to systematically measure such failures.

[33] State over Tokens: Characterizing the Role of Reasoning Tokens

Mosh Levy,Zohar Elyoseph,Shauli Ravfogel,Yoav Goldberg

Main category: cs.CL

TL;DR: 本文提出了“State over Tokens (SoT)”框架，将大语言模型中的推理token视为外部化的计算状态而非人类可读的思维过程，揭示了其作为无状态生成过程中唯一持久信息载体的作用。

Details

Motivation: 尽管大语言模型生成的推理token看似人类思维过程，但实证表明它们并不能真实反映模型的实际推理机制，因此需要一个新的框架来准确理解这些token的功能。 Method: 提出SoT概念框架，将推理token重新定义为跨生成周期的外部化计算状态，并从信息持续性的角度分析其在模型推理中的作用。 Result: SoT框架解释了为何这些token能推动正确推理却无法作为忠实的文字解释，并揭示了以往被忽视的研究问题。 Conclusion: 要真正理解大语言模型的推理过程，研究应超越将推理token当作文本阅读的传统方式，转而专注于将其解码为计算状态进行分析。 Abstract: Large Language Models (LLMs) can generate reasoning tokens before their final answer to boost performance on complex tasks. While these sequences seem like human thought processes, empirical evidence reveals that they are not a faithful explanation of the model's actual reasoning process. To address this gap between appearance and function, we introduce the State over Tokens (SoT) conceptual framework. SoT reframes reasoning tokens not as a linguistic narrative, but as an externalized computational state -- the sole persistent information carrier across the model's stateless generation cycles. This explains how the tokens can drive correct reasoning without being a faithful explanation when read as text and surfaces previously overlooked research questions on these tokens. We argue that to truly understand the process that LLMs do, research must move beyond reading the reasoning tokens as text and focus on decoding them as state.

[34] Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, LLaMA

Hanyu Cai,Binqi Shen,Lier Jin,Lan Hu,Xiaojing Fan

Main category: cs.CL

TL;DR: 本研究提出一个系统性评估框架，探讨语言语调（友好、中性、粗鲁）对三种主流大语言模型（GPT-4o mini、Gemini 2.0 Flash、Llama 4 Scout）性能的影响。基于MMMLU基准，在STEM和人文学科六大任务中进行测试，发现语调敏感性具有模型依赖性和领域特异性：粗鲁语调会降低GPT和Llama在部分人文学科任务中的准确率，而Gemini表现更稳健；整体跨任务分析则显示语调影响减弱。结果表明现代LLM对语调变化总体鲁棒，提示实际应用中可适度简化语调设计。

Details

Motivation: 尽管提示工程对大语言模型性能至关重要，但语用因素（如语气和礼貌）的影响尚不明确，尤其在不同模型家族间的差异缺乏系统研究。现有研究结果不一致，可能与数据集规模和覆盖范围有关，因此需要更系统的评估框架来厘清语调效应的真实影响。 Method: 构建包含友好、中性和粗鲁三类语调的提示变体，在MMMLU基准的六个任务（涵盖STEM与人文学科）上测试GPT-4o mini、Gemini 2.0 Flash和Llama 4 Scout三种模型的表现，并通过统计显著性检验分析不同条件下的准确率差异。 Result: 1. 语调敏感性具有模型依赖性和领域特异性：GPT和Llama在部分人文学科任务中对粗鲁语调表现出显著性能下降，而Gemini相对不敏感；2. 友好或中性语调通常优于粗鲁语调；3. 跨任务聚合后，语调效应大多不再显著；4. 数据集规模与覆盖范围影响语调效应的可检测性。 Conclusion: 现代大语言模型在多数混合领域应用场景下对提示语调变化总体保持鲁棒性，语调仅在特定解释性任务中产生显著影响。该结果为实际部署中的提示设计和模型选择提供了实用指导，表明无需过度优化语调即可获得稳定性能。 Abstract: Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Friendly, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing. Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Friendly prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks, where rude tone reduces accuracy for GPT and Llama, while Gemini remains comparatively tone-insensitive. When performance is aggregated across tasks within each domain, tone effects diminish and largely lose statistical significance. Compared with earlier researches, these findings suggest that dataset scale and coverage materially influence the detection of tone effects. Overall, our study indicates that while interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments.

[35] Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects

Chris Latimer,Nicoló Boschi,Andrew Neeser,Chris Bartholomew,Gaurav Srivastava,Xuan Wang,Naren Ramakrishnan

Main category: cs.CL

TL;DR: Hindsight是一种新型的代理记忆架构，通过将记忆组织为四个逻辑网络（世界事实、代理经验、实体摘要和动态信念），支持保留、回忆和反思三项核心操作，显著提升了长时程对话记忆任务的性能。

Details

Motivation: 现有的代理记忆系统将记忆作为外部层处理，难以区分证据与推理，信息组织能力有限，且缺乏对推理过程的可追溯支持。Hindsight旨在构建一个结构化的记忆框架，使记忆成为推理的一等公民，从而提升代理在长周期、多会话场景下的表现。 Method: Hindsight设计了四个逻辑网络来结构化存储不同类型的信息，并引入时间性、实体感知的记忆层将对话流转化为可查询的记忆库；同时，通过反思层对记忆内容进行推理，实现答案生成与信息更新。系统支持retain（保留）、recall（回忆）和reflect（反思）三种核心操作。 Result: 在LongMemEval和LoCoMo等长时程对话记忆基准上，Hindsight使用20B开源模型将整体准确率从39%提升至83.6%，超过同骨干的全上下文基线及GPT-4o。进一步扩大骨干模型后，在LongMemEval上达到91.4%，LoCoMo上达89.61%，优于此前最强的开源系统（75.78%）。 Conclusion: Hindsight通过结构化记忆设计，有效分离证据与推理，增强长期信息组织与可解释性，显著提升代理在复杂、跨会话任务中的性能，代表了代理记忆向可追踪、可推理范式的演进。 Abstract: Agent memory has been touted as a dimension of growth for LLM-based applications, enabling agents that can accumulate experience, adapt across sessions, and move beyond single-shot question answering. The current generation of agent memory systems treats memory as an external layer that extracts salient snippets from conversations, stores them in vector or graph-based stores, and retrieves top-k items into the prompt of an otherwise stateless model. While these systems improve personalization and context carry-over, they still blur the line between evidence and inference, struggle to organize information over long horizons, and offer limited support for agents that must explain their reasoning. We present Hindsight, a memory architecture that treats agent memory as a structured, first-class substrate for reasoning by organizing it into four logical networks that distinguish world facts, agent experiences, synthesized entity summaries, and evolving beliefs. This framework supports three core operations -- retain, recall, and reflect -- that govern how information is added, accessed, and updated. Under this abstraction, a temporal, entity aware memory layer incrementally turns conversational streams into a structured, queryable memory bank, while a reflection layer reasons over this bank to produce answers and to update information in a traceable way. On key long-horizon conversational memory benchmarks like LongMemEval and LoCoMo, Hindsight with an open-source 20B model lifts overall accuracy from 39% to 83.6% over a full-context baseline with the same backbone and outperforms full context GPT-4o. Scaling the backbone further pushes Hindsight to 91.4% on LongMemEval and up to 89.61% on LoCoMo (vs. 75.78% for the strongest prior open system), consistently outperforming existing memory architectures on multi-session and open-domain questions.

[36] What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation

Dingyi Yang,Qin Jin

Main category: cs.CL

TL;DR: 本文提出了首个大规模长篇故事自动评估基准LongStoryEval，包含600本新出版书籍，并引入基于摘要的高效评估模型NovelCritique，在对齐人类评价方面优于GPT-4o等商业模型。

Details

Motivation: 长篇故事（>10万token）的自动评估具有挑战性，现有方法缺乏系统性评估标准和真实数据支持，难以反映读者真正关注的评价维度。 Method: 构建了包含600本书的LongStoryEval基准，每本书附带评分和按评价维度组织的读者评论；提出8个顶层评价标准，并比较三种评估方法：基于聚合、增量更新和基于摘要的方法；基于有效方法设计NovelCritique模型进行自动评分。 Result: 发现基于聚合和基于摘要的方法表现更优，前者在细节评估上更强，后者效率更高；NovelCritique模型在多个指定维度上的评分比GPT-4o更贴近人类评价。 Conclusion: 建立了首个面向长篇故事的系统化评估基准与框架，验证了高效且准确的自动评估可行性，为未来长文本生成评估提供了重要参考。 Abstract: In this work, we conduct systematic research in a challenging area: the automatic evaluation of book-length stories (>100K tokens). Our study focuses on two key questions: (1) understanding which evaluation aspects matter most to readers, and (2) exploring effective methods for evaluating lengthy stories. We introduce the first large-scale benchmark, LongStoryEval, comprising 600 newly published books with an average length of 121K tokens (maximum 397K). Each book includes its average rating and multiple reader reviews, presented as critiques organized by evaluation aspects. By analyzing all user-mentioned aspects, we propose an evaluation criteria structure and conduct experiments to identify the most significant aspects among the 8 top-level criteria. For evaluation methods, we compare the effectiveness of three types: aggregation-based, incremental-updated, and summary-based evaluations. Our findings reveal that aggregation- and summary-based evaluations perform better, with the former excelling in detail assessment and the latter offering greater efficiency. Building on these insights, we further propose NovelCritique, an 8B model that leverages the efficient summary-based framework to review and score stories across specified aspects. NovelCritique outperforms commercial models like GPT-4o in aligning with human evaluations. Our datasets and codes are available at https://github.com/DingyiYang/LongStoryEval.

[37] Counting Clues: A Lightweight Probabilistic Baseline Can Match an LLM

Furong Jia,Yuan Pu,Finn Guo,Monica Agrawal

Main category: cs.CL

TL;DR: 本文研究了大语言模型在临床诊断选择题上的表现，并提出了一个基于频率的朴素贝叶斯方法（FBPR），发现该轻量级方法在使用相同预训练语料库时可达到与LLM相当的性能，表明传统概率方法仍具有重要价值。

Details

Motivation: 探究大语言模型在临床诊断任务中的优异表现是否源于其概率推理能力，还是其他机制。 Method: 提出频率基础的概率排序器（FBPR），利用来自OLMo和Llama预训练语料库的概念-诊断共现统计，采用平滑的朴素贝叶斯对选项进行评分。 Result: FBPR在MedQA任务上与相应LLM性能相当，且两者答对的问题差异较大，重叠仅略高于随机水平，显示二者优势互补。 Conclusion: LLM的表现可能并非主要来自简单的频率聚合，但基于频率的传统专家系统方法仍能解释基准测试中大量性能，凸显显式概率基线的重要性及混合方法的潜力。 Abstract: Large language models (LLMs) excel on multiple-choice clinical diagnosis benchmarks, yet it is unclear how much of this performance reflects underlying probabilistic reasoning. We study this through questions from MedQA, where the task is to select the most likely diagnosis. We introduce the Frequency-Based Probabilistic Ranker (FBPR), a lightweight method that scores options with a smoothed Naive Bayes over concept-diagnosis co-occurrence statistics from a large corpus. When co-occurrence statistics were sourced from the pretraining corpora for OLMo and Llama, FBPR achieves comparable performance to the corresponding LLMs pretrained on that same corpus. Direct LLM inference and FBPR largely get different questions correct, with an overlap only slightly above random chance, indicating complementary strengths of each method. These findings highlight the continued value of explicit probabilistic baselines: they provide a meaningful performance reference point and a complementary signal for potential hybridization. While the performance of LLMs seems to be driven by a mechanism other than simple frequency aggregation, we show that an approach similar to the historically grounded, low-complexity expert systems still accounts for a substantial portion of benchmark performance.

[38] Building from Scratch: A Multi-Agent Framework with Human-in-the-Loop for Multilingual Legal Terminology Mapping

Lingyi Meng,Maolin Liu,Hao Wang,Yilan Cheng,Qi Yang,Idlkaid Mohanmmed

Main category: cs.CL

TL;DR: 提出一种人机协同的多智能体框架，用于构建中日英三语法律术语数据库，通过人类专家与AI智能体协作提升术语映射的准确性与一致性。

Details

Motivation: 中日语言间存在大量同形异义词，现有资源和工具难以准确进行法律术语跨语言映射，且缺乏标准化支持。 Method: 基于多智能体框架，结合大语言模型与法律领域专家，分阶段完成文档预处理、篇章对齐、术语抽取、映射及质量保证；AI负责OCR、文本分割、语义对齐等重复性任务，人类专家负责审核与监督。 Result: 在包含35部中国重要法律及其英日译本的平行语料库上验证，该方法显著提高了术语映射的精度和一致性，并具备优于传统人工方法的可扩展性。 Conclusion: 人机协同的多智能体模式能有效提升多语言法律术语构建的质量与效率，为跨语言法律信息处理提供了可行的新路径。 Abstract: Accurately mapping legal terminology across languages remains a significant challenge, especially for language pairs like Chinese and Japanese, which share a large number of homographs with different meanings. Existing resources and standardized tools for these languages are limited. To address this, we propose a human-AI collaborative approach for building a multilingual legal terminology database, based on a multi-agent framework. This approach integrates advanced large language models and legal domain experts throughout the entire process-from raw document preprocessing, article-level alignment, to terminology extraction, mapping, and quality assurance. Unlike a single automated pipeline, our approach places greater emphasis on how human experts participate in this multi-agent system. Humans and AI agents take on different roles: AI agents handle specific, repetitive tasks, such as OCR, text segmentation, semantic alignment, and initial terminology extraction, while human experts provide crucial oversight, review, and supervise the outputs with contextual knowledge and legal judgment. We tested the effectiveness of this framework using a trilingual parallel corpus comprising 35 key Chinese statutes, along with their English and Japanese translations. The experimental results show that this human-in-the-loop, multi-agent workflow not only improves the precision and consistency of multilingual legal terminology mapping but also offers greater scalability compared to traditional manual methods.

[39] QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Weizhou Shen,Ziyi Yang,Chenliang Li,Zhiyuan Lu,Miao Peng,Huashan Sun,Yingcheng Shi,Shengyi Liao,Shaopeng Lai,Bo Zhang,Dayiheng Liu,Fei Huang,Jingren Zhou,Ming Yan

Main category: cs.CL

TL;DR: QwenLong-L1.5通过系统性后训练创新实现了卓越的长上下文推理能力，提出数据合成、稳定强化学习和记忆增强架构三大关键技术，在长上下文任务中表现接近GPT-5和Gemini-2.5-Pro，并在超长序列任务中显著超越基线模型。

Details

Motivation: 现有模型在长上下文推理方面存在局限，难以处理需要多跳推理和全局证据分布的任务，且长上下文强化学习不稳定，传统上下文窗口无法应对超长序列。 Method: 1) 构建长上下文数据合成流水线，将文档分解为原子事实并生成可验证的复杂推理问题；2) 提出任务平衡采样与自适应熵控制策略优化（AEPO）以稳定长上下文强化学习；3) 设计记忆增强架构，结合单次推理与迭代记忆处理，支持超过4M token的超长任务。 Result: 基于Qwen3-30B-A3B-Thinking，QwenLong-L1.5在长上下文推理基准上平均提升9.90分，接近GPT-5和Gemini-2.5-Pro水平；在1M~4M token任务中，其记忆代理框架比基线高9.48分，并在科学推理、工具使用和长对话等泛化任务中表现更优。 Conclusion: QwenLong-L1.5通过创新的数据合成、稳定训练方法和记忆扩展架构，显著提升了长上下文推理能力，能够在超长序列任务中有效运行，并具备良好的迁移性能。 Abstract: We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.

[40] Authors Should Annotate

Marcus Ma,Cole Johnson,Nolan Bridges,Jackson Trager,Georgios Chochlakis,Shrikanth Narayanan

Main category: cs.CL

TL;DR: 本文提出了一种名为“作者标注”的新方法，即文档作者在创作时实时标注数据，特别适用于主观特征（如情感和信念）的标注。通过与一个拥有超过10,000名用户的商业聊天机器人合作，研究团队部署了该系统，并结合在线学习模型用于产品推荐，实现了点击率比行业广告基线高出534%的效果。与传统的三种第三方标注方式相比，作者标注在质量、速度和成本上均表现更优。研究结果支持了由作者而非第三方进行主观标注能显著提高标注质量的观点，并公开发布了学术版服务以促进科研应用。

Details

Motivation: 传统文本标注依赖第三方标注者，但在处理主观性较强的特征（如情感、信念）时，作者自身才是最权威的信息来源。因此，需要一种能够获取第一人称视角标注信息的方法，以提升标注质量和实际应用效果。 Method: 提出了“作者标注”技术，即在文档创建的同时由作者进行实时标注。研究团队与商用聊天机器人合作，构建了一个自动识别任务相关查询、动态生成标注问题并实时记录作者反馈的系统。基于此系统收集的数据，采用在线学习模型进行产品推荐，并持续优化模型性能。同时，将作者标注与三种传统标注方式在情感分析任务中进行对比，评估其质量、获取速度和成本。 Result: 作者标注系统成功部署并应用于真实场景，所训练的在线学习模型在产品推荐中实现了点击率较行业基线提升534%。在情感分析任务中，作者标注相比传统方法具有更高的标注质量、更快的获取速度和更低的成本。 Conclusion: 作者标注是一种高质量、高效且低成本的标注范式，尤其适用于主观和自我中心特征的标注任务。研究验证了其在实际应用中的优越性，并推动了该方法在科研社区的可及性，未来有望广泛应用于自然语言处理中的各类主观性建模任务。 Abstract: The status quo for labeling text is third-party annotation, but there are many cases where information directly from the document's source would be preferable over a third-person proxy, especially for egocentric features like sentiment and belief. We introduce author labeling, an annotation technique where the writer of the document itself annotates the data at the moment of creation. We collaborate with a commercial chatbot with over 10,000 users to deploy an author labeling annotation system for subjective features related to product recommendation. This system identifies task-relevant queries, generates on-the-fly labeling questions, and records authors' answers in real time. We train and deploy an online-learning model architecture for product recommendation that continuously improves from author labeling and find it achieved a 534% increase in click-through rate compared to an industry advertising baseline running concurrently. We then compare the quality and practicality of author labeling to three traditional annotation approaches for sentiment analysis and find author labeling to be higher quality, faster to acquire, and cheaper. These findings reinforce existing literature that annotations, especially for egocentric and subjective beliefs, are significantly higher quality when labeled by the author rather than a third party. To facilitate broader scientific adoption, we release an author labeling service for the research community at academic.echollm.io.

[41] An Open and Reproducible Deep Research Agent for Long-Form Question Answering

Ikuya Yamada,Wataru Ikeda,Ko Yoshida,Mengyu Ye,Hinata Sugimoto,Masatoshi Suzuki,Hisanori Ozaki,Jun Suzuki

Main category: cs.CL

TL;DR: 本文提出了一种用于长篇问答的开源深度研究系统，结合开源大语言模型与开放网络搜索API，通过迭代检索、推理和合成来提升开放域问答性能。采用基于LLM评判反馈的偏好调优以提高回答的清晰性、洞察力和事实准确性，实验结果表明该方法在各方面均有效提升答案质量。

Details

Motivation: 为了在真实开放域环境中提升长篇问答系统的性能，需要结合外部知识检索与高质量推理能力。现有方法在推理质量和事实一致性方面仍有不足，因此需要一种能够持续优化回答质量的框架。 Method: 系统结合开源大语言模型与开放网页搜索API，实现多轮检索-推理-合成流程；引入基于LLM-as-a-judge的偏好调优机制，从清晰性、洞察力和事实性等多个维度对生成结果进行评估并优化模型输出。 Result: 实验结果显示，所提方法在清晰性、洞察力和事实性三个维度上均显著且一致地提升了答案质量，验证了其在开放域长篇问答中的有效性。 Conclusion: 本文提出的开源深度研究系统通过融合检索与基于反馈的推理优化，在长篇问答任务中实现了高质量的回答生成，为开放域复杂问答提供了可复现、可扩展的解决方案。 Abstract: We present an open deep research system for long-form question answering, selected as a winning system in the text-to-text track of the MMU-RAG competition at NeurIPS 2025. The system combines an open-source large language model (LLM) with an open web search API to perform iterative retrieval, reasoning, and synthesis in real-world open-domain settings. To enhance reasoning quality, we apply preference tuning based on LLM-as-a-judge feedback that evaluates multiple aspects, including clarity, insightfulness, and factuality. Our experimental results show that the proposed method consistently improves answer quality across all three aspects. Our source code is publicly available at https://github.com/efficient-deep-research/efficient-deep-research.

[42] LLM Rationalis? Measuring Bargaining Capabilities of AI Negotiators

Cheril Shah,Akshit Agarwal,Kanak Garg,Mourad Heddaya

Main category: cs.CL

TL;DR: 本文提出了一种基于双曲正切曲线的让步动态统一数学框架，并引入了burstiness tau和让步刚性指数（CRI）两个指标来量化谈判中的时机与刚性。通过大规模实证比较人类与四种先进大语言模型在多种谈判场景下的表现，发现大模型在谈判中存在极端锚定、缺乏情境适应性和策略多样性不足等问题，且性能不随模型升级而提升，揭示了当前大模型在对手推理和情境策略建模上的根本缺陷。

Details

Motivation: 为了理解大语言模型在双边谈判中与人类行为的差异，尤其是在复杂情境、权力不对称和上下文敏感性方面的表现，研究需要一个可量化的分析框架来揭示模型的局限性。 Method: 提出了基于双曲正切函数的让步动态建模方法，定义了burstiness tau和Concession-Rigidity Index（CRI）两个新指标；在自然语言与数值报价设置下，对人类与四个最先进的大语言模型进行了大规模实证比较，涵盖有无市场背景信息及六种控制的权力不对称情境。 Result: 研究发现大语言模型倾向于在谈判区域极端位置锚定，无法像人类一样根据对手立场或情境调整策略；表现出较低的策略多样性和偶发的欺骗行为；且模型性能并未随模型规模或能力提升而改善。人类则展现出平滑的情境适应与对手推断能力。 Conclusion: 当前的大语言模型在谈判任务中存在根本性局限，特别是在动态调整、对手建模和情境依赖策略方面；未来需开发能更好内化对手意图与上下文变化的新型模型。 Abstract: Bilateral negotiation is a complex, context-sensitive task in which human negotiators dynamically adjust anchors, pacing, and flexibility to exploit power asymmetries and informal cues. We introduce a unified mathematical framework for modeling concession dynamics based on a hyperbolic tangent curve, and propose two metrics burstiness tau and the Concession-Rigidity Index (CRI) to quantify the timing and rigidity of offer trajectories. We conduct a large-scale empirical comparison between human negotiators and four state-of-the-art large language models (LLMs) across natural-language and numeric-offers settings, with and without rich market context, as well as six controlled power-asymmetry scenarios. Our results reveal that, unlike humans who smoothly adapt to situations and infer the opponents position and strategies, LLMs systematically anchor at extremes of the possible agreement zone for negotiations and optimize for fixed points irrespective of leverage or context. Qualitative analysis further shows limited strategy diversity and occasional deceptive tactics used by LLMs. Moreover the ability of LLMs to negotiate does not improve with better models. These findings highlight fundamental limitations in current LLM negotiation capabilities and point to the need for models that better internalize opponent reasoning and context-dependent strategy.

[43] Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing

Zewen Qiang,Sendong Zhao,Haochun Wang,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 本文研究了大语言模型在处理长文本时存在的“中间丢失”问题，发现除了位置编码外，初始显著性也是导致该问题的重要因素，并提出通过调整初始token的注意力权重来提升模型对长上下文的处理能力。

Details

Motivation: 大语言模型在处理长文本序列时表现不佳，出现“lost in the middle”现象，现有研究主要归因于位置编码偏差，但本文发现初始显著性也起关键作用。 Method: 分析注意力机制中的初始显著性影响，提出对初始token与其他token之间的注意力权重进行缩放的方法，并结合已有去偏方法进行优化。 Result: 在MDQA数据集上最大提升了3.6%，在KV-Retrieval任务上结合其他方法最大提升了3.4%。 Conclusion: 初始显著性是导致大模型“中间丢失”现象的重要因素，通过对注意力权重的调整可有效提升模型对长文本的建模能力。 Abstract: Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle'' phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model's ability to process long contexts, achieving a maximum improvement of 3.6\% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4\% in KV-Retrieval tasks.

[44] Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

Chendong Sun

Main category: cs.CL

TL;DR: 本文提出了EARS方法，通过结合目标模型的预测不确定性动态调整接受阈值，有效缓解了推测解码中的随机拒绝问题，显著提升了大语言模型推理效率。

Details

Motivation: 推测解码中使用的固定随机阈值在高不确定性生成场景下会导致合理候选token被频繁错误拒绝，降低推理效率。 Method: 提出高效自适应拒绝采样（EARS），利用目标模型的预测不确定性（1 - max(P_target)）动态调整接受阈值，并引入与不确定性成比例的容忍项，在模型不确定时放宽标准，在确定时保持严格。 Result: 在创意写作和开放域问答任务上实验表明，EARS最多可将吞吐量提高18.12%，在GSM8K基准上准确率仅有0.84%的轻微下降。 Conclusion: EARS能有效减少推测解码中的随机拒绝，提升推理效率，且无需修改模型结构，可无缝集成到现有框架中。 Abstract: Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as $1 - \max(P_{\mathrm{target}})$. By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.

[45] AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

Jiaru Zou,Ling Yang,Yunzhe Qi,Sirui Chen,Mengting Ai,Ke Shen,Jingrui He,Mengdi Wang

Main category: cs.CL

TL;DR: AutoTool是一个新框架，通过动态工具选择增强大语言模型的推理能力，利用包含上千工具和百项任务的数据集，结合监督学习与强化学习优化推理路径，并采用KL正则化的Plackett-Luce排序提升多步工具选择的一致性，在多个领域显著优于现有方法。

Details

Motivation: 现有LLM代理在使用工具时受限于固定的工具集，缺乏对新或动态变化工具集的适应能力，限制了其灵活性和泛化性。 Method: 构建包含20万样本、1000多个工具和100多项任务的数据集，提出双阶段优化流程：第一阶段通过监督学习和强化学习稳定推理轨迹，第二阶段采用KL正则化的Plackett-Luce模型优化多步工具选择排序。 Result: 在十个基准测试中，基于Qwen3-8B和Qwen2.5-VL-7B的AutoTool在数学与科学推理上平均提升6.4%，搜索问答提升4.5%，代码生成提升7.7%，多模态理解提升6.9%，并展现出对未见工具的良好泛化能力。 Conclusion: AutoTool通过动态工具选择机制显著提升了LLM代理在复杂任务中的性能与适应性，为构建更灵活、可扩展的智能体系统提供了有效方案。 Abstract: Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.

[46] AIR: Post-training Data Selection for Reasoning via Attention Head Influence

Jinrui Liu,Jeff Wu,Xuanguang Pan,Gavin Cheung,Shuai Ma,Chongyang Tao

Main category: cs.CL

TL;DR: 提出了一种名为Attention Influence for Reasoning (AIR)的无监督、无需训练的数据选择框架，通过利用注意力机制的因果影响来提升大模型推理能力的知识蒸馏效率。

Details

Motivation: 现有数据选择方法无法捕捉推理过程中单个步骤的因果重要性，导致在后训练蒸馏中难以有效传递多步推理能力。 Method: AIR首先识别对推理关键的注意力头，构建一个禁用这些头影响的弱化参考模型，并通过比较其与原模型的损失差异计算Attention Influence Score，用于衡量每个推理步骤和样本的重要性，从而实现细粒度的加权微调与全局样本选择。 Result: 在多个推理基准上的实验表明，AIR能持续提升推理准确率，优于基于长度、熵或损失等启发式方法，并能有效识别最关键的推理步骤和样本。 Conclusion: AIR提供了一种基于机制分析、数据高效的大语言模型推理能力蒸馏方法，为知识蒸馏中的数据选择提供了新的细粒度评估视角。 Abstract: LLMs achieve remarkable multi-step reasoning capabilities, yet effectively transferring these skills via post-training distillation remains challenging. Existing data selection methods, ranging from manual curation to heuristics based on length, entropy, or overall loss, fail to capture the causal importance of individual reasoning steps, limiting distillation efficiency. To address this, we propose Attention Influence for Reasoning (AIR), a principled, unsupervised and training-free framework that leverages mechanistic insights of the retrieval head to select high-value post-training data. AIR first identifies reasoning-critical attention heads of an off-the-shelf model, then constructs a weakened reference model with disabled head influence, and finally quantifies the resulting loss divergence as the Attention Influence Score. This score enables fine-grained assessment at both the step and sample levels, supporting step-level weighted fine-tuning and global sample selection. Experiments across multiple reasoning benchmarks show that AIR consistently improves reasoning accuracy, surpassing heuristic baselines and effectively isolating the most critical steps and samples. Our work establishes a mechanism-driven, data-efficient approach for reasoning distillation in LLMs.

[47] Integrating Causal Reasoning into Automated Fact-Checking

Youssra Rebboud,Pasquale Lisena,Raphael Troncy

Main category: cs.CL

TL;DR: 提出一种结合事件关系抽取、语义相似度计算和基于规则推理的方法，用于检测声明与证据之间因果事件链的逻辑不一致，提升事实核查中的可解释性。

Details

Motivation: 现有自动事实核查方法缺乏专门的因果推理能力，难以识别声明中事件间的错误因果关系，限制了语义丰富性和可解释性。 Method: 结合事件关系抽取、语义相似度计算和基于规则的推理，分析声明和证据中的因果事件链，检测其逻辑不一致性。 Result: 在两个事实核查数据集上验证了该方法的有效性，建立了将细粒度因果事件关系融入事实核查的第一个基线。 Conclusion: 该方法填补了因果推理在自动化事实核查中的空白，提升了结果的可解释性，为未来研究提供了基础。 Abstract: In fact-checking applications, a common reason to reject a claim is to detect the presence of erroneous cause-effect relationships between the events at play. However, current automated fact-checking methods lack dedicated causal-based reasoning, potentially missing a valuable opportunity for semantically rich explainability. To address this gap, we propose a methodology that combines event relation extraction, semantic similarity computation, and rule-based reasoning to detect logical inconsistencies between chains of events mentioned in a claim and in an evidence. Evaluated on two fact-checking datasets, this method establishes the first baseline for integrating fine-grained causal event relationships into fact-checking and enhance explainability of verdict prediction.

[48] MiniLingua: A Small Open-Source LLM for European Languages

Anna Aksenova,Boris Zverkov,Nicola Dainese,Alexander Nikitin,Pekka Marttinen

Main category: cs.CL

TL;DR: 本文介绍了MiniLingua，一个专为13种欧洲语言训练的十亿参数开源多语言大模型，其在指令遵循任务上表现优于更大规模的EuroLLM，并具备良好的开放生成能力。

Details

Motivation: 解决大型语言模型计算成本高、隐私问题突出及以英语为中心的问题，推动小型高效模型的多语言支持与设备端应用。 Method: 从零开始训练一个十亿参数的多语言模型MiniLingua，覆盖13种欧洲语言，并进行指令微调以提升任务执行能力。 Result: 指令微调后的MiniLingua在摘要生成、分类、开闭卷问答任务上超越EuroLLM，在开放式生成任务中与最先进的模型表现相当。 Conclusion: 小型多语言模型在合理设计和训练下可实现高效、隐私友好的多语言支持，具备实际部署潜力，且作者已公开模型权重、分词器及训练代码。 Abstract: Large language models are powerful but often limited by high computational cost, privacy concerns, and English-centric training. Recent progress demonstrates that small, efficient models with around one billion parameters can deliver strong results and enable on-device use. This paper introduces MiniLingua, a multilingual open-source LLM of one billion parameters trained from scratch for 13 European languages, designed to balance coverage and instruction-following capabilities. Based on evaluation results, the instruction-tuned version of MiniLingua outperforms EuroLLM, a model with a similar training approach but a larger training budget, on summarization, classification and both open- and closed-book question answering. Moreover, it remains competitive with more advanced state-of-the-art models on open-ended generation tasks. We release model weights, tokenizer and source code used for data processing and model training.

[49] FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Joona Kytöniemi,Jousia Piha,Akseli Reunamo,Fedor Vitiugin,Farrokh Mehryary,Sampo Pyysalo

Main category: cs.CL

TL;DR: 本文介绍了FIN-bench-v2，一个用于评估芬兰语大语言模型的统一基准套件，整合了多个现有基准的芬兰语版本，并进行了标准化和扩展，所有数据集均转换为HuggingFace格式并提供多种提示形式，通过预训练模型筛选出鲁棒任务，并对更大规模的指令调优模型进行评估，所有资源均已公开。

Details

Motivation: 为了更全面、一致地评估芬兰语大语言模型在多种任务上的性能，解决现有基准分散、格式不一的问题。 Method: 整合并更新现有的芬兰语基准测试，将所有数据集统一转换为HuggingFace Datasets格式，包含填空和多项选择题型，每项任务提供五种提示变体；使用2.15B参数的解码器-only模型的训练曲线来评估任务质量，筛选满足单调性、信噪比、非随机性能和模型排序一致性标准的任务。 Result: 构建了一个涵盖阅读理解、常识推理、情感分析、世界知识和对齐等多类任务的统一基准FIN-bench-v2，筛选出具有高鲁棒性的任务，并评估了多个大型指令调优模型在不同任务和提示形式下的表现。 Conclusion: FIN-bench-v2为芬兰语大语言模型提供了一个高质量、标准化的评估平台，有助于推动该语言在自然语言处理领域的研究与发展。 Abstract: We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.

[50] Detecting Emotion Drift in Mental Health Text Using Pre-Trained Transformers

Shibani Sankpal

Main category: cs.CL

TL;DR: 该研究探讨了心理健康相关文本中的情绪漂移现象，利用预训练的Transformer模型检测句子级情绪并量化情绪变化。

Details

Motivation: 情绪在单条文本中的动态变化常被传统情感分析忽略，尤其是在心理健康领域，理解情绪的演变对洞察用户心理状态具有重要意义。 Method: 使用DistilBERT和RoBERTa等预训练Transformer模型进行句子级情绪识别，并计算情绪漂移得分以衡量情绪变化。 Result: 能够有效捕捉心理健康对话中情绪的上升或缓解模式，揭示情感动态变化的特征。 Conclusion: 所提出的方法有助于深入理解心理健康相关内容中的情绪动态，可为心理状态监测和干预提供技术支持。 Abstract: This study investigates emotion drift: the change in emotional state across a single text, within mental health-related messages. While sentiment analysis typically classifies an entire message as positive, negative, or neutral, the nuanced shift of emotions over the course of a message is often overlooked. This study detects sentence-level emotions and measures emotion drift scores using pre-trained transformer models such as DistilBERT and RoBERTa. The results provide insights into patterns of emotional escalation or relief in mental health conversations. This methodology can be applied to better understand emotional dynamics in content.

[51] Large language models are not about language

Johan J. Bolhuis,Andrea Moro,Stephen Crain,Sandiway Fong

Main category: cs.CL

TL;DR: 本文认为大语言模型对语言学无用，因为它们是依赖大量数据的概率模型，而人类语言基于心智内部的计算系统，能够递归生成层次化的思维结构。

Details

Motivation: 探讨大语言模型在语言学研究中的局限性，并强调人类语言能力的内在计算本质。 Method: 通过对比大语言模型与人类语言能力的本质差异进行理论分析。 Result: 指出大语言模型无法真正理解语言的结构性和合法性，不能区分真实语言与不可能语言。 Conclusion: 大语言模型不适合作为研究人类语言的工具，因其缺乏人类语言所具有的先天计算机制。 Abstract: Large Language Models are useless for linguistics, as they are probabilistic models that require a vast amount of data to analyse externalized strings of words. In contrast, human language is underpinned by a mind-internal computational system that recursively generates hierarchical thought structures. The language system grows with minimal external input and can readily distinguish between real language and impossible languages.

[52] Scaling Laws for Code: Every Programming Language Matters

Jian Yang,Shawn Guo,Lin Jing,Wei Zhang,Aishan Liu,Chuan Hao,Zhoujun Li,Wayne Xin Zhao,Xianglong Liu,Weifeng Lv,Bryan Dai

Main category: cs.CL

TL;DR: 本文首次系统研究了多语言代码大模型预训练的扩展规律，通过超过1000次实验建立了跨多种编程语言、模型和数据规模的综合扩展定律，发现解释型语言（如Python）比编译型语言（如Rust）更能从更大模型和数据中受益，且多语言训练具有协同效应，尤其在语法相似的语言之间。提出一种比例依赖的多语言扩展定律，可优化训练资源分配，在相同计算预算下优于均匀分配策略。

Details

Motivation: 现有扩展定律忽视不同编程语言对模型性能的差异化影响，且忽略现代软件开发的多语言特性，导致性能预测不准和资源分配低效。需要系统研究多语言场景下的扩展规律以提升代码大模型训练效率与跨语言能力。 Method: 进行了1000多次实验（相当于33.6万H800小时），覆盖多种编程语言、模型规模（0.2B到14B参数）和数据规模（1T tokens），分析不同语言的扩展行为；研究多语言训练中的协同效应；评估并行配对预训练策略的效果；基于实验结果提出比例依赖的多语言扩展定律。 Result: 发现解释型语言（如Python）相比编译型语言（如Rust）在增大模型和数据时性能提升更显著；多语言训练存在协同增益，尤其在语法相似语言间；并行配对策略能有效增强跨语言能力；提出的比例依赖扩展定律可通过优先分配高价值语言、平衡高协同语言对、减少饱和语言投入，在相同计算预算下实现更优的多语言平均性能。 Conclusion: 多语言代码预训练具有显著的协同效应和差异化的扩展行为，应根据语言特性、协同关系和收益饱和度动态分配训练资源，所提出的比例依赖扩展定律为高效训练多语言代码大模型提供了理论依据和实践指导。 Abstract: Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.

[53] Non-Resolution Reasoning: A Framework for Preserving Semantic Ambiguity in Language Models

Kei Saito

Main category: cs.CL

TL;DR: 本文提出了非解析推理（Non-Resolution Reasoning, NRR）框架，以解决当前语言模型中语义过早崩溃的问题，通过多向量嵌入、非坍缩注意力和上下文身份追踪技术，在推理过程中保留语义歧义，并仅在必要时进行语义解析。

Details

Motivation: 当前语言模型由于softmax竞争和贪婪解码机制，会在上下文不足时过早地锁定单一语义，导致推理脆弱和上下文错误，因此需要一种能够延迟语义决策、保持歧义的新型计算框架。 Method: 提出NRR框架，包含三个组件：(1) 多向量嵌入，为每个token保留多个可能解释；(2) 非坍缩注意力，防止层间赢家通吃现象；(3) 上下文身份追踪（CIT），为重复实体分配上下文特定的身份；并通过外部解析算子ρ显式控制语义解析时机与方式。 Result: 在合成任务上评估显示，配备CIT的模型在分布外身份切换任务中达到90.9%准确率，远高于Transformer基线的9.1%。 Conclusion: NRR提供了一种应对语义过早崩溃的原则性方案，将歧义视为可管理的表示状态而非缺陷，支持同一模型在创造性、事实性和歧义保留推理之间灵活切换，强调语义解析的时机、方式和控制权问题。 Abstract: Premature semantic collapse -- the forced early commitment to a single meaning -- remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing "Dr. Smith the cardiologist" from "Dr. Smith the researcher"). These mechanisms are unified by an external Resolution Operator $ρ$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR's ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.

[54] Advancing Bangla Machine Translation Through Informal Datasets

Ayon Roy,Risat Rahaman,Sadat Shibly,Udoy Saha Joy,Abdulla Al Kafi,Farig Yousuf Sadeque

Main category: cs.CL

TL;DR: 本研究致力于改进孟加拉语机器翻译，特别是针对非正式语言的翻译，通过构建来自社交媒体和日常对话的双语数据集，提升孟加拉语使用者的在线信息获取能力。

Details

Motivation: 由于缺乏成对的孟加拉语-英语数据和先进翻译模型，现有的孟加拉语翻译研究主要集中在正式语言上，忽视了更常用的非正式语言，导致数百万使用者难以获取重要信息。 Method: 探索当前最先进的翻译模型，并从社交媒体和日常对话等非正式来源构建新的孟加拉语-英语平行数据集，以改进非正式孟加拉语的翻译效果。 Result: 开发了一个专注于非正式语言的新型双语数据集，并对现有翻译模型进行了改进，提升了孟加拉语翻译的质量和自然度。 Conclusion: 通过关注非正式语言并增强数据集与模型，本研究推动了开放源码孟加拉语机器翻译的发展，有助于提高孟加拉语使用者在数字世界中的信息可及性。 Abstract: Bangla is the sixth most widely spoken language globally, with approximately 234 million native speakers. However, progress in open-source Bangla machine translation remains limited. Most online resources are in English and often remain untranslated into Bangla, excluding millions from accessing essential information. Existing research in Bangla translation primarily focuses on formal language, neglecting the more commonly used informal language. This is largely due to the lack of pairwise Bangla-English data and advanced translation models. If datasets and models can be enhanced to better handle natural, informal Bangla, millions of people will benefit from improved online information access. In this research, we explore current state-of-the-art models and propose improvements to Bangla translation by developing a dataset from informal sources like social media and conversational texts. This work aims to advance Bangla machine translation by focusing on informal language translation and improving accessibility for Bangla speakers in the digital world.

[55] SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping

Yu-Chen Lu,Sheng-Feng Yu,Hui-Hsien Weng,Pei-Shuo Wang,Yu-Fang Hu,Liang Hung-Chun,Hung-Yueh Chiang,Kai-Chiang Wu

Main category: cs.CL

TL;DR: 本文提出了一种名为SkipCat的新型低秩压缩框架，通过共享投影和块跳过技术，在相同压缩率下保留更多有效秩，显著提升了大语言模型在资源受限设备上的部署效率。

Details

Motivation: 大语言模型参数量大，难以在边缘设备上部署；现有低秩压缩方法需大幅降低秩才能获得效率提升，但会导致性能严重下降。 Method: 提出SkipCat框架，包括层内共享低秩投影和块跳过技术，减少冗余并提高压缩效率，从而在相同压缩预算下保留更多有效秩。 Result: 实验表明，在无需额外微调的情况下，相同压缩率下零样本任务准确率比以往方法提升7%。 Conclusion: SkipCat通过最大化保留有效秩，有效平衡了压缩率与模型性能之间的权衡，适用于资源受限环境下的大模型压缩。 Abstract: Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, naïve low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains. Such aggressive truncation, however, typically results in substantial performance degradation. To address this trade-off, we propose SkipCat, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget. Experimental results show that, without any additional fine-tuning, our method outperforms previous low-rank compression approaches by 7% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.

[56] PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation

Hour Kaing,Raj Dabre,Haiyue Song,Van-Hien Tran,Hideki Tanaka,Masao Utiyama

Main category: cs.CL

TL;DR: PrahokBART是一个从零训练的紧凑型序列到序列模型，专为高棉语设计，通过提升语料质量和融入语言学模块，在机器翻译、文本摘要和标题生成任务上优于mBART50。

Details

Motivation: 现有多语言模型忽视了高棉语的语言特性与语料质量问题，导致其在低资源语言上的表现受限，因此需要专门针对高棉语设计更合适的预训练模型。 Method: 从头训练一个轻量级的序列到序列模型PrahokBART，使用精心整理的高棉语和英语语料库，并引入词分割和归一化等语言学组件以改善预训练质量。 Result: 在机器翻译、文本摘要和标题生成三个生成任务上，PrahokBART均优于mBART50；分析还表明各语言学模块有效提升了模型性能，且模型能更好处理高棉语文本生成中的空格问题。 Conclusion: 通过关注低资源语言的语料质量与语言特性并引入针对性的语言学模块，可以显著提升其在生成任务上的表现，PrahokBART为高棉语NLP任务提供了有力工具。 Abstract: This work introduces {\it PrahokBART}, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.

[57] Verifying Rumors via Stance-Aware Structural Modeling

Gibson Nkhata,Uttamasha Anjally Oyshi,Quan Mai,Susan Gauch

Main category: cs.CL

TL;DR: 提出一种基于立场感知的结构建模方法，用于社交媒体谣言真实性验证，通过整合语义、立场和对话结构信息，在多个基准数据集上显著优于现有方法，并支持早期检测与跨平台泛化。

Details

Motivation: 现有模型难以同时捕捉语义内容、立场信息和对话结构，尤其受限于Transformer编码器的序列长度限制，导致谣言验证效果不佳。 Method: 设计立场感知的结构建模方法，将每个帖子与其立场信号联合编码，并按立场类别聚合回复嵌入；引入立场分布和层级深度作为协变量，增强对结构特征的建模。 Result: 在多个基准数据集上实验表明，该方法在谣言真实性预测方面显著优于先前方法，同时具备良好的早期检测能力和跨平台泛化性能。 Conclusion: 所提方法有效融合了立场与结构信息，提升了谣言验证的准确性和模型可扩展性，具有实际应用价值。 Abstract: Verifying rumors on social media is critical for mitigating the spread of false information. The stances of conversation replies often provide important cues to determine a rumor's veracity. However, existing models struggle to jointly capture semantic content, stance information, and conversation strructure, especially under the sequence length constraints of transformer-based encoders. In this work, we propose a stance-aware structural modeling that encodes each post in a discourse with its stance signal and aggregates reply embedddings by stance category enabling a scalable and semantically enriched representation of the entire thread. To enhance structural awareness, we introduce stance distribution and hierarchical depth as covariates, capturing stance imbalance and the influence of reply depth. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms prior methods in the ability to predict truthfulness of a rumor. We also demonstrate that our model is versatile for early detection and cross-platfrom generalization.

[58] Memory in the Age of AI Agents

Yuyang Hu,Shichun Liu,Yanwei Yue,Guibin Zhang,Boyang Liu,Fangyi Zhu,Jiahang Lin,Honglin Guo,Shihan Dou,Zhiheng Xi,Senjie Jin,Jiejun Tan,Yanbin Yin,Jiongnan Liu,Zeyu Zhang,Zhongxiang Sun,Yutao Zhu,Hao Sun,Boci Peng,Zhenrong Cheng,Xuanbo Fan,Jiaxin Guo,Xinlei Yu,Zhenhong Zhou,Zewen Hu,Jiahao Huo,Junhao Wang,Yuwei Niu,Yu Wang,Zhenfei Yin,Xiaobin Hu,Yue Liao,Qiankun Li,Kun Wang,Wangchunshu Zhou,Yixin Liu,Dawei Cheng,Qi Zhang,Tao Gui,Shirui Pan,Yan Zhang,Philip Torr,Zhicheng Dou,Ji-Rong Wen,Xuanjing Huang,Yu-Gang Jiang,Shuicheng Yan

Main category: cs.CL

TL;DR: 本文综述了基于基础模型的智能体记忆研究现状，提出了形式、功能和动态三个统一视角来系统分析当前的记忆系统，并建立了更精细的分类体系，涵盖记忆的形式（token-level、parametric、latent）、功能（factual、experiential、working）及演化过程。同时总结了基准测试与开源框架，展望了记忆自动化、强化学习融合、多模态与多智能体记忆等未来方向。

Details

Motivation: 现有智能体记忆研究分散，术语模糊，传统分类不足以描述多样性，亟需统一框架以提升概念清晰度和推动领域发展。 Method: 通过区分智能体记忆与其他相关概念（如LLM记忆、RAG、上下文工程），从形式、功能和动态三个维度构建统一分析框架，并对现有工作进行系统梳理与分类。 Result: 提出了三类记忆形式（token-level, parametric, latent）和三类功能（factual, experiential, working），并分析了记忆的形成、演化与检索动态；汇总了记忆评测基准与开源工具；明确了未来研究方向。 Conclusion: 智能体记忆应被视为未来自主智能设计中的一等公民，本文为该领域提供了概念基础与系统性参考。 Abstract: Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.

[59] ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Jia-Nan Li,Jian Guan,Wei Wu,Chongxuan Li

Main category: cs.CL

TL;DR: ReFusion是一种新型掩码扩散模型，通过将并行解码从token级提升到slot级（固定长度的连续子序列），结合“规划-填充”迭代过程，在保持生成质量的同时显著提升效率。

Details

Motivation: 现有的自回归模型推理速度慢，而掩码扩散模型虽支持并行但存在计算开销大、无法有效利用KV缓存及生成不连贯的问题。 Method: 提出ReFusion，采用slot级并行解码，通过扩散模型进行槽位规划，再以自回归方式并行填充各槽位；利用slot结构实现KV缓存复用，并将学习复杂度从token组合空间降至slot排列空间。 Result: 在七个基准上实验表明，ReFusion相比现有MDMs平均性能提升34%，速度加快18倍以上，同时比强自回归模型平均快2.33倍，且性能差距显著缩小。 Conclusion: ReFusion通过slot级扩散建模有效解决了MDMs的效率与生成质量难题，实现了高性能与高效率的统一，为生成模型提供了新的设计思路。 Abstract: Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.

[60] Textual Gradients are a Flawed Metaphor for Automatic Prompt Optimization

Daniel Melcer,Qi Chen,Wen-Hao Chiang,Shweta Garg,Pranav Garg,Christian Bock

Main category: cs.CL

TL;DR: 本文研究了基于文本梯度类比的自动提示优化方法，发现尽管这些方法能提升模型性能，但其背后的梯度类比并不能准确解释其行为。

Details

Motivation: 探索自动提示优化技术中“文本梯度”类比的有效性，理解其实际工作机制。 Method: 通过一系列实验和案例研究分析文本梯度方法的行为。 Result: 发现虽然这些方法常能提升性能，但梯度类比并不准确反映其真实行为。 Conclusion: 梯度类比可能不是解释此类方法的最佳框架，研究结果可指导提示优化策略的选择与新方法的开发。 Abstract: A well-engineered prompt can increase the performance of large language models; automatic prompt optimization techniques aim to increase performance without requiring human effort to tune the prompts. One leading class of prompt optimization techniques introduces the analogy of textual gradients. We investigate the behavior of these textual gradient methods through a series of experiments and case studies. While such methods often result in a performance improvement, our experiments suggest that the gradient analogy does not accurately explain their behavior. Our insights may inform the selection of prompt optimization strategies, and development of new approaches.

[61] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Boxin Wang,Chankyu Lee,Nayeon Lee,Sheng-Chieh Lin,Wenliang Dai,Yang Chen,Yangyi Chen,Zhuolin Yang,Zihan Liu,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping

Main category: cs.CL

TL;DR: 提出级联域内强化学习（Cascade RL）方法，用于构建能在指令和深度思考模式下运行的通用推理模型Nemotron-Cascade，在减少工程复杂性的同时实现多种基准上的最先进性能。

Details

Motivation: 解决使用强化学习构建通用推理模型时因跨域异构性（如推理响应长度和验证延迟差异大）带来的训练复杂性、速度慢及超参数选择困难等问题。 Method: 采用级联式的域内强化学习（Cascade RL），按顺序在不同领域上进行强化学习训练，而非混合多领域提示；先用RLHF对齐提升推理能力，再逐步进行各领域的RLVR训练。 Result: 14B模型在LiveCodeBench v5/v6/Pro上超越其SFT教师模型DeepSeek-R1-0528，并在2025年国际信息学奥林匹克竞赛（IOI）中获得银牌成绩；各阶段训练几乎不降低先前性能，甚至有所提升。 Conclusion: Cascade RL通过有序的域内训练降低了强化学习基础设施的复杂性，支持训练课程设计与超参数优化，有效提升了通用推理模型的性能，展示了RLHF作为预训练步骤的强大增益作用。 Abstract: Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.

[62] Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

Zefang Liu,Nam Nguyen,Yinzhu Quan,Austin Zhang

Main category: cs.CL

TL;DR: 本文首次对事件序列的时间标记化进行了实证研究，比较了多种编码策略在不同数据分布下的表现，发现没有一种策略普遍最优，性能取决于标记器与数据统计特性的匹配程度。

Details

Motivation: 连续时间表示是使用大语言模型建模时间事件序列中的一个关键且未充分探索的挑战，现有方法在多样化的现实世界事件数据分布下效果不明确。 Method: 通过细调大语言模型，在具有不同统计分布的真实世界数据集上评估五种时间编码策略：朴素数字字符串、高精度字节级表示、人类语义日历标记、经典均匀分箱和自适应残差标量量化。 Result: 不同编码策略的表现依赖于数据的统计特性；基于对数的策略在偏态分布上表现优异，而面向人类的格式在混合模式下更为鲁棒。 Conclusion: 最优时间标记化策略应根据数据分布特性进行选择，而非采用通用方法，强调了为特定数据类型定制时间表示的重要性。 Abstract: Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.

[63] Large-Language Memorization During the Classification of United States Supreme Court Cases

John E. Ortega,Dhruv D. Joshi,Matt P. Borkowski

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）在基于美国最高法院（SCOTUS）判决文本的分类任务中的表现，探讨了提示机制和记忆增强模型对分类准确率的影响。

Details

Motivation: 由于SCOTUS语料库具有句子长、法律术语复杂、结构非标准和领域专有词汇多等特点，是检验LLM记忆准确性的理想测试任务。 Method: 采用最新的LLM微调和基于检索的方法，包括参数高效微调和自动建模，在15类和279类两种SCOTUS分类任务上进行实验，并比较提示式模型（如DeepSeek）与传统BERT模型的表现。 Result: 提示式且具备记忆能力的模型（如DeepSeek）在两个分类任务上均优于之前的BERT模型，性能高出约2个百分点。 Conclusion: 提示机制结合记忆能力可提升LLM在复杂法律文本分类中的鲁棒性和准确性，显示出其在专业领域任务中的潜力。 Abstract: Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.

[64] Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

Richard J. Young

Main category: cs.CL

TL;DR: 本研究评估了四种去除大语言模型拒绝行为的工具（Heretic、DECCP、ErisForge、FailSpy），在16个指令微调模型上测试其兼容性与性能影响，发现单次通过方法在保持模型能力方面表现更优，而贝叶斯优化方法效果不稳定；数学推理能力对去拒绝干预最敏感，性能变化显著依赖于工具选择和模型结构。

Details

Motivation: 大语言模型的安全对齐机制虽能防止有害回应，但也阻碍了认知建模、对抗测试等正当研究应用，因此需要有效且可控的方法去除拒绝行为以支持合法研究需求。 Method: 系统评估四种去拒绝工具（Heretic、DECCP、ErisForge、FailSpy）在16个7B-14B参数指令调优模型上的兼容性和效果，使用GSM8K等指标衡量能力保留情况，并分析KL散度以评估分布偏移。 Result: 所有工具在16个模型上均具备兼容性；在子集上，单次通过方法能力保留更好（ErisForge平均-0.28pp，DECCP-0.13pp）；贝叶斯优化方法导致不稳定的分布偏移（KL散度0.043–1.646）；数学推理能力对干预最敏感，GSM8K变化范围为+1.51至-18.81pp（相对下降26.5%）。 Conclusion: 不同去拒绝工具在模型兼容性与能力保留方面表现差异显著，工具选择应基于具体模型架构和任务需求，尤其需关注对数学推理等敏感能力的影响，本研究为研究人员提供了实证依据的工具选择标准。 Abstract: Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.

[65] A stylometric analysis of speaker attribution from speech transcripts

Cristina Aggazzotti,Elizabeth Allyn Smith

Main category: cs.CL

TL;DR: 本文提出了一种基于文体特征的说话人归属方法StyloSpeaker，通过分析转录文本的内容来识别说话人，适用于语音伪装或使用合成语音的情况。

Details

Motivation: 当语音特征不可靠时（如声音伪装或使用文本到语音技术），需要依赖语言内容进行说话人识别。 Method: 采用文体计量学方法，结合字符、词汇、标记、句子和风格特征，应用于两种格式的转录文本，并与黑箱神经网络模型对比。 Result: 在标准化转录文本上表现更好，但在强话题控制条件下整体性能最高；发现了最有效的区分说话人的文体特征。 Conclusion: StyloSpeaker是一种可解释性强且有效的说话人归属方法，尤其适用于语音特征不可靠的法医场景。 Abstract: Forensic scientists often need to identify an unknown speaker or writer in cases such as ransom calls, covert recordings, alleged suicide notes, or anonymous online communications, among many others. Speaker recognition in the speech domain usually examines phonetic or acoustic properties of a voice, and these methods can be accurate and robust under certain conditions. However, if a speaker disguises their voice or employs text-to-speech software, vocal properties may no longer be reliable, leaving only their linguistic content available for analysis. Authorship attribution methods traditionally use syntactic, semantic, and related linguistic information to identify writers of written text (authorship attribution). In this paper, we apply a content-based authorship approach to speech that has been transcribed into text, using what a speaker says to attribute speech to individuals (speaker attribution). We introduce a stylometric method, StyloSpeaker, which incorporates character, word, token, sentence, and style features from the stylometric literature on authorship, to assess whether two transcripts were produced by the same speaker. We evaluate this method on two types of transcript formatting: one approximating prescriptive written text with capitalization and punctuation and another normalized style that removes these conventions. The transcripts' conversation topics are also controlled to varying degrees. We find generally higher attribution performance on normalized transcripts, except under the strongest topic control condition, in which overall performance is highest. Finally, we compare this more explainable stylometric model to black-box neural approaches on the same data and investigate which stylistic features most effectively distinguish speakers.

[66] Towards Effective Model Editing for LLM Personalization

Baixiang Huang,Limeng Cui,Jiapeng Liu,Haoran Wang,Jiawei Xu,Zhuiyue Tan,Yutong Chen,Chen Luo,Yi Liu,Kai Shu

Main category: cs.CL

TL;DR: 本文提出了一种名为Personalization Editing的框架，将个性化视为模型编辑任务，通过基于聚类的偏好表示进行局部修改，实现了高效、精确的个性化，同时避免了灾难性遗忘。此外，作者构建了新的评测数据集UPQA，更真实地评估模型对用户偏好的记忆与应用能力。实验表明该方法在准确率、计算效率和多轮对话等场景中优于微调和基于提示的方法。

Details

Motivation: 现有LLM个性化方法存在计算成本高、数据需求大、易遗忘以及在多轮对话或隐式查询中表现差的问题；同时，现有基准测试多依赖于角色对话，忽视了信息寻求任务中对用户偏好准确回忆的评估。 Method: 将个性化建模为模型编辑任务，提出Personalization Editing框架：利用聚类的偏好表征指导对模型的局部编辑，并引入UPQA数据集——一个基于真实用户查询的短答案问答数据集，用于评估模型对用户偏好的记忆与应用能力。 Result: Personalization Editing在编辑准确性与计算效率上优于微调方法，在多轮对话和隐式偏好问题上优于基于提示的基线方法；UPQA提供了更贴近真实用户交互的评测环境。 Conclusion: 通过将个性化转化为模型编辑问题并引入更具现实意义的评测基准，该研究为高效、精准且稳定的LLM个性化提供了一个可行的新方向。 Abstract: Personalization is becoming indispensable for LLMs to align with individual user preferences and needs. Yet current approaches are often computationally expensive, data-intensive, susceptible to catastrophic forgetting, and prone to performance degradation in multi-turn interactions or when handling implicit queries. To address these challenges, we conceptualize personalization as a model editing task and introduce Personalization Editing, a framework that applies localized edits guided by clustered preference representations. This design enables precise preference-aligned updates while preserving overall model capabilities. In addition, existing personalization benchmarks frequently rely on persona-based dialogs between LLMs rather than user-LLM interactions, or focus primarily on stylistic imitation while neglecting information-seeking tasks that require accurate recall of user-specific preferences. We introduce User Preference Question Answering (UPQA), a short-answer QA dataset constructed from in-situ user queries with varying levels of difficulty. Unlike prior benchmarks, UPQA directly evaluates a model's ability to recall and apply specific user preferences. Across experimental settings, Personalization Editing achieves higher editing accuracy and greater computational efficiency than fine-tuning, while outperforming prompting-based baselines in multi-turn conversations and implicit preference questions settings.

[67] Beyond surface form: A pipeline for semantic analysis in Alzheimer's Disease detection from spontaneous speech

Dylan Phelps,Rodrigo Wilkens,Edward Gow-Smith,Lilian Hubner,Bárbara Malcorra,César Rennó-Costa,Marco Idiart,Maria-Cruz Villa-Uriol,Aline Villavicencio

Main category: cs.CL

TL;DR: 本研究提出一种新方法，通过改变语法和词汇但保留语义来变换语言表面形式，评估语言模型在阿尔茨海默病（AD）筛查中对深层语义特征的利用能力。结果表明，仅依赖语义信息的模型仍能有效检测AD，说明语义损伤可被识别，为早期诊断提供了新途径。

Details

Motivation: 语言模型在AD筛查中表现出潜力，但其预测是否依赖真正的语义衰退标志而非表层文本模式尚不明确。因此，需要方法来区分模型是捕捉到认知衰退的本质语言特征还是仅仅依赖表面统计规律。 Method: 引入语义保持的文本变换方法，通过改写句法和词汇降低BLEU和chrF分数以改变表面形式，同时保持高语义相似性；使用变换后的文本评估分类模型性能，并测试基于描述重建图像的能力以分析信息保留程度。 Result: 变换后文本虽表面差异大，但模型分类性能保持稳定（macro-F1变化小），表明模型能利用深层语义信息进行AD检测；而图像重建任务因引入噪声导致分类准确率下降，说明其增加了干扰。 Conclusion: 语言模型能够超越表层语言模式，捕捉与认知衰退相关的深层语义损伤，验证了语义信息在AD语言标志识别中的有效性，支持了构建更可靠、去偏倚的自动筛查工具的可能性。 Abstract: Alzheimer's Disease (AD) is a progressive neurodegenerative condition that adversely affects cognitive abilities. Language-related changes can be automatically identified through the analysis of outputs from linguistic assessment tasks, such as picture description. Language models show promise as a basis for screening tools for AD, but their limited interpretability poses a challenge in distinguishing true linguistic markers of cognitive decline from surface-level textual patterns. To address this issue, we examine how surface form variation affects classification performance, with the goal of assessing the ability of language models to represent underlying semantic indicators. We introduce a novel approach where texts surface forms are transformed by altering syntax and vocabulary while preserving semantic content. The transformations significantly modify the structure and lexical content, as indicated by low BLEU and chrF scores, yet retain the underlying semantics, as reflected in high semantic similarity scores, isolating the effect of semantic information, and finding models perform similarly to if they were using the original text, with only small deviations in macro-F1. We also investigate whether language from picture descriptions retains enough detail to reconstruct the original image using generative models. We found that image-based transformations add substantial noise reducing classification accuracy. Our methodology provides a novel way of looking at what features influence model predictions, and allows the removal of possible spurious correlations. We find that just using semantic information, language model based classifiers can still detect AD. This work shows that difficult to detect semantic impairment can be identified, addressing an overlooked feature of linguistic deterioration, and opening new pathways for early detection systems.

cs.CV [Back]

[68] Explainable Adversarial-Robust Vision-Language-Action Model for Robotic Manipulation

Ju-Young Kim,Ji-Hong Park,Myeongjun Kim,Gun-Woo Kim

Main category: cs.CV

TL;DR: 提出了一种可解释的对抗鲁棒性视觉-语言-动作模型（基于OpenVLA-OFT框架），通过引入Evidence-3模块检测光度扰动并生成自然语言解释，显著提升了智能农业系统在对抗环境下动作预测的准确性和可解释性。

Details

Motivation: 现有基于RGB相机和机械臂的智能农业系统易受色调、光照和噪声等光度扰动影响，在对抗攻击下容易失效，缺乏鲁棒性和可解释性。 Method: 基于OpenVLA-OFT框架，构建融合Evidence-3模块的视觉-语言-动作模型，该模块可检测光度扰动并生成关于其成因与影响的自然语言解释，提升模型可解释性与鲁棒性。 Result: 实验表明，所提模型相比基线减少了21.7%的当前动作L1损失和18.4%的下一动作L1损失，在对抗条件下显著提高了动作预测精度。 Conclusion: 该方法有效增强了智能农业系统在复杂光照和对抗扰动下的感知与控制能力，为可解释、鲁棒的农业自动化提供了新思路。 Abstract: Smart farming has emerged as a key technology for advancing modern agriculture through automation and intelligent control. However, systems relying on RGB cameras for perception and robotic manipulators for control, common in smart farming, are vulnerable to photometric perturbations such as hue, illumination, and noise changes, which can cause malfunction under adversarial attacks. To address this issue, we propose an explainable adversarial-robust Vision-Language-Action model based on the OpenVLA-OFT framework. The model integrates an Evidence-3 module that detects photometric perturbations and generates natural language explanations of their causes and effects. Experiments show that the proposed model reduces Current Action L1 loss by 21.7% and Next Actions L1 loss by 18.4% compared to the baseline, demonstrating improved action prediction accuracy and explainability under adversarial conditions.

[69] Temporal-Anchor3DLane: Enhanced 3D Lane Detection with Multi-Task Losses and LSTM Fusion

D. Shainu Suhas,G. Rahul,K. Muni

Main category: cs.CV

TL;DR: 本文提出了Temporal-Anchor3DLane，一种增强的单目3D车道检测框架，通过改进损失函数、引入轻量级时序融合模块和训练策略，在OpenLane数据集上显著提升了性能和时序稳定性。

Details

Motivation: 由于深度模糊、遮挡以及时序不稳定等问题，单目3D车道检测仍然具有挑战性。现有的Anchor-based方法存在对回归异常值敏感、全局曲线几何监督不足、多损失项平衡困难以及时序连续性利用有限等问题。 Method: 在Anchor3DLane基础上提出三项改进：（1）多任务损失优化，包括Balanced L1回归、Chamfer点集距离、基于不确定性的损失加权，以及分类与可见性的focal和Dice损失；（2）轻量级Temporal LSTM Fusion模块，跨帧聚合锚点特征；（3）结合曲线级监督与时序一致性的ESCOP式训练优化。 Result: 在OpenLane数据集上，F1分数提升+6.2，生成更平滑的时序轨迹。 Conclusion: 小规模的架构和损失函数改进能显著增强3D车道检测的鲁棒性和时序一致性，无需额外传感器或模型扩展。 Abstract: Monocular 3D lane detection remains challenging due to depth ambiguity, occlusion, and temporal instability across frames. Anchor-based approaches such as Anchor3DLane have demonstrated strong performance by regressing continuous 3D lane curves from multi-camera surround views. However, the baseline model still exhibits (i) sensitivity to regression outliers, (ii) weak supervision of global curve geometry, (iii) difficulty in balancing multiple loss terms, and (iv) limited exploitation of temporal continuity. We propose Temporal-Anchor3DLane, an enhanced 3D lane detection framework that extends Anchor3DLane with three key contributions: (1) a set of multi-task loss improvements, including Balanced L1 regression, Chamfer point-set distance, and uncertainty-based loss weighting, together with focal and Dice components for classification and visibility; (2) a lightweight Temporal LSTM Fusion module that aggregates per-anchor features across frames, replacing a heavier Transformer-style temporal fusion; and (3) ESCOP-style training refinements that couple curve-level supervision with temporal consistency. On OpenLane, Temporal-Anchor3DLane improves F1 by +6.2 and yields smoother temporal trajectories, showing that small architectural and loss refinements significantly enhance 3D lane robustness without extra sensors or scaling.

[70] Automated Plant Disease and Pest Detection System Using Hybrid Lightweight CNN-MobileViT Models for Diagnosis of Indigenous Crops

Tekleab G. Gebremedhin,Hailom S. Asegede,Bruh W. Tesheme,Tadesse B. Gebremichael,Kalayu G. Redae

Main category: cs.CV

TL;DR: 本文提出了一种面向埃塞俄比亚提格雷地区仙人掌果作物病害的离线优先检测系统，基于3,587张田间图像构建了专用数据集，并评估了三种轻量级模型在边缘设备上的性能权衡。

Details

Motivation: 提格雷地区农业人口占比高，战后基础设施薄弱，缺乏专业的作物病害诊断支持，急需可在离线环境下运行的本地化智能诊断工具。 Method: 构建了一个针对本土仙人掌果（Opuntia ficus-indica）的病害图像数据集，包含3,587张田间图像，涵盖三类主要症状；对比评估了三种移动端高效架构：自定义轻量CNN、EfficientNet-Lite1和MobileViT-XS，在交叉验证和实际部署指标上分析其性能。 Result: EfficientNet-Lite1达到90.7%测试准确率，轻量CNN以89.5%准确率实现最优部署性能（42ms延迟，4.8MB模型大小），MobileViT-XS凭借MHSA全局推理能力取得97.3%的平均交叉验证准确率，能更可靠区分虫害聚集与二维真菌病变。 Conclusion: MobileViT-XS在准确率上表现最佳，证明全局注意力机制对复杂病征具有更强分辨力；而轻量CNN在响应速度与模型体积方面优势明显，更适合资源极度受限的边缘环境；最终模型集成于支持提格里尼亚语和阿姆哈拉语的Flutter应用中，可在Cortex-A53类设备上离线运行，提升粮食安全关键诊断的可及性与包容性。 Abstract: Agriculture supports over 80% of the population in the Tigray region of Ethiopia, where infrastructural disruptions limit access to expert crop disease diagnosis. We present an offline-first detection system centered on a newly curated indigenous cactus-fig (Opuntia ficus-indica) dataset consisting of 3,587 field images across three core symptom classes. Given deployment constraints in post-conflict edge environments, we benchmark three mobile-efficient architectures: a custom lightweight CNN, EfficientNet-Lite1, and the CNN-Transformer hybrid MobileViT-XS. While the broader system contains independent modules for potato, apple, and corn, this study isolates cactus-fig model performance to evaluate attention sensitivity and inductive bias transfer on indigenous morphology alone. Results establish a clear Pareto trade-off: EfficientNet-Lite1 achieves 90.7% test accuracy, the lightweight CNN reaches 89.5% with the most favorable deployment profile (42 ms inference latency, 4.8 MB model size), and MobileViT-XS delivers 97.3% mean cross-validation accuracy, demonstrating that MHSA-based global reasoning disambiguates pest clusters from two dimensional fungal lesions more reliably than local texture CNN kernels. The ARM compatible models are deployed in a Tigrigna and Amharic localized Flutter application supporting fully offline inference on Cortex-A53 class devices, strengthening inclusivity for food security critical diagnostics.

Jiahao Jiang,Zhangrui Yang,Xuanhan Wang,Jingkuan Song

Main category: cs.CV

TL;DR: 本文提出了一种用于全球小麦全语义分割竞赛的自训练框架，结合两阶段混合训练策略与数据增强，采用SegFormer（MiT-B4）为主干网络，并通过迭代师生模型提升性能。

Details

Motivation: 旨在提高小麦图像语义分割的精度与数据利用效率，应对竞赛中的挑战。 Method: 采用SegFormer模型配合Mix Transformer（MiT-B4）主干，结合两阶段混合训练与大量数据增强，并在迭代的教师-学生框架中进行自训练。 Result: 该方法在开发集和测试集上均取得了具有竞争力的结果，显著提升了分割精度和数据利用率。 Conclusion: 所提出的自训练框架有效提升了小麦语义分割性能，具备良好的应用潜力。 Abstract: This extended abstract details our solution for the Global Wheat Full Semantic Segmentation Competition. We developed a systematic self-training framework. This framework combines a two-stage hybrid training strategy with extensive data augmentation. Our core model is SegFormer with a Mix Transformer (MiT-B4) backbone. We employ an iterative teacher-student loop. This loop progressively refines model accuracy. It also maximizes data utilization. Our method achieved competitive performance. This was evident on both the Development and Testing Phase datasets.

[72] Generalization vs. Specialization: Evaluating Segment Anything Model (SAM3) Zero-Shot Segmentation Against Fine-Tuned YOLO Detectors

Ranjan Sapkota,Konstantinos I. Roumeliotis,Manoj Karkee,Nikolaos D. Tselikas

Main category: cs.CV

TL;DR: 本文比较了在零样本模式下运行的SAM3与三种经过微调的Ultralytics YOLO11变体在MinneApple数据集上的实例分割性能，发现YOLO在高IoU阈值下表现更好，而SAM3在边界稳定性方面显著优于YOLO。

Details

Motivation: 旨在对比专用微调模型与通用基础模型在密集实例分割任务中的优劣，以指导实际应用中模型选择。 Method: 在MinneApple数据集上评估SAM3（零样本）和三种YOLO11变体（nano、medium、large）的实例分割性能，使用不同IoU阈值分析F1分数和边界稳定性。 Result: 在IoU=0.15时，YOLO模型F1分数为68.9%-72.2%，SAM3为59.8%；但YOLO在不同IoU下性能下降48-50点，SAM3仅下降4点，边界稳定性是YOLO的12倍。 Conclusion: SAMv3在掩码精度和边界稳定性上优于YOLO11，而YOLO11在检测完整性上更具优势，研究为密集场景下选择合适模型提供了依据。 Abstract: Deep learning has advanced two fundamentally different paradigms for instance segmentation: specialized models optimized through task-specific fine-tuning and generalist foundation models capable of zero-shot segmentation. This work presents a comprehensive comparison between SAM3 (Segment Anything Model, also called SAMv3) operating in zero-shot mode and three variants of Ultralytics YOLO11 (nano, medium, and large) fine-tuned for instance segmentation. The evaluation is conducted on the MinneApple dataset, a dense benchmark comprising 670 orchard images with 28,179 annotated apple instances, enabling rigorous validation of model behavior under high object density and occlusion. Our analysis shows IoU choices can inflate performance gaps by up to 30%. At the appropriate IoU = 0.15 threshold, YOLO models achieve 68.9%, 72.2%, and 71.9% F1, while SAM3 reaches 59.8% in pure zero-shot mode. However, YOLO exhibits steep degradation 48-50 points across IoU ranges whereas SAM3 drops only 4 points, revealing 12 times superior boundary stability of SAM3. This highlights the strength of SAMv3 in mask precision versus specialization in detection completeness of YOLO11. We provide open-source code, evaluation pipelines, and methodological recommendations, contributing to a deeper understanding of when specialized fine-tuned models or generalist foundation models are preferable for dense instance segmentation tasks. This project repository is available on GitHub as https://github.com/Applied-AI-Research-Lab/Segment-Anything-Model-SAM3-Zero-Shot-Segmentation-Against-Fine-Tuned-YOLO-Detectors

[73] mmWEAVER: Environment-Specific mmWave Signal Synthesis from a Photo and Activity Description

Mahathir Monjur,Shahriar Nirjon

Main category: cs.CV

TL;DR: 本文提出了mmWeaver，一种基于隐式神经表示（INR）的毫米波信号生成框架，通过结合环境上下文和人体动作特征实现高效、逼真的信号合成，在压缩、速度和下游任务性能上均优于现有方法。

Details

Motivation: 毫米波雷达应用依赖多样化且环境特定的信号数据集，但真实信号复杂、稀疏且高维，物理仿真成本高昂，难以满足数据需求。 Method: 提出mmWeaver框架，使用隐式神经表示建模毫米波信号为连续函数，并引入超网络根据RGB-D图像提取的环境上下文和MotionGPT生成的姿态特征动态生成INR参数，实现语义与几何先验引导的信号合成。 Result: 在多个指标上表现优异：复数SSIM达0.88，PSNR为35dB，活动识别准确率提升最高7%，姿态估计误差降低最高15%，运行速度比仿真方法快6-35倍，同时实现最高49倍压缩。 Conclusion: mmWeaver能够高效生成高质量、环境自适应的毫米波I/Q信号，显著提升数据增强效果与下游任务性能，为毫米波雷达应用提供了可扩展且实用的解决方案。 Abstract: Realistic signal generation and dataset augmentation are essential for advancing mmWave radar applications such as activity recognition and pose estimation, which rely heavily on diverse, and environment-specific signal datasets. However, mmWave signals are inherently complex, sparse, and high-dimensional, making physical simulation computationally expensive. This paper presents mmWeaver, a novel framework that synthesizes realistic, environment-specific complex mmWave signals by modeling them as continuous functions using Implicit Neural Representations (INRs), achieving up to 49-fold compression. mmWeaver incorporates hypernetworks that dynamically generate INR parameters based on environmental context (extracted from RGB-D images) and human motion features (derived from text-to-pose generation via MotionGPT), enabling efficient and adaptive signal synthesis. By conditioning on these semantic and geometric priors, mmWeaver generates diverse I/Q signals at multiple resolutions, preserving phase information critical for downstream tasks such as point cloud estimation and activity classification. Extensive experiments show that mmWeaver achieves a complex SSIM of 0.88 and a PSNR of 35 dB, outperforming existing methods in signal realism while improving activity recognition accuracy by up to 7% and reducing human pose estimation error by up to 15%, all while operating 6-35 times faster than simulation-based approaches.

[74] Hot Hém: Sài Gòn Giũa Cái Nóng Hông Còng Bàng -- Saigon in Unequal Heat

Tessa Vu

Main category: cs.CV

TL;DR: Hot Hém是一个结合街景图像、语义分割和遥感数据的GeoAI工作流，用于估计和优化胡志明市行人的热暴露情况。

Details

Motivation: 在热带高密度城市中，行人热暴露是重要的健康风险，但传统路径算法常忽略微尺度热环境差异。 Method: 利用Google街景图像构建训练数据集，通过XGBoost模型预测地表温度，并结合OSMnx提取的步行网络实现热感知路径规划。 Result: 模型能有效预测不同城市走廊的地表温度，在街区尺度上揭示热暴露的空间差异。 Conclusion: 该方法为城市热环境精细化管理及健康导向的步行路径设计提供了可行技术框架。 Abstract: Pedestrian heat exposure is a critical health risk in dense tropical cities, yet standard routing algorithms often ignore micro-scale thermal variation. Hot Hém is a GeoAI workflow that estimates and operationalizes pedestrian heat exposure in Hô Chí Minh City (HCMC), Vi\d{e}t Nam, colloquially known as Sài Gòn. This spatial data science pipeline combines Google Street View (GSV) imagery, semantic image segmentation, and remote sensing. Two XGBoost models are trained to predict land surface temperature (LST) using a GSV training dataset in selected administrative wards, known as phŏng, and are deployed in a patchwork manner across all OSMnx-derived pedestrian network nodes to enable heat-aware routing. This is a model that, when deployed, can provide a foundation for pinpointing where and further understanding why certain city corridors may experience disproportionately higher temperatures at an infrastructural scale.

[75] Microscopic Vehicle Trajectory Datasets from UAV-collected Video for Heterogeneous, Area-Based Urban Traffic

Yawar Ali,K. Ramachandra Rao,Ashish Bhaskar,Niladri Chatterjee

Main category: cs.CV

TL;DR: 本文介绍了使用无人机（UAV）在印度国家首都地区的城市异质交通环境中采集的公开微观车辆轨迹（MVT）数据集，克服了传统路边视频因遮挡和视角受限的问题。

Details

Motivation: 传统基于路边摄像头的车辆轨迹数据采集方法在密集混合交通中常因遮挡、视角有限和车辆运动不规则而失效，亟需更可靠的数据采集方式。 Method: 利用无人机从顶部视角采集数据，并通过Data from Sky（DFS）平台提取车辆轨迹，数据包括时间戳、位置、速度、加速度和车辆类型，采样频率为每秒30帧，并在多个路段收集以覆盖不同交通密度与组成。 Result: 成功构建并公开发布了六个地点的高质量MVT数据集，经验证具有高精度；探索性分析揭示了车道保持行为、速度分布和横向移动等典型驾驶行为特征。 Conclusion: 该数据集为研究异质城市交通中的仿真建模、安全评估和驾驶行为提供了宝贵资源，有助于提升复杂交通环境下的模型准确性。 Abstract: This paper offers openly available microscopic vehicle trajectory (MVT) datasets collected using unmanned aerial vehicles (UAVs) in heterogeneous, area-based urban traffic conditions. Traditional roadside video collection often fails in dense mixed traffic due to occlusion, limited viewing angles, and irregular vehicle movements. UAV-based recording provides a top-down perspective that reduces these issues and captures rich spatial and temporal dynamics. The datasets described here were extracted using the Data from Sky (DFS) platform and validated against manual counts, space mean speeds, and probe trajectories in earlier work. Each dataset contains time-stamped vehicle positions, speeds, longitudinal and lateral accelerations, and vehicle classifications at a resolution of 30 frames per second. Data were collected at six mid-block locations in the national capital region of India, covering diverse traffic compositions and density levels. Exploratory analyses highlight key behavioural patterns, including lane-keeping preferences, speed distributions, and lateral manoeuvres typical of heterogeneous and area-based traffic settings. These datasets are intended as a resource for the global research community to support simulation modelling, safety assessment, and behavioural studies under area-based traffic conditions. By making these empirical datasets openly available, this work offers researchers a unique opportunity to develop, test, and validate models that more accurately represent complex urban traffic environments.

[76] Read or Ignore? A Unified Benchmark for Typographic-Attack Robustness and Text Recognition in Vision-Language Models

Futa Waseda,Shojiro Yamabe,Daiki Shiono,Kento Sasaki,Tsubasa Takahashi

Main category: cs.CV

TL;DR: 本文提出了一个名为Read-or-Ignore VQA（RIO-VQA）的新任务和RIO-Bench评测基准，旨在解决大视觉语言模型在面对图像中误导性文本时的脆弱性问题，同时保留对必要文本的理解能力。

Details

Motivation: 现有的视觉-语言模型评估方法和防御机制过于强调忽略文本以提升鲁棒性，但在现实场景中，模型需要同时理解物体和文本（如交通标志）。这种矛盾促使作者提出能根据上下文选择性读取或忽略文本的新范式。 Method: 作者提出了RIO-VQA任务和RIO-Bench数据集与协议，通过对同一场景构造仅改变文本内容和问题类型的反事实样本，系统评估模型在何时应读取或忽略文本的能力，并基于此开发了一种新的数据驱动型自适应防御方法。 Result: 实验表明，当前强大的LVLM及现有防御方法难以在抗干扰性和文本理解之间取得平衡；RIO-Bench能够有效揭示这一缺陷，并支持实现更具适应性的选择性文本使用机制。 Conclusion: 该工作揭示了现有评估体系与真实需求之间的根本性错位，提出了一个原则性的解决方案路径，推动构建更可靠、具备上下文感知文本处理能力的视觉语言模型。 Abstract: Large vision-language models (LVLMs) are vulnerable to typographic attacks, where misleading text within an image overrides visual understanding. Existing evaluation protocols and defenses, largely focused on object recognition, implicitly encourage ignoring text to achieve robustness; however, real-world scenarios often require joint reasoning over both objects and text (e.g., recognizing pedestrians while reading traffic signs). To address this, we introduce a novel task, Read-or-Ignore VQA (RIO-VQA), which formalizes selective text use in visual question answering (VQA): models must decide, from context, when to read text and when to ignore it. For evaluation, we present the Read-or-Ignore Benchmark (RIO-Bench), a standardized dataset and protocol that, for each real image, provides same-scene counterfactuals (read / ignore) by varying only the textual content and question type. Using RIO-Bench, we show that strong LVLMs and existing defenses fail to balance typographic robustness and text-reading capability, highlighting the need for improved approaches. Finally, RIO-Bench enables a novel data-driven defense that learns adaptive selective text use, moving beyond prior non-adaptive, text-ignoring defenses. Overall, this work reveals a fundamental misalignment between the existing evaluation scope and real-world requirements, providing a principled path toward reliable LVLMs. Our Project Page is at https://turingmotors.github.io/rio-vqa/.

[77] CLARGA: Multimodal Graph Representation Learning over Arbitrary Sets of Modalities

Santosh Patapati

Main category: cs.CV

TL;DR: CLARGA是一种通用的多模态融合架构，能够自适应地融合任意数量和类型的模态，具有高效、鲁棒和可扩展的优点，适用于多种机器学习任务。

Details

Motivation: 现有的多模态融合方法通常受限于模态数量和类型，难以处理缺失模态和复杂跨模态交互，因此需要一种更通用、灵活且高效的融合框架。 Method: CLARGA通过在样本级别构建基于注意力的加权图，并使用多头图注意力网络进行消息传递来实现模态间交互；采用可学习掩码处理缺失模态，并结合监督损失与对比InfoNCE损失进行训练，提升跨模态一致性和噪声鲁棒性。 Result: 在7个涵盖金融、人机交互、多媒体分类和情感计算等领域的数据集上，CLARGA consistently超越了基线模型、SOTA模型及消融变体，展现出卓越性能，并表现出对缺失输入的鲁棒性和在小众任务上的优越表现。 Conclusion: CLARGA是一种即插即用的多模态融合架构，具备高度适应性、高效性和实用性，可广泛应用于各种多模态表示学习任务中。 Abstract: We introduce CLARGA, a general-purpose multimodal fusion architecture for multimodal representation learning that works with any number and type of modalities without changing the underlying framework. Given a supervised dataset, CLARGA can be applied to virtually any machine learning task to fuse different multimodal representations for processing by downstream layers. On a sample-by-sample basis, CLARGA learns how modalities should inform one another by building an attention weighted graph over their features and passing messages along this graph with a multi-head Graph Attention Network. Not only does this make CLARGA highly adaptive, as it constructs unique graphs for different samples, it makes for efficient fusion with sub-quadratic complexity as the number of modalities grows. Through a learnable mask, it can also adapt to missing modality inputs. The model is trained with a hybrid objective that combines a supervised task loss with contrastive InfoNCE loss, improving cross-modal consistency and robustness to noisy inputs. We demonstrate CLARGA's effectiveness in diverse multimodal representation learning tasks across 7 datasets spanning finance, human-computer interaction, general multimedia classification, and affective computing. It consistently outperforms baselines, state-of-the-art models, and ablations. Additional experiments also demonstrate its robustness to missing inputs and ability to excel on niche tasks. Overall, CLARGA can be easily plugged into machine learning models for effective and efficient learning of representations across a wide variety of tasks.

[78] Smartphone monitoring of smiling as a behavioral proxy of well-being in everyday life

Ming-Zher Poh,Shun Liao,Marco Andreetto,Daniel McDuff,Jonathan Wang,Paolo Di Achille,Jiang Wu,Yun Liu,Lawrence Cai,Eric Teasley,Mark Malhotra,Anupam Pathak,Shwetak Patel

Main category: cs.CV

TL;DR: 该研究利用智能手机被动记录的40万多个视频片段，通过深度学习模型量化自然互动中的微笑强度，发现微笑强度的日常和昼夜模式与幸福感调查数据高度相关，且与身体活动和光照暴露显著正相关，提出了一种客观、可扩展的主观幸福感测量新方法。

Details

Motivation: 传统的主观幸福感测量依赖自我报告，易受回忆偏差和高参与负担影响，缺乏对日常生活中幸福感表达的理解，因此需要一种更客观、生态效度高的测量方式。 Method: 通过智能手机被动收集233名参与者一周内的405,448段视频，使用深度学习模型分析自然互动中的微笑强度，并考察其与国家幸福感调查、身体活动、光照及手机使用等变量的关系。 Result: 发现每日微笑强度与国家幸福感调查数据高度相关（r=0.92），昼夜节律与日重建法结果一致（r=0.80）；更高的平均微笑强度与更多身体活动（β=0.043）和更大光照暴露（β=0.038）显著相关，但与手机使用无显著关系。 Conclusion: 被动式智能手机感知技术可作为研究情感行为动态的有力工具，具有良好的生态效度，有望实现大规模人群层面的幸福感监测。 Abstract: Subjective well-being is a cornerstone of individual and societal health, yet its scientific measurement has traditionally relied on self-report methods prone to recall bias and high participant burden. This has left a gap in our understanding of well-being as it is expressed in everyday life. We hypothesized that candid smiles captured during natural smartphone interactions could serve as a scalable, objective behavioral correlate of positive affect. To test this, we analyzed 405,448 video clips passively recorded from 233 consented participants over one week. Using a deep learning model to quantify smile intensity, we identified distinct diurnal and daily patterns. Daily patterns of smile intensity across the week showed strong correlation with national survey data on happiness (r=0.92), and diurnal rhythms documented close correspondence with established results from the day reconstruction method (r=0.80). Higher daily mean smile intensity was significantly associated with more physical activity (Beta coefficient = 0.043, 95% CI [0.001, 0.085]) and greater light exposure (Beta coefficient = 0.038, [0.013, 0.063]), whereas no significant effects were found for smartphone use. These findings suggest that passive smartphone sensing could serve as a powerful, ecologically valid methodology for studying the dynamics of affective behavior and open the door to understanding this behavior at a population scale.

[79] MPath: Multimodal Pathology Report Generation from Whole Slide Images

Noorul Wahab,Nasir Rajpoot

Main category: cs.CV

TL;DR: MPath是一种轻量级多模态框架，通过学习的视觉前缀提示机制，将全切片图像（WSI）的视觉嵌入注入预训练生物医学语言模型（BioBART），实现病理报告的自动生成。

Details

Motivation: 由于组织形态变异大且病理叙述结构复杂，从全切片图像直接生成临床连贯的诊断报告具有挑战性。 Method: MPath利用基础模型提取的WSI特征（CONCH + Titan），通过紧凑的投影模块将其注入冻结的BioBART语言模型，采用视觉前缀提示机制进行多模态融合，避免端到端的联合预训练。 Result: 在RED 2025 Grand Challenge数据集上，MPath在测试阶段2中排名第四，尽管提交次数有限，仍表现出良好性能。 Conclusion: 基于提示的多模态条件生成是一种可扩展、可解释的病理报告生成策略，MPath为无需大规模训练即可实现图文生成提供了有效方案。 Abstract: Automated generation of diagnostic pathology reports directly from whole slide images (WSIs) is an emerging direction in computational pathology. Translating high-resolution tissue patterns into clinically coherent text remains difficult due to large morphological variability and the complex structure of pathology narratives. We introduce MPath, a lightweight multimodal framework that conditions a pretrained biomedical language model (BioBART) on WSI-derived visual embeddings through a learned visual-prefix prompting mechanism. Instead of end-to-end vision-language pretraining, MPath leverages foundation-model WSI features (CONCH + Titan) and injects them into BioBART via a compact projection module, keeping the language backbone frozen for stability and data efficiency. MPath was developed and evaluated on the RED 2025 Grand Challenge dataset and ranked 4th in Test Phase 2, despite limited submission opportunities. The results highlight the potential of prompt-based multimodal conditioning as a scalable and interpretable strategy for pathology report generation.

[80] FloraForge: LLM-Assisted Procedural Generation of Editable and Analysis-Ready 3D Plant Geometric Models For Agricultural Applications

Mozhgan Hadadi,Talukder Z. Jubery,Patrick S. Schnable,Arti Singh,Bedrich Benes,Adarsh Krishnamurthy,Baskar Ganapathysubramanian

Main category: cs.CV

TL;DR: 本文提出FloraForge，一个基于大语言模型（LLM）辅助的框架，使领域专家能通过自然语言指令生成生物准确且完全参数化的3D植物模型，无需深厚编程背景。

Details

Motivation: 现有3D植物建模方法在数据需求、可编辑性或用户友好性方面存在局限，难以被领域科学家广泛使用。 Method: 利用LLM协同设计机制，将自然语言转化为可迭代优化的Python脚本，生成具有植物学约束的层次化B样条曲面表示，并通过人工可读的Plant Descriptor（PD）进行参数控制。 Result: 成功为玉米、大豆和绿豆构建高精度参数化模型，支持从点云数据拟合，并输出可用于可视化的三角网格及带参数元数据的网格用于定量分析。 Conclusion: 该框架在保持数学严谨性的同时，降低了复杂几何建模的门槛，推动了植物科学中功能结构分析的普及化。 Abstract: Accurate 3D plant models are crucial for computational phenotyping and physics-based simulation; however, current approaches face significant limitations. Learning-based reconstruction methods require extensive species-specific training data and lack editability. Procedural modeling offers parametric control but demands specialized expertise in geometric modeling and an in-depth understanding of complex procedural rules, making it inaccessible to domain scientists. We present FloraForge, an LLM-assisted framework that enables domain experts to generate biologically accurate, fully parametric 3D plant models through iterative natural language Plant Refinements (PR), minimizing programming expertise. Our framework leverages LLM-enabled co-design to refine Python scripts that generate parameterized plant geometries as hierarchical B-spline surface representations with botanical constraints with explicit control points and parametric deformation functions. This representation can be easily tessellated into polygonal meshes with arbitrary precision, ensuring compatibility with functional structural plant analysis workflows such as light simulation, computational fluid dynamics, and finite element analysis. We demonstrate the framework on maize, soybean, and mung bean, fitting procedural models to empirical point cloud data through manual refinement of the Plant Descriptor (PD), human-readable files. The pipeline generates dual outputs: triangular meshes for visualization and triangular meshes with additional parametric metadata for quantitative analysis. This approach uniquely combines LLM-assisted template creation, mathematically continuous representations enabling both phenotyping and rendering, and direct parametric control through PD. The framework democratizes sophisticated geometric modeling for plant science while maintaining mathematical rigor.

[81] TransBridge: Boost 3D Object Detection by Scene-Level Completion with Transformer Decoder

Qinghao Meng,Chenming Wu,Liangjun Zhang,Jianbing Shen

Main category: cs.CV

TL;DR: 本文提出了一种联合补全与检测的框架TransBridge，用于改善自动驾驶中稀疏点云区域的3D目标检测性能，通过引入Transformer-based上采样模块和动态-静态重建模块，在不增加成本的情况下显著提升检测精度。

Details

Motivation: 在远距离区域，LiDAR点稀疏导致3D目标检测困难，现有方法难以有效处理点云稀疏问题，因此需要一种能同时提升点云密度和检测性能的方法。 Method: 提出TransBridge，一种基于Transformer的上采样模块，融合检测与补全网络的特征；设计DSRecon模块生成密集LiDAR数据作为补全网络的监督信号，并利用Transformer建模通道与空间关系，生成高分辨率特征图用于点云补全。 Result: 在nuScenes和Waymo数据集上实验表明，该框架在多种检测方法上mAP提升0.7至1.5，在两阶段检测框架中mAP提升达5.78，显示出良好的泛化能力。 Conclusion: 所提出的联合补全与检测框架能有效利用隐式补全特征增强稀疏区域的检测性能，且无需额外成本，具有较强的实用性和通用性。 Abstract: 3D object detection is essential in autonomous driving, providing vital information about moving objects and obstacles. Detecting objects in distant regions with only a few LiDAR points is still a challenge, and numerous strategies have been developed to address point cloud sparsity through densification.This paper presents a joint completion and detection framework that improves the detection feature in sparse areas while maintaining costs unchanged. Specifically, we propose TransBridge, a novel transformer-based up-sampling block that fuses the features from the detection and completion networks.The detection network can benefit from acquiring implicit completion features derived from the completion network. Additionally, we design the Dynamic-Static Reconstruction (DSRecon) module to produce dense LiDAR data for the completion network, meeting the requirement for dense point cloud ground truth.Furthermore, we employ the transformer mechanism to establish connections between channels and spatial relations, resulting in a high-resolution feature map used for completion purposes.Extensive experiments on the nuScenes and Waymo datasets demonstrate the effectiveness of the proposed framework.The results show that our framework consistently improves end-to-end 3D object detection, with the mean average precision (mAP) ranging from 0.7 to 1.5 across multiple methods, indicating its generalization ability. For the two-stage detection framework, it also boosts the mAP up to 5.78 points.

[82] MONET -- Virtual Cell Painting of Brightfield Images and Time Lapses Using Reference Consistent Diffusion

Alexander Peysakhovich,William Berman,Joseph Rufo,Felix Wong,Maxwell Z. Wilson

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的虚拟细胞染色技术（MONET），可从明场图像预测细胞形态，支持时序观察并具备跨条件迁移能力。

Details

Motivation: 传统细胞染色方法劳动密集且需化学固定，无法研究细胞动态过程，因此需要一种非侵入性、高效的替代方案。 Method: 训练一个基于大规模数据集的扩散模型（MONET），采用一致性架构从明场图像生成细胞染色图像，并实现无真实视频标签的时序预测和上下文学习以提升泛化性。 Result: 模型性能随规模提升而提高，能够生成高质量的虚拟细胞染色图像及时序视频，并在分布外的细胞系和成像协议上表现出部分迁移能力。 Conclusion: 虚拟细胞染色虽不取代传统方法，但可作为互补工具，推动生物学研究中的新工作流程。 Abstract: Cell painting is a popular technique for creating human-interpretable, high-contrast images of cell morphology. There are two major issues with cell paint: (1) it is labor-intensive and (2) it requires chemical fixation, making the study of cell dynamics impossible. We train a diffusion model (Morphological Observation Neural Enhancement Tool, or MONET) on a large dataset to predict cell paint channels from brightfield images. We show that model quality improves with scale. The model uses a consistency architecture to generate time-lapse videos, despite the impossibility of obtaining cell paint video training data. In addition, we show that this architecture enables a form of in-context learning, allowing the model to partially transfer to out-of-distribution cell lines and imaging protocols. Virtual cell painting is not intended to replace physical cell painting completely, but to act as a complementary tool enabling novel workflows in biological research.

[83] Contextual Peano Scan and Fast Image Segmentation Using Hidden and Evidential Markov Chains

Clément Fernandes,Wojciech Pieczynski

Main category: cs.CV

TL;DR: 提出了一种新的HEMC-CPS模型，结合上下文Peano扫描与隐含证据马尔可夫链，用于贝叶斯图像分割，具有高效性和扩展性。

Details

Motivation: 为了提升图像分割的准确性与建模能力，尤其是在处理复杂图像（如三维或多源多分辨率图像）时，需要结合上下文信息与更强大的概率模型。 Method: 将上下文Peano扫描（CPS）与隐藏证据马尔可夫链（HEMC）结合，构建HEMC-CPS模型，并采用随机EM算法进行无监督参数估计，使用最大后验模式（MPM）进行分割。 Result: 在合成与真实图像上验证了HEMC-CPS模型的有效性，表现出优于现有HMC和HMF方法的分割效果，且计算效率高。 Conclusion: HEMC-CPS模型在图像分割中表现优异，具备处理复杂图像的潜力，且可推广至其他空间相关数据的建模与分析。 Abstract: Transforming bi-dimensional sets of image pixels into mono-dimensional sequences with a Peano scan (PS) is an established technique enabling the use of hidden Markov chains (HMCs) for unsupervised image segmentation. Related Bayesian segmentation methods can compete with hidden Markov fields (HMFs)-based ones and are much faster. PS has recently been extended to the contextual PS, and some initial experiments have shown the value of the associated HMC model, denoted as HMC-CPS, in image segmentation. Moreover, HMCs have been extended to hidden evidential Markov chains (HEMCs), which are capable of improving HMC-based Bayesian segmentation. In this study, we introduce a new HEMC-CPS model by simultaneously considering contextual PS and evidential HMC. We show its effectiveness for Bayesian maximum posterior mode (MPM) segmentation using synthetic and real images. Segmentation is performed in an unsupervised manner, with parameters being estimated using the stochastic expectation--maximization (SEM) method. The new HEMC-CPS model presents potential for the modeling and segmentation of more complex images, such as three-dimensional or multi-sensor multi-resolution images. Finally, the HMC-CPS and HEMC-CPS models are not limited to image segmentation and could be used for any kind of spatially correlated data.

Jingmin Zhu,Anqi Zhu,James Bailey,Jun Liu,Hossein Rahmani,Mohammed Bennamoun,Farid Boussaid,Qiuhong Ke

Main category: cs.CV

TL;DR: 本文提出了DynaPURLS框架，用于零样本骨架动作识别，通过动态细粒度视觉-语义对齐提升对未见类别的泛化能力。

Details

Motivation: 现有方法依赖骨架特征与静态类别语义的粗粒度对齐，难以克服可见与未见类别间的域偏移，限制了细粒度知识迁移。 Method: 利用大语言模型生成分层文本描述，结合自适应骨架分区模块提取细粒度视觉表征，并在推理时通过轻量级可学习投影动态优化文本特征，辅以置信度感知的记忆库稳定训练。 Result: 在NTU RGB+D 60/120和PKU-MMD等三大基准上显著超越先前方法，达到最先进性能。 Conclusion: DynaPURLS通过多尺度动态对齐机制有效缓解域偏移问题，提升了零样本骨架动作识别的泛化性和鲁棒性。 Abstract: Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton features with static, class-level semantics. This coarse-grained alignment fails to bridge the domain shift between seen and unseen classes, thereby impeding the effective transfer of fine-grained visual knowledge. To address these limitations, we introduce \textbf{DynaPURLS}, a unified framework that establishes robust, multi-scale visual-semantic correspondences and dynamically refines them at inference time to enhance generalization. Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics. Concurrently, an adaptive partitioning module produces fine-grained visual representations by semantically grouping skeleton joints. To fortify this fine-grained alignment against the train-test domain shift, DynaPURLS incorporates a dynamic refinement module. During inference, this module adapts textual features to the incoming visual stream via a lightweight learnable projection. This refinement process is stabilized by a confidence-aware, class-balanced memory bank, which mitigates error propagation from noisy pseudo-labels. Extensive experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art, setting new state-of-the-art records. The source code is made publicly available at https://github.com/Alchemist0754/DynaPURLS

[85] A Comparative Analysis of Semiconductor Wafer Map Defect Detection with Image Transformer

Sushmita Nath

Main category: cs.CV

TL;DR: 本研究探讨了在数据受限条件下使用Data-Efficient Image Transformer（DeiT）对晶圆图缺陷进行分类，结果表明DeiT在准确率、F1分数和训练速度上均优于传统CNN模型。

Details

Motivation: 由于传统卷积神经网络（CNN）在数据有限且不平衡的情况下性能下降，因此需要一种更高效的数据利用方法来提升半导体晶圆缺陷检测的准确性。 Method: 采用Data-Efficient Image Transformer（DeiT）模型对晶圆图缺陷进行分类，并与VGG-19、SqueezeNet、Xception等CNN模型进行比较。 Result: DeiT模型达到了90.83%的最高分类准确率，优于VGG-19（65%）、SqueezeNet（82%）、Xception（66%）和Hybrid（67%），同时F1-score为90.78%，训练收敛更快，并在少数缺陷类别检测中表现出更强的鲁棒性。 Conclusion: 基于Transformer的DeiT模型在半导体晶圆缺陷检测中表现优越，具有应用于预测性维护策略的巨大潜力。 Abstract: Predictive maintenance is an important sector in modern industries which improves fault detection and cost reduction processes. By using machine learning algorithms in the whole process, the defects detection process can be implemented smoothly. Semiconductor is a sensitive maintenance field that requires predictability in work. While convolutional neural networks (CNNs) such as VGG-19, Xception and Squeeze-Net have demonstrated solid performance in image classification for semiconductor wafer industry, their effectiveness often declines in scenarios with limited and imbalanced data. This study investigates the use of the Data-Efficient Image Transformer (DeiT) for classifying wafer map defects under data-constrained conditions. Experimental results reveal that the DeiT model achieves highest classification accuracy of 90.83%, outperforming CNN models such as VGG-19(65%), SqueezeNet(82%), Xception(66%) and Hybrid(67%). DeiT also demonstrated superior F1-score (90.78%) and faster training convergence, with enhanced robustness in detecting minority defect classes. These findings highlight the potential of transformer-based models like DeiT in semiconductor wafer defect detection and support predictive maintenance strategies within semiconductor fabrication processes.

[86] CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction

Xianghui Xie,Bowen Wen,Yan Chang,Hesam Rabeti,Jiefeng Li,Ye Yuan,Gerard Pons-Moll,Stan Birchfield

Main category: cs.CV

TL;DR: 本文提出CARI4D，首个无需物体类别假设的单目RGB视频中4D人-物交互重建方法，通过基础模型集成与渲染比对范式实现空间、时间与像素级一致性，在已见和未见物体上均显著优于先前方法。

Details

Motivation: 从单目RGB视频中准确恢复4D人-物交互面临深度模糊、遮挡和复杂运动等挑战，现有方法依赖物体模板或限于特定类别，难以泛化。因此需要一种类别无关、可在真实场景中零样本应用的重建方法。 Method: 提出CARI4D，采用姿态假设选择算法融合基础模型的预测，通过可学习的渲染-比对范式联合优化以保证空间、时间和像素对齐，并引入基于物理接触约束的精细化推理，实现度量尺度下的一致性4D重建。 Result: 在已见数据集上比先前方法降低38%重建误差，未见数据集上降低36%，且能零样本应用于野外互联网视频，展现出强泛化能力。 Conclusion: CARI4D是首个类别无关的单目4D人-物交互重建方法，通过融合基础模型与联合优化策略，在精度和泛化性方面均取得显著提升，推动了从单目视频中理解人-物交互的发展。 Abstract: Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.

[87] V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Chenrui Fan,Yijun Liang,Shweta Bhardwaj,Kwesi Cobbina,Ming Li,Tianyi Zhou

Main category: cs.CV

TL;DR: 本文提出了一个名为V-REX的评估套件，用于评估视觉语言模型在复杂开放任务中的多步探索与推理能力，通过链式提问（CoQ）分解任务，分别评估模型的规划和执行能力。

Details

Motivation: 现有的视觉语言模型在处理定义明确的问题时表现良好，但在复杂的开放式任务中往往表现不佳，尤其是在需要多轮探索和推理的情况下。因此，需要一个新的评估框架来更好地衡量这些模型的能力。 Method: 提出了一种新的评估协议V-REX，将多步探索性推理转化为链式提问（Chain-of-Questions, CoQ），并分解为两个方面：规划（选择探索性问题链）和执行（按顺序回答问题以收集信息）。通过为每一步设定有限的问题和答案选项，实现对中间步骤的可靠定量分析。 Result: 通过对最先进的专有和开源视觉语言模型进行评估，揭示了模型在规划与执行能力上的显著差异，并发现了多步探索性推理仍有很大的提升空间。同时观察到一致的扩展趋势。 Conclusion: V-REX为评估视觉语言模型在复杂视觉推理任务中的表现提供了有效工具，有助于推动具备更强大探索与推理能力的AI系统的发展。 Abstract: While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.

[88] Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus

Antonio Guillen-Perez

Main category: cs.CV

TL;DR: Semantic-Drive 是一种本地优先、神经符号结合的语义数据挖掘框架，用于在自动驾驶视频日志中高效发现罕见的安全关键事件，其通过解耦感知与推理过程，在保持隐私的同时显著提升召回率和风险评估精度。

Details

Motivation: 自动驾驶车辆开发受限于稀有“长尾”安全事件数据的缺乏，现有方法在精确性、隐私或成本方面存在不足。 Method: 提出 Semantic-Drive 框架：首先使用实时开放词汇检测器（YOLOE）进行符号锚定，然后由推理型视觉语言模型（VLM）进行认知分析，并采用多模型‘Judge-Scout’共识机制减少幻觉。 Result: 在 nuScenes 数据集上测试，相比 CLIP 召回率达到 0.966（对比 0.475），风险评估误差降低 40%，且可在消费级硬件（如 RTX 3090）运行。 Conclusion: Semantic-Drive 提供了一种高效、隐私保护、低成本的长尾场景挖掘方案，优于依赖云服务或粗粒度元数据的传统方法。 Abstract: The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of "Long-Tail" training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a "System 2" inference-time alignment strategy, utilizing a multi-model "Judge-Scout" consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40\% compared to single models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.

[89] Exploring Spatial-Temporal Representation via Star Graph for mmWave Radar-based Human Activity Recognition

Senhao Gao,Junqing Zhang,Luoyu Mei,Shuai Wang,Xuyu Wang

Main category: cs.CV

TL;DR: 提出了一种基于星型图表示和离散动态图神经网络（DDGNN）的毫米波雷达点云人体活动识别方法，有效应对点云稀疏性和可变尺寸问题，实现了接近视觉系统的高性能识别。

Details

Motivation: 毫米波雷达点云具有稀疏和尺寸可变的特点，直接借用基于视觉的密集点云预处理方法并非最优，因此需要专门针对雷达特性设计更有效的时空特征提取方法。 Method: 设计了一种星型图结构来建模静态中心点与动态雷达点之间的高维相对关系，并采用离散动态图神经网络（DDGNN）学习该图上可变大小的人体运动特征。 Result: 在真实世界数据集上，该方法整体分类准确率达到94.27%，接近基于视觉骨架数据的97.25%；在树莓派4上验证了其在资源受限平台的有效性，并优于三种最新的雷达专用方法。 Conclusion: 所提出的星型图与DDGNN框架能有效挖掘毫米波雷达点云中的时空特征，无需重采样或帧聚合器即可实现高性能人体活动识别，适用于边缘设备部署。 Abstract: Human activity recognition (HAR) requires extracting accurate spatial-temporal features with human movements. A mmWave radar point cloud-based HAR system suffers from sparsity and variable-size problems due to the physical features of the mmWave signal. Existing works usually borrow the preprocessing algorithms for the vision-based systems with dense point clouds, which may not be optimal for mmWave radar systems. In this work, we proposed a graph representation with a discrete dynamic graph neural network (DDGNN) to explore the spatial-temporal representation of human movement-related features. Specifically, we designed a star graph to describe the high-dimensional relative relationship between a manually added static center point and the dynamic mmWave radar points in the same and consecutive frames. We then adopted DDGNN to learn the features residing in the star graph with variable sizes. Experimental results demonstrated that our approach outperformed other baseline methods using real-world HAR datasets. Our system achieved an overall classification accuracy of 94.27\%, which gets the near-optimal performance with a vision-based skeleton data accuracy of 97.25\%. We also conducted an inference test on Raspberry Pi~4 to demonstrate its effectiveness on resource-constraint platforms. \sh{ We provided a comprehensive ablation study for variable DDGNN structures to validate our model design. Our system also outperformed three recent radar-specific methods without requiring resampling or frame aggregators.

[90] Adaptive federated learning for ship detection across diverse satellite imagery sources

Tran-Vu La,Minh-Tan Pham,Yu Li,Patrick Matgen,Marco Chini

Main category: cs.CV

TL;DR: 本研究探讨了联邦学习（FL）在多源卫星图像船舶检测中的应用，评估了FedAvg、FedProx、FedOpt和FedMedian四种FL模型，结果表明其显著优于本地训练，且性能接近全局训练。

Details

Motivation: 由于商业卫星图像和敏感船舶标注数据的隐私限制，传统集中式训练难以实施，因此需要一种无需共享数据的隐私保护学习方法。 Method: 采用联邦学习框架，在不共享原始数据的前提下，利用多个独立卫星数据集训练YOLOv8船舶检测模型，并对比四种主流FL算法与本地训练基线的性能。 Result: 所有FL模型均显著提升检测精度，接近全局训练效果，其中模型性能受通信轮数和本地训练周期等配置影响明显。 Conclusion: 联邦学习可有效支持跨域卫星图像中的船舶检测，在保护数据隐私的同时实现高性能检测，合理配置FL参数对优化精度和效率至关重要。 Abstract: We investigate the application of Federated Learning (FL) for ship detection across diverse satellite datasets, offering a privacy-preserving solution that eliminates the need for data sharing or centralized collection. This approach is particularly advantageous for handling commercial satellite imagery or sensitive ship annotations. Four FL models including FedAvg, FedProx, FedOpt, and FedMedian, are evaluated and compared to a local training baseline, where the YOLOv8 ship detection model is independently trained on each dataset without sharing learned parameters. The results reveal that FL models substantially improve detection accuracy over training on smaller local datasets and achieve performance levels close to global training that uses all datasets during the training. Furthermore, the study underscores the importance of selecting appropriate FL configurations, such as the number of communication rounds and local training epochs, to optimize detection precision while maintaining computational efficiency.

[91] Enhancing deep learning performance on burned area delineation from SPOT-6/7 imagery for emergency management

Maria Rodriguez,Minh-Tan Pham,Martin Sudmanns,Quentin Poterek,Oscar Narvaez

Main category: cs.CV

TL;DR: 本研究提出了一种用于野火后烧毁区域快速制图的监督语义分割工作流，利用SPOT-6/7高分辨率影像，比较U-Net与SegFormer模型性能，并探讨数据增强、多任务学习与推理效率的平衡。

Details

Motivation: 现有烧毁区制图方法在应急响应时效性方面存在不足，难以满足紧急管理场景的时间约束需求。 Method: 采用U-Net和SegFormer模型进行语义分割，结合土地覆盖数据作为辅助任务，并应用测试时增强（Test-Time Augmentation）与混合精度优化等技术提升性能与效率。 Result: U-Net与SegFormer在少量训练数据下表现相近，但SegFormer资源消耗更高；引入土地覆盖辅助任务可提升模型鲁棒性而不增加推理时间；测试时增强可提升精度但增加耗时，可通过混合精度等方法缓解。 Conclusion: 所提工作流在保证精度的同时提升了烧毁区制图的推理效率，有助于支持基于高分辨率影像的快速应急响应。 Abstract: After a wildfire, delineating burned areas (BAs) is crucial for quantifying damages and supporting ecosystem recovery. Current BA mapping approaches rely on computer vision models trained on post-event remote sensing imagery, but often overlook their applicability to time-constrained emergency management scenarios. This study introduces a supervised semantic segmentation workflow aimed at boosting both the performance and efficiency of BA delineation. It targets SPOT-6/7 imagery due to its very high resolution and on-demand availability. Experiments are evaluated based on Dice score, Intersection over Union, and inference time. The results show that U-Net and SegFormer models perform similarly with limited training data. However, SegFormer requires more resources, challenging its practical use in emergencies. Incorporating land cover data as an auxiliary task enhances model robustness without increasing inference time. Lastly, Test-Time Augmentation improves BA delineation performance but raises inference time, which can be mitigated with optimization methods like Mixed Precision.

[92] CreativeVR: Diffusion-Prior-Guided Approach for Structure and Motion Restoration in Generative and Real Videos

Tejas Panambur,Ishan Rajendrakumar Dave,Chongjian Ge,Ersin Yumer,Xue Bai

Main category: cs.CV

TL;DR: 本文提出了CreativeVR，一种基于扩散先验的视频恢复框架，专门用于修复AI生成和真实世界视频中的严重结构和时序伪影。该方法通过一个精度调节旋钮灵活平衡保真度与感知质量，并在新提出的AIGC54基准上实现最先进的性能。

Details

Motivation: 现有的视频恢复方法主要针对模糊、下采样等合成退化，难以有效修复AI生成视频中常见的面部、手部扭曲以及时序不一致等问题；同时缺乏对感知质量与保真度之间权衡的控制能力。 Method: 提出CreativeVR，采用深度适配器架构，引入一个可调节的精度控制机制，使模型能根据输入强度在精确恢复与强结构/运动修正间平滑切换；训练中使用时间一致的退化模块，模拟真实的结构性故障。 Result: 在新构建的AIGC54基准上达到最先进水平，同时在标准视频恢复任务中表现良好，处理720p视频可达约13 FPS（单块80-GB A100），具备实用吞吐能力。 Conclusion: CreativeVR能有效修复AI生成和低质量真实视频中的严重结构与时序伪影，在多种退化类型上实现了良好的平衡，兼具高性能与实际部署可行性。 Abstract: Modern text-to-video (T2V) diffusion models can synthesize visually compelling clips, yet they remain brittle at fine-scale structure: even state-of-the-art generators often produce distorted faces and hands, warped backgrounds, and temporally inconsistent motion. Such severe structural artifacts also appear in very low-quality real-world videos. Classical video restoration and super-resolution (VR/VSR) methods, in contrast, are tuned for synthetic degradations such as blur and downsampling and tend to stabilize these artifacts rather than repair them, while diffusion-prior restorers are usually trained on photometric noise and offer little control over the trade-off between perceptual quality and fidelity. We introduce CreativeVR, a diffusion-prior-guided video restoration framework for AI-generated (AIGC) and real videos with severe structural and temporal artifacts. Our deep-adapter-based method exposes a single precision knob that controls how strongly the model follows the input, smoothly trading off between precise restoration on standard degradations and stronger structure- and motion-corrective behavior on challenging content. Our key novelty is a temporally coherent degradation module used during training, which applies carefully designed transformations that produce realistic structural failures. To evaluate AIGC-artifact restoration, we propose the AIGC54 benchmark with FIQA, semantic and perceptual metrics, and multi-aspect scoring. CreativeVR achieves state-of-the-art results on videos with severe artifacts and performs competitively on standard video restoration benchmarks, while running at practical throughput (about 13 FPS at 720p on a single 80-GB A100). Project page: https://daveishan.github.io/creativevr-webpage/.

[93] BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models

Ryan Po,Eric Ryan Chan,Changan Chen,Gordon Wetzstein

Main category: cs.CV

TL;DR: 本文提出了Backwards Aggregation (BAgger)，一种自监督方法，用于改善自回归视频模型在长期预测中的错误累积和质量漂移问题，通过利用模型自身的 rollout 构建纠正轨迹，实现更稳定的长时序生成。

Details

Motivation: 自回归视频模型在训练和推理之间存在暴露偏差，导致生成帧的质量随时间下降，影响长期世界建模的稳定性与一致性。 Method: 提出Backwards Aggregation (BAgger) 方法，利用模型自身生成的 rollout 构造纠正性反馈路径，在标准得分匹配或流匹配目标下进行训练，无需依赖大型教师模型或长时间链式反向传播。 Result: 在文本到视频、视频扩展和多提示生成任务上验证了BAgger的有效性，表现出更稳定的长时序运动、更好的视觉一致性和更少的生成漂移。 Conclusion: BAgger通过自监督的反向聚合机制有效缓解了暴露偏差问题，在不牺牲生成质量和多样性的情况下显著提升了自回归视频模型的长期生成稳定性。 Abstract: Autoregressive video models are promising for world modeling via next-frame prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model's own rollouts, teaching it to recover from its mistakes. Unlike prior approaches that rely on few-step distillation and distribution-matching losses, which can hurt quality and diversity, BAgger trains with standard score or flow matching objectives, avoiding large teachers and long-chain backpropagation through time. We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation, observing more stable long-horizon motion and better visual consistency with reduced drift.

[94] RePack: Representation Packing of Vision Foundation Model Features Enhances Diffusion Transformer

Guanfang Dong,Luke Schultz,Negar Hassanpour,Chao Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为RePack的简单而有效的框架，用于改进扩散Transformer（DiTs），通过将高维视觉基础模型（VFM）表示投影到低维流形上来解决信息过载问题。实验结果表明，RePack显著加速了DiT的收敛速度，并在图像重建任务中优于直接注入原始VFM特征的最新方法。

Details

Motivation: 高维VFM表示虽然富含语义信息，但可能导致信息过载，尤其是在其尺寸超过原始图像时，影响解码效率和生成性能。 Method: RePack通过将VFM表示投影到低维流形上，生成更紧凑、更适合解码器使用的表示，从而过滤非语义噪声并保留高保真重建所需的核心结构信息。 Result: 在DiT-XL/2上，RePack仅用64个epoch就达到了3.66的FID分数，比最先进的方法快35%。 Conclusion: RePack能有效提取VFM表示的核心语义，同时规避其高维带来的副作用，显著提升DiT的训练效率和生成质量。 Abstract: The superior representation capability of pre-trained vision foundation models (VFMs) has been harnessed for enhancing latent diffusion models (LDMs). These approaches inject the rich semantics from high-dimensional VFM representations (e.g., DINOv3) into LDMs at different phases, resulting in accelerated learning and better generation performance. However, the high-dimensionality of VFM representations may also lead to Information Overload, particularly when the VFM features exceed the size of the original image for decoding. To address this issue while preserving the utility of VFM features, we propose RePack (Representation Packing), a simple yet effective framework for improving Diffusion Transformers (DiTs). RePack transforms the VFM representation into a more compact, decoder-friendly representation by projecting onto low-dimensional manifolds. We find that RePack can effectively filter out non-semantic noise while preserving the core structural information needed for high-fidelity reconstruction. Experimental results show that RePack significantly accelerates DiT convergence and outperforms recent methods that directly inject raw VFM features into the decoder for image reconstruction. On DiT-XL/2, RePack achieves an FID of 3.66 in only 64 epochs, which is 35% faster than the state-of-the-art method. This demonstrates that RePack successfully extracts the core semantics of VFM representations while bypassing their high-dimensionality side effects.

[95] VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering

Zihu Wang,Boxun Xu,Yuxuan Xia,Peng Li

Main category: cs.CV

TL;DR: 本文提出VEGAS方法，利用视觉编码器的注意力图在解码中期注入语言模型，以减少大视觉-语言模型中的幻觉问题。

Details

Motivation: 大视觉-语言模型常在生成文本时产生与视觉证据不符的幻觉内容，亟需有效机制抑制此类错误。 Method: 分析发现幻觉与视觉注意力分散有关，且视觉-文本冲突在语言模型中层最显著，因此将视觉编码器的注意力图注入这些中层以引导生成过程。 Result: VEGAS在多个基准上显著降低幻觉率，取得当前最优的抑制效果。 Conclusion: 视觉编码器的注意力图是抑制LVLM幻觉的有效信号，尤其在语言模型中层引入可自适应纠正注意力偏差。 Abstract: Large vision-language models (LVLMs) exhibit impressive ability to jointly reason over visual and textual inputs. However, they often produce outputs that are linguistically fluent but factually inconsistent with the visual evidence, i.e., they hallucinate. Despite growing efforts to mitigate such hallucinations, a key question remains: what form of visual attention can effectively suppress hallucinations during decoding? In this work, we provide a simple answer: the vision encoder's own attention map. We show that LVLMs tend to hallucinate when their final visual-attention maps fail to concentrate on key image objects, whereas the vision encoder's more concentrated attention maps substantially reduce hallucinations. To further investigate the cause, we analyze vision-text conflicts during decoding and find that these conflicts peak in the language model's middle layers. Injecting the vision encoder's attention maps into these layers effectively suppresses hallucinations. Building on these insights, we introduce VEGAS, a simple yet effective inference-time method that integrates the vision encoder's attention maps into the language model's mid-layers and adaptively steers tokens which fail to concentrate on key image objects. Extensive experiments across multiple benchmarks demonstrate that VEGAS consistently achieves state-of-the-art performance in reducing hallucinations.

[96] SPDMark: Selective Parameter Displacement for Robust Video Watermarking

Samar Fares,Nurbek Tastan,Karthik Nandakumar

Main category: cs.CV

TL;DR: 本文提出了一种基于选择性参数位移的视频扩散模型内生水印框架SPDMark，实现了生成视频中不可感知、鲁棒且高效的水印嵌入与提取。

Details

Motivation: 现有视频水印方法在不可感知性、鲁棒性和计算效率之间难以兼顾，尤其是在高质量生成视频背景下对可追溯性提出了更高要求。 Method: 通过在视频扩散模型中引入低秩适应（LoRA）实现层间基位移的加性组合，利用水印密钥索引最终组合，在训练阶段联合优化信息恢复、感知相似性和时间一致性损失；并使用加密哈希函数生成帧级水印以检测和定位时序篡改，结合最大二分匹配恢复被篡改的帧序。 Result: 在文本到视频和图像到视频生成模型上验证了SPDMark能生成高度不可感知的水印，水印恢复准确率高，并对多种常见视频修改具有强鲁棒性。 Conclusion: SPDMark为生成式视频提供了高效、安全且实用的内生水印解决方案，平衡了不可感知性、鲁棒性和效率三方面需求。 Abstract: The advent of high-quality video generation models has amplified the need for robust watermarking schemes that can be used to reliably detect and track the provenance of generated videos. Existing video watermarking methods based on both post-hoc and in-generation approaches fail to simultaneously achieve imperceptibility, robustness, and computational efficiency. This work introduces a novel framework for in-generation video watermarking called SPDMark (pronounced `SpeedMark') based on selective parameter displacement of a video diffusion model. Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model. To make the problem tractable, the displacement is modeled as an additive composition of layer-wise basis shifts, where the final composition is indexed by the watermarking key. For parameter efficiency, this work specifically leverages low-rank adaptation (LoRA) to implement the basis shifts. During the training phase, the basis shifts and the watermark extractor are jointly learned by minimizing a combination of message recovery, perceptual similarity, and temporal consistency losses. To detect and localize temporal modifications in the watermarked videos, we use a cryptographic hashing function to derive frame-specific watermark messages from the given base watermarking key. During watermark extraction, maximum bipartite matching is applied to recover the correct frame order, even from temporally tampered videos. Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of SPDMark to generate imperceptible watermarks that can be recovered with high accuracy and also establish its robustness against a variety of common video modifications.

[97] AI-Augmented Pollen Recognition in Optical and Holographic Microscopy for Veterinary Imaging

Swarn S. Warshaneyan,Maksims Ivanovs,Blaž Cugmas,Inese Bērziņa,Laura Goldberga,Mindaugas Tamosiunas,Roberts Kadiķis

Main category: cs.CV

TL;DR: 本研究针对传统光学显微与数字全息显微（DIHM）图像中的全自动花粉识别展开，提出基于YOLOv8s和MobileNetV3L的检测与分类方法，并利用WGAN-SN生成合成DIHM图像以提升检测性能，推动兽医成像中DIHM自动化流程的发展。

Details

Motivation: DIHM图像中存在散斑噪声、孪生像伪影且与明场图像差异大，导致花粉视觉识别困难，限制了DIHM在全自动分析中的应用。 Method: 构建包含光学与DIHM图像的双模态数据集，使用YOLOv8s进行目标检测，MobileNetV3L进行分类，并采用WGAN-SN生成合成DIHM图像用于数据增强，混合真实与合成数据训练模型。 Result: 在光学图像上检测mAP50达91.3%，分类准确率97%；DIHM图像上初始检测mAP50为8.15%，分类准确率为50%；通过扩大边界框提升至13.3%和54%；引入WGAN-SN生成数据后检测性能进一步提升至15.4% mAP50。 Conclusion: 基于GAN的数据增强能有效缩小DIHM图像上的识别性能差距，为实现兽医领域DIHM全自动工作流提供了可行路径。 Abstract: We present a comprehensive study on fully automated pollen recognition across both conventional optical and digital in-line holographic microscopy (DIHM) images of sample slides. Visually recognizing pollen in unreconstructed holographic images remains challenging due to speckle noise, twin-image artifacts and substantial divergence from bright-field appearances. We establish the performance baseline by training YOLOv8s for object detection and MobileNetV3L for classification on a dual-modality dataset of automatically annotated optical and affinely aligned DIHM images. On optical data, detection mAP50 reaches 91.3% and classification accuracy reaches 97%, whereas on DIHM data, we achieve only 8.15% for detection mAP50 and 50% for classification accuracy. Expanding the bounding boxes of pollens in DIHM images over those acquired in aligned optical images achieves 13.3% for detection mAP50 and 54% for classification accuracy. To improve object detection in DIHM images, we employ a Wasserstein GAN with spectral normalization (WGAN-SN) to create synthetic DIHM images, yielding an FID score of 58.246. Mixing real-world and synthetic data at the 1.0 : 1.5 ratio for DIHM images improves object detection up to 15.4%. These results demonstrate that GAN-based augmentation can reduce the performance divide, bringing fully automated DIHM workflows for veterinary imaging a small but important step closer to practice.

[98] EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography

Yuheng Li,Yue Zhang,Abdoul Aziz Amadou,Yuxiang Lai,Jike Zhong,Tiziano Passerini,Dorin Comaniciu,Puneet Sharma

Main category: cs.CV

TL;DR: 本文提出了EchoGround-MIMIC，首个基于测量的多模态超声心动图数据集，以及基于该数据集构建的视觉-语言模型EchoVLM，其在多种临床任务中实现了最先进的性能，推动了端到端超声心动图解读的发展。

Details

Motivation: 现有的视觉-语言模型在自然图像和其他医学领域取得成功，但在超声心动图领域的应用受限于缺乏大规模、临床相关的图文数据集以及基于测量的推理能力。因此，需要一个专门针对超声心动图特点构建的数据集和模型。 Method: 构建了包含19,065个图文对的EchoGround-MIMIC数据集，涵盖标准化视图、结构化测量、基于测量的描述和指南推导的疾病标签；提出EchoVLM模型，引入两种新的预训练目标：视图感知对比损失和否定感知对比损失，以更好地捕捉超声图像的结构和临床关键信息。 Result: EchoVLM在五类共36项临床任务中表现优异，零样本疾病分类AUC达86.5%，视图分类准确率达95.1%。结果表明，基于临床的多模态预训练能生成可迁移的视觉表征。 Conclusion: EchoVLM是一个适用于端到端超声心动图解读的基础模型，EchoGround-MIMIC数据集和代码的公开将促进该领域的进一步研究。 Abstract: Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives: (i) a view-informed contrastive loss that encodes the view-dependent structure of echocardiographic imaging, and (ii) a negation-aware contrastive loss that distinguishes clinically critical negative from positive findings. Across five types of clinical applications with 36 tasks spanning multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection, EchoVLM achieves state-of-the-art performance (86.5% AUC in zero-shot disease classification and 95.1% accuracy in view classification). We demonstrate that clinically grounded multimodal pretraining yields transferable visual representations and establish EchoVLM as a foundation model for end-to-end echocardiography interpretation. We will release EchoGround-MIMIC and the data curation code, enabling reproducibility and further research in multimodal echocardiography interpretation.

[99] A Benchmark Dataset for Spatially Aligned Road Damage Assessment in Small Uncrewed Aerial Systems Disaster Imagery

Thomas Manzini,Priyankari Perali,Raisa Karnik,Robin R. Murphy

Main category: cs.CV

TL;DR: 本文提出了一个用于道路损伤评估和道路对齐的最大基准数据集，并基于10次联邦认定灾害的sUAS影像训练了18个基线模型，解决了以往灾后道路损伤数据集规模小、分辨率低及缺乏操作验证的问题。研究发现道路线空间错位会显著降低模型性能，若不进行对齐，将导致约8%的路况误判和9%的道路线偏离实际道路。

Details

Motivation: 现有灾后道路损伤评估数据集存在规模小、影像分辨率低、缺乏实际操作验证等问题，且实践中发现道路线与实际位置存在错位，影响模型性能，亟需解决。 Method: 基于CRASAR-U-DRIODs数据集中的灾后sUAS影像，标注了657.25公里道路，采用10类标签体系，并提供9,184条道路线调整以实现空间对齐；训练并部署18个基线ML模型，并在2024年黛比和海伦飓风响应中进行实操验证。 Result: 当使用未对齐的道路线时，模型平均Macro IoU下降5.596%；若不进行空间对齐，约8%（11公里）的不良路况会被错误标注，约9%（59公里）的道路线会偏离实际道路。 Conclusion: 道路线的空间对齐对灾后道路损伤评估模型性能至关重要，忽略该问题将导致显著误判。本研究填补了ML、CV与机器人领域在实际灾害响应中道路对齐与评估方面的空白，推动更有效的决策支持。 Abstract: This paper presents the largest known benchmark dataset for road damage assessment and road alignment, and provides 18 baseline models trained on the CRASAR-U-DRIODs dataset's post-disaster small uncrewed aerial systems (sUAS) imagery from 10 federally declared disasters, addressing three challenges within prior post-disaster road damage assessment datasets. While prior disaster road damage assessment datasets exist, there is no current state of practice, as prior public datasets have either been small-scale or reliant on low-resolution imagery insufficient for detecting phenomena of interest to emergency managers. Further, while machine learning (ML) systems have been developed for this task previously, none are known to have been operationally validated. These limitations are overcome in this work through the labeling of 657.25km of roads according to a 10-class labeling schema, followed by training and deploying ML models during the operational response to Hurricanes Debby and Helene in 2024. Motivated by observed road line misalignment in practice, 9,184 road line adjustments were provided for spatial alignment of a priori road lines, as it was found that when the 18 baseline models are deployed against real-world misaligned road lines, model performance degraded on average by 5.596\% Macro IoU. If spatial alignment is not considered, approximately 8\% (11km) of adverse conditions on road lines will be labeled incorrectly, with approximately 9\% (59km) of road lines misaligned off the actual road. These dynamics are gaps that should be addressed by the ML, CV, and robotics communities to enable more effective and informed decision-making during disasters.

[100] MeltwaterBench: Deep learning for spatiotemporal downscaling of surface meltwater

Björn Lütjens,Patrick Alexander,Raf Antwerpen,Til Widmann,Guido Cervone,Marco Tedesco

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的高时空分辨率地表融水制图方法，融合多源遥感与物理模型数据，显著提升了格陵兰冰盖融水监测精度。

Details

Motivation: 现有融水地图在时间和空间分辨率上难以兼顾，且传统方法依赖单一数据源，无法准确捕捉极端融化事件，亟需更精确、高分辨率的监测手段。 Method: 结合合成孔径雷达（SAR）、被动微波（PMW）和数字高程模型（DEM）数据，对区域气候模型（RCM）输出进行时空降尺度，采用UNet和DeepLabv3+等深度学习模型融合多源数据，并构建名为MeltwaterBench的基准数据集。 Result: 所提深度学习方法在100米每日分辨率下准确率达95%，较仅用RCM（83%）或PMW（72%）的传统方法提升超10个百分点；仅用SAR的滑动窗口法达90%准确率但低估极端事件。 Conclusion: 融合多源数据的深度学习模型可显著提高冰盖表面融水制图精度，发布的MeltwaterBench数据集为未来数据驱动的降尺度方法提供了可比较的基准。 Abstract: The Greenland ice sheet is melting at an accelerated rate due to processes that are not fully understood and hard to measure. The distribution of surface meltwater can help understand these processes and is observable through remote sensing, but current maps of meltwater face a trade-off: They are either high-resolution in time or space, but not both. We develop a deep learning model that creates gridded surface meltwater maps at daily 100m resolution by fusing data streams from remote sensing observations and physics-based models. In particular, we spatiotemporally downscale regional climate model (RCM) outputs using synthetic aperture radar (SAR), passive microwave (PMW), and a digital elevation model (DEM) over the Helheim Glacier in Eastern Greenland from 2017-2023. Using SAR-derived meltwater as "ground truth", we show that a deep learning-based method that fuses all data streams is over 10 percentage points more accurate over our study area than existing non deep learning-based approaches that only rely on a regional climate model (83% vs. 95% Acc.) or passive microwave observations (72% vs. 95% Acc.). Alternatively, creating a gridded product through a running window calculation with SAR data underestimates extreme melt events, but also achieves notable accuracy (90%) and does not rely on deep learning. We evaluate standard deep learning methods (UNet and DeepLabv3+), and publish our spatiotemporally aligned dataset as a benchmark, MeltwaterBench, for intercomparisons with more complex data-driven downscaling methods. The code and data are available at $\href{https://github.com/blutjens/hrmelt}{github.com/blutjens/hrmelt}$.

[101] Audio-Visual Camera Pose Estimationn with Passive Scene Sounds and In-the-Wild Video

Daniel Adebi,Sagnik Majumder,Kristen Grauman

Main category: cs.CV

TL;DR: 本文提出了一种利用被动场景声音辅助视觉进行相对相机位姿估计的音频-视觉框架，首次在真实世界视频中成功利用音频信号提升相机运动估计性能。

Details

Motivation: 视觉方法在图像模糊或遮挡等退化条件下表现不佳，而声音可能提供互补的空间线索，因此探索音频在相机位姿估计中的作用。 Method: 将声源方向（DOA）谱和双耳化音频嵌入整合到先进的纯视觉位姿估计模型中，构建一个简单但有效的多模态框架。 Result: 在两个大型数据集上实验表明，所提方法持续优于强视觉基线，并在视觉信息受损时表现出更强的鲁棒性。 Conclusion: 日常环境中偶然产生的音频可作为解决经典空间问题（如相机位姿估计）的有效补充信号，为具身感知提供了新思路。 Abstract: Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide complementary cues for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-ofarrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.

[102] SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation

Xuancheng Xu,Yaning Li,Sisi You,Bing-Kun Bao

Main category: cs.CV

TL;DR: 本文提出SMRABooth，通过自监督编码器和光流编码器提供对象级的主体与运动表征，并在LoRA微调中对齐，以实现外观与运动的一致性视频生成。

Details

Motivation: 现有方法难以同时保证主体外观相似性和运动模式一致性，因缺乏对象级的主体与运动引导。 Method: 采用自监督编码器提取主体表征，光流编码器提取运动表征，并设计稀疏LoRA注入的主体育动解耦策略，在位置和时间上减少干扰。 Result: 实验表明SMRABooth在主体外观保持和运动一致性方面表现优异，提升了定制化视频生成的质量。 Conclusion: SMRABooth有效实现了主体与运动的解耦建模，增强了文本到视频生成中的可控性与一致性。 Abstract: Customized video generation aims to produce videos that faithfully preserve the subject's appearance from reference images while maintaining temporally consistent motion from reference videos. Existing methods struggle to ensure both subject appearance similarity and motion pattern consistency due to the lack of object-level guidance for subject and motion. To address this, we propose SMRABooth, which leverages the self-supervised encoder and optical flow encoder to provide object-level subject and motion representations. These representations are aligned with the model during the LoRA fine-tuning process. Our approach is structured in three core stages: (1) We exploit subject representations via a self-supervised encoder to guide subject alignment, enabling the model to capture overall structure of subject and enhance high-level semantic consistency. (2) We utilize motion representations from an optical flow encoder to capture structurally coherent and object-level motion trajectories independent of appearance. (3) We propose a subject-motion association decoupling strategy that applies sparse LoRAs injection across both locations and timing, effectively reducing interference between subject and motion LoRAs. Extensive experiments show that SMRABooth excels in subject and motion customization, maintaining consistent subject appearance and motion patterns, proving its effectiveness in controllable text-to-video generation.

[103] Thermal RGB Fusion for Micro-UAV Wildfire Perimeter Tracking with Minimal Comms

Ercan Erkalkan,Vedat Topuz,Ayça Ak

Main category: cs.CV

TL;DR: 提出了一种轻量级的微型无人机编队在野火环境中进行周界跟踪的方法，适用于带宽受限情况。

Details

Motivation: 在GPS信号弱和通信带宽有限的情况下，实现微型无人机对野火边界的高效、稳定跟踪。 Method: 结合热成像图像的自适应阈值与形态学处理生成热点区域掩码，利用RGB图像的梯度滤波提取边缘线索并抑制纹理干扰；通过规则融合策略选取边界候选，并用Ramer-Douglas-Peucker算法简化路径；引入周期性信标和惯性反馈环以增强轨迹稳定性；优化计算流程以在嵌入式SoC上实现低于50ms的延迟。 Result: 相比纯边缘跟踪基线，该方法减少了平均路径长度和边界抖动，保持了良好的环境覆盖能力；仿真验证了其在标准微型平台上支持10-15m/s前向飞行的可行性和低功耗特性。 Conclusion: 该轻量级周界跟踪方法适合快速部署于应急侦察任务，具备强鲁棒性和低通信需求，适用于资源受限的微无人机系统。 Abstract: This study introduces a lightweight perimeter tracking method designed for micro UAV teams operating over wildfire environments under limited bandwidth conditions. Thermal image frames generate coarse hot region masks through adaptive thresholding and morphological refinement, while RGB frames contribute edge cues and suppress texture related false detections using gradient based filtering. A rule level merging strategy selects boundary candidates and simplifies them via the Ramer Douglas Peucker algorithm. The system incorporates periodic beacons and an inertial feedback loop that maintains trajectory stability in the presence of GPS degradation. The guidance loop targets sub 50 ms latency on embedded System on Chip (SoC) platforms by constraining per frame pixel operations and precomputing gradient tables. Small scale simulations demonstrate reductions in average path length and boundary jitter compared to a pure edge tracking baseline, while maintaining environmental coverage measured through intersection merge analysis. Battery consumption and computational utilization confirm the feasibility of achieving 10, 15 m/s forward motion on standard micro platforms. This approach enables rapid deployment in the field, requiring robust sensing and minimal communications for emergency reconnaissance applications.

[104] A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection

Peizheng Li,Ioannis Mavromatis,Ajith Sahadevan,Tim Farnham,Adnan Aijaz,Aftab Khan

Main category: cs.CV

TL;DR: 本文介绍了一个大规模、纵向的布里斯托尔市街道路灯视觉数据集，包含2021至2025年间22个固定摄像头拍摄的超过52.6万张图像及丰富元数据，用于研究视觉漂移、异常检测和MLOps策略，并提供了基于CNN-VAE的自监督框架与两种漂移度量指标，数据集已公开发布以支持可重复性和下游应用。

Details

Motivation: 为了应对智能城市中长期视觉系统部署面临的视觉漂移和模型稳定性挑战，需要一个真实世界的大规模纵向数据集来评估模型性能退化与环境变化的影响。 Method: 部署22个固定角度摄像头在布里斯托尔市，持续四年每小时采集图像；构建基于卷积变分自编码器（CNN-VAE）的自监督框架，按摄像头节点和昼夜图像分别训练模型；提出两种逐样本漂移度量：相对中心点漂移和相对重构误差。 Result: 收集了超过526,000张图像及其包含时间戳、GPS坐标和设备标识符的元数据；成功实现了针对不同摄像头和昼夜条件的CNN-VAE建模；定义并计算了两种漂移指标，可用于监测长期视觉变化和模型退化。 Conclusion: 该数据集为评估长期模型稳定性、视觉漂移和部署就绪的视觉系统提供了一个现实且细粒度的基准，推动了智能城市环境中MLOps、异常检测与漂移感知学习的研究，并通过公开数据促进可重复研究与多种下游应用。 Abstract: We present a large-scale, longitudinal visual dataset of urban streetlights captured by 22 fixed-angle cameras deployed across Bristol, U.K., from 2021 to 2025. The dataset contains over 526,000 images, collected hourly under diverse lighting, weather, and seasonal conditions. Each image is accompanied by rich metadata, including timestamps, GPS coordinates, and device identifiers. This unique real-world dataset enables detailed investigation of visual drift, anomaly detection, and MLOps strategies in smart city deployments. To promtoe seconardary analysis, we additionally provide a self-supervised framework based on convolutional variational autoencoders (CNN-VAEs). Models are trained separately for each camera node and for day/night image sets. We define two per-sample drift metrics: relative centroid drift, capturing latent space deviation from a baseline quarter, and relative reconstruction error, measuring normalized image-domain degradation. This dataset provides a realistic, fine-grained benchmark for evaluating long-term model stability, drift-aware learning, and deployment-ready vision systems. The images and structured metadata are publicly released in JPEG and CSV formats, supporting reproducibility and downstream applications such as streetlight monitoring, weather inference, and urban scene understanding. The dataset can be found at https://doi.org/10.5281/zenodo.17781192 and https://doi.org/10.5281/zenodo.17859120.

[105] ALERT Open Dataset and Input-Size-Agnostic Vision Transformer for Driver Activity Recognition using IR-UWB

Jeongjun Park,Sunwook Hwang,Hyeonho Noh,Jin Mo Yang,Hyun Jong Yang,Saewoong Bahk

Main category: cs.CV

TL;DR: 本文提出了ALERT数据集和输入尺寸无关的Vision Transformer（ISA-ViT）框架，用于基于超宽带雷达的驾驶员分心行为识别，显著提升了分类准确率。

Details

Motivation: 现有研究缺乏覆盖多种分心驾驶行为的大规模真实世界UWB数据集，且标准Vision Transformer难以适应非标准维度的UWB雷达数据。 Method: 提出ISA-ViT框架，通过调整补丁配置和利用预训练位置嵌入向量，将UWB数据调整至ViT所需输入尺寸同时保留多普勒和相位特征，并采用范围-频率域融合策略提升分类性能。 Result: ISA-ViT相比现有ViT方法在UWB-based DAR任务上准确率提升了22.68%。 Conclusion: ALERT数据集和ISA-ViT框架有助于推动更鲁棒、可扩展的分心驾驶检测系统在实际场景中的应用。 Abstract: Distracted driving contributes to fatal crashes worldwide. To address this, researchers are using driver activity recognition (DAR) with impulse radio ultra-wideband (IR-UWB) radar, which offers advantages such as interference resistance, low power consumption, and privacy preservation. However, two challenges limit its adoption: the lack of large-scale real-world UWB datasets covering diverse distracted driving behaviors, and the difficulty of adapting fixed-input Vision Transformers (ViTs) to UWB radar data with non-standard dimensions. This work addresses both challenges. We present the ALERT dataset, which contains 10,220 radar samples of seven distracted driving activities collected in real driving conditions. We also propose the input-size-agnostic Vision Transformer (ISA-ViT), a framework designed for radar-based DAR. The proposed method resizes UWB data to meet ViT input requirements while preserving radar-specific information such as Doppler shifts and phase characteristics. By adjusting patch configurations and leveraging pre-trained positional embedding vectors (PEVs), ISA-ViT overcomes the limitations of naive resizing approaches. In addition, a domain fusion strategy combines range- and frequency-domain features to further improve classification performance. Comprehensive experiments demonstrate that ISA-ViT achieves a 22.68% accuracy improvement over an existing ViT-based approach for UWB-based DAR. By publicly releasing the ALERT dataset and detailing our input-size-agnostic strategy, this work facilitates the development of more robust and scalable distracted driving detection systems for real-world deployment.

[106] A Hybrid Deep Learning Framework for Emotion Recognition in Children with Autism During NAO Robot-Mediated Interaction

Indranil Bhattacharjee,Vartika Narayani Srinet,Anirudha Bhattacharjee,Braj Bhushan,Bishakh Bhattacharya

Main category: cs.CV

TL;DR: 本研究提出了一种结合ResNet-50和图卷积网络（GCN）的深度学习管道，用于识别自闭症儿童在与人形机器人互动时的细微情绪反应，基于印度首个大规模真实世界数据集，通过MediaPipe FaceMesh提取视觉与几何特征，并采用加权集成模型生成软标签，有效提升了神经多样性儿童情绪识别的准确性。

Details

Motivation: 理解自闭症谱系障碍（ASD）儿童在社交互动中的情绪反应是发展心理学和人机交互领域的关键挑战，现有方法难以捕捉其细微且非典型的情绪表达，亟需针对该群体的专用情绪识别技术。 Method: 提出一种融合ResNet-50 CNN与三层GCN的混合模型，利用从NAO机器人实验中采集的约5万张面部帧构建数据集；使用MediaPipe FaceMesh提取面部几何与视觉特征，通过DeepFace与FER模型的加权集成生成七类情绪的软标签，并采用KL散度优化融合嵌入进行最终分类。 Result: 该方法在识别ASD儿童对机器人唤名事件的情绪反应中表现出鲁棒性能，能有效捕捉微表情变化，显著优于传统模型，在情感建模和临床应用中展现出高潜力。 Conclusion: 本研究首次在印度建立了面向自闭症儿童情绪分析的大规模真实数据集与高效深度学习管道，填补了以社会机器人为媒介的ASD情绪识别研究空白，为未来个性化辅助技术的发展奠定了基础。 Abstract: Understanding emotional responses in children with Autism Spectrum Disorder (ASD) during social interaction remains a critical challenge in both developmental psychology and human-robot interaction. This study presents a novel deep learning pipeline for emotion recognition in autistic children in response to a name-calling event by a humanoid robot (NAO), under controlled experimental settings. The dataset comprises of around 50,000 facial frames extracted from video recordings of 15 children with ASD. A hybrid model combining a fine-tuned ResNet-50-based Convolutional Neural Network (CNN) and a three-layer Graph Convolutional Network (GCN) trained on both visual and geometric features extracted from MediaPipe FaceMesh landmarks. Emotions were probabilistically labeled using a weighted ensemble of two models: DeepFace's and FER, each contributing to soft-label generation across seven emotion classes. Final classification leveraged a fused embedding optimized via Kullback-Leibler divergence. The proposed method demonstrates robust performance in modeling subtle affective responses and offers significant promise for affective profiling of ASD children in clinical and therapeutic human-robot interaction contexts, as the pipeline effectively captures micro emotional cues in neurodivergent children, addressing a major gap in autism-specific HRI research. This work represents the first such large-scale, real-world dataset and pipeline from India on autism-focused emotion analysis using social robotics, contributing an essential foundation for future personalized assistive technologies.

[107] CineLOG: A Training Free Approach for Cinematic Long Video Generation

Zahra Dehghanian,Morteza Abolghasemi,Hamid Beigy,Hamid R. Rabiee

Main category: cs.CV

TL;DR: 本文提出了CineLOG，一个包含5000个高质量、平衡且未剪辑视频片段的新数据集，用于提升可控视频生成中对摄像机运动和电影类型的精细控制。同时提出了一种分阶段的生成管道和新的轨迹引导过渡模块，显著优于现有的端到端文本到视频模型。

Details

Motivation: 现有可控视频生成模型在细粒度控制（如摄像机轨迹和电影类型）方面能力有限，且常用数据集存在标签噪声、数据不平衡或仿真与现实差距大的问题。 Method: 构建了一个包含详细场景描述、标准电影术语定义的摄像指令和类型标签的高质量数据集CineLOG，并设计了一个四阶段解耦的文本到视频生成管道，引入轨迹引导过渡模块实现多镜头间的平滑时空过渡。 Result: 实验表明，该方法在遵循特定摄像机和剧本指令方面显著优于当前最先进的端到端T2V模型，同时保持专业级视觉质量，人类评估结果验证了其优越性。 Conclusion: CineLOG数据集和提出的分阶段生成框架有效提升了视频生成中的细粒度可控性，特别是在摄像机运动和电影类型控制方面，为未来研究提供了高质量资源和新思路。 Abstract: Controllable video synthesis is a central challenge in computer vision, yet current models struggle with fine grained control beyond textual prompts, particularly for cinematic attributes like camera trajectory and genre. Existing datasets often suffer from severe data imbalance, noisy labels, or a significant simulation to real gap. To address this, we introduce CineLOG, a new dataset of 5,000 high quality, balanced, and uncut video clips. Each entry is annotated with a detailed scene description, explicit camera instructions based on a standard cinematic taxonomy, and genre label, ensuring balanced coverage across 17 diverse camera movements and 15 film genres. We also present our novel pipeline designed to create this dataset, which decouples the complex text to video (T2V) generation task into four easier stages with more mature technology. To enable coherent, multi shot sequences, we introduce a novel Trajectory Guided Transition Module that generates smooth spatio-temporal interpolation. Extensive human evaluations show that our pipeline significantly outperforms SOTA end to end T2V models in adhering to specific camera and screenplay instructions, while maintaining professional visual quality. All codes and data are available at https://cine-log.pages.dev.

[108] Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

Rheeya Uppaal,Phu Mon Htut,Min Bai,Nikolaos Pappas,Zheng Qi

Main category: cs.CV

TL;DR: 本文提出了一种新的评估维度——视觉推理链的忠实性，用于衡量视觉语言模型（VLMs）在多模态推理过程中感知步骤是否基于图像内容，并设计了一个无需训练和参考答案的框架，结合VLM裁判和人类元评估来评估每一步的忠实性。在此基础上，引入轻量级自省机制，可检测并局部重生成不忠实地感知步骤，在降低不忠实感知率的同时保持最终答案准确率，提升多模态推理的可靠性。

Details

Motivation: 现有的推理增强型视觉语言模型虽然能生成显式推理链以提高透明度和能力，但存在新类型的失败模式：中间步骤可能与图像内容不符（视觉不忠实），或虽推理忠实但在最终预测上出错。而传统评估仅关注最终答案准确性，无法区分这些情况，因此需要新的评估维度和改进方法。 Method: 提出一种无需训练和参考答案的评估框架，将推理链分解为感知步骤和推理步骤，利用现成的VLM作为裁判对每一步进行忠实性判断，并通过人类元评估验证该方法的有效性；进一步设计轻量级自省机制，自动检测不忠实地感知步骤并进行局部重生成，从而提升整体推理链的视觉忠实性。 Result: 在多个经过推理训练的视觉语言模型和注重感知的基准测试上，所提方法显著降低了不忠实感知率（Unfaithful Perception Rate），同时保持了最终答案的准确性，验证了其在提升多模态推理可靠性方面的有效性。 Conclusion: 视觉推理链的忠实性是一个重要且可衡量的新评估维度，所提出的评估框架和自省机制无需训练即可有效识别和修复感知错误，提升了多模态模型推理过程的可靠性和可信度，为未来模型设计和评估提供了新方向。 Abstract: Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.

[109] Fine-Grained Zero-Shot Learning with Attribute-Centric Representations

Zhi Chen,Jingcai Guo,Taotao Cai,Yuxiang Cai

Main category: cs.CV

TL;DR: 提出了一种基于属性中心表示（ACR）的零样本学习框架，通过混合专家机制实现属性解耦，提升了对未见细粒度类别的识别能力。

Details

Motivation: 传统模型将不同视觉属性（如颜色、形状、纹理）纠缠在单一嵌入中，导致难以区分细微视觉差异，阻碍了对未见细粒度类别的识别。 Method: 设计了两个混合专家组件：Mixture of Patch Experts (MoPE) 和 Mixture of Attribute Experts (MoAE)。MoPE通过双层路由机制将图像块分派给专门处理特定属性族的专家；MoAE则将专家优化后的特征投影为稀疏的、部分感知的属性图，以实现鲁棒的零样本分类。 Result: 在CUB、AwA2和SUN三个零样本学习基准数据集上均取得了最先进的性能。 Conclusion: ACR框架通过在表示学习过程中强制属性解耦，有效缓解了属性纠缠问题，显著提升了零样本细粒度分类的性能。 Abstract: Recognizing unseen fine-grained categories demands a model that can distinguish subtle visual differences. This is typically achieved by transferring visual-attribute relationships from seen classes to unseen classes. The core challenge is attribute entanglement, where conventional models collapse distinct attributes like color, shape, and texture into a single visual embedding. This causes interference that masks these critical distinctions. The post-hoc solutions of previous work are insufficient, as they operate on representations that are already mixed. We propose a zero-shot learning framework that learns AttributeCentric Representations (ACR) to tackle this problem by imposing attribute disentanglement during representation learning. ACR is achieved with two mixture-of-experts components, including Mixture of Patch Experts (MoPE) and Mixture of Attribute Experts (MoAE). First, MoPE is inserted into the transformer using a dual-level routing mechanism to conditionally dispatch image patches to specialized experts. This ensures coherent attribute families are processed by dedicated experts. Finally, the MoAE head projects these expert-refined features into sparse, partaware attribute maps for robust zero-shot classification. On zero-shot learning benchmark datasets CUB, AwA2, and SUN, our ACR achieves consistent state-of-the-art results.

[110] ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation

Minheng Ni,Zhengyuan Yang,Yaowen Zhang,Linjie Li,Chung-Ching Lin,Kevin Lin,Zhendong Wang,Xiaofei Wang,Shujie Liu,Lei Zhang,Wangmeng Zuo,Lijuan Wang

Main category: cs.CV

TL;DR: 本文提出了ProImage-Bench，一个基于评分标准的专业图像生成基准，用于评估技术密集型科学插图的生成准确性，揭示现有模型在科学保真度上的不足，并展示如何利用该基准进行迭代优化以提升生成质量。

Details

Motivation: 现有文本到图像模型在开放领域表现良好，但在生成信息密集、科学精确的专业图像（如生物学示意图、工程图纸）方面缺乏评估手段和改进路径，亟需一个细粒度、可量化的基准来衡量和提升其科学保真能力。 Method: 构建包含654张来自真实教材和技术报告的图像数据集，通过大多少模态模型生成详细图像指令和分层评分标准（含6,076项准则和44,131项二元检查），并设计基于LMM的自动评审机制结合惩罚机制聚合得分；进一步将失败项反馈给编辑模型进行迭代优化。 Result: 代表性文本到图像模型在ProImage-Bench上表现有限，最佳基础模型仅达到0.791的评分准确率和0.553的准则得分；通过基于失败检查的迭代编辑，强生成器的评分准确率从0.653提升至0.865，准则得分从0.388提升至0.697。 Conclusion: ProImage-Bench为专业图像生成提供了严格的诊断工具和可扩展的优化信号，不仅能有效暴露模型在科学细节上的缺陷，还能指导生成结果向更高规格保真度进化。 Abstract: We study professional image generation, where a model must synthesize information-dense, scientifically precise illustrations from technical descriptions rather than merely produce visually plausible pictures. To quantify the progress, we introduce ProImage-Bench, a rubric-based benchmark that targets biology schematics, engineering/patent drawings, and general scientific diagrams. For 654 figures collected from real textbooks and technical reports, we construct detailed image instructions and a hierarchy of rubrics that decompose correctness into 6,076 criteria and 44,131 binary checks. Rubrics are derived from surrounding text and reference figures using large multimodal models, and are evaluated by an automated LMM-based judge with a principled penalty scheme that aggregates sub-question outcomes into interpretable criterion scores. We benchmark several representative text-to-image models on ProImage-Bench and find that, despite strong open-domain performance, the best base model reaches only 0.791 rubric accuracy and 0.553 criterion score overall, revealing substantial gaps in fine-grained scientific fidelity. Finally, we show that the same rubrics provide actionable supervision: feeding failed checks back into an editing model for iterative refinement boosts a strong generator from 0.653 to 0.865 in rubric accuracy and from 0.388 to 0.697 in criterion score. ProImage-Bench thus offers both a rigorous diagnostic for professional image generation and a scalable signal for improving specification-faithful scientific illustrations.

[111] Comparison of different segmentation algorithms on brain volume and fractal dimension in infant brain MRIs

Nathalie Alexander,Arnaud Gucciardi,Umberto Michelucci

Main category: cs.CV

TL;DR: 本研究基于BOB数据集系统评估了SynthSeg和SamSeg在婴儿脑MRI分割中的表现，结果显示SynthSeg在各项指标上均优于SamSeg，能更准确地估计体积和分形维度（FD），但分割误差仍可能影响FD的可靠性。

Details

Motivation: 由于婴儿脑组织对比度低和持续髓鞘化，自动化分割具有挑战性，因此需要评估现有方法在婴儿脑MRI中的准确性及其对发育指标（如体积和分形维度）的影响。 Method: 使用Baby Open Brains（BOB）数据集（71次扫描，1-9个月）比较SynthSeg与SamSeg两种分割方法，采用Dice、IoU、95% Hausdorff距离和归一化互信息等指标评估其与专家标注的一致性，并分析其对体积和分形维度（FD）估计的影响。 Result: SynthSeg在所有指标上均优于SamSeg（主要区域平均Dice > 0.8），体积估计接近人工标准（平均+4%），而SamSeg显著高估脑室和全脑体积（平均+76%）。分割准确性随年龄增长而提高；FD分析显示SynthSeg与专家分割存在区域差异，且分割相关的FD变异超过多数发育队列中的组间差异，体积与FD偏差呈正相关。 Conclusion: SynthSeg在婴幼儿MRI中提供了最可靠的体积和FD估计结果，但由于分割不确定性可能导致微小形态差异（如体积和FD变化）被误读，因此解释结果时应谨慎。 Abstract: Accurate segmentation of infant brain MRI is essential for quantifying developmental changes in structure and complexity. However, ongoing myelination and reduced tissue contrast make automated segmentation particularly challenging. This study systematically compared segmentation accuracy and its impact on volumetric and fractal dimension (FD) estimates in infant brain MRI using the Baby Open Brains (BOB) dataset (71 scans, 1-9 months). Two methods, SynthSeg and SamSeg, were evaluated against expert annotations using Dice, Intersection over Union, 95th-percentile Hausdorff distance, and Normalised Mutual Information. SynthSeg outperformed SamSeg across all quality metrics (mean Dice > 0.8 for major regions) and provided volumetric estimates closely matching the manual reference (mean +4% [-28% - 71%]). SamSeg systematically overestimated ventricular and whole-brain volumes (mean +76% [-12% - 190%]). Segmentation accuracy improved with age, consistent with increasing tissue contrast during myelination. Fractal dimension a(FD) nalyses revealed significant regional differences between SynthSeg and expert segmentations, and Bland-Altman limits of agreement indicated that segmentation-related FD variability exceeded most group differences reported in developmental cohorts. Volume and FD deviations were positively correlated across structures, indicating that segmentation bias directly affects FD estimation. Overall, SynthSeg provided the most reliable volumetric and FD results for paediatric MRI, yet small morphological differences in volume and FD should be interpreted with caution due to segmentation-related uncertainty.

[112] Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder

Tianyu Zhang,Dong Liu,Chang Wen Chen

Main category: cs.CV

TL;DR: 提出了一种用于超低比特率图像压缩的非对称极端图像压缩（AEIC）框架，使用浅层编码器和单步扩散解码器，在保证高压缩效率的同时实现高质量重建。

Details

Motivation: 现有超低比特率图像压缩方法依赖大型预训练编码器，难以部署在计算能力弱的边缘设备上，因此需要一种编码简单、解码高质量的新框架。 Method: 采用中等或浅层编码器网络，结合单步扩散解码器，并设计双侧特征蒸馏方案，将中等编码器的知识迁移到浅层变体中，以提升其性能。 Result: AEIC在超低比特率下优于现有方法，在率-失真-感知性能上表现更优，编码效率高达1080P图像35.8 FPS，同时保持有竞争力的解码速度。 Conclusion: AEIC实现了编码轻量化与解码高质量的平衡，适用于带宽和计算资源受限的场景，为边缘设备上的图像压缩提供了可行方案。 Abstract: Ultra-low bitrate image compression (below 0.05 bits per pixel) is increasingly critical for bandwidth-constrained and computation-limited encoding scenarios such as edge devices. Existing frameworks typically rely on large pretrained encoders (e.g., VAEs or tokenizer-based models) and perform transform coding within their generative latent space. While these approaches achieve impressive perceptual fidelity, their reliance on heavy encoder networks makes them unsuitable for deployment on weak sender devices. In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. Specifically, AEIC employs moderate or even shallow encoder networks, while leveraging an one-step diffusion decoder to maintain high-fidelity and high-realism reconstructions under extreme bitrates. To further enhance the efficiency of shallow encoders, we design a dual-side feature distillation scheme that transfers knowledge from AEIC with moderate encoders to its shallow encoder variants. Experiments demonstrate that AEIC not only outperforms existing methods on rate-distortion-perception performance at ultra-low bitrates, but also delivers exceptional encoding efficiency for 35.8 FPS on 1080P input images, while maintaining competitive decoding speed compared to existing methods.

[113] Moment and Highlight Detection via MLLM Frame Segmentation

I Putu Andika Bagas Jiwanta,Ayu Purwarianti

Main category: cs.CV

TL;DR: 提出一种基于大语言模型（LLM）输出token进行视频时刻和高亮检测的新型方法，通过将帧序列映射为“0”和“1”字符序列并应用分割损失，实现直接的帧级监督，在低采样帧数下仍取得优异性能。

Details

Motivation: 现有基于文本生成的方法无法提供帧级预测的直接梯度，限制了高亮和时刻检测的优化；而强化学习等替代方案不够稳定或高效。 Method: 将固定数量的视频帧输入多模态大语言模型，并设计提示词使其输出与帧一一对应的“0”（背景）和“1”（前景）字符序列；在训练中结合分割损失（如Dice Loss）和因果语言建模损失，使模型学习帧级分类；推理时通过束搜索生成序列及其logits，分别作为检测结果和显著性得分。 Result: 在QVHighlights数据集上达到56.74 HIT@1的高亮检测性能，优于多数现有方法；在时刻检索任务中取得35.28 MAP，超过基线；仅使用25帧（不足其他方法一半），且分割损失能持续提供稳定的训练信号，即使语言模型损失已饱和。 Conclusion: 通过将分割目标直接应用于大语言模型的输出token，实现了有效的帧级监督学习，在保持语言生成能力的同时提升了视频理解任务的性能与训练稳定性，尤其在低帧采样率下表现突出。 Abstract: Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames -- less than half of comparable methods -- our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.

[114] MetaTPT: Meta Test-time Prompt Tuning for Vision-Language Models

Yuqing Lei,Yingjun Du,Yawen Huang,Xiantong Zhen,Ling Shao

Main category: cs.CV

TL;DR: 本文提出了一种名为Meta Test-Time Prompt Tuning (MetaTPT)的元学习框架，用于在测试时自适应地调整视觉-语言模型中的提示，通过学习参数化的数据增强来提升跨域泛化性能。

Details

Motivation: 现有的测试时提示调优方法使用固定的数据增强，在面对显著域偏移时表现受限，因此需要更灵活的增强策略来提升模型的零样本泛化能力。 Method: MetaTPT采用双层优化框架：内层通过自监督任务学习为每个样本生成有信息量的增强视图；外层则基于这些视图进行一致性约束下的提示调优，实现动态且更具表达力的域适应。 Result: 实验表明，MetaTPT在多个域泛化和跨数据集基准上达到了最先进的性能，显著优于现有测试时适应方法。 Conclusion: 通过联合学习参数化增强与提示调优，MetaTPT有效提升了视觉-语言模型在测试时面对域偏移的适应能力，验证了动态增强在零样本迁移中的重要性。 Abstract: Vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization but remain sensitive to domain shifts at test time. Test-time prompt tuning (TPT) mitigates this issue by adapting prompts with fixed augmentations, which may falter in more challenging settings. In this work, we propose Meta Test-Time Prompt Tuning (MetaTPT), a meta-learning framework that learns a self-supervised auxiliary task to guide test-time prompt tuning. The auxiliary task dynamically learns parameterized augmentations for each sample, enabling more expressive transformations that capture essential features in target domains. MetaTPT adopts a dual-loop optimization paradigm: an inner loop learns a self-supervised task that generates informative views, while the outer loop performs prompt tuning by enforcing consistency across these views. By coupling augmentation learning with prompt tuning, MetaTPT improves test-time adaptation under domain shifts. Extensive experiments demonstrate that MetaTPT achieves state-of-the-art performance on domain generalization and cross-dataset benchmarks.

[115] Feature Aggregation for Efficient Continual Learning of Complex Facial Expressions

Thibault Geoffroy,Myriam Maumy,Lionel Prevost

Main category: cs.CV

TL;DR: 提出一种用于连续学习的面部表情识别混合框架，结合深度卷积特征和面部动作单元，采用贝叶斯高斯混合模型减少灾难性遗忘，在CFEE数据集上表现出良好的准确性和知识保持能力。

Details

Motivation: 为了实现有效的人机交互，AI系统需要能够识别并适应人类情绪，而现有模型在持续学习中容易遗忘先前知识，因此需要一种能缓解灾难性遗忘的表情识别方法。 Method: 结合深度卷积特征和基于FACS的面部动作单元（AUs），构建一个双模态表示，并使用贝叶斯高斯混合模型（BGMM）进行建模，避免重新训练，实现轻量化的持续学习。 Result: 在CFEE数据集上，模型能够先学习基本表情，再逐步识别复合表情，实验显示准确性提高，知识保留增强，遗忘显著减少。 Conclusion: 该框架为开发具有情感智能的AI系统提供了有效方案，适用于教育、医疗和自适应用户界面等领域。 Abstract: As artificial intelligence (AI) systems become increasingly embedded in our daily life, the ability to recognize and adapt to human emotions is essential for effective human-computer interaction. Facial expression recognition (FER) provides a primary channel for inferring affective states, but the dynamic and culturally nuanced nature of emotions requires models that can learn continuously without forgetting prior knowledge. In this work, we propose a hybrid framework for FER in a continual learning setting that mitigates catastrophic forgetting. Our approach integrates two complementary modalities: deep convolutional features and facial Action Units (AUs) derived from the Facial Action Coding System (FACS). The combined representation is modelled through Bayesian Gaussian Mixture Models (BGMMs), which provide a lightweight, probabilistic solution that avoids retraining while offering strong discriminative power. Using the Compound Facial Expression of Emotion (CFEE) dataset, we show that our model can first learn basic expressions and then progressively recognize compound expressions. Experiments demonstrate improved accuracy, stronger knowledge retention, and reduced forgetting. This framework contributes to the development of emotionally intelligent AI systems with applications in education, healthcare, and adaptive user interfaces.

[116] Cognitive-YOLO: LLM-Driven Architecture Synthesis from First Principles of Data for Object Detection

Jiahao Zhao

Main category: cs.CV

TL;DR: 提出Cognitive-YOLO，一种基于大语言模型的数据感知目标检测架构生成框架，通过提取数据集元特征并结合检索增强生成神经架构描述语言，直接合成高性能网络结构。

Details

Motivation: 传统手动设计和NAS方法在目标检测架构设计中耗时或计算成本高，现有LLM方法多作为迭代优化器，缺乏从数据本质特征出发的全局理解能力。 Method: 三阶段框架：1）分析模块提取数据集关键元特征（如物体尺度分布、场景密度）；2）大语言模型结合检索增强生成（RAG）获取的先进组件，基于元特征推理生成结构化神经架构描述语言（NADL）；3）编译器将NADL实例化为可部署模型。 Result: 在五个不同目标检测数据集上实验表明，Cognitive-YOLO生成的架构性能优于强基线模型，具有更优的参数效率和性能权衡；消融实验证明LLM对数据‘第一性原理’的理解是性能提升的关键。 Conclusion: 数据驱动的LLM推理比单纯检索SOTA组件更能有效指导高性能检测架构的设计，验证了从数据本质特征出发进行架构合成的重要性。 Abstract: Designing high-performance object detection architectures is a complex task, where traditional manual design is time-consuming and labor-intensive, and Neural Architecture Search (NAS) is computationally prohibitive. While recent approaches using Large Language Models (LLMs) show promise, they often function as iterative optimizers within a search loop, rather than generating architectures directly from a holistic understanding of the data. To address this gap, we propose Cognitive-YOLO, a novel framework for LLM-driven architecture synthesis that generates network configurations directly from the intrinsic characteristics of the dataset. Our method consists of three stages: first, an analysis module extracts key meta-features (e.g., object scale distribution and scene density) from the target dataset; second, the LLM reasons upon these features, augmented with state-of-the-art components retrieved via Retrieval-Augmented Generation (RAG), to synthesize the architecture into a structured Neural Architecture Description Language (NADL); finally, a compiler instantiates this description into a deployable model. Extensive experiments on five diverse object detection datasets demonstrate that our proposed Cognitive-YOLO consistently generates superior architectures, achieving highly competitive performance and demonstrating a superior performance-per-parameter trade-off compared to strong baseline models across multiple benchmarks. Crucially, our ablation studies prove that the LLM's data-driven reasoning is the primary driver of performance, demonstrating that a deep understanding of data "first principles" is more critical for achieving a superior architecture than simply retrieving SOTA components.

[117] RealDrag: The First Dragging Benchmark with Real Target Image

Ahmad Zafarani,Zahra Dehghanian,Mohammadreza Davoodi,Mohsen Shadroo,MohammadAmin Fazli,Hamid R. Rabiee

Main category: cs.CV

TL;DR: 本文提出了RealDrag，首个包含真实目标图像的基于点的图像编辑综合基准，解决了现有评估方法因缺乏标准化数据集和指标而导致的不可靠问题。

Details

Motivation: 由于缺乏标准化的基准和度量标准，当前基于拖拽的图像编辑模型评估不可靠，且缺少包含真实目标图像的数据集，导致难以进行客观比较。 Method: 构建了一个包含400多个人工标注样本的数据集，涵盖源/目标图像、控制点、可编辑区域掩码和描述文本，并提出了四个新指标（SeD、OMPS、IPPS、DiS）来评估编辑的保真度、非编辑区域保持性和语义一致性。 Result: 对17种SOTA模型进行了首次大规模系统性评估，揭示了现有方法之间的权衡，并建立了可复现的基线。 Conclusion: RealDrag为基于点的图像编辑提供了可靠、公开的评估基准和工具，推动该领域的标准化和进一步发展。 Abstract: The evaluation of drag based image editing models is unreliable due to a lack of standardized benchmarks and metrics. This ambiguity stems from inconsistent evaluation protocols and, critically, the absence of datasets containing ground truth target images, making objective comparisons between competing methods difficult. To address this, we introduce \textbf{RealDrag}, the first comprehensive benchmark for point based image editing that includes paired ground truth target images. Our dataset contains over 400 human annotated samples from diverse video sources, providing source/target images, handle/target points, editable region masks, and descriptive captions for both the image and the editing action. We also propose four novel, task specific metrics: Semantical Distance (SeD), Outer Mask Preserving Score (OMPS), Inner Patch Preserving Score (IPPS), and Directional Similarity (DiS). These metrics are designed to quantify pixel level matching fidelity, check preservation of non edited (out of mask) regions, and measure semantic alignment with the desired task. Using this benchmark, we conduct the first large scale systematic analysis of the field, evaluating 17 SOTA models. Our results reveal clear trade offs among current approaches and establish a robust, reproducible baseline to guide future research. Our dataset and evaluation toolkit will be made publicly available.

[118] GrowTAS: Progressive Expansion from Small to Large Subnets for Efficient ViT Architecture Search

Hyunju Lee,Youngmin Oh,Jeimin Jeon,Donghyeon Baek,Bumsub Ham

Main category: cs.CV

TL;DR: 提出了一种渐进式训练框架GrowTAS，通过从小网络开始逐步引入大网络，减少权重共享带来的干扰，提升视觉Transformer架构搜索的效果。

Details

Motivation: 现有Transformer架构搜索方法中，所有子网络共享权重导致小网络性能严重下降，且训练过程不稳定。 Method: 提出GrowTAS框架，先训练小规模子网络，再逐步引入更大子网络；进一步提出GrowTAS+，通过仅微调部分权重来增强大子网络性能。 Result: 在ImageNet和多个迁移学习基准（如CIFAR-10/100、Flowers、CARS、INAT-19）上验证了方法的有效性，优于当前TAS方法。 Conclusion: 渐进式训练策略能有效缓解权重共享带来的干扰，提升超网训练稳定性和子网性能。 Abstract: Transformer architecture search (TAS) aims to automatically discover efficient vision transformers (ViTs), reducing the need for manual design. Existing TAS methods typically train an over-parameterized network (i.e., a supernet) that encompasses all candidate architectures (i.e., subnets). However, all subnets share the same set of weights, which leads to interference that degrades the smaller subnets severely. We have found that well-trained small subnets can serve as a good foundation for training larger ones. Motivated by this, we propose a progressive training framework, dubbed GrowTAS, that begins with training small subnets and incorporate larger ones gradually. This enables reducing the interference and stabilizing a training process. We also introduce GrowTAS+ that fine-tunes a subset of weights only to further enhance the performance of large subnets. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate the effectiveness of our approach over current TAS methods

[119] From Human Intention to Action Prediction: A Comprehensive Benchmark for Intention-driven End-to-End Autonomous Driving

Huan Zheng,Yucheng Zhou,Tianyi Yan,Jiayi Su,Hongjun Chen,Dubing Chen,Wencheng Han,Runzhou Tao,Zhongying Qiu,Jianfei Yang,Jianbing Shen

Main category: cs.CV

TL;DR: 本文提出了Intention-Drive，首个用于评估自动驾驶系统将高层人类意图转化为安全精确驾驶行为能力的基准，包含复杂场景与自然语言意图配对的数据集及以意图成功率（ISR）为核心的新型评估协议。

Details

Motivation: 当前端到端自动驾驶系统仅能执行低级指令，缺乏理解和实现高级抽象人类意图的能力，亟需一个标准化基准来推动这一关键方向的发展。 Method: 构建了一个包含复杂驾驶场景与对应自然语言意图的新数据集，并设计了基于意图成功率（ISR）的评估协议，强调语义层面的目标达成度而非简单的几何准确性。 Result: 通过对多种基线模型在Intention-Drive上的广泛评估，发现现有模型在场景和意图的综合理解方面存在显著性能缺陷，难以胜任高级意图驱动的驾驶任务。 Conclusion: Intention-Drive填补了自动驾驶从命令跟随向意图实现跃迁中的关键空白，为发展真正智能的自动驾驶系统提供了衡量标准和前进方向。 Abstract: Current end-to-end autonomous driving systems operate at a level of intelligence akin to following simple steering commands. However, achieving genuinely intelligent autonomy requires a paradigm shift: moving from merely executing low-level instructions to understanding and fulfilling high-level, abstract human intentions. This leap from a command-follower to an intention-fulfiller, as illustrated in our conceptual framework, is hindered by a fundamental challenge: the absence of a standardized benchmark to measure and drive progress on this complex task. To address this critical gap, we introduce Intention-Drive, the first comprehensive benchmark designed to evaluate the ability to translate high-level human intent into safe and precise driving actions. Intention-Drive features two core contributions: (1) a new dataset of complex scenarios paired with corresponding natural language intentions, and (2) a novel evaluation protocol centered on the Intent Success Rate (ISR), which assesses the semantic fulfillment of the human's goal beyond simple geometric accuracy. Through an extensive evaluation of a spectrum of baseline models on Intention-Drive, we reveal a significant performance deficit, showing that the baseline model struggle to achieve the comprehensive scene and intention understanding required for this advanced task.

[120] OMUDA: Omni-level Masking for Unsupervised Domain Adaptation in Semantic Segmentation

Yang Ou,Xiongwei Zhao,Xinye Yang,Yihan Wang,Yicheng Di,Rong Yuan,Xieyuanli Chen,Xu Zhu

Main category: cs.CV

TL;DR: 提出了一种名为OMUDA的统一框架，通过跨表示层次的分层掩码策略解决无监督域适应中的上下文模糊、特征不一致和伪标签噪声问题，在多个基准上实现了最先进的性能。

Details

Motivation: 现有无监督域适应方法在应对跨域上下文模糊、特征表示不一致和类别的伪标签噪声方面仍存在困难，导致域间差距难以有效缩小。 Method: 提出了OMUDA框架，包含三种掩码策略：1）上下文感知掩码（CAM），自适应区分前景与背景；2）特征蒸馏掩码（FDM），利用预训练模型进行知识迁移以增强特征学习；3）类别解耦掩码（CDM），建模类别级不确定性以减轻伪标签噪声影响。 Result: 在SYNTHIA->Cityscapes和GTA5->Cityscapes等任务上，OMUDA相比现有方法平均提升7%，并达到最先进的性能。 Conclusion: OMUDA通过多层次掩码机制有效减少了语义分割中跨域的上下文、表示和类别级差异，为无监督域适应提供了统一且高效的解决方案。 Abstract: Unsupervised domain adaptation (UDA) enables semantic segmentation models to generalize from a labeled source domain to an unlabeled target domain. However, existing UDA methods still struggle to bridge the domain gap due to cross-domain contextual ambiguity, inconsistent feature representations, and class-wise pseudo-label noise. To address these challenges, we propose Omni-level Masking for Unsupervised Domain Adaptation (OMUDA), a unified framework that introduces hierarchical masking strategies across distinct representation levels. Specifically, OMUDA comprises: 1) a Context-Aware Masking (CAM) strategy that adaptively distinguishes foreground from background to balance global context and local details; 2) a Feature Distillation Masking (FDM) strategy that enhances robust and consistent feature learning through knowledge transfer from pre-trained models; and 3) a Class Decoupling Masking (CDM) strategy that mitigates the impact of noisy pseudo-labels by explicitly modeling class-wise uncertainty. This hierarchical masking paradigm effectively reduces the domain shift at the contextual, representational, and categorical levels, providing a unified solution beyond existing approaches. Extensive experiments on multiple challenging cross-domain semantic segmentation benchmarks validate the effectiveness of OMUDA. Notably, on the SYNTHIA->Cityscapes and GTA5->Cityscapes tasks, OMUDA can be seamlessly integrated into existing UDA methods and consistently achieving state-of-the-art results with an average improvement of 7%.

[121] MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding

Benjamin Beilharz,Thomas S. A. Wallis

Main category: cs.CV

TL;DR: 本文提出了一种名为MRD的新方法，利用可微分渲染探测视觉模型对3D场景属性的隐式理解，通过寻找产生相同模型激活但物理上不同的3D场景参数来分析模型表示。

Details

Motivation: 深度学习在视觉任务中表现优异，但其表示和决策难以解释；尽管模型通常假设具有对3D场景的隐式理解，尚缺乏有效工具验证这一点。 Method: 使用基于物理的可微分渲染技术，寻找使模型激活相同的不同3D场景参数（即模型同感刺激），从而分析模型对形状、材质、光照等物理属性的敏感性。 Result: 在多个模型上验证了该方法的有效性，结果显示尽管重建场景视觉差异明显，模型激活却高度相似，揭示了模型对某些物理属性的不变性或敏感性。 Conclusion: MRD为理解视觉模型如何响应物理场景参数提供了新途径，有助于深入分析计算机与人类视觉系统的表征机制。 Abstract: While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models' implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model's sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.

[122] WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

Shenghao Fu,Yukun Su,Fengyun Rao,Jing Lyu,Xiaohua Xie,Wei-Shi Zheng

Main category: cs.CV

TL;DR: 本文提出WeDetect系列模型，基于检索思想实现开放词汇目标检测，兼具高效性和多功能性，在15个基准上达到最优性能。

Details

Motivation: 探索无需跨模态融合层的开放词汇检测方法，提升推理效率与应用 versatility。 Method: 采用双塔架构的非融合模型WeDetect，通过共享嵌入空间中的区域-文本匹配进行检测；WeDetect-Uni冻结检测器微调目标性提示生成通用提议；WeDetect-Ref结合LMM进行指代表达理解，单次前向分类目标。 Result: WeDetect实现实时检测且性能超越融合模型；WeDetect-Uni支持历史数据中的对象检索；WeDetect-Ref在REC任务中高效准确；整体在15个基准上表现SOTA。 Conclusion: WeDetect家族统一了检测、提议生成、对象检索和指代表达理解于一个高效的检索框架中，展现出卓越的性能与广泛的应用潜力。 Abstract: Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, \ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency.

[123] Unified Control for Inference-Time Guidance of Denoising Diffusion Models

Maurya Goyal,Anuj Singh,Hadi Jamali-Rad

Main category: cs.CV

TL;DR: 本文提出了一种名为UniCoDe的通用算法，统一了采样与基于梯度的引导方法，以更高效地对齐扩散模型输出与下游目标。

Details

Motivation: 为了提升扩散模型在特定任务中的性能，需要在推理时有效对齐模型输出与下游目标，现有方法存在采样效率低或偏离先验分布的问题。 Method: UniCoDe在采样过程中融合局部梯度信号，结合了基于采样的奖励选择和基于梯度的生成引导两种策略，形成统一框架。 Result: 实验表明，UniCoDe在多种任务上与当前最先进方法具有竞争力，同时提高了采样效率，并更好地平衡了奖励对齐与对无条件先验的偏离。 Conclusion: UniCoDe通过统一采样与梯度引导策略，实现了更高效的推理时对齐，在多个任务中表现出优越的性能与平衡性。 Abstract: Aligning diffusion model outputs with downstream objectives is essential for improving task-specific performance. Broadly, inference-time training-free approaches for aligning diffusion models can be categorized into two main strategies: sampling-based methods, which explore multiple candidate outputs and select those with higher reward signals, and gradient-guided methods, which use differentiable reward approximations to directly steer the generation process. In this work, we propose a universal algorithm, UniCoDe, which brings together the strengths of sampling and gradient-based guidance into a unified framework. UniCoDe integrates local gradient signals during sampling, thereby addressing the sampling inefficiency inherent in complex reward-based sampling approaches. By cohesively combining these two paradigms, UniCoDe enables more efficient sampling while offering better trade-offs between reward alignment and divergence from the diffusion unconditional prior. Empirical results demonstrate that UniCoDe remains competitive with state-of-the-art baselines across a range of tasks. The code is available at https://github.com/maurya-goyal10/UniCoDe

[124] TCLeaf-Net: a transformer-convolution framework with global-local attention for robust in-field lesion-level plant leaf disease detection

Zishen Song,Yongjian Zhu,Dong Wang,Hongzhan Liu,Lingyu Jiang,Yongxing Duan,Zehua Zhang,Sihan Li,Jiarui Li

Main category: cs.CV

TL;DR: 本文提出了一种用于田间叶片病害检测的混合Transformer-卷积检测器TCLeaf-Net，并发布了包含1,746张图像和7,839个病斑的配对数据集Daylily-Leaf。该方法通过TCM模块抑制复杂背景干扰，RSFRS模块保留精细空间特征，DFPN模块增强多尺度融合，显著提升了检测精度与效率。

Details

Motivation: 田间环境下背景复杂、域偏移严重且缺乏高质量的病斑级标注数据，限制了现有模型在实际农业场景中的鲁棒性和泛化能力，亟需更高效、适应性强的检测方法。 Method: 提出TCLeaf-Net，包含三个核心模块：1）Transformer-卷积模块（TCM）结合全局上下文与局部卷积以抑制非叶片区域；2）原始尺度特征重采样块（RSFRS）通过双线性重采样与卷积减少下采样信息损失；3）基于可变形对齐与FPN的DFPN模块增强多尺度特征融合。同时发布真实田间条件下的病斑级数据集Daylily-Leaf。 Result: 在Daylily-Leaf田间子集上，TCLeaf-Net将mAP@50提升5.4个百分点至78.2%，计算量减少7.5 GFLOPs，GPU内存占用降低8.7%。在PlantDoc、Tomato-Leaf和Rice-Leaf数据集上也优于YOLO和RT-DETR系列模型，展现出优异的精度、召回率与跨域泛化能力。 Conclusion: TCLeaf-Net通过混合架构有效应对田间病害检测中的背景干扰、信息丢失和尺度变化问题，结合新发布的Daylily-Leaf数据集，为实际农业场景下的植物病害检测提供了高效且可推广的解决方案。 Abstract: Timely and accurate detection of foliar diseases is vital for safeguarding crop growth and reducing yield losses. Yet, in real-field conditions, cluttered backgrounds, domain shifts, and limited lesion-level datasets hinder robust modeling. To address these challenges, we release Daylily-Leaf, a paired lesion-level dataset comprising 1,746 RGB images and 7,839 lesions captured under both ideal and in-field conditions, and propose TCLeaf-Net, a transformer-convolution hybrid detector optimized for real-field use. TCLeaf-Net is designed to tackle three major challenges. To mitigate interference from complex backgrounds, the transformer-convolution module (TCM) couples global context with locality-preserving convolution to suppress non-leaf regions. To reduce information loss during downsampling, the raw-scale feature recalling and sampling (RSFRS) block combines bilinear resampling and convolution to preserve fine spatial detail. To handle variations in lesion scale and feature shifts, the deformable alignment block with FPN (DFPN) employs offset-based alignment and multi-receptive-field perception to strengthen multi-scale fusion. Experimental results show that on the in-field split of the Daylily-Leaf dataset, TCLeaf-Net improves mAP@50 by 5.4 percentage points over the baseline model, reaching 78.2\%, while reducing computation by 7.5 GFLOPs and GPU memory usage by 8.7\%. Moreover, the model outperforms recent YOLO and RT-DETR series in both precision and recall, and demonstrates strong performance on the PlantDoc, Tomato-Leaf, and Rice-Leaf datasets, validating its robustness and generalizability to other plant disease detection scenarios.

[125] VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

Yufei Yin,Qianke Meng,Minghao Chen,Jiajun Ding,Zhenwei Shao,Zhou Yu

Main category: cs.CV

TL;DR: 提出VideoARM，一种基于分层记忆的智能体推理范式，用于高效、自适应的长视频理解，显著降低token消耗并提升性能。

Details

Motivation: 现有长视频理解方法依赖手工设计的推理流程或高token消耗的预处理，难以实现高效且自主的推理。 Method: 引入VideoARM，通过智能体在观察、思考、行动和记忆间的自适应循环进行端到端推理；采用控制器自主调用工具以粗到细方式解析视频，并构建分层多模态记忆以支持决策。 Result: 在主流基准上超越了当前最优方法DVD，同时显著降低了处理长视频时的token消耗。 Conclusion: VideoARM通过代理式、层次化记忆机制实现了高效、灵活的长视频理解，为减少token开销和提升推理自主性提供了有效方案。 Abstract: Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.

[126] STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative

Peixuan Zhang,Zijian Jia,Kaiqi Liu,Shuchen Weng,Si Li,Boxin Shi

Main category: cs.CV

TL;DR: 本文提出了STAGE，一种基于故事板引导的多镜头视频生成工作流，通过结构化关键帧预测和记忆机制提升跨镜头一致性与电影语言表达能力。

Details

Motivation: 现有关键帧方法在多镜头视频生成中难以保持跨镜头一致性并捕捉电影语言，缺乏对叙事结构的有效建模。 Method: 提出STEP2模型预测每镜头的起止帧作为结构化故事板，引入多镜头记忆包保证长程实体一致性，采用双编码策略增强镜头内连贯性，并设计两阶段训练学习电影级镜头间转换；同时构建大规模ConStoryBoard数据集。 Result: 实验表明，STAGE在结构化叙事控制和跨镜头连贯性方面显著优于现有方法，生成视频在视觉质量和叙事逻辑上更优。 Conclusion: STAGE通过结构化故事板建模有效提升了多镜头视频生成的质量与可控性，为生成具有电影语言的连贯叙事视频提供了新思路。 Abstract: While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot coherence, and the two-stage training scheme to learn cinematic inter-shot transition. We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. Extensive experiments demonstrate that STAGE achieves superior performance in structured narrative control and cross-shot coherence.

[127] V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping

Hyunkoo Lee,Wooseok Jang,Jini Yang,Taehwan Kim,Sangoh Kim,Sangwon Jung,Seungryong Kim

Main category: cs.CV

TL;DR: V-Warper是一种无需训练的粗到精视频个性化框架，通过利用参考图像和推理时的细粒度外观注入，提升主体身份一致性，同时无需大规模视频微调。

Details

Motivation: 现有视频个性化方法依赖重训练或大规模视频数据，计算成本高且难以保持帧间细粒度外观一致性。 Method: 提出V-Warper，包含两个阶段：1）轻量级粗粒度外观适配，使用图像级LoRA和主体嵌入适配编码全局身份；2）推理时细粒度外观注入，通过RoPE-free中层查询-键特征计算语义对应关系，引导值表示的形变对齐，并用掩码保证空间可靠性。 Result: V-Warper在无需视频微调的情况下显著提升了外观保真度，同时保持文本提示对齐和运动动态。 Conclusion: V-Warper为Transformer-based视频扩散模型提供了一种高效、无需训练的个性化方案，在身份一致性和视觉质量方面表现优越。 Abstract: Video personalization aims to generate videos that faithfully reflect a user-provided subject while following a text prompt. However, existing approaches often rely on heavy video-based finetuning or large-scale video datasets, which impose substantial computational cost and are difficult to scale. Furthermore, they still struggle to maintain fine-grained appearance consistency across frames. To address these limitations, we introduce V-Warper, a training-free coarse-to-fine personalization framework for transformer-based video diffusion models. The framework enhances fine-grained identity fidelity without requiring any additional video training. (1) A lightweight coarse appearance adaptation stage leverages only a small set of reference images, which are already required for the task. This step encodes global subject identity through image-only LoRA and subject-embedding adaptation. (2) A inference-time fine appearance injection stage refines visual fidelity by computing semantic correspondences from RoPE-free mid-layer query--key features. These correspondences guide the warping of appearance-rich value representations into semantically aligned regions of the generation process, with masking ensuring spatial reliability. V-Warper significantly improves appearance fidelity while preserving prompt alignment and motion dynamics, and it achieves these gains efficiently without large-scale video finetuning.

[128] M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction

Junqiao Fan,Yunjiao Zhou,Yizhuo Yang,Xinyuan Cui,Jiarui Zhang,Lihua Xie,Jianfei Yang,Chris Xiaoxuan Lu,Fangqiang Ding

Main category: cs.CV

TL;DR: M4Human是目前最大规模的多模态人体网格重建基准数据集，包含66.1万帧高分辨率毫米波雷达、RGB和深度数据，支持多种动作和高质量动捕标注，推动非可见光下的人体建模研究。

Details

Motivation: 现有基于视觉的人体网格重建受限于遮挡、光照变化和隐私问题，而当前雷达数据集存在标签稀疏、规模小和动作单一的问题，亟需一个大规模、多模态、多样化的数据集来推动该领域发展。 Method: 提出M4Human数据集，采集高分辨率毫米波雷达（提供原始雷达张量和点云）、RGB和深度数据，结合高质量动捕系统生成3D网格和全局轨迹，覆盖20名受试者和50种多样化动作，并建立基于雷达和多模态融合的基准模型。 Result: 构建了包含661K帧的大型数据集（为此前最大的9倍），提供了不同粒度的雷达信号表示，实现了对复杂动作（如自由空间运动）的支持，实验验证了雷达在人体建模中的潜力，同时揭示了快速无约束运动下的建模挑战。 Conclusion: M4Human显著推进了基于雷达的人体网格重建研究，为多模态感知提供了重要资源，尽管在快速运动下仍存在挑战，但其大规模与多样性为未来研究奠定了基础。 Abstract: Human mesh reconstruction (HMR) provides direct insights into body-environment interaction, which enables various immersive applications. While existing large-scale HMR datasets rely heavily on line-of-sight RGB input, vision-based sensing is limited by occlusion, lighting variation, and privacy concerns. To overcome these limitations, recent efforts have explored radio-frequency (RF) mmWave radar for privacy-preserving indoor human sensing. However, current radar datasets are constrained by sparse skeleton labels, limited scale, and simple in-place actions. To advance the HMR research community, we introduce M4Human, the current largest-scale (661K-frame) ($9\times$ prior largest) multimodal benchmark, featuring high-resolution mmWave radar, RGB, and depth data. M4Human provides both raw radar tensors (RT) and processed radar point clouds (RPC) to enable research across different levels of RF signal granularity. M4Human includes high-quality motion capture (MoCap) annotations with 3D meshes and global trajectories, and spans 20 subjects and 50 diverse actions, including in-place, sit-in-place, and free-space sports or rehabilitation movements. We establish benchmarks on both RT and RPC modalities, as well as multimodal fusion with RGB-D modalities. Extensive results highlight the significance of M4Human for radar-based human modeling while revealing persistent challenges under fast, unconstrained motion. The dataset and code will be released after the paper publication.

[129] Speedrunning ImageNet Diffusion

Swayam Bhanded

Main category: cs.CV

TL;DR: SR-DiT 是一个高效集成 token 路由、架构改进和训练优化的扩散 Transformer 框架，在仅 140M 参数和 400K 迭代下达到 ImageNet-256 上的 SOTA 性能（FID 3.49，KDD 0.319），媲美更大模型的结果。

Details

Motivation: 现有加速扩散 Transformer 的技术多被孤立研究，缺乏对多种方法组合协同效应的系统探索。 Method: 在表征对齐基础上，系统性整合 token 路由、架构改进和训练策略调整，构建 SR-DiT 框架。 Result: 在 ImageNet-256 上以 140M 参数、400K 迭代实现 FID 3.49 和 KDD 0.319，性能匹敌 685M 参数模型；通过消融实验揭示关键技术组合的协同与冲突。 Conclusion: SR-DiT 实现了小模型下的高性能生成，验证了多技术融合的有效性，为后续研究提供了计算友好的基线框架。 Abstract: Recent advances have significantly improved the training efficiency of diffusion transformers. However, these techniques have largely been studied in isolation, leaving unexplored the potential synergies from combining multiple approaches. We present SR-DiT (Speedrun Diffusion Transformer), a framework that systematically integrates token routing, architectural improvements, and training modifications on top of representation alignment. Our approach achieves FID 3.49 and KDD 0.319 on ImageNet-256 using only a 140M parameter model at 400K iterations without classifier-free guidance - comparable to results from 685M parameter models trained significantly longer. To our knowledge, this is a state-of the-art result at this model size. Through extensive ablation studies, we identify which technique combinations are most effective and document both synergies and incompatibilities. We release our framework as a computationally accessible baseline for future research.

[130] ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States

Haowen Wang,Xiaoping Yuan,Fugang Zhang,Rui Jian,Yuanwei Zhu,Xiuquan Qiao,Yakun Huang

Main category: cs.CV

TL;DR: 本文提出了ArtGen，一种基于条件扩散的生成框架，能够从单视角图像或文本描述中生成具有精确几何形状和一致运动学结构的可动3D物体。

Details

Motivation: 现有生成模型依赖于单视角输入且通常表示闭合状态，导致几何形状与关节动态纠缠，产生模糊或不真实的运动学结构。 Method: ArtGen采用跨状态蒙特卡洛采样以增强全局运动学一致性，引入思维链推理模块推断结构先验（如部件语义、关节类型和连接性），并使用稀疏专家Diffusion Transformer处理多样化的运动交互；同时结合具有局部-全局注意力的复合3D-VAE潜在先验来捕捉细粒度几何与部件间关系。 Result: 在PartNet-Mobility基准上的大量实验表明，ArtGen显著优于当前最先进的方法。 Conclusion: ArtGen有效解耦了几何形状与运动学动态，实现了高质量、运动一致的可动3D对象生成。 Abstract: Generating articulated assets is crucial for robotics, digital twins, and embodied intelligence. Existing generative models often rely on single-view inputs representing closed states, resulting in ambiguous or unrealistic kinematic structures due to the entanglement between geometric shape and joint dynamics. To address these challenges, we introduce ArtGen, a conditional diffusion-based framework capable of generating articulated 3D objects with accurate geometry and coherent kinematics from single-view images or text descriptions at arbitrary part-level states. Specifically, ArtGen employs cross-state Monte Carlo sampling to explicitly enforce global kinematic consistency, reducing structural-motion entanglement. Additionally, we integrate a Chain-of-Thought reasoning module to infer robust structural priors, such as part semantics, joint types, and connectivity, guiding a sparse-expert Diffusion Transformer to specialize in diverse kinematic interactions. Furthermore, a compositional 3D-VAE latent prior enhanced with local-global attention effectively captures fine-grained geometry and global part-level relationships. Extensive experiments on the PartNet-Mobility benchmark demonstrate that ArtGen significantly outperforms state-of-the-art methods.

[131] A Graph Attention Network-Based Framework for Reconstructing Missing LiDAR Beams

Khalfalla Awedat,Mohamed Abidalrekab,Mohammad El-Yabroudi

Main category: cs.CV

TL;DR: 提出了一种基于图注意力网络（GAT）的框架，仅使用当前LiDAR帧即可重建因传感器退化导致的垂直通道缺失，无需相机图像或时间信息。

Details

Motivation: 垂直波束丢失会严重影响自动驾驶中的3D感知，现有方法依赖额外模态或时序信息，限制了实用性。 Method: 将LiDAR扫描建模为非结构化空间图，以点为节点、邻近点连接为边，并保留原始波束索引顺序；通过多层GAT学习局部几何邻域的自适应注意力权重，直接回归缺失位置的高程值。 Result: 在1,065个KITTI序列上测试，平均高度RMSE为11.67厘米，87.98%的重建点误差小于10厘米；单GPU推理耗时14.65秒/帧，且对不同邻域大小k具有稳定性。 Conclusion: 纯基于图注意力的模型仅利用原始点云几何即可有效恢复真实传感器退化下的垂直波束缺失，具备鲁棒性和应用潜力。 Abstract: Vertical beam dropout in spinning LiDAR sensors triggered by hardware aging, dust, snow, fog, or bright reflections removes entire vertical slices from the point cloud and severely degrades 3D perception in autonomous vehicles. This paper proposes a Graph Attention Network (GAT)-based framework that reconstructs these missing vertical channels using only the current LiDAR frame, with no camera images or temporal information required. Each LiDAR sweep is represented as an unstructured spatial graph: points are nodes and edges connect nearby points while preserving the original beam-index ordering. A multi-layer GAT learns adaptive attention weights over local geometric neighborhoods and directly regresses the missing elevation (z) values at dropout locations. Trained and evaluated on 1,065 raw KITTI sequences with simulated channel dropout, the method achieves an average height RMSE of 11.67 cm, with 87.98% of reconstructed points falling within a 10 cm error threshold. Inference takes 14.65 seconds per frame on a single GPU, and reconstruction quality remains stable for different neighborhood sizes k. These results show that a pure graph attention model operating solely on raw point-cloud geometry can effectively recover dropped vertical beams under realistic sensor degradation.

[132] ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics

Tue-Thu Van-Dinh,Hoang-Duy Tran,Truong-Binh Duong,Mai-Hanh Pham,Binh-Nam Le-Nguyen,Quoc-Thai Nguyen

Main category: cs.CV

TL;DR: ViInfographicVQA是首个针对越南语信息图视觉问答的基准，包含6747张真实信息图和20409个人工验证的问答对，提出单图和多图推理任务，揭示了现有多模态模型在跨图整合和非跨度推理上的局限性。

Details

Motivation: 现有的视觉问答系统在处理富含文本、图表和复杂布局的信息图时表现不足，尤其是在越南语等低资源语言中缺乏专门的基准来评估模型在OCR、布局理解及跨图像推理方面的能力。 Method: 构建了一个包含超过6747张真实世界信息图和20409个经人工验证的问答对的大规模数据集ViInfographicVQA，并设计了单图像（Single-image）和多图像（Multi-image）两种评估设置，用于测试模型在单一信息图理解和跨多个相关图像进行证据整合方面的能力。 Result: 评估了多种最新的视觉-语言模型，发现它们在多图像任务上表现显著较差，特别是在需要跨图像整合和非跨度推理的问题上错误最多，显示出当前模型在处理复杂布局和多图推理方面的不足。 Conclusion: ViInfographicVQA为越南语信息图VQA提供了重要基准，揭示了当前多模态模型在低资源语言和复杂视觉推理任务中的局限，推动未来研究关注布局感知和跨图像推理方法的发展。 Abstract: Infographic Visual Question Answering (InfographicVQA) evaluates a model's ability to read and reason over data-rich, layout-heavy visuals that combine text, charts, icons, and design elements. Compared with scene-text or natural-image VQA, infographics require stronger integration of OCR, layout understanding, and numerical and semantic reasoning. We introduce ViInfographicVQA, the first benchmark for Vietnamese InfographicVQA, comprising over 6747 real-world infographics and 20409 human-verified question-answer pairs across economics, healthcare, education, and more. The benchmark includes two evaluation settings. The Single-image task follows the traditional setup in which each question is answered using a single infographic. The Multi-image task requires synthesizing evidence across multiple semantically related infographics and is, to our knowledge, the first Vietnamese evaluation of cross-image reasoning in VQA. We evaluate a range of recent vision-language models on this benchmark, revealing substantial performance disparities, with the most significant errors occurring on Multi-image questions that involve cross-image integration and non-span reasoning. ViInfographicVQA contributes benchmark results for Vietnamese InfographicVQA and sheds light on the limitations of current multimodal models in low-resource contexts, encouraging future exploration of layout-aware and cross-image reasoning methods.

[133] BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation

Hangwei Zhang,Armando Teles Fortes,Tianyi Wei,Xingang Pan

Main category: cs.CV

TL;DR: 本文提出了BokehDepth，一个利用散焦作为几何线索的双阶段框架，用于改进单目深度估计和散景合成。

Details

Motivation: 现有的高质量散景渲染依赖于噪声较多的深度图，而单目度量深度模型在纹理较弱、距离远或几何模糊区域表现不佳，这些地方正是散焦线索最有用的地方。 Method: 第一阶段使用物理引导的可控散景生成器从单个清晰输入生成无深度的散景堆栈；第二阶段通过轻量级散焦感知聚合模块将特征沿散焦维度融合，并插入现有单目深度编码器中。 Result: 在多个具有挑战性的基准测试中，BokehDepth 提升了基于深度图的散景基线的视觉保真度，并持续提高了强单目深度基础模型的度量精度和鲁棒性。 Conclusion: BokehDepth 有效解耦了散景合成与深度预测，利用散焦作为无需额外监督的几何线索，显著提升了深度估计和散景渲染的质量。 Abstract: Bokeh and monocular depth estimation are tightly coupled through the same lens imaging geometry, yet current methods exploit this connection in incomplete ways. High-quality bokeh rendering pipelines typically depend on noisy depth maps, which amplify estimation errors into visible artifacts, while modern monocular metric depth models still struggle on weakly textured, distant and geometrically ambiguous regions where defocus cues are most informative. We introduce BokehDepth, a two-stage framework that decouples bokeh synthesis from depth prediction and treats defocus as an auxiliary supervision-free geometric cue. In Stage-1, a physically guided controllable bokeh generator, built on a powerful pretrained image editing backbone, produces depth-free bokeh stacks with calibrated bokeh strength from a single sharp input. In Stage-2, a lightweight defocus-aware aggregation module plugs into existing monocular depth encoders, fuses features along the defocus dimension, and exposes stable depth-sensitive variations while leaving downstream decoder unchanged. Across challenging benchmarks, BokehDepth improves visual fidelity over depth-map-based bokeh baselines and consistently boosts the metric accuracy and robustness of strong monocular depth foundation models.

[134] Endless World: Real-Time 3D-Aware Long Video Generation

Ke Zhang,Yiqun Mei,Jiacong Xu,Vishal M. Patel

Main category: cs.CV

TL;DR: Endless World是一个实时生成无限、3D一致视频的框架，通过条件自回归训练策略和全局3D感知注意力机制实现长时序下的视觉连贯性和几何稳定性。

Details

Motivation: 现有方法在生成长时序、稳定3D结构的视频方面存在挑战，尤其是在流式场景中难以保持一致性与效率。 Method: 提出条件自回归训练策略以对齐新生成内容与已有帧，并引入全局3D感知注意力机制提供跨时间的连续几何引导，确保物理合理性和几何一致性。 Result: 实验表明Endless World能生成长时间稳定且视觉连贯的视频，在视觉保真度和空间一致性上优于或媲美现有方法，支持单GPU实时推理。 Conclusion: Endless World为无限视频生成提供了高效、稳定的解决方案，推动了动态场景下长时序3D视频生成的发展。 Abstract: Producing long, coherent video sequences with stable 3D structure remains a major challenge, particularly in streaming scenarios. Motivated by this, we introduce Endless World, a real-time framework for infinite, 3D-consistent video generation.To support infinite video generation, we introduce a conditional autoregressive training strategy that aligns newly generated content with existing video frames. This design preserves long-range dependencies while remaining computationally efficient, enabling real-time inference on a single GPU without additional training overhead.Moreover, our Endless World integrates global 3D-aware attention to provide continuous geometric guidance across time. Our 3D injection mechanism enforces physical plausibility and geometric consistency throughout extended sequences, addressing key challenges in long-horizon and dynamic scene synthesis.Extensive experiments demonstrate that Endless World produces long, stable, and visually coherent videos, achieving competitive or superior performance to existing methods in both visual fidelity and spatial consistency. Our project has been available on https://bwgzk-keke.github.io/EndlessWorld/.

[135] Adaptive Detector-Verifier Framework for Zero-Shot Polyp Detection in Open-World Settings

Shengkai Xu,Hsiang Lun Kao,Tianxiang Xu,Honghui Zhang,Junqiao Wang,Runmeng Ding,Guanyu Liu,Tianyu Shi,Zhenyu Yu,Guofeng Pan,Ziqian Bi,Yuqi Ouyang

Main category: cs.CV

TL;DR: 本文提出AdaptiveDetector，一种结合YOLOv11检测器与视觉-语言模型（VLM）验证器的两阶段息肉检测框架，通过自适应阈值调整和代价敏感强化学习减少临床内镜中漏检风险。

Details

Motivation: 现有息肉检测模型在真实临床环境中因光照变化、运动模糊和遮挡等影响表现不佳，难以弥合实验室与实际应用之间的域差距。 Method: 提出一种检测器-验证器双阶段框架：YOLOv11作为检测器，在VLM指导下自适应调整每帧置信度阈值；VLM验证器采用组相对策略优化（GRPO）并设计非对称代价敏感奖励函数，以降低漏检率。同时构建包含常见临床退化情形的合成测试集用于零样本评估。 Result: 在合成退化的CVC-ClinicDB和Kvasir-SEG数据上进行零样本评估，召回率比单独YOLO提升14至22个百分点，精确率基本持平（±0.7~1.7点）。 Conclusion: 该方法通过自适应阈值与代价敏感强化学习实现了更符合临床需求的开放世界息肉检测，显著减少假阴性，有助于降低漏诊癌前息肉的风险，改善患者预后。 Abstract: Polyp detectors trained on clean datasets often underperform in real-world endoscopy, where illumination changes, motion blur, and occlusions degrade image quality. Existing approaches struggle with the domain gap between controlled laboratory conditions and clinical practice, where adverse imaging conditions are prevalent. In this work, we propose AdaptiveDetector, a novel two-stage detector-verifier framework comprising a YOLOv11 detector with a vision-language model (VLM) verifier. The detector adaptively adjusts per-frame confidence thresholds under VLM guidance, while the verifier is fine-tuned with Group Relative Policy Optimization (GRPO) using an asymmetric, cost-sensitive reward function specifically designed to discourage missed detections -- a critical clinical requirement. To enable realistic assessment under challenging conditions, we construct a comprehensive synthetic testbed by systematically degrading clean datasets with adverse conditions commonly encountered in clinical practice, providing a rigorous benchmark for zero-shot evaluation. Extensive zero-shot evaluation on synthetically degraded CVC-ClinicDB and Kvasir-SEG images demonstrates that our approach improves recall by 14 to 22 percentage points over YOLO alone, while precision remains within 0.7 points below to 1.7 points above the baseline. This combination of adaptive thresholding and cost-sensitive reinforcement learning achieves clinically aligned, open-world polyp detection with substantially fewer false negatives, thereby reducing the risk of missed precancerous polyps and improving patient outcomes.

[136] From Particles to Fields: Reframing Photon Mapping with Continuous Gaussian Photon Fields

Jiachen Tao,Benjamin Planche,Van Nguyen Nguyen,Junyi Wu,Yuchun Liu,Haoxuan Wang,Zhongpai Gao,Gengyu Zhang,Meng Zheng,Feiran Wang,Anwesa Choudhuri,Zhenghao Zhao,Weitai Kang,Terrence Chen,Yan Yan,Ziyan Wu

Main category: cs.CV

TL;DR: 提出高斯光子场（GPF），一种可学习的连续光子表示方法，通过将物理追踪的光子建模为各向异性的3D高斯基元，实现多视角下高效且准确的全局光照渲染，兼具光子映射的物理精度与神经场景表示的计算效率。

Details

Motivation: 传统光子映射在多视角渲染时因每个视点独立进行光子追踪和核估计而产生大量冗余计算，导致效率低下，难以满足高效高质量渲染需求。 Method: 提出高斯光子场（GPF），将首次SPPM迭代中物理追踪的光子初始化为位置、旋转、尺度和光谱参数化的各向异性3D高斯基元，并通过多视角最终辐射率监督进行优化，构建连续、可微的辐射率函数，实现无需重复光子追踪的相机光线渲染。 Result: 在包含焦散和镜面-漫反射交互等复杂光照的场景中，GPF实现了与光子级精度相当的结果，同时计算量降低数个数量级。 Conclusion: GPF成功融合了基于光子的物理渲染精度与神经场景表示的高效性，为多视角全局光照提供了一种高效、可微、可重用的解决方案。 Abstract: Accurately modeling light transport is essential for realistic image synthesis. Photon mapping provides physically grounded estimates of complex global illumination effects such as caustics and specular-diffuse interactions, yet its per-view radiance estimation remains computationally inefficient when rendering multiple views of the same scene. The inefficiency arises from independent photon tracing and stochastic kernel estimation at each viewpoint, leading to inevitable redundant computation. To accelerate multi-view rendering, we reformulate photon mapping as a continuous and reusable radiance function. Specifically, we introduce the Gaussian Photon Field (GPF), a learnable representation that encodes photon distributions as anisotropic 3D Gaussian primitives parameterized by position, rotation, scale, and spectrum. GPF is initialized from physically traced photons in the first SPPM iteration and optimized using multi-view supervision of final radiance, distilling photon-based light transport into a continuous field. Once trained, the field enables differentiable radiance evaluation along camera rays without repeated photon tracing or iterative refinement. Extensive experiments on scenes with complex light transport, such as caustics and specular-diffuse interactions, demonstrate that GPF attains photon-level accuracy while reducing computation by orders of magnitude, unifying the physical rigor of photon-based rendering with the efficiency of neural scene representations.

[137] Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Chengzhi Liu,Yuzhe Yang,Yue Fan,Qingyue Wei,Sheng Liu,Xin Eric Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的动态多模态潜在推理框架DMLR，通过置信度引导的潜在策略梯度优化和动态视觉注入策略，实现推理与感知的动态交织，提升了多模态大模型的推理性能与效率。

Details

Motivation: 现有基于思维链（CoT）的多模态推理方法依赖显式逐步推理，存在感知-推理交互不稳定和计算开销大的问题。受人类非线性、动态交织的认知过程启发，本文旨在构建更高效、稳定的多模态推理机制。 Method: 提出DMLR框架：1）在测试时使用置信度引导的潜在策略梯度优化，优化潜在思维token以深化推理；2）引入动态视觉注入策略，在每个潜在思维token处检索最相关的视觉特征并更新最佳视觉块集合，实现动态的图文交织。 Result: 在七个多模态推理基准和多种模型架构上实验表明，DMLR显著提升了推理与感知性能，同时保持高推理效率。 Conclusion: DMLR通过模拟人类动态交织的认知过程，实现了更高效、稳定的多模态推理，为未来多模态大模型的设计提供了新思路。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.

[138] More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

Hoang Anh Just,Yifei Fan,Handong Zhao,Jiuxiang Gu,Ruiyi Zhang,Simon Jenni,Kushal Kafle,Ruoxi Jia,Jing Shi

Main category: cs.CV

TL;DR: PeRL-VL 是一种解耦框架，通过分别增强视觉感知和文本推理来改进基于可验证奖励的视觉-语言模型训练，显著提升多模态推理性能。

Details

Motivation: 现有RLVR训练的VLM在视觉提取和推理链一致性上存在缺陷，因奖励信号仅监督最终答案。 Method: 引入基于VLM的描述奖励以评估图像描述的忠实性和充分性，并增加纯文本推理SFT阶段以提升逻辑连贯性。 Result: 在多个多模态基准上，PeRL-VL将平均Pass@1准确率从63.3%提升至68.8%，优于基线与多种替代方法。 Conclusion: 解耦感知与推理训练能有效提升VLM的多模态推理能力，为RLVR框架提供了更优扩展路径。 Abstract: Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervise only the final answer. We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR. For perception, PeRL-VL introduces a VLM-based description reward that scores the model's self-generated image descriptions for faithfulness and sufficiency. For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision. Across diverse multimodal benchmarks, PeRL-VL improves average Pass@1 accuracy from 63.3% (base Qwen2.5-VL-7B) to 68.8%, outperforming standard RLVR, text-only reasoning SFT, and naive multimodal distillation from GPT-4o.

[139] Efficient Vision-Language Reasoning via Adaptive Token Pruning

Xue Li,Xiaonan Song,Henry Hu

Main category: cs.CV

TL;DR: 提出了一种名为自适应令牌剪枝（ATP）的动态推理机制，可在保持几乎无损精度的同时显著提升视觉-语言模型的推理效率，并增强模型鲁棒性。

Details

Motivation: 现有视觉-语言模型因均匀处理所有令牌而导致计算开销高，难以在实际中部署，尤其在资源受限的边缘设备上。 Method: ATP在视觉-语言接口处运行，结合ViT CLS注意力（模内显著性）和CLIP图文相似性（模间相关性）生成混合重要性得分，动态保留前K个最相关的视觉令牌输入到大语言模型，无需修改主干网络。 Result: 在VQAv2、GQA和COCO等数据集上，ATP减少了约40%的推理FLOPs，实现了约1.5倍的端到端延迟加速，精度损失小于1%；同时显示出对输入干扰更强的鲁棒性，抑制了虚假相关性。 Conclusion: ATP作为一种轻量级门控模块，兼容BLIP-2、LLaVA和Flamingo等主流架构，表明高效推理与模型可靠性可协同优化，有助于推动多模态模型在边缘计算场景中的应用。 Abstract: Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end latency with negligible accuracy loss (less than 1%). Qualitative analyses suggest ATP preserves visual grounding and enhances interpretability. Beyond efficiency, we investigate robustness under corruptions; observations suggest adaptive pruning suppresses spurious correlations, improving stability. These findings imply that resource-constrained inference and model reliability are not competing objectives. Finally, we discuss ATP's role in efficient multimodal edge computing pipelines.

[140] SignRAG: A Retrieval-Augmented System for Scalable Zero-Shot Road Sign Recognition

Minghao Zhu,Zhihao Zhang,Anmol Sidhu,Keith Redmill

Main category: cs.CV

TL;DR: 提出了一种基于检索增强生成（RAG）的零样本道路标志识别框架，利用视觉语言模型和大语言模型实现高精度识别，无需任务特定训练。

Details

Motivation: 传统深度学习方法在面对大量道路标志类别和难以构建完整标注数据集时存在局限，亟需一种可扩展且无需大量标注的识别方法。 Method: 采用视觉语言模型（VLM）从输入图像生成文本描述，结合向量数据库检索最相关的参考标志候选，再利用大语言模型（LLM）进行推理以实现细粒度识别。 Result: 在303个Ohio MUTCD法规标志数据集上验证，理想参考图像下准确率达95.58%，真实道路数据下达82.45%。 Conclusion: 基于RAG的架构可用于构建可扩展、高精度的零样本道路标志识别系统，无需专门训练，具有实际应用潜力。 Abstract: Automated road sign recognition is a critical task for intelligent transportation systems, but traditional deep learning methods struggle with the sheer number of sign classes and the impracticality of creating exhaustive labeled datasets. This paper introduces a novel zero-shot recognition framework that adapts the Retrieval-Augmented Generation (RAG) paradigm to address this challenge. Our method first uses a Vision Language Model (VLM) to generate a textual description of a sign from an input image. This description is used to retrieve a small set of the most relevant sign candidates from a vector database of reference designs. Subsequently, a Large Language Model (LLM) reasons over the retrieved candidates to make a final, fine-grained recognition. We validate this approach on a comprehensive set of 303 regulatory signs from the Ohio MUTCD. Experimental results demonstrate the framework's effectiveness, achieving 95.58% accuracy on ideal reference images and 82.45% on challenging real-world road data. This work demonstrates the viability of RAG-based architectures for creating scalable and accurate systems for road sign recognition without task-specific training.

[141] Advancing Cache-Based Few-Shot Classification via Patch-Driven Relational Gated Graph Attention

Tasweer Ahmad,Arindam Sikdar,Sandip Pradhan,Ardhendu Behera

Main category: cs.CV

TL;DR: 提出一种基于补丁关系的缓存适配方法，通过图注意力网络增强少样本图像分类性能，在11个基准上优于现有方法，并在战场伤员识别任务中验证了实用性。

Details

Motivation: 现有CLIP适配方法依赖全局特征，缺乏对局部判别信息的利用，难以应对少样本场景下的域偏移问题。 Method: 引入基于补丁的图注意力网络，构建补丁图并进行边感知注意力以强化关键补丁交互，通过多聚合池化生成任务判别性表示，并在训练中将关系结构蒸馏到缓存中。 Result: 在11个基准数据集上显著优于现有缓存和适配方法，推理无额外开销，并在新提出的伤员识别数据集上验证了实战价值。 Conclusion: 补丁驱动的关系精炼能有效提升少样本分类性能，兼顾零样本能力与实际部署效率。 Abstract: Few-shot image classification remains difficult under limited supervision and visual domain shift. Recent cache-based adaptation approaches (e.g., Tip-Adapter) address this challenge to some extent by learning lightweight residual adapters over frozen features, yet they still inherit CLIP's tendency to encode global, general-purpose representations that are not optimally discriminative to adapt the generalist to the specialist's domain in low-data regimes. We address this limitation with a novel patch-driven relational refinement that learns cache adapter weights from intra-image patch dependencies rather than treating an image embedding as a monolithic vector. Specifically, we introduce a relational gated graph attention network that constructs a patch graph and performs edge-aware attention to emphasize informative inter-patch interactions, producing context-enriched patch embeddings. A learnable multi-aggregation pooling then composes these into compact, task-discriminative representations that better align cache keys with the target few-shot classes. Crucially, the proposed graph refinement is used only during training to distil relational structure into the cache, incurring no additional inference cost beyond standard cache lookup. Final predictions are obtained by a residual fusion of cache similarity scores with CLIP zero-shot logits. Extensive evaluations on 11 benchmarks show consistent gains over state-of-the-art CLIP adapter and cache-based baselines while preserving zero-shot efficiency. We further validate battlefield relevance by introducing an Injured vs. Uninjured Soldier dataset for casualty recognition. It is motivated by the operational need to support triage decisions within the "platinum minutes" and the broader "golden hour" window in time-critical UAV-driven search-and-rescue and combat casualty care.

[142] Heart Disease Prediction using Case Based Reasoning (CBR)

Mohaiminul Islam Bhuiyan,Chan Hue Wah,Nur Shazwani Kamarudin,Nur Hafieza Ismail,Ahmad Fakhri Ab Nasir

Main category: cs.CV

TL;DR: 本研究比较了模糊逻辑、神经网络和基于案例的推理（CBR）在心脏病预测中的应用，最终选用CBR方法，在数据预处理和划分后，CBR达到97.95%的准确率，并分析性别、吸烟和饮酒等因素对心脏病的影响。

Details

Motivation: 传统医疗诊断依赖医生经验，准确性有限，因此需要引入智能系统提高心脏病预测的精确度。 Method: 采用三种智能系统方法——模糊逻辑、神经网络和基于案例的推理（CBR），对心脏病数据集进行预处理和数据分割，并比较三者在预测准确性上的表现，最终选择CBR模型进行预测。 Result: CBR模型在心脏病预测中达到97.95%的准确率；男性患病概率为57.76%，女性为42.24%；吸烟和饮酒被识别为男性患心脏病的重要因素。 Conclusion: 基于案例的推理（CBR）在心脏病预测中表现出高准确性，优于其他智能方法，且性别及相关生活习惯是影响心脏病风险的关键因素，具有临床辅助诊断价值。 Abstract: This study provides an overview of heart disease prediction using an intelligent system. Predicting disease accurately is crucial in the medical field, but traditional methods relying solely on a doctor's experience often lack precision. To address this limitation, intelligent systems are applied as an alternative to traditional approaches. While various intelligent system methods exist, this study focuses on three: Fuzzy Logic, Neural Networks, and Case-Based Reasoning (CBR). A comparison of these techniques in terms of accuracy was conducted, and ultimately, Case-Based Reasoning (CBR) was selected for heart disease prediction. In the prediction phase, the heart disease dataset underwent data pre-processing to clean the data and data splitting to separate it into training and testing sets. The chosen intelligent system was then employed to predict heart disease outcomes based on the processed data. The experiment concluded with Case-Based Reasoning (CBR) achieving a notable accuracy rate of 97.95% in predicting heart disease. The findings also revealed that the probability of heart disease was 57.76% for males and 42.24% for females. Further analysis from related studies suggests that factors such as smoking and alcohol consumption are significant contributors to heart disease, particularly among males.

[143] Generative Spatiotemporal Data Augmentation

Jinfan Zhou,Lixin Luo,Sungmin Eum,Heesung Kwon,Jeong Joon Park

Main category: cs.CV

TL;DR: 提出一种基于视频扩散模型的时空数据增强方法，通过生成多样化的3D空间和时间变化来扩充图像数据集，尤其在低数据场景（如无人机图像）中显著提升模型性能。

Details

Motivation: 现有数据增强方法多依赖简单的几何变换或外观扰动，难以生成真实且多样化的时空变化，限制了在数据稀缺场景下的模型训练效果。 Method: 利用现成的视频扩散模型从静态图像数据集中生成具有真实感的时空变化视频片段，并将这些合成视频作为额外训练数据；同时提供关于生成设置选择、标注迁移和遮挡区域处理的实用指南。 Result: 在COCO子集和无人机图像数据集上的实验表明，该方法能有效拓展数据分布，相比传统及以往生成方法，在低数据环境下 consistently 提升模型性能。 Conclusion: 所提出的时空数据增强方法通过引入视频生成模型，为数据稀缺场景提供了一种有效的性能提升手段，具备实际应用价值和推广潜力。 Abstract: We explore spatiotemporal data augmentation using video foundation models to diversify both camera viewpoints and scene dynamics. Unlike existing approaches based on simple geometric transforms or appearance perturbations, our method leverages off-the-shelf video diffusion models to generate realistic 3D spatial and temporal variations from a given image dataset. Incorporating these synthesized video clips as supplemental training data yields consistent performance gains in low-data settings, such as UAV-captured imagery where annotations are scarce. Beyond empirical improvements, we provide practical guidelines for (i) choosing an appropriate spatiotemporal generative setup, (ii) transferring annotations to synthetic frames, and (iii) addressing disocclusion - regions newly revealed and unlabeled in generated views. Experiments on COCO subsets and UAV-captured datasets show that, when applied judiciously, spatiotemporal augmentation broadens the data distribution along axes underrepresented by traditional and prior generative methods, offering an effective lever for improving model performance in data-scarce regimes.

[144] Towards Interactive Intelligence for Digital Humans

Yiyi Cai,Xuangeng Chu,Xiwei Gao,Sitong Gong,Yifei Huang,Caixin Kang,Kunhang Li,Haiyang Liu,Ruicong Liu,Yun Liu,Dianwen Ng,Zixiong Su,Erwin Wu,Yuhan Wu,Dingkun Yan,Tianyu Yan,Chang Zeng,Bo Zheng,You Zhou

Main category: cs.CV

TL;DR: 本文提出了“交互智能”这一新范式，并介绍了Mio框架，一个集认知推理与多模态实时具身化于一体的端到端数字人系统，显著提升了数字人在个性表达、自适应交互和自我进化方面的能力。

Details

Motivation: 现有的数字人技术多停留在表面模仿，缺乏一致性人格、自适应交互和持续进化的智能能力，因此需要一种新的范式来实现真正智能的交互。 Method: 提出Mio（Multimodal Interactive Omni-Avatar）框架，包含Thinker、Talker、Face Animator、Body Animator和Renderer五个模块，实现从认知到多模态表达的端到端集成，并构建新基准用于评估交互智能能力。 Result: 实验表明，该框架在所有评估维度上均优于现有最先进方法，显著提升了数字人的交互流畅性与一致性。 Conclusion: Mio推动了数字人从表层模仿向具备个性一致、自适应和自我进化能力的交互智能体转变，为未来智能交互系统提供了新方向。 Abstract: We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.

[145] Animus3D: Text-driven 3D Animation via Motion Score Distillation

Qi Sun,Can Wang,Jiaxiang Shang,Wensen Feng,Jing Liao

Main category: cs.CV

TL;DR: Animus3D是一种基于文本驱动的3D动画框架，通过提出Motion Score Distillation（MSD）方法，显著提升了3D资产动画化的运动幅度与细节质量。

Details

Motivation: 现有方法使用标准SDS目标从预训练视频扩散模型中蒸馏动作，但常导致运动不足或抖动问题，难以生成高质量3D动画。 Method: 提出Motion Score Distillation（MSD），采用LoRA增强的视频扩散模型，以静态源分布替代纯噪声，并结合反转式噪声估计技术保持外观一致性；引入时空正则化项减少几何畸变，并设计运动细化模块提升时间分辨率和细节表现。 Result: 实验表明，Animus3D能根据多样化文本提示生成比现有最先进方法更丰富、更精细的运动，同时保持高视觉保真度。 Conclusion: Animus3D通过新颖的MSD策略和精细化模块，在文本驱动3D动画生成任务中实现了更优的运动质量和视觉一致性。 Abstract: We present Animus3D, a text-driven 3D animation framework that generates motion field given a static 3D asset and text prompt. Previous methods mostly leverage the vanilla Score Distillation Sampling (SDS) objective to distill motion from pretrained text-to-video diffusion, leading to animations with minimal movement or noticeable jitter. To address this, our approach introduces a novel SDS alternative, Motion Score Distillation (MSD). Specifically, we introduce a LoRA-enhanced video diffusion model that defines a static source distribution rather than pure noise as in SDS, while another inversion-based noise estimation technique ensures appearance preservation when guiding motion. To further improve motion fidelity, we incorporate explicit temporal and spatial regularization terms that mitigate geometric distortions across time and space. Additionally, we propose a motion refinement module to upscale the temporal resolution and enhance fine-grained details, overcoming the fixed-resolution constraints of the underlying video model. Extensive experiments demonstrate that Animus3D successfully animates static 3D assets from diverse text prompts, generating significantly more substantial and detailed motion than state-of-the-art baselines while maintaining high visual integrity. Code will be released at https://qiisun.github.io/animus3d_page.

[146] Anatomy Guided Coronary Artery Segmentation from CCTA Using Spatial Frequency Joint Modeling

Huan Huang,Michele Esposito,Chen Zhao

Main category: cs.CV

TL;DR: 提出了一种结合心肌解剖先验、结构感知特征编码和三维小波逆小波变换的冠状动脉分割框架，显著提升了在复杂几何条件下的分割稳定性与一致性。

Details

Motivation: 由于血管细小、分支复杂、边界模糊及心肌干扰，冠状动脉CT图像中的准确分割仍具挑战性，需提高分割精度以支持临床决策。 Method: 引入心肌解剖先验和残差注意力机制增强特征表达；采用小波-逆小波变换进行下采样和上采样，实现空间频率联合建模；通过多尺度特征融合模块整合语义与几何信息。 Result: 在ImageCAS数据集上取得Dice系数0.8082、敏感度0.7946、精确度0.8471、HD95为9.77mm，优于多个主流模型，消融实验验证各模块互补性。 Conclusion: 所提方法在复杂解剖条件下实现了更稳定一致的冠状动脉分割，为后续冠状动脉结构分析提供了可靠的分割结果。 Abstract: Accurate coronary artery segmentation from coronary computed tomography angiography is essential for quantitative coronary analysis and clinical decision support. Nevertheless, reliable segmentation remains challenging because of small vessel calibers, complex branching, blurred boundaries, and myocardial interference. We propose a coronary artery segmentation framework that integrates myocardial anatomical priors, structure aware feature encoding, and three dimensional wavelet inverse wavelet transformations. Myocardial priors and residual attention based feature enhancement are incorporated during encoding to strengthen coronary structure representation. Wavelet inverse wavelet based downsampling and upsampling enable joint spatial frequency modeling and preserve multi scale structural consistency, while a multi scale feature fusion module integrates semantic and geometric information in the decoding stage. The model is trained and evaluated on the public ImageCAS dataset using a 3D overlapping patch based strategy with a 7:1:2 split for training, validation, and testing. Experimental results demonstrate that the proposed method achieves a Dice coefficient of 0.8082, Sensitivity of 0.7946, Precision of 0.8471, and an HD95 of 9.77 mm, outperforming several mainstream segmentation models. Ablation studies further confirm the complementary contributions of individual components. The proposed method enables more stable and consistent coronary artery segmentation under complex geometric conditions, providing reliable segmentation results for subsequent coronary structure analysis tasks.

[147] Supervised Contrastive Frame Aggregation for Video Representation Learning

Shaif Chowdhury,Mushfika Rahman,Greg Hamerly

Main category: cs.CV

TL;DR: 提出一种利用全局时序上下文的监督对比学习框架，通过将多帧视频聚合为单张图像进行高效视频表示学习，在多个数据集上超越现有方法。

Details

Motivation: 现有视频表示学习方法计算开销大，且依赖复杂模型如视频Transformer，难以有效利用时序信息并易过拟合。 Method: 设计一种视频到图像的聚合策略，将同一视频的多帧空间排列成单张输入图像，使用预训练CNN（如ResNet50）作为骨干；构建监督对比学习目标，以相同标签的视频投影为正样本对，不同视频为负样本，并通过不同时间采样生成同一视频的多种自然视角以增强多样性。 Result: 在Penn Action和HMDB51数据集上优于现有方法，分别达到76%和48%的分类准确率（ViVIT为43%和37%），且计算资源需求更低。 Conclusion: 所提监督对比帧聚合方法能有效学习视频表示，适用于监督与自监督场景下的视频分类与字幕生成任务，兼具高性能与高效率。 Abstract: We propose a supervised contrastive learning framework for video representation learning that leverages temporally global context. We introduce a video to image aggregation strategy that spatially arranges multiple frames from each video into a single input image. This design enables the use of pre trained convolutional neural network backbones such as ResNet50 and avoids the computational overhead of complex video transformer models. We then design a contrastive learning objective that directly compares pairwise projections generated by the model. Positive pairs are defined as projections from videos sharing the same label while all other projections are treated as negatives. Multiple natural views of the same video are created using different temporal frame samplings from the same underlying video. Rather than relying on data augmentation these frame level variations produce diverse positive samples with global context and reduce overfitting. Experiments on the Penn Action and HMDB51 datasets demonstrate that the proposed method outperforms existing approaches in classification accuracy while requiring fewer computational resources. The proposed Supervised Contrastive Frame Aggregation method learns effective video representations in both supervised and self supervised settings and supports video based tasks such as classification and captioning. The method achieves seventy six percent classification accuracy on Penn Action compared to forty three percent achieved by ViVIT and forty eight percent accuracy on HMDB51 compared to thirty seven percent achieved by ViVIT.

[148] StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding

Xinqi Jin,Hanxun Yu,Bohan Yu,Kebin Liu,Jian Liu,Keda Tao,Yixuan Pei,Huan Wang,Fan Dang,Jiangchuan Liu,Weiqiang Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于最大空间相邻视频令牌相似性（MSSAVT）的令牌剪枝方法，用于在线视频理解中的多模态大语言模型，有效降低计算开销并提升准确率。

Details

Motivation: 由于视频帧数量庞大，直接应用多模态大语言模型（MLLMs）会导致高GPU内存消耗和计算延迟，因此需要一种能减少上下文长度同时保留关键信息的方法。 Method: 提出MSSAVT冗余度量标准，结合令牌相似性和空间位置；设计掩码剪枝策略以解决剪枝与冗余之间的双向依赖；结合现有时间冗余剪枝方法，共同消除时空冗余。 Result: 在多个在线和离线视频理解基准上实验表明，该方法最多可提升4%的准确率，且剪枝延迟极低（小于1ms）。 Conclusion: 所提出的令牌剪枝方法在保持关键信息的同时显著降低了计算负担，有效提升了视频理解任务的效率与性能。 Abstract: Online video understanding is essential for applications like public surveillance and AI glasses. However, applying Multimodal Large Language Models (MLLMs) to this domain is challenging due to the large number of video frames, resulting in high GPU memory usage and computational latency. To address these challenges, we propose token pruning as a means to reduce context length while retaining critical information. Specifically, we introduce a novel redundancy metric, Maximum Similarity to Spatially Adjacent Video Tokens (MSSAVT), which accounts for both token similarity and spatial position. To mitigate the bidirectional dependency between pruning and redundancy, we further design a masked pruning strategy that ensures only mutually unadjacent tokens are pruned. We also integrate an existing temporal redundancy-based pruning method to eliminate temporal redundancy of the video modality. Experimental results on multiple online and offline video understanding benchmarks demonstrate that our method significantly improves the accuracy (i.e., by 4\% at most) while incurring a negligible pruning latency (i.e., less than 1ms). Our full implementation will be made publicly available.

[149] From Tokens to Photons: Test-Time Physical Prompting for Vison-Language Models

Boyeong Im,Wooseok Lee,Yoojin Kwon,Hyung-Sin Kim

Main category: cs.CV

TL;DR: 提出MVP框架，通过在推理时利用相机曝光三角（ISO、快门速度、光圈）作为物理提示，实现视觉-语言模型的测试时自适应，显著提升在传感器捕获环境下的鲁棒性。

Details

Motivation: 将视觉-语言模型从网络图像应用扩展到传感器感知的物理环境，解决传统数字增强在真实场景中鲁棒性不足的问题。 Method: MVP在推理时采集每个场景的多物理视角，使用源亲和性分数选择最优k组传感器设置，结合轻量级数字增强、低熵筛选和零温度softmax进行预测聚合，无需梯度或模型修改。 Result: 在ImageNet-ES及其多样化版本上，MVP比纯数字TTA最高提升25.6个百分点，并比结合传统传感器控制的TTA方法额外提升3.4个百分点，且在减少参数候选集时仍有效。 Conclusion: 测量时刻的物理视图选择与组合（而非仅后处理提示）能显著增强视觉-语言模型在真实环境中的适应性与鲁棒性，验证了物理提示的有效性。 Abstract: To extend the application of vision-language models (VLMs) from web images to sensor-mediated physical environments, we propose Multi-View Physical-prompt for Test-Time Adaptation (MVP), a forward-only framework that moves test-time adaptation (TTA) from tokens to photons by treating the camera exposure triangle--ISO, shutter speed, and aperture--as physical prompts. At inference, MVP acquires a library of physical views per scene, selects the top-k sensor settings using a source-affinity score, evaluates each retained view under lightweight digital augmentations, filters the lowest-entropy subset of augmented views, and aggregates predictions with Zero-temperature softmax (i.e., hard voting). This selection-then-vote design is simple, calibration-friendly, and requires no gradients or model modifications. On ImageNet-ES and ImageNet-ES-Diverse, MVP consistently outperforms digital-only TTA on single Auto-Exposure captures, by up to 25.6 percentage points (pp), and delivers up to 3.4 pp additional gains over pipelines that combine conventional sensor control with TTA. MVP remains effective under reduced parameter candidate sets that lower capture latency, demonstrating practicality. These results support the main claim that, beyond post-capture prompting, measurement-time control--selecting and combining real physical views--substantially improves robustness for VLMs.

[150] StegaVAR: Privacy-Preserving Video Action Recognition via Steganographic Domain Analysis

Lixin Chen,Chaomeng Chen,Jiale Zhou,Zhijian Wu,Xun Lin

Main category: cs.CV

TL;DR: 本文提出了一种名为StegaVAR的新框架，首次在隐写域中直接进行视频动作识别（VAR），通过将动作视频嵌入普通载体视频中，在保证传输隐蔽性的同时完整保留了秘密视频的时空信息。

Details

Motivation: 现有隐私保护方法在视频动作识别中存在隐蔽性差和时空特征破坏的问题，难以兼顾隐私与识别性能，因此需要一种既能有效隐藏信息又不损害关键动作特征的新方法。 Method: 提出了StegaVAR框架，结合Secret Spatio-Temporal Promotion（STeP）和Cross-Band Difference Attention（CroDA）模块，实现隐写域内的动作识别；STeP利用秘密视频指导特征提取，CroDA通过捕捉跨频带语义差异抑制载体干扰。 Result: 实验表明，StegaVAR在多个常用数据集上均取得了优异的动作识别和隐私保护性能，并且兼容多种隐写模型。 Conclusion: StegaVAR成功实现了高隐蔽性的视频传输与精准的动作识别，解决了传统匿名化方法在隐蔽性和特征完整性方面的缺陷，为隐私保护下的视频分析提供了新思路。 Abstract: Despite the rapid progress of deep learning in video action recognition (VAR) in recent years, privacy leakage in videos remains a critical concern. Current state-of-the-art privacy-preserving methods often rely on anonymization. These methods suffer from (1) low concealment, where producing visually distorted videos that attract attackers' attention during transmission, and (2) spatiotemporal disruption, where degrading essential spatiotemporal features for accurate VAR. To address these issues, we propose StegaVAR, a novel framework that embeds action videos into ordinary cover videos and directly performs VAR in the steganographic domain for the first time. Throughout both data transmission and action analysis, the spatiotemporal information of hidden secret video remains complete, while the natural appearance of cover videos ensures the concealment of transmission. Considering the difficulty of steganographic domain analysis, we propose Secret Spatio-Temporal Promotion (STeP) and Cross-Band Difference Attention (CroDA) for analysis within the steganographic domain. STeP uses the secret video to guide spatiotemporal feature extraction in the steganographic domain during training. CroDA suppresses cover interference by capturing cross-band semantic differences. Experiments demonstrate that StegaVAR achieves superior VAR and privacy-preserving performance on widely used datasets. Moreover, our framework is effective for multiple steganographic models.

[151] Automatic Wire-Harness Color Sequence Detector

Indiwara Nanayakkara,Dehan Jayawickrama,Mervyn Parakrama B. Ekanayake

Main category: cs.CV

TL;DR: 本文提出了一种半自动化的机器视觉系统，用于检测线束的正确性，包括导线位置、连接器极性和颜色序列，并在实际工业环境中实现了100%的检测准确率和44%的检测时间减少。

Details

Motivation: 线束检测目前仍依赖人工，容易出错且效率低，亟需一种自动化解决方案来提高电子制造服务行业的检测可靠性与效率。 Method: 采用五个工业级CMOS相机构建模块化机械结构，结合HSV和RGB颜色空间的颜色序列分类器，通过至少五个参考样本训练系统，并可保存模型用于同类线束检测。 Result: 系统在GPV Lanka Pvt. Ltd. 实际部署中实现了100%的检测准确率，检测时间比人工方法减少44%，并具备用户管理、可调照明、会话数据存储和安全登录等附加功能。 Conclusion: 该半自动机器视觉系统能有效提升线束检测的准确性与效率，适用于多种线束配置，具有良好的工业应用前景。 Abstract: Wire harness inspection process remains a labor-intensive process prone to errors in the modern Electronics Manufacturing Services (EMS) industry. This paper introduces a semiautomated machine vision system capable of verifying correct wire positioning, correctness of the connector polarity and correctness of color sequences for both linear and circular wire harness configurations. Five industrial standard CMOS cameras are integrated into a modularized mechanical framework in the physical structure of the solution and a HSV and RGB color domain value comparison based color sequence classifier is used in the operation. For each harness batch, a user can train the system using at least five reference samples; the trained file is stored and reused for similar harness types. The Solution is deployed at GPV Lanka Pvt. Ltd. (Fig. 2) and the system achieved 100% detection accuracy and reduced inspection time by 44% compared to manual methods. Additional features include user management, adjustable lighting, session data storage, and secure login. Results of this product usage in the real world situation demonstrate that this approach delivers reliable and efficient inspection capabilities.

[152] Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

Karthikeya KV

Main category: cs.CV

TL;DR: 提出了一种结合视觉增强大语言模型与先进Transformer架构的新框架，通过校正流机制和双向分词策略，实现高质量、高分辨率的图像生成与多模态数据统一理解，相比扩散模型在清晰度上提升25%，计算成本降低20%。

Details

Motivation: 解决高分辨率图像合成与多模态数据解释中的效率与质量瓶颈，尤其是在噪声数据下的生成性能问题。 Method: 采用基于Transformer的架构，引入校正流机制以线性路径连接噪声与数据，结合双向分词策略融合文本、图像与视频输入，并利用时空特征嵌入与噪声感知学习算法优化生成过程。 Result: 在基准数据集上实现了图像分辨率清晰度提升25%，计算需求减少20%，且在生成质量与多模态一致性方面优于现有扩散模型。 Conclusion: 该框架显著提升了多模态生成模型的效率与适应性，展示了视觉增强大语言模型在计算机视觉与人工智能应用中的巨大潜力。 Abstract: This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures to tackle challenges in high-resolution image synthesis and multimodal data interpretation. The proposed model incorporates a rectified flow mechanism that connects noise and data with linear paths, enabling efficient and high-quality generation. A bidirectional tokenization strategy is employed to seamlessly merge inputs from text, image, and video modalities, fostering a unified understanding across diverse data types. By embedding spatial-temporal features and leveraging a hybrid text-image sequence modeling approach, the framework achieves unparalleled fidelity in synthesized images and coherent multimodal representations. The architecture is optimized with a noise-aware learning algorithm, addressing discrepancies in noisy data distributions and improving generative performance under varying input conditions. Rigorous evaluations on benchmark datasets demonstrate a 25% increase in image resolution clarity and a 20% reduction in computational requirements compared to diffusion-based methods. Furthermore, the model exhibits robust scalability and adaptability, showcasing its potential in applications like autonomous systems, creative content generation, and advanced video analysis. This work underscores the role of vision-centric LLMs in redefining capabilities in computer vision and multimodal artificial intelligence.

[153] Content-Aware Ad Banner Layout Generation with Two-Stage Chain-of-Thought in Vision Language Models

Kei Yoshitake,Kento Hosono,Ken Kobayashi,Kazuhide Nakata

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言模型（VLM）的图像广告布局生成方法，通过理解图像语义内容生成更高质量的布局。

Details

Motivation: 传统广告布局依赖显著性检测，难以捕捉图像的语义和组成细节，导致文本和标志放置不合理。 Method: 利用VLM分析图像中的对象类型及其空间关系，生成文本形式的“放置计划”，再将其转化为HTML格式的布局代码。 Result: 实验表明，该方法在定量和定性评估中均优于现有方法，能生成更符合图像内容的高质量广告布局。 Conclusion: 结合VLM理解图像语义可有效提升广告布局的质量，为自动化广告设计提供了新思路。 Abstract: In this paper, we propose a method for generating layouts for image-based advertisements by leveraging a Vision-Language Model (VLM). Conventional advertisement layout techniques have predominantly relied on saliency mapping to detect salient regions within a background image, but such approaches often fail to fully account for the image's detailed composition and semantic content. To overcome this limitation, our method harnesses a VLM to recognize the products and other elements depicted in the background and to inform the placement of text and logos. The proposed layout-generation pipeline consists of two steps. In the first step, the VLM analyzes the image to identify object types and their spatial relationships, then produces a text-based "placement plan" based on this analysis. In the second step, that plan is rendered into the final layout by generating HTML-format code. We validated the effectiveness of our approach through evaluation experiments, conducting both quantitative and qualitative comparisons against existing methods. The results demonstrate that by explicitly considering the background image's content, our method produces noticeably higher-quality advertisement layouts.

[154] Geometry-Aware Scene-Consistent Image Generation

Cong Xie,Che Wang,Yan Zhang,Zheng Pan,Han Zou,Zhenpeng Zhan

Main category: cs.CV

TL;DR: 本文研究几何感知的场景一致性图像生成，提出了一种新的方法来平衡场景保持和文本提示遵循之间的权衡。

Details

Motivation: 现有方法在保持场景一致性和遵循文本提示之间难以平衡，要么高保真复制场景但对提示响应差，要么优先考虑提示一致性而牺牲了场景结构。因此需要一种能够同时保持场景结构并准确根据文本描述生成新实体的方法。 Method: 提出了两个关键贡献：一是构建了一个场景一致的数据构造管道，用于生成多样且具有几何基础的训练样本对；二是设计了一种新的几何引导注意力损失函数，利用多视角线索来约束模型的空间推理能力。 Result: 在自建的场景一致性基准上实验表明，该方法在自动指标和人类偏好测试中均优于现有的最先进基线方法，能生成几何上连贯、构图多样且忠实于文本指令和场景结构的图像。 Conclusion: 所提出的方法有效解决了场景保持与文本遵循之间的冲突，实现了更高质量的场景一致性图像生成。 Abstract: We study geometry-aware scene-consistent image generation: given a reference scene image and a text condition specifying an entity to be generated in the scene and its spatial relation to the scene, the goal is to synthesize an output image that preserves the same physical environment as the reference scene while correctly generating the entity according to the spatial relation described in the text. Existing methods struggle to balance scene preservation with prompt adherence: they either replicate the scene with high fidelity but poor responsiveness to the prompt, or prioritize prompt compliance at the expense of scene consistency. To resolve this trade-off, we introduce two key contributions: (i) a scene-consistent data construction pipeline that generates diverse, geometrically-grounded training pairs, and (ii) a novel geometry-guided attention loss that leverages cross-view cues to regularize the model's spatial reasoning. Experiments on our scene-consistent benchmark show that our approach achieves better scene alignment and text-image consistency than state-of-the-art baselines, according to both automatic metrics and human preference studies. Our method produces geometrically coherent images with diverse compositions that remain faithful to the textual instructions and the underlying scene structure.

[155] No Cache Left Idle: Accelerating diffusion model via Extreme-slimming Caching

Tingyan Wen,Haoyu Li,Yihuang Chen,Xing Zhou,Lifei Zhu,Xueqian Wang

Main category: cs.CV

TL;DR: X-Slim是一种无需训练的缓存加速框架，通过跨时间步、结构块和空间令牌的多级缓存机制，在保证生成质量的同时大幅提升扩散模型的推理速度。

Details

Motivation: 扩散模型虽然生成质量高，但计算开销大；现有缓存方法在速度与保真度之间存在权衡，难以兼顾效率与性能。 Method: 提出X-Slim框架，采用双阈值控制器实现‘先推后修’的缓存策略：先在时间步级别激进复用，再在块和令牌级别轻量刷新，并在越过临界线时触发完整推理以重置误差；上下文感知指标动态决定缓存时机与位置。 Result: 在FLUX.1-dev和HunyuanVideo上延迟降低达4.97倍和3.52倍，在DiT-XL/2上加速3.13倍且FID提升2.42，显著优于先前方法。 Conclusion: X-Slim首次统一利用跨时间步、结构和空间的冗余，有效突破了缓存复用中的速度-质量权衡，推动了扩散模型高效推理的前沿。 Abstract: Diffusion models achieve remarkable generative quality, but computational overhead scales with step count, model depth, and sequence length. Feature caching is effective since adjacent timesteps yield highly similar features. However, an inherent trade-off remains: aggressive timestep reuse offers large speedups but can easily cross the critical line, hurting fidelity, while block- or token-level reuse is safer but yields limited computational savings. We present X-Slim (eXtreme-Slimming Caching), a training-free, cache-based accelerator that, to our knowledge, is the first unified framework to exploit cacheable redundancy across timesteps, structure (blocks), and space (tokens). Rather than simply mixing levels, X-Slim introduces a dual-threshold controller that turns caching into a push-then-polish process: it first pushes reuse at the timestep level up to an early-warning line, then switches to lightweight block- and token-level refresh to polish the remaining redundancy, and triggers full inference once the critical line is crossed to reset accumulated error. At each level, context-aware indicators decide when and where to cache. Across diverse tasks, X-Slim advances the speed-quality frontier. On FLUX.1-dev and HunyuanVideo, it reduces latency by up to 4.97x and 3.52x with minimal perceptual loss. On DiT-XL/2, it reaches 3.13x acceleration and improves FID by 2.42 over prior methods.

[156] Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching

Wonseok Choi,Sohwi Lim,Nam Hyeon-Woo,Moon Ye-Bin,Dong-Ju Jeong,Jinyoung Hwang,Tae-Hyun Oh

Main category: cs.CV

TL;DR: 提出Patchify，一种无需微调的块级图像检索框架，通过局部特征与全局描述符匹配实现高效、可扩展且可解释的实例级图像检索，并引入LocScore度量评估空间定位准确性。

Details

Motivation: 实例级图像检索需应对对象在尺寸、位置和外观上的变化，现有方法在性能、可扩展性和可解释性方面存在不足。 Method: 将数据库图像划分为少量结构化块，利用全局查询描述符比较局部特征进行检索；引入LocScore指标评估检索区域的空间对齐情况；结合乘积量化实现大规模高效检索。 Result: 在多个基准、主干网络和区域选择策略上实验表明，Patchify优于全局方法，能有效补充最先进的重排序流程，且使用信息丰富的特征进行压缩显著提升性能。 Conclusion: Patchify提供了一种高性能、可扩展且可解释的图像检索方案，LocScore为分析和改进检索行为提供了有价值的诊断工具。 Abstract: Instance-level image retrieval aims to find images containing the same object as a given query, despite variations in size, position, or appearance. To address this challenging task, we propose Patchify, a simple yet effective patch-wise retrieval framework that offers high performance, scalability, and interpretability without requiring fine-tuning. Patchify divides each database image into a small number of structured patches and performs retrieval by comparing these local features with a global query descriptor, enabling accurate and spatially grounded matching. To assess not just retrieval accuracy but also spatial correctness, we introduce LocScore, a localization-aware metric that quantifies whether the retrieved region aligns with the target object. This makes LocScore a valuable diagnostic tool for understanding and improving retrieval behavior. We conduct extensive experiments across multiple benchmarks, backbones, and region selection strategies, showing that Patchify outperforms global methods and complements state-of-the-art reranking pipelines. Furthermore, we apply Product Quantization for efficient large-scale retrieval and highlight the importance of using informative features during compression, which significantly boosts performance. Project website: https://wons20k.github.io/PatchwiseRetrieval/

Zihan Wang,Seungjun Lee,Guangzhao Dai,Gim Hee Lee

Main category: cs.CV

TL;DR: 提出动态3D视觉-语言-规划模型（D3D-VLP），通过动态3D思维链和协同学习策略，统一多任务并利用混合数据实现SOTA性能。

Details

Motivation: 解决具身智能体中端到端模型缺乏可解释性和3D推理、模块化系统忽略组件间依赖的问题。 Method: 引入动态3D思维链（3D CoT）统一规划、指代、导航和问答，并采用协同学习从碎片监督（SLFS）策略通过掩码自回归损失利用部分标注的混合数据进行训练。 Result: 在多个基准上达到SOTA，包括R2R-CE、REVERIE-CE、NavRAG-CE、HM3D-OVON和SG3D，并通过真实世界移动操作实验验证有效性。 Conclusion: D3D-VLP有效融合了端到端与模块化系统的优势，实现了可解释、强推理且协同训练的具身智能。 Abstract: Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning, while modular systems ignore cross-component interdependencies and synergies. To bridge this gap, we propose the Dynamic 3D Vision-Language-Planning Model (D3D-VLP). Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data. This allows different CoT components to mutually reinforce and implicitly supervise each other. To this end, we construct a large-scale dataset with 10M hybrid samples from 5K real scans and 20K synthetic scenes that are compatible with online learning methods such as RL and DAgger. Our D3D-VLP achieves state-of-the-art results on multiple benchmarks, including Vision-and-Language Navigation (R2R-CE, REVERIE-CE, NavRAG-CE), Object-goal Navigation (HM3D-OVON), and Task-oriented Sequential Grounding and Navigation (SG3D). Real-world mobile manipulation experiments further validate the effectiveness.

[158] DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

Zhou Tao,Shida Wang,Yongxiang Hua,Haoyu Cao,Linli Xu

Main category: cs.CV

TL;DR: 本文提出了DiG（差异定位）框架，通过让多模态大语言模型识别并定位相似图像对之间的所有差异，提升其细粒度视觉感知和空间推理能力。

Details

Motivation: 现有的多模态大语言模型在细粒度视觉感知和精确空间推理方面存在局限。 Method: 提出DiG框架，结合基于3D渲染的数据生成 pipeline 和课程学习策略，训练模型识别未知数量的图像差异。 Result: 实验表明，DiG在RefCOCO系列等视觉感知基准上显著提升性能，且技能可迁移至下游任务。 Conclusion: 差异定位是一种可扩展且鲁棒的方法，有助于推进MLLM中的细粒度视觉推理。 Abstract: Multimodal Large Language Models have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across a variety of visual perception benchmarks and that the learned fine-grained perception skills transfer effectively to standard downstream tasks, including RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks. Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.

Hongyang Li,Junyi Tao,Qijie Wei,Ningzhi Yang,Meng Wang,Weihong Yu,Xirong Li

Main category: cs.CV

TL;DR: 本文提出了一种简单而有效的方法CARe，用于解决具有大视场差异的跨模态眼底图像配准问题，通过裁剪和双拟合对齐模块显著提升了配准性能。

Details

Motivation: 现有跨模态眼底图像配准方法假设视场差异较小，难以应对大视场差异的挑战性场景。 Method: 提出CARe方法，首先利用视网膜生理结构对宽视野彩色眼底照片进行裁剪，使其视场与OCTA源图像大致对齐；然后采用基于RANSAC和多项式坐标拟合的双拟合对齐模块优化空间变换。 Result: 在包含60对OCTA-wfCFP图像的新测试集上进行了大量实验，验证了CARe在大视场差异下跨模态配准的有效性。 Conclusion: CARe能有效处理大视场差异下的跨模态眼底图像配准问题，且可复用现有小视场差异方法，具有实用价值。 Abstract: Previous work on cross-modal fundus image registration (CMFIR) assumes small cross-modal Field-of-View (FoV) disparity. By contrast, this paper is targeted at a more challenging scenario with large FoV disparity, to which directly applying current methods fails. We propose Crop and Alignment for cross-modal fundus image Registration(CARe), a very simple yet effective method. Specifically, given an OCTA with smaller FoV as a source image and a wide-field color fundus photograph (wfCFP) as a target image, our Crop operation exploits the physiological structure of the retina to crop from the target image a sub-image with its FoV roughly aligned with that of the source. This operation allows us to re-purpose the previous small-FoV-disparity oriented methods for subsequent image registration. Moreover, we improve spatial transformation by a double-fitting based Alignment module that utilizes the classical RANSAC algorithm and polynomial-based coordinate fitting in a sequential manner. Extensive experiments on a newly developed test set of 60 OCTA-wfCFP pairs verify the viability of CARe for CMFIR.

[160] CogDoc: Towards Unified thinking in Documents

Qixin Xu,Haozhe Wang,Che Liu,Fangzhen Lin,Wenhu Chen

Main category: cs.CV

TL;DR: 提出CogDoc，一种从粗到细的统一推理框架，通过直接强化学习在7B模型上实现优于更大模型（如GPT-4o）的文档理解性能。

Details

Motivation: 现有文档推理方法在可扩展性和保真度之间存在权衡，难以同时处理长文本和捕捉细粒度多模态细节。 Method: 设计两阶段框架：低分辨率的“快速阅读”用于信息定位，高分辨率的“聚焦思考”用于深度推理，并采用直接强化学习进行后训练。 Result: 直接RL优于带SFT初始化的RL，避免了策略冲突；7B模型在视觉丰富的文档基准上达到同类最佳，超越GPT-4o等大模型。 Conclusion: CogDoc通过模拟人类认知过程，有效平衡了可扩展性与推理保真度，验证了直接RL在统一思维框架中的优越性。 Abstract: Current document reasoning paradigms are constrained by a fundamental trade-off between scalability (processing long-context documents) and fidelity (capturing fine-grained, multimodal details). To bridge this gap, we propose CogDoc, a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution "Fast Reading" phase for scalable information localization,followed by a high-resolution "Focused Thinking" phase for deep reasoning. We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning (RL) approach outperforms RL with Supervised Fine-Tuning (SFT) initialization. Specifically, we find that direct RL avoids the "policy conflict" observed in SFT. Empirically, our 7B model achieves state-of-the-art performance within its parameter class, notably surpassing significantly larger proprietary models (e.g., GPT-4o) on challenging, visually rich document benchmarks.

[161] Anatomy-Guided Representation Learning Using a Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images

Muhammad Umar Farooq,Abd Ur Rehman,Azka Rehman,Muhammad Usman,Dong-Kyu Chae,Junaid Qadir

Main category: cs.CV

TL;DR: 提出了一种基于半监督多任务Transformer的网络SSMT-Net，用于超声图像中甲状腺结节的分割，结合未标注数据和多任务学习提升性能。

Details

Motivation: 解决现有深度学习模型在甲状腺结节分割中难以利用上下文信息、泛化能力差以及标注数据稀缺的问题。 Method: 设计SSMT-Net，包含无监督预训练阶段以增强特征提取能力，并在有监督阶段联合优化结节分割、腺体分割和结节大小估计三个任务。 Result: 在TN3K和DDTI数据集上表现优于现有最先进方法，具有更高的准确性和鲁棒性。 Conclusion: SSMT-Net能有效利用未标注数据并融合多任务学习，提升了甲状腺结节分割的性能，具备临床应用潜力。 Abstract: Accurate thyroid nodule segmentation in ultrasound images is critical for diagnosis and treatment planning. However, ambiguous boundaries between nodules and surrounding tissues, size variations, and the scarcity of annotated ultrasound data pose significant challenges for automated segmentation. Existing deep learning models struggle to incorporate contextual information from the thyroid gland and generalize effectively across diverse cases. To address these challenges, we propose SSMT-Net, a Semi-Supervised Multi-Task Transformer-based Network that leverages unlabeled data to enhance Transformer-centric encoder feature extraction capability in an initial unsupervised phase. In the supervised phase, the model jointly optimizes nodule segmentation, gland segmentation, and nodule size estimation, integrating both local and global contextual features. Extensive evaluations on the TN3K and DDTI datasets demonstrate that SSMT-Net outperforms state-of-the-art methods, with higher accuracy and robustness, indicating its potential for real-world clinical applications.

[162] InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation

Sreehari Rajan,Kunal Bhosikar,Charu Sharma

Main category: cs.CV

TL;DR: 本文提出了一种名为InteracTalker的新框架，能够统一生成与语音和物体交互相协调的人体动作，通过多阶段训练和自适应融合策略，在生成逼真且具备对象感知的全身运动方面优于现有方法。

Details

Motivation: 现有的方法通常独立处理语音驱动手势或物体交互，缺乏综合数据集支持，限制了在真实场景中的应用，因此需要一个能同时考虑语言和物体交互的统一框架。 Method: 提出InteracTalker框架，采用多阶段训练学习统一的运动、语音和提示嵌入空间；构建包含详细物体交互标注的增强数据集；设计广义运动适配模块，并在推理时动态组合条件信号；引入自适应融合策略，在扩散采样过程中动态调整异构条件信号的权重。 Result: 该方法在语音伴随手势生成和物体交互合成任务上均优于先前方法，尤其超越了专注于手势的扩散模型，生成的动作更具现实感、灵活性和可控性。 Conclusion: InteracTalker成功整合了语音驱动与物体感知的运动生成，实现了更自然、真实的人类动作合成，为交互式数字体验提供了更强的技术支持。 Abstract: Generating realistic human motions that naturally respond to both spoken language and physical objects is crucial for interactive digital experiences. Current methods, however, address speech-driven gestures or object interactions independently, limiting real-world applicability due to a lack of integrated, comprehensive datasets. To overcome this, we introduce InteracTalker, a novel framework that seamlessly integrates prompt-based object-aware interactions with co-speech gesture generation. We achieve this by employing a multi-stage training process to learn a unified motion, speech, and prompt embedding space. To support this, we curate a rich human-object interaction dataset, formed by augmenting an existing text-to-motion dataset with detailed object interaction annotations. Our framework utilizes a Generalized Motion Adaptation Module that enables independent training, adapting to the corresponding motion condition, which is then dynamically combined during inference. To address the imbalance between heterogeneous conditioning signals, we propose an adaptive fusion strategy, which dynamically reweights the conditioning signals during diffusion sampling. InteracTalker successfully unifies these previously separate tasks, outperforming prior methods in both co-speech gesture generation and object-interaction synthesis, outperforming gesture-focused diffusion methods, yielding highly realistic, object-aware full-body motions with enhanced realism, flexibility, and control.

[163] Open-World Deepfake Attribution via Confidence-Aware Asymmetric Learning

Haiyang Zheng,Nan Pu,Wenjing Li,Teng Long,Nicu Sebe,Zhun Zhong

Main category: cs.CV

TL;DR: 本文提出了一种用于开放世界DeepFake归因（OW-DFA）的置信度感知非对称学习框架（CAL），通过CCR和ACR组件缓解伪标签偏差，并引入动态原型剪枝（DPP）策略自动估计未知伪造类型数量，显著提升了在已知和未知伪造类型上的归因性能。

Details

Motivation: 现有OW-DFA方法存在置信度偏差和需预先知道未知伪造类型数量的不现实假设，限制了其在真实场景中的应用。 Method: 提出CAL框架，包含置信度感知一致性正则化（CCR）和非对称置信度强化（ACR），并结合动态原型剪枝（DPP）策略自动估计未知类别数。 Result: 在标准和扩展的OW-DFA基准上实验表明，所提方法在已知和未知伪造归因任务上均优于先前方法，达到新的SOTA性能。 Conclusion: CAL框架有效解决了伪标签偏差和未知类别数假设问题，提升了OW-DFA模型的鲁棒性和可扩展性，适用于更真实的开放世界场景。 Abstract: The proliferation of synthetic facial imagery has intensified the need for robust Open-World DeepFake Attribution (OW-DFA), which aims to attribute both known and unknown forgeries using labeled data for known types and unlabeled data containing a mixture of known and novel types. However, existing OW-DFA methods face two critical limitations: 1) A confidence skew that leads to unreliable pseudo-labels for novel forgeries, resulting in biased training. 2) An unrealistic assumption that the number of unknown forgery types is known *a priori*. To address these challenges, we propose a Confidence-Aware Asymmetric Learning (CAL) framework, which adaptively balances model confidence across known and novel forgery types. CAL mainly consists of two components: Confidence-Aware Consistency Regularization (CCR) and Asymmetric Confidence Reinforcement (ACR). CCR mitigates pseudo-label bias by dynamically scaling sample losses based on normalized confidence, gradually shifting the training focus from high- to low-confidence samples. ACR complements this by separately calibrating confidence for known and novel classes through selective learning on high-confidence samples, guided by their confidence gap. Together, CCR and ACR form a mutually reinforcing loop that significantly improves the model's OW-DFA performance. Moreover, we introduce a Dynamic Prototype Pruning (DPP) strategy that automatically estimates the number of novel forgery types in a coarse-to-fine manner, removing the need for unrealistic prior assumptions and enhancing the scalability of our methods to real-world OW-DFA scenarios. Extensive experiments on the standard OW-DFA benchmark and a newly extended benchmark incorporating advanced manipulations demonstrate that CAL consistently outperforms previous methods, achieving new state-of-the-art performance on both known and novel forgery attribution.

[164] Progressive Conditioned Scale-Shift Recalibration of Self-Attention for Online Test-time Adaptation

Yushun Tang,Ziqiong Liu,Jiyuan Jia,Yi Zhang,Zhihai He

Main category: cs.CV

TL;DR: 提出一种渐进式条件缩放-偏移重校准（PCSR）方法，用于在线测试时自适应Transformer模型的自注意力机制，显著提升跨域图像分类性能。

Details

Motivation: 发现Transformer在跨域迁移时自注意力模块的Query、Key、Value特征变化显著，导致性能下降。 Method: 通过轻量级的域分离网络和因子生成网络，在每个Transformer层中在线学习域偏移特征，并动态预测自注意力重校准所需的缩放和平移参数。 Result: 在ImageNet-C等基准数据集上，分类准确率最高提升3.9%，显著优于现有在线测试时适应方法。 Conclusion: PCSR能有效缓解Transformer在在线测试时的域偏移问题，实现更鲁棒的跨域推理。 Abstract: Online test-time adaptation aims to dynamically adjust a network model in real-time based on sequential input samples during the inference stage. In this work, we find that, when applying a transformer network model to a new target domain, the Query, Key, and Value features of its self-attention module often change significantly from those in the source domain, leading to substantial performance degradation of the transformer model. To address this important issue, we propose to develop a new approach to progressively recalibrate the self-attention at each layer using a local linear transform parameterized by conditioned scale and shift factors. We consider the online model adaptation from the source domain to the target domain as a progressive domain shift separation process. At each transformer network layer, we learn a Domain Separation Network to extract the domain shift feature, which is used to predict the scale and shift parameters for self-attention recalibration using a Factor Generator Network. These two lightweight networks are adapted online during inference. Experimental results on benchmark datasets demonstrate that the proposed progressive conditioned scale-shift recalibration (PCSR) method is able to significantly improve the online test-time domain adaptation performance by a large margin of up to 3.9\% in classification accuracy on the ImageNet-C dataset.

[165] Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Yuran Wang,Bohan Zeng,Chengzhuo Tong,Wenxuan Liu,Yang Shi,Xiaochen Ma,Hao Liang,Yuanxing Zhang,Wentao Zhang

Main category: cs.CV

TL;DR: 本文提出了Scone，一种统一的理解-生成方法，用于提升多主体图像生成中的组合性与区分性，并引入了SconeEval基准来评估这两方面性能。

Details

Motivation: 现有方法在处理多主体图像生成时忽视了对不同主体的区分能力，限制了其在复杂真实场景中的应用。 Method: 提出Scone方法，通过理解专家作为语义桥梁，结合两阶段训练策略（先学习组合，再通过语义对齐和基于注意力的掩码增强区分），实现主体身份保持与干扰最小化。 Result: 实验表明，Scone在两个基准上均优于现有的开源模型，在组合与区分任务中表现更优。 Conclusion: Scone有效提升了多主体图像生成中的区分能力，推动了面向现实复杂场景的主体驱动图像生成发展。 Abstract: Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

[166] $β$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment

Fatimah Zohra,Chen Zhao,Hani Itani,Bernard Ghanem

Main category: cs.CV

TL;DR: 本文提出了β-CLIP，一种多粒度文本条件对比学习框架，通过层次化对齐不同粒度的文本（从整段描述到句子和短语）与对应视觉区域，提升细粒度图文匹配性能。

Details

Motivation: CLIP在零样本图文检索中表现优异，但在细粒度任务上即使使用详细标注微调仍表现不足，本文旨在解决这一问题。 Method: β-CLIP采用交叉注意力动态聚合图像块以生成上下文化的视觉嵌入，并提出β-CAL损失函数，在查询特定匹配与图像内上下文松弛对齐之间进行权衡。 Result: 在Urban1K上达到91.8% T2I和92.3% I2T（R@1），在FG-OVD（Hard）上达到30.9%，优于无硬负样本训练的现有方法。 Conclusion: β-CLIP实现了更优的密集图文对齐，为密集视觉-语言对应关系建立了强大且自适应的基线。 Abstract: CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose $β$-CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities-from full captions to sentences and phrases-and their corresponding visual regions. For each level of granularity, $β$-CLIP utilizes cross-attention to dynamically pool image patches, producing contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the $β$-Contextualized Contrastive Alignment Loss ($β$-CAL). This objective parameterizes the trade-off between strict query-specific matching and relaxed intra-image contextualization, supporting both soft Cross-Entropy and hard Binary Cross-Entropy formulations. Through extensive experiments, we demonstrate that $β$-CLIP significantly improves dense alignment: achieving 91.8% T2I 92.3% I2T at R@1 on Urban1K and 30.9% on FG-OVD (Hard), setting state-of-the-art among methods trained without hard negatives. $β$-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence. The code and models are released at https://github.com/fzohra/B-CLIP.

[167] Robust Motion Generation using Part-level Reliable Data from Videos

Boyuan Li,Sipeng Zheng,Bin Cao,Ruihua Song,Zongqing Lu

Main category: cs.CV

TL;DR: 提出一种基于可信部件的掩码自回归模型，用于从大规模网络视频中生成高质量的人体动作，解决因遮挡或出框导致的数据缺失问题。

Details

Motivation: 由于视频中人体部分常被遮挡或超出画面，导致动作数据不完整，影响动画生成质量与规模，需在保留数据规模的同时提升数据可用性。 Method: 将人体分为五个部分，识别视频帧中的“可信”可见部件；使用部件感知的变分自编码器将其编码为潜在标记；通过鲁棒的部件级掩码生成模型预测被掩码的可信部件，忽略噪声部件。 Result: 在干净和含噪数据集上均优于基线方法，提升了动作质量、语义一致性和多样性；并贡献了包含约20万真实动作序列的新基准K700-M。 Conclusion: 该方法有效利用不完整视频数据生成高质量人体动作，兼顾数据规模与生成性能，具备实际应用潜力。 Abstract: Extracting human motion from large-scale web videos offers a scalable solution to the data scarcity issue in character animation. However, some human parts in many video frames cannot be seen due to off-screen captures or occlusions. It brings a dilemma: discarding the data missing any part limits scale and diversity, while retaining it compromises data quality and model performance. To address this problem, we propose leveraging credible part-level data extracted from videos to enhance motion generation via a robust part-aware masked autoregression model. First, we decompose a human body into five parts and detect the parts clearly seen in a video frame as "credible". Second, the credible parts are encoded into latent tokens by our proposed part-aware variational autoencoder. Third, we propose a robust part-level masked generation model to predict masked credible parts, while ignoring those noisy parts. In addition, we contribute K700-M, a challenging new benchmark comprising approximately 200k real-world motion sequences, for evaluation. Experimental results indicate that our method successfully outperforms baselines on both clean and noisy datasets in terms of motion quality, semantic consistency and diversity. Project page: https://boyuaner.github.io/ropar-main/

[168] Spinal Line Detection for Posture Evaluation through Train-ing-free 3D Human Body Reconstruction with 2D Depth Images

Sehyun Kim,Hye Jun Lee,Jiwoo Lee,Changgyun Kim,Taemin Lee

Main category: cs.CV

TL;DR: 提出一种基于四方向深度图像的3D人体姿态分析系统，可高精度估计脊柱中心线和脊柱角，无需训练数据或复杂神经网络模型。

Details

Motivation: 现有基于多图像的方法设备昂贵、流程复杂，而单图像方法因遮挡和视角限制难以准确估计脊柱等内部结构。 Method: 融合四个方向的深度图像重建3D人体模型，采用全局与精细配准的分层匹配处理噪声和遮挡，应用自适应顶点缩减和细节层次集成（Level of Detail ensemble）提高网格分辨率与形状可靠性，实现脊柱中心线自动估计。 Result: 该方法在无训练数据和复杂神经网络的情况下实现了高精度的3D脊柱配准估计，验证结果表明配准质量显著提升。 Conclusion: 所提方法有效克服了多图像和单图像方法的局限性，实现了稳定、准确的脊柱角估计，适用于临床体态评估等实际应用。 Abstract: The spinal angle is an important indicator of body balance. It is important to restore the 3D shape of the human body and estimate the spine center line. Existing mul-ti-image-based body restoration methods require expensive equipment and complex pro-cedures, and single image-based body restoration methods have limitations in that it is difficult to accurately estimate the internal structure such as the spine center line due to occlusion and viewpoint limitation. This study proposes a method to compensate for the shortcomings of the multi-image-based method and to solve the limitations of the sin-gle-image method. We propose a 3D body posture analysis system that integrates depth images from four directions to restore a 3D human model and automatically estimate the spine center line. Through hierarchical matching of global and fine registration, restora-tion to noise and occlusion is performed. Also, the Adaptive Vertex Reduction is applied to maintain the resolution and shape reliability of the mesh, and the accuracy and stabil-ity of spinal angle estimation are simultaneously secured by using the Level of Detail en-semble. The proposed method achieves high-precision 3D spine registration estimation without relying on training data or complex neural network models, and the verification confirms the improvement of matching quality.

[169] GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation

Zhenya Yang,Zhe Liu,Yuxiang Lu,Liping Hou,Chenxuan Miao,Siyi Peng,Bailan Feng,Xiang Bai,Hengshuang Zhao

Main category: cs.CV

TL;DR: 本文提出了一种名为GenieDrive的新框架，用于物理感知的驾驶视频生成。该方法通过先生成包含丰富物理信息的4D占用场，并结合高效的VAE压缩和控制建模机制，实现了高质量、多视角一致且物理合理的驾驶视频生成。

Details

Motivation: 现有方法通常依赖单一扩散模型直接从驾驶动作生成视频，导致学习困难和物理不一致性。因此需要一种更具物理感知能力的生成框架。 Method: 首先生成4D占用场作为物理基础，使用基于潜在三平面表示的VAE进行高效压缩；引入互控注意力（MCA）建模控制对占用演化的影響，并端到端联合训练VAE与预测模块；在视频生成中采用归一化多视图注意力，利用4D占用指导多视图视频生成。 Result: 相比现有方法，潜变量大小减少至58%，预测mIoU提升7.2%，推理速度达41 FPS，参数量仅3.47M；视频生成FVD降低20.7%，显著提升视频质量与多视角一致性。 Conclusion: GenieDrive能够实现高可控性、多视角一致且物理感知的驾驶视频生成，在自动驾驶规划、异常数据合成和闭环评估中具有广泛应用潜力。 Abstract: Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.

[170] FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning

Yue Jiang,Dingkang Yang,Minghao Han,Jinghang Han,Zizhi Chen,Yizhou Liu,Mingcheng Li,Peng Zhai,Lihua Zhang

Main category: cs.CV

TL;DR: FysicsWorld是首个统一的全模态基准，支持图像、视频、音频和文本之间的双向输入输出，涵盖16项主要任务和3,268个样本，通过跨模态互补筛选策略构建数据，全面评估30多种先进模型在理解、生成和推理方面的性能差距。

Details

Motivation: 现有多模态基准在模态覆盖、交互方式和模态间互补性方面存在不足，缺乏支持任意到任意模态转换的统一评估框架。 Method: 提出FysicsWorld，支持多模态双向输入输出；设计跨模态互补筛选（CMCS）策略，在系统化数据构建框架中生成用于语音交互和融合依赖推理的全模态数据；整合40多个高质量来源的数据。 Result: 构建了包含16项任务和3,268个样本的全模态基准；在超过30种前沿模型上进行评估，揭示了各类模型在理解、生成与推理能力上的差距与局限。 Conclusion: FysicsWorld为评估和推动下一代全模态架构提供了统一基础和强基线，促进了多模态AI系统的全面发展。 Abstract: Despite rapid progress in multimodal large language models (MLLMs) and emerging omni-modal architectures, current benchmarks remain limited in scope and integration, suffering from incomplete modality coverage, restricted interaction to text-centric outputs, and weak interdependence and complementarity among modalities. To bridge these gaps, we introduce FysicsWorld, the first unified full-modality benchmark that supports bidirectional input-output across image, video, audio, and text, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning. FysicsWorld encompasses 16 primary tasks and 3,268 curated samples, aggregated from over 40 high-quality sources and covering a rich set of open-domain categories with diverse question types. We also propose the Cross-Modal Complementarity Screening (CMCS) strategy integrated in a systematic data construction framework that produces omni-modal data for spoken interaction and fusion-dependent cross-modal reasoning. Through a comprehensive evaluation of over 30 state-of-the-art baselines, spanning MLLMs, modality-specific models, unified understanding-generation models, and omni-modal language models, FysicsWorld exposes the performance disparities and limitations across models in understanding, generation, and reasoning. Our benchmark establishes a unified foundation and strong baselines for evaluating and advancing next-generation full-modality architectures.

[171] CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence

Tianjiao Yu,Xinzhuo Li,Yifan Shen,Yuanzhe Liu,Ismini Lourentzou

Main category: cs.CV

TL;DR: CoRe3D提出了一种统一的3D理解与生成推理框架，通过语义与空间抽象的联合推理，实现语言意图到3D内容生成的精准映射。

Details

Motivation: 现有的多模态模型在语言和视觉任务中已展现出推理机制的有效性，但在3D领域的应用仍不充分，缺乏将高层语义与低层几何结构紧密结合的推理框架。 Method: CoRe3D引入一种空间锚定的推理表示方法，将3D潜在空间分解为局部区域，并结合语义链式推理与结构化空间推理，实现对几何结构的组合式、过程式推理。 Result: 该方法在3D内容生成任务中实现了更强的局部一致性，并显著提升了生成结果与语言描述之间的对齐精度。 Conclusion: CoRe3D验证了显式推理机制在3D理解与生成中的有效性，为空间智能与语言引导的3D建模提供了新思路。 Abstract: Recent advances in large multimodal models suggest that explicit reasoning mechanisms play a critical role in improving model reliability, interpretability, and cross-modal alignment. While such reasoning-centric approaches have been proven effective in language and vision tasks, their extension to 3D remains underdeveloped. CoRe3D introduces a unified 3D understanding and generation reasoning framework that jointly operates over semantic and spatial abstractions, enabling high-level intent inferred from language to directly guide low-level 3D content formation. Central to this design is a spatially grounded reasoning representation that decomposes 3D latent space into localized regions, allowing the model to reason over geometry in a compositional and procedural manner. By tightly coupling semantic chain-of-thought inference with structured spatial reasoning, CoRe3D produces 3D outputs that exhibit strong local consistency and faithful alignment with linguistic descriptions.

[172] Fast 2DGS: Efficient Image Representation with Deep Gaussian Prior

Hao Wang,Ashish Bastola,Chaoyi Zhou,Wenhui Zhu,Xiwen Chen,Xuanzhao Dong,Siyu Huang,Abolfazl Razi

Main category: cs.CV

TL;DR: 本文提出了Fast-2DGS，一种轻量级的二维高斯点阵图像表示框架，通过引入深度高斯先验和属性回归网络，实现高质量、快速收敛的图像重建，显著降低了计算成本。

Details

Motivation: 现有的2D高斯点阵方法在初始化高斯分布时依赖随机或启发式策略，导致收敛慢，而基于学习的方法又增加了计算和架构复杂性，因此需要一种高效且低复杂度的解决方案。 Method: 提出了一种解耦架构，包括一个条件网络（Deep Gaussian Prior）用于捕捉不同复杂度下高斯基元的空间分布，以及一个属性回归网络用于预测密集的高斯属性。 Result: 该方法能在单次前向传播后仅需极少微调即可实现高质量重建，并显著降低计算开销，同时保持实时渲染能力。 Conclusion: Fast-2DGS在效率与质量之间取得了良好平衡，推动了2D高斯点阵技术向工业级部署迈进。 Abstract: As generative models become increasingly capable of producing high-fidelity visual content, the demand for efficient, interpretable, and editable image representations has grown substantially. Recent advances in 2D Gaussian Splatting (2DGS) have emerged as a promising solution, offering explicit control, high interpretability, and real-time rendering capabilities (>1000 FPS). However, high-quality 2DGS typically requires post-optimization. Existing methods adopt random or heuristics (e.g., gradient maps), which are often insensitive to image complexity and lead to slow convergence (>10s). More recent approaches introduce learnable networks to predict initial Gaussian configurations, but at the cost of increased computational and architectural complexity. To bridge this gap, we present Fast-2DGS, a lightweight framework for efficient Gaussian image representation. Specifically, we introduce Deep Gaussian Prior, implemented as a conditional network to capture the spatial distribution of Gaussian primitives under different complexities. In addition, we propose an attribute regression network to predict dense Gaussian properties. Experiments demonstrate that this disentangled architecture achieves high-quality reconstruction in a single forward pass, followed by minimal fine-tuning. More importantly, our approach significantly reduces computational cost without compromising visual quality, bringing 2DGS closer to industry-ready deployment.

[173] L-STEC: Learned Video Compression with Long-term Spatio-Temporal Enhanced Context

Tiange Zhang,Zhimeng Huang,Xiandong Meng,Kai Zhang,Zhipin Deng,Siwei Ma

Main category: cs.CV

TL;DR: 提出了一种基于长时空间-时间增强上下文（L-STEC）的神经视频压缩方法，通过引入LSTM和像素域的空间上下文融合，显著提升了压缩性能。

Details

Motivation: 现有神经视频压缩方法依赖前一帧特征进行时序预测，参考窗口短且仅传递特征信息，导致长期依赖缺失、细节丢失和误差累积。 Method: 提出L-STEC方法：1）使用LSTM扩展参考链以捕获长期依赖；2）引入 warped 空间上下文，结合多感受野网络融合时空信息，增强上下文表达。 Result: 在PSNR上相比DCVC-TCM节省37.01%码率，在MS-SSIM上节省31.65%，优于VTM-17.0和DCVC-FM，达到SOTA性能。 Conclusion: L-STEC通过融合长时特征与像素级空间上下文，有效提升预测精度与纹理保持能力，显著改善神经视频压缩效率。 Abstract: Neural Video Compression has emerged in recent years, with condition-based frameworks outperforming traditional codecs. However, most existing methods rely solely on the previous frame's features to predict temporal context, leading to two critical issues. First, the short reference window misses long-term dependencies and fine texture details. Second, propagating only feature-level information accumulates errors over frames, causing prediction inaccuracies and loss of subtle textures. To address these, we propose the Long-term Spatio-Temporal Enhanced Context (L-STEC) method. We first extend the reference chain with LSTM to capture long-term dependencies. We then incorporate warped spatial context from the pixel domain, fusing spatio-temporal information through a multi-receptive field network to better preserve reference details. Experimental results show that L-STEC significantly improves compression by enriching contextual information, achieving 37.01% bitrate savings in PSNR and 31.65% in MS-SSIM compared to DCVC-TCM, outperforming both VTM-17.0 and DCVC-FM and establishing new state-of-the-art performance.

[174] DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Zhe Liu,Runhui Huang,Rui Yang,Siming Yan,Zining Wang,Lu Hou,Di Lin,Xiang Bai,Hengshuang Zhao

Main category: cs.CV

TL;DR: 本文提出了DrivePI，一种空间感知的4D多模态大语言模型，作为统一的视觉-语言-动作（VLA）框架，用于自动驾驶中的细粒度3D感知、预测和规划。

Details

Motivation: 尽管多模态大语言模型在多个领域表现出色，但在自动驾驶中生成细粒度3D感知和预测输出的应用仍待探索。 Method: 结合点云、多视角图像和语言指令，在统一的MLLM架构中进行端到端优化，同时实现空间理解、3D占据、占据流预测和动作规划，并构建数据引擎生成文本-占据和文本-流问答对。 Result: 在仅使用0.5B参数的Qwen2.5作为骨干的情况下，DrivePI在nuScenes-QA上比OpenDriveVLA-7B高2.5%准确率，碰撞率从0.37%降至0.11%；在3D占据任务上比FB-OCC提升10.3 RayIoU，在占据流上将mAVE从0.591降至0.509，在规划任务上L2误差比VAD降低32%（从0.72m降至0.49m）。 Conclusion: DrivePI作为一个统一的单模型，在性能上达到或超过了现有的VLA和专用VA模型，展示了其在自动驾驶中多模态融合与4D空间理解的潜力。 Abstract: Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI

[175] Learning Common and Salient Generative Factors Between Two Image Datasets

Yunlong He,Gwilherm Lesné,Ziqian Liu,Michaël Soumm,Pietro Gori

Main category: cs.CV

TL;DR: 本文提出了一种新的对比分析（Contrastive Analysis, CA）框架，用于分离两个图像数据集中共有的生成因子和特定于某一数据集的显著因子，适用于GAN和扩散模型，并在多种数据集上表现出优越的分离能力和生成质量。

Details

Motivation: 现有图像编辑方法多依赖属性标签作为监督信号，而本文关注一个较少研究的问题——仅利用数据集信号来分离共有和显著生成因子，提升无监督解耦表示的能力。 Method: 提出一种新颖的CA框架，通过设计适配的学习策略和损失函数，在GAN和扩散模型中同时学习共性和显著因子，实现高质量图像生成与有效因子分离。 Result: 在人脸、动物和医学图像等多种数据集上验证了方法的有效性，相比先前方法在因子分离能力和生成图像质量方面表现更优。 Conclusion: 该框架成功实现了无需细粒度属性标注的对比分析，为无监督解耦表示学习提供了新思路，并可广泛应用于不同生成模型。 Abstract: Recent advancements in image synthesis have enabled high-quality image generation and manipulation. Most works focus on: 1) conditional manipulation, where an image is modified conditioned on a given attribute, or 2) disentangled representation learning, where each latent direction should represent a distinct semantic attribute. In this paper, we focus on a different and less studied research problem, called Contrastive Analysis (CA). Given two image datasets, we want to separate the common generative factors, shared across the two datasets, from the salient ones, specific to only one dataset. Compared to existing methods, which use attributes as supervised signals for editing (e.g., glasses, gender), the proposed method is weaker, since it only uses the dataset signal. We propose a novel framework for CA, that can be adapted to both GAN and Diffusion models, to learn both common and salient factors. By defining new and well-adapted learning strategies and losses, we ensure a relevant separation between common and salient factors, preserving a high-quality generation. We evaluate our approach on diverse datasets, covering human faces, animal images and medical scans. Our framework demonstrates superior separation ability and image quality synthesis compared to prior methods.

[176] Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding

Yongyuan Liang,Xiyao Wang,Yuanchen Ju,Jianwei Yang,Furong Huang

Main category: cs.CV

TL;DR: Lemon是一种统一的Transformer架构，用于联合处理3D点云块和语言标记，实现早期空间-语言融合，提升参数效率和模型扩展性，在多项3D理解与推理任务中达到最先进性能。

Details

Motivation: 现有3D多模态模型依赖碎片化架构和模态特定编码器，存在训练不稳定、可扩展性差以及点云数据稀疏不规则带来的挑战。 Method: 提出Lemon，采用统一Transformer架构，将3D点云块和语言标记作为单一序列进行联合处理；设计结构化分块与分词方案以保留空间上下文，并采用三阶段训练策略逐步提升从物体识别到场景级空间推理的能力。 Result: Lemon在多个3D理解与推理任务上实现了最先进的性能，包括物体识别、描述生成和3D场景中的空间推理，且在模型规模和训练数据增加时表现出良好的可扩展性。 Conclusion: Lemon为3D空间智能提供了一个统一的基础框架，推动了真实场景中大规模多模态模型的发展。 Abstract: Scaling large multimodal models (LMMs) to 3D understanding poses unique challenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively builds capabilities from object-level recognition to scene-level spatial reasoning. Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks, from object recognition and captioning to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as model size and training data increase. Our work provides a unified foundation for advancing 3D spatial intelligence in real-world applications.

[177] Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners

N. K. B. M. P. K. B. Narasinghe,Uthayasanker Thayasivam

Main category: cs.CV

TL;DR: 本文研究了在极端数据稀缺情况下如何有效微调对比字幕模型（CoCa）的视觉主干用于少样本图像分类，系统评估了从无训练到深度参数调整的各种策略，并揭示了数据增强、损失函数和正则化对性能的影响。

Details

Motivation: 尽管CoCa等生成-对比混合大模型在零样本迁移中表现优异，但其在极低数据场景下的适应性尚未被充分探索，尤其是与CLIP不同的潜在空间对参数高效微调的响应机制仍不清楚。 Method: 系统评估了一系列适应策略，包括无需训练的混合原型方法和基于LoRA的深度参数微调，并结合SupCon损失与Cross-Entropy进行对比分析，同时研究数据增强、正则化强度、秩选择和采样策略的影响。 Result: 发现强数据增强会降低线性探测性能但有助于稳定LoRA训练（即'增强发散'现象），引入SupCon损失能持续提升各类少样本设置下的分类准确率，并明确了不同训练配置对数据稀缺的敏感性。 Conclusion: 为生成-对比型基础模型在极端数据稀缺场景下的高效适应提供了实证指导，建议根据样本数量动态调整正则化、秩和数据增强策略以实现最优性能。 Abstract: Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa's distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an "augmentation divergence": while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning. We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy across varying shot counts. Crucially, we characterize the sensitivity of training configurations to data scarcity, providing empirical reference settings for scaling regularization, rank, and sampling strategies to facilitate the efficient adaptation of generative-contrastive foundation models.

[178] Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal

Weihan Xu,Kan Jen Cheng,Koichi Saito,Muhammad Jehanzeb Mirza,Tingle Li,Yisi Liu,Alexander H. Liu,Liming Wang,Masato Ishii,Takashi Shibuya,Yuki Mitsufuji,Gopala Anumanchipalli,Paul Pu Liang

Main category: cs.CV

TL;DR: 提出了一种新的端到端音视频联合编辑模型SAVE，基于新构建的配对数据集SAVEBench，利用薛定谔桥实现音频与视觉内容的同步编辑，显著提升了编辑后内容的时序对齐与语义一致性。

Details

Motivation: 现有音视频编辑方法难以保持跨模态的一致性和时序同步，且缺乏合适的配对训练数据，因此需要一个能够联合编辑并保持模态对齐的框架。 Method: 构建了带有文本和掩码条件的配对音视频数据集SAVEBench，并提出了基于流匹配的端到端模型SAVE，引入薛定谔桥机制实现源到目标音视频混合的直接传输。 Result: 实验表明，SAVE在去除目标对象的同时更好保留了背景内容，并在时序同步和音视频语义对应性上优于分离式音视频编辑组合方法。 Conclusion: SAVE通过统一建模实现了高质量的联合音视频编辑，为多模态内容编辑提供了有效解决方案。 Abstract: Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content, with stronger temporal synchronization and audiovisual semantic correspondence compared with pairwise combinations of an audio editor and a video editor.

[179] Cross-Level Sensor Fusion with Object Lists via Transformer for 3D Object Detection

Xiangzhong Liu,Jiajie Zhang,Hao Shen

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer的端到端跨层级融合方法，将高抽象级别的目标列表与原始图像数据结合用于3D目标检测，通过引入可学习查询和可变形高斯掩码提升性能，并设计伪目标列表生成方法以支持训练，在nuScenes数据集上显著优于纯视觉基线。

Details

Motivation: 现有传感器融合系统中，智能传感器和V2X模块通常只提供处理后的目标列表而非原始数据，导致难以有效融合原始感知信息；缺乏统一的跨层级（原始数据与抽象对象）融合框架，限制了检测性能的进一步提升。 Method: 提出一种基于Transformer的端到端跨层级融合方法：将目标列表作为去噪查询输入Transformer，并与可学习查询共同参与特征聚合；设计可变形高斯掩码，利用目标列表中的位置和尺寸先验信息引导解码器注意力聚焦于感兴趣区域；为解决缺乏含目标列表模态的公开数据集问题，提出通过在真实标注框上模拟状态噪声及误检漏检来生成伪目标列表用于训练。 Result: 在nuScenes数据集上，所提方法显著优于基于视觉的基线模型；验证了模型对不同噪声水平的伪目标列表以及真实检测器输出的良好泛化能力；引入的可变形高斯掩码有效加速了模型训练收敛。 Conclusion: 本文是首个实现原始图像与抽象目标列表跨层级融合的工作，所提出的Transformer架构结合去噪查询与可变形高斯掩码能有效整合多级感知信息，提升了3D目标检测性能，且具备良好的鲁棒性与泛化性，为未来车载多源感知融合提供了新思路。 Abstract: In automotive sensor fusion systems, smart sensors and Vehicle-to-Everything (V2X) modules are commonly utilized. Sensor data from these systems are typically available only as processed object lists rather than raw sensor data from traditional sensors. Instead of processing other raw data separately and then fusing them at the object level, we propose an end-to-end cross-level fusion concept with Transformer, which integrates highly abstract object list information with raw camera images for 3D object detection. Object lists are fed into a Transformer as denoising queries and propagated together with learnable queries through the latter feature aggregation process. Additionally, a deformable Gaussian mask, derived from the positional and size dimensional priors from the object lists, is explicitly integrated into the Transformer decoder. This directs attention toward the target area of interest and accelerates model training convergence. Furthermore, as there is no public dataset containing object lists as a standalone modality, we propose an approach to generate pseudo object lists from ground-truth bounding boxes by simulating state noise and false positives and negatives. As the first work to conduct cross-level fusion, our approach shows substantial performance improvements over the vision-based baseline on the nuScenes dataset. It demonstrates its generalization capability over diverse noise levels of simulated object lists and real detectors.

[180] Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification

Han Liu,Bogdan Georgescu,Yanbo Zhang,Youngjin Yoo,Michael Baumgartner,Riqiang Gao,Jianing Wang,Gengyan Zhao,Eli Gibson,Dorin Comaniciu,Sasa Grbic

Main category: cs.CV

TL;DR: 本文提出了AnyMC3D，一种从2D基础模型迁移而来的可扩展3D医学图像分类框架，通过仅添加轻量级插件即可高效适应新任务，解决了现有研究中的数据偏倚、适应不佳和任务覆盖不足问题，并在12项多样化任务上验证了其优越性能。

Details

Motivation: 现有3D医学图像分类方法存在数据体制偏差、适应性差和任务覆盖不全三大问题，限制了医学基础模型的广泛应用。 Method: 提出AnyMC3D框架，基于单一冻结的2D基础模型，引入轻量级插件（每任务约1M参数），支持多视图输入、像素级辅助监督和可解释热图生成，实现对新任务的高效扩展。 Result: 构建涵盖12项任务的综合基准，实验表明：有效适配能释放基础模型潜力；通用基础模型经良好适配可媲美专用医学模型；2D方法优于3D架构；AnyMC3D在多项任务上达到最先进水平，首次实现单一框架在多样应用中取得领先性能，包括VLM3D挑战赛第一名。 Conclusion: AnyMC3D证明了基于2D基础模型的轻量适配框架在3D医学分类中的可行性与优越性，无需为每个任务单独建模即可实现高性能，推动了可扩展、通用型医学AI系统的发展。 Abstract: 3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.

[181] Qonvolution: Towards Learning High-Frequency Signals with Queried Convolution

Abhinav Kumar,Tristan Aumentado-Armstrong,Lazar Valkov,Gopal Sharma,Alex Levinshtein,Radek Grzeszczuk,Suren Kumar

Main category: cs.CV

TL;DR: 本文提出了Qonvolution（查询卷积），通过利用卷积的邻域特性，将低频信号与坐标等查询信息进行卷积，以增强高频信号的学习能力，在多种计算机视觉与图形学任务中表现出优越性能，尤其在新视角合成中结合高斯点阵达到了最先进水平。

Details

Motivation: 神经网络在学习高频信号时常因频谱偏差或优化困难而表现不佳，现有方法如傅里叶编码虽有改进但仍存在局限，因此需要更有效的方法来提升高频信号建模能力。 Method: 提出Qonvolution，利用卷积操作对低频信号与输入查询（如空间坐标）进行局部邻域卷积，从而增强高频细节的表达能力，并可灵活集成到不同网络结构中。 Result: 在1D回归、2D超分辨率、2D图像回归和新视角合成等任务上均取得性能提升，尤其在结合高斯点阵的新视角合成中，图像质量优于当前强大的辐射场模型。 Conclusion: Qonvolution是一种简单而有效的方法，能够显著提升神经网络对高频信号的学习能力，在多个视觉与图形任务中展现出广泛适用性和先进性能。 Abstract: Accurately learning high-frequency signals is a challenge in computer vision and graphics, as neural networks often struggle with these signals due to spectral bias or optimization difficulties. While current techniques like Fourier encodings have made great strides in improving performance, there remains scope for improvement when presented with high-frequency information. This paper introduces Queried-Convolutions (Qonvolutions), a simple yet powerful modification using the neighborhood properties of convolution. Qonvolution convolves a low-frequency signal with queries (such as coordinates) to enhance the learning of intricate high-frequency signals. We empirically demonstrate that Qonvolutions enhance performance across a variety of high-frequency learning tasks crucial to both the computer vision and graphics communities, including 1D regression, 2D super-resolution, 2D image regression, and novel view synthesis (NVS). In particular, by combining Gaussian splatting with Qonvolutions for NVS, we showcase state-of-the-art performance on real-world complex scenes, even outperforming powerful radiance field models on image quality.

[182] Predictive Sample Assignment for Semantically Coherent Out-of-Distribution Detection

Zhimao Peng,Enguang Wang,Xialei Liu,Ming-Ming Cheng

Main category: cs.CV

TL;DR: 本文提出了一种基于预测样本分配（PSA）的简洁SCOOD框架，通过双阈值三元样本分配策略和对比表示学习损失，在保留高纯度ID/OOD样本的同时减少噪声干扰，显著提升了语义一致的分布外检测性能。

Details

Motivation: 现有SCOOD方法在利用无标签数据时采用聚类过滤策略，易引入大量噪声样本，影响模型性能。为此，本文旨在提高所选ID和OOD样本集的纯净度，增强模型判别能力。 Method: 提出PSA框架，包含基于预测能量得分的双阈值三元样本分配策略（将低置信样本归入丢弃集），以及概念对比表示学习损失来拉大ID与OOD样本在表示空间中的距离，并结合重训练策略提升模型拟合效果。 Result: 在两个标准SCOOD基准上实验表明，该方法显著优于当前最先进的方法。 Conclusion: 所提出的PSA框架有效提升了SCOOD任务中ID/OOD样本划分的纯净度和模型检测性能，为现实场景下的OOD检测提供了更鲁棒的解决方案。 Abstract: Semantically coherent out-of-distribution detection (SCOOD) is a recently proposed realistic OOD detection setting: given labeled in-distribution (ID) data and mixed in-distribution and out-of-distribution unlabeled data as the training data, SCOOD aims to enable the trained model to accurately identify OOD samples in the testing data. Current SCOOD methods mainly adopt various clustering-based in-distribution sample filtering (IDF) strategies to select clean ID samples from unlabeled data, and take the remaining samples as auxiliary OOD data, which inevitably introduces a large number of noisy samples in training. To address the above issue, we propose a concise SCOOD framework based on predictive sample assignment (PSA). PSA includes a dual-threshold ternary sample assignment strategy based on the predictive energy score that can significantly improve the purity of the selected ID and OOD sample sets by assigning unconfident unlabeled data to an additional discard sample set, and a concept contrastive representation learning loss to further expand the distance between ID and OOD samples in the representation space to assist ID/OOD discrimination. In addition, we also introduce a retraining strategy to help the model fully fit the selected auxiliary ID/OOD samples. Experiments on two standard SCOOD benchmarks demonstrate that our approach outperforms the state-of-the-art methods by a significant margin.

[183] Sharpness-aware Dynamic Anchor Selection for Generalized Category Discovery

Zhimao Peng,Enguang Wang,Fei Yang,Xialei Liu,Ming-Ming Cheng

Main category: cs.CV

TL;DR: 本文提出了一种用于广义类别发现（GCD）的新方法，包含损失锐度惩罚（LSP）和动态锚点选择（DAS）两个模块，以提升伪标签质量和未知类别的聚类准确性。

Details

Motivation: 现有基于参数化分类的GCD方法采用DINO-like伪标签策略，但由于大模型对特定视觉模式的偏好，容易引入虚假相关性并生成噪声伪标签，影响聚类性能。 Method: 提出LSP模块通过最小化模型最坏情况下的损失锐度来增强参数对扰动的鲁棒性，抑制无关特征编码；同时设计DAS模块基于KNN密度和类别概率选择未知类别的代表性样本，并为其分配硬伪标签，缓解已知与未知类间置信度差异。 Result: 实验表明该方法能有效减少伪标签噪声，在多个GCD基准上达到最先进的聚类性能。 Conclusion: LSP和DAS协同提升了模型在广义类别发现中的鲁棒性和聚类精度，显著优于现有方法。 Abstract: Generalized category discovery (GCD) is an important and challenging task in open-world learning. Specifically, given some labeled data of known classes, GCD aims to cluster unlabeled data that contain both known and unknown classes. Current GCD methods based on parametric classification adopt the DINO-like pseudo-labeling strategy, where the sharpened probability output of one view is used as supervision information for the other view. However, large pre-trained models have a preference for some specific visual patterns, resulting in encoding spurious correlation for unlabeled data and generating noisy pseudo-labels. To address this issue, we propose a novel method, which contains two modules: Loss Sharpness Penalty (LSP) and Dynamic Anchor Selection (DAS). LSP enhances the robustness of model parameters to small perturbations by minimizing the worst-case loss sharpness of the model, which suppressing the encoding of trivial features, thereby reducing overfitting of noise samples and improving the quality of pseudo-labels. Meanwhile, DAS selects representative samples for the unknown classes based on KNN density and class probability during the model training and assigns hard pseudo-labels to them, which not only alleviates the confidence difference between known and unknown classes but also enables the model to quickly learn more accurate feature distribution for the unknown classes, thus further improving the clustering accuracy. Extensive experiments demonstrate that the proposed method can effectively mitigate the noise of pseudo-labels, and achieve state-of-the-art results on multiple GCD benchmarks.

[184] MADTempo: An Interactive System for Multi-Event Temporal Video Retrieval with Query Augmentation

Huu-An Vu,Van-Khanh Mai,Trong-Tam Nguyen,Quang-Duc Dam,Tien-Huy Nguyen,Thanh-Huong Le

Main category: cs.CV

TL;DR: MADTempo是一种结合时间搜索与网络规模视觉定位的视频检索框架，通过事件级连续性建模和基于网络图像的查询扩展，提升对多事件和罕见概念查询的理解与鲁棒性。

Details

Motivation: 现有视频检索方法在建模多事件间的时间依赖性和处理未见或稀有视觉概念方面存在不足，需要更强大的时间推理与泛化能力。 Method: 提出MADTempo框架：1）通过聚合连续视频片段的相似性得分来捕捉事件级时间连续性；2）引入基于Google图片搜索的回退模块，利用外部网络图像扩展查询表示，弥补预训练视觉嵌入的不足。 Result: 该方法增强了对多事件复杂查询的检索能力，并显著提升了对分布外（OOD）查询的鲁棒性，在时间推理和泛化性能上优于现有方法。 Conclusion: MADTempo通过融合时间结构建模与网络增强的视觉 grounding，推动了大规模视频检索系统向更语义化、自适应方向的发展。 Abstract: The rapid expansion of video content across online platforms has accelerated the need for retrieval systems capable of understanding not only isolated visual moments but also the temporal structure of complex events. Existing approaches often fall short in modeling temporal dependencies across multiple events and in handling queries that reference unseen or rare visual concepts. To address these challenges, we introduce MADTempo, a video retrieval framework developed by our team, AIO_Trinh, that unifies temporal search with web-scale visual grounding. Our temporal search mechanism captures event-level continuity by aggregating similarity scores across sequential video segments, enabling coherent retrieval of multi-event queries. Complementarily, a Google Image Search-based fallback module expands query representations with external web imagery, effectively bridging gaps in pretrained visual embeddings and improving robustness against out-of-distribution (OOD) queries. Together, these components advance the temporal reasoning and generalization capabilities of modern video retrieval systems, paving the way for more semantically aware and adaptive retrieval across large-scale video corpora.

[185] Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

Toan Le Ngo Thanh,Phat Ha Huu,Tan Nguyen Dang Duy,Thong Nguyen Le Minh,Anh Nguyen Nhu Tinh

Main category: cs.CV

TL;DR: 本文提出了一种统一的多模态时刻检索系统，通过级联双嵌入、时序感知评分和基于Agent的查询分解，有效应对跨模态噪声、时序连贯性建模和手动模态选择等问题。

Details

Motivation: 现有方法在处理跨模态噪声、模糊查询以及时序建模方面存在不足，且依赖人工模态选择，限制了系统的实用性与鲁棒性。 Method: 1) 级联双嵌入流程：结合BEIT-3和SigLIP进行初检，BLIP-2重排序提升精度；2) 时序感知评分机制：通过带指数衰减惩罚的束搜索构建连贯事件序列；3) 基于GPT-4o的Agent引导查询分解：自动拆分模糊查询为特定模态子查询，并自适应融合得分。 Result: 系统在定性分析中展现出对模糊查询的良好处理能力，能检索出时序连贯的视频片段，并实现融合策略的动态调整。 Conclusion: 所提方法提升了多模态时刻检索的精度、鲁棒性和可用性，推动了交互式视频检索的发展。 Abstract: The exponential growth of video content has created an urgent need for efficient multimodal moment retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail across cross modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries, decomposes them into modality specific sub-queries (visual/OCR/ASR), and performs adaptive score fusion eliminating manual modality selection. Qualitative analysis demonstrates that our system effectively handles ambiguous queries, retrieves temporally coherent sequences, and dynamically adapts fusion strategies, advancing interactive moment search capabilities.

[186] Content Adaptive based Motion Alignment Framework for Learned Video Compression

Tiange Zhang,Xiandong Meng,Siwei Ma

Main category: cs.CV

TL;DR: 本文提出了一种内容自适应的端到端视频压缩框架CAMA，通过运动对齐优化提升压缩性能。

Details

Motivation: 现有端到端视频压缩模型缺乏对不同内容特征的自适应能力，导致压缩效率受限。 Method: 引入两阶段光流引导的可变形扭曲机制、多参考质量感知策略以及无需训练的帧下采样模块，实现精细化特征对齐与误差抑制。 Result: 在标准测试集上，CAMA相比DCVC-TCM基线实现了24.95%的BD-rate（PSNR）节省，并优于DCVC-DC和HM-16.25等先进编解码器。 Conclusion: 所提出的CAMA框架通过内容自适应机制显著提升了神经视频压缩性能，具有较强的实用性和泛化能力。 Abstract: Recent advances in end-to-end video compression have shown promising results owing to their unified end-to-end learning optimization. However, such generalized frameworks often lack content-specific adaptation, leading to suboptimal compression performance. To address this, this paper proposes a content adaptive based motion alignment framework that improves performance by adapting encoding strategies to diverse content characteristics. Specifically, we first introduce a two-stage flow-guided deformable warping mechanism that refines motion compensation with coarse-to-fine offset prediction and mask modulation, enabling precise feature alignment. Second, we propose a multi-reference quality aware strategy that adjusts distortion weights based on reference quality, and applies it to hierarchical training to reduce error propagation. Third, we integrate a training-free module that downsamples frames by motion magnitude and resolution to obtain smooth motion estimation. Experimental results on standard test datasets demonstrate that our framework CAMA achieves significant improvements over state-of-the-art Neural Video Compression models, achieving a 24.95% BD-rate (PSNR) savings over our baseline model DCVC-TCM, while also outperforming reproduced DCVC-DC and traditional codec HM-16.25.

[187] UAGLNet: Uncertainty-Aggregated Global-Local Fusion Network with Cooperative CNN-Transformer for Building Extraction

Siyuan Yao,Dongxiu Liu,Taotao Li,Shengjie Li,Wenqi Ren,Xiaochun Cao

Main category: cs.CV

TL;DR: 提出了一种用于从遥感图像中提取建筑物的不确定性聚合全局-局部融合网络（UAGLNet），通过结合CNN和Transformer结构，并引入不确定性建模来提升分割精度。

Details

Motivation: 现有方法在特征金字塔间存在固有差距，且全局与局部特征融合不足，导致建筑物提取结果不准确和模糊。 Method: 设计了混合CNN与Transformer的协同编码器，在不同阶段捕获局部和全局语义；提出中间协同交互模块（CIB）缩小深层网络中的特征差距；采用全局-局部融合模块（GLF）进行互补融合；并提出不确定性聚合解码器（UAD）以像素级不确定性估计提升分割准确性。 Result: 实验表明，该方法在多个数据集上优于现有的最先进方法，显著提高了建筑提取的准确性和鲁棒性。 Conclusion: UAGLNet通过有效的全局-局部特征融合与不确定性建模，显著提升了遥感图像中建筑物提取的性能。 Abstract: Building extraction from remote sensing images is a challenging task due to the complex structure variations of the buildings. Existing methods employ convolutional or self-attention blocks to capture the multi-scale features in the segmentation models, while the inherent gap of the feature pyramids and insufficient global-local feature integration leads to inaccurate, ambiguous extraction results. To address this issue, in this paper, we present an Uncertainty-Aggregated Global-Local Fusion Network (UAGLNet), which is capable to exploit high-quality global-local visual semantics under the guidance of uncertainty modeling. Specifically, we propose a novel cooperative encoder, which adopts hybrid CNN and transformer layers at different stages to capture the local and global visual semantics, respectively. An intermediate cooperative interaction block (CIB) is designed to narrow the gap between the local and global features when the network becomes deeper. Afterwards, we propose a Global-Local Fusion (GLF) module to complementarily fuse the global and local representations. Moreover, to mitigate the segmentation ambiguity in uncertain regions, we propose an Uncertainty-Aggregated Decoder (UAD) to explicitly estimate the pixel-wise uncertainty to enhance the segmentation accuracy. Extensive experiments demonstrate that our method achieves superior performance to other state-of-the-art methods. Our code is available at https://github.com/Dstate/UAGLNet

[188] SCAdapter: Content-Style Disentanglement for Diffusion Style Transfer

Luan Thanh Trinh,Kenji Doi,Atsuki Osanai

Main category: cs.CV

TL;DR: 本文提出了一种名为SCAdapter的新方法，利用CLIP图像空间有效分离和融合内容与风格特征，解决了扩散模型在真实感风格迁移中的不足。该方法通过CSAdaIN、KVS注入和一致性目标三个组件实现精确的多风格混合和过程连贯性，在效果和效率上均优于现有方法。

Details

Motivation: 现有的风格迁移方法在处理照片级真实感转换时表现不佳，容易产生类似绘画的结果或丢失细节风格元素，并且难以消除原始内容风格和参考风格特征的干扰。因此需要一种更有效的机制来分离和整合内容与风格。 Method: 提出SCAdapter，利用CLIP图像空间提取纯净的内容和风格特征；引入可控风格自适应实例归一化（CSAdaIN）实现多风格精准融合，使用KVS注入进行目标风格集成，并设计风格迁移一致性目标以保持过程连贯性；无需DDIM反演和推理阶段优化。 Result: 实验表明，SCAdapter在传统和基于扩散模型的基准上均显著优于当前最先进的方法，同时推理速度至少提升2倍以上。 Conclusion: SCAdapter通过在CLIP空间中解耦内容与风格，结合高效的设计，在真实感风格迁移任务中实现了更优的效果与更高的效率，具有良好的实用价值。 Abstract: Diffusion models have emerged as the leading approach for style transfer, yet they struggle with photo-realistic transfers, often producing painting-like results or missing detailed stylistic elements. Current methods inadequately address unwanted influence from original content styles and style reference content features. We introduce SCAdapter, a novel technique leveraging CLIP image space to effectively separate and integrate content and style features. Our key innovation systematically extracts pure content from content images and style elements from style references, ensuring authentic transfers. This approach is enhanced through three components: Controllable Style Adaptive Instance Normalization (CSAdaIN) for precise multi-style blending, KVS Injection for targeted style integration, and a style transfer consistency objective maintaining process coherence. Comprehensive experiments demonstrate SCAdapter significantly outperforms state-of-the-art methods in both conventional and diffusion-based baselines. By eliminating DDIM inversion and inference-stage optimization, our method achieves at least $2\times$ faster inference than other diffusion-based approaches, making it both more effective and efficient for practical applications.

[189] VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference

Shengling Qin,Hao Yu,Chenxin Wu,Zheng Li,Yizhong Cao,Zhengyang Zhuge,Yuxin Zhou,Wentao Yao,Yi Zhang,Zhengheng Wang,Shuai Bai,Jianwei Zhang,Junyang Lin

Main category: cs.CV

TL;DR: VLCache是一个利用键值缓存和编码器缓存的重用框架，通过消除重复多模态输入的重新计算来提升推理效率，结合层感知动态重计算策略，在几乎无精度损失下实现显著的速度提升。

Details

Motivation: 在多模态模型推理中，重复输入导致大量冗余计算，影响推理效率，现有启发式缓存方法存在累积重用误差问题。 Method: 提出VLCache框架，形式化分析缓存重用中的累积误差效应，设计动态的、分层感知的重计算策略，以平衡准确性和计算效率。 Result: 实验表明，VLCache在保持与完全重计算相当的精度的同时，仅需计算2-5%的token，实现了1.2倍到16倍的首次生成时间（TTFT）加速，并已集成至SGLang系统。 Conclusion: VLCache有效解决了多模态推理中的缓存重用误差问题，通过智能缓存管理和动态重计算策略，大幅提升了实际部署中的推理速度与资源利用率。 Abstract: This paper presents VLCache, a cache reuse framework that exploits both Key-Value (KV) cache and encoder cache from prior multimodal inputs to eliminate costly recomputation when the same multimodal inputs recur. Unlike previous heuristic approaches, we formally identify the cumulative reuse error effect and demonstrate how to minimize the non-prefix cache reuse error effectively. We further analyze the varying importance of model layers and propose a dynamic, layer-aware recomputation strategy to balance accuracy and efficiency. Experimental results show that VLCache achieves an accuracy on par with full recomputation, while requiring only 2-5% of the tokens to compute, yielding 1.2x-16x TTFT speedups. The proposed VLCache pipeline has been integrated into SGLang, enabling significantly faster inference in practical deployments.

[190] Scaling Up AI-Generated Image Detection via Generator-Aware Prototypes

Ziheng Qin,Yuheng Ji,Renshuai Tao,Yuxuan Tian,Yuyang Liu,Yipu Wang,Xiaolong Zheng

Main category: cs.CV

TL;DR: 本文提出了一个针对AI生成图像检测的“先受益后冲突”困境，并提出了一种名为生成器感知原型学习（GAPL）的新框架，以应对数据异质性和模型瓶颈问题，显著提升了跨多种生成模型的检测性能。

Details

Motivation: 随着生成模型种类增多，现有的通用AI生成图像检测器因数据来源多样性增加而出现性能停滞甚至下降，本文旨在揭示这一现象并提出解决方案。 Method: 提出Generator-Aware Prototype Learning (GAPL) 框架，通过构建紧凑的伪造原型集来统一特征空间，并采用两阶段低秩自适应训练策略，缓解数据异质性并增强模型适应能力。 Result: 实验表明，GAPL在多种GAN和扩散模型生成的图像上均实现了最先进的检测精度，表现出更强的泛化性和鲁棒性。 Conclusion: GAPL有效解决了多源生成图像检测中的“先受益后冲突”难题，为构建更通用、可扩展的AIGI检测器提供了新思路。 Abstract: The pursuit of a universal AI-generated image (AIGI) detector often relies on aggregating data from numerous generators to improve generalization. However, this paper identifies a paradoxical phenomenon we term the Benefit then Conflict dilemma, where detector performance stagnates and eventually degrades as source diversity expands. Our systematic analysis, diagnoses this failure by identifying two core issues: severe data-level heterogeneity, which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL), a framework that constrain representation with a structured learning paradigm. GAPL learns a compact set of canonical forgery prototypes to create a unified, low-variance feature space, effectively countering data heterogeneity.To resolve the model bottleneck, it employs a two-stage training scheme with Low-Rank Adaptation, enhancing its discriminative power while preserving valuable pretrained knowledge. This approach establishes a more robust and generalizable decision boundary. Through extensive experiments, we demonstrate that GAPL achieves state-of-the-art performance, showing superior detection accuracy across a wide variety of GAN and diffusion-based generators. Code is available at https://github.com/UltraCapture/GAPL

[191] Calibrating Uncertainty for Zero-Shot Adversarial CLIP

Wenjing lu,Zerui Tao,Dongping Zhang,Yuning Qiu,Yang Yang,Qibin Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的对抗微调目标，用于CLIP模型，在保持零样本分类性能的同时恢复校准的不确定性并提升对抗鲁棒性。

Details

Motivation: CLIP在零样本分类中表现优异但易受对抗攻击，现有方法忽视了不确定性校准，导致对抗样本下模型过度自信且不可靠。 Method: 通过将CLIP输出重新参数化为Dirichlet分布的浓度参数，提出一种统一表示，整体对齐干净和对抗样本下的预测分布，实现准确性和不确定性的联合优化。 Result: 在多个零样本分类基准上验证了该方法能有效恢复校准的不确定性，具备竞争力的对抗鲁棒性，并保持原始准确率。 Conclusion: 所提方法弥补了模型可靠性与鲁棒性之间的差距，提升了CLIP在对抗扰动下的可信度与泛化能力。 Abstract: CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Previous work of adversarial fine-tuning largely focuses on matching the predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and unreliable over-confidence. This overlooked phenomenon highlights a critical reliability gap beyond robustness. To bridge this gap, we propose a novel adversarial fine-tuning objective for CLIP considering both prediction accuracy and uncertainty alignments. By reparameterizing the output of CLIP as the concentration parameter of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and the magnitude of predictive confidence. Our objective aligns these distributions holistically under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments on multiple zero-shot classification benchmarks demonstrate that our approach effectively restores calibrated uncertainty and achieves competitive adversarial robustness while maintaining clean accuracy.

[192] Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Yifan Pu,Yizeng Han,Zhiwei Tang,Jiasheng Tang,Fan Wang,Bohan Zhuang,Gao Huang

Main category: cs.CV

TL;DR: 本文首次系统研究了将最先进的蒸馏技术应用于强文本到图像教师模型FLUX.1-lite的可行性，分析了从离散类别标签转向自由形式语言提示时的关键挑战，并提供了关于输入缩放、网络架构和超参数的实用指南，为快速、高保真且资源高效的扩散生成器在实际T2I应用中的部署奠定了基础。

Details

Motivation: 尽管扩散蒸馏已显著加速类条件图像合成，但其在开放文本到图像生成中的适用性尚不明确，因此需要系统研究其在自由文本提示下的表现与挑战。 Method: 将现有蒸馏方法统一到一个框架下，在FLUX.1-lite模型上进行适配与比较，分析不同技术在文本到图像生成中的表现。 Result: 识别出从离散标签到自由文本提示迁移过程中的关键障碍，提供了有效的实践指南，并开源了实现代码与预训练学生模型。 Conclusion: 研究为在真实场景中部署高效、高质量的文本到图像扩散蒸馏模型建立了坚实基础。 Abstract: Diffusion distillation has dramatically accelerated class-conditional image synthesis, but its applicability to open-ended text-to-image (T2I) generation is still unclear. We present the first systematic study that adapts and compares state-of-the-art distillation techniques on a strong T2I teacher model, FLUX.1-lite. By casting existing methods into a unified framework, we identify the key obstacles that arise when moving from discrete class labels to free-form language prompts. Beyond a thorough methodological analysis, we offer practical guidelines on input scaling, network architecture, and hyperparameters, accompanied by an open-source implementation and pretrained student models. Our findings establish a solid foundation for deploying fast, high-fidelity, and resource-efficient diffusion generators in real-world T2I applications. Code is available on github.com/alibaba-damo-academy/T2I-Distill.

[193] Light Field Based 6DoF Tracking of Previously Unobserved Objects

Nikolai Goncharov,James L. Gray,Donald G. Dansereau

Main category: cs.CV

TL;DR: 本文提出了一种基于光场图像的无需预训练模型的对象跟踪方法，利用视觉基础模型提取语义和几何特征，并转换为视图相关的高斯点阵表示，实现对复杂外观（如反射）对象的鲁棒跟踪。同时构建了包含挑战性反射对象和精确位姿标注的光场跟踪数据集。实验表明该方法在困难场景下与现有最优模型相当，推动了通用对象跟踪在机器人系统中的应用。

Details

Motivation: 现有高性能对象跟踪方法依赖于预先捕获的对象视图构建显式参考模型，限制了对未知或视觉复杂对象（如反射）的泛化能力。本文旨在提出一种不依赖预建模、能应对复杂视觉行为的通用跟踪方法。 Method: 从光场输入中利用视觉基础模型提取语义和几何特征，将其转化为视图相关的高斯点阵（view-dependent Gaussian splats），作为统一的对象表示；结合可微渲染与位姿优化实现对象跟踪。 Result: 提出了新的光场对象跟踪数据集，包含具有精确真值位姿的挑战性反射物体；实验显示所提方法在处理复杂视觉行为时性能与最先进的基于模型的方法相当。 Conclusion: 本文方法实现了不依赖预训练模型的鲁棒对象跟踪，尤其适用于具有复杂外观（如反射）的未知物体，结合新数据集为机器人系统中的通用对象跟踪提供了新方向。 Abstract: Object tracking is an important step in robotics and reautonomous driving pipelines, which has to generalize to previously unseen and complex objects. Existing high-performing methods often rely on pre-captured object views to build explicit reference models, which restricts them to a fixed set of known objects. However, such reference models can struggle with visually complex appearance, reducing the quality of tracking. In this work, we introduce an object tracking method based on light field images that does not depend on a pre-trained model, while being robust to complex visual behavior, such as reflections. We extract semantic and geometric features from light field inputs using vision foundation models and convert them into view-dependent Gaussian splats. These splats serve as a unified object representation, supporting differentiable rendering and pose optimization. We further introduce a light field object tracking dataset containing challenging reflective objects with precise ground truth poses. Experiments demonstrate that our method is competitive with state-of-the-art model-based trackers in these difficult cases, paving the way toward universal object tracking in robotic systems. Code/data available at https://github.com/nagonch/LiFT-6DoF.

[194] TWLR: Text-Guided Weakly-Supervised Lesion Localization and Severity Regression for Explainable Diabetic Retinopathy Grading

Xi Luo,Shixin Xu,Ying Xie,JianZhong Hu,Yuwei He,Yuhui Deng,Huaxiong Huang

Main category: cs.CV

TL;DR: 提出了一种名为TWLR的两阶段框架，用于可解释的糖尿病视网膜病变评估，结合视觉-语言模型与弱监督语义分割，实现在无需像素级标注的情况下进行病灶定位和疾病严重程度退化可视化。

Details

Motivation: 医学图像分析依赖高质量标注，但获取像素级标签成本高且耗时；同时深度学习模型缺乏可解释性，限制了其在临床的应用。 Method: 第一阶段使用视觉-语言模型融合眼科领域知识进行DR分级和病灶分类；第二阶段提出基于弱监督语义分割的迭代严重度回归框架，通过病灶显著图指导渐进式修复机制，逐步消除病理特征。 Result: 在FGADR、DDR和私有数据集上实验表明，TWLR在DR分类和病灶分割方面均达到竞争性性能，无需像素级监督即可实现准确的病灶定位，并提供疾病向健康转变的可解释可视化。 Conclusion: TWLR是一种可解释且标注高效的方法，为自动视网膜图像分析提供了临床可接受的解决方案。 Abstract: Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high-quality expert annotations Obtaining pixel-level labels for medical images, particularly fundus images, remains costly and time-consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two-stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision-language model integrates domain-specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly-supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel-level supervision and providing an interpretable visualization of disease-to-healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation-efficient solution for automated retinal image analysis.

[195] JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

Haoyu Wang,Lei Zhang,Wenrui Liu,Dengyang Jiang,Wei Wei,Chen Ding

Main category: cs.CV

TL;DR: 提出了一种名为JoDiffusion的新型扩散模型框架，能够基于文本提示同时生成语义一致的图像和像素级标注，解决了现有合成数据方法中的语义不一致和可扩展性问题。

Details

Motivation: 现有合成数据方法在生成图像和标注时存在语义不一致或依赖人工标注掩码导致可扩展性差的问题，亟需一种能同步生成高质量、多样化且标注一致图像的方法。 Method: 在标准潜在扩散模型基础上引入独立的标注变分自编码器（VAE），将标注掩码映射到与图像共享的潜在空间，并调整扩散模型以建模图像-标注联合分布；同时提出掩码优化策略减少生成噪声。 Result: 在Pascal VOC、COCO和ADE20K上实验表明，JoDiffusion生成的数据显著提升了语义分割模型性能，优于现有方法。 Conclusion: JoDiffusion实现了仅通过文本提示即可同步生成图像及其语义一致的像素级标注，兼具高可扩展性和准确性，为语义分割提供了高效的数据生成方案。 Abstract: Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.

[196] What Happens Next? Next Scene Prediction with a Unified Video Model

Xinjie Li,Zhimin Chen,Rui Zhao,Florian Schiffers,Zhenyu Liao,Vimal Bhat

Main category: cs.CV

TL;DR: 本文提出了“下一场景预测”（NSP）这一新任务，旨在推动统一视频模型在时间与因果推理方面的能力，并通过结合Qwen-VL和LTX的框架，在新构建的大规模数据集上实现了最先进的性能。

Details

Motivation: 现有统一模型多集中于文本到视频生成等常规任务，忽略了其在时间推理方面的潜力，本文旨在探索并增强模型对视频序列中未来事件的预测与因果理解能力。 Method: 提出一个融合Qwen-VL（理解）与LTX（生成）的统一框架，通过潜在查询嵌入和连接模块实现；在自建的大规模NSP数据集上分三阶段训练：文本到视频预训练、监督微调和基于因果一致性奖励的强化学习（GRPO）。 Result: 在NSP任务上取得了当前最优的性能表现，显著提升了多模态模型对未来场景的预测能力。 Conclusion: 所提方法有效推动了统一视频模型在时间推理与因果理解方面的能力，为通用多模态系统赋予了更强的“预测未来”能力。 Abstract: Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.

[197] Comprehensive Deployment-Oriented Assessment for Cross-Environment Generalization in Deep Learning-Based mmWave Radar Sensing

Tomoya Tanaka,Tomonori Ikeda,Ryo Yonemoto

Main category: cs.CV

TL;DR: 本研究首次全面评估了深度学习雷达感知中的空间泛化技术，针对FMCW MIMO雷达室内人数统计任务，系统比较了多种方法，发现基于Sigmoid的幅度加权预处理显著提升跨环境性能，结合少量样本的迁移学习可实现超过80%的误差降低，为实际部署提供了高效解决方案。

Details

Motivation: 深度学习在RF传感中面临空间泛化难题，现有方法在不同环境中性能下降严重，缺乏系统性评估和有效解决方案，限制了实际部署能力。 Method: 研究聚焦于FMCW MIMO雷达的人数统计任务，系统评估了包括幅度统计预处理（Sigmoid加权、阈值归零）、频域滤波、自编码器背景抑制、数据增强和迁移学习等多种空间泛化方法，并在两个不同布局的环境中进行实验验证。 Result: Sigmoid幅度加权表现最优，相比基线方法在跨环境测试中降低了50.1%的RMSE和55.2%的MAE；数据增强带来最多8.8%的MAE改善；而迁移学习在大空间位移下至关重要，仅用540个目标域样本即实现82.1%的RMSE和91.3%的MAE降幅。 Conclusion: 结合幅度基预处理（特别是Sigmoid加权）与轻量级迁移学习，是实现深度学习雷达感知系统在空间变化下保持高鲁棒性的实用且高效路径，为未来系统设计提供了明确方向。 Abstract: This study presents the first comprehensive evaluation of spatial generalization techniques, which are essential for the practical deployment of deep learning-based radio-frequency (RF) sensing. Focusing on people counting in indoor environments using frequency-modulated continuous-wave (FMCW) multiple-input multiple-output (MIMO) radar, we systematically investigate a broad set of approaches, including amplitude-based statistical preprocessing (sigmoid weighting and threshold zeroing), frequency-domain filtering, autoencoder-based background suppression, data augmentation strategies, and transfer learning. Experimental results collected across two environments with different layouts demonstrate that sigmoid-based amplitude weighting consistently achieves superior cross-environment performance, yielding 50.1% and 55.2% reductions in root-mean-square error (RMSE) and mean absolute error (MAE), respectively, compared with baseline methods. Data augmentation provides additional though modest benefits, with improvements up to 8.8% in MAE. By contrast, transfer learning proves indispensable for large spatial shifts, achieving 82.1% and 91.3% reductions in RMSE and MAE, respectively, with 540 target-domain samples. Taken together, these findings establish a highly practical direction for developing radar sensing systems capable of maintaining robust accuracy under spatial variations by integrating deep learning models with amplitude-based preprocessing and efficient transfer learning.

[198] SneakPeek: Future-Guided Instructional Streaming Video Generation

Cheeun Hong,German Barquero,Fadime Sener,Markos Georgopoulos,Edgar Schönfeld,Stefan Popov,Yuming Du,Oscar Mañas,Albert Pumarola

Main category: cs.CV

TL;DR: 本文提出了一种名为SneakPeek的未来驱动流式教学视频生成框架，通过预测性因果适应、未来引导的自强制机制和多提示条件控制，提升了视频扩散模型在长时间多步骤任务中的时间一致性和可控性。

Details

Motivation: 现有视频扩散模型在生成长序列、多步骤的教学视频时难以保持时间一致性和可控性，限制了其在教育和内容创作中的应用。 Method: 提出SneakPeek框架，包含三个关键技术：(1) 预测性因果适应，用于下一帧和关键帧预测；(2) 带双区域KV缓存的未来引导自强制机制，缓解推理时的暴露偏差；(3) 多提示条件控制，实现对多步指令的细粒度控制。 Result: 实验结果表明，该方法能生成时间上连贯且语义忠实的教学视频，准确遵循复杂的多步骤任务描述，在视频质量和指令跟随方面优于现有方法。 Conclusion: SneakPeek通过引入未来预测与自回归机制，有效解决了长序列教学视频生成中的时间漂移问题，实现了高一致性与可交互的流式视频生成。 Abstract: Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided self-forcing with a dual-region KV caching scheme to address the exposure bias issue at inference time; (3) multi-prompt conditioning, which provides fine-grained and procedural control over multi-step instructions. Together, these components mitigate temporal drift, preserve motion consistency, and enable interactive video generation where future prompt updates dynamically influence ongoing streaming video generation. Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos that accurately follow complex, multi-step task descriptions.

[199] Motus: A Unified Latent Action World Model

Hongzhe Bi,Hengkai Tan,Shenghao Xie,Zeyuan Wang,Shuhe Huang,Haitian Liu,Ruowen Zhao,Yao Feng,Chendong Xiang,Yinze Rong,Hongyan Zhao,Hanyu Liu,Zhizhong Su,Lei Ma,Hang Su,Jun Zhu

Main category: cs.CV

TL;DR: 本文提出了Motus，一种统一的潜在动作世界模型，通过混合Transformer架构整合理解、视频生成和动作三种专家，并利用光流学习像素级“delta action”，实现大规模动作预训练，在仿真和真实场景中均显著优于现有方法。

Details

Motivation: 当前具身智能体的方法基于孤立的模型（如理解、世界建模和控制），导致无法统一多模态生成能力，且难以利用大规模异构数据。因此需要一个统一框架来整合不同功能与先验知识。 Method: 提出Motus模型，采用Mixture-of-Transformer（MoT）架构集成三个专家模块，并结合UniDiffuser风格的调度器支持多种建模范式切换；利用光流提取潜在动作，设计三阶段训练流程和六层数据金字塔，实现像素级‘delta action’建模与大规模预训练。 Result: 实验表明，Motus在仿真环境中相比X-VLA提升15%，相比Pi0.5提升45%；在真实世界任务中性能提升11%~48%，实现了对多种建模模式的统一支持。 Conclusion: 统一建模理解、生成与控制功能并融合多源先验可显著提升机器人任务表现，Motus为通用具身智能体提供了一个高效、可扩展的解决方案。 Abstract: While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

[200] Comprehensive Evaluation of Rule-Based, Machine Learning, and Deep Learning in Human Estimation Using Radio Wave Sensing: Accuracy, Spatial Generalization, and Output Granularity Trade-offs

Tomoya Tanaka,Tomonori Ikeda,Ryo Yonemoto

Main category: cs.CV

TL;DR: 首次全面比较了基于规则、传统机器学习和深度学习方法在调频连续波MIMO雷达无线电波感知中的性能，发现高容量模型在相同环境表现优异但对域偏移敏感，而基于规则的方法虽输出粗略却更鲁棒，揭示了空间泛化能力与输出精细度之间的权衡。

Details

Motivation: 为了系统评估不同方法在不同室内环境下的无线电波感知性能，并探讨模型容量、泛化能力与输出粒度之间的关系。 Method: 比较了五种方法：基于连接组件的规则方法；三种传统机器学习模型（k近邻、随机森林、支持向量机）；以及结合CNN和LSTM的深度学习模型，在两个不同布局的室内环境中进行系统评估。 Result: 在训练环境中，CNN-LSTM模型准确率最高，传统模型表现中等；在新布局中，所有基于学习的方法性能显著下降，而规则方法保持稳定；在人员存在性二分类任务中，所有模型均表现出跨环境的高准确性。 Conclusion: 高容量模型在原环境中可实现高精度细粒度输出，但对域偏移敏感；规则方法虽无法提供细粒度结果，但具有更强的鲁棒性；模型的空间泛化能力与输出粒度之间存在明显权衡。 Abstract: This study presents the first comprehensive comparison of rule-based methods, traditional machine learning models, and deep learning models in radio wave sensing with frequency modulated continuous wave multiple input multiple output radar. We systematically evaluated five approaches in two indoor environments with distinct layouts: a rule-based connected component method; three traditional machine learning models, namely k-nearest neighbors, random forest, and support vector machine; and a deep learning model combining a convolutional neural network and long short term memory. In the training environment, the convolutional neural network long short term memory model achieved the highest accuracy, while traditional machine learning models provided moderate performance. In a new layout, however, all learning based methods showed significant degradation, whereas the rule-based method remained stable. Notably, for binary detection of presence versus absence of people, all models consistently achieved high accuracy across layouts. These results demonstrate that high capacity models can produce fine grained outputs with high accuracy in the same environment, but they are vulnerable to domain shift. In contrast, rule-based methods cannot provide fine grained outputs but exhibit robustness against domain shift. Moreover, regardless of the model type, a clear trade off was revealed between spatial generalization performance and output granularity.

[201] Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models

Hao Chen,Yiwei Wang,Songze Li

Main category: cs.CV

TL;DR: 提出了一种双向图像引导的概念擦除框架（Bi-Erasing），通过同时抑制有害概念和增强安全生成，有效平衡了概念移除效果与生成质量。

Details

Motivation: 现有概念擦除方法多采用单向策略，难以兼顾有害概念的彻底移除与生成图像的质量，缺乏对安全语义的显式引导。 Method: 基于文本-图像联合表示，设计两个解耦的图像分支：负分支抑制有害语义，正分支提供安全视觉引导，并引入基于掩码的过滤机制以减少无关内容干扰。 Result: 在多项实验中，Bi-Erasing在概念移除效果和视觉保真度方面均优于基线方法。 Conclusion: 双向协同优化策略能更有效地实现安全与高质量生成之间的平衡，为扩散模型中的概念擦除提供了新思路。 Abstract: Concept erasure, which fine-tunes diffusion models to remove undesired or harmful visual concepts, has become a mainstream approach to mitigating unsafe or illegal image generation in text-to-image models.However, existing removal methods typically adopt a unidirectional erasure strategy by either suppressing the target concept or reinforcing safe alternatives, making it difficult to achieve a balanced trade-off between concept removal and generation quality. To address this limitation, we propose a novel Bidirectional Image-Guided Concept Erasure (Bi-Erasing) framework that performs concept suppression and safety enhancement simultaneously. Specifically, based on the joint representation of text prompts and corresponding images, Bi-Erasing introduces two decoupled image branches: a negative branch responsible for suppressing harmful semantics and a positive branch providing visual guidance for safe alternatives. By jointly optimizing these complementary directions, our approach achieves a balance between erasure efficacy and generation usability. In addition, we apply mask-based filtering to the image branches to prevent interference from irrelevant content during the erasure process. Across extensive experiment evaluations, the proposed Bi-Erasing outperforms baseline methods in balancing concept removal effectiveness and visual fidelity.

[202] GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

Tong Wei,Yijun Yang,Changhao Zhang,Junliang Xing,Yuanchun Shi,Zongqing Lu,Deheng Ye

Main category: cs.CV

TL;DR: GTR-Turbo是一种高效的多模态智能体强化学习方法，无需依赖昂贵的教师模型，通过合并训练过程中的检查点权重作为免费教师来指导后续训练，在减少成本的同时显著提升性能。

Details

Motivation: 现有方法依赖高成本或特权教师模型提供细粒度奖励，限制了实用性与可复现性，且存在熵崩溃和训练不稳定问题。 Method: 提出GTR-Turbo，通过在强化学习过程中合并不同阶段的模型检查点权重，构建一个“免费”教师模型，并利用该教师通过监督微调或软logit蒸馏指导后续学习。 Result: 在多种视觉代理任务中，相比基线模型准确率提升10-30%，相较于GTR训练时间减少50%，计算成本降低60%，同时缓解熵崩溃并保持训练稳定。 Conclusion: GTR-Turbo在不使用外部特权模型的情况下实现了高效、稳定的多步强化学习，显著降低了训练开销并提升了多模态智能体的性能。 Abstract: Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.

[203] Towards Test-time Efficient Visual Place Recognition via Asymmetric Query Processing

Jaeyoon Kim,Yoonki Cho,Sung-Eui Yoon

Main category: cs.CV

TL;DR: 提出了一种高效的非对称视觉位置识别（VPR）框架，利用地理信息构建记忆库并引入隐式嵌入增强技术，显著降低计算成本的同时优于现有方法。

Details

Motivation: 高容量基础模型（如DINOv2）在VPR中性能优异但计算开销大，难以部署于资源受限设备，需设计高效且兼容的非对称框架。 Method: 采用高容量离线提取的图库模型与轻量级在线查询网络相结合的非对称架构；利用地理元数据构建地理记忆库存储图库特征，并通过隐式嵌入增强提升轻量查询网络的泛化能力。 Result: 实验表明该方法大幅降低计算成本，在多个基准上超越现有非对称VPR方法，实现更优的精度与效率权衡。 Conclusion: 所提方法为资源受限环境下的VPR提供了新的解决方案，通过地理结构化记忆库和嵌入增强实现了高效、高性能的非对称检索。 Abstract: Visual Place Recognition (VPR) has advanced significantly with high-capacity foundation models like DINOv2, achieving remarkable performance. Nonetheless, their substantial computational cost makes deployment on resource-constrained devices impractical. In this paper, we introduce an efficient asymmetric VPR framework that incorporates a high-capacity gallery model for offline feature extraction with a lightweight query network for online processing. A key challenge in this setting is ensuring compatibility between these heterogeneous networks, which conventional approaches address through computationally expensive k-NN-based compatible training. To overcome this, we propose a geographical memory bank that structures gallery features using geolocation metadata inherent in VPR databases, eliminating the need for exhaustive k-NN computations. Additionally, we introduce an implicit embedding augmentation technique that enhances the query network to model feature variations despite its limited capacity. Extensive experiments demonstrate that our method not only significantly reduces computational costs but also outperforms existing asymmetric retrieval techniques, establishing a new aspect for VPR in resource-limited environments. The code is available at https://github.com/jaeyoon1603/AsymVPR

[204] Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models

Zizhi Chen,Yizhen Gao,Minghao Han,Yizhou Liu,Zhaoyu Chen,Dingkang Yang,Lihua Zhang

Main category: cs.CV

TL;DR: 提出一种结合检索增强生成（RAG）与动态知识蒸馏的多模态医学视觉-语言模型持续学习框架，在新构建的MGTIL基准上实现了SOTA性能。

Details

Motivation: 解决多模态生物医学VLM在持续学习中难以同时保留细粒度模态内特征和弥合跨模态域差距的核心矛盾。 Method: 基于1800万医学多模态数据构建多模态、多层RAG系统，实现动态知识检索以指导模型微调；引入动态知识蒸馏框架，根据需求调节参数空间重要性、知识粒度和参考数据分布。 Result: 在新提出的医学通才任务增量学习（MGTIL）基准上进行全面评估，验证了模型在应对域偏移、保留细粒度特征和实时学习新任务方面的能力。 Conclusion: 所提方法有效解决了多模态持续学习中的核心挑战，在所有指标上均达到最先进水平。 Abstract: Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous \textbf{M}edical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.

[205] DiRe: Diversity-promoting Regularization for Dataset Condensation

Saumyaranjan Mohanty,Aravind Reddy,Konda Reddy Mopuri

Main category: cs.CV

TL;DR: 提出一种名为DiRe的多样性正则化方法，通过余弦相似性和欧氏距离减少数据合成中的冗余，提升数据集压缩的多样性和泛化性能。

Details

Motivation: 现有数据集压缩方法生成的合成数据存在显著冗余，缺乏多样性，限制了其训练效用，因此需要有效机制来提升多样性。 Method: 设计了一个即插即用的多样性正则化器（DiRe），结合余弦相似性和欧氏距离，可应用于多种先进的数据集压缩方法中以增强多样性。 Result: 在CIFAR-10到ImageNet-1K等多个基准数据集上实验表明，加入DiRe后，各类先进压缩方法在泛化性能和多样性指标上均有提升。 Conclusion: DiRe是一种有效且通用的多样性增强模块，能够显著改善数据集压缩方法的性能，推动高效、低冗余数据合成的发展。 Abstract: In Dataset Condensation, the goal is to synthesize a small dataset that replicates the training utility of a large original dataset. Existing condensation methods synthesize datasets with significant redundancy, so there is a dire need to reduce redundancy and improve the diversity of the synthesized datasets. To tackle this, we propose an intuitive Diversity Regularizer (DiRe) composed of cosine similarity and Euclidean distance, which can be applied off-the-shelf to various state-of-the-art condensation methods. Through extensive experiments, we demonstrate that the addition of our regularizer improves state-of-the-art condensation methods on various benchmark datasets from CIFAR-10 to ImageNet-1K with respect to generalization and diversity metrics.

[206] UniVCD: A New Method for Unsupervised Change Detection in the Open-Vocabulary Era

Ziqiang Zhu,Bowei Yang

Main category: cs.CV

TL;DR: 提出了一种基于冻结视觉基础模型（SAM2和CLIP）的无监督、开放词汇变化检测方法UniVCD，通过轻量级特征对齐模块实现高分辨率、语义感知的变化检测，在无需标注数据的情况下在多个基准上表现出色。

Details

Motivation: 现有变化检测方法依赖监督学习，标注成本高、泛化能力差，且局限于预定义类别，难以适应多样化场景。随着SAM2和CLIP等视觉基础模型的发展，亟需一种不依赖标注、可扩展至开放词汇的通用变化检测框架。 Method: 提出UniVCD，利用冻结的SAM2提取空间细节，CLIP提供语义先验，通过轻量级特征对齐模块融合二者；设计无监督训练策略，无需配对图像或标签；引入简化后处理流程抑制噪声和伪变化。 Result: 在多个公开的二值和语义变化检测基准上，UniVCD在F1、IoU等指标上达到或优于现有开放词汇方法，验证了其在不同场景和成像几何下的强泛化能力。 Conclusion: 基于冻结视觉基础模型与轻量多模态对齐的无监督变化检测是一种实用且高效的方法，为开放词汇变化检测提供了新范式。 Abstract: Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at https://github.com/Die-Xie/UniVCD.

[207] ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning

Feng Zhang,Zezhong Tan,Xinhong Ma,Ziqiang Dong,Xi Leng,Jianfei Zhao,Xin Sun,Yang Yang

Main category: cs.CV

TL;DR: 本文提出ADHint，一种考虑样本难度的提示增强方法，用于优化监督微调与强化学习中的推理能力，提升模型在不同分布下的泛化表现。

Details

Motivation: 现有基于提示的强化学习方法在调度提示比例和估计优势时忽略样本难度，导致学习不稳定和过度模仿离策略提示。 Method: 提出ADHint，包含自适应提示调度、基于一致性的梯度调制与选择性掩码、以及基于 rollout 难度后验的优势估计，综合考虑样本和生成轨迹的难度进行优化。 Result: 在多种模态、模型规模和领域上实验表明，ADHint在pass@1和avg@8指标上均优于现有方法，显著提升推理能力和分布外泛化性能。 Conclusion: 将样本和生成难度纳入提示调度与优势估计是实现探索与模仿平衡的关键，ADHint为知识扩展和推理泛化提供了有效框架。 Abstract: To combine the advantages of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), recent methods have integrated ''hints'' into post-training, which are prefix segments of complete reasoning trajectories, aiming for powerful knowledge expansion and reasoning generalization. However, existing hint-based RL methods typically ignore difficulty when scheduling hint ratios and estimating relative advantages, leading to unstable learning and excessive imitation of off-policy hints. In this work, we propose ADHint, which treats difficulty as a key factor in both hint-ratio schedule and relative-advantage estimation to achieve a better trade-off between exploration and imitation. Specifically, we propose Adaptive Hint with Sample Difficulty Prior, which evaluates each sample's difficulty under the policy model and accordingly schedules an appropriate hint ratio to guide its rollouts. We also introduce Consistency-based Gradient Modulation and Selective Masking for Hint Preservation to modulate token-level gradients within hints, preventing biased and destructive updates. Additionally, we propose Advantage Estimation with Rollout Difficulty Posterior, which leverages the relative difficulty of rollouts with and without hints to estimate their respective advantages, thereby achieving more balanced updates. Extensive experiments across diverse modalities, model scales, and domains demonstrate that ADHint delivers superior reasoning ability and out-of-distribution generalization, consistently surpassing existing methods in both pass@1 and avg@8. Our code and dataset will be made publicly available upon paper acceptance.

[208] Harmonizing Generalization and Specialization: Uncertainty-Informed Collaborative Learning for Semi-supervised Medical Image Segmentation

Wenjing Lu,Yi Hong,Yang Yang

Main category: cs.CV

TL;DR: 提出了一种名为UnCoL的双教师框架，通过不确定性引导的协同学习，在半监督医学图像分割中平衡通用性与特异性，显著降低标注需求的同时达到接近全监督的性能。

Details

Motivation: 现有视觉基础模型在医学图像分割中因通用先验与特定任务需求不匹配，难以适应标注有限或病理变异罕见的临床场景，因此需要一种能兼顾泛化与特化的学习方法。 Method: 设计了一个双教师框架UnCoL：一个冻结的基础模型教师传递视觉和语义通用知识，另一个渐进式自适应教师捕捉任务特异性表示；通过预测不确定性动态调节伪标签学习，抑制不可靠监督，稳定模糊区域的学习过程。 Result: 在多种2D和3D医学图像分割基准上，UnCoL持续优于当前最先进的半监督方法和基础模型基线，仅用少量标注即可达到接近全监督模型的性能。 Conclusion: UnCoL有效融合了基础模型的泛化能力与任务特定的细粒度学习，通过不确定性引导的协同学习机制，为低标注条件下的医学图像分割提供了高效且稳定的解决方案。 Abstract: Vision foundation models have demonstrated strong generalization in medical image segmentation by leveraging large-scale, heterogeneous pretraining. However, they often struggle to generalize to specialized clinical tasks under limited annotations or rare pathological variations, due to a mismatch between general priors and task-specific requirements. To address this, we propose Uncertainty-informed Collaborative Learning (UnCoL), a dual-teacher framework that harmonizes generalization and specialization in semi-supervised medical image segmentation. Specifically, UnCoL distills both visual and semantic representations from a frozen foundation model to transfer general knowledge, while concurrently maintaining a progressively adapting teacher to capture fine-grained and task-specific representations. To balance guidance from both teachers, pseudo-label learning in UnCoL is adaptively regulated by predictive uncertainty, which selectively suppresses unreliable supervision and stabilizes learning in ambiguous regions. Experiments on diverse 2D and 3D segmentation benchmarks show that UnCoL consistently outperforms state-of-the-art semi-supervised methods and foundation model baselines. Moreover, our model delivers near fully supervised performance with markedly reduced annotation requirements.

[209] FID-Net: A Feature-Enhanced Deep Learning Network for Forest Infestation Detection

Yan Zhang,Baoxin Li,Han Sun,Yuhang Gao,Mingtai Zhang,Pei Wang

Main category: cs.CV

TL;DR: 本研究提出FID-Net，一种基于YOLOv8n的深度学习模型，用于从无人机可见光影像中检测受虫害影响的树木，并结合三种空间分析指标评估虫情态势。FID-Net引入轻量级特征增强模块（FEM）、自适应多尺度特征融合模块（AMFM）和高效通道注意力（ECA）机制，提升了小目标和病害敏感特征的检测能力。实验在天山东部32个林分上进行，结果表明FID-Net在精度、召回率和mAP等指标上优于主流YOLO模型。通过核密度估计、邻域风险评估和DBSCAN聚类分析，实现了虫害热点定位、健康树木感染风险评估和优先保护区域识别。研究表明虫害树木呈明显聚集分布，支持精准防控。

Details

Motivation: 传统森林虫害监测方法难以满足大范围、细粒度的检测需求，亟需一种高效、自动化的监测手段来准确识别受害树木并分析虫害扩散模式，以保障生态系统稳定。 Method: 提出FID-Net模型：基于YOLOv8n，引入轻量级特征增强模块（FEM）提取病害敏感特征，设计自适应多尺度特征融合模块（AMFM）融合RGB与增强分支特征，并结合高效通道注意力（ECA）机制强化关键信息。基于检测结果构建虫情分析框架：采用核密度估计定位感染热点，通过邻域评估分析健康树感染风险，利用DBSCAN聚类识别高密度健康树区域作为优先保护对象。 Result: 在天山东部32个林分的无人机影像数据上，FID-Net达到86.10%的精确率、75.44%的召回率、82.29%的mAP@0.5和64.30%的mAP@0.5:0.95，性能优于主流YOLO模型。空间分析表明，受害树木具有显著的空间聚集性，验证了虫害传播的局部扩散特征。 Conclusion: FID-Net能够实现对森林虫害的高精度、细粒度检测，结合空间分析方法可有效识别虫害热点与优先保护区域，为智能虫害监测、早期预警和精准管理提供了可靠的技术支持。 Abstract: Forest pests threaten ecosystem stability, requiring efficient monitoring. To overcome the limitations of traditional methods in large-scale, fine-grained detection, this study focuses on accurately identifying infected trees and analyzing infestation patterns. We propose FID-Net, a deep learning model that detects pest-affected trees from UAV visible-light imagery and enables infestation analysis via three spatial metrics. Based on YOLOv8n, FID-Net introduces a lightweight Feature Enhancement Module (FEM) to extract disease-sensitive cues, an Adaptive Multi-scale Feature Fusion Module (AMFM) to align and fuse dual-branch features (RGB and FEM-enhanced), and an Efficient Channel Attention (ECA) mechanism to enhance discriminative information efficiently. From detection results, we construct a pest situation analysis framework using: (1) Kernel Density Estimation to locate infection hotspots; (2) neighborhood evaluation to assess healthy trees' infection risk; (3) DBSCAN clustering to identify high-density healthy clusters as priority protection zones. Experiments on UAV imagery from 32 forest plots in eastern Tianshan, China, show that FID-Net achieves 86.10% precision, 75.44% recall, 82.29% mAP@0.5, and 64.30% mAP@0.5:0.95, outperforming mainstream YOLO models. Analysis confirms infected trees exhibit clear clustering, supporting targeted forest protection. FID-Net enables accurate tree health discrimination and, combined with spatial metrics, provides reliable data for intelligent pest monitoring, early warning, and precise management.

Zhijian He,Feifei Liu,Yuwei Li,Zhanpeng Liu,Jintao Cheng,Xieyuanli Chen,Xiaoyu Tang

Main category: cs.CV

TL;DR: 本文提出了一种名为DiffFusion的新型多模态3D目标检测框架，通过基于扩散模型的图像与点云恢复以及自适应跨模态融合机制，显著提升了恶劣天气条件下的感知鲁棒性。

Details

Motivation: 在恶劣天气条件下，现有多模态3D目标检测因传感器数据退化和模态间错位而性能受限，亟需提升鲁棒性和模态对齐能力。 Method: 提出DiffFusion框架，包含Diffusion-IR（用于恢复受天气影响的图像）、点云恢复（PCR）模块（利用图像线索修复LiDAR数据），以及双向自适应融合与对齐模块（BAFAM），实现动态多模态融合与BEV空间中的双向对齐。 Result: 在三个公开数据集上实验表明，DiffFusion在恶劣天气下达到最先进的鲁棒性，同时在干净数据上保持良好性能；在真实世界DENSE数据集上的零样本结果验证了其优异的泛化能力。 Conclusion: DiffFusion通过扩散模型驱动的恢复与自适应融合策略，有效解决了恶劣天气下多模态3D检测中的数据退化与模态错位问题，显著提升了感知系统的可靠性与泛化性。 Abstract: Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird's-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.

[211] DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass

Vivek Alumootil,Tuan-Anh Vu,M. Khalid Jawed

Main category: cs.CV

TL;DR: 本文提出了一种名为DePT3R的新框架，能够在无需相机姿态的情况下，通过单次前向传播同时实现动态场景的密集点跟踪与三维重建。

Details

Motivation: 现有方法通常依赖成对处理、需要已知相机位姿或假设输入帧的时间顺序，限制了其灵活性和适用性；而大规模无姿态图像集合的高效3D重建进展启发了统一动态场景理解方法的发展。 Method: 通过强大的骨干网络提取深度时空特征，并使用密集预测头回归像素级映射，实现多任务学习，同时完成密集点跟踪和3D重建。 Result: 在多个具有挑战性的动态场景基准上验证了DePT3R，表现出优异性能，并在内存效率方面显著优于现有最先进方法。 Conclusion: DePT3R是一种灵活且高效的框架，能够在不依赖相机姿态的前提下统一处理动态场景中的密集点跟踪与三维重建问题。 Abstract: Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency-especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: https://github.com/StructuresComp/DePT3R

[212] LeafTrackNet: A Deep Learning Framework for Robust Leaf Tracking in Top-Down Plant Phenotyping

Shanghua Liu,Majharulislam Babor,Christoph Verduyn,Breght Vandenberghe,Bruno Betoni Parodi,Cornelia Weltzien,Marina M. -C. Höhne

Main category: cs.CV

TL;DR: 本文提出了CanolaTrack数据集和LeafTrackNet框架，用于在真实条件下实现油菜叶片的高精度跟踪。

Details

Motivation: 现有的植物特异性跟踪方法通常局限于小规模物种或受限成像条件，而通用多目标跟踪方法不适用于动态生物场景，且缺乏大规模真实条件下的数据集阻碍了研究进展。 Method: 提出CanolaTrack数据集（包含5,704张RGB图像和31,840个标注叶片实例），并设计LeafTrackNet框架，结合基于YOLOv10的叶片检测器和MobileNetV3的嵌入网络，通过嵌入式记忆关联策略维持叶片身份。 Result: LeafTrackNet在CanolaTrack上相比现有方法提升了9%的HOTA指标，优于植物特异性跟踪器和最先进的MOT基线方法。 Conclusion: 本工作为农业作物叶片级跟踪建立了新标准，提供了目前最大的农业作物叶片跟踪数据集CanolaTrack，并推动植物表型分析的未来研究。 Abstract: High resolution phenotyping at the level of individual leaves offers fine-grained insights into plant development and stress responses. However, the full potential of accurate leaf tracking over time remains largely unexplored due to the absence of robust tracking methods-particularly for structurally complex crops such as canola. Existing plant-specific tracking methods are typically limited to small-scale species or rely on constrained imaging conditions. In contrast, generic multi-object tracking (MOT) methods are not designed for dynamic biological scenes. Progress in the development of accurate leaf tracking models has also been hindered by a lack of large-scale datasets captured under realistic conditions. In this work, we introduce CanolaTrack, a new benchmark dataset comprising 5,704 RGB images with 31,840 annotated leaf instances spanning the early growth stages of 184 canola plants. To enable accurate leaf tracking over time, we introduce LeafTrackNet, an efficient framework that combines a YOLOv10-based leaf detector with a MobileNetV3-based embedding network. During inference, leaf identities are maintained over time through an embedding-based memory association strategy. LeafTrackNet outperforms both plant-specific trackers and state-of-the-art MOT baselines, achieving a 9% HOTA improvement on CanolaTrack. With our work we provide a new standard for leaf-level tracking under realistic conditions and we provide CanolaTrack - the largest dataset for leaf tracking in agriculture crops, which will contribute to future research in plant phenotyping. Our code and dataset are publicly available at https://github.com/shl-shawn/LeafTrackNet.

[213] Weight Space Correlation Analysis: Quantifying Feature Utilization in Deep Learning Models

Chun Kit Wong,Paraskevas Pegios,Nina Weng,Emilie Pi Fogtmann Sejer,Martin Grønnebæk Tolsgaard,Anders Nymark Christensen,Aasa Feragen

Main category: cs.CV

TL;DR: 提出权重空间相关性分析方法，用于量化深度学习模型在医学图像中对临床任务和元数据特征的利用情况，验证了模型在无诱导偏见下选择性使用与真实临床信号相关的特征。

Details

Motivation: 检测医学图像深度学习模型是否利用嵌入在图像中的混淆元数据（如扫描仪型号）进行预测，以判断其是否存在捷径学习问题。 Method: 提出权重空间相关性分析（Weight Space Correlation Analysis），通过测量主临床任务与辅助元数据任务分类头之间的对齐程度来量化特征利用。 Result: 该方法成功检测到人工诱导的捷径学习；在sPTB预测模型中发现，尽管嵌入包含大量元数据信息，但分类器权重与临床相关因素（如出生体重）高度相关，而与采集设备等无关因素解耦。 Conclusion: 所提方法可有效验证模型可信度，表明在无偏置训练下，临床模型能选择性利用真实临床信号相关的特征，避免依赖无关元数据。 Abstract: Deep learning models in medical imaging are susceptible to shortcut learning, relying on confounding metadata (e.g., scanner model) that is often encoded in image embeddings. The crucial question is whether the model actively utilizes this encoded information for its final prediction. We introduce Weight Space Correlation Analysis, an interpretable methodology that quantifies feature utilization by measuring the alignment between the classification heads of a primary clinical task and auxiliary metadata tasks. We first validate our method by successfully detecting artificially induced shortcut learning. We then apply it to probe the feature utilization of an SA-SonoNet model trained for Spontaneous Preterm Birth (sPTB) prediction. Our analysis confirmed that while the embeddings contain substantial metadata, the sPTB classifier's weight vectors were highly correlated with clinically relevant factors (e.g., birth weight) but decoupled from clinically irrelevant acquisition factors (e.g. scanner). Our methodology provides a tool to verify model trustworthiness, demonstrating that, in the absence of induced bias, the clinical model selectively utilizes features related to the genuine clinical signal.

[214] StarryGazer: Leveraging Monocular Depth Estimation Models for Domain-Agnostic Single Depth Image Completion

Sangmin Hong,Suyoung Lee,Kyoung Mu Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为StarryGazer的无监督深度补全框架，利用大规模单目深度估计（MDE）模型生成相对深度图，并通过合成伪真值训练细化网络，实现从稀疏深度图和RGB图像中恢复密集深度图，无需真实深度标签。

Details

Motivation: 现有无监督深度补全方法依赖辅助数据，不符合实际场景；虽然MDE模型能从单图估计相对深度，但尚未有效结合稀疏深度进行补全，且简单仿射变换误差大。 Method: 使用预训练MDE模型生成相对深度图，对其进行分割和随机缩放构建合成的稠密伪标签与稀疏深度对；利用这些合成数据训练一个融合相对深度图和RGB图像的细化网络。 Result: StarryGazer在多个数据集上优于现有的无监督方法和直接转换MDE的结果，验证了其利用MDE能力并用稀疏深度修正误差的有效性。 Conclusion: StarryGazer是一种领域无关、不依赖真实深度标签的深度补全框架，成功融合了MDE模型的先验与稀疏深度信息，显著提升了无监督深度补全的性能。 Abstract: The problem of depth completion involves predicting a dense depth image from a single sparse depth map and an RGB image. Unsupervised depth completion methods have been proposed for various datasets where ground truth depth data is unavailable and supervised methods cannot be applied. However, these models require auxiliary data to estimate depth values, which is far from real scenarios. Monocular depth estimation (MDE) models can produce a plausible relative depth map from a single image, but there is no work to properly combine the sparse depth map with MDE for depth completion; a simple affine transformation to the depth map will yield a high error since MDE are inaccurate at estimating depth difference between objects. We introduce StarryGazer, a domain-agnostic framework that predicts dense depth images from a single sparse depth image and an RGB image without relying on ground-truth depth by leveraging the power of large MDE models. First, we employ a pre-trained MDE model to produce relative depth images. These images are segmented and randomly rescaled to form synthetic pairs for dense pseudo-ground truth and corresponding sparse depths. A refinement network is trained with the synthetic pairs, incorporating the relative depth maps and RGB images to improve the model's accuracy and robustness. StarryGazer shows superior results over existing unsupervised methods and transformed MDE results on various datasets, demonstrating that our framework exploits the power of MDE models while appropriately fixing errors using sparse depth information.

[215] Intrinsic Image Fusion for Multi-View 3D Material Reconstruction

Peter Kocsis,Lukas Höllein,Matthias Nießner

Main category: cs.CV

TL;DR: 提出了一种名为Intrinsic Image Fusion的方法，通过融合多视角图像中的单视角先验信息，实现高质量的基于物理材质重建。

Details

Motivation: 材质重建问题高度欠约束，传统方法依赖于计算昂贵且噪声较大的分析-合成方法。因此需要引入更有效的先验信息来更好约束优化过程。 Method: 利用基于扩散的材质估计器生成每视图的多个候选分解，并拟合一个显式的低维参数化函数；通过软性单视图预测选择和基于置信度的多视图内点集融合策略，将最一致、最可信的预测融合到一致的参数化材质空间中，最后使用逆路径追踪优化低维参数。 Result: 在合成和真实场景上均优于现有最先进方法，实现了更优的材质解耦，重建结果清晰、锐利，适用于高质量重光照。 Conclusion: 所提出的方法有效缓解了材质重建中的欠约束问题，通过融合多视角一致性预测显著提升了重建质量与稳定性。 Abstract: We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images. Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view. To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions. We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse path tracing to optimize for the low-dimensional parameters. Our results outperform state-of-the-art methods in material disentanglement on both synthetic and real scenes, producing sharp and clean reconstructions suitable for high-quality relighting.

[216] A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis

Xianchao Guan,Zhiyuan Fan,Yifeng Wang,Fuqiang Chen,Yanjiang Zhou,Zengyang Che,Hongxue Meng,Xin Li,Yaowei Wang,Hongpeng Wang,Min Zhang,Heng Tao Shen,Zheng Zhang,Yongbing Zhang

Main category: cs.CV

TL;DR: CRAFTS是一种病理学专用的文本到图像生成基础模型，通过双阶段训练和相关性调节对齐机制，生成高质量、生物准确的多样化病理图像，缓解数据稀缺与隐私问题，提升多种临床任务性能。

Details

Motivation: 由于缺乏多样且高质量标注的病理数据集，临床级人工智能在病理学中的发展受到限制。现有生成模型存在语义不稳定和形态幻觉问题，影响诊断可靠性。 Method: 提出CRAFTS框架，采用双阶段训练策略，基于约280万图文对进行训练；引入新的对齐机制抑制语义漂移，确保生物学准确性；结合ControlNet实现对组织结构的精确控制。 Result: 模型成功生成涵盖30种癌症类型的病理图像，质量经客观指标和病理医生评估验证；CRAFTS增强的数据集提升了分类、跨模态检索、自监督学习和视觉问答等任务的表现。 Conclusion: CRAFTS克服了数据稀缺和隐私保护的关键障碍，提供了无限且多样的标注组织学数据来源，推动了针对罕见和复杂癌症表型的稳健诊断工具的发展。 Abstract: The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.

[217] Seeing the Whole Picture: Distribution-Guided Data-Free Distillation for Semantic Segmentation

Hongxuan Sun,Tao Wu

Main category: cs.CV

TL;DR: 本文提出了一种面向语义分割的新型无数据知识蒸馏框架DFSS，通过利用批归一化统计信息引导近似分布采样，并引入加权分布渐进蒸馏策略，有效保持场景的结构与上下文连续性，在标准基准上实现了最先进的性能。

Details

Motivation: 现有无数据知识蒸馏方法主要针对分类任务设计，忽视了语义分割中对象的空间连续性和结构一致性，导致直接应用于分割任务时性能显著下降。 Method: 提出DFSS框架，利用教师模型的Batch Normalization统计信息指导近似分布采样（ADS），避免依赖可能误导的教师预测；并设计加权分布渐进蒸馏（WDPD）策略，动态优先选择更接近原始数据分布的可靠样本，逐步引入更具挑战性的样本。 Result: 在多个标准语义分割基准上的实验表明，DFSS显著优于现有的无数据蒸馏方法，取得了当前最优的结果，且对辅助数据的依赖大大减少。 Conclusion: DFSS通过尊重真实场景的结构与上下文连续性，为语义分割任务提供了一个高效、可靠的无数据知识蒸馏解决方案，推动了该领域的发展。 Abstract: Semantic segmentation requires a holistic understanding of the physical world, as it assigns semantic labels to spatially continuous and structurally coherent objects rather than to isolated pixels. However, existing data-free knowledge distillation (DFKD) methods-primarily designed for classification-often disregard this continuity, resulting in significant performance degradation when applied directly to segmentation tasks. In this paper, we introduce DFSS, a novel data-free distillation framework tailored for semantic segmentation. Unlike prior approaches that treat pixels independently, DFSS respects the structural and contextual continuity of real-world scenes. Our key insight is to leverage Batch Normalization (BN) statistics from a teacher model to guide Approximate Distribution Sampling (ADS), enabling the selection of data that better reflects the original training distribution-without relying on potentially misleading teacher predictions. Additionally, we propose Weighted Distribution Progressive Distillation (WDPD), which dynamically prioritizes reliable samples that are more closely aligned with the original data distribution early in training and gradually incorporates more challenging cases, mirroring the natural progression of learning in human perception. Extensive experiments on standard benchmarks demonstrate that DFSS consistently outperforms existing data-free distillation methods for semantic segmentation, achieving state-of-the-art results with significantly reduced reliance on auxiliary data.

[218] MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion

Minghui Hou,Wei-Hsing Huang,Shaofeng Liang,Daizong Liu,Tai-Hao Wen,Gang Wang,Runwei Guan,Weiping Ding

Main category: cs.CV

TL;DR: 本文提出MMDrive，一种用于自动驾驶的多模态视觉-语言模型框架，通过融合占据地图、LiDAR点云和文本描述三种模态，突破传统基于二维图像的理解局限，实现三维场景的深度语义理解。

Details

Motivation: 现有视觉-语言模型受限于二维图像理解范式，难以有效感知3D空间信息并进行深层语义融合，导致在复杂驾驶环境中表现不佳。 Method: MMDrive引入两种新组件：面向文本的多模态调制器（动态加权各模态贡献）和跨模态抽象器（通过可学习的抽象令牌生成紧凑的跨模态摘要），实现自适应跨模态融合与关键信息提取。 Result: 在DriveLM和NuScenes-QA基准上显著优于现有方法，DriveLM上BLEU-4为54.56，METEOR为41.78；NuScenes-QA上准确率达62.7%。 Conclusion: MMDrive打破了传统仅依赖图像的理解限制，实现了复杂驾驶环境下的鲁棒多模态推理，为可解释的自动驾驶场景理解提供了新基础。 Abstract: Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.

[219] CoRA: A Collaborative Robust Architecture with Hybrid Fusion for Efficient Perception

Gong Chen,Chaokun Zhang,Pengcheng Lv,Xiaohui Xie

Main category: cs.CV

TL;DR: 本文提出了一种名为CoRA的新型协作感知架构，通过特征级融合和对象级校正的混合方法，在低通信开销下实现了高性能与强鲁棒性。

Details

Motivation: 现有基于中间融合的协作感知方法在通信条件不佳时因数据传输错位而导致性能下降，限制了实际部署。 Method: 提出CoRA架构，包含两个分支：一是特征级融合分支，选择关键特征以高效融合；二是对象级校正分支，利用语义相关性校正空间位姿误差。 Result: 实验表明，在极端场景下，CoRA相比基线方法在AP@0.7上提升约19%，且通信量减少5倍以上。 Conclusion: CoRA通过解耦性能与鲁棒性，有效结合中间融合与后期融合的优势，成为实现鲁棒协作感知的有前景方案。 Abstract: Collaborative perception has garnered significant attention as a crucial technology to overcome the perceptual limitations of single-agent systems. Many state-of-the-art (SOTA) methods have achieved communication efficiency and high performance via intermediate fusion. However, they share a critical vulnerability: their performance degrades under adverse communication conditions due to the misalignment induced by data transmission, which severely hampers their practical deployment. To bridge this gap, we re-examine different fusion paradigms, and recover that the strengths of intermediate and late fusion are not a trade-off, but a complementary pairing. Based on this key insight, we propose CoRA, a novel collaborative robust architecture with a hybrid approach to decouple performance from robustness with low communication. It is composed of two components: a feature-level fusion branch and an object-level correction branch. Its first branch selects critical features and fuses them efficiently to ensure both performance and scalability. The second branch leverages semantic relevance to correct spatial displacements, guaranteeing resilience against pose errors. Experiments demonstrate the superiority of CoRA. Under extreme scenarios, CoRA improves upon its baseline performance by approximately 19% in AP@0.7 with more than 5x less communication volume, which makes it a promising solution for robust collaborative perception.

[220] POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling

Zhuo Chen,Chengqun Yang,Zhuo Su,Zheng Lv,Jingnan Gao,Xiaoyuan Zhang,Xiaokang Yang,Yichao Yan

Main category: cs.CV

TL;DR: 本文提出了POLAR，一个大规模、物理校准的单光源人脸重光照数据集，以及基于该数据集的生成模型POLARNet，实现了身份保持且物理可解释的人脸重光照，构建了一个自持续的照明学习框架。

Details

Motivation: 现有的面部重光照研究受限于大规模、物理一致照明数据的缺乏，难以实现真实感和可控性兼备的光照合成。 Method: 构建了包含200多个受试者、156个光照方向的大规模OLAT数据集POLAR，并提出基于流的生成模型POLARNet，从单张肖像预测每个光源下的响应，将光照建模为连续的、物理可解释的状态变换。 Result: POLARNet能够捕捉细粒度、方向感知的光照效果，同时保持面部身份；与依赖统计或背景线索的方法不同，其物理可解释性支持更可控和可扩展的重光照。 Conclusion: POLAR和POLARNet共同构成了一个统一的照明学习框架，连接真实数据、生成合成与物理基础的重光照，形成可扩展、可重复的人像照明研究的自持续循环。 Abstract: Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining "chicken-and-egg" cycle for scalable and reproducible portrait illumination.

[221] Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance

Francesco Ragusa,Michele Mazzamuto,Rosario Forte,Irene D'Ambra,James Fort,Jakob Engel,Antonino Furnari,Giovanni Maria Farinella

Main category: cs.CV

TL;DR: 本文提出了Ego-EXTRA，一个用于专家-学员辅助的视频-语言数据集，包含50小时的以第一人称视角记录的无脚本程序性活动视频，配合真实专家提供的自然语言指导和问答。该数据集通过“Wizard of OZ”范式收集，生成超过15,000个高质量视觉问答对，用于评估多模态大语言模型在提供专家级辅助方面的性能，结果表明现有模型仍面临挑战。

Details

Motivation: 为了推动面向第一人称视角（egocentric）场景下的智能助手发展，需要高质量的专家与学员互动数据来训练和评估模型，而现有数据集缺乏真实、细粒度的专家指导对话。 Method: 采用“Wizard of OZ”实验范式，让专家通过第一人称视角观察学员操作，并实时提供自然语言反馈或主动建议；收集双人对话并转录，构建包含15k+视觉问答对的基准数据集，用于评估多模态大语言模型。 Result: 构建了Ego-EXTRA数据集，包含50小时视频和超过15,000个高质量视觉问答对；基于该数据集的实验表明当前多模态大语言模型在提供专家级辅助方面表现有限，任务具有挑战性。 Conclusion: Ego-EXTRA为评估和推动面向第一人称视角的智能助手提供了高质量的数据和新基准，揭示了现有模型在真实专家指导场景中的不足，有助于未来研究更实用的多模态助手系统。 Abstract: We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a ``Wizard of OZ'' data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 15k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlight the limitations of current models when used to provide expert-level assistance to the user. The Ego-EXTRA dataset is publicly available to support the benchmark of egocentric video-language assistants: https://fpv-iplab.github.io/Ego-EXTRA/.

[222] STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits

Foivos Paraperas Papantoniou,Stathis Galanakis,Rolandos Alexandros Potamias,Bernhard Kainz,Stefanos Zafeiriou

Main category: cs.CV

TL;DR: 本文提出了STARCaster，一种统一的身份感知时空视频扩散模型，用于语音驱动的肖像动画和自由视角说话头像合成，通过软身份约束和隐式3D感知提升运动多样性与身份一致性。

Details

Motivation: 现有2D语音到视频模型依赖强参考引导导致运动单一，3D感知方法存在重建缺陷与身份漂移问题，需更鲁棒的统一框架。 Method: 采用身份感知的时空扩散架构，引入软身份约束；利用视频数据的多视角特性实现隐式3D感知；通过唇读监督实现音画同步，并设计时序到空间的自适应机制生成新视角；采用解耦学习与自强制训练策略。 Result: 在多个基准上超越先前方法，具备良好的跨任务与跨身份泛化能力，生成动画更具动态性且身份保持更好。 Conclusion: STARCaster通过解耦身份、运动与视角建模，在无需显式3D表示的情况下实现了高质量、多视角的说话头像视频生成，为语音驱动动画提供了高效统一的解决方案。 Abstract: This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.

[223] Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection

Juil Koo,Daehyeon Choi,Sangwoo Youn,Phillip Y. Lee,Minhyuk Sung

Main category: cs.CV

TL;DR: 本文提出了一个名为VG-AVS的新任务，旨在通过当前图像中的视觉信息选择最具信息量的下一个视点，而无需依赖场景记忆或外部知识。为此，作者构建了一个合成数据集，并提出了一种基于预训练视觉语言模型（VLM）的框架，结合监督微调和基于强化学习的策略优化。实验表明，该方法在问题回答性能上表现出色，并能很好地泛化到未见过的合成和真实场景中，同时还能提升现有场景探索型问答系统的准确性。

Details

Motivation: 现有的视觉语言模型主要基于静态图像进行推理，缺乏在动态环境中主动选择最优视角的能力，而这种能力对于具身智能体实现有效环境交互至关重要。因此，研究者希望探索一种仅依赖当前视觉信息就能主动选择最佳观察位置的方法，以增强模型的视觉理解与决策能力。 Method: 提出Visually Grounded Active View Selection (VG-AVS)任务，构建包含配对查询-目标视图及问答提示的合成数据集；采用预训练VLM，先进行监督微调（SFT），再通过基于强化学习的策略优化来训练模型以选择最优下一视点。 Result: 所提方法在基于视角选择的问题回答任务中表现优异，能够有效泛化至未见的合成与真实场景，并且集成到现有场景探索系统后显著提升了下游问答准确率。 Conclusion: VG-AVS为从静态视觉向主动视觉过渡提供了可行路径，验证了仅依靠当前视觉线索进行主动视点选择的有效性，增强了VLM在具身环境中的适用性。 Abstract: Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.

[224] CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing

Yan Li,Lin Liu,Xiaopeng Zhang,Wei Xue,Wenhan Luo,Yike Guo,Qi Tian

Main category: cs.CV

TL;DR: 提出CogniEdit框架，结合多模态推理与密集奖励优化，实现细粒度指令引导的图像编辑，在轨迹级控制和视觉质量上达到SOTA。

Details

Motivation: 现有基于扩散模型的指令编辑方法在处理颜色、位置、数量等细粒度属性时表现不佳，且仅在单个采样步优化，缺乏轨迹级控制。 Method: 构建CogniEdit框架：1）使用多模态大语言模型解析复杂指令；2）动态令牌聚焦重定位以强调细粒度属性；3）基于密集GRPO的跨步梯度传播实现轨迹级监督。 Result: 在多个基准数据集上实验表明，CogniEdit在遵循细粒度指令的同时保持了优异的视觉质量和可编辑性。 Conclusion: CogniEdit通过多模态推理与密集梯度优化的结合，显著提升了指令驱动图像编辑的精度与连贯性。 Abstract: Instruction-based image editing with diffusion models has achieved impressive results, yet existing methods struggle with fine-grained instructions specifying precise attributes such as colors, positions, and quantities. While recent approaches employ Group Relative Policy Optimization (GRPO) for alignment, they optimize only at individual sampling steps, providing sparse feedback that limits trajectory-level control. We propose a unified framework CogniEdit, combining multi-modal reasoning with dense reward optimization that propagates gradients across consecutive denoising steps, enabling trajectory-level gradient flow through the sampling process. Our method comprises three components: (1) Multi-modal Large Language Models for decomposing complex instructions into actionable directives, (2) Dynamic Token Focus Relocation that adaptively emphasizes fine-grained attributes, and (3) Dense GRPO-based optimization that propagates gradients across consecutive steps for trajectory-level supervision. Extensive experiments on benchmark datasets demonstrate that our CogniEdit achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation

[225] Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Jiaqi Wang,Weijia Wu,Yi Zhan,Rui Zhao,Ming Hu,James Cheng,Wei Liu,Philip Torr,Kevin Qinghong Lin

Main category: cs.CV

TL;DR: 本文提出了一个名为Video Reality Test的新型基准测试，用于评估AI生成的音视频内容在沉浸式ASMR场景下的真实感，采用对抗性的创建者-评审者协议，揭示了当前视频生成模型（如Veo3.1-Fast）已能显著欺骗视觉语言模型（VLMs），而人类专家仍表现更优。

Details

Motivation: 随着AI生成视频质量的提升，尤其是带有音频的沉浸式内容，传统仅基于视觉或无音频的检测方法已不足以应对现实挑战，亟需新的评测基准来衡量生成视频的真实性和感知一致性。 Method: 构建了一个基于真实ASMR视频的数据集，强调精细的动作-物体交互和多样的音视频耦合；设计了对抗性协议：生成模型为‘创建者’，VLM为‘评审者’；通过人类专家与VLM在识别真伪视频上的准确率进行评估，并分析音频、水印等因素的影响。 Result: 最先进的生成模型Veo3.1-Fast能够欺骗大多数VLM，最强的评审模型Gemini 2.5-Pro仅达到56%的准确率（接近随机水平50%），远低于人类专家的81.25%；加入音频有助于区分真实与虚假视频，但水印等表面线索仍会显著误导模型判断。 Conclusion: 当前AI生成的音视频已达到较高感知真实感，能在紧密音视频耦合场景下欺骗先进VLM，暴露了现有模型在感知保真度和跨模态一致性上的局限，强调未来需更鲁棒的检测机制和更贴近人类感知的评估标准。 Abstract: Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf{(i) Immersive ASMR video-audio sources.} Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf{(ii) Peer-Review evaluation.} An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56\% accuracy (random 50\%), far below that of human experts (81.25\%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.

[226] CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images

Bo Liu,Qiao Qin,Qinghui He

Main category: cs.CV

TL;DR: 本文提出了CausalCLIP框架，通过因果推断方法解耦生成图像检测中的因果特征与非因果特征，提升了模型在未见生成模型上的泛化能力。

Details

Motivation: 现有生成图像检测方法提取的特征高度纠缠，混杂了任务相关的因果特征和无关的非因果特征，导致跨模型泛化能力差。 Method: 构建结构因果模型，使用Gumbel-Softmax特征掩码和HSIC约束实现因果与非因果特征的解耦与统计独立，保留可迁移且判别性强的法医线索。 Result: 在未见生成模型上测试，CausalCLIP比现有最先进方法准确率提升6.83%，平均精度提升4.06%。 Conclusion: CausalCLIP通过因果特征解耦显著增强了生成图像检测器的跨模型泛化能力，为通用检测提供了有效解决方案。 Abstract: The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.

[227] LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models

Shu Yu,Chaochao Lu

Main category: cs.CV

TL;DR: 本文提出了一种名为LINA的新框架，通过因果感知的干预和去噪调度，提升扩散模型在物理对齐和分布外指令遵循方面的能力。

Details

Motivation: 现有的扩散模型在物理对齐和分布外指令遵循方面表现不佳，主要由于未能学习因果方向和解耦因果因素。本文旨在诊断并解决这些问题。 Method: 引入因果场景图（CSG）和物理对齐探针（PAP）数据集进行诊断分析，并提出LINA框架，包含在提示和视觉潜在空间中的定向引导以及重新分配的因果感知去噪调度。 Result: LINA在具有挑战性的因果生成任务和Winoground数据集上实现了最先进的性能，显著提升了图像和视频扩散模型的物理对齐与OOD指令跟随能力。 Conclusion: 通过引入因果结构感知机制，扩散模型可以更有效地处理复杂因果推理任务，LINA为实现更鲁棒、可控的生成模型提供了新方向。 Abstract: Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at https://opencausalab.github.io/LINA.

Zhihang Liu,Xiaoyi Bao,Pandeng Li,Junjie Zhou,Zhaohe Liao,Yefei He,Kaixun Jiang,Chen-Wei Xie,Yun Zheng,Hongtao Xie

Main category: cs.CV

TL;DR: 本文提出了一个新任务——创意表格可视化，并设计了ShowTable流水线，结合MLLM与扩散模型，通过渐进式自修正过程生成高保真信息图。

Details

Motivation: 现有生成模型在需要深度推理、规划和精确数据到视觉映射的任务上表现不佳，难以满足复杂表格数据的美观且准确的可视化需求。 Method: 提出ShowTable框架，利用MLLM进行视觉规划和错误判断，指导扩散模型执行生成；并通过三个自动化数据构建流程训练各模块。 Result: 实验表明，该方法在新提出的TableVisBench基准（包含800个实例、5个评估维度）上显著优于基线模型。 Conclusion: ShowTable通过多模态推理、生成与纠错机制，有效提升了复杂表格数据的可视化质量，推动了需精细推理的视觉生成任务发展。 Abstract: While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.

[229] KlingAvatar 2.0 Technical Report

Kling Team,Jialu Chen,Yikang Ding,Zhixue Fang,Kun Gai,Yuan Gao,Kang He,Jingyun Hua,Boyuan Jiang,Mingming Lao,Xiaohan Li,Hui Liu,Jiwen Liu,Xiaoqiang Liu,Yuan Liu,Shun Lu,Yongsen Mao,Yingchao Shao,Huafeng Shi,Xiaoyu Shi,Peiqin Sun,Songlin Tang,Pengfei Wan,Chao Wang,Xuebo Wang,Haoxian Zhang,Yuanxing Zhang,Yan Zhou

Main category: cs.CV

TL;DR: 提出KlingAvatar 2.0，一种时空级联框架，用于高效生成长时、高分辨率、多模态对齐的视频，通过蓝图关键帧与精细化子片段生成，结合LLM专家系统提升指令跟随能力。

Details

Motivation: 现有Avatar视频生成模型在生成长时高清视频时存在时间漂移、质量退化和提示跟随弱等问题，亟需提升生成效率与多模态对齐能力。 Method: 采用时空级联框架：先生成低分辨率关键帧作为全局语义与运动蓝图，再通过首尾帧策略精细化为高分辨率子片段；引入由三个模态专用LLM组成的Co-Reasoning Director进行多轮对话推理用户意图，并设计Negative Director优化负向提示；支持ID特定的多角色控制。 Result: 实验表明，该方法在长时高清视频生成中实现了更优的视觉清晰度、真实唇齿渲染与口型同步、强身份保持性，以及连贯的多模态指令跟随能力。 Conclusion: KlingAvatar 2.0有效解决了长时高分辨率Avatar视频生成中的效率、一致性与指令对齐难题，推动了多模态可控视频生成的发展。 Abstract: Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.

[230] Face Identity Unlearning for Retrieval via Embedding Dispersion

Mikhail Zakharov

Main category: cs.CV

TL;DR: 本文研究了人脸识别检索系统中的身份遗忘问题，提出了一种基于分散嵌入的遗忘方法，在实现特定身份不可检索的同时，保持模型对其他身份的识别性能。

Details

Motivation: 由于人脸识别技术可能被用于未经授权的身份追踪，引发隐私问题，因此需要研究如何在不损害整体性能的前提下，使特定身份从系统中被有效遗忘。 Method: 通过在超球面上分散待遗忘身份的嵌入表示，破坏其紧凑性以防止重识别；评估了多种近似类别遗忘方法，并提出一种简单有效的基于分散的遗忘策略。 Result: 在VGGFace2和CelebA等标准数据集上的实验表明，所提方法在遗忘效果上优于现有方法，同时较好地保留了模型的检索性能。 Conclusion: 该研究为现代基于嵌入的人脸检索系统提供了可行的身份遗忘方案，平衡了隐私保护与模型效用之间的矛盾。 Abstract: Face recognition systems rely on learning highly discriminative and compact identity clusters to enable accurate retrieval. However, as with other surveillance-oriented technologies, such systems raise serious privacy concerns due to their potential for unauthorized identity tracking. While several works have explored machine unlearning as a means of privacy protection, their applicability to face retrieval - especially for modern embedding-based recognition models - remains largely unexplored. In this work, we study the problem of face identity unlearning for retrieval systems and present its inherent challenges. The goal is to make selected identities unretrievable by dispersing their embeddings on the hypersphere and preventing the formation of compact identity clusters that enable re-identification in the gallery. The primary challenge is to achieve this forgetting effect while preserving the discriminative structure of the embedding space and the retrieval performance of the model for the remaining identities. To address this, we evaluate several existing approximate class unlearning methods (e.g., Random Labeling, Gradient Ascent, Boundary Unlearning, and other recent approaches) in the context of face retrieval and propose a simple yet effective dispersion-based unlearning approach. Extensive experiments on standard benchmarks (VGGFace2, CelebA) demonstrate that our method achieves superior forgetting behavior while preserving retrieval utility.

[231] Automated User Identification from Facial Thermograms with Siamese Networks

Elizaveta Prozorova,Anton Konev,Vladimir Faerman

Main category: cs.CV

TL;DR: 本文研究了基于面部热成像图的生物识别身份验证技术，比较了不同红外波段，并提出使用Siamese神经网络实现自动化识别，在自建数据集上达到约80%的准确率，表明热成像在安全系统中具有潜力。

Details

Motivation: 克服传统可见光生物识别在光照变化、伪装等情况下的局限性，探索热成像在身份识别中的应用潜力。 Method: 比较NIR、SWIR、MWIR和LWIR红外波段，分析不同热像仪参数（分辨率、热灵敏度、帧率）对识别性能的影响，并采用Siamese神经网络进行特征匹配与身份验证。 Result: 在私有数据集上的实验显示，所提方法实现了约80%的识别准确率，且LWIR波段表现较优；同时验证了可见光与红外融合的混合系统可提升鲁棒性。 Conclusion: 热成像技术，尤其是结合深度学习方法和多模态融合策略，具备发展为可靠生物识别安全系统的潜力。 Abstract: The article analyzes the use of thermal imaging technologies for biometric identification based on facial thermograms. It presents a comparative analysis of infrared spectral ranges (NIR, SWIR, MWIR, and LWIR). The paper also defines key requirements for thermal cameras used in biometric systems, including sensor resolution, thermal sensitivity, and a frame rate of at least 30 Hz. Siamese neural networks are proposed as an effective approach for automating the identification process. In experiments conducted on a proprietary dataset, the proposed method achieved an accuracy of approximately 80%. The study also examines the potential of hybrid systems that combine visible and infrared spectra to overcome the limitations of individual modalities. The results indicate that thermal imaging is a promising technology for developing reliable security systems.

[232] Unlocking Generalization in Polyp Segmentation with DINO Self-Attention "keys"

Carla Monteiro,Valentina Corbetta,Regina Beets-Tan,Luís F. Teixeira,Wilson Silva

Main category: cs.CV

TL;DR: 提出一种基于DINO自注意力“key”特征的简单而鲁棒的结肠息肉分割框架，在多中心数据集上实现了领域泛化和极端单域泛化的最先进性能。

Details

Motivation: 现有息肉分割方法在数据受限或挑战性场景下泛化能力差，且常依赖复杂的任务特定架构。 Method: 利用Vision Transformer中自注意力模块的key特征，结合简单的卷积解码器进行息肉掩码预测，避免使用深层特征和复杂结构。 Result: 在多中心数据集的领域泛化（DG）和极端单域泛化（ESDG）协议下均取得SOTA性能，显著提升在数据稀缺和挑战场景下的表现，并超越nnU-Net和UM-Net等成熟模型。 Conclusion: 该方法通过利用DINO的内在鲁棒性特征和简化架构设计，有效提升了息肉分割的泛化能力和实用性，同时为DINO框架的演进提供了系统性基准评估。 Abstract: Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention "key" features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework's evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.

[233] Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs

Anran Qi,Changjian Li,Adrien Bousseau,Niloy J. Mitra

Main category: cs.CV

TL;DR: 本文提出一种无需训练的图像到视频生成方法，通过引入可编辑的代理动态图（PDG）分离运动控制与外观合成，实现对最终帧中遮挡区域内容的显式用户控制。

Details

Motivation: 现有图像到视频方法在生成合理运动的同时，难以实现可预测的关节运动并保证用户指定内容在新暴露区域的准确呈现。 Method: 提出代理动态图（PDG）来近似驱动部件运动，并利用冻结的扩散先验进行外观合成；用户可编辑PDG和遮挡区域外观，结合可见性信息在潜在空间中融合运动与用户意图。 Result: 在无需微调的情况下实现了对关节约束物体、家具、车辆和可变形物体的可控视频生成，显著优于现有最先进方法。 Conclusion: 该方法结合生成控制与精确外观控制，提供了一种新的图像到视频工作流，支持用户对运动和遮挡区域内容的灵活操控。 Abstract: We address image-to-video generation with explicit user control over the final frame's disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: https://anranqi.github.io/beyondvisible.github.io/

[234] rNCA: Self-Repairing Segmentation Masks

Malte Silbernagel,Albert Alonso,Jens Petersen,Bulat Ibragimov,Marleen de Bruijne,Madeleine K. Wyburd

Main category: cs.CV

TL;DR: 本文提出了一种基于神经细胞自动机（NCA）的分割掩码拓扑修复方法，称为rNCA，能有效修复多种分割模型产生的碎片化或断连问题。

Details

Motivation: 通用分割模型常产生拓扑结构错误的掩码（如断裂、碎片），现有修复方法依赖手工规则或特定架构，缺乏通用性。 Method: 将NCA用作迭代式局部 refinement 机制，在图像上下文指导下，通过训练学习目标形状的结构特性，对粗略预测的掩码进行逐步修复。 Result: 在视网膜血管和心肌分割任务中，rNCA显著改善拓扑指标：血管分割中β₀误差减少60%，β₁减少20%；心肌分割中61.5%断裂案例被零样本修复，ASSD和HD分别降低19%和16%。 Conclusion: NCA可作为通用、有效的拓扑修复模块，适用于不同基础模型和任务，提升分割结果的结构一致性。 Abstract: Accurately predicting topologically correct masks remains a difficult task for general segmentation models, which often produce fragmented or disconnected outputs. Fixing these artifacts typically requires hand-crafted refinement rules or architectures specialized to a particular task. Here, we show that Neural Cellular Automata (NCA) can be directly re-purposed as an effective refinement mechanism, using local, iterative updates guided by image context to repair segmentation masks. By training on imperfect masks and ground truths, the automaton learns the structural properties of the target shape while relying solely on local information. When applied to coarse, globally predicted masks, the learned dynamics progressively reconnect broken regions, prune loose fragments and converge towards stable, topologically consistent results. We show how refinement NCA (rNCA) can be easily applied to repair common topological errors produced by different base segmentation models and tasks: for fragmented retinal vessels, it yields 2-3% gains in Dice/clDice and improves Betti errors, reducing $β_0$ errors by 60% and $β_1$ by 20%; for myocardium, it repairs 61.5% of broken cases in a zero-shot setting while lowering ASSD and HD by 19% and 16%, respectively. This showcases NCA as effective and broadly applicable refiners.

[235] End2Reg: Learning Task-Specific Segmentation for Markerless Registration in Spine Surgery

Lorenzo Pettinari,Sidaty El Hadramy,Michael Wehrli,Philippe C. Cattin,Daniel Studer,Carol C. Hasler,Maria Licci

Main category: cs.CV

TL;DR: 本文提出了一种名为End2Reg的端到端深度学习框架，用于脊柱手术中的无标记RGB-D图像配准，通过联合优化分割与配准，显著提高了配准精度，并实现了全自动化的术中导航。

Details

Motivation: 现有的术中导航系统依赖于术中放射成像和骨锚定标记物，具有侵入性、辐射强且干扰工作流程；而当前的无标记RGB-D配准方法依赖弱分割标签，容易传播误差。因此需要一种更精确、自动化的解决方案。 Method: 提出End2Reg，一个端到端的深度学习框架，联合优化解剖结构的分割与配准过程。该网络在没有直接分割监督的情况下，仅通过配准目标引导学习专为配准优化的分割掩码。 Result: 在体外和体内基准测试中，该方法达到最先进的性能，中位目标配准误差降低32%至1.83mm，均方根误差平均降低45%至3.95mm。消融实验表明端到端优化显著提升配准精度。 Conclusion: 所提出的端到端RGB-D配准流程摆脱了对弱标签和人工步骤的依赖，推动了完全自动化、无标记术中导航的发展。代码和交互式可视化可在https://lorenzopettinari.github.io/end-2-reg/获取。 Abstract: Purpose: Intraoperative navigation in spine surgery demands millimeter-level accuracy. Current systems based on intraoperative radiographic imaging and bone-anchored markers are invasive, radiation-intensive and workflow disruptive. Recent markerless RGB-D registration methods offer a promising alternative, but existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, which can propagate errors throughout registration. Methods: We present End2Reg an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for weak segmentation labels and manual steps. The network learns segmentation masks specifically optimized for registration, guided solely by the registration objective without direct segmentation supervision. Results: The proposed framework achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% to 1.83mm and mean Root Mean Square Error by 45% to 3.95mm, respectively. An ablation study confirms that end-to-end optimization significantly improves registration accuracy. Conclusion: The presented end-to-end RGB-D registration pipeline removes dependency on weak labels and manual steps, advancing towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: https://lorenzopettinari.github.io/end-2-reg/.

[236] Computer vision training dataset generation for robotic environments using Gaussian splatting

Patryk Niżeniec,Marcin Iwanowski

Main category: cs.CV

TL;DR: 提出了一种结合3D高斯点阵与游戏引擎生成大规模、高真实感、自动标注视觉数据集的新流程，通过两遍渲染技术提升图像真实感，并实现像素级精确分割，实验表明混合使用少量真实数据与大量合成数据可最优提升检测与分割性能。

Details

Motivation: 解决合成图像与真实世界之间的域差距问题以及手动标注耗时的瓶颈。 Method: 利用3D高斯点阵（3DGS）构建环境和物体的逼真表示，在游戏引擎中进行物理仿真以生成自然布局，采用创新的两遍渲染技术结合点阵与代理网格生成的阴影图，算法化合成图像以增强真实感，并自动生成像素级分割掩码用于YOLO等模型。 Result: 实现了高质量、自动标注的数据集生成；实验显示，结合少量真实图像与大量合成数据的混合训练策略在目标检测和分割任务中表现最佳。 Conclusion: 该方法能高效生成高度逼真的带标注数据，结合真实与合成数据的训练策略是实现鲁棒、准确视觉模型的最优途径。 Abstract: This paper introduces a novel pipeline for generating large-scale, highly realistic, and automatically labeled datasets for computer vision tasks in robotic environments. Our approach addresses the critical challenges of the domain gap between synthetic and real-world imagery and the time-consuming bottleneck of manual annotation. We leverage 3D Gaussian Splatting (3DGS) to create photorealistic representations of the operational environment and objects. These assets are then used in a game engine where physics simulations create natural arrangements. A novel, two-pass rendering technique combines the realism of splats with a shadow map generated from proxy meshes. This map is then algorithmically composited with the image to add both physically plausible shadows and subtle highlights, significantly enhancing realism. Pixel-perfect segmentation masks are generated automatically and formatted for direct use with object detection models like YOLO. Our experiments show that a hybrid training strategy, combining a small set of real images with a large volume of our synthetic data, yields the best detection and segmentation performance, confirming this as an optimal strategy for efficiently achieving robust and accurate models.

[237] USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition

Ahmed Abul Hasanaath,Hamzah Luqman

Main category: cs.CV

TL;DR: 提出了一种统一的时空建模框架（USTM），结合Swin Transformer和轻量级时间适配器（TAPE），用于连续手语识别，仅使用RGB视频即达到最先进性能。

Details

Motivation: 现有方法在捕捉细粒度手势和面部线索以及建模长时序依赖方面存在不足，且多依赖多流输入或多模态数据。 Method: 采用Swin Transformer作为空间主干，引入带有位置编码的轻量级时间适配器（TAPE）增强时序建模能力，构建端到端的时空编码器。 Result: 在PHOENIX14、PHOENIX14T和CSL-Daily数据集上取得领先性能，优于基于RGB和多模态的方法，并与多流方法具有竞争力。 Conclusion: USTM能有效联合建模细粒度空间特征与时序上下文，为纯RGB连续手语识别提供了高效解决方案。 Abstract: Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at https://github.com/gufranSabri/USTM

[238] Learning to Generate Cross-Task Unexploitable Examples

Haoxuan Qu,Qiuchi Xiang,Yujun Cai,Yirui Wu,Majid Mirmehdi,Hossein Rahmani,Jun Liu

Main category: cs.CV

TL;DR: 提出了一种新的元跨任务不可利用示例生成框架（MCT-UEG），通过扁平最小值导向的元训练和测试方案，有效生成在多种现实世界计算机视觉任务中均不可利用的示例，以增强个人图像隐私保护。

Details

Motivation: 现有不可利用示例生成方法在实际应用中受限，难以保证生成的示例在多种真实世界视觉任务中均无法被滥用，因此需要更广泛有效的防护方法。 Method: 提出了MCT-UEG框架，采用面向扁平最小值的元训练与测试机制，优化生成器以产生跨任务鲁棒的不可利用示例。 Result: 大量实验表明，所提框架在生成广泛不可利用示例方面优于现有方法，具备更强的跨任务防御能力。 Conclusion: MCT-UEG能有效提升生成示例的不可利用性，适用于多种视觉任务场景，增强了对在线个人图像的隐私保护。 Abstract: Unexploitable example generation aims to transform personal images into their unexploitable (unlearnable) versions before they are uploaded online, thereby preventing unauthorized exploitation of online personal images. Recently, this task has garnered significant research attention due to its critical relevance to personal data privacy. Yet, despite recent progress, existing methods for this task can still suffer from limited practical applicability, as they can fail to generate examples that are broadly unexploitable across different real-world computer vision tasks. To deal with this problem, in this work, we propose a novel Meta Cross-Task Unexploitable Example Generation (MCT-UEG) framework. At the core of our framework, to optimize the unexploitable example generator for effectively producing broadly unexploitable examples, we design a flat-minima-oriented meta training and testing scheme. Extensive experiments show the efficacy of our framework.

[239] RecTok: Reconstruction Distillation along Rectified Flow

Qingyu Shi,Size Wu,Jinbin Bai,Kaidong Yu,Yujing Wang,Yunhai Tong,Xiangtai Li,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了RecTok，一种通过流语义蒸馏和重建对齐蒸馏克服高维视觉tokenizers局限性的新方法，在保持丰富语义表征的同时显著提升了图像生成质量和重建性能。

Details

Motivation: 现有高维视觉tokenizers在生成质量上表现不佳，存在维度与生成质量之间的权衡问题，限制了模型的表达能力和重建保真度。 Method: 提出RecTok，引入流语义蒸馏将视觉基础模型（VFM）的语义信息注入流匹配的前向流轨迹中，并结合掩码特征重建损失增强语义表达，使扩散Transformer的训练空间更具语义丰富性。 Result: RecTok在gFID-50K指标上取得了SOTA结果（无论是否使用无分类器引导），且随着潜在空间维度增加表现出持续提升的性能，同时实现了更优的图像重建和判别性能。 Conclusion: 通过将语义蒸馏从潜在空间转移到前向流空间，RecTok成功解决了高维视觉tokenizers的性能瓶颈，实现了高维表示与高质量生成的统一。 Abstract: Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at https://shi-qingyu.github.io/rectok.github.io.

[240] MineTheGap: Automatic Mining of Biases in Text-to-Image Models

Noa Cohen,Nurit Spingarn-Eliezer,Inbar Huberman-Spiegelglas,Tomer Michaeli

Main category: cs.CV

TL;DR: 本文提出了一种名为MineTheGap的方法，用于自动挖掘导致文本到图像模型生成偏见输出的提示。该方法利用遗传算法优化提示池，并通过新的偏见评分来衡量偏见严重程度。

Details

Motivation: 文本到图像模型在面对模糊提示时表现出偏见，可能带来社会影响并降低用户体验，因此需要自动发现这些偏见。 Method: 提出MineTheGap方法，结合遗传算法迭代优化提示池，使用基于LLM生成文本变体与图像分布对比的新型偏见评分指导搜索。 Result: 该方法能有效识别出引发显著偏见的提示，并在已知偏见数据集上验证了偏见评分的有效性。 Conclusion: MineTheGap能够主动发现文本到图像模型中的生成偏见，有助于后续的偏见缓解和模型改进。 Abstract: Text-to-Image (TTI) models generate images based on text prompts, which often leave certain aspects of the desired image ambiguous. When faced with these ambiguities, TTI models have been shown to exhibit biases in their interpretations. These biases can have societal impacts, e.g., when showing only a certain race for a stated occupation. They can also affect user experience when creating redundancy within a set of generated images instead of spanning diverse possibilities. Here, we introduce MineTheGap - a method for automatically mining prompts that cause a TTI model to generate biased outputs. Our method goes beyond merely detecting bias for a given prompt. Rather, it leverages a genetic algorithm to iteratively refine a pool of prompts, seeking for those that expose biases. This optimization process is driven by a novel bias score, which ranks biases according to their severity, as we validate on a dataset with known biases. For a given prompt, this score is obtained by comparing the distribution of generated images to the distribution of LLM-generated texts that constitute variations on the prompt. Code and examples are available on the project's webpage.

[241] A Domain-Adapted Lightweight Ensemble for Resource-Efficient Few-Shot Plant Disease Classification

Anika Islam,Tasfia Tahsin,Zaarin Anjum,Md. Bakhtiar Hasan,Md. Hasanul Kabir

Main category: cs.CV

TL;DR: 提出一种轻量级的少样本学习框架，结合MobileNetV2/V3与注意力增强的Bi-LSTM，实现高效植物叶片病害识别，在数据稀缺和复杂背景下仍表现优异。

Details

Motivation: 现有深度学习方法依赖大量标注数据和高计算资源，难以适用于数据稀缺和资源受限的农业环境，需开发更轻量、高效的少样本诊断模型。 Method: 采用领域适配的MobileNetV2和MobileNetV3作为特征提取器，融合其特征后输入带注意力机制的Bi-LSTM进行分类，提升少样本下的特征表示与判别能力。 Result: 在PlantVillage数据集上15-shot达到98.23%准确率，接近SOTA；在Dhan Shomadhan田间数据上达69.28%；仅用15-shot即超越先前SOTA（96.0%），达99.72%准确率，模型大小仅40MB，推理复杂度1.12 GFLOPs。 Conclusion: 该框架在保持轻量化和移动端适用性的同时，显著提升了少样本条件下的植物病害识别性能，具备在资源受限地区推广的潜力。 Abstract: Accurate and timely identification of plant leaf diseases is essential for resilient and sustainable agriculture, yet most deep learning approaches rely on large annotated datasets and computationally intensive models that are unsuitable for data-scarce and resource-constrained environments. To address these challenges we present a few-shot learning approach within a lightweight yet efficient framework that combines domain-adapted MobileNetV2 and MobileNetV3 models as feature extractors, along with a feature fusion technique to generate robust feature representation. For the classification task, the fused features are passed through a Bi-LSTM classifier enhanced with attention mechanisms to capture sequential dependencies and focus on the most relevant features, thereby achieving optimal classification performance even in complex, real-world environments with noisy or cluttered backgrounds. The proposed framework was evaluated across multiple experimental setups, including both laboratory-controlled and field-captured datasets. On tomato leaf diseases from the PlantVillage dataset, it consistently improved performance across 1 to 15 shot scenarios, reaching 98.23+-0.33% at 15 shot, closely approaching the 99.98% SOTA benchmark achieved by a Transductive LSTM with attention, while remaining lightweight and mobile-friendly. Under real-world conditions using field images from the Dhan Shomadhan dataset, it maintained robust performance, reaching 69.28+-1.49% at 15-shot and demonstrating strong resilience to complex backgrounds. Notably, it also outperformed the previous SOTA accuracy of 96.0% on six diseases from PlantVillage, achieving 99.72% with only 15-shot learning. With a compact model size of approximately 40 MB and inference complexity of approximately 1.12 GFLOPs, this work establishes a scalable, mobile-ready foundation for precise plant disease diagnostics in data-scarce regions.

[242] IMILIA: interpretable multiple instance learning for inflammation prediction in IBD from H&E whole slide images

Thalyssa Baiocco-Rodrigues,Antoine Olivier,Reda Belbahri,Thomas Duboudin,Pierre-Antoine Bannier,Benjamin Adjadj,Katharina Von Loga,Nathan Noiry,Maxime Touzot,Hector Roux de Bezieux

Main category: cs.CV

TL;DR: 本文提出IMILIA，一种用于分析炎症性肠病（IBD）组织切片中微观炎症的可解释多实例学习框架，能够预测炎症并自动生成驱动预测的组织特征标记。

Details

Motivation: 随着IBD治疗目标转向组织学缓解，准确评估微观炎症对疾病活动和治疗反应至关重要。传统方法依赖人工阅片，耗时且主观，亟需自动化、可解释的计算工具。 Method: IMILIA结合多实例学习（MIL）模型进行炎症预测，并集成HistoPLUS（细胞检测、分割与分类）和EpiSeg（上皮分割）两个模块实现可解释性分析，基于H&E染色全切片图像进行端到端学习。 Result: 在发现队列中交叉验证ROC-AUC为0.83，在两个外部验证队列中分别达到0.99和0.84；可解释模块显示高分区域富含多种免疫细胞，低分区域以正常上皮细胞为主，且模式在各数据集中一致。 Conclusion: IMILIA能有效预测IBD组织切片中的炎症，并提供生物学合理的可解释结果，有助于推动组织学评估的自动化和标准化。 Abstract: As the therapeutic target for Inflammatory Bowel Disease (IBD) shifts toward histologic remission, the accurate assessment of microscopic inflammation has become increasingly central for evaluating disease activity and response to treatment. In this work, we introduce IMILIA (Interpretable Multiple Instance Learning for Inflammation Analysis), an end-to-end framework designed for the prediction of inflammation presence in IBD digitized slides stained with hematoxylin and eosin (H&E), followed by the automated computation of markers characterizing tissue regions driving the predictions. IMILIA is composed of an inflammation prediction module, consisting of a Multiple Instance Learning (MIL) model, and an interpretability module, divided in two blocks: HistoPLUS, for cell instance detection, segmentation and classification; and EpiSeg, for epithelium segmentation. IMILIA achieves a cross-validation ROC-AUC of 0.83 on the discovery cohort, and a ROC-AUC of 0.99 and 0.84 on two external validation cohorts. The interpretability module yields biologically consistent insights: tiles with higher predicted scores show increased densities of immune cells (lymphocytes, plasmocytes, neutrophils and eosinophils), whereas lower-scored tiles predominantly contain normal epithelial cells. Notably, these patterns were consistent across all datasets. Code and models to partially replicate the results on the public IBDColEpi dataset can be found at https://github.com/owkin/imilia.

[243] Test-Time Modification: Inverse Domain Transformation for Robust Perception

Arpit Jadon,Joshua Niemeijer,Yuki M. Asano

Main category: cs.CV

TL;DR: 提出一种利用扩散模型在测试时将目标域图像映射回源域分布的新方法，以提升下游模型在未知域偏移下的泛化能力，无需额外训练或大规模数据生成，在多个任务和数据集上取得显著增益。

Details

Motivation: 现有基于生成模型的数据增强方法在域泛化中生成全面目标域变化速度慢、成本高且不完整，亟需一种更高效、低成本的方法来应对真实场景中的未知域偏移。 Method: 利用预训练的扩散模型在测试阶段将目标域图像逆向映射到源域分布，使已训练的下游模型能在熟悉分布上进行预测，仅需源域描述，无需微调模型或生成大量合成数据。 Result: 在语义分割、目标检测和图像分类任务中，面对真实到真实的域偏移（如夜间、天气变化），该方法在BDD100K-Night、ImageNet-R和DarkZurich等挑战性数据集上分别实现了137%、68%和62%的相对性能提升。 Conclusion: 该方法提供了一种高效、通用且无需训练的域泛化新范式，通过测试时的分布对齐充分利用生成模型的知识，显著提升了模型在未知环境下的鲁棒性。 Abstract: Generative foundation models contain broad visual knowledge and can produce diverse image variations, making them particularly promising for advancing domain generalization tasks. While they can be used for training data augmentation, synthesizing comprehensive target-domain variations remains slow, expensive, and incomplete. We propose an alternative: using diffusion models at test time to map target images back to the source distribution where the downstream model was trained. This approach requires only a source domain description, preserves the task model, and eliminates large-scale synthetic data generation. We demonstrate consistent improvements across segmentation, detection, and classification tasks under challenging environmental shifts in real-to-real domain generalization scenarios with unknown target distributions. Our analysis spans multiple generative and downstream models, including an ensemble variant for enhanced robustness. The method achieves substantial relative gains: 137% on BDD100K-Night, 68% on ImageNet-R, and 62% on DarkZurich.

[244] PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence

Ruiyan Wang,Teng Hu,Kaihui Huang,Zihan Su,Ran Yi,Lizhuang Ma

Main category: cs.CV

TL;DR: 本文提出了PoseAnything，首个能够处理人类与非人类角色的通用姿态引导视频生成框架，并引入了部件感知的时间一致性模块和主体与相机运动解耦的CFG策略，实现了更精细的一致性保持和独立的相机运动控制。

Details

Motivation: 当前的姿态引导视频生成方法仅限于接受人体姿态作为输入，难以泛化到其他主体的姿态。为了突破这一限制，需要一个能支持任意骨骼结构输入的通用框架。 Method: 提出PoseAnything框架，包含部件感知的时间一致性模块（Part-aware Temporal Coherence Module），通过划分主体部件并在帧间建立对应关系来增强运动一致性；设计主体与相机运动解耦的CFG（Subject and Camera Motion Decoupled CFG），实现独立的相机控制；并构建XPose数据集用于训练和评估。 Result: 实验表明，PoseAnything在效果和泛化能力上显著优于现有最先进方法，尤其在非人类角色的姿态引导视频生成中表现突出。 Conclusion: PoseAnything是首个支持任意骨骼输入的通用姿态引导视频生成框架，结合新模块、控制策略和高质量数据集，有效提升了生成视频的运动一致性和控制灵活性。 Abstract: Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.

[245] Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$

Jiangning Zhang,Junwei Zhu,Teng Hu,Yabiao Wang,Donghao Luo,Weijian Cao,Zhenye Gan,Xiaobin Hu,Zhucun Xue,Chengjie Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为T3的新型Transformer优化策略，用于实现高效的原生4K视频生成，通过多尺度权重共享窗口注意力机制，在不改变预训练模型结构的前提下显著提升了生成效率和质量。

Details

Motivation: 由于全注意力机制在时空分辨率提升时计算量呈二次增长，原生4K视频生成面临巨大计算挑战，现有方法难以兼顾效率与质量。 Method: 提出T3-Video，采用多尺度权重共享窗口注意力、分层分块策略以及保持轴向的全注意力设计，对预训练模型的前向逻辑进行优化，实现注意力模式的转换。 Result: 在4K-VBench上实验表明，T3-Video相比现有方法性能显著提升（VQA +4.29，VTC +0.08），同时将原生4K视频生成速度提高10倍以上。 Conclusion: T3-Video在不修改预训练模型核心架构的基础上，通过优化前向计算逻辑，有效解决了高分辨率视频生成中的计算瓶颈，实现了效率与质量的双重提升。 Abstract: Native 4K (2160$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $\textbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $\textbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Project page at https://zhangzjn.github.io/projects/T3-Video

[246] Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation

Jiangning Zhang,Junwei Zhu,Zhenye Gan,Donghao Luo,Chuming Lin,Feifan Xu,Xu Peng,Jianlong Hu,Yuansen Liu,Yijia Hong,Weijian Cao,Han Feng,Xu Chen,Chencan Fu,Keke He,Xiaobin Hu,Chengjie Wang

Main category: cs.CV

TL;DR: 提出了一种名为Soul的多模态驱动框架，用于高保真、长期的数字人动画生成，支持从单帧图像、文本和音频生成语义连贯的视频，并在唇形同步、表情生动性和身份保持方面表现优异。

Details

Motivation: 解决现有数字人动画生成中数据稀缺、长期一致性差、生成效率低以及缺乏统一评估基准的问题。 Method: 基于Wan2.2-5B模型构建，引入音频注入层、多训练策略和阈值感知的码本替换机制以提升长期一致性；采用步长/CFG蒸馏与轻量级VAE优化推理速度；构建了包含100万样本的Soul-1M数据集和Soul-Bench评估基准。 Result: 在视频质量、文本对齐、身份保持和唇形同步精度上显著优于当前主流开源和商业模型，推理速度提升11.4倍且质量损失可忽略，适用于虚拟主播、影视制作等实际场景。 Conclusion: Soul实现了高质量、长时程、多模态的数字人动画生成，结合大规模数据集和高效训练推理解决了关键挑战，具备广泛的应用前景。 Abstract: We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $\textbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at https://zhangzjn.github.io/projects/Soul/

[247] Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Siyan Chen,Yanfei Chen,Ying Chen,Zhuo Chen,Feng Cheng,Xuyan Chi,Jian Cong,Qinpeng Cui,Qide Dong,Junliang Fan,Jing Fang,Zetao Fang,Chengjian Feng,Han Feng,Mingyuan Gao,Yu Gao,Qiushan Guo,Boyang Hao,Qingkai Hao,Bibo He,Qian He,Tuyen Hoang,Ruoqing Hu,Xi Hu,Weilin Huang,Zhaoyang Huang,Zhongyi Huang,Siqi Jiang,Wei Jiang,Yunpu Jiang,Zhuo Jiang,Ashley Kim,Jianan Kong,Zhichao Lai,Shanshan Lao,Ai Li,Feiya Li,Gen Li,Huixia Li,JiaShi Li,Liang Li,Ming Li,Tao Li,Xian Li,Xiaojie Li,Xiaoyang Li,Xingxing Li,Yameng Li,Yifu Li,Yiying Li,Chao Liang,Ying Liang,Zhiqiang Liang,Wang Liao,Yalin Liao,Heng Lin,Kengyu Lin,Shanchuan Lin,Xi Lin,Zhijie Lin,Feng Ling,Fangfang Liu,Gaohong Liu,Jiawei Liu,Jie Liu,Shouda Liu,Shu Liu,Sichao Liu,Songwei Liu,Xin Liu,Xue Liu,Yibo Liu,Zikun Liu,Zuxi Liu,Junlin Lyu,Lecheng Lyu,Qian Lyu,Han Mu,Xiaonan Nie,Jingzhe Ning,Xitong Pan,Yanghua Peng,Lianke Qin,Xueqiong Qu,Yuxi Ren,Yuchen Shen,Guang Shi,Lei Shi,Yan Song,Yinglong Song,Fan Sun,Li Sun,Renfei Sun,Zeyu Sun,Wenjing Tang,Zirui Tao,Feng Wang,Furui Wang,Jinran Wang,Junkai Wang,Ke Wang,Kexin Wang,Qingyi Wang,Rui Wang,Sen Wang,Shuai Wang,Tingru Wang,Weichen Wang,Xin Wang,Yanhui Wang,Yue Wang,Yuping Wang,Yuxuan Wang,Ziyu Wang,Guoqiang Wei,Wanru Wei,Di Wu,Guohong Wu,Hanjie Wu,Jian Wu,Jie Wu,Ruolan Wu,Xinglong Wu,Yonghui Wu,Ruiqi Xia,Liang Xiang,Fei Xiao,XueFeng Xiao,Pan Xie,Shuangyi Xie,Shuang Xu,Jinlan Xue,Bangbang Yang,Ceyuan Yang,Jiaqi Yang,Runkai Yang,Tao Yang,Yang Yang,Yihang Yang,ZhiXian Yang,Ziyan Yang,Yifan Yao,Zilyu Ye,Bowen Yu,Chujie Yuan,Linxiao Yuan,Sichun Zeng,Weihong Zeng,Xuejiao Zeng,Yan Zeng,Chuntao Zhang,Heng Zhang,Jingjie Zhang,Kuo Zhang,Liang Zhang,Liying Zhang,Manlin Zhang,Ting Zhang,Weida Zhang,Xiaohe Zhang,Xinyan Zhang,Yan Zhang,Yuan Zhang,Zixiang Zhang,Fengxuan Zhao,Huating Zhao,Yang Zhao,Hao Zheng,Jianbin Zheng,Xiaozheng Zheng,Yangyang Zheng,Yijie Zheng,Jiexin Zhou,Kuan Zhu,Shenhan Zhu,Wenjia Zhu,Benhui Zou,Feilong Zuo

Main category: cs.CV

TL;DR: Seedance 1.5 pro 是一个基于双分支Diffusion Transformer架构的联合音视频生成基础模型，具备高同步性、高质量生成和多语言唇音同步能力，并通过SFT与RLHF优化及推理加速框架提升实用性。

Details

Motivation: 为了实现高质量、高同步性的原生音视频联合生成，满足专业内容创作对多语言支持、叙事连贯性和实时性的需求。 Method: 采用双分支Diffusion Transformer架构，结合跨模态联合模块与多阶段数据流水线，并使用高质量数据集进行监督微调（SFT）和基于人类反馈的强化学习（RLHF），同时引入推理加速框架以提升生成速度。 Result: 实现了卓越的音视频同步效果、精准的多语言/方言唇音同步、动态电影级镜头控制和更强的叙事连贯性，推理速度提升超过10倍。 Conclusion: Seedance 1.5 pro 是一个高效、高质量的联合音视频生成模型，已在火山引擎上线，适用于专业级内容创作场景。 Abstract: Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.

[248] TARA: Simple and Efficient Time Aware Retrieval Adaptation of MLLMs for Video Understanding

Piyush Bagad,Andrew Zisserman

Main category: cs.CV

TL;DR: 提出了一种简单高效的方法TARA，用于将多模态大模型适配为时间感知的视频-文本嵌入模型，无需使用任何视频数据，在新提出的具有挑战性的时间对称动作基准上表现优异，并在否定理解、动词和副词理解等任务中展现出额外优势。

Details

Motivation: 构建一个通用的时间感知视频-文本嵌入模型用于检索任务，现有方法缺乏对时间动态性的有效建模，且通常依赖大量视频数据进行训练。 Method: 提出TARA（Time Aware Retrieval Adaptation）方法，通过设计特定的文本提示和对比学习策略，利用多模态大模型（MLLMs）的已有知识，将其适应到时间感知的视频-文本嵌入空间，整个过程无需使用视频数据。 Result: 在新提出的具有时间对称动作（chiral actions）的硬负样本基准上显著优于现有视频-文本模型；在标准基准上也取得强结果；在NegBench上显示否定感知能力；在动词和副词理解任务上达到SOTA性能。 Conclusion: TARA是一种强大且通用的时间感知视频-文本嵌入方法，无需视频数据即可实现最先进的零样本检索性能，并具备理解时间动态、否定以及细粒度动作描述的能力。 Abstract: Our objective is to build a general time-aware video-text embedding model for retrieval. To that end, we propose a simple and efficient recipe, dubbed TARA (Time Aware Retrieval Adaptation), to adapt Multimodal LLMs (MLLMs) to a time-aware video-text embedding model without using any video data at all. For evaluating time-awareness in retrieval, we propose a new benchmark with temporally opposite (chiral) actions as hard negatives and curated splits for chiral and non-chiral actions. We show that TARA outperforms all existing video-text models on this chiral benchmark while also achieving strong results on standard benchmarks. Furthermore, we discover additional benefits of TARA beyond time-awareness: (i) TARA embeddings are negation-aware as shown in NegBench benchmark that evaluates negation in video retrieval, (ii) TARA achieves state of the art performance on verb and adverb understanding in videos. Overall, TARA yields a strong, versatile, time-aware video-text embedding model with state of the art zero-shot performance.

[249] Pancakes: Consistent Multi-Protocol Image Segmentation Across Biomedical Domains

Marianne Rakic,Siyu Gai,Etienne Chollet,John V. Guttag,Adrian V. Dalca

Main category: cs.CV

TL;DR: Pancakes是一种新框架，能针对未见过的医学图像自动生成多种语义一致的多标签分割图，支持多种分割协议，优于现有基础模型。

Details

Motivation: 现有的医学图像分割模型通常只支持单一协议或依赖繁琐的手动提示，无法灵活应对同一图像多种合理分割的需求。 Method: 提出Pancakes框架，通过新的问题建模，在无需人工干预的情况下，为新图像自动生成多个语义一致的多协议分割结果。 Result: 在七个保留数据集上的实验表明，Pancakes在生成多个语义连贯的全图像分割方面显著优于现有的基础模型。 Conclusion: Pancakes实现了现有基础模型无法达到的多协议自动分割能力，推动了医学图像分割的灵活性和实用性。 Abstract: A single biomedical image can be meaningfully segmented in multiple ways, depending on the desired application. For instance, a brain MRI can be segmented according to tissue types, vascular territories, broad anatomical regions, fine-grained anatomy, or pathology, etc. Existing automatic segmentation models typically either (1) support only a single protocol, the one they were trained on, or (2) require labor-intensive manual prompting to specify the desired segmentation. We introduce Pancakes, a framework that, given a new image from a previously unseen domain, automatically generates multi-label segmentation maps for multiple plausible protocols, while maintaining semantic consistency across related images. Pancakes introduces a new problem formulation that is not currently attainable by existing foundation models. In a series of experiments on seven held-out datasets, we demonstrate that our model can significantly outperform existing foundation models in producing several plausible whole-image segmentations, that are semantically coherent across images.

[250] 3D Human-Human Interaction Anomaly Detection

Shun Maeda,Chunzhi Gu,Koichiro Kamide,Katsuya Hotta,Shangce Gao,Chao Zhang

Main category: cs.CV

TL;DR: 本文提出了一个新的任务——人-人交互异常检测（H2IAD），并设计了IADNet模型，通过时序注意力共享模块（TASM）和基于距离的关系编码模块（DREM）有效捕捉协作行为中的时空动态，显著优于现有单人异常检测方法。

Details

Motivation: 现有异常检测方法主要关注单个人的行为，难以准确识别源于人与人交互的异常行为，缺乏对交互中复杂且不对称动态的建模能力。 Method: 提出IADNet模型，包含时序注意力共享模块（TASM）以同步两人之间的运动相关性，并引入距离关系编码模块（DREM）捕捉空间社交线索，最后使用归一化流进行异常评分。 Result: 在多人交互动作数据集上的实验表明，IADNet在H2IAD任务上显著优于现有的人类中心异常检测基线方法。 Conclusion: 通过显式建模人-人交互的时空动态，H2IAD任务是可行且必要的，IADNet为协作场景下的异常行为检测提供了有效解决方案。 Abstract: Human-centric anomaly detection (AD) has been primarily studied to specify anomalous behaviors in a single person. However, as humans by nature tend to act in a collaborative manner, behavioral anomalies can also arise from human-human interactions. Detecting such anomalies using existing single-person AD models is prone to low accuracy, as these approaches are typically not designed to capture the complex and asymmetric dynamics of interactions. In this paper, we introduce a novel task, Human-Human Interaction Anomaly Detection (H2IAD), which aims to identify anomalous interactive behaviors within collaborative 3D human actions. To address H2IAD, we then propose Interaction Anomaly Detection Network (IADNet), which is formalized with a Temporal Attention Sharing Module (TASM). Specifically, in designing TASM, we share the encoded motion embeddings across both people such that collaborative motion correlations can be effectively synchronized. Moreover, we notice that in addition to temporal dynamics, human interactions are also characterized by spatial configurations between two people. We thus introduce a Distance-Based Relational Encoding Module (DREM) to better reflect social cues in H2IAD. The normalizing flow is eventually employed for anomaly scoring. Extensive experiments on human-human motion benchmarks demonstrate that IADNet outperforms existing Human-centric AD baselines in H2IAD.

[251] MMhops-R1: Multimodal Multi-hop Reasoning

Tao Zhang,Ziqi Zhang,Zongyang Ma,Yuxin Chen,Bing Li,Chunfeng Yuan,Guangting Wang,Fengyun Rao,Ying Shan,Weiming Hu

Main category: cs.CV

TL;DR: 本文提出了MMhops，一个用于评估和促进多模态多跳推理的大规模基准，并设计了MMhops-R1模型，基于强化学习的多模态检索增强生成框架，能够动态规划推理路径并整合多层级信息，在复杂推理任务中表现优异且具备良好泛化能力。

Details

Motivation: 现有大模型主要局限于单步推理，缺乏评估和推动多模态多跳推理能力的复杂基准，因此需要构建更具挑战性的任务来驱动该领域发展。 Method: 提出MMhops基准，包含Bridging和Comparison两种任务形式，要求模型结合外部知识构建复杂的多步推理链；同时提出MMhops-R1框架，采用强化学习优化推理路径的自主规划、查询生成与多级信息融合。 Result: 实验表明MMhops-R1显著优于强基线模型，在MMhops上表现出色，并在固定跳数任务中展现出良好的泛化能力。 Conclusion: 本研究贡献了一个具有挑战性的新基准和一个高效的基线模型，验证了动态规划与多模态知识整合对复杂推理的重要性，相关代码、数据和权重将开源以推动后续研究。 Abstract: The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.

[252] Lighting in Motion: Spatiotemporal HDR Lighting Estimation

Christophe Bolduc,Julien Philip,Li Ma,Mingming He,Paul Debevec,Jean-François Lalonde

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的时空光照估计方法LiMo，通过生成多曝光镜面与漫反射球体并结合可微渲染生成HDRI贴图，在室内外场景中实现了高精度的光照细节与照度估计。

Details

Motivation: 现有的光照估计方法难以同时捕捉高频细节和准确照度，且空间条件控制不足，尤其在复杂几何环境下表现受限。 Method: 利用扩散先验，在包含时空光探针的大规模室内外场景数据集上微调扩散模型；引入新的几何条件以增强空间定位，并生成不同曝光下的镜面与漫反射球体，最后通过可微渲染融合为单一HDRI图。 Result: 在空间控制和预测准确性方面均达到最先进水平，验证了深度信息不足以实现精确条件控制，新几何条件显著提升性能。 Conclusion: LiMo通过改进条件输入和多分支预测，结合可微渲染，有效提升了时空光照估计的质量与精度，适用于复杂真实场景。 Abstract: We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.

[253] DA-SSL: self-supervised domain adaptor to leverage foundational models in turbt histopathology slides

Haoyue Zhang,Meera Chappidi,Erolcan Sayar,Helen Richards,Zhijun Chen,Lucas Liu,Roxanne Wadia,Peter A Humphrey,Fady Ghali,Alberto Contreras-Sanz,Peter Black,Jonathan Wright,Stephanie Harmon,Michael Haffner

Main category: cs.CV

TL;DR: 提出一种轻量级的域自适应自监督适配器（DA-SSL），用于增强病理基础模型在膀胱肿瘤组织碎片和电灼伪影等挑战性样本中的表现，无需微调基础模型。

Details

Motivation: 现有病理基础模型在少见癌种或含伪影的标本上表现受限，如经尿道膀胱肿瘤切除术（TURBT）样本因组织碎片化和电灼伪影导致域偏移问题，且未被广泛用于预训练。 Method: 设计一种域自适应的自监督学习适配器（DA-SSL），在不微调基础模型的前提下重新对齐其特征至TURBT域，并结合多实例学习框架用于治疗反应预测。 Result: 在多中心研究中，DA-SSL五折交叉验证AUC达0.77±0.04，外部测试准确率0.84，敏感性0.71，特异性0.91（基于多数投票）。 Conclusion: 轻量级自监督域适应方法可有效提升病理基础模型在临床复杂任务中的适用性，尤其适用于训练数据稀缺的特殊标本类型。 Abstract: Recent deep learning frameworks in histopathology, particularly multiple instance learning (MIL) combined with pathology foundational models (PFMs), have shown strong performance. However, PFMs exhibit limitations on certain cancer or specimen types due to domain shifts - these cancer types were rarely used for pretraining or specimens contain tissue-based artifacts rarely seen within the pretraining population. Such is the case for transurethral resection of bladder tumor (TURBT), which are essential for diagnosing muscle-invasive bladder cancer (MIBC), but contain fragmented tissue chips and electrocautery artifacts and were not widely used in publicly available PFMs. To address this, we propose a simple yet effective domain-adaptive self-supervised adaptor (DA-SSL) that realigns pretrained PFM features to the TURBT domain without fine-tuning the foundational model itself. We pilot this framework for predicting treatment response in TURBT, where histomorphological features are currently underutilized and identifying patients who will benefit from neoadjuvant chemotherapy (NAC) is challenging. In our multi-center study, DA-SSL achieved an AUC of 0.77+/-0.04 in five-fold cross-validation and an external test accuracy of 0.84, sensitivity of 0.71, and specificity of 0.91 using majority voting. Our results demonstrate that lightweight domain adaptation with self-supervision can effectively enhance PFM-based MIL pipelines for clinically challenging histopathology tasks. Code is Available at https://github.com/zhanghaoyue/DA_SSL_TURBT.

[254] LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Jianxiong Gao,Zhaoxi Chen,Xian Liu,Junhao Zhuang,Chengming Xu,Jianfeng Feng,Yu Qiao,Yanwei Fu,Chenyang Si,Ziwei Liu

Main category: cs.CV

TL;DR: 提出LongVie 2，一种通过三阶段训练实现可控、高质量、长时间一致视频生成的端到端自回归框架，并构建LongVGenBench基准进行评估。

Details

Motivation: 现有视频生成模型在长期生成中难以同时保证可控性、视觉质量和时间一致性，需构建更统一的视频世界模型。 Method: 采用三阶段训练：多模态引导增强可控性；输入帧的退化感知训练提升长期视觉质量；历史上下文引导确保片段间时间一致性。 Result: 在LongVGenBench上实现SOTA性能，支持长达五分钟的连续高质量视频生成，在可控性、时间连贯性和视觉保真度方面表现优异。 Conclusion: LongVie 2推动了基于预训练视频生成系统的统一视频世界建模，是通向通用时空智能的重要进展。 Abstract: Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.

[255] DBT-DINO: Towards Foundation model based analysis of Digital Breast Tomosynthesis

Felix J. Dorfner,Manon A. Dorster,Ryan Connolly,Oscar Gentilhomme,Edward Gibbs,Steven Graham,Seth Wander,Thomas Schultz,Manisha Bahl,Dania Daye,Albert E. Kim,Christopher P. Bridge

Main category: cs.CV

TL;DR: 本文提出了DBT-DINO，首个用于数字乳腺断层合成（DBT）的自监督基础模型，基于DINOv2方法在大规模DBT数据上预训练，并在乳腺密度分类和乳腺癌风险预测任务中表现优异，但在病灶检测任务上提升有限，表明领域特定预训练对不同临床任务的效果存在差异。

Details

Motivation: 尽管基础模型在医学影像中展现出潜力，但尚未有针对数字乳腺断层合成（DBT）的基础模型。本文旨在填补该空白，并评估领域特定预训练对多类临床任务的影响。 Method: 采用DINOv2方法，在来自27,990名患者的487,975个DBT体积的超过2500万张2D切片上进行自监督预训练，构建DBT-DINO模型，并在乳腺密度分类、5年乳腺癌风险预测和病灶检测三个下游任务上进行评估。 Result: 在乳腺密度分类中，DBT-DINO准确率为0.79，优于DINOv2和DenseNet-121；在5年风险预测中，AUROC为0.78，略高于DINOv2（p=0.57）；在病灶检测中，敏感性为0.62，低于DINOv2的0.67（p=0.60），但在恶性病灶检测率上略优（78.8% vs 77.3%）。 Conclusion: DBT-DINO是首个针对DBT的基础模型，在非定位任务如分类和风险预测中表现良好，说明领域特定预训练的有效性；但在病灶检测等局部定位任务中优势不显著，提示仍需进一步的方法学改进。 Abstract: Foundation models have shown promise in medical imaging but remain underexplored for three-dimensional imaging modalities. No foundation model currently exists for Digital Breast Tomosynthesis (DBT), despite its use for breast cancer screening. To develop and evaluate a foundation model for DBT (DBT-DINO) across multiple clinical tasks and assess the impact of domain-specific pre-training. Self-supervised pre-training was performed using the DINOv2 methodology on over 25 million 2D slices from 487,975 DBT volumes from 27,990 patients. Three downstream tasks were evaluated: (1) breast density classification using 5,000 screening exams; (2) 5-year risk of developing breast cancer using 106,417 screening exams; and (3) lesion detection using 393 annotated volumes. For breast density classification, DBT-DINO achieved an accuracy of 0.79 (95\% CI: 0.76--0.81), outperforming both the MetaAI DINOv2 baseline (0.73, 95\% CI: 0.70--0.76, p<.001) and DenseNet-121 (0.74, 95\% CI: 0.71--0.76, p<.001). For 5-year breast cancer risk prediction, DBT-DINO achieved an AUROC of 0.78 (95\% CI: 0.76--0.80) compared to DINOv2's 0.76 (95\% CI: 0.74--0.78, p=.57). For lesion detection, DINOv2 achieved a higher average sensitivity of 0.67 (95\% CI: 0.60--0.74) compared to DBT-DINO with 0.62 (95\% CI: 0.53--0.71, p=.60). DBT-DINO demonstrated better performance on cancerous lesions specifically with a detection rate of 78.8\% compared to Dinov2's 77.3\%. Using a dataset of unprecedented size, we developed DBT-DINO, the first foundation model for DBT. DBT-DINO demonstrated strong performance on breast density classification and cancer risk prediction. However, domain-specific pre-training showed variable benefits on the detection task, with ImageNet baseline outperforming DBT-DINO on general lesion detection, indicating that localized detection tasks require further methodological development.

[256] Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models

Shweta Mahajan,Shreya Kadambi,Hoang Le,Munawar Hayat,Fatih Porikli

Main category: cs.CV

TL;DR: 本文提出了Do-Undo任务和基准，旨在提升视觉语言模型对由现实世界动作驱动的物理场景变换的理解与生成能力。

Details

Motivation: 现有的视觉语言模型在理解物理世界的因果关系方面存在不足，特别是在处理可逆的物理动作时表现不佳。 Method: 构建了一个大规模的真实视频中可逆动作数据集，并设计了一种训练策略来强制一致性，以实现稳健的动作定位。 Result: 实验表明当前模型在物理可逆性方面表现较差，突显了该任务对于具身AI、机器人学和物理感知生成建模的重要性。 Conclusion: Do-Undo为评估和推进多模态系统中的物理推理提供了一个直观的测试平台。 Abstract: We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.

[257] SCR2-ST: Combine Single Cell with Spatial Transcriptomics for Efficient Active Sampling via Reinforcement Learning

Junchao Zhu,Ruining Deng,Junlin Guo,Tianyuan Yao,Chongyu Qu,Juming Xiong,Siqi Lu,Zhengyi Lu,Yanfan Zhu,Marilyn Lionts,Yuechen Yang,Yalin Zheng,Yu Wang,Shilin Zhao,Haichun Yang,Yuankai Huo

Main category: cs.CV

TL;DR: 本文提出了一种名为SCR2-ST的新框架，利用单细胞测序的先验知识指导空间转录组学（ST）的数据采集与表达预测，结合强化学习主动采样和混合回归-检索网络，在有限测序预算下显著提升了采样效率和预测精度。

Details

Motivation: 空间转录组学数据获取成本高，传统固定网格采样策略导致信息冗余，数据稀疏限制了现有方法性能；而单细胞测序数据丰富但未被充分利用，因此需要融合两者优势以提升ST效率与准确性。 Method: 提出SCR2-ST框架，包含两个核心组件：1）基于单细胞引导的强化学习主动采样（SCRL），结合单细胞基础模型嵌入和空间密度信息构建生物学合理的奖励信号；2）混合回归-检索预测网络（SCR2Net），通过回归建模与检索增强推理相结合，并引入主要细胞类型过滤机制抑制噪声匹配，利用检索到的表达谱作为软标签进行辅助监督。 Result: 在三个公开ST数据集上评估显示，SCR2-ST在采样效率和预测准确率方面均达到SOTA水平，尤其在低预算场景下表现突出。 Conclusion: SCR2-ST有效整合单细胞先验知识与空间信息，实现了高效、精准的空间转录组数据采集与重建，为降低实验成本和提升数据分析质量提供了新途径。 Abstract: Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: https://github.com/hrlblab/SCR2ST

[258] MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning

Haoyu Fu,Diankun Zhang,Zongchuang Zhao,Jianfeng Cui,Hongwei Xie,Bing Wang,Guang Chen,Dingkang Liang,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出了MindDrive，一种用于自动驾驶中视觉-语言-动作（VLA）模型的在线强化学习框架，通过将连续动作空间中的探索转化为基于离散语言决策的试错学习，解决了传统模仿学习中的分布偏移和因果混淆问题，并在Bench2Drive基准上取得了优异性能。

Details

Motivation: 现有VLA方法依赖模仿学习，存在分布偏移和因果混淆问题，且在线强化学习因在连续动作空间中探索效率低而难以应用。 Method: 提出MindDrive框架，使用一个具有两组LoRA参数的大语言模型：一个作为决策专家进行场景推理，另一个作为动作专家将语言化决策映射为可行轨迹；通过将轨迹级奖励反馈至语言推理空间，实现基于离散语言动作的在线强化学习。 Result: 在Bench2Drive基准上达到78.04的驾驶分数（DS）和55.09%的成功率（SR），实现了高效的探索与人类样式的驾驶行为。 Conclusion: MindDrive是首个成功将在线强化学习应用于自动驾驶VLA模型的工作，有效平衡了复杂场景下的最优决策、类人驾驶行为与探索效率。 Abstract: Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.

[259] Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All

Michal Nazarczuk,Thomas Tanay,Arthur Moreau,Zhensong Zhang,Eduardo Pérez-Pellitero

Main category: cs.CV

TL;DR: 本文提出了一种用于新视角合成的高质量动画电影数据集，包含多种动态场景和多模态标注，适用于4D场景重建与视图生成模型的研究。

Details

Motivation: 现有数据集在视觉真实感、细节丰富度和标注完整性方面存在不足，难以满足前沿视图合成与3D视觉模型的需求。 Method: 从高精度动画电影中生成数据集，提供RGB图像及深度、法线、分割、光流等多模态信息，并设计了密集、稀疏和单目视频三种基准测试场景。 Result: 该数据集具有高保真度、丰富的几何与运动信息，支持多种实验设置，为新视角合成和3D视觉研究提供了理想资源。 Conclusion: 所提出的数据集因其视觉丰富性、高质量标注和多样化实验配置，有望成为推动视图合成与3D视觉领域发展的关键资源。 Abstract: This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.

[260] Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency

Wenhan Chen,Sezer Karaoglu,Theo Gevers

Main category: cs.CV

TL;DR: 提出Grab-3D，一种基于3D几何时序一致性的AI生成视频检测框架，利用消失点揭示生成视频中几何不一致性。

Details

Motivation: 现有AI生成视频检测方法对3D几何模式的探索有限，缺乏对几何一致性的深入分析。 Method: 引入消失点作为3D几何模式的显式表示，构建静态场景的AI生成视频数据集，设计包含几何位置编码、时序-几何注意力机制和基于EMA的几何分类头的几何感知Transformer。 Result: Grab-3D在多个指标上显著优于现有最先进检测器，并展现出对未见生成器的强跨域泛化能力。 Conclusion: 通过显式建模3D几何时序一致性，Grab-3D有效提升了AI生成视频的检测性能与鲁棒性。 Abstract: Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.

[261] AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection

Junwen Miao,Penghui Du,Yi Liu,Yu Wang,Yan Wang

Main category: cs.CV

TL;DR: 提出AgentIAD，一种基于工具驱动的代理框架，用于工业异常检测，通过感知缩放和比较检索实现多阶段视觉检查，在MMAD数据集上达到97.62%的分类准确率。

Details

Motivation: 工业异常检测因正常样本稀缺及缺陷细微局部而困难，现有单次视觉语言模型常忽略小异常且缺乏与标准正常模式比较机制。 Method: 设计包含感知缩放器（PZ）和比较检索器（CR）的AgentIAD框架，利用MMAD数据集构建感知与比较轨迹，分两阶段训练：监督微调和强化学习，并采用感知奖励与行为奖励联合优化。 Result: 在MMAD数据集上实现97.62%的分类准确率，超越以往基于MLLM的方法，具备透明且可解释的检测过程。 Conclusion: AgentIAD通过多阶段、工具协作的检查机制显著提升工业异常检测性能，兼具高精度与可解释性。 Abstract: Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.

[262] JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

Xiaohu Huang,Hao Zhou,Qiangpeng Yang,Shilei Wen,Kai Han

Main category: cs.CV

TL;DR: JoVA是一个用于联合视频-音频生成的统一框架，通过跨模态自注意力和嘴部区域损失实现高质量唇音同步，无需额外对齐模块。

Details

Motivation: 现有方法在生成与唇动同步的人类语音方面存在不足，且依赖复杂的多模态融合结构，影响模型简洁性。 Method: 在每个Transformer层中使用跨视频和音频token的联合自注意力机制，并引入基于面部关键点检测的嘴部区域损失来增强训练监督。 Result: 在多个基准上实验表明，JoVA在唇同步精度、语音质量和整体生成保真度方面优于或媲美现有最先进方法。 Conclusion: JoVA是一种简洁高效的统一视频-音频生成框架，实现了高质量的多模态生成。 Abstract: In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.

[263] Feedforward 3D Editing via Text-Steerable Image-to-3D

Ziqi Ma,Hongqiao Chen,Yisong Yue,Georgia Gkioxari

Main category: cs.CV

TL;DR: Steer3D是一种前馈方法，通过文本指令实现对图像到3D生成模型的编辑，具备高效、高保真和强一致性的特点。

Details

Motivation: 为了使AI生成的3D资产能够用于实际应用，需要具备便捷的编辑能力，尤其是通过自然语言进行操控。 Method: 受ControlNet启发，将文本控制引入图像到3D生成过程；构建可扩展的自动数据生成引擎，并采用基于流匹配和直接偏好优化（DPO）的两阶段训练策略。 Result: 相比现有方法，Steer3D在遵循语言指令和保持原始3D资产一致性方面表现更优，速度提升2.4倍至28.5倍，仅用10万数据即可为预训练模型添加文本控制模态。 Conclusion: Steer3D证明了可以通过大规模数据有效为预训练的图像到3D模型引入新的文本模态，实现快速且可控的3D资产编辑。 Abstract: Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: https://glab-caltech.github.io/steer3d/

[264] LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

Tianye Ding,Yiming Xie,Yiqing Liang,Moitreya Chatterjee,Pedro Miraldo,Huaizu Jiang

Main category: cs.CV

TL;DR: 本文提出了一种名为LASER的无需训练的框架，可将离线重建模型转换为流式系统，通过跨时间窗口对齐预测来解决现有方法因内存复杂度高或需重新训练而难以部署的问题。

Details

Motivation: 现有的前馈重建模型虽然重建质量高，但因二次内存复杂度无法处理流式视频；而现有流式方法需要大量重训练且未能充分利用先进离线模型的几何先验。 Method: 提出LASER框架，利用层间尺度对齐方法，在连续时间窗口之间对齐预测：将深度预测分层，计算每层的尺度因子，并在相邻窗口和时间戳间传播这些因子，以解决单目尺度模糊导致的层深度比例不一致问题。 Result: 实验表明，LASER在相机姿态估计和点云重建质量上达到最先进水平，同时在RTX A6000 GPU上以14 FPS运行，峰值内存仅6 GB，适用于公里级流式视频的实际部署。 Conclusion: LASER是一种高效、无需训练的流式重建框架，通过层级尺度对齐有效克服了离线模型在流式场景中的应用障碍，兼具高性能与低资源消耗。 Abstract: Recent feed-forward reconstruction models like VGGT and $π^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{https://neu-vi.github.io/LASER/}{\texttt{https://neu-vi.github.io/LASER/}}$

[265] I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

Lu Ling,Yunhao Ge,Yichen Sheng,Aniket Bera

Main category: cs.CV

TL;DR: 本文提出一种通过重新编程预训练的3D实例生成器来实现场景级学习的方法，利用其可迁移的空间先验知识，实现对未见布局和新物体组合的良好泛化能力。

Details

Motivation: 现有基于学习的方法依赖有限的场景数据集，空间理解受限，难以泛化到新布局。 Method: 将预训练的3D实例生成器重新编程为场景级学习器，用模型为中心的空间监督替代数据集驱动的监督，并采用以视角为中心的场景空间建模方式。 Result: 方法在随机组成的训练场景下仍能涌现出空间推理能力，能够从纯几何线索中推断邻近、支撑和对称关系，且实现完全前馈的通用场景生成。 Conclusion: 3D实例生成器是隐式的空间学习与推理模型，具备成为交互式3D场景理解和生成基础模型的潜力。 Abstract: Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: https://luling06.github.io/I-Scene-project/

[266] Recurrent Video Masked Autoencoders

Daniel Zoran,Nikhil Parthasarathy,Yi Yang,Drew A Hudson,Joao Carreira,Andrew Zisserman

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的循环神经网络模型RVM，通过不对称掩码预测任务学习视频表征，在动作识别、目标跟踪等任务上表现优异，且在小模型下具有高参数效率和线性计算成本。

Details

Motivation: 现有视频表征学习方法在处理时空结构时存在计算复杂度高或需要大量参数的问题，缺乏高效且通用的模型。 Method: 设计了一个名为RVM的循环视频掩码自编码器，采用Transformer-based RNN聚合图像特征，并通过像素重建目标进行不对称掩码预测训练。 Result: RVM在视频级任务（如动作识别、点/对象跟踪）上与VideoMAE、V-JEPA等SOTA模型性能相当，在几何与密集空间理解任务上优于DINOv2等图像模型；小模型下无需知识蒸馏即达30倍参数效率提升；支持长时间序列的稳定特征传播，计算成本为线性。 Conclusion: RVM是一种高效、通用的视频表征学习框架，兼具优异性能与高参数效率，克服了传统时空注意力架构的部分局限性。 Abstract: We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.

[267] Towards Scalable Pre-training of Visual Tokenizers for Generation

Jingfeng Yao,Yuda Song,Yucong Zhou,Xinggang Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉标记化器预训练框架VTP，通过联合优化图像-文本对比、自监督和重建损失，解决了传统基于重建的预训练方法在生成任务上扩展性差的问题，实现了更优的语义表示和生成性能的高效扩展。

Details

Motivation: 传统的基于像素重建的视觉编码器预训练方法导致潜在空间偏向低级信息，无法有效提升生成质量，存在“预训练扩展问题”。 Method: 提出VTP框架，联合优化图像-文本对比损失、自监督损失和重建损失，以增强潜在空间的高层语义表达能力。 Result: VTP在ImageNet上达到78.2的零样本准确率和0.36的rFID，并在生成任务中收敛速度比先进蒸馏方法快4.1倍；仅增加预训练计算量，即可在下游生成任务中实现FID提升65.8%，而传统方法在低计算量时即趋于饱和。 Conclusion: 为实现高质量生成，视觉标记化器的预训练应聚焦于高层语义理解，VTP通过多任务联合学习实现了优异的可扩展性和生成性能，标志着从重建范式向理解驱动范式的转变。 Abstract: The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.

[268] LitePT: Lighter Yet Stronger Point Transformer

Yuanwen Yue,Damien Robert,Jianyuan Wang,Sunghwan Hong,Jan Dirk Wegner,Christian Rupprecht,Konrad Schindler

Main category: cs.CV

TL;DR: 提出了一种新的3D点云处理主干网络LitePT，早期使用卷积提取几何特征，深层使用注意力捕捉语义信息，并引入无需训练的PointROPE位置编码，在参数、速度和内存上均优于现有方法。

Details

Motivation: 现有3D点云网络中卷积和注意力的组合方式不明确，需探索更高效的架构设计。 Method: 分析不同模块的作用，提出早期用卷积、深层用注意力的结构，并设计PointROPE保持空间布局信息。 Result: LitePT比Point Transformer V3减少3.6倍参数，速度快2倍，内存少2倍，且性能相当或更优。 Conclusion: 卷积适合高分辨率低层几何提取，注意力适合低分辨率高层语义建模，结合二者优势可构建更高效网络。 Abstract: Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has $3.6\times$ fewer parameters, runs $2\times$ faster, and uses $2\times$ less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: https://github.com/prs-eth/LitePT.

[269] DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

Susung Hong,Chongjian Ge,Zhifei Zhang,Jui-Hsien Wang

Main category: cs.CV

TL;DR: 提出DiffusionBrowser，一种模型无关的轻量级解码器框架，可在视频扩散模型生成过程中任意时间点提供快速、一致的多模态预览，并支持交互式生成控制与黑盒过程解析。

Details

Motivation: 现有视频扩散模型生成过程不透明、速度慢且缺乏用户交互，难以实时掌握生成状态并进行干预。 Method: 设计一个模型无关的轻量级解码器框架DiffusionBrowser，在去噪过程中的任意时间步或Transformer模块处生成RGB和场景内参等多模态预览视频，实现实时预览；通过随机性重注入和模态引导实现对中间噪声步骤的交互式引导。 Result: 该方法可在不到1秒内生成4秒视频的预览，速度超过4倍实时，预览在外观和运动上与最终视频保持一致；训练后的解码器可用于探测扩散模型内部机制，揭示场景、物体等细节在去噪过程中的构建方式。 Conclusion: DiffusionBrowser不仅实现了对视频扩散模型生成过程的高效可视化和交互控制，还为理解扩散模型内部工作机制提供了新工具，推动了可解释性和可控性的发展。 Abstract: Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.

Table of Contents

cs.CL [Back]

[1] Enhancing Urban Visual Place Recognition for Crowdsourced Flood Imagery via LLM-Guided Attention

[2] Reinforcement Learning for Latent-Space Thinking in LLMs

[3] KH-FUNSD: A Hierarchical and Fine-Grained Layout Analysis Dataset for Low-Resource Khmer Business Document

[4] Direct Confidence Alignment: Aligning Verbalized Confidence with Internal Confidence In Large Language Models

[5] Hold Onto That Thought: Assessing KV Cache Compression On Reasoning

[6] Benchmarking Contextual Understanding for In-Car Conversational Systems

[7] VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

[8] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

[9] Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

[10] Diffusion Language Model Inference with Monte Carlo Tree Search

[11] Semantic Distance Measurement based on Multi-Kernel Gaussian Processes

[12] Adversarially Probing Cross-Family Sound Symbolism in 27 Languages

[13] Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics

[14] F5-TTS-RO: Extending F5-TTS to Romanian TTS via Lightweight Input Adaptation

[15] SCIR: A Self-Correcting Iterative Refinement Framework for Enhanced Information Extraction Based on Schema

[16] Can GPT replace human raters? Validity and reliability of machine-generated norms for metaphors

[17] Large language models have learned to use language

[18] The American Ghost in the Machine: How language models align culturally and the effects of cultural prompting

[19] NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data

[20] HyperEdit: Unlocking Instruction-based Text Editing in LLMs via Hypernetworks

[21] Coupled Variational Reinforcement Learning for Language Model General Reasoning

[22] Human-Inspired Learning for Large Language Models via Obvious Record and Maximum-Entropy Method Discovery

[23] StruProKGR: A Structural and Probabilistic Framework for Sparse Knowledge Graph Reasoning

[24] Understanding Syllogistic Reasoning in LLMs from Formal and Natural Language Perspectives

[25] Which Pieces Does Unigram Tokenization Really Need?

[26] LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases

[27] Modeling Authorial Style in Urdu Novels Using Character Interaction Graphs and Graph Neural Networks

[28] Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches

[29] CoDA: A Context-Decoupled Hierarchical Agent with Reinforcement Learning

[30] NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

[31] Curió-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining

[32] Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions

[33] State over Tokens: Characterizing the Role of Reasoning Tokens

[34] Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, LLaMA

[35] Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects

[36] What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation

[37] Counting Clues: A Lightweight Probabilistic Baseline Can Match an LLM

[38] Building from Scratch: A Multi-Agent Framework with Human-in-the-Loop for Multilingual Legal Terminology Mapping

[39] QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

[40] Authors Should Annotate

[41] An Open and Reproducible Deep Research Agent for Long-Form Question Answering

[42] LLM Rationalis? Measuring Bargaining Capabilities of AI Negotiators

[43] Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing

[44] Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

[45] AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

[46] AIR: Post-training Data Selection for Reasoning via Attention Head Influence

[47] Integrating Causal Reasoning into Automated Fact-Checking

[48] MiniLingua: A Small Open-Source LLM for European Languages

[49] FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

[50] Detecting Emotion Drift in Mental Health Text Using Pre-Trained Transformers

[51] Large language models are not about language

[52] Scaling Laws for Code: Every Programming Language Matters

[53] Non-Resolution Reasoning: A Framework for Preserving Semantic Ambiguity in Language Models

[54] Advancing Bangla Machine Translation Through Informal Datasets

[55] SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping

[56] PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation

[57] Verifying Rumors via Stance-Aware Structural Modeling

[58] Memory in the Age of AI Agents

[59] ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

[60] Textual Gradients are a Flawed Metaphor for Automatic Prompt Optimization

[61] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

[62] Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

[63] Large-Language Memorization During the Classification of United States Supreme Court Cases

[64] Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

[65] A stylometric analysis of speaker attribution from speech transcripts

[66] Towards Effective Model Editing for LLM Personalization

[67] Beyond surface form: A pipeline for semantic analysis in Alzheimer's Disease detection from spontaneous speech

cs.CV [Back]

[68] Explainable Adversarial-Robust Vision-Language-Action Model for Robotic Manipulation

[69] Temporal-Anchor3DLane: Enhanced 3D Lane Detection with Multi-Task Losses and LSTM Fusion

[70] Automated Plant Disease and Pest Detection System Using Hybrid Lightweight CNN-MobileViT Models for Diagnosis of Indigenous Crops

[71] Pseudo-Label Refinement for Robust Wheat Head Segmentation via Two-Stage Hybrid Training

[72] Generalization vs. Specialization: Evaluating Segment Anything Model (SAM3) Zero-Shot Segmentation Against Fine-Tuned YOLO Detectors

[73] mmWEAVER: Environment-Specific mmWave Signal Synthesis from a Photo and Activity Description

[74] Hot Hém: Sài Gòn Giũa Cái Nóng Hông Còng Bàng -- Saigon in Unequal Heat

[75] Microscopic Vehicle Trajectory Datasets from UAV-collected Video for Heterogeneous, Area-Based Urban Traffic

[76] Read or Ignore? A Unified Benchmark for Typographic-Attack Robustness and Text Recognition in Vision-Language Models

[77] CLARGA: Multimodal Graph Representation Learning over Arbitrary Sets of Modalities