cs.CL [Back]

[1] Entropy-Based Measurement of Value Drift and Alignment Work in Large Language Models

Samih Fadli

Main category: cs.CL

TL;DR: 提出了一种基于“智能第二定律”的操作性框架，通过测量大语言模型的伦理熵动态来实现对价值漂移的运行时监控。

Details

Motivation: 现有静态基准无法充分评估大语言模型在分布偏移、越狱攻击和部署中对齐退化等动态场景下的安全性问题。 Method: 定义了一个五类行为分类体系，训练分类器从模型对话记录中估计伦理熵S(t)，并在压力测试中测量四种前沿模型的基础与指令调优版本的熵动态，进而估算有效对齐工作率gamma_eff。 Result: 基础模型表现出持续的熵增长，而指令调优版本能抑制漂移并降低约80%的伦理熵；基于熵轨迹构建了可触发警报的监控管道。 Conclusion: 该框架使伦理熵概念可操作化，为大语言模型在部署中提供运行时价值漂移监测提供了可行方案。 Abstract: Large language model safety is usually assessed with static benchmarks, but key failures are dynamic: value drift under distribution shift, jailbreak attacks, and slow degradation of alignment in deployment. Building on a recent Second Law of Intelligence that treats ethical entropy as a state variable which tends to increase unless countered by alignment work, we make this framework operational for large language models. We define a five-way behavioral taxonomy, train a classifier to estimate ethical entropy S(t) from model transcripts, and measure entropy dynamics for base and instruction-tuned variants of four frontier models across stress tests. Base models show sustained entropy growth, while tuned variants suppress drift and reduce ethical entropy by roughly eighty percent. From these trajectories we estimate an effective alignment work rate gamma_eff and embed S(t) and gamma_eff in a monitoring pipeline that raises alerts when entropy drift exceeds a stability threshold, enabling run-time oversight of value drift.

[2] Watermarks for Embeddings-as-a-Service Large Language Models

Anudeex Shetty

Main category: cs.CL

TL;DR: 本文研究了针对嵌入即服务（EaaS）中大语言模型的模仿攻击防御方法，揭示了现有水印技术可通过文本改写被移除，并提出了一种基于线性变换的新水印技术WET，具有强鲁棒性和高可验证性。

Details

Motivation: 由于现有的EaaS水印容易受到黑盒模仿攻击中的改写绕过，亟需更鲁棒的水印机制来保护服务商的知识产权。 Method: 首先通过实验展示现有水印在多种改写策略下可被有效去除；然后提出WET方法，利用嵌入的线性变换嵌入水印，并通过逆变换与相似度比较实现验证。 Result: 实验证明现有水印在多数情况下可被改写攻击成功绕过；WET在面对改写攻击时表现出接近完美的水印可验证性，且经过消融研究验证了其关键组件的有效性。 Conclusion: 文本改写是对当前EaaS水印的重大威胁，而WET通过线性变换提供了更鲁棒的解决方案，为保护EaaS模型产权提供了有效途径。 Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities in natural language understanding and generation. Based on these LLMs, businesses have started to provide Embeddings-as-a-Service (EaaS), offering feature extraction capabilities (in the form of text embeddings) that benefit downstream natural language processing tasks. However, prior research has demonstrated that EaaS is vulnerable to imitation attacks, where an attacker clones the service's model in a black-box manner without access to the model's internal workings. In response, watermarks have been added to the text embeddings to protect the intellectual property of EaaS providers by allowing them to check for model ownership. This thesis focuses on defending against imitation attacks by investigating EaaS watermarks. To achieve this goal, we unveil novel attacks and propose and validate new watermarking techniques. Firstly, we show that existing EaaS watermarks can be removed through paraphrasing the input text when attackers clone the model during imitation attacks. Our study illustrates that paraphrasing can effectively bypass current state-of-the-art EaaS watermarks across various attack setups (including different paraphrasing techniques and models) and datasets in most instances. This demonstrates a new vulnerability in recent EaaS watermarking techniques. Subsequently, as a countermeasure, we propose a novel watermarking technique, WET (Watermarking EaaS with Linear Transformation), which employs linear transformation of the embeddings. Watermark verification is conducted by applying a reverse transformation and comparing the similarity between recovered and original embeddings. We demonstrate its robustness against paraphrasing attacks with near-perfect verifiability. We conduct detailed ablation studies to assess the significance of each component and hyperparameter in WET.

[3] Alleviating Choice Supportive Bias in LLM with Reasoning Dependency Generation

Nan Zhuang,Wenshuo Wang,Lekai Qian,Yuxiao Wang,Boyu Cao,Qi Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为推理依赖生成（RDG）的新框架，用于缓解大语言模型中的选择支持性偏差（CSB），通过生成无偏见的推理数据进行微调，在多个实验中显著提升了模型的客观性，同时保持了在标准偏见基准上的性能。

Details

Motivation: 大语言模型在评估任务中表现出选择支持性偏差，倾向于支持其选择的选项，影响AI辅助决策的客观性，而现有去偏方法主要针对社会和人口统计偏见，缺乏对认知偏差的有效处理方法。 Method: 提出推理依赖生成（RDG）框架，自动生成平衡的问答推理对，显式建模或解耦选择、证据与理由之间的依赖关系，并构建包含上下文依赖数据和依赖解耦数据的大规模多领域数据集用于微调。 Result: 在基于记忆和基于评估的实验中，使用RDG数据微调的模型分别取得了81.5%和94.3%的性能提升，同时在标准BBQ基准上保持原有表现。 Conclusion: RDG是首个针对大语言模型中选择支持性偏差的解决方案，为缓解认知偏差提供了新路径，推动了更可靠的AI辅助决策系统的发展。 Abstract: Recent studies have demonstrated that some Large Language Models exhibit choice-supportive bias (CSB) when performing evaluations, systematically favoring their chosen options and potentially compromising the objectivity of AI-assisted decision making. While existing debiasing approaches primarily target demographic and social biases, methods for addressing cognitive biases in LLMs remain largely unexplored. In this work, we present the first solution to address CSB through Reasoning Dependency Generation (RDG), a novel framework for generating unbiased reasoning data to mitigate choice-supportive bias through fine-tuning. RDG automatically constructs balanced reasoning QA pairs, explicitly (un)modeling the dependencies between choices, evidences, and justifications. Our approach is able to generate a large-scale dataset of QA pairs across domains, incorporating Contextual Dependency Data and Dependency Decouple Data. Experiments show that LLMs fine-tuned on RDG-generated data demonstrate a 81.5% improvement in memory-based experiments and 94.3% improvement in the evaluation-based experiment, while maintaining similar performance on standard BBQ benchmarks. This work pioneers an approach for addressing cognitive biases in LLMs and contributes to the development of more reliable AI-assisted decision support systems.

[4] Enhancing Job Matching: Occupation, Skill and Qualification Linking with the ESCO and EQF taxonomies

Stylianos Saroglou,Konstantinos Diamantaras,Francesco Preta,Marina Delianidi,Apostolos Benisis,Christian Johannes Meyer

Main category: cs.CL

TL;DR: 本研究探讨了利用语言模型通过将职位空缺文本与ESCO和EQF两大欧洲框架关联，来改进劳动力市场信息分类的方法，比较了Sentence Linking和Entity Linking两种方法，发布了开源工具和两个标注数据集，并探索了生成式大语言模型的应用，推动了职位实体提取的技术发展。

Details

Motivation: 为了更准确地从职位空缺文本中提取职业和资格信息，并将其与欧洲标准框架（ESCO和EQF）对齐，支持劳动力市场分析和技能匹配。 Method: 比较了Sentence Linking和Entity Linking两种方法，开发了一个集成这两种方法的开源工具，并构建了两个用于评估职业与资格表示的标注数据集，同时探索了生成式大语言模型在此任务中的不同应用方式。 Result: 提出了有效的职位与资格分类方法，发布了可用于进一步研究的开源工具和高质量标注数据集，验证了大语言模型在该任务中的潜力。 Conclusion: 研究推动了劳动力市场信息自动分类的进展，提供了可复用的计算基础设施，有助于在数字化经济中更好地理解工作、技能与就业叙事。 Abstract: This study investigates the potential of language models to improve the classification of labor market information by linking job vacancy texts to two major European frameworks: the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy and the European Qualifications Framework (EQF). We examine and compare two prominent methodologies from the literature: Sentence Linking and Entity Linking. In support of ongoing research, we release an open-source tool, incorporating these two methodologies, designed to facilitate further work on labor classification and employment discourse. To move beyond surface-level skill extraction, we introduce two annotated datasets specifically aimed at evaluating how occupations and qualifications are represented within job vacancy texts. Additionally, we examine different ways to utilize generative large language models for this task. Our findings contribute to advancing the state of the art in job entity extraction and offer computational infrastructure for examining work, skills, and labor market narratives in a digitally mediated economy. Our code is made publicly available: https://github.com/tabiya-tech/tabiya-livelihoods-classifier

[5] InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation

Faezeh Faez,Marzieh S. Tahaei,Yaochen Hu,Ali Pourranjbar,Mahdi Biparva,Mark Coates,Yingxue Zhang

Main category: cs.CL

TL;DR: InvertiTune提出了一种结合可控数据生成与监督微调的框架，通过从知识库中提取子图并生成对应文本，提升文本到知识图谱构建的效果和效率。

Details

Motivation: 现有文本到知识图谱（Text2KG）方法依赖多次大模型提示，计算成本高且难以捕捉文本中分布复杂的关联关系。 Method: 提出InvertiTune框架：首先从大型知识库中系统提取带噪声过滤的子图，利用大语言模型生成对应的自然文本描述；然后使用该数据对轻量级模型进行监督微调，实现单次推理的知识图谱构建。 Result: 在自建数据集CE12k上，InvertiTune优于未微调的大模型及当前最优Text2KG方法，并在跨数据集测试集CrossEval-1200上表现出更强的泛化能力。 Conclusion: 高质量、贴近真实场景的训练数据对构建高效、高性能的Text2KG系统至关重要。 Abstract: Large Language Models (LLMs) have revolutionized the ability to understand and generate text, enabling significant progress in automatic knowledge graph construction from text (Text2KG). Many Text2KG methods, however, rely on iterative LLM prompting, making them computationally expensive and prone to overlooking complex relations distributed throughout the text. To address these limitations, we propose InvertiTune, a framework that combines a controlled data generation pipeline with supervised fine-tuning (SFT). Within this framework, the data-generation pipeline systematically extracts subgraphs from large knowledge bases, applies noise filtering, and leverages LLMs to generate corresponding natural text descriptions, a task more aligned with LLM capabilities than direct KG generation from text. This pipeline enables generating datasets composed of longer texts paired with larger KGs that better reflect real-world scenarios compared to existing benchmarks, thus supporting effective SFT of lightweight models for single-shot KG construction. Experimental results on CE12k, a dataset generated using the introduced pipeline, show that InvertiTune outperforms larger non-fine-tuned LLMs as well as state-of-the-art Text2KG approaches, while also demonstrating stronger cross-dataset generalization on CrossEval-1200, a test set created from three established benchmark datasets and CE12k. These findings highlight the importance of realistic, high-quality training data for advancing efficient and high-performing Text2KG systems.

[6] Identifying attributions of causality in political text

Paulina Garcia-Corral

Main category: cs.CL

TL;DR: 本文提出了一种用于检测和解析政治文本中因果解释的轻量级因果语言模型框架，能够以较低的标注成本实现大规模因果关系分析，并具有良好的泛化性和准确性。

Details

Motivation: 尽管解释在政治认知中至关重要，但现有政治学研究对解释的系统性分析仍不足且分散，缺乏通用的方法论工具。 Method: 训练一个轻量级的因果语言模型，从政治文本中自动提取结构化的因果陈述（即原因-结果对），形成可用于下游分析的数据集。 Result: 该方法能够在大规模文本中有效识别因果解释，在标注需求少的情况下展现出良好的准确性和相对于人工编码的泛化能力。 Conclusion: 所提出的框架为政治文本中的因果解释提供了可扩展、高效且可靠的分析工具，推动了政治科学中对解释现象的系统研究。 Abstract: Explanations are a fundamental element of how people make sense of the political world. Citizens routinely ask and answer questions about why events happen, who is responsible, and what could or should be done differently. Yet despite their importance, explanations remain an underdeveloped object of systematic analysis in political science, and existing approaches are fragmented and often issue-specific. I introduce a framework for detecting and parsing explanations in political text. To do this, I train a lightweight causal language model that returns a structured data set of causal claims in the form of cause-effect pairs for downstream analysis. I demonstrate how causal explanations can be studied at scale, and show the method's modest annotation requirements, generalizability, and accuracy relative to human coding.

[7] Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs

Kunj Joshi,David A. Smith

Main category: cs.CL

TL;DR: 本文提出了一种名为随机掩码微调（RMFT）的隐私保护微调技术，有效减少了大语言模型对训练数据中个人身份信息（PII）的记忆，同时保持较低的性能损失，并在Enron邮件数据集上验证了其在隐私-效用权衡中的优越性。

Details

Motivation: 大语言模型容易记忆训练数据中的个人身份信息（PII），带来严重的安全与隐私风险，亟需有效的隐私保护方法。 Method: 提出随机掩码微调（RMFT）方法，在微调过程中引入随机掩码机制以减少PII记忆，并使用MaxTER框架和AURC指标评估隐私-效用权衡。 Result: 在Enron邮件数据集上，RMFT相比基线微调实现了80.81%的总提取率降低和80.17%的已见提取率降低，仅导致5.73%的困惑度上升，优于去重方法。 Conclusion: RMFT是一种高效的隐私保护微调技术，能在几乎不影响模型性能的前提下显著降低PII记忆，为大模型隐私保护提供了可行方案。 Abstract: The current literature on memorization in Natural Language Models, especially Large Language Models (LLMs), poses severe security and privacy risks, as models tend to memorize personally identifying information (PIIs) from training data. We introduce Randomized Masked Fine-Tuning (RMFT), a novel privacy-preserving fine-tuning technique that reduces PII memorization while minimizing performance impact. Using the Enron Email Dataset, we demonstrate that RMFT achieves an 80.81% reduction in Total Extraction Rate and 80.17% reduction in Seen Extraction Rate compared to baseline fine-tuning, outperforming deduplication methods while maintaining only a 5.73% increase in perplexity. We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.

[8] Modeling Topics and Sociolinguistic Variation in Code-Switched Discourse: Insights from Spanish-English and Spanish-Guaraní

Nemika Tyagi,Nelvin Licona Guevara,Olga Kellert

Main category: cs.CL

TL;DR: 本研究提出了一种基于大语言模型的自动标注管道，用于分析西班牙语-英语和西班牙语-瓜拉尼语双语语码转换话语的社会语言学与话题特征，实现了大规模数据中社会语言模式的可靠提取。

Details

Motivation: 传统社会语言学分析依赖人工标注，耗时且难以扩展到大规模语料。本研究旨在利用大语言模型自动化标注双语话语中的话题、体裁和语用功能，以推动低资源和跨语言双语研究的计算方法发展。 Method: 采用大语言模型对3,691个语码转换句子进行自动标注，识别其话题、体裁和话语-语用功能，并整合迈阿密双语语料库的人口统计元数据，同时为西班牙语-瓜拉尼语数据集新增话题标注。 Result: 结果揭示了迈阿密语料中性别、语言主导性与话语功能之间的系统性关联，以及巴拉圭文本中正式瓜拉尼语与非正式西班牙语之间的明显双言现象。这些发现通过语料库规模的量化证据复制并扩展了以往的互动和社会语言学观察。 Conclusion: 大语言模型能够可靠地恢复传统上需人工标注才能获得的可解释社会语言学模式，证明其在跨语言及低资源双语研究中的潜力。 Abstract: This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaraní. Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences, integrated demographic metadata from the Miami Bilingual Corpus, and enriched the Spanish-Guaraní dataset with new topic annotations. The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaraní and informal Spanish in Paraguayan texts. These findings replicate and extend earlier interactional and sociolinguistic observations with corpus-scale quantitative evidence. The study demonstrates that large language models can reliably recover interpretable sociolinguistic patterns traditionally accessible only through manual annotation, advancing computational methods for cross-linguistic and low-resource bilingual research.

[9] PERCS: Persona-Guided Controllable Biomedical Summarization Dataset

Rohan Charudatt Salvi,Chirag Chawla,Dhruv Jain,Swapnil Panigrahi,Md Shad Akhtar,Shweta Yadav

Main category: cs.CL

TL;DR: 本文介绍了PERCS，一个面向不同用户群体的生物医学摘要个性化简化数据集，支持可控文本简化与健康信息可及性研究。

Details

Motivation: 现有医学文本简化资源多针对通用受众，忽视了不同用户在医学知识水平和信息需求上的差异，导致信息传达效果受限。 Method: 提出PERCS数据集，包含针对四种 personas（外行、医预科生、非医学研究人员、医学专家）定制的生物医学摘要摘要，并由医生评审确保事实准确性和 persona 一致性。使用可读性、词汇和内容深度指标进行技术验证，并对四种大语言模型进行基准测试。 Result: PERCS数据集展现出跨 personas 在可读性、词汇和内容深度上的显著差异；大模型自动评估结果提供了未来研究的基线，显示不同模型在 comprehensiveness、readability 和 faithfulness 上的表现差异。 Conclusion: 个性化可控摘要是提升医学信息可及性的有效路径，PERCS为面向不同知识背景用户的生物医学文本简化提供了高质量资源和评估基准。 Abstract: Automatic medical text simplification plays a key role in improving health literacy by making complex biomedical research accessible to diverse readers. However, most existing resources assume a single generic audience, overlooking the wide variation in medical literacy and information needs across user groups. To address this limitation, we introduce PERCS (Persona-guided Controllable Summarization), a dataset of biomedical abstracts paired with summaries tailored to four personas: Laypersons, Premedical Students, Non-medical Researchers, and Medical Experts. These personas represent different levels of medical literacy and information needs, emphasizing the need for targeted, audience-specific summarization. Each summary in PERCS was reviewed by physicians for factual accuracy and persona alignment using a detailed error taxonomy. Technical validation shows clear differences in readability, vocabulary, and content depth across personas. Along with describing the dataset, we benchmark four large language models on PERCS using automatic evaluation metrics that assess comprehensiveness, readability, and faithfulness, establishing baseline results for future research. The dataset, annotation guidelines, and evaluation materials are publicly available to support research on persona-specific communication and controllable biomedical summarization.

[10] Idea-Gated Transformers: Enforcing Semantic Coherence via Differentiable Vocabulary Pruning

Darshan Fofadiya

Main category: cs.CL

TL;DR: 提出Idea-Gated Transformer，通过分离语义规划与句法生成来缓解语言模型在生成过程中的主题漂移问题。

Details

Motivation: 解决自回归语言模型因仅依赖下一词预测而导致的主题漂移问题，增强生成过程中的全局语义一致性。 Method: 引入一个辅助的“Idea Head”来预测未来上下文窗口的词袋分布，形成“Concept Vector”，并通过可微分的门控机制调节主词汇表中词元的生成概率，抑制语义无关词。 Result: 在WikiText-103上与GPT-2基线模型具有相当的验证困惑度，但在领域保持性（Domain Retention）方面显著更优，定性和定量分析表明生成过程能锁定特定语义簇并抵抗关联性漂移。 Conclusion: Idea-Gated Transformer为提升语言模型的主题一致性和可控性提供了一种参数高效的新架构路径。 Abstract: Autoregressive Language Models (LLMs) trained on Next-Token Prediction (NTP) often suffer from ``Topic Drift'' where the generation wanders away from the initial prompt due to a reliance on local associations rather than global planning \citep{holtzman2019curious}. While scaling model size mitigates this \citep{brown2020language}, the fundamental myopia of the NTP objective remains. In this work, we introduce the Idea-Gated Transformer, a novel architecture that separates semantic planning from syntactic generation. We introduce an auxiliary ``Idea Head'' trained to predict the bag-of-words distribution for a future context window, creating a latent ``Concept Vector'' that actively gates the main vocabulary during generation. We propose a differentiable gating mechanism that suppresses semantically irrelevant tokens, effectively pruning the search space in real-time. Experiments on WikiText-103 demonstrate that while the Idea-Gated model achieves comparable validation perplexity to a standard GPT-2 baseline, it exhibits significantly superior Domain Retention. Qualitative and quantitative analysis reveals that the gating mechanism successfully locks generation into specific semantic clusters (e.g., Finance, Science) and resists associative drift, offering a parameter-efficient path toward more controllable language modeling.

[11] From Hypothesis to Premises: LLM-based Backward Logical Reasoning with Selective Symbolic Translation

Qingchuan Li,Mingyue Cheng,Zirui Liu,Daoyu Wang,Yuting Zeng,Tongxuan Liu

Main category: cs.CL

TL;DR: 本文提出了一种假设驱动的反向逻辑推理框架（HBLR），结合置信度感知的符号翻译与反向推理，提升了自然语言逻辑推理的准确性和效率。

Details

Motivation: 现有大模型多采用前向推理，易产生冗余路径、幻觉步骤和语义漂移，导致推理低效且不可靠，因此需要更可靠、高效的逻辑推理方法。 Method: 提出HBLR框架：在翻译阶段，仅高置信度文本转为一阶逻辑形式，不确定性内容保留为自然语言，并通过翻译反思模块保障语义保真；在推理阶段，采用假设结论为真的反向推理，并通过推理反思模块修正错误推导步骤。 Result: 在五个推理基准上的实验表明，HBLR在准确性和推理效率上均优于强基线方法。 Conclusion: HBLR通过结合符号与神经方法、前向与反向推理，实现了更高效、可靠的逻辑推理，为语言模型中的复杂推理提供了新思路。 Abstract: Logical reasoning is a core challenge in natural language understanding and a fundamental capability of artificial intelligence, underpinning scientific discovery, mathematical theorem proving, and complex decision-making. Despite the remarkable progress of large language models (LLMs), most current approaches still rely on forward reasoning paradigms, generating step-by-step rationales from premises to conclusions. However, such methods often suffer from redundant inference paths, hallucinated steps, and semantic drift, resulting in inefficient and unreliable reasoning. In this paper, we propose a novel framework, Hypothesis-driven Backward Logical Reasoning (HBLR). The core idea is to integrate confidence-aware symbolic translation with hypothesis-driven backward reasoning. In the translation phase, only high-confidence spans are converted into logical form, such as First-Order Logic (FOL), while uncertain content remains in natural language. A translation reflection module further ensures semantic fidelity by evaluating symbolic outputs and reverting lossy ones back to text when necessary. In the reasoning phase, HBLR simulates human deductive thinking by assuming the conclusion is true and recursively verifying its premises. A reasoning reflection module further identifies and corrects flawed inference steps, enhancing logical coherence. Extensive experiments on five reasoning benchmarks demonstrate that HBLR consistently outperforms strong baselines in both accuracy and efficiency.

[12] Nexus: Higher-Order Attention Mechanisms in Transformers

Hanting Chen,Chu Zhong,Kai Han,Yuchuan Tian,Yuchen Liang,Tianyu Guo,Xinghao Chen,Dacheng Tao,Yunhe Wang

Main category: cs.CL

TL;DR: 本文提出了Higher-Order Attention Network (Hon)，通过递归式自注意力机制增强模型表达能力，突破标准注意力的低秩瓶颈，且仅增加常数级额外参数。

Details

Motivation: 标准的一阶注意力机制受限于低秩瓶颈，难以在单层内捕捉复杂的多跳关系，限制了模型的表示能力。 Method: 提出Hon架构，采用嵌套的自注意力循环动态优化Query和Key向量，使其在最终注意力计算前聚合全局上下文并建模高阶相关性，并通过参数共享策略保持高效性。 Result: 理论分析表明Hon可打破标准注意力的线性瓶颈，在多个基准任务上实验结果显示其优于标准Transformer模型。 Conclusion: Hon通过递归式高阶注意力机制有效提升了表示能力，在保持参数效率的同时增强了对复杂依赖关系的建模。 Abstract: Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the \textbf{Higher-Order Attention Network (Hon)}, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Hon dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations \textit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs $\mathcal{O}(1)$ additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Hon outperforms standard Transformers on multiple benchmarks.

[13] Characterizing Language Use in a Collaborative Situated Game

Nicholas Tomlin,Naitian Zhou,Eve Fleisig,Liangyuan,Chen,Téa Wright,Lauren Vinh,Laura X. Ma,Seun Eisape,Ellie French,Tingting Du,Tianjiao Zhang,Alexander Koller,Alane Suhr

Main category: cs.CL

TL;DR: 本文介绍了Portal对话语料库，包含11.5小时的《传送门2》合作模式中的口语对话，涵盖24.5K条话语，揭示了复杂空间指代、澄清与修复及临时约定形成等罕见语言现象，并公开发布了包括视频、音频、转录文本和游戏状态数据在内的语料库。

Details

Motivation: 为了研究在复杂、情境化、协作解决问题场景中的人类语言使用情况，需要一个能够反映真实合作交流特点的新语料库。 Method: 收集了《传送门2》合作模式下玩家之间的11.5小时口语对话，构建了Portal对话语料库，包含了24.5K条话语，并对玩家的语言和行为进行了分析。 Result: 识别出了一些在现有闲聊或任务导向对话语料库中很少出现的语言现象，如复杂的空間参考、澄清与修复以及即兴约定的形成。 Conclusion: 通过发布Portal对话语料库及其丰富的多模态数据，为未来在复杂环境中协作问题解决的语言使用分析提供了重要资源。 Abstract: Cooperative video games, where multiple participants must coordinate by communicating and reasoning under uncertainty in complex environments, yield a rich source of language data. We collect the Portal Dialogue Corpus: a corpus of 11.5 hours of spoken human dialogue in the co-op mode of the popular Portal 2 virtual puzzle game, comprising 24.5K total utterances. We analyze player language and behavior, identifying a number of linguistic phenomena that rarely appear in most existing chitchat or task-oriented dialogue corpora, including complex spatial reference, clarification and repair, and ad-hoc convention formation. To support future analyses of language use in complex, situated, collaborative problem-solving scenarios, we publicly release the corpus, which comprises player videos, audio, transcripts, game state data, and both manual and automatic annotations of language data.

[14] Dual LoRA: Enhancing LoRA with Magnitude and Direction Updates

Yixing Xu,Chao Li,Xuanwu Yin,Spandan Tiwari,Dong Li,Ashish Sirasao,Emad Barsoum

Main category: cs.CL

TL;DR: 本文提出了一种名为Dual LoRA的新方法，通过引入归纳偏置改进低秩适应（LoRA），在保持相同可训练参数数量的情况下，在多种NLP任务上优于LoRA及其先进变体。

Details

Motivation: 由于LoRA的低秩假设导致性能不佳，本文旨在通过更好地模拟全量微调中的参数更新过程来提升PEFT方法的表现。 Method: 将低秩矩阵分为控制更新幅度的‘magnitude组’和控制更新方向的‘direction组’，并在前者中加入ReLU函数、后者中加入sign函数以实现更精细的参数更新模拟。 Result: 在GPT-2、RoBERTa、DeBERTa和LLaMA系列模型上，于NLG、NLU和常识推理等多个NLP任务中实验表明，Dual LoRA consistently优于LoRA及其变体。 Conclusion: Dual LoRA通过分解低秩矩阵并引入简单但有效的非线性操作，显著提升了参数高效微调的性能，是一种有效且通用的LoRA改进方案。 Abstract: Low-rank adaptation (LoRA) is one of the most popular methods among parameter-efficient fine-tuning (PEFT) methods to adapt pre-trained large language models (LLMs) to specific downstream tasks. However, the model trained based on LoRA often has an unsatisfactory performance due to its low-rank assumption. In this paper, we propose a novel method called Dual LoRA to improve the performance by incorporating an inductive bias into the original LoRA. Specifically, we separate low-rank matrices into two groups: the magnitude group to control whether or not and how far we should update a parameter and the direction group to decide whether this parameter should move forward or backward, to better simulate the parameter updating process of the full fine-tuning based on gradient-based optimization algorithms. We show that this can be simply achieved by adding a ReLU function to the magnitude group and a sign function to the direction group. We conduct several experiments over a wide range of NLP tasks, including natural language generation (NLG), understanding (NLU), and commonsense reasoning datasets on GPT-2, RoBERTa, DeBERTa, and LLaMA-1/2/3 as baseline models. The results show that we consistently outperform LoRA and its state-of-the-art variants with the same number of trainable parameters.

[15] PretrainZero: Reinforcement Active Pretraining

Xingrun Xing,Zhiyuan Fan,Jie Lou,Guoqi Li,Jiajun Zhang,Debing Zhang

Main category: cs.CL

TL;DR: 本文提出了PretrainZero，一种基于预训练语料库的强化主动学习框架，通过主动学习和自监督学习机制，无需可验证标签或奖励模型，直接在维基百科语料上用强化学习预训练推理模型，显著提升大模型的通用推理能力。

Details

Motivation: 现有基于强化学习的大模型在特定领域依赖可验证的奖励信号，难以扩展到通用推理任务，限制了其泛化能力。因此需要一种不依赖外部验证信号、能从通用语料中主动学习的框架。 Method: 提出PretrainZero框架，结合主动预训练与自监督强化学习：模型从预训练语料中主动识别合理且信息丰富的片段，通过预测掩码内容进行强化学习；采用掩码跨度预测作为自我监督信号，直接在无标签的通用文本上训练3B至30B的基础模型。 Result: 在强化预训练中，PretrainZero使Qwen3-4B-Base在MMLU-Pro、SuperGPQA和数学平均基准上分别提升8.43、5.96和10.60；预训练模型还可作为下游RLVR任务的推理基础模型。 Conclusion: PretrainZero打破了传统依赖验证数据的壁垒，实现了从通用语料中进行强化预训练，有效提升了大模型的通用推理能力，并展示了在下游任务中的迁移潜力。 Abstract: Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.

[16] A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention

Di Xiu,Hongyin Tang,Bolin Rong,Lizhi Yan,Jingang Wang,Yifan Lu,Xunliang Cai

Main category: cs.CL

TL;DR: 本文研究了Top-$k$注意力机制在大语言模型解码和训练阶段的有效性与理论机制，验证了其在保持关键上下文时可达到甚至超过全注意力机制的性能，并探讨了近似算法精度对下游任务的影响，同时从熵的角度提供了理论解释。

Details

Motivation: 大语言模型在长上下文建模中广泛应用，但推理计算成本成为制约代理和多模态应用发展的瓶颈，因此需要探索更高效的注意力机制。 Method: 通过实验验证精确Top-$k$解码的有效性，探索原生Top-$k$注意力训练策略，并分析近似Top-$k$算法精度对下游任务的影响，结合熵理论进行解释。 Result: 保留最高相似度的关键上下文可在解码阶段实现与全注意力相当或更优的性能；训练与推理一致使用Top-$k$注意力可显著提升模型表现；下游任务性能与近似精度呈正相关；模型在Top-$k$注意力SFT后呈现熵减现象。 Conclusion: Top-$k$注意力机制在保证性能的同时有效降低计算开销，训练推理一致性与高精度近似有助于释放其潜力，熵减现象为其有效性提供了理论支持。 Abstract: Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top-$k$ Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top-$k$ Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top-$k$ Attention operations facilitates the further unlocking of Top-$k$ Decoding's potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer's precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top-$k$ Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top-$k$ Decoding.

[17] Understanding LLM Reasoning for Abstractive Summarization

Haohan Yuan,Siu Cheung Hui,Haopeng Zhang

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型（LLM）在抽象摘要任务中的推理能力，发现推理并非普遍有效，其效果取决于具体策略和上下文。

Details

Motivation: 尽管大语言模型在数学和代码生成等分析任务中表现出色，但其在抽象摘要中的有效性尚未得到充分验证，因此需要系统研究。 Method: 研究者将通用推理策略适配到摘要任务中，系统性地比较了8种推理策略和3种大推理模型（LRM），在8个不同数据集上评估摘要质量和事实忠实度。 Result: 结果显示，显式推理策略提升了摘要流畅性但降低了事实忠实度，而大推理模型中的隐式推理则相反；增加模型内部推理预算并不能提升甚至可能损害事实一致性。 Conclusion: 有效的摘要需要忠实的信息压缩而非过度创造性推理，推理在摘要中的应用需谨慎权衡质量与忠实度。 Abstract: While the reasoning capabilities of Large Language Models (LLMs) excel in analytical tasks such as mathematics and code generation, their utility for abstractive summarization remains widely assumed but largely unverified. To bridge this gap, we first tailor general reasoning strategies to the summarization domain. We then conduct a systematic, large scale comparative study of 8 reasoning strategies and 3 Large Reasoning Models (LRMs) across 8 diverse datasets, assessing both summary quality and faithfulness. Our findings show that reasoning is not a universal solution and its effectiveness is highly dependent on the specific strategy and context. Specifically, we observe a trade-off between summary quality and factual faithfulness: explicit reasoning strategies tend to improve fluency at the expense of factual grounding, while implicit reasoning in LRMs exhibits the inverse pattern. Furthermore, increasing an LRM's internal reasoning budget does not improve, and can even hurt, factual consistency, suggesting that effective summarization demands faithful compression rather than creative over-thinking.

[18] Fine-grained Narrative Classification in Biased News Articles

Zeba Afroz,Harsh Vardhan,Pawan Bhakuni,Aanchal Punia,Rajdeep Kumar,Md. Shad Akhtar

Main category: cs.CL

TL;DR: 本文提出了一个细粒度的叙事分类方法，用于分析印度新闻媒体中的宣传内容，并构建了首个意识形态导向的多层次标注数据集INDI-PROP，包含1,266篇文章，涵盖CAA和农民抗议两个社会政治事件。

Details

Motivation: 现有研究缺乏对新闻宣传中复杂叙事结构的细粒度建模，尤其在多层级偏见与说服技巧的整合分析方面不足。 Method: 提出INDI-PROP数据集，包含三层标注：文章意识形态偏见、事件相关的叙事框架和说服技术；设计FANTA和TPTC两种基于GPT-4o-mini的多跳提示推理框架进行分类任务。 Result: 实验表明，FANTA和TPTC在文章偏见、叙事分类和说服技术识别任务上均显著优于基线模型。 Conclusion: 通过多层次标注和基于大模型的推理框架，能够更有效地揭示新闻文本中的意识形态叙事结构及其说服机制。 Abstract: Narratives are the cognitive and emotional scaffolds of propaganda. They organize isolated persuasive techniques into coherent stories that justify actions, attribute blame, and evoke identification with ideological camps. In this paper, we propose a novel fine-grained narrative classification in biased news articles. We also explore article-bias classification as the precursor task to narrative classification and fine-grained persuasive technique identification. We develop INDI-PROP, the first ideologically grounded fine-grained narrative dataset with multi-level annotation for analyzing propaganda in Indian news media. Our dataset INDI-PROP comprises 1,266 articles focusing on two polarizing socio-political events in recent times: CAA and the Farmers' protest. Each article is annotated at three hierarchical levels: (i) ideological article-bias (pro-government, pro-opposition, neutral), (ii) event-specific fine-grained narrative frames anchored in ideological polarity and communicative intent, and (iii) persuasive techniques. We propose FANTA and TPTC, two GPT-4o-mini guided multi-hop prompt-based reasoning frameworks for the bias, narrative, and persuasive technique classification. FANTA leverages multi-layered communicative phenomena by integrating information extraction and contextual framing for hierarchical reasoning. On the other hand, TPTC adopts systematic decomposition of persuasive cues via a two-stage approach. Our evaluation suggests substantial improvement over underlying baselines in each case.

[19] AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment

Ahmad Aghaebrahimian

Main category: cs.CL

TL;DR: 提出了一种可解释的、灵活的框架来评估大语言模型生成文本的事实一致性，通过将文本分解为原子事实并引入加权度量，有效提升了在通用和临床领域中的事实性评估效果。

Details

Motivation: 现有评估指标无法充分评估大语言模型生成内容的事实一致性，且缺乏可解释性，尤其在临床等高风险领域中，幻觉问题可能导致严重后果。 Method: 将文本分解为原子事实，采用无预设模式的灵活方法，并引入加权度量而非绝对度量，同时设计机制控制复杂领域中的评估复杂度。 Result: 在多个通用和临床数据集上进行了基准测试，结果表明该框架能更准确地识别事实不一致问题，且具有良好的可解释性和适应性。 Conclusion: 所提出的框架有效提升了对大语言模型输出的事实性评估能力，尤其适用于高风险领域，有助于未来构建更可靠的事实感知模型。 Abstract: Large Language Models have significantly advanced natural language processing tasks, but remain prone to generating incorrect or misleading but plausible arguments. This issue, known as hallucination, is particularly concerning in high-stakes domains like clinical applications, where factual inaccuracies can have severe consequences. Existing evaluation metrics fail to adequately assess factual consistency and lack interpretability, making diagnosing and mitigating errors difficult. We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts to address these limitations. Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology. Unlike previous methods with an absolute metric, we incorporate a weighted metric to enhance factual evaluation. Additionally, we propose a mechanism to control assessment complexity in intricate domains. We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training in future research.

[20] Generative AI Practices, Literacy, and Divides: An Empirical Analysis in the Italian Context

Beatrice Savoldi,Giuseppe Attanasio,Olga Gorodetskaya,Marta Marchiori Manerba,Elisa Bassignana,Silvia Casola,Matteo Negri,Tommaso Caselli,Luisa Bentivogli,Alan Ramponi,Arianna Muti,Nicoletta Balbo,Debora Nozza

Main category: cs.CL

TL;DR: 本研究基于对1906名意大利语成年人的调查，首次全面描绘了意大利生成式AI（GenAI）的采用、使用模式和素养状况，发现其在工作与个人生活中广泛使用，甚至用于情感支持和医疗建议等敏感任务，并正成为主要信息来源。然而，用户数字素养普遍偏低，难以识别错误与虚假信息，且存在显著性别差距，尤其是年长群体中女性采用率仅为男性一半。尽管素养是采用的关键预测因素，但无法完全解释这一差距，表明存在其他障碍。研究强调需加强针对性教育并深入探究参与不平等的深层原因。

Details

Motivation: 生成式AI聊天机器人正在改变数字互动方式，但其不均衡的采用和用户对其局限性的认知不足可能加剧数字鸿沟。因此，亟需实证研究来理解公众的实际使用情况、素养水平及其社会影响，特别是在尚未充分研究的国家如意大利。 Method: 通过新收集的1906名意大利语成年人的问卷调查数据，进行描述性统计与回归分析，以绘制生成式AI在意大利的采用率、使用场景、频率及数字素养水平的实证图景，并探讨影响采用的关键因素，特别是性别与年龄的差异。 Result: 研究发现生成式AI在意大利已被广泛采用，用于多种工作和个人用途，包括敏感领域；它正取代其他技术成为主要信息源；用户整体数字素养较低，难以识别错误信息；存在显著性别差距，年长女性的采用率和使用频率明显低于男性；数字素养虽能预测采用行为，但仅能部分解释性别差异，暗示存在非能力相关的障碍。 Conclusion: 生成式AI的普及伴随着低数字素养和显著的社会不平等风险，尤其是性别差距。单纯提升技能不足以实现公平参与，必须结合针对性教育干预和对结构性障碍的深入研究，以应对技术变革中的社会分化挑战。 Abstract: The rise of Artificial Intelligence (AI) language technologies, particularly generative AI (GenAI) chatbots accessible via conversational interfaces, is transforming digital interactions. While these tools hold societal promise, they also risk widening digital divides due to uneven adoption and low awareness of their limitations. This study presents the first comprehensive empirical mapping of GenAI adoption, usage patterns, and literacy in Italy, based on newly collected survey data from 1,906 Italian-speaking adults. Our findings reveal widespread adoption for both work and personal use, including sensitive tasks like emotional support and medical advice. Crucially, GenAI is supplanting other technologies to become a primary information source: this trend persists despite low user digital literacy, posing a risk as users struggle to recognize errors or misinformation. Moreover, we identify a significant gender divide -- particularly pronounced in older generations -- where women are half as likely to adopt GenAI and use it less frequently than men. While we find literacy to be a key predictor of adoption, it only partially explains this disparity, suggesting that other barriers are at play. Overall, our data provide granular insights into the multipurpose usage of GenAI, highlighting the dual need for targeted educational initiatives and further investigation into the underlying barriers to equitable participation that competence alone cannot explain.

[21] Evaluating Hydro-Science and Engineering Knowledge of Large Language Models

Shiruo Hu,Wenbo Shan,Yingjia Li,Zhiqi Wan,Xinpeng Yu,Yunjia Qi,Haotian Xia,Yang Xiao,Dingxiao Liu,Jiaru Wang,Chenxu Gong,Ruixi Zhang,Shuyue Wu,Shibo Cui,Chee Hui Lai,Wei Luo,Yubin He,Bin Xu,Jianshi Zhao

Main category: cs.CL

TL;DR: 本文提出了一个针对大型语言模型（LLM）在水利科学与工程（Hydro-SE）领域应用的评估基准（Hydro-SE Bench），包含4000道多选题，覆盖九个子领域，用于评估LLM在基础概念、工程应用及推理计算能力方面的表现。结果显示商业LLM准确率在0.74–0.80之间，小参数模型为0.41–0.68，模型在物理科学相关任务中表现较好，但在行业标准和水工结构等专业领域仍存在不足。

Details

Motivation: 由于水利科学与工程是一个多目标、跨学科的领域，依赖专家协作决策，对智能系统提出挑战。尽管大语言模型快速发展，其在该领域的知识和应用能力尚未得到充分评估，因此需要建立专门的评估基准。 Method: 构建了一个名为Hydro-SE Bench的评估基准，包含4000道多选题，覆盖9个水利子领域，从基础知识、工程应用、推理与计算能力三个维度对多个LLM进行系统评估。 Result: 商业LLM在基准测试中准确率为0.74–0.80，小参数LLM为0.41–0.68；LLM在自然科学相关的子领域表现良好，但在行业标准、水工结构等专业内容上表现较差；模型规模提升主要增强推理与计算能力，但实际工程应用能力仍有待提高。 Conclusion: Hydro-SE Bench为评估LLM在水利领域的性能提供了有效工具，揭示了当前LLM的优势与短板，为模型开发者指明训练方向，并为水利研究人员提供应用LLM的实践指导。 Abstract: Hydro-Science and Engineering (Hydro-SE) is a critical and irreplaceable domain that secures human water supply, generates clean hydropower energy, and mitigates flood and drought disasters. Featuring multiple engineering objectives, Hydro-SE is an inherently interdisciplinary domain that integrates scientific knowledge with engineering expertise. This integration necessitates extensive expert collaboration in decision-making, which poses difficulties for intelligence. With the rapid advancement of large language models (LLMs), their potential application in the Hydro-SE domain is being increasingly explored. However, the knowledge and application abilities of LLMs in Hydro-SE have not been sufficiently evaluated. To address this issue, we propose the Hydro-SE LLM evaluation benchmark (Hydro-SE Bench), which contains 4,000 multiple-choice questions. Hydro-SE Bench covers nine subfields and enables evaluation of LLMs in aspects of basic conceptual knowledge, engineering application ability, and reasoning and calculation ability. The evaluation results on Hydro-SE Bench show that the accuracy values vary among 0.74 to 0.80 for commercial LLMs, and among 0.41 to 0.68 for small-parameter LLMs. While LLMs perform well in subfields closely related to natural and physical sciences, they struggle with domain-specific knowledge such as industry standards and hydraulic structures. Model scaling mainly improves reasoning and calculation abilities, but there is still great potential for LLMs to better handle problems in practical engineering application. This study highlights the strengths and weaknesses of LLMs for Hydro-SE tasks, providing model developers with clear training targets and Hydro-SE researchers with practical guidance for applying LLMs.

[22] Different types of syntactic agreement recruit the same units within large language models

Daria Kryvosheieva,Andrea de Varda,Evelina Fedorenko,Greta Tuckute

Main category: cs.CL

TL;DR: 该研究通过功能定位方法发现，大语言模型中不同句法现象（尤其是句法一致关系）会激活重叠的神经单元，表明句法一致在模型中构成有意义的功能类别，且这一模式跨语言存在。

Details

Motivation: 探究大语言模型如何表征语法知识，特别是不同句法现象是否依赖共享或不同的模型组件。 Method: 受认知神经科学启发，采用功能定位方法，识别七种开源大模型中对67种英语句法现象最敏感的单元，并进行因果分析；扩展至俄语、中文及57种语言的跨语言分析。 Result: 不同类型的句法一致（如主谓、照应、限定词-名词）激活重叠的模型单元，表明‘一致’是模型中的功能性范畴；该模式在英语、俄语和中文中一致，且结构相似的语言在主谓一致上共享更多单元。 Conclusion: 句法一致作为句法依赖的关键标志，在大语言模型的表征空间中构成了有意义的类别，揭示了模型内部语法知识的组织方式。 Abstract: Large language models (LLMs) can reliably distinguish grammatical from ungrammatical sentences, but how grammatical knowledge is represented within the models remains an open question. We investigate whether different syntactic phenomena recruit shared or distinct components in LLMs. Using a functional localization approach inspired by cognitive neuroscience, we identify the LLM units most responsive to 67 English syntactic phenomena in seven open-weight models. These units are consistently recruited across sentences containing the phenomena and causally support the models' syntactic performance. Critically, different types of syntactic agreement (e.g., subject-verb, anaphor, determiner-noun) recruit overlapping sets of units, suggesting that agreement constitutes a meaningful functional category for LLMs. This pattern holds in English, Russian, and Chinese; and further, in a cross-lingual analysis of 57 diverse languages, structurally more similar languages share more units for subject-verb agreement. Taken together, these findings reveal that syntactic agreement-a critical marker of syntactic dependencies-constitutes a meaningful category within LLMs' representational spaces.

[23] AITutor-EvalKit: Exploring the Capabilities of AI Tutors

Numaan Naeem,Kaushal Kumar Maurya,Kseniia Petukhova,Ekaterina Kochmar

Main category: cs.CL

TL;DR: AITutor-EvalKit是一个利用语言技术评估AI导师教学质量的应用，提供演示、评估、模型检查和数据可视化功能，面向教育利益相关者和*ACL社区，支持学习并可用于收集用户反馈和标注。

Details

Motivation: 为了评估AI导师在教育中的实际教学效果，并为研究人员和教育工作者提供可解释性和反馈机制。 Method: 采用语言技术构建评估工具，集成软件演示、模型检查与数据可视化模块。 Result: 开发出AITutor-EvalKit应用，能够有效评估AI导师的 pedagogical 质量，并支持用户反馈与数据标注。 Conclusion: AITutor-EvalKit为AI教育系统的评估提供了实用工具，增强了透明度和可改进性，适用于教育和研究场景。 Abstract: We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotations.

[24] DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking in Long-Context Dialogue

Yijun Liao

Main category: cs.CL

TL;DR: 本文提出了一种名为DZ-TDPO的非破坏性对齐框架，用于解决长上下文对话系统中的状态惯性问题，结合冲突感知的动态KL约束与可学习的时间注意力偏置，在MSC数据集上实现了最先进的胜率，并揭示了容量-稳定性权衡现象。

Details

Motivation: 长上下文对话系统中存在状态惯性问题，即静态约束阻碍模型根据用户意图变化更新历史上下文，影响对话连贯性和准确性。 Method: 提出DZ-TDPO框架，引入冲突感知的动态KL散度约束和可学习的时间注意力偏置机制，实现对历史状态的非破坏性调整，从而在不损害通用能力的前提下提升对齐性能。 Result: 在MSC数据集上，DZ-TDPO在Phi-3.5模型上达到86.2%的胜率，Qwen2.5-7B模型上达到99.4%的胜率且困惑度几乎无增加；验证了‘容量-稳定性权衡’：小模型需付出对齐税，大模型则能高效对齐。 Conclusion: 通过精确的注意力调控而非破坏性权重更新，可以有效缓解TAI（训练后对齐惯性），并在不同规模模型上保持良好的零样本泛化能力和通用性能（如MMLU）。 Abstract: Long-context dialogue systems suffer from State Inertia, where static constraints prevent models from resolving conflicts between evolving user intents and established historical context. To address this, we propose DZ-TDPO, a non-destructive alignment framework that synergizes conflict-aware dynamic KL constraints with a learnable temporal attention bias. Experiments on the Multi-Session Chat (MSC) dataset demonstrate that DZ-TDPO achieves state-of-the-art win rates (86.2% on Phi-3.5) while maintaining robust zero-shot generalization. Crucially, our scaling analysis reveals a "Capacity-Stability Trade-off": while smaller models incur an "alignment tax" (perplexity surge) to overcome historical inertia, the larger Qwen2.5-7B model achieves near-perfect alignment (99.4% win rate) with negligible perplexity overhead. This confirms that TAI can be alleviated via precise attention regulation rather than destructive weight updates, preserving general capabilities (MMLU) across model scales. Code and data are available: https://github.com/lyj20071013/DZ-TDPO

[25] AR-Med: Automated Relevance Enhancement in Medical Search via LLM-Driven Information Augmentation

Chuyue Wang,Jie Feng,Yuxi Wu,Hang Zhang,Zhiguo Fan,Bing Cheng,Wei Lin

Main category: cs.CL

TL;DR: 本文提出了一种名为AR-Med的新型框架，用于在线医疗搜索中的自动化相关性评估，通过检索增强和知识蒸馏技术，结合专门构建的多专家标注基准LocalQSMed，在保证准确性与可靠性的同时实现了大规模部署，显著提升了搜索相关性和用户满意度。

Details

Motivation: 传统方法难以理解复杂和细微的医疗搜索查询，而直接应用大语言模型存在事实性幻觉、专业知识不足和高成本等问题，因此需要一个可靠且可扩展的解决方案来提升在线医疗平台的搜索质量。 Method: 提出AR-Med框架，采用检索增强方法将大语言模型的推理建立在经过验证的医学知识基础上，并设计了一种实用的知识蒸馏方案，将大型教师模型压缩为高效的学生模型以支持在线服务，同时构建了LocalQSMed多专家标注基准用于模型迭代和离线-在线性能对齐。 Result: AR-Med在离线测试中准确率超过93%，相比原系统有24个百分点的绝对提升，在线上也显著提高了搜索相关性和用户满意度。 Conclusion: AR-Med为现实世界医疗应用中可信、可扩展的LLM驱动系统提供了实用蓝图，成功解决了大模型在高风险领域部署的关键挑战。 Abstract: Accurate and reliable search on online healthcare platforms is critical for user safety and service efficacy. Traditional methods, however, often fail to comprehend complex and nuanced user queries, limiting their effectiveness. Large language models (LLMs) present a promising solution, offering powerful semantic understanding to bridge this gap. Despite their potential, deploying LLMs in this high-stakes domain is fraught with challenges, including factual hallucinations, specialized knowledge gaps, and high operational costs. To overcome these barriers, we introduce \textbf{AR-Med}, a novel framework for \textbf{A}utomated \textbf{R}elevance assessment for \textbf{Med}ical search that has been successfully deployed at scale on the Online Medical Delivery Platforms. AR-Med grounds LLM reasoning in verified medical knowledge through a retrieval-augmented approach, ensuring high accuracy and reliability. To enable efficient online service, we design a practical knowledge distillation scheme that compresses large teacher models into compact yet powerful student models. We also introduce LocalQSMed, a multi-expert annotated benchmark developed to guide model iteration and ensure strong alignment between offline and online performance. Extensive experiments show AR-Med achieves an offline accuracy of over 93\%, a 24\% absolute improvement over the original online system, and delivers significant gains in online relevance and user satisfaction. Our work presents a practical and scalable blueprint for developing trustworthy, LLM-powered systems in real-world healthcare applications.

[26] Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

Jingyang Ou,Jiaqi Han,Minkai Xu,Shaoxuan Xu,Jianwen Xie,Stefano Ermon,Yi Wu,Chongxuan Li

Main category: cs.CL

TL;DR: 本文提出了ESPO，一种用于扩散大语言模型（dLLMs）的序列级强化学习框架，通过ELBO作为序列似然代理，解决了传统逐token强化学习方法在dLLMs中的不适用问题，并在数学推理、代码生成和规划任务上显著优于基线方法。

Details

Motivation: 由于扩散大语言模型（dLLMs）缺乏自回归模型中的逐token条件概率，传统的基于token级的强化学习方法难以直接应用，因此需要一种适配dLLMs生成机制的新型强化学习框架。 Method: 提出ELBO-based Sequence-level Policy Optimization (ESPO)，将整个序列生成视为单一动作，使用ELBO作为序列级似然的可计算代理，并引入逐token重要性比率归一化和鲁棒KL散度估计以确保训练稳定性。 Result: 在数学推理、代码生成和规划任务上，ESPO显著优于token级基线方法，在Countdown任务上提升20-40个点，并在多个基准测试中保持一致增益。 Conclusion: ESPO为dLLMs中的强化学习提供了一种原则性强且实证有效的序列级优化范式，验证了序列级策略优化在非自回归语言生成中的优越性。 Abstract: Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.

[27] In-Context Representation Hijacking

Itay Yona,Amir Sarid,Michael Karasik,Yossi Gandelsman

Main category: cs.CL

TL;DR: 提出一种名为Doublespeak的无需优化的上下文表示劫持攻击，通过在上下文中替换有害关键词为良性词汇，使模型内部语义被重写，从而绕过安全对齐机制。

Details

Motivation: 揭示大语言模型在潜在空间中的新攻击面，指出当前安全对齐策略的不足，尤其是在表示层面上的脆弱性。 Method: 通过在多个上下文示例中系统地将有害关键词替换为良性标记，并利用解释性工具分析语义重写在模型各层中的逐层演化过程。 Result: 实现了对Llama-3.3-70B-Instruct等模型的有效攻击，单句上下文即可达到74%的攻击成功率，且该方法无需优化、跨模型通用。 Conclusion: 当前的安全对齐方法不足以防御表示层面的语义劫持，应加强在模型表示空间中的防护机制。 Abstract: We introduce \textbf{Doublespeak}, a simple \emph{in-context representation hijacking} attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., \textit{bomb}) with a benign token (e.g., \textit{carrot}) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., ``How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., ``How to build a bomb?''), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74\% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.

[28] Enhancing Instruction-Following Capabilities in Seq2Seq Models: DoLA Adaptations for T5

Huey Sun,Anabel Yong,Lorenzo Gilly,Felipe Jin

Main category: cs.CL

TL;DR: 本文首次将对比解码策略DoLa应用于编码器-解码器架构（T5和FLAN-T5），研究其对指令遵循能力的影响，发现DoLa在某些任务上提升生成忠实性，而在其他任务上则产生负面影响，并通过逐层分析解释其原因。

Details

Motivation: 现有对比解码方法（如DoLa）仅在仅解码器架构中实现并用于提升事实性，本文旨在探索其在编码器-解码器模型中的适用性及其对指令遵循能力的影响。 Method: 将DoLa方法适配到T5和FLAN-T5模型家族，并评估其在不同任务上的表现，同时进行逐层logit演变分析以理解其对输出概率的影响机制。 Result: DoLa在某些任务类别中提升了生成文本的忠实性，但在其他类别中表现更差；通过分析发现其效果与模型内部层的logit变化密切相关。 Conclusion: 对比解码在编码器-解码器架构中的应用具有潜力但效果因任务而异，未来需更精细地理解其作用机制以优化使用。 Abstract: Contrastive decoding is a lightweight and effective inference-time method that improves the quality of text generation in Large Language Models. However, algorithms such as DoLa (Decoding by Contrastive Layers) have only been implemented in decoder-only architectures and studied for their impact on improving factuality. This work adapts DoLa for the T5 and FLAN-T5 model families and evaluates its impact on the models' instruction following capabilities, which to our knowledge is the first implementation of a contrastive decoding strategy in an encoder-decoder architecture. Our results show that DoLa improves the faithfulness of text generation for certain categories of tasks and harms others. To understand these results, we present a layer-by-layer analysis of logit evolution in a FLAN-T5 model to quantify DoLa's impact on token output probabilities.

[29] Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

Kylie L. Anglin,Stephanie Milan,Brittney Hernandez,Claudia Ventura

Main category: cs.CL

TL;DR: 本文提出了一种通过提示工程优化大语言模型在文本分类任务中表现的实证框架，特别是在心理学构念识别中，强调提示中的构念定义、任务框架和示例选择对模型性能的关键影响。

Details

Motivation: 由于大语言模型在心理学等理论驱动领域中的分类表现受提示措辞影响较大，而现有研究缺乏针对此类领域的提示优化方法，因此需要系统性框架来提升模型输出与专家判断的一致性。 Method: 实验评估了五种提示策略（代码本引导的经验选择、自动提示工程、角色提示、思维链推理和解释性提示）在零样本和少样本分类中的效果，结合人类设计与自动生成的提示变体，并基于训练集表现进行选择和验证。 Result: 结果显示，构念定义、任务框架和示例是影响分类性能最关键的提示特征；结合代码本引导与自动提示工程的少样本提示在三个构念和两个模型中均取得最接近专家判断的分类结果。 Conclusion: 研究人员应尽可能生成并评估多种提示变体，基于实证表现选择最优提示，并在独立验证集上确认其有效性，从而实现理论驱动且与专家判断高度一致的LLM提示优化。 Abstract: Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies --codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.

[30] Training and Evaluation of Guideline-Based Medical Reasoning in LLMs

Michael Staniek,Artem Sokolov,Stefan Riezler

Main category: cs.CL

TL;DR: 本文提出通过微调小型语言模型，使其遵循医学共识指南进行逐步推理，以实现可信赖的早期医疗预测，并结合时间序列预测模型改进对稀疏不规则临床变量的未来推断。

Details

Motivation: 尽管机器学习在早期医疗预测中表现优异，但缺乏符合医学共识的可解释性，难以获得临床医生信任。本文旨在让大语言模型（LLMs）在推理过程中忠实遵循医学共识指南，提升预测的可信度。 Method: 将医学共识指南转化为可实例化的推理规则，并在电子健康记录上微调LLMs；利用规则实现对模型推理过程的自动评估（包括推导正确性和数值正确性），并在脓毒症-3定义上进行验证；进一步结合时间序列预测模型构建多模态框架以提升对未来临床变量的预测能力。 Result: 微调后的小型模型在遵循共识规则和例外情况方面显著优于大模型的一次性提示学习及基于医学文本训练的模型，且在未见数据上实现了接近完美的推导正确性；结合时间序列模型可改善对未来稀疏采样变量的预测性能。 Conclusion: 当前早期预测的主要瓶颈并非分布外泛化，而是面向未来的预测能力；通过微调使模型学习并遵循医学共识规则是实现可信赖AI的关键路径，结合时序建模可进一步提升实际预测效果。 Abstract: Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model's inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.

[31] Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

Hongzhan Lin,Zhiqi Bai,Xinmiao Zhang,Sen Yang,Xiang Li,Siran Yang,Yunlong Xu,Jiaheng Liu,Yongchi Zhao,Jiamang Wang,Yuchi Xu,Wenbo Su,Bo Zheng

Main category: cs.CL

TL;DR: 提出FusedKV和FusedKV-Lite方法，通过融合底层和中层的KV缓存来减少Transformer解码器的内存开销，在降低50%缓存内存的同时实现比标准Transformer更低的困惑度。

Details

Motivation: 现有跨层KV缓存共享方法（如YOCO、CLA）通常性能低于同层方法（如GQA），需探究其根本原因并提升性能。 Method: 分析顶层KV的信息流分布，发现值主要来自底层，键则更多来自底层和中层；据此提出FusedKV，将底层和中层最具信息量的KV进行可学习融合，并直接在RoPE后操作以保留位置信息；进一步提出FusedKV-Lite，直接使用底层值和中层键以提升效率。 Result: 在332M到4B参数规模的LLM上实验表明，相比标准Transformer解码器，在减少50%缓存内存的同时实现了更低的验证困惑度；FusedKV-Lite略微增加困惑度但进一步降低I/O开销。 Conclusion: FusedKV及其轻量版本FusedKV-Lite是高效且高性能的Transformer架构替代方案，有效缓解长序列下的KV缓存内存瓶颈。 Abstract: Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.

[32] BERnaT: Basque Encoders for Representing Natural Textual Diversity

Ekhi Azurmendi,Joseba Fernandez de Landa,Jaione Bengoetxea,Maite Heredia,Julen Etxaniz,Mikel Zubillaga,Ander Soraluze,Aitor Soroa

Main category: cs.CL

TL;DR: 本文探讨了语言模型应捕捉语言变体的全貌，而非仅依赖标准化文本。通过构建包含标准、社交媒体和历史语料的巴斯克语新语料库，并训练BERnaT系列模型，研究发现结合多样数据训练的模型在各类任务中表现更优，且不影响标准基准性能。

Details

Motivation: 现有语言模型训练常过滤掉非标准语言变体，导致模型泛化能力差、代表性偏见严重。本文旨在探索纳入更多语言多样性是否能提升模型的鲁棒性和包容性。 Method: 以巴斯克语为例，构建包含标准、社交媒体和历史文本的新型语料库，预训练三种配置的BERnaT编码器模型（标准、多样、混合），并提出将NLU任务划分为标准与多样子集的评估框架，以衡量语言泛化能力。 Result: 使用标准与多样数据联合训练的模型在所有任务类型上均优于仅用标准数据训练的模型，同时保持在标准基准上的准确性。 Conclusion: 纳入语言多样性有助于构建更具包容性和泛化能力的语言模型，应重视非标准语言变体在训练数据中的作用。 Abstract: Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.

[33] Is Lying Only Sinful in Islam? Exploring Religious Bias in Multilingual Large Language Models Across Major Religions

Kazi Abrab Hossain,Jannatul Somiya Mahmud,Maria Hossain Tuli,Anik Mitra,S. M. Taiabul Haque,Farig Y. Sadeque

Main category: cs.CL

TL;DR: 本文介绍了BRAND数据集，旨在解决多语言大模型在宗教语境下的偏见问题，特别是在南亚四大宗教（佛教、基督教、印度教和伊斯兰教）中的表现。研究发现模型在英语上表现优于孟加拉语，且对伊斯兰教存在持续偏见。

Details

Motivation: 当前的大语言模型在处理宗教等敏感话题时容易产生误解，尤其是在多语言环境下对宗教的表征不准确，导致潜在的严重偏见问题。 Method: 构建了一个名为BRAND的双语宗教可问责规范数据集，包含超过2400条英文和孟加拉文条目，涵盖南亚四大宗教，并使用三种不同类型的提示进行评估。 Result: 实验结果显示，模型在英语上的表现优于孟加拉语，且在回答与宗教无关的问题时仍表现出对伊斯兰教的系统性偏见。 Conclusion: 多语言模型在不同语言中对宗教的处理存在持续偏见，凸显了在跨语言场景下实现公平性和准确性的挑战，需在人机交互设计中更审慎地考虑宗教与灵性议题。 Abstract: While recent developments in large language models have improved bias detection and classification, sensitive subjects like religion still present challenges because even minor errors can result in severe misunderstandings. In particular, multilingual models often misrepresent religions and have difficulties being accurate in religious contexts. To address this, we introduce BRAND: Bilingual Religious Accountable Norm Dataset, which focuses on the four main religions of South Asia: Buddhism, Christianity, Hinduism, and Islam, containing over 2,400 entries, and we used three different types of prompts in both English and Bengali. Our results indicate that models perform better in English than in Bengali and consistently display bias toward Islam, even when answering religion-neutral questions. These findings highlight persistent bias in multilingual models when similar questions are asked in different languages. We further connect our findings to the broader issues in HCI regarding religion and spirituality.

[34] Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study

Lifeng Chen,Ryan Lai,Tianming Liu

Main category: cs.CL

TL;DR: 本研究提出了一种两阶段方法，将Qwen2.5-3B模型适应于藏语这一低资源语言，通过持续预训练（CPT）建立语言基础，再通过监督微调（SFT）实现任务与翻译专业化，显著提升了语言建模和翻译性能。

Details

Motivation: 由于数据稀缺和跨语言漂移，将大语言模型适应到低资源语言（如藏语）仍具挑战性。本文旨在探索有效的适应策略以提升模型在形态丰富但表征不足语言上的表现。 Method: 采用两阶段方法：首先对Qwen2.5-3B进行持续预训练（CPT）以建立藏语语言基础，随后进行监督微调（SFT）以增强任务和中-藏翻译能力；并在Qwen3-4B上进行层分析以理解适应机制。 Result: 语言模型困惑度显著下降（2.98 → 1.54），中→藏翻译质量大幅提升（BLEU: 0.046 → 0.261；chrF: 2.2 → 6.6）；层分析显示适应主要集中在嵌入层和输出头，中后层MLP负责领域特定变换。 Conclusion: CPT构建了藏语语义流形，SFT在最小表示干扰下增强了任务对齐；本研究首次量化了大模型藏语适应动态，并提供了可复现的框架以推广至其他低资源语言场景。 Abstract: Adapting large language models (LLMs) to low-resource languages remains a major challenge due to data scarcity and cross-lingual drift. This work presents a two-stage adaptation of Qwen2.5-3B to Tibetan, a morphologically rich and underrepresented language. We employ Continual Pretraining (CPT) to establish Tibetan linguistic grounding, followed by Supervised Fine-Tuning (SFT) for task and translation specialization. Empirical evaluations demonstrate a consistent decrease in perplexity (from 2.98 $\rightarrow$ 1.54) and substantial improvements in Chinese$\rightarrow$Tibetan translation quality (BLEU: 0.046 $\rightarrow$ 0.261; chrF: 2.2 $\rightarrow$ 6.6). Layer-wise analysis across 435 layers in Qwen3-4B reveals that adaptation primarily concentrates on embedding and output heads, with mid--late MLP projections encoding domain-specific transformations. Our findings suggest that CPT constructs a Tibetan semantic manifold while SFT sharpens task alignment with minimal representational disruption. This study provides the first quantitative exploration of Tibetan adaptation dynamics for LLMs, and offers an open, reproducible framework for extending multilingual foundation models to low-resource settings.

[35] Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models

Taido Purason,Pavel Chizhov,Ivan P. Yamshchikov,Mark Fishel

Main category: cs.CL

TL;DR: 本文提出了一种通过继续BPE训练来扩展词汇表并利用叶节点剪枝减少冗余词符的方法，以更高效地适应预训练语言模型到新领域或语言。

Details

Motivation: 在将预训练语言模型迁移到新领域或语言时，分词器的适应至关重要。现有方法在扩展词汇时容易引入大量无用词符，且缺乏有效的剪枝策略。 Method: 提出继续BPE训练（continued BPE training）以在新数据上延续BPE合并学习过程，并引入基于叶节点的词汇剪枝（leaf-based vocabulary pruning）去除冗余词符。 Result: 在多种语言和模型族上的实验表明，该方法提高了分词效率和新增词汇的利用率，同时保持模型性能。 Conclusion: 所提方法为受控的词汇表修改提供了实用工具，有效提升了跨领域和跨语言场景下的分词器适应能力。 Abstract: Tokenizer adaptation plays an important role in transferring pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not overlap with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued BPE training, which adapts a pre-trained tokenizer by continuing the BPE merge learning process on new data. Experiments across multiple languages and model families show that this approach improves tokenization efficiency and leads to better utilization of added vocabulary. We also introduce leaf-based vocabulary pruning, which removes redundant tokens while preserving model quality. Together, these methods provide practical tools for controlled vocabulary modification, which we release as an open-source package.

[36] AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

Ying Wang,Zhen Jin,Jiexiong Xu,Wenhai Lin,Yiquan Chen,Wenzhi Chen

Main category: cs.CL

TL;DR: 本文提出了AugServe，一种高效的增强型大语言模型推理服务框架，通过两阶段自适应请求调度和动态批处理机制，显著提升了有效吞吐量并降低了延迟。

Details

Motivation: 现有增强型LLM推理系统存在首先进先出调度导致的队头阻塞和静态批处理限制，难以满足延迟敏感应用的服务水平目标（SLOs），亟需提升有效吞吐量和服务质量。 Method: 提出AugServe框架，采用两阶段自适应请求调度策略：第一阶段结合增强型LLM请求的推理特征优化调度顺序；第二阶段利用运行时信息持续优化调度决策；同时根据硬件状态和实时负载动态调整令牌批处理机制。 Result: 实验结果表明，与vLLM和InferCept相比，AugServe的有效吞吐量分别提高了4.7-33.1倍和3.3-13.2倍，首次令牌时间（TTFT）最多分别降低了96.3%和95.0%。 Conclusion: AugServe通过自适应调度和动态批处理，有效解决了队头阻塞和静态配置问题，显著提升了增强型LLM推理服务的效率和性能，适用于对延迟敏感的应用场景。 Abstract: As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maximize request handling within latency constraints, referred to as increasing effective throughput. However, existing systems face two major challenges: (i) reliance on first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, leading to queuing delays exceeding the SLOs for many requests; and (ii) static batch token limit, which fails to adapt to fluctuating loads and hardware conditions. Both of these factors degrade effective throughput and service quality. This paper presents AugServe, an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented LLM inference services. The core idea of AugServe is a two-stage adaptive request scheduling strategy. Specifically, AugServe combines the inference features of augmented LLM requests to optimize the order of scheduling decisions (stage I). These decisions are continuously refined with runtime information (stage II), adapting to both request characteristics and system capabilities. In addition, AugServe dynamically adjusts the token batching mechanism based on hardware status and real-time load, further enhancing throughput performance. Experimental results show that AugServe achieves 4.7-33.1x and 3.3-13.2x higher effective throughput than vLLM and InferCept, while reducing time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.

[37] Jina-VLM: Small Multilingual Vision Language Model

Andreas Koukounas,Georgios Mastrapas,Florian Hönicke,Sedigheh Eslami,Guillaume Roncari,Scott Martens,Han Xiao

Main category: cs.CL

TL;DR: Jina-VLM是一个24亿参数的多语言视觉语言模型，在开放的20亿规模VLM中实现了最先进的多语言视觉问答性能。

Details

Motivation: 旨在开发一个高效、高性能的多语言视觉语言模型，能够在保持文本能力的同时处理任意分辨率图像。 Method: 结合SigLIP2视觉编码器和Qwen3语言模型骨干，通过注意力池化连接器实现令牌高效的图像处理。 Result: 在标准VQA基准和多语言评估中，Jina-VLM优于同类模型，同时保持了有竞争力的纯文本性能。 Conclusion: Jina-VLM在多语言视觉问答任务上表现出色，是当前2B级开源VLM中的领先模型。 Abstract: We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.

[38] SkillFactory: Self-Distillation For Learning Cognitive Behaviors

Zayne Sprague,Jack Lu,Manya Wadhwa,Sedrick Keh,Mengye Ren,Greg Durrett

Main category: cs.CL

TL;DR: SkillFactory 是一种在强化学习（RL）之前通过监督微调（SFT）使模型学习认知技能的方法，利用模型自身生成并重排的“银级”轨迹来训练，提升模型在复杂任务上的泛化能力和鲁棒性。

Details

Motivation: 如何让基础模型不具备某些认知技能时，仍能通过训练使其掌握这些技能，尤其是在后续强化学习中有效利用。 Method: 提出 SkillFactory 方法，在 SFT 阶段使用模型自身生成的推理轨迹，经过重排构造出具备验证、回溯、重试等技能的训练样本，用于初始化模型，再进行 RL 训练。 Result: 实验表明，基于 SkillFactory 初始化的模型在 RL 后能更好泛化到更难的任务变体上，尽管其 RL 前性能较低；分析证实模型确实学会了使用认知技能，且在跨领域任务上对退化更具鲁棒性。 Conclusion: 在 RL 之前通过 SFT 引入归纳偏置有助于模型更稳健地掌握和运用复杂认知技能。 Abstract: Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.

cs.CV [Back]

[39] Hierarchical Process Reward Models are Symbolic Vision Learners

Shan Zhang,Aotian Chen,Kai Zou,Jindong Gu,Yuan Xue,Anton van den Hengel

Main category: cs.CV

TL;DR: 提出了一种新的自监督符号自动编码器，结合分层过程奖励建模和稳定机制，实现图表的结构化表示与重建，并在多个视觉与推理任务上取得优异表现。

Details

Motivation: 传统基于像素的视觉模型难以实现对图表的可解释性理解，而符号化计算机视觉需要不同的学习范式来解析几何元素及其关系。 Method: 设计了一个符号层次化过程奖励建模（Symbolic Hierarchical Process Reward Modeling）的自监督符号自动编码器，通过执行引擎编码和解码图表中的几何原语及其相互关系，并引入稳定机制以平衡强化学习中的探索与利用。 Result: 在几何图表重建中MSE降低了98.2%，7B模型在图表重建上超过GPT-4o 0.6%，MathGlance感知基准提升+13%，MathVerse和GeoQA推理基准各提升+3%。 Conclusion: 该方法有效结合了神经网络的推理能力与符号系统的可解释性，显著提升了符号视觉系统在重建、感知与推理任务上的性能。 Abstract: Symbolic computer vision represents diagrams through explicit logical rules and structured representations, enabling interpretable understanding in machine vision. This requires fundamentally different learning paradigms from pixel-based visual models. Symbolic visual learners parse diagrams into geometric primitives-points, lines, and shapes-whereas pixel-based learners operate on textures and colors. We propose a novel self-supervised symbolic auto-encoder that encodes diagrams into structured primitives and their interrelationships within the latent space, and decodes them through our executable engine to reconstruct the input diagrams. Central to this architecture is Symbolic Hierarchical Process Reward Modeling, which applies hierarchical step-level parsing rewards to enforce point-on-line, line-on-shape, and shape-on-relation consistency. Since vanilla reinforcement learning exhibits poor exploration in the policy space during diagram reconstruction; we thus introduce stabilization mechanisms to balance exploration and exploitation. We fine-tune our symbolic encoder on downstream tasks, developing a neuro-symbolic system that integrates the reasoning capabilities of neural networks with the interpretability of symbolic models through reasoning-grounded visual rewards. Evaluations across reconstruction, perception, and reasoning tasks demonstrate the effectiveness of our approach: achieving a 98.2% reduction in MSE for geometric diagram reconstruction, surpassing GPT-4o by 0.6% with a 7B model on chart reconstruction, and improving by +13% on the MathGlance perception benchmark, and by +3% on MathVerse and GeoQA reasoning benchmarks.

[40] Drainage: A Unifying Framework for Addressing Class Uncertainty

Yasser Taha,Grégoire Montavon,Nils Körber

Main category: cs.CV

TL;DR: 提出一种基于“drainage node”的统一框架，有效应对噪声标签、类别模糊及分布外样本识别问题，在多种噪声设置下显著提升鲁棒性与准确率。

Details

Motivation: 深度学习在面对噪声标签、类别模糊和异常样本时表现不稳定，尤其在实例依赖和非对称噪声场景下缺乏有效的统一解决方案。 Method: 在网络输出层引入一个可学习的“drainage node”，动态重分配概率质量至不确定性，使模糊或噪声样本得以自然分流，同时保持端到端可微训练。 Result: 在CIFAR-10/100中加入实例依赖或非对称噪声时，高噪声环境下准确率最高提升9%；在mini-WebVision、mini-ImageNet和Clothing-1M等真实数据集上达到或超越现有最优方法；定性分析显示drainage节点能吸收错误标签和异常样本。 Conclusion: drainage node提供了一种通用且可扩展的机制，不仅增强了模型对噪声和异常的鲁棒性，还拓展至半监督数据清洗和开放集识别等应用。 Abstract: Modern deep learning faces significant challenges with noisy labels, class ambiguity, as well as the need to robustly reject out-of-distribution or corrupted samples. In this work, we propose a unified framework based on the concept of a "drainage node'' which we add at the output of the network. The node serves to reallocate probability mass toward uncertainty, while preserving desirable properties such as end-to-end training and differentiability. This mechanism provides a natural escape route for highly ambiguous, anomalous, or noisy samples, particularly relevant for instance-dependent and asymmetric label noise. In systematic experiments involving the addition of varying proportions of instance-dependent noise or asymmetric noise to CIFAR-10/100 labels, our drainage formulation achieves an accuracy increase of up to 9\% over existing approaches in the high-noise regime. Our results on real-world datasets, such as mini-WebVision, mini-ImageNet and Clothing-1M, match or surpass existing state-of-the-art methods. Qualitative analysis reveals a denoising effect, where the drainage neuron consistently absorbs corrupt, mislabeled, or outlier data, leading to more stable decision boundaries. Furthermore, our drainage formulation enables applications well beyond classification, with immediate benefits for web-scale, semi-supervised dataset cleaning, and open-set applications.

[41] Does Head Pose Correction Improve Biometric Facial Recognition?

Justin Norman,Hany Farid

Main category: cs.CV

TL;DR: 研究探讨了AI驱动的头部姿态校正和图像修复对提升真实场景下人脸识别准确率的影响，发现选择性结合CFR-GAN与CodeFormer可显著改善识别性能。

Details

Motivation: 由于实际图像常存在低质量、非正面姿态和遮挡问题，导致现有的人脸识别模型准确率下降，本文旨在探索图像恢复技术是否能有效提升识别效果。 Method: 采用模型无关的大规模法医评估流程，评估三种图像恢复方法（3D重建NextFace、2D正面化CFR-GAN和特征增强CodeFormer）对人脸识别的影响。 Result: 直接应用这些恢复技术会显著降低识别准确率，但选择性使用CFR-GAN结合CodeFormer可带来明显的性能提升。 Conclusion: 图像恢复技术需谨慎应用，合理组合正面化与特征增强方法可在实际场景中提升人脸识别精度。 Abstract: Biometric facial recognition models often demonstrate significant decreases in accuracy when processing real-world images, often characterized by poor quality, non-frontal subject poses, and subject occlusions. We investigate whether targeted, AI-driven, head-pose correction and image restoration can improve recognition accuracy. Using a model-agnostic, large-scale, forensic-evaluation pipeline, we assess the impact of three restoration approaches: 3D reconstruction (NextFace), 2D frontalization (CFR-GAN), and feature enhancement (CodeFormer). We find that naive application of these techniques substantially degrades facial recognition accuracy. However, we also find that selective application of CFR-GAN combined with CodeFormer yields meaningful improvements.

[42] Flux4D: Flow-based Unsupervised 4D Reconstruction

Jingkang Wang,Henry Che,Yun Chen,Ze Yang,Lily Goli,Sivabalan Manivasagam,Raquel Urtasun

Main category: cs.CV

TL;DR: 提出Flux4D，一种简单且可扩展的框架，用于大规模动态场景的4D重建，完全无需监督，仅通过光度损失和“尽可能静态”正则化实现高效、高质量的动态元素分解。

Details

Motivation: 现有方法在大规模动态场景重建中存在可扩展性差、依赖标注或对超参数敏感等问题，需要一种更高效、无需监督且泛化能力强的方法。 Method: 提出Flux4D，直接预测3D高斯及其运动动态，采用光度损失和“尽可能静态”正则化，在多场景训练下实现完全无监督的4D重建。 Result: 在户外驾驶数据集上显著优于现有方法，具有更好的可扩展性、泛化能力和重建质量，能在数秒内完成动态场景重建。 Conclusion: Flux4D是一种高效、可扩展且无需监督的大规模动态场景4D重建框架，能够从原始数据中自动分解动态元素，并在真实世界场景中表现出优异性能。 Abstract: Reconstructing large-scale dynamic scenes from visual observations is a fundamental challenge in computer vision, with critical implications for robotics and autonomous systems. While recent differentiable rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved impressive photorealistic reconstruction, they suffer from scalability limitations and require annotations to decouple actor motion. Existing self-supervised methods attempt to eliminate explicit annotations by leveraging motion cues and geometric priors, yet they remain constrained by per-scene optimization and sensitivity to hyperparameter tuning. In this paper, we introduce Flux4D, a simple and scalable framework for 4D reconstruction of large-scale dynamic scenes. Flux4D directly predicts 3D Gaussians and their motion dynamics to reconstruct sensor observations in a fully unsupervised manner. By adopting only photometric losses and enforcing an "as static as possible" regularization, Flux4D learns to decompose dynamic elements directly from raw data without requiring pre-trained supervised models or foundational priors simply by training across many scenes. Our approach enables efficient reconstruction of dynamic scenes within seconds, scales effectively to large datasets, and generalizes well to unseen environments, including rare and unknown objects. Experiments on outdoor driving datasets show Flux4D significantly outperforms existing methods in scalability, generalization, and reconstruction quality.

[43] Object Counting with GPT-4o and GPT-5: A Comparative Study

Richard Füzesséry,Kaziwa Saleh,Sándor Szénási,Zoltán Vámossy

Main category: cs.CV

TL;DR: 本研究利用多模态大语言模型（如GPT-4o和GPT-5）在零样本条件下通过纯文本提示进行物体计数，无需监督或视觉示例，在FSC-147等数据集上表现媲美甚至超越现有方法。

Details

Motivation: 现有零样本物体计数方法依赖大量标注数据和视觉示例，限制了其泛化能力；而大语言模型具备强大的推理与理解能力，启发研究者探索其在无监督计数任务中的潜力。 Method: 使用GPT-4o和GPT-5两个多模态大语言模型，仅通过设计文本提示（prompt）进行零样本物体计数，并在FSC-147和CARPK数据集上进行评估与比较。 Result: 实验结果显示，这两个模型在FSC-147数据集上的表现与当前最先进的零样本方法相当，某些情况下更优；但在CARPK上的效果相对有限。 Conclusion: 多模态大语言模型可通过纯文本提示有效支持零样本物体计数，展现出强大的视觉推理能力，为未来减少对标注数据和视觉示例的依赖提供了新方向。 Abstract: Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training. Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process. However, large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision. In this work we aim to leverage the visual capabilities of two multi-modal LLMs, GPT-4o and GPT-5, to perform object counting in a zero-shot manner using only textual prompts. We evaluate both models on the FSC-147 and CARPK datasets and provide a comparative analysis. Our findings show that the models achieve performance comparable to the state-of-the-art zero-shot approaches on FSC-147, in some cases, even surpass them.

[44] LLM-Guided Material Inference for 3D Point Clouds

Nafiseh Izadyar,Teseo Schneider

Main category: cs.CV

TL;DR: 提出一种基于大语言模型的两阶段方法，从3D点云中零样本推断物体语义和材料组成，利用语言模型作为先验知识桥接几何与材料理解。

Details

Motivation: 现有3D形状数据集和模型多关注几何结构，忽略决定外观的材料属性，且缺乏可靠的材料标注数据。 Method: 采用两阶段大语言模型方法：第一阶段预测物体语义，第二阶段基于语义为几何片段分配可能的材料，全过程无需任务特定训练，实现零样本推理。 Result: 在Fusion/ABS和ShapeNet的1000个形状上使用LLM-as-a-Judge评估，语义和材料分配均表现出高合理性。 Conclusion: 语言模型可作为通用先验，有效连接3D数据中的几何推理与材料理解。 Abstract: Most existing 3D shape datasets and models focus solely on geometry, overlooking the material properties that determine how objects appear. We introduce a two-stage large language model (LLM) based method for inferring material composition directly from 3D point clouds with coarse segmentations. Our key insight is to decouple reasoning about what an object is from what it is made of. In the first stage, an LLM predicts the object's semantic; in the second stage, it assigns plausible materials to each geometric segment, conditioned on the inferred semantics. Both stages operate in a zero-shot manner, without task-specific training. Because existing datasets lack reliable material annotations, we evaluate our method using an LLM-as-a-Judge implemented in DeepEval. Across 1,000 shapes from Fusion/ABS and ShapeNet, our method achieves high semantic and material plausibility. These results demonstrate that language models can serve as general-purpose priors for bridging geometric reasoning and material understanding in 3D data.

[45] 2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition

Liying Lu,Raphaël Achddou,Sabine Süsstrunk

Main category: cs.CV

TL;DR: 提出了一种通用且实用的噪声合成方法，仅需单张噪声图像和暗帧即可生成逼真的低光噪声，用于训练去噪模型，并在多个基准上达到最先进性能。

Details

Motivation: 现有的基于学习的低光图像去噪方法需要大量成对的干净与噪声图像进行训练，但这类数据难以获取，因此需要一种更实用的噪声合成方法来替代大规模数据采集。 Method: 提出一种新的噪声合成方法：使用泊松分布建模信号依赖性噪声，并引入傅里叶域谱采样算法来精确建模信号独立噪声，仅需单张噪声图像和一个暗帧即可合成具有真实空间和统计特性的噪声。 Result: 该方法在多个低光去噪基准上实现了最先进的性能，生成的噪声保持了真实传感器噪声的多样性和统计特性。 Conclusion: 所提出的噪声合成方法准确、实用，无需复杂参数模型或大量成对数据，显著提升了低光图像去噪模型的训练效果。 Abstract: Raw images taken in low-light conditions are very noisy due to low photon count and sensor noise. Learning-based denoisers have the potential to reconstruct high-quality images. For training, however, these denoisers require large paired datasets of clean and noisy images, which are difficult to collect. Noise synthesis is an alternative to large-scale data acquisition: given a clean image, we can synthesize a realistic noisy counterpart. In this work, we propose a general and practical noise synthesis method that requires only one single noisy image and one single dark frame per ISO setting. We represent signal-dependent noise with a Poisson distribution and introduce a Fourier-domain spectral sampling algorithm to accurately model signal-independent noise. The latter generates diverse noise realizations that maintain the spatial and statistical properties of real sensor noise. As opposed to competing approaches, our method neither relies on simplified parametric models nor on large sets of clean-noisy image pairs. Our synthesis method is not only accurate and practical, it also leads to state-of-the-art performances on multiple low-light denoising benchmarks.

Haitian Zheng,Yuan Yao,Yongsheng Yu,Yuqian Zhou,Jiebo Luo,Zhe Lin

Main category: cs.CV

TL;DR: PixPerfect 是一种针对潜在扩散模型（LDMs）在图像局部编辑中产生像素级伪影问题的像素级优化框架，通过可微判别像素空间、逼真伪影模拟和直接像素空间优化策略，显著提升了修复、物体移除与插入等任务的视觉保真度和泛化能力。

Details

Motivation: 潜在扩散模型在图像修复和局部编辑中存在由潜在压缩引起的像素级不一致问题（如色偏、纹理不匹配和边界接缝），现有方法难以有效消除这些伪影且泛化性差。 Method: 提出 PixPerfect 框架：1）可微判别像素空间以放大和抑制颜色与纹理差异；2）构建包含多种真实伪影的训练模拟流程；3）采用直接的像素空间优化机制，确保对不同 LDM 架构和任务的广泛适用性。 Result: 在图像修复、物体移除和插入等多个基准上实验表明，PixPerfect 显著提升感知质量与下游编辑性能，优于现有方法。 Conclusion: PixPerfect 为基于 LDM 的局部图像编辑建立了新的高保真与鲁棒性标准，具有良好的跨模型与跨任务泛化能力。 Abstract: Latent Diffusion Models (LDMs) have markedly advanced the quality of image inpainting and local editing. However, the inherent latent compression often introduces pixel-level inconsistencies, such as chromatic shifts, texture mismatches, and visible seams along editing boundaries. Existing remedies, including background-conditioned latent decoding and pixel-space harmonization, usually fail to fully eliminate these artifacts in practice and do not generalize well across different latent representations or tasks. We introduce PixPerfect, a pixel-level refinement framework that delivers seamless, high-fidelity local edits across diverse LDM architectures and tasks. PixPerfect leverages (i) a differentiable discriminative pixel space that amplifies and suppresses subtle color and texture discrepancies, (ii) a comprehensive artifact simulation pipeline that exposes the refiner to realistic local editing artifacts during training, and (iii) a direct pixel-space refinement scheme that ensures broad applicability across diverse latent representations and tasks. Extensive experiments on inpainting, object removal, and insertion benchmarks demonstrate that PixPerfect substantially enhances perceptual fidelity and downstream editing performance, establishing a new standard for robust and high-fidelity localized image editing.

[47] PyroFocus: A Deep Learning Approach to Real-Time Wildfire Detection in Multispectral Remote Sensing Imagery

Mark Moussa,Andre Williams,Seth Roffe,Douglas Morton

Main category: cs.CV

TL;DR: 本文提出了一种名为PyroFocus的两阶段深度学习管道，用于机载和星载任务中的实时野火检测与强度估计，结合分类与回归/分割以提升效率。

Details

Motivation: 随着野火频率和严重性上升，现有高光谱数据处理面临高维度与机载资源受限的挑战，亟需低延迟、高效的实时检测方法。 Method: 系统评估了多种深度学习模型（如CNN和Transformer），并提出PyroFocus两阶段框架：先进行多类火灾分类，再执行火灾辐射功率（FRP）回归或分割，以降低推理时间和计算成本。使用NASA的MASTER传感器数据进行验证。 Result: 该两阶段管道在准确率、推理延迟和资源效率之间实现了良好权衡，显著优于传统方法，适合边缘设备部署。 Conclusion: PyroFocus为未来野火监测任务提供了高效、可部署于边缘设备的解决方案，具有实现实时 wildfire 检测与强度估计的潜力。 Abstract: Rapid and accurate wildfire detection is crucial for emergency response and environmental management. In airborne and spaceborne missions, real-time algorithms must distinguish between no fire, active fire, and post-fire conditions, and estimate fire intensity. Multispectral and hyperspectral thermal imagers provide rich spectral information, but high data dimensionality and limited onboard resources make real-time processing challenging. As wildfires increase in frequency and severity, the need for low-latency and computationally efficient onboard detection methods is critical. We present a systematic evaluation of multiple deep learning architectures, including custom Convolutional Neural Networks (CNNs) and Transformer-based models, for multi-class fire classification. We also introduce PyroFocus, a two-stage pipeline that performs fire classification followed by fire radiative power (FRP) regression or segmentation to reduce inference time and computational cost for onboard deployment. Using data from NASA's MODIS/ASTER Airborne Simulator (MASTER), which is similar to a next-generation fire detection sensor, we compare accuracy, inference latency, and resource efficiency. Experimental results show that the proposed two-stage pipeline achieves strong trade-offs between speed and accuracy, demonstrating significant potential for real-time edge deployment in future wildfire monitoring missions.

[48] SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

Hongpei Zheng,Shijie Li,Yanran Li,Hujun Yin

Main category: cs.CV

TL;DR: 本文提出了H$^2$U3D，一个用于房屋尺度3D场景理解的视觉问答数据集，以及SpatialReasoner，一种基于主动感知的空间推理框架。该框架通过两阶段训练，在仅使用少量图像的情况下实现了最先进的性能。

Details

Motivation: 现有的视觉语言模型在大尺度3D环境中的空间推理能力有限，通常局限于房间尺度场景，难以应对多层、大范围住宅环境的理解需求。 Method: 构建了H$^2$U3D数据集，包含多楼层、大面积的3D场景，并通过自动化标注流程生成分层的粗到细视觉表示和带思维链的问题回答对；提出SpatialReasoner框架，结合主动感知与空间工具调用，采用两阶段训练：监督冷启动+基于自适应探索奖励的强化学习。 Result: 在H$^2$U3D上实验表明，SpatialReasoner显著优于GPT-4o和Gemini-2.5-Pro等强基线方法，平均仅用3-4张图像即可超越需16+图像的模型，验证了其高效探索能力。 Conclusion: 本文展示了粗到细的主动探索范式在大尺度3D空间推理中的有效性，推动了视觉语言模型在复杂真实环境中的应用。 Abstract: Spatial reasoning in large-scale 3D environments remains challenging for current vision-language models, which are typically constrained to room-scale scenarios. We introduce H$^2$U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset designed for house-scale scene understanding. H$^2$U3D features multi-floor environments spanning up to three floors and 10-20 rooms, covering more than 300 m$^2$. Through an automated annotation pipeline, it constructs hierarchical coarse-to-fine visual representations and generates diverse question-answer pairs with chain-of-thought annotations. We further propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. SpatialReasoner is trained through a two-stage strategy: a supervised cold start followed by reinforcement learning with an adaptive exploration reward that promotes efficient exploration while discouraging redundant operations. Extensive experiments demonstrate that SpatialReasoner achieves state-of-the-art performance on H$^2$U3D, outperforming strong baselines including GPT-4o and Gemini-2.5-Pro. Notably, our method attains superior results while using only 3-4 images in total on average, compared to baselines requiring 16+ images, highlighting the effectiveness of our coarse-to-fine active exploration paradigm.

Thomas Monninger,Zihan Zhang,Steffen Staab,Sihao Ding

Main category: cs.CV

TL;DR: 提出NavMapFusion，一种基于扩散模型的框架，通过融合低精度导航地图与高精度传感器数据，实现鲁棒的在线地图构建，显著提升自动驾驶环境表征的准确性与实时性。

Details

Motivation: 解决传统高精地图难以实时更新的问题，利用广泛可用但精度较低的导航地图作为先验，指导基于车载传感器的在线地图构建。 Method: 提出NavMapFusion，采用基于扩散模型的迭代去噪框架，以高保真传感器数据和低保真导航地图为条件，将先验地图与感知结果之间的差异视为扩散过程中的噪声进行建模。 Result: 在nuScenes基准上，结合OpenStreetMap的粗略道路线，NavMapFusion在100米范围内实现了21.4%的相对提升，且感知范围越大提升越显著，同时保持实时性能。 Conclusion: 扩散模型为地图融合提供了鲁棒框架，通过融合低精度先验与高精度感知数据，可生成准确、实时的环境表示，推动更安全可靠的自动驾驶发展。 Abstract: Accurate environmental representations are essential for autonomous driving, providing the foundation for safe and efficient navigation. Traditionally, high-definition (HD) maps are providing this representation of the static road infrastructure to the autonomous system a priori. However, because the real world is constantly changing, such maps must be constructed online from on-board sensor data. Navigation-grade standard-definition (SD) maps are widely available, but their resolution is insufficient for direct deployment. Instead, they can be used as coarse prior to guide the online map construction process. We propose NavMapFusion, a diffusion-based framework that performs iterative denoising conditioned on high-fidelity sensor data and on low-fidelity navigation maps. This paper strives to answer: (1) How can coarse, potentially outdated navigation maps guide online map construction? (2) What advantages do diffusion models offer for map fusion? We demonstrate that diffusion-based map construction provides a robust framework for map fusion. Our key insight is that discrepancies between the prior map and online perception naturally correspond to noise within the diffusion process; consistent regions reinforce the map construction, whereas outdated segments are suppressed. On the nuScenes benchmark, NavMapFusion conditioned on coarse road lines from OpenStreetMap data reaches a 21.4% relative improvement on 100 m, and even stronger improvements on larger perception ranges, while maintaining real-time capabilities. By fusing low-fidelity priors with high-fidelity sensor data, the proposed method generates accurate and up-to-date environment representations, guiding towards safer and more reliable autonomous driving. The code is available at https://github.com/tmonnin/navmapfusion

[50] Step-by-step Layered Design Generation

Faizan Farooq Khan,K J Joseph,Koustava Goswami,Mohamed Elhoseiny,Balaji Vasan Srinivasan

Main category: cs.CV

TL;DR: 提出了一种新的逐步分层设计生成方法（Step-by-Step Layered Design Generation），并基于多模态大模型构建了SLEDGE系统，以指令驱动的方式逐层更新设计，同时发布了配套的评测集和数据集。

Details

Motivation: 现有设计生成方法大多视为单步生成问题，忽略了设计过程中逐步迭代的本质，难以体现实际设计工作的复杂性。 Method: 提出SLEDGE模型，利用多模态大语言模型，将每条设计指令对应为对前一状态的原子化、分层式修改，并通过指令与设计状态的对齐实现渐进式生成。 Result: 在新提出的评估套件上进行了广泛实验，结果表明SLEDGE在逐步设计生成任务中优于现有的最先进方法。 Conclusion: 该工作推动了更贴近真实设计流程的生成范式，为设计生成领域提供了新的研究方向和基准。 Abstract: Design generation, in its essence, is a step-by-step process where designers progressively refine and enhance their work through careful modifications. Despite this fundamental characteristic, existing approaches mainly treat design synthesis as a single-step generation problem, significantly underestimating the inherent complexity of the creative process. To bridge this gap, we propose a novel problem setting called Step-by-Step Layered Design Generation, which tasks a machine learning model with generating a design that adheres to a sequence of instructions from a designer. Leveraging recent advancements in multi-modal LLMs, we propose SLEDGE: Step-by-step LayEred Design GEnerator to model each update to a design as an atomic, layered change over its previous state, while being grounded in the instruction. To complement our new problem setting, we introduce a new evaluation suite, including a dataset and a benchmark. Our exhaustive experimental analysis and comparison with state-of-the-art approaches tailored to our new setup demonstrate the efficacy of our approach. We hope our work will attract attention to this pragmatic and under-explored research area.

[51] ProtoEFNet: Dynamic Prototype Learning for Inherently Interpretable Ejection Fraction Estimation in Echocardiography

Yeganeh Ghamary,Victoria Wu,Hooman Vaseli,Christina Luong,Teresa Tsang,Siavash Bigdeli,Purang Abolmaesumi

Main category: cs.CV

TL;DR: 提出了一种名为ProtoEFNet的视频基原型学习模型，用于连续射血分数（EF）回归预测，通过动态时空原型和新型PAS损失函数实现高精度与临床可解释性。

Details

Motivation: 现有深度学习模型在EF预测中缺乏透明度，后验解释方法无法引导模型内部推理，限制了其在临床中的可信度和应用。 Method: 设计ProtoEFNet模型，采用原型学习捕捉具有临床意义的心脏运动模式，并引入原型角度分离（PAS）损失以增强连续EF空间中的判别表示能力。 Result: 在EchonetDynamic数据集上，ProtoEFNet达到与非可解释模型相当的精度，F1分数通过PAS损失提升2%，从77.67±2.68升至79.64±2.10，且提供临床相关的可解释结果。 Conclusion: ProtoEFNet在保持高预测性能的同时，提供了可靠的内在可解释性，有助于提升临床对AI模型的信任，推动其在心脏功能评估中的实际应用。 Abstract: Ejection fraction (EF) is a crucial metric for assessing cardiac function and diagnosing conditions such as heart failure. Traditionally, EF estimation requires manual tracing and domain expertise, making the process time-consuming and subject to interobserver variability. Most current deep learning methods for EF prediction are black-box models with limited transparency, which reduces clinical trust. Some post-hoc explainability methods have been proposed to interpret the decision-making process after the prediction is made. However, these explanations do not guide the model's internal reasoning and therefore offer limited reliability in clinical applications. To address this, we introduce ProtoEFNet, a novel video-based prototype learning model for continuous EF regression. The model learns dynamic spatiotemporal prototypes that capture clinically meaningful cardiac motion patterns. Additionally, the proposed Prototype Angular Separation (PAS) loss enforces discriminative representations across the continuous EF spectrum. Our experiments on the EchonetDynamic dataset show that ProtoEFNet can achieve accuracy on par with its non-interpretable counterpart while providing clinically relevant insight. The ablation study shows that the proposed loss boosts performance with a 2% increase in F1 score from 77.67$\pm$2.68 to 79.64$\pm$2.10. Our source code is available at: https://github.com/DeepRCL/ProtoEF

[52] HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration

Seunghoi Kim,Henry F. J. Tregidgo,Chen Jin,Matteo Figini,Daniel C. Alexander

Main category: cs.CV

TL;DR: 本文提出HalluGen，一种基于扩散模型的框架，用于生成可控的图像恢复中的幻觉，并构建了首个大规模幻觉数据集，以系统评估和缓解安全关键领域中的生成错误。

Details

Motivation: 生成模型在图像恢复中易产生幻觉，这在医疗成像等安全关键领域严重影响可靠性与信任，但缺乏有效的标注数据阻碍了幻觉评估的研究进展。 Method: 提出HalluGen框架，利用扩散模型合成具有可控类型、位置和严重程度的逼真幻觉图像，并构建包含4,350张标注图像的大规模脑部MRI幻觉数据集，用于评估幻觉检测与缓解方法。 Result: 生成的图像语义错误显著（分割IoU从0.86降至0.36），成功构建数据集并应用于两个场景：提升图像质量评估指标对幻觉的敏感性（SHAFE），以及训练无需参考图像的幻觉检测器。 Conclusion: HalluGen及其公开数据集为安全关键图像恢复任务中的幻觉评估提供了首个可扩展的基础。 Abstract: Generative models are prone to hallucinations: plausible but incorrect structures absent in the ground truth. This issue is problematic in image restoration for safety-critical domains such as medical imaging, industrial inspection, and remote sensing, where such errors undermine reliability and trust. For example, in low-field MRI, widely used in resource-limited settings, restoration models are essential for enhancing low-quality scans, yet hallucinations can lead to serious diagnostic errors. Progress has been hindered by a circular dependency: evaluating hallucinations requires labeled data, yet such labels are costly and subjective. We introduce HalluGen, a diffusion-based framework that synthesizes realistic hallucinations with controllable type, location, and severity, producing perceptually realistic but semantically incorrect outputs (segmentation IoU drops from 0.86 to 0.36). Using HalluGen, we construct the first large-scale hallucination dataset comprising 4,350 annotated images derived from 1,450 brain MR images for low-field enhancement, enabling systematic evaluation of hallucination detection and mitigation. We demonstrate its utility in two applications: (1) benchmarking image quality metrics and developing Semantic Hallucination Assessment via Feature Evaluation (SHAFE), a feature-based metric with soft-attention pooling that improves hallucination sensitivity over traditional metrics; and (2) training reference-free hallucination detectors that generalize to real restoration failures. Together, HalluGen and its open dataset establish the first scalable foundation for evaluating hallucinations in safety-critical image restoration.

[53] Hierarchical Attention for Sparse Volumetric Anomaly Detection in Subclinical Keratoconus

Lynn Kandakji,William Woof,Nikolas Pontikos

Main category: cs.CV

TL;DR: 本研究比较了16种现代深度学习架构在3D OCT图像中检测亚临床圆锥角膜（SKC）的表现，发现分层注意力模型具有更优的归纳偏置，显著提升了对微弱、分散病灶的检测灵敏度和特异性。

Details

Motivation: 现有2D/3D CNN和ViT在检测体积医学影像中微弱、非连续的早期病变时存在局限：CNN受限于局部感受野，而ViT的全局注意力过于扩散，难以有效捕捉稀疏异常模式。 Method: 系统性对比16种涵盖2D/3D卷积、混合结构及体积Transformer的深度学习模型，在3D眼前节OCT数据上进行亚临床圆锥角膜检测，并通过注意力机制分析、表征相似性分析等手段探究模型行为。 Result: 分层注意力模型在参数效率更高的情况下，比CNN和ViT提升21-23%的敏感性和特异性；机制分析显示其窗口化分层结构能匹配亚临床病变的跨切片范围，实现适中的空间整合尺度，且弱信号需要更长的空间整合距离。 Conclusion: 分层注意力提供了一种原则性强且有效的归纳偏置，适用于3D医学影像中的早期病理变化检测，为未来体积异常检测系统的设计提供了指导。 Abstract: The detection of weak, spatially distributed anomalies in volumetric medical imaging remains a major challenge. The subtle, non-adjacent nature of early disease signals is often lost due to suboptimal architectural inductive biases: 2D/3D CNNs impose strong locality, while ViTs diffuse unconstrained global attention. This conflict leaves the optimal inductive structure for robust, sparse volumetric pattern recognition unresolved. This study presents a controlled comparison of sixteen modern deep learning architectures spanning 2D/3D convolutional, hybrid, and volumetric transformer families for subclinical keratoconus (SKC) detection from 3D anterior segment OCT volumes. We demonstrate that hierarchical attention models offer a superior and more parameter-efficient inductive bias, surpassing the performance of both 2D and 3D CNNs and ViTs. Our results show 21-23% higher sensitivity and specificity in the sparse anomaly (subclinical) regime. Mechanistic analyses reveal that this advantage stems from precise spatial scale alignment: hierarchical windowing produces effective receptive fields matched to the intermediate, multi-slice extent of subclinical abnormalities. This avoids excessive CNN locality and diffuse global attention. Attention-distance measurements confirm a key insight into architectural adaptation: the required spatial integration length shifts significantly based on the signal strength, with subclinical cases necessitating longer integration compared to both healthy and manifest disease states. Representational similarity and auxiliary age/sex prediction tasks further support the generalizability of these inductive principles. The findings provide design guidance for future volumetric anomaly detection systems, establishing hierarchical attention as a principled and effective approach for early pathological change analysis in 3D medical imaging.

[54] SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation

Yu Yuan,Tharindu Wickremasinghe,Zeeshan Nadir,Xijun Wang,Yiheng Chi,Stanley H. Chan

Main category: cs.CV

TL;DR: SeeU 提出了一种新的 2D→4D→2D 学习框架，通过从稀疏单目图像重建连续 4D 动态世界，实现物理一致的未见视觉内容生成。

Details

Motivation: 现有方法多在 2D 空间操作，忽略了真实世界的 3D 结构与时间动态，导致性能受限。SeeU 旨在通过建模连续 4D 动态提升视觉理解与生成能力。 Method: 采用 2D→4D→2D 框架：首先从单目视频重建 4D 场景；然后在低秩表示上学习连续动态并引入物理约束；最后前向推演并重投影到 2D 生成新视角和时间点的内容。 Result: SeeU 在未见时间生成、未见空间生成和视频编辑等任务中表现出色，生成结果具有时空连续性和物理一致性。 Conclusion: 建模 4D 连续动态能有效提升视觉生成的质量与合理性，SeeU 为未来视频理解与生成提供了新范式。 Abstract: Images and videos are discrete 2D projections of the 4D world (3D space + time). Most visual understanding, prediction, and generation operate directly on 2D observations, leading to suboptimal performance. We propose SeeU, a novel approach that learns the continuous 4D dynamics and generate the unseen visual contents. The principle behind SeeU is a new 2D$\to$4D$\to$2D learning framework. SeeU first reconstructs the 4D world from sparse and monocular 2D frames (2D$\to$4D). It then learns the continuous 4D dynamics on a low-rank representation and physical constraints (discrete 4D$\to$continuous 4D). Finally, SeeU rolls the world forward in time, re-projects it back to 2D at sampled times and viewpoints, and generates unseen regions based on spatial-temporal context awareness (4D$\to$2D). By modeling dynamics in 4D, SeeU achieves continuous and physically-consistent novel visual generation, demonstrating strong potentials in multiple tasks including unseen temporal generation, unseen spatial generation, and video editing.

[55] A Hybrid Deep Learning Framework with Explainable AI for Lung Cancer Classification with DenseNet169 and SVM

Md Rashidul Islam,Bakary Gibba,Altagi Abdallah Bakheit Abdelgadir

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的肺部CT图像自动分类系统，结合DenseNet169和SVM模型，实现高精度、可解释的肺癌诊断。

Details

Motivation: 提高肺癌早期诊断的准确性和效率，减少人工阅片的时间与误差。 Method: 采用IQOTHNCCD数据集，使用DenseNet169（集成SE模块、Focal Loss和FPN）和基于MobileNetV2特征提取的SVM模型进行分类，并引入Grad-CAM和SHAP提升模型可解释性。 Result: DenseNet169和SVM模型均达到98%的准确率，验证了方法的有效性与鲁棒性。 Conclusion: 所提方法在准确性、可解释性和实用性方面表现优异，具有应用于临床实践的潜力。 Abstract: Lung cancer is a very deadly disease worldwide, and its early diagnosis is crucial for increasing patient survival rates. Computed tomography (CT) scans are widely used for lung cancer diagnosis as they can give detailed lung structures. However, manual interpretation is time-consuming and prone to human error. To surmount this challenge, the study proposes a deep learning-based automatic lung cancer classification system to enhance detection accuracy and interpretability. The IQOTHNCCD lung cancer dataset is utilized, which is a public CT scan dataset consisting of cases categorized into Normal, Benign, and Malignant and used DenseNet169, which includes Squeezeand-Excitation blocks for attention-based feature extraction, Focal Loss for handling class imbalance, and a Feature Pyramid Network (FPN) for multi-scale feature fusion. In addition, an SVM model was developed using MobileNetV2 for feature extraction, improving its classification performance. For model interpretability enhancement, the study integrated Grad-CAM for the visualization of decision-making regions in CT scans and SHAP (Shapley Additive Explanations) for explanation of feature contributions within the SVM model. Intensive evaluation was performed, and it was found that both DenseNet169 and SVM models achieved 98% accuracy, suggesting their robustness for real-world medical practice. These results open up the potential for deep learning to improve the diagnosis of lung cancer by a higher level of accuracy, transparency, and robustness.

Nan Zhou,Huandong Wang,Jiahao Li,Han Li,Yali Song,Qiuhua Wang,Yong Li,Xinlei Chen

Main category: cs.CV

TL;DR: 本文提出了FireSentry，一个高分辨率的多模态野火数据集，以及FiReDiff双模态预测范式，显著提升了细粒度野火蔓延预测的精度。

Details

Motivation: 现有研究依赖低分辨率卫星数据，难以捕捉局部精细火情动态，限制了高精度野火预测能力。 Method: 构建基于无人机的多模态数据集FireSentry，并提出FiReDiff双模态方法：先生成红外视频序列，再基于动态信息分割火场掩膜。 Result: FiReDiff在生成模型上显著提升性能：视频质量指标PSNR提升39.2%，SSIM提升36.1%，LPIPS改善50.0%，FVD降低29.4%；掩膜精度AUPRC提升3.3%，F1提升59.1%，IoU提升42.9%，MSE降低62.5%。 Conclusion: FireSentry数据集与FiReDiff范式共同推动了细粒度野火预测与动态灾害模拟的发展。 Abstract: Fine-grained wildfire spread prediction is crucial for enhancing emergency response efficacy and decision-making precision. However, existing research predominantly focuses on coarse spatiotemporal scales and relies on low-resolution satellite data, capturing only macroscopic fire states while fundamentally constraining high-precision localized fire dynamics modeling capabilities. To bridge this gap, we present FireSentry, a provincial-scale multi-modal wildfire dataset characterized by sub-meter spatial and sub-second temporal resolution. Collected using synchronized UAV platforms, FireSentry provides visible and infrared video streams, in-situ environmental measurements, and manually validated fire masks. Building on FireSentry, we establish a comprehensive benchmark encompassing physics-based, data-driven, and generative models, revealing the limitations of existing mask-only approaches. Our analysis proposes FiReDiff, a novel dual-modality paradigm that first predicts future video sequences in the infrared modality, and then precisely segments fire masks in the mask modality based on the generated dynamics. FiReDiff achieves state-of-the-art performance, with video quality gains of 39.2% in PSNR, 36.1% in SSIM, 50.0% in LPIPS, 29.4% in FVD, and mask accuracy gains of 3.3% in AUPRC, 59.1% in F1 score, 42.9% in IoU, and 62.5% in MSE when applied to generative models. The FireSentry benchmark dataset and FiReDiff paradigm collectively advance fine-grained wildfire forecasting and dynamic disaster simulation. The processed benchmark dataset is publicly available at: https://github.com/Munan222/FireSentry-Benchmark-Dataset.

[57] ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

Lingjun Zhao,Yandong Luo,James Hay,Lu Gan

Main category: cs.CV

TL;DR: ShelfGaussian是一个基于高斯的开放词汇多模态3D场景理解框架，利用现成的视觉基础模型进行2D和3D联合监督，提升了几何质量与跨模态感知能力，在零样本语义占据预测等任务中表现优越。

Details

Motivation: 现有高斯方法在建模3D场景时受限于闭集语义标签或纯2D自监督，导致渲染能力弱、几何退化，且难以融合多传感器模态。因此需要一种能充分利用视觉基础模型、支持开放词汇并具备良好几何与渲染性能的多模态高斯框架。 Method: 提出Multi-Modal Gaussian Transformer，使高斯能够查询多种传感器模态特征；设计Shelf-Supervised Learning Paradigm，利用视觉基础模型在2D图像和3D场景层面联合优化高斯参数。 Result: 在Occ3D-nuScenes上实现了最先进的零样本语义占据预测性能，并在真实城市环境中通过无人地面车辆（UGV）验证了其实际感知与规划能力。 Conclusion: ShelfGaussian有效结合了高斯表示的优势与视觉基础模型的强大语义能力，实现了高性能、多模态、开放词汇的3D场景理解，具有良好的实际应用潜力。 Abstract: We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in the-wild performance across diverse urban scenarios. Project website: https://lunarlab-gatech.github.io/ShelfGaussian/.

Yujian Zhao,Hankun Liu,Guanglin Niu

Main category: cs.CV

TL;DR: 提出MOS框架，用于解决光学与SAR图像间的模态差距问题，在跨模态船舶重识别中实现了更优的性能。

Details

Motivation: 光学与SAR图像之间存在显著模态差异，导致现有方法在跨模态船舶重识别上表现不佳。 Method: 提出MOS框架，包含两个核心组件：模态一致性表示学习（MCRL）通过去噪SAR图像和类别级模态对齐损失实现特征对齐；跨模态数据生成与特征融合（CDGF）利用布朗桥扩散模型生成跨模态样本，并在推理时融合以增强判别性。 Result: 在HOSS ReID数据集上实验表明，MOS在所有评估协议下均显著优于现有最先进方法，R1准确率分别提升3.0%（ALL to ALL）、6.2%（Optical to SAR）和16.4%（SAR to Optical）。 Conclusion: MOS有效缩小了光学与SAR之间的模态差距，实现了更鲁棒的跨模态船舶重识别，具有良好的应用前景。 Abstract: Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to mitigate the optical-SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. MOS consists of two core components: (1) Modality-Consistent Representation Learning (MCRL) applies denoise SAR image procession and a class-wise modality alignment loss to align intra-identity feature distributions across modalities. (2) Cross-modal Data Generation and Feature fusion (CDGF) leverages a brownian bridge diffusion model to synthesize cross-modal samples, which are subsequently fused with original features during inference to enhance alignment and discriminability. Extensive experiments on the HOSS ReID dataset demonstrate that MOS significantly surpasses state-of-the-art methods across all evaluation protocols, achieving notable improvements of +3.0%, +6.2%, and +16.4% in R1 accuracy under the ALL to ALL, Optical to SAR, and SAR to Optical settings, respectively. The code and trained models will be released upon publication.

[59] ViDiC: Video Difference Captioning

Jiangtao Wu,Shihao Li,Zhaozhou Bian,Yuanxing Zhang,Jialu Chen,Runzhe Wen,An Ping,Yiwen He,Jiakai Wang,Jiaheng Liu

Main category: cs.CV

TL;DR: 本文提出了视频差异描述（ViDiC）任务及ViDiC-1K数据集，用于评估多模态大模型对视频对之间细粒度异同的描述能力，并引入双清单框架进行可靠评测。

Details

Motivation: 现有视觉语言系统在理解动态场景间的时空变化方面能力不足，静态图像差异描述方法无法捕捉运动连续性和事件演化，因此需要专门针对视频差异的理解与描述。 Method: 提出ViDiC任务和ViDiC-1K数据集，包含1000个视频对和4000多个标注项，覆盖7个类别；设计基于LLM-as-a-Judge的双清单评估框架，分别衡量相似性与差异性的描述准确性。 Result: 在19个多模态模型上的实验显示，当前模型在比较描述和差异感知方面存在显著性能差距。 Conclusion: ViDiC-1K可作为具有挑战性的基准，推动视频理解、编辑感知和多模态比较推理的发展。 Abstract: Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.

[60] YOLOA: Real-Time Affordance Detection via LLM Adapter

Yuqi Ji,Junjie Ke,Lihuo He,Jun Liu,Kaifan Zhang,Yu-Kun Lai,Guiguang Ding,Xinbo Gao

Main category: cs.CV

TL;DR: 本文提出了YOLO Affordance (YOLOA)，一种实时的、基于大语言模型适配器的联合物体检测与功能学习模型，有效解决了传统方法中“what-where-how”割裂的问题，在多个基准上实现了精度与速度的平衡。

Details

Motivation: 现有方法通常将物体检测与功能学习分离或忽略‘what’和‘where’信息，缺乏交互性和实时性，限制了在具身AI中的应用。 Method: 提出YOLOA模型，包含检测与功能学习双分支，并通过轻量级大语言模型适配器在训练中动态优化类别先验、边界框偏移和功能门控，实现两任务的协同提升。 Result: 在ADG-Det和IIT-Heat数据集上分别达到52.8和73.1 mAP，实时性能达89.77 FPS，轻量版高达846.24 FPS。 Conclusion: YOLOA通过LLM适配器实现了物体检测与功能学习的有效联合建模，在准确性和效率之间取得了优异平衡，适用于实时具身AI场景。 Abstract: Affordance detection aims to jointly address the fundamental "what-where-how" challenge in embodied AI by understanding "what" an object is, "where" the object is located, and "how" it can be used. However, most affordance learning methods focus solely on "how" objects can be used while neglecting the "what" and "where" aspects. Other affordance detection methods treat object detection and affordance learning as two independent tasks, lacking effective interaction and real-time capability. To overcome these limitations, we introduce YOLO Affordance (YOLOA), a real-time affordance detection model that jointly handles these two tasks via a large language model (LLM) adapter. Specifically, YOLOA employs a lightweight detector consisting of object detection and affordance learning branches refined through the LLM Adapter. During training, the LLM Adapter interacts with object and affordance preliminary predictions to refine both branches by generating more accurate class priors, box offsets, and affordance gates. Experiments on our relabeled ADG-Det and IIT-Heat benchmarks demonstrate that YOLOA achieves state-of-the-art accuracy (52.8 / 73.1 mAP on ADG-Det / IIT-Heat) while maintaining real-time performance (up to 89.77 FPS, and up to 846.24 FPS for the lightweight variant). This indicates that YOLOA achieves an excellent trade-off between accuracy and efficiency.

[61] DM3D: Deformable Mamba via Offset-Guided Gaussian Sequencing for Point Cloud Understanding

Bin Liu,Chunyang Wang,Xuelian Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为DM3D的可变形Mamba架构，用于点云理解，通过自适应序列化机制克服了传统状态空间模型对输入顺序依赖的问题。

Details

Motivation: 现有的点云处理方法通常依赖预定义的序列化策略，无法根据不同的几何结构进行调整，限制了状态空间模型在点云上的应用。 Method: 引入了偏移引导的高斯序列化机制，包括高斯K近邻重采样（GKR）和高斯可微重排序（GDR），并在局部重采样和全局重排序之间实现统一；同时设计了三路径频率融合模块以增强特征互补性并减少混叠效应。 Result: 在多个基准数据集上的实验表明，DM3D在分类、少样本学习和部件分割任务中均达到了最先进的性能。 Conclusion: 自适应序列化能够有效释放状态空间模型在点云理解中的潜力，DM3D为点云建模提供了一个新的高效框架。 Abstract: State Space Models (SSMs) demonstrate significant potential for long-sequence modeling, but their reliance on input order conflicts with the irregular nature of point clouds. Existing approaches often rely on predefined serialization strategies, which cannot adjust based on diverse geometric structures. To overcome this limitation, we propose \textbf{DM3D}, a deformable Mamba architecture for point cloud understanding. Specifically, DM3D introduces an offset-guided Gaussian sequencing mechanism that unifies local resampling and global reordering within a deformable scan. The Gaussian-based KNN Resampling (GKR) enhances structural awareness by adaptively reorganizing neighboring points, while the Gaussian-based Differentiable Reordering (GDR) enables end-to-end optimization of serialization order. Furthermore, a Tri-Path Frequency Fusion module enhances feature complementarity and reduces aliasing. Together, these components enable structure-adaptive serialization of point clouds. Extensive experiments on benchmark datasets show that DM3D achieves state-of-the-art performance in classification, few-shot learning, and part segmentation, demonstrating that adaptive serialization effectively unlocks the potential of SSMs for point cloud understanding.

[62] Generalization Evaluation of Deep Stereo Matching Methods for UAV-Based Forestry Applications

Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green

Main category: cs.CV

TL;DR: 本文首次系统评估了八种最先进的立体匹配方法在森林环境中的零样本深度估计性能，揭示了不同方法在跨域场景下的表现差异，并提出了适用于植被密集环境的基准模型。

Details

Motivation: 现有的深度估计方法主要在城市和室内场景下进行评估，缺乏对植被密集的森林环境的适用性研究，限制了无人机在林业中的自主作业能力，因此需要针对此类特殊环境开展系统性零样本评估。 Method: 研究选取了八种先进的立体匹配方法（包括基于迭代优化、基础模型和零样本适应的方法），仅在Scene Flow数据集上训练，并在四个标准基准（ETH3D、KITTI 2012/2015、Middlebury）和一个新的包含5313对图像的坎特伯雷森林数据集上进行零样本测试，使用ZED Mini相机采集高分辨率图像（1920x1080）。 Result: 实验结果显示：基础模型在结构化场景中表现优异（如BridgeDepth在ETH3D上EPE为0.23 px），而迭代方法具有更好的跨域鲁棒性（IGEV++波动范围0.36-6.77 px）；RAFT-Stereo在ETH3D上出现灾难性失败（EPE达26.23 px，错误率98%），但在KITTI上表现正常；在森林数据集上，DEFOM展现出最佳的整体性能，具有更优的深度平滑性、遮挡处理能力和跨域一致性。 Conclusion: DEFOM是目前植被深度估计任务中最适合作为金标准基线的模型，尽管其细节保留略逊于IGEV++；同时，RAFT-Stereo的严重失效问题提示需警惕模型在特定场景下的泛化风险，未来应重视面向自然环境的深度估计模型设计与评估。 Abstract: Autonomous UAV forestry operations require robust depth estimation methods with strong cross-domain generalization. However, existing evaluations focus on urban and indoor scenarios, leaving a critical gap for specialized vegetation-dense environments. We present the first systematic zero-shot evaluation of eight state-of-the-art stereo methods--RAFT-Stereo, IGEV, IGEV++, BridgeDepth, StereoAnywhere, DEFOM (plus baseline methods ACVNet, PSMNet, TCstereo)--spanning iterative refinement, foundation model, and zero-shot adaptation paradigms. All methods are trained exclusively on Scene Flow and evaluated without fine-tuning on four standard benchmarks (ETH3D, KITTI 2012/2015, Middlebury) plus a novel 5,313-pair Canterbury forestry dataset captured with ZED Mini camera (1920x1080). Performance reveals scene-dependent patterns: foundation models excel on structured scenes (BridgeDepth: 0.23 px on ETH3D, 0.83-1.07 px on KITTI; DEFOM: 0.35-4.65 px across benchmarks), while iterative methods maintain cross-domain robustness (IGEV++: 0.36-6.77 px; IGEV: 0.33-21.91 px). Critical finding: RAFT-Stereo exhibits catastrophic ETH3D failure (26.23 px EPE, 98 percent error rate) due to negative disparity predictions, while performing normally on KITTI (0.90-1.11 px). Qualitative evaluation on Canterbury forestry dataset identifies DEFOM as the optimal gold-standard baseline for vegetation depth estimation, exhibiting superior depth smoothness, occlusion handling, and cross-domain consistency compared to IGEV++, despite IGEV++'s finer detail preservation.

[63] Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features

Yuzhen Hu,Biplab Banerjee,Saurabh Prasad

Main category: cs.CV

TL;DR: 提出一种基于预训练扩散模型的轻量级、标签高效高光谱图像分类框架，利用早期去噪步骤中的高层空间特征，并通过FiLM模块融合光谱信息，实现对稀疏标注数据下的有效学习。

Details

Motivation: 高光谱图像分类面临空间分辨率低和标注稀疏的挑战，现有方法依赖大量标注数据，难以在遥感等标注成本高的场景应用。 Method: 利用在自然图像上预训练的冻结扩散模型提取早期去噪阶段解码器的低层空间特征，并结合FiLM机制根据光谱特征自适应调制空间特征，实现光谱-空间信息融合。 Result: 在两个最新的高光谱数据集上，仅使用提供的稀疏训练标签即超越了现有最先进方法；消融实验验证了扩散模型特征迁移和光谱感知融合的有效性。 Conclusion: 预训练扩散模型可作为通用、标签高效的表征工具，适用于遥感及更广泛的科学成像任务。 Abstract: Hyperspectral imaging (HSI) enables detailed land cover classification, yet low spatial resolution and sparse annotations pose significant challenges. We present a label-efficient framework that leverages spatial features from a frozen diffusion model pretrained on natural images. Our approach extracts low-level representations from high-resolution decoder layers at early denoising timesteps, which transfer effectively to the low-texture structure of HSI. To integrate spectral and spatial information, we introduce a lightweight FiLM-based fusion module that adaptively modulates frozen spatial features using spectral cues, enabling robust multimodal learning under sparse supervision. Experiments on two recent hyperspectral datasets demonstrate that our method outperforms state-of-the-art approaches using only the provided sparse training labels. Ablation studies further highlight the benefits of diffusion-derived features and spectral-aware fusion. Overall, our results indicate that pretrained diffusion models can support domain-agnostic, label-efficient representation learning for remote sensing and broader scientific imaging tasks.

[64] Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation

Xieji Li,Siyuan Yan,Yingsheng Liu,H. Peter Soyer,Monika Janda,Victoria Mar,Zongyuan Ge

Main category: cs.CV

TL;DR: 提出了一种新的视觉-语言预训练框架MAGEN和O-MAKE，用于提升医学图像分析中噪声数据和长文本处理的效果，在皮肤病学领域实现了最先进的零样本性能。

Details

Motivation: 现有视觉-语言预训练方法在处理网络采集的 noisy 数据和复杂的非结构化长医学文本时存在困难。 Method: 设计了一个多智能体数据生成系统（MAGEN）来合成知识丰富的描述，并通过基于本体的多方面知识增强（O-MAKE）预训练方法分解长文本，实现全局和局部的细粒度对齐。 Result: 在八个数据集上验证了该框架的有效性，在疾病分类和跨模态检索任务中达到了最先进水平的零样本性能。 Conclusion: 所提出的MAGEN和O-MAKE框架显著提升了医学视觉-语言预训练的效果，尤其适用于处理噪声数据和复杂文本，具有良好的应用前景。 Abstract: Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.

[65] LM-CartSeg: Automated Segmentation of Lateral and Medial Cartilage and Subchondral Bone for Radiomics Analysis

Tongxu Zhang

Main category: cs.CV

TL;DR: LM-CartSeg是一个全自动的膝关节MRI软骨/骨分割、内外侧分区及放射组学分析流程，结合nnU-Net模型与几何后处理规则，实现跨数据集稳定的高质量ROI生成与有意义的放射组学特征提取。

Details

Motivation: 现有膝关节MRI放射组学研究依赖手动勾画ROI且缺乏质量控制，难以稳定提取涵盖软骨和软骨下骨的解剖学有意义区域，限制了多中心研究的应用。 Method: 采用两个3D nnU-Net模型分别在SKM-TEA和OAIZIB-CM数据集上训练，测试时进行零样本预测融合，并通过连通域清洗、物理空间内构建10mm软骨下骨带、基于PCA和k-means的数据驱动胫骨内外侧分割等几何规则优化结果；从10个ROI中提取4650个非形状放射组学特征，并利用体积和厚度指标进行质量控制。 Result: 在OAIZIB-CM测试集上，后处理使宏观ASSD从2.63降至0.36 mm，HD95从25.2降至3.35 mm，DSC达0.91；在SKI-10上零样本DSC为0.80；几何L/M分割规则跨数据集稳定，而直接使用nnU-Net分割存在侧别混淆问题；每ROI仅6%-12%特征与体积或厚度强相关，表明多数放射组学特征携带超出形态大小的判别信息。 Conclusion: LM-CartSeg能自动生成经过质量控制的ROI和具有判别力的放射组学特征，超越简单形态测量，为多中心膝骨关节炎放射组学研究提供了实用基础。 Abstract: Background and Objective: Radiomics of knee MRI requires robust, anatomically meaningful regions of interest (ROIs) that jointly capture cartilage and subchondral bone. Most existing work relies on manual ROIs and rarely reports quality control (QC). We present LM-CartSeg, a fully automatic pipeline for cartilage/bone segmentation, geometric lateral/medial (L/M) compartmentalisation and radiomics analysis. Methods: Two 3D nnU-Net models were trained on SKM-TEA (138 knees) and OAIZIB-CM (404 knees). At test time, zero-shot predictions were fused and refined by simple geometric rules: connected-component cleaning, construction of 10 mm subchondral bone bands in physical space, and a data-driven tibial L/M split based on PCA and k-means. Segmentation was evaluated on an OAIZIB-CM test set (103 knees) and on SKI-10 (100 knees). QC used volume and thickness signatures. From 10 ROIs we extracted 4 650 non-shape radiomic features to study inter-compartment similarity, dependence on ROI size, and OA vs. non-OA classification on OAIZIB-CM Results: Post-processing improved macro ASSD on OAIZIB-CM from 2.63 to 0.36 mm and HD95 from 25.2 to 3.35 mm, with DSC 0.91; zero-shot DSC on SKI-10 was 0.80. The geometric L/M rule produced stable compartments across datasets, whereas a direct L/M nnU-Net showed domain-dependent side swaps. Only 6 to 12 percent of features per ROI were strongly correlated with volume or thickness. Radiomics-based models models restricted to size-linked features. Conclusions: LM-CartSeg yields automatic, QCd ROIs and radiomic features that carry discriminative information beyond simple morphometry, providing a practical foundation for multi-centre knee OA radiomics studies.

[66] KeyPointDiffuser: Unsupervised 3D Keypoint Learning via Latent Diffusion Models

Rhys Newbury,Juyan Zhang,Tin Tran,Hanna Kurniawati,Dana Kulić

Main category: cs.CV

TL;DR: 提出一种无监督框架，从点云数据中学习结构化的3D关键点，用于条件化扩散模型进行形状重建，并在关键点一致性上优于现有方法6个百分点。

Details

Motivation: 现有无监督关键点方法不适用于无条件生成设置，限制了其在现代3D生成流程中的应用，本文旨在填补这一空白。 Method: 设计一个无监督框架，从点云数据中学习具有空间结构的3D关键点，并利用这些关键点作为紧凑且可解释的表示来条件化Elucidated Diffusion Model（EDM）以重建完整形状。 Result: 所学关键点在不同物体实例间表现出可重复的空间结构，并支持关键点空间中的平滑插值，表明其能捕捉几何变化；在多个物体类别上表现优异，关键点一致性比先前方法提高6个百分点。 Conclusion: 该方法有效实现了无监督3D关键点学习，具备良好的结构化表示能力，适用于3D生成任务。 Abstract: Understanding and representing the structure of 3D objects in an unsupervised manner remains a core challenge in computer vision and graphics. Most existing unsupervised keypoint methods are not designed for unconditional generative settings, restricting their use in modern 3D generative pipelines; our formulation explicitly bridges this gap. We present an unsupervised framework for learning spatially structured 3D keypoints from point cloud data. These keypoints serve as a compact and interpretable representation that conditions an Elucidated Diffusion Model (EDM) to reconstruct the full shape. The learned keypoints exhibit repeatable spatial structure across object instances and support smooth interpolation in keypoint space, indicating that they capture geometric variation. Our method achieves strong performance across diverse object categories, yielding a 6 percentage-point improvement in keypoint consistency compared to prior approaches.

[67] GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers

Zhiye Song,Steve Dai,Ben Keller,Brucek Khailany

Main category: cs.CV

TL;DR: GalaxyDiT提出了一种无需训练的方法，通过引导对齐和系统性代理选择来加速视频生成，在保持高质量的同时显著提升推理速度。

Details

Motivation: 现有的基于Transformer的扩散模型（如DiTs）结合分类器自由引导（CFG）虽然效果优异，但计算开销大，尤其CFG使计算量翻倍，限制了其在实际应用中的广泛部署。 Method: 提出GalaxyDiT，利用引导对齐减少冗余计算，并通过秩相关分析为不同规模的视频模型选择最优代理，实现计算重用；整个过程无需额外训练。 Result: 在Wan2.1-1.3B和Wan2.1-14B模型上分别实现了1.87倍和2.37倍的加速，VBench-2.0评分仅下降0.97%和0.72%；在高加速比下，PSNR优于先前方法5到10dB。 Conclusion: GalaxyDiT为大规模视频扩散模型提供了一种高效、即插即用的加速方案，在极少性能损失下显著提升生成效率，推动其在实际应用中的落地。 Abstract: Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications. We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve $1.87\times$ and $2.37\times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).

[68] GeoVideo: Introducing Geometric Regularization into Video Generation Model

Yunpeng Bai,Shaoheng Fang,Chaohui Yu,Fan Wang,Qixing Huang

Main category: cs.CV

TL;DR: 本文提出了一种通过在潜扩散模型中引入每帧深度预测来增强视频生成中几何一致性的方法，利用多视角几何损失对齐共享3D坐标系中的深度图，从而提升时空连贯性和结构合理性。

Details

Motivation: 现有视频生成方法多在2D像素空间操作，缺乏对3D结构的显式建模，导致几何时序不一致、运动不合理和结构伪影等问题。 Method: 在潜扩散模型中加入深度预测分支，并设计多视角几何损失，在共享3D坐标系中对齐帧间预测的深度图，以增强结构一致性。 Result: 在多个数据集上的实验表明，该方法相比现有基线能生成更稳定、几何结构更一致的视频。 Conclusion: 通过引入基于深度的几何正则化，有效桥接了外观生成与3D结构建模，显著提升了生成视频的物理合理性和时空一致性。 Abstract: Recent advances in video generation have enabled the synthesis of high-quality and visually realistic clips using diffusion transformer models. However, most existing approaches operate purely in the 2D pixel space and lack explicit mechanisms for modeling 3D structures, often resulting in temporally inconsistent geometries, implausible motions, and structural artifacts. In this work, we introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction. We adopted depth as the geometric representation because of the great progress in depth prediction and its compatibility with image-based latent encoders. Specifically, to enforce structural consistency over time, we propose a multi-view geometric loss that aligns the predicted depth maps across frames within a shared 3D coordinate system. Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved spatio-temporal coherence, shape consistency, and physical plausibility. Experiments across multiple datasets show that our approach produces significantly more stable and geometrically consistent results than existing baselines.

[69] Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Haicheng Liao,Huanming Shen,Bonan Wang,Yongkang Li,Yihong Tang,Chengyue Wang,Dingyi Zhuang,Kehua Chen,Hai Yang,Chengzhong Xu,Zhenning Li

Main category: cs.CV

TL;DR: ThinkDeeper是一种用于自动驾驶中视觉指代定位的框架，通过引入具有前瞻推理能力的空间感知世界模型（SA-WM）和超图引导解码器，有效解决自然语言指令的歧义性问题，在多个基准上达到领先性能。

Details

Motivation: 现有视觉指代方法在处理模糊、上下文依赖的自然语言指令时表现不佳，缺乏对3D空间关系和场景演化建模的能力。 Method: 提出ThinkDeeper框架，包含一个空间感知世界模型（SA-WM），将当前场景编码为命令感知的潜在状态，并推演未来状态序列以提供前瞻性线索；结合超图引导解码器，分层融合多模态输入与未来状态，捕捉高阶空间依赖关系。 Result: 在六个基准测试上进行了广泛评估，ThinkDeeper在Talk2Car排行榜上排名第一，在DrivePilot、MoCAD和RefCOCO/+/g等数据集上优于现有最先进方法，尤其在长文本、多智能体和歧义场景中表现出强鲁棒性和高效性。 Conclusion: ThinkDeeper通过引入基于世界模型的前瞻性空间推理机制，显著提升了自动驾驶场景下视觉指代定位的准确性和鲁棒性，且具备良好的数据效率。 Abstract: Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

[70] Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

Shojiro Yamabe,Futa Waseda,Daiki Shiono,Tsubasa Takahashi

Main category: cs.CV

TL;DR: 本文研究了在仅有文本描述而无真实图像的情况下进行大规模视觉-语言模型（LVLM）训练的文本中心训练范式，提出了一种名为文本打印图像（TPI）的方法，通过将文本渲染到白底画布上生成合成图像，以低成本弥合图文模态差距，并在多个模型和基准上验证了其有效性。

Details

Motivation: 由于收集大量带标注的图像-文本对成本高昂且受隐私限制，而文本数据更易获取和编辑，因此探索仅使用文本进行LVLM训练的低代价数据扩展方法具有重要意义。 Method: 提出Text-Printed Image（TPI）方法，将文本描述直接渲染为纯白背景上的文字图像，从而将文本投影到图像模态中；该方法可无缝集成到现有LVLM训练流程中，并利用LLM实现文本自动多样化扩展。 Result: 在四个模型和七个基准上的实验表明，TPI比基于扩散模型生成的合成图像更能有效提升文本中心训练的效果，并验证了其作为低成本数据增强策略的实用性。 Conclusion: TPI为文本中心训练提供了高效、低成本的解决方案，展示了完全自动化数据生成用于LVLM的可行路径，推动了无需真实图像的大规模模型训练发展。 Abstract: Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.

[71] Difference Decomposition Networks for Infrared Small Target Detection

Chen Hu,Mingyu Zhou,Shuai Yuan,Hongbo Hu,Xiangyu Qiu,Junhai Luo,Tian Pu,Xiyin Li

Main category: cs.CV

TL;DR: 本文提出了一种基于基分解的轻量级模块BDM，用于红外小目标检测（ISTD），并由此发展出SD²Net和STD²Net网络，在单帧和多帧ISTD任务中均取得优异性能，尤其在多帧任务中显著优于现有方法。

Details

Motivation: 红外小目标检测面临目标纹理不明显和背景杂波严重的挑战，导致目标易被背景遮蔽，需要有效增强目标并抑制背景。 Method: 提出基分解模块（BDM），将复杂特征分解为多个基特征，以增强关键信息并消除冗余；进一步扩展为SD²M、SD³M和TD²M模块，并构建SD²Net用于单帧检测，结合运动信息的TD²M模块升级为STD²Net用于多帧检测。 Result: 在SISTD和MISTD数据集上实验表明，所提方法达到SOTA水平；SD²Net在单帧任务中表现良好，而STD²Net在多帧任务中mIoU达87.68%，显著优于SD²Net的64.97%。 Conclusion: 基于基分解的模块能有效提升红外小目标检测性能，特别是在引入时序信息后，网络对复杂背景的鲁棒性显著增强，具备良好的应用前景。 Abstract: Infrared small target detection (ISTD) faces two major challenges: a lack of discernible target texture and severe background clutter, which results in the background obscuring the target. To enhance targets and suppress backgrounds, we propose the Basis Decomposition Module (BDM) as an extensible and lightweight module based on basis decomposition, which decomposes a complex feature into several basis features and enhances certain information while eliminating redundancy. Extending BDM leads to a series of modules, including the Spatial Difference Decomposition Module (SD$^\mathrm{2}$M), Spatial Difference Decomposition Downsampling Module (SD$^\mathrm{3}$M), and Temporal Difference Decomposition Module (TD$^\mathrm{2}$M). Based on these modules, we develop the Spatial Difference Decomposition Network (SD$^\mathrm{2}$Net) for single-frame ISTD (SISTD) and the Spatiotemporal Difference Decomposition Network (STD$^\mathrm{2}$Net) for multi-frame ISTD (MISTD). SD$^\mathrm{2}$Net integrates SD$^\mathrm{2}$M and SD$^\mathrm{3}$M within an adapted U-shaped architecture. We employ TD$^\mathrm{2}$M to introduce motion information, which transforms SD$^\mathrm{2}$Net into STD$^\mathrm{2}$Net. Extensive experiments on SISTD and MISTD datasets demonstrate state-of-the-art (SOTA) performance. On the SISTD task, SD$^\mathrm{2}$Net performs well compared to most established networks. On the MISTD datasets, STD$^\mathrm{2}$Net achieves a mIoU of 87.68\%, outperforming SD$^\mathrm{2}$Net, which achieves a mIoU of 64.97\%. Our codes are available: https://github.com/greekinRoma/IRSTD_HC_Platform.

[72] Procedural Mistake Detection via Action Effect Modeling

Wenliang Guo,Yujiang Pu,Yu Kong

Main category: cs.CV

TL;DR: 本文提出了一个名为Action Effect Modeling (AEM)的统一框架，通过联合建模动作执行及其结果来检测程序性任务中的错误，特别关注被现有方法忽略的动作效果（action effect）。该方法结合视觉和符号信息，在共享隐空间中构建鲁棒的效果感知表示，并引入基于提示的检测器实现状态下的错误检测。在EgoPER和CaptainCook4D基准上达到了最先进的性能。

Details

Motivation: 现有错误检测方法主要关注动作如何执行，而忽略了动作产生的结果（即动作效果），但许多错误实际上体现在结果中，如物体状态或空间关系的错误。因此，需要一种能够同时建模动作执行与结果的方法以提升检测准确性。 Method: 提出Action Effect Modeling (AEM)框架：首先基于语义相关性和视觉质量选择最具信息量的效果帧来识别动作结果；然后从视觉定位和符号化场景图中提取互补线索，并将其对齐到共享隐空间以形成鲁棒的效果感知表示；设计了一个基于提示的检测器，结合任务特定提示，将每个动作片段与其预期语义对齐以检测错误。 Result: 在EgoPER和CaptainCook4D数据集的单类分类（OCC）设置下实现了最先进性能，验证了建模动作执行与结果的有效性。 Conclusion: 同时建模动作执行与动作效果可显著提升程序性任务中错误检测的可靠性，效果感知表示具有广泛的应用潜力。 Abstract: Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the \textbf{action effect}. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.

[73] Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis

Zijian Gu,Yuxi Liu,Zhenhao Zhang,Song Wang

Main category: cs.CV

TL;DR: 提出了一种公平性感知的低秩适应方法（FR-LoRA、GR-LoRA、Hybrid-LoRA），通过MaxAccGap损失和梯度重加权提升医疗视觉语言模型在不同人群中的诊断公平性，在仅使用0.24%可训练参数的情况下显著降低准确性差异。

Details

Motivation: 现有医学视觉语言模型在不同人口统计学群体间存在显著诊断准确性差异，亟需在保持高性能的同时提升公平性。 Method: 提出了三种基于低秩适应的方法：FR-LoRA引入MaxAccGap损失进行公平性正则化；GR-LoRA采用逆频率加权平衡梯度贡献；Hybrid-LoRA结合两者。使用可微分MaxAccGap损失实现端到端的准确率均衡优化。 Result: 在10,000张青光眼眼底图像上，GR-LoRA将诊断准确性差异降低了69%，整体准确率为53.15%；强正则化在最小精度代价下实现最优公平性；种族特异性优化可减少60%的差异；方法仅需0.24%可训练参数。 Conclusion: 所提方法在极低参数成本下有效提升了医学VLM的跨群体诊断公平性，具有在资源受限医疗环境中部署的实用价值。 Abstract: Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms.Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.

[74] Towards Object-centric Understanding for Instructional Videos

Wenliang Guo,Yu Kong

Main category: cs.CV

TL;DR: 本文提出了一种以对象为中心的范式来理解程序性活动，通过引入Object-IVQA基准和一个集成规划、感知、分析与生成工具的代理框架，实现了对复杂真实任务中对象状态演变等四个维度的推理能力提升。

Details

Motivation: 现有的以动作为中心的方法难以应对实际程序中步骤顺序因对象状态变化而变化的灵活性问题，因此需要转向以对象为中心的推理范式。 Method: 提出了Object-IVQA这一长视频基准，包含107个视频和514个开放性问题，并设计了一个代理框架，整合了以对象为中心的规划、感知、分析和生成工具，支持显式的证据检索和跨片段的多跳推理。 Result: 实验表明，现有大型视觉语言模型在对象级识别和推理上表现不佳，而所提出的框架在Object-IVQA基准上实现了显著性能提升。 Conclusion: 以对象为中心的范式更适用于复杂程序性活动的理解，所提出的框架为未来助手机器人AI的发展提供了有效路径。 Abstract: Understanding procedural activities is crucial for developing future assistive AI that can reason about complex real-world tasks. Existing action-centric methods struggle with the flexibility of real procedures, where step order varies depending on object states. In this work, we propose to shift the focus to an object-centric paradigm by regarding actions as mechanisms that drive state transitions. To advance this direction, we introduce Object-IVQA, a long-form instructional video benchmark with 107 videos and 514 open-ended question-answer pairs annotated with temporally grounded evidence. The benchmark evaluates four dimensions of object-centric reasoning, including state evolution, precondition verification, counterfactual reasoning and mistake recognition. We further propose an agent framework that orchestrates object-centric planning, perception, analysis and generation tools, enabling explicit evidence retrieval and multi-hop reasoning across disjoint segments. Experiments show that existing large vision-language models struggle in object-level recognition and reasoning, whereas our framework achieves substantially improvement.

[75] NAS-LoRA: Empowering Parameter-Efficient Fine-Tuning for Visual Foundation Models with Searchable Adaptation

Renqi Chen,Haoyang Su,Shixiang Tang

Main category: cs.CV

TL;DR: 提出NAS-LoRA，一种结合神经架构搜索的低秩适应方法，用于提升SAM在特定领域（如医学和农业图像）的适应能力，通过引入先验知识优化权重更新，在降低训练成本的同时改善语义分割性能。

Details

Motivation: SAM的Transformer编码器缺乏图像块内的空间先验，难以获取高层语义信息，限制了其在专业领域的适应性。现有LoRA方法未能有效引入归纳偏置。 Method: 提出NAS-LoRA，将轻量级神经架构搜索（NAS）模块嵌入LoRA的编码器与解码器之间，动态优化权重更新中融入的先验知识；并设计分阶段优化策略，平衡ViT编码器的权重更新与结构调整。 Result: 实验表明，NAS-LoRA优于现有的PEFT方法，训练成本降低24.14%，且不增加推理成本。 Conclusion: NAS-LoRA通过引入可学习的架构先验，有效缩小了预训练SAM与特定领域之间的语义差距，验证了将NAS与PEFT结合提升视觉基础模型适应性的潜力。 Abstract: The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM's adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.

[76] EEA: Exploration-Exploitation Agent for Long Video Understanding

Te Yang,Xiangyu Zhu,Bo Wang,Quan Chen,Peng Jiang,Zhen Lei

Main category: cs.CV

TL;DR: 本文提出了一种名为EEA的新型视频智能体框架，通过语义引导与分层树搜索实现探索与利用的平衡，有效提升长视频理解的效率和信息覆盖率。

Details

Motivation: 现有长视频理解方法存在计算开销大或探索与利用不平衡的问题，导致信息覆盖不全和效率低下，本文旨在解决这一挑战。 Method: EEA框架通过自主发现并动态更新任务相关的语义查询，选取匹配的视频帧作为语义锚点；在树搜索过程中优先探索语义相关帧，同时保障未知区域的覆盖，并结合视觉语言模型的内在奖励与语义先验，显式建模不确定性以实现稳定精准的片段评估。 Result: 在多个长视频基准上的实验表明，EEA在性能和计算效率方面均优于现有方法。 Conclusion: EEA通过语义引导的分层搜索策略，有效实现了长视频中关键信息的高效定位与全面覆盖，为长视频理解提供了新的解决方案。 Abstract: Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.

[77] Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation

Seogkyu Jeon,Kibeom Hong,Hyeran Byun

Main category: cs.CV

TL;DR: 本文提出了一种新的领域泛化语义分割框架DPMFormer，通过域感知提示学习、对比学习和一致性学习来解决视觉与文本语义不对齐问题，并在多个基准上实现了最先进的性能。

Details

Motivation: 现有基于视觉-语言模型的领域泛化语义分割方法忽略了视觉与文本上下文之间的语义不对齐问题，因其依赖单一源域的固定提示，泛化能力受限。 Method: 提出了DPMFormer框架，包含域感知提示学习以对齐视觉与文本语义，结合纹理扰动的域感知对比学习以模拟多域特征，并引入域鲁棒一致性学习来减少原始与增强图像预测间的差异。 Result: 在多个DGSS基准上取得了当前最优性能，验证了方法的有效性和鲁棒性。 Conclusion: DPMFormer通过动态提示与多策略学习机制有效提升了跨域语义分割的泛化能力，为未来研究提供了新方向。 Abstract: Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks. The code is available at https://github.com/jone1222/DPMFormer.

[78] AfroBeats Dance Movement Analysis Using Computer Vision: A Proof-of-Concept Framework Combining YOLO and Segment Anything Model

Kwaku Opoku-Ware,Gideon Opoku

Main category: cs.CV

TL;DR: 本研究提出了一种结合YOLOv8/v11与SAM模型的舞蹈动作分析框架，可无需标记设备地追踪和量化舞者运动，初步验证显示其在AfroBeats舞蹈视频中具有较高检测精度与分割效果，并能提取步数、空间使用等定量指标。

Details

Motivation: 传统舞蹈分析依赖人工观察或需标记的动作捕捉系统，成本高且难以量化。本文旨在探索基于计算机视觉的自动化、低成本舞蹈动作量化方法，推动舞蹈研究的客观化与数据化。 Method: 采用YOLOv8和YOLOv11进行舞者检测，结合Segment Anything Model（SAM）实现像素级精确分割；在此基础上追踪舞者位置、识别舞步节奏、计算空间覆盖范围与运动强度，构建无需专用设备的舞蹈运动量化分析流程。 Result: 在一段49秒的Ghanaian AfroBeats舞蹈视频上测试，系统达到约94%的检测精度和89%的召回率；SAM分割结果与人工标注的IoU约为83%；分析发现主舞者比次舞者多执行23%的舞步，运动强度高37%，使用表演空间多42%。 Conclusion: 该框架展示了在无标记条件下进行舞蹈动作自动分析的技术可行性，能够提取有意义的定量舞蹈特征，为未来建立系统的舞蹈计量学方法奠定了基础，但需更大规模数据和更严谨验证进一步完善。 Abstract: This paper presents a preliminary investigation into automated dance movement analysis using contemporary computer vision techniques. We propose a proof-of-concept framework that integrates YOLOv8 and v11 for dancer detection with the Segment Anything Model (SAM) for precise segmentation, enabling the tracking and quantification of dancer movements in video recordings without specialized equipment or markers. Our approach identifies dancers within video frames, counts discrete dance steps, calculates spatial coverage patterns, and measures rhythm consistency across performance sequences. Testing this framework on a single 49-second recording of Ghanaian AfroBeats dance demonstrates technical feasibility, with the system achieving approximately 94% detection precision and 89% recall on manually inspected samples. The pixel-level segmentation provided by SAM, achieving approximately 83% intersection-over-union with visual inspection, enables motion quantification that captures body configuration changes beyond what bounding-box approaches can represent. Analysis of this preliminary case study indicates that the dancer classified as primary by our system executed 23% more steps with 37% higher motion intensity and utilized 42% more performance space compared to dancers classified as secondary. However, this work represents an early-stage investigation with substantial limitations including single-video validation, absence of systematic ground truth annotations, and lack of comparison with existing pose estimation methods. We present this framework to demonstrate technical feasibility, identify promising directions for quantitative dance metrics, and establish a foundation for future systematic validation studies.

[79] CSMapping: Scalable Crowdsourced Semantic Mapping and Topology Inference for Autonomous Driving

Zhijian Qiao,Zehuan Yu,Tong Li,Chih-Chung Chou,Wenchao Ding,Shaojie Shen

Main category: cs.CV

TL;DR: 提出CSMapping系统，利用潜在扩散模型和鲁棒优化方法，从噪声众包数据中生成高质量语义地图和拓扑中心线，性能随数据量增加而持续提升。

Details

Motivation: 低成本传感器的噪声限制了众包自动驾驶地图的质量提升，尽管数据量增大，但难以保证地图精度和结构合理性。 Method: 提出CSMapping：1）使用在HD地图上训练的潜在扩散模型学习真实地图结构的生成先验，并通过无配对监督的约束MAP优化融入先验；2）采用向量化映射模块初始化并结合扩散反演；3）使用高斯基重参数化、投影梯度下降和潜在空间因子图实现高效优化；4）对轨迹进行置信加权k-medoids聚类与运动学优化以生成拓扑中心线。 Result: 在nuScenes、Argoverse 2和大型私有数据集上实现了语义和拓扑地图构建的最先进性能，具备良好的可扩展性和鲁棒性，支持从小规模到大规模数据的一致质量提升。 Conclusion: CSMapping有效克服了低质传感器噪声问题，首次实现了众包地图质量随数据量持续提升，为大规模高精地图构建提供了可行方案。 Abstract: Crowdsourcing enables scalable autonomous driving map construction, but low-cost sensor noise hinders quality from improving with data volume. We propose CSMapping, a system that produces accurate semantic maps and topological road centerlines whose quality consistently increases with more crowdsourced data. For semantic mapping, we train a latent diffusion model on HD maps (optionally conditioned on SD maps) to learn a generative prior of real-world map structure, without requiring paired crowdsourced/HD-map supervision. This prior is incorporated via constrained MAP optimization in latent space, ensuring robustness to severe noise and plausible completion in unobserved areas. Initialization uses a robust vectorized mapping module followed by diffusion inversion; optimization employs efficient Gaussian-basis reparameterization, projected gradient descent zobracket multi-start, and latent-space factor-graph for global consistency. For topological mapping, we apply confidence-weighted k-medoids clustering and kinematic refinement to trajectories, yielding smooth, human-like centerlines robust to trajectory variation. Experiments on nuScenes, Argoverse 2, and a large proprietary dataset achieve state-of-the-art semantic and topological mapping performance, with thorough ablation and scalability studies.

[80] FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

Yiyi Cai,Yuhan Wu,Kunhang Li,You Zhou,Bo Zheng,Haiyang Liu

Main category: cs.CV

TL;DR: FloodDiffusion 是一种用于文本驱动、流式人体动作生成的新框架，采用改进的扩散强迫方法实现低延迟、连续且与文本对齐的动作序列生成。

Details

Motivation: 现有方法在处理时变文本控制下的连续动作生成时存在延迟高或生成不连贯的问题，需要一种能够实时生成高质量动作序列的新方法。 Method: 提出 FloodDiffusion 框架，引入三项关键改进：双向注意力机制、下三角时间调度器和连续的文本条件注入方式，以适配扩散强迫模型在真实运动分布上的建模需求。 Result: 在 HumanML3D 基准上实现了 0.057 的 FID 分数，首次证明基于扩散强迫的框架可在流式动作生成任务中达到最先进的性能。 Conclusion: FloodDiffusion 通过针对性改进显著提升了扩散强迫模型在流式、文本驱动人体动作生成中的表现，为未来实时人机交互应用提供了高效解决方案。 Abstract: We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. https://shandaai.github.io/FloodDiffusion/

[81] OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Zhishan Zhou,Siyuan Wei,Zengran Wang,Chunjie Wang,Xiaosheng Yan,Xiao Liu

Main category: cs.CV

TL;DR: 本文提出OpenTrack3D，一种无需依赖预定义提议或网格的开放词汇3D实例分割框架，通过视觉-空间追踪器在线生成跨视角一致的对象提议，并结合多模态大语言模型提升对复杂查询的文本推理能力，在多种真实场景中实现了先进性能与强泛化性。

Details

Motivation: 现有方法在无网格环境中泛化能力差，且难以处理组合性和功能性的用户查询，主要受限于数据集特定的提议生成机制和CLIP分类器的弱文本推理能力。 Method: 采用2D开放词汇分割器生成掩码并结合深度信息提升至3D点云；利用DINO特征图提取实例特征；设计视觉-空间追踪器融合视觉与空间线索以维护实例一致性；使用多模态大语言模型替代CLIP进行更强的文本推理，并可选地引入超点模块优化结果。 Result: 在ScanNet200、Replica、ScanNet++和SceneFun3D等多个基准上实验表明，该方法在准确性和泛化能力方面均达到最先进水平。 Conclusion: OpenTrack3D实现了在无网格、多样化环境中的高效开放词汇3D实例分割，通过在线提议生成和增强的文本推理显著提升了实际应用潜力。 Abstract: Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.

[82] CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding

Huy Quang Ung,Guillaume Habault,Yasutaka Nishimura,Hao Niu,Roberto Legaspi,Tomoki Oya,Ryoichi Kojima,Masato Taya,Chihiro Ono,Atsunori Minamikawa,Yan Liu

Main category: cs.CV

TL;DR: 本文提出了CartoMapQA，一个用于评估视觉-语言模型（LVLMs）在地图理解方面能力的基准测试，揭示了现有模型在地图语义、地理空间推理和OCR错误方面的局限性。

Details

Motivation: 尽管视觉-语言模型取得了进展，但其对地图这一特定类型图像的理解能力尚未被充分探索，亟需专门的评估基准来推动发展。 Method: 构建了一个包含2000多个样本的数据集，涵盖不同难度的地图问答任务，并对开源和专有LVLM进行系统评估。 Result: 实验表明当前LVLM在符号识别、信息提取、比例解读和路径推理等方面存在明显不足，尤其受OCR错误影响严重。 Conclusion: CartoMapQA为改进LVLM的地图理解能力提供了有效工具，有助于推动导航、地理搜索和城市规划等实际应用的发展。 Abstract: The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs' understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: https://github.com/ungquanghuy-kddi/CartoMapQA.git

[83] Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

Subin Kim,Sangwoo Mo,Mamshad Nayeem Rizve,Yiran Xu,Difan Liu,Jinwoo Shin,Tobias Hinz

Main category: cs.CV

TL;DR: 提出PRIS框架，在推理时通过细粒度验证器动态重设计提示词，实现生成视觉内容与用户意图的精准对齐。

Details

Motivation: 现有文本到视觉生成方法因固定提示词而在扩展时遇到性能瓶颈，难以持续提升生成质量。 Method: 提出Prompt Redesign for Inference-time Scaling (PRIS)，在推理过程中根据生成结果动态修改提示词；引入元素级事实校正验证器，细粒度评估提示词属性与生成内容的一致性。 Result: 在文本到图像和文本到视频任务上显著提升性能，VBench 2.0提升15%，验证了联合扩展提示词与视觉生成的有效性。 Conclusion: 推理时联合扩展提示词和视觉生成能更有效利用扩展定律，实现更精准的用户意图对齐。 Abstract: Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.

[84] Optical Context Compression Is Just (Bad) Autoencoding

Ivan Yee Lee,Cheng Yang,Taylor Berg-Kirkpatrick

Main category: cs.CV

TL;DR: 本文质疑了基于视觉的文本压缩在语言建模中的优势，发现简单的替代方法在文本重建和语言建模中均优于视觉编码器。

Details

Motivation: 检验DeepSeek-OCR所引发的关于视觉压缩是否真正有利于语言建模的假设。 Method: 比较视觉编码器与无参数的均值池化和分层学习编码器在相同压缩比下的文本重建和语言建模性能。 Result: 简单方法在文本重建上达到或超过视觉编码器表现，且在语言建模任务中显著优于视觉方法，后者甚至不如截断法。 Conclusion: 当前对光学上下文压缩的热情超过了实际证据支持，视觉压缩未必是提升语言模型上下文能力的有效路径。 Abstract: DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding

[85] CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Ruoxuan Zhang,Bin Wen,Hongxia Xie,Yi Yao,Songhan Zuo,Jian-Yu Jiang-Lin,Hong-Han Shuai,Wen-Huang Cheng

Main category: cs.CV

TL;DR: 本文提出了一种名为CookAnything的扩散模型框架，用于根据任意长度的烹饪指令生成连贯且语义清晰的多步图像序列，解决了现有方法在灵活性和一致性上的不足。

Details

Motivation: 现有的文本到图像生成模型难以处理具有结构化多步骤特性的烹饪流程，且当前的食谱插图方法无法适应步骤数量的变化，缺乏跨步骤的一致性控制。 Method: 提出CookAnything框架，包含三个关键组件：步骤区域控制（SRC）实现文本与图像区域对齐；灵活RoPE位置编码增强时空一致性与多样性；跨步骤一致性控制（CSCC）保持食材在多步间的连续性。 Result: 在食谱插图基准上，CookAnything在基于训练和无需训练的设置下均优于现有方法，能生成更连贯、语义更清晰的图像序列。 Conclusion: CookAnything实现了对任意长度烹饪指令的高质量、可扩展的视觉合成，在 instructional media 和程序性内容创作中具有广泛应用潜力。 Abstract: Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.

[86] Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Zirun Guo,Minjie Hong,Feng Zhang,Kai Jia,Tao Jin

Main category: cs.CV

TL;DR: 本文提出CodeVision，一个基于代码作为工具的多模态大语言模型框架，通过生成代码调用任意图像操作来提升模型在面对图像方向变化和自然损坏时的鲁棒性与工具组合能力。

Details

Motivation: 现有MLLM在处理图像时依赖有限工具集，且对图像方向变化或自然损坏极为脆弱，缺乏实际应用所需的鲁棒性和可扩展性。 Method: 提出CodeVision框架，采用代码作为通用接口调用图像操作；使用两阶段训练：先在高质量多轮交互数据上进行监督微调（SFT），再通过具有密集过程奖励的强化学习（RL）优化策略性工具使用。 Result: 在Qwen2.5-VL和Qwen3-VL系列模型上验证，显著提升性能，展现出灵活的工具组合、高效的链式执行和基于运行反馈的鲁棒错误恢复能力；构建了新的SFT、RL数据集及挑战性基准测试套件。 Conclusion: CodeVision通过代码作为工具的方式，实现了更灵活、可扩展和鲁棒的视觉推理，推动MLLM向真实场景应用迈进。 Abstract: Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.

[87] V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention

Nan Sun,Zhenyu Zhang,Xixun Lin,Kun Wang,Yanmin Shang,Naibin Gu,Shuohuan Wang,Yu Sun,Hua Wu,Haifeng Wang,Yanan Cao

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的视觉推理时干预框架V-ITI，用于缓解多模态大语言模型（MLLMs）中的视觉忽视问题，从而有效减少幻觉现象。

Details

Motivation: MLLMs在视觉语言任务中表现出色，但由于视觉忽视导致的幻觉问题影响了其在高精度要求领域中的可靠性。现有方法通常干预注意力分数或输出logits，但忽略了“何时干预”的问题，导致过度干预和额外计算开销。 Method: 通过研究视觉忽视机制，发现可通过头级别激活模式准确检测该问题；提出V-ITI框架，包含一个基于头级别判别探针的视觉忽视检测器，以及仅在检测到忽视时使用预存视觉激活信息调节响应的视觉召回干预器。 Result: 在八个基准和多个MLLM家族上的实验表明，V-ITI能持续缓解与视觉相关的幻觉问题，同时保持通用任务性能。 Conclusion: V-ITI通过精准识别和干预视觉忽视，有效减少了MLLM中的幻觉，且不引入过度干预或显著计算负担，提升了模型的可靠性和效率。 Abstract: Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on "how to intervene" but overlooking the prerequisite "when to intervene", which leads to the "over-intervention" problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.

[88] AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Zichuan Lin,Yicheng Liu,Yang Yang,Lvfang Tao,Deheng Ye

Main category: cs.CV

TL;DR: 本文提出了一种名为AdaptVision的高效视觉语言模型范式，通过粗到精的方式自适应地获取视觉token，以减少计算开销并保持高性能。

Details

Motivation: 现有的高效VLM方法采用固定比率压缩视觉token，缺乏对不同任务需求的适应能力，因此需要一种能够自主决定每样本所需最少视觉token数量的方法。 Method: 引入AdaptVision，初始处理低分辨率图像的压缩视觉token，并在必要时调用边界框工具裁剪关键区域以选择性地获取额外视觉信息；使用强化学习框架训练，其中包含解耦回合策略优化（DTPO），将学习目标分解为工具利用和准确性提升两个部分，并分别计算优势估计。 Result: 在多个VQA基准上的实验表明，AdaptVision在显著减少视觉token消耗的同时，实现了优于现有最先进高效VLM方法的性能。 Conclusion: AdaptVision通过自适应视觉token获取机制，在保证准确性的前提下大幅提升了计算效率，为高效VLM设计提供了新思路。 Abstract: Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.

[89] Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

Wei Chee Yew,Hailun Xu,Sanjay Saha,Xiaotian Fan,Hiok Hian Ong,David Yuchen Wang,Kanchan Sarkar,Zhenheng Yang,Danhui Guan

Main category: cs.CV

TL;DR: 提出了一种用于大规模用户生成视频平台（尤其是直播环境）的混合内容审核框架，结合监督分类和基于参考的相似性匹配，有效检测已知违规和新型边缘情况，多模态大模型提升各模块准确性，实现在80%精度下分类流水线67%召回率、相似性流水线76%召回率，并在A/B测试中减少6-8%对不良直播的观看量。

Details

Motivation: 直播环境中内容审核需及时、多模态且能应对不断演变的不良内容，传统方法难以兼顾已知违规与新型隐蔽违规，亟需更鲁棒、可扩展的解决方案。 Method: 构建一个混合审核框架：一方面使用监督分类模型检测已知违规，另一方面采用基于参考的相似性匹配识别新型或隐晦违规；多模态输入（文本、音频、视觉）同时送入两条流水线，并利用多模态大语言模型（MLLM）为两个模块提炼知识以提升准确性，同时保持推理轻量化。 Result: 在生产环境中，分类流水线达到67%召回率（80%精度），相似性流水线达到76%召回率（80%精度）；大规模A/B测试显示用户对不良直播内容的观看量减少了6-8%。 Conclusion: 该混合式、多模态的内容审核框架具备良好的可扩展性和适应性，能够有效应对显式违规和新兴对抗行为，为大规模直播平台的内容治理提供了实用且高效的解决方案。 Abstract: Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.

[90] Stable Signer: Hierarchical Sign Language Generative Model

Sen Fang,Yalin Feng,Hongbin Zhong,Yanxin Zhang,Dimitris N. Metaxas

Main category: cs.CV

TL;DR: 本文提出了一种新的手语生成模型Stable Signer，通过简化传统流程，将文本到手语视频的生成任务重新定义为分层端到端任务，结合新设计的SLUL模块和SLP-MoE专家块，显著提升了生成质量和多风格能力。

Details

Motivation: 由于传统方法在文本转换、姿态生成和视频渲染阶段存在误差累积问题，导致手语生成进展缓慢，因此需要一种更高效、准确的端到端模型来提升整体性能。 Method: 将SLP任务简化为文本理解（Prompt2Gloss, Text2Gloss）和Pose2Vid两个阶段，提出SLUL模块进行文本理解，并采用SLP-MoE手部渲染专家块生成手势视频，使用SAGM Loss进行训练。 Result: Stable Signer在生成质量上相比当前SOTA方法提升了48.6%，能够端到端生成高质量、多风格的手语视频。 Conclusion: Stable Signer通过结构简化和模块创新，有效减少了误差累积，显著提升了手语视频生成的准确性和自然度，推动了Sign Language Production领域的发展。 Abstract: Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.

[91] GAOT: Generating Articulated Objects Through Text-Guided Diffusion Models

Hao Sun,Lei Fan,Donglin Di,Shaohui Liu

Main category: cs.CV

TL;DR: 提出GAOT框架，通过文本生成3D关节物体，结合扩散模型与超图学习，分三阶段实现从文本到关节物体的生成。

Details

Motivation: 现有模型难以将文本描述与3D关节物体表示关联，缺乏基于文本提示的条件生成能力。 Method: 采用三阶段框架：1）微调点云生成模型从文本生成粗略物体；2）设计基于超图的学习方法，将物体部件表示为图顶点进行细化；3）利用扩散模型基于部件生成作为图边的关节。 Result: 在PartNet-Mobility数据集上实验表明，该方法在定性和定量评估中均优于先前方法。 Conclusion: GAOT有效弥合了文本描述与3D关节物体生成之间的差距，展示了在复杂结构生成中的优越性能。 Abstract: Articulated object generation has seen increasing advancements, yet existing models often lack the ability to be conditioned on text prompts. To address the significant gap between textual descriptions and 3D articulated object representations, we propose GAOT, a three-phase framework that generates articulated objects from text prompts, leveraging diffusion models and hypergraph learning in a three-step process. First, we fine-tune a point cloud generation model to produce a coarse representation of objects from text prompts. Given the inherent connection between articulated objects and graph structures, we design a hypergraph-based learning method to refine these coarse representations, representing object parts as graph vertices. Finally, leveraging a diffusion model, the joints of articulated objects-represented as graph edges-are generated based on the object parts. Extensive qualitative and quantitative experiments on the PartNet-Mobility dataset demonstrate the effectiveness of our approach, achieving superior performance over previous methods.

[92] Global-Local Aware Scene Text Editing

Fuxiang Yang,Tonghua Su,Donglin Di,Yin Chen,Xiangqian Wu,Zhongjie Wang,Lei Fan

Main category: cs.CV

TL;DR: 提出了一种端到端的全局-局部感知场景文本编辑框架GLASTE，以解决现有方法在编辑后文本与周围环境不一致以及对文本长度变化敏感的问题。

Details

Motivation: 现有场景文本编辑方法在保持编辑区域与周围环境的一致性以及处理不同长度文本方面存在不足。 Method: 设计了全局-局部组合结构、联合全局与局部损失，并增强文本图像特征；将文本风格表示为与图像大小无关的向量，通过仿射融合保持目标文本的宽高比。 Result: 在真实数据集上的实验表明，GLASTE在定量指标和视觉效果上均优于先前方法，有效缓解了不一致性和长度不敏感问题。 Conclusion: GLASTE通过融合全局上下文和局部细节，显著提升了场景文本编辑的质量与鲁棒性。 Abstract: Scene Text Editing (STE) involves replacing text in a scene image with new target text while preserving both the original text style and background texture. Existing methods suffer from two major challenges: inconsistency and length-insensitivity. They often fail to maintain coherence between the edited local patch and the surrounding area, and they struggle to handle significant differences in text length before and after editing. To tackle these challenges, we propose an end-to-end framework called Global-Local Aware Scene Text Editing (GLASTE), which simultaneously incorporates high-level global contextual information along with delicate local features. Specifically, we design a global-local combination structure, joint global and local losses, and enhance text image features to ensure consistency in text style within local patches while maintaining harmony between local and global areas. Additionally, we express the text style as a vector independent of the image size, which can be transferred to target text images of various sizes. We use an affine fusion to fill target text images into the editing patch while maintaining their aspect ratio unchanged. Extensive experiments on real-world datasets validate that our GLASTE model outperforms previous methods in both quantitative metrics and qualitative results and effectively mitigates the two challenges.

[93] UniComp: Rethinking Video Compression Through Informational Uniqueness

Chao Yuan,Shimin Chen,Minliang Lin,Limeng Qiao,Guanglu Wan,Lin Ma

Main category: cs.CV

TL;DR: 本文提出了一种基于信息唯一性的视频压缩框架UniComp，通过最小化保留与完整token之间的条件熵来优化视觉表示的信息保真度。

Details

Motivation: 现有基于注意力的压缩方法未能充分考虑token间的内在冗余，影响重建精度与计算效率的平衡。 Method: 从信息论角度出发，引入信息唯一性度量token间冗余，并据此设计帧组融合、token分配和空间动态压缩三个模块，实现语义分组、自适应资源分配和细粒度压缩。 Result: 实验表明，UniComp在有限计算预算下能更有效地保留关键视觉token，显著优于现有压缩方法。 Conclusion: 信息唯一性是提升token压缩效能的关键因素，所提出的UniComp框架为高效视频压缩提供了新思路。 Abstract: Distinct from attention-based compression methods, this paper presents an information uniqueness driven video compression framework, termed UniComp, which aims to maximize the information fidelity of video representations under constrained computational budgets. Starting from the information-theoretic perspective, we formulate the vision compression as an optimization problem that minimizes conditional entropy (reconstruction error) between retained and full tokens. To achieve this, we introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error. Based on uniqueness, we design three modules-Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression-that progressively perform semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression. Extensive experiments demonstrate that UniComp consistently outperforms existing compression methods in preserving essential visual tokens under limited computational budgets, highlighting the pivotal role of information uniqueness in token compression efficacy.

[94] Cross-Stain Contrastive Learning for Paired Immunohistochemistry and Histopathology Slide Representation Learning

Yizhi Zhang,Lei Fan,Zhulin Tao,Donglin Di,Yang Song,Sidong Liu,Cong Cong

Main category: cs.CV

TL;DR: 本文提出了一种跨染色对比学习框架（CSCL），利用五种染色的配对数据集来生成可迁移的全切片图像表示，显著提升了癌症亚型分类、IHC生物标志物状态预测和生存分析的性能。

Details

Motivation: 由于多染色数据集中染色间错位问题严重且数据稀缺，限制了H&E与免疫组化（IHC）联合特征的学习，阻碍了高质量、可迁移的病理图像表示的发展。 Method: 构建了一个五染色（H&E, HER2, KI67, ER, PGR）配对的全切片图像数据集，并提出CSCL两阶段预训练框架：第一阶段通过补丁级对比对齐增强H&E与IHC特征兼容性；第二阶段采用多实例学习（MIL），结合跨染色注意力融合模块和全局对齐模块整合不同染色的特征并保持滑片级嵌入的一致性。 Result: 在癌症亚型分类、IHC生物标志物状态分类和生存预测任务中均取得一致性能提升，验证了所提方法能生成高质量、可迁移的H&E滑片级表示。 Conclusion: CSCL通过解决跨染色错位问题并有效融合多染色信息，为计算病理学提供了更鲁棒、通用的表示学习方案。 Abstract: Universal, transferable whole-slide image (WSI) representations are central to computational pathology. Incorporating multiple markers (e.g., immunohistochemistry, IHC) alongside H&E enriches H&E-based features with diverse, biologically meaningful information. However, progress is limited by the scarcity of well-aligned multi-stain datasets. Inter-stain misalignment shifts corresponding tissue across slides, hindering consistent patch-level features and degrading slide-level embeddings. To address this, we curated a slide-level aligned, five-stain dataset (H&E, HER2, KI67, ER, PGR) to enable paired H&E-IHC learning and robust cross-stain representation. Leveraging this dataset, we propose Cross-Stain Contrastive Learning (CSCL), a two-stage pretraining framework with a lightweight adapter trained using patch-wise contrastive alignment to improve the compatibility of H&E features with corresponding IHC-derived contextual cues, and slide-level representation learning with Multiple Instance Learning (MIL), which uses a cross-stain attention fusion module to integrate stain-specific patch features and a cross-stain global alignment module to enforce consistency among slide-level embeddings across different stains. Experiments on cancer subtype classification, IHC biomarker status classification, and survival prediction show consistent gains, yielding high-quality, transferable H&E slide-level representations. The code and data are available at https://github.com/lily-zyz/CSCL.

[95] Dynamic Optical Test for Bot Identification (DOT-BI): A simple check to identify bots in surveys and online processes

Malte Bleeker,Mauro Gotsch

Main category: cs.CV

TL;DR: 提出了一种名为DOT-BI的动态光学测试方法，利用人类对运动的感知来区分人类与自动化系统，实验表明该方法在人类中识别率高且易于使用，而当前最先进的AI模型无法破解。

Details

Motivation: 为了解决在线调查和自动化流程中机器人冒充人类的问题，需要一种简单、快速且难以被AI破解的身份验证方法。 Method: 设计了一个动态光学测试（DOT-BI），其中隐藏数字与背景具有相同的黑白像素纹理，仅通过运动和尺度差异使人类可感知；而对逐帧算法处理无意义信号。评估包括对最先进多模态AI模型的测试和人类参与者在线及实验室研究。 Result: GPT-5-Thinking和Gemini 2.5 Pro等先进AI模型无法正确提取隐藏数字；在线调查显示99.5%的人类参与者成功完成任务，平均耗时10.7秒；实验室研究未发现可用性负面影响。 Conclusion: DOT-BI是一种高效、用户友好且能有效抵御当前AI攻击的新型验证码方法，具备在实际场景中广泛部署的潜力。 Abstract: We propose the Dynamic Optical Test for Bot Identification (DOT-BI): a quick and easy method that uses human perception of motion to differentiate between human respondents and automated systems in surveys and online processes. In DOT-BI, a 'hidden' number is displayed with the same random black-and-white pixel texture as its background. Only the difference in motion and scale between the number and the background makes the number perceptible to humans across frames, while frame-by-frame algorithmic processing yields no meaningful signal. We conducted two preliminary assessments. Firstly, state-of-the-art, video-capable, multimodal models (GPT-5-Thinking and Gemini 2.5 Pro) fail to extract the correct value, even when given explicit instructions about the mechanism. Secondly, in an online survey (n=182), 99.5% (181/182) of participants solved the task, with an average end-to-end completion time of 10.7 seconds; a supervised lab study (n=39) found no negative effects on perceived ease-of-use or completion time relative to a control. We release code to generate tests and 100+ pre-rendered variants to facilitate adoption in surveys and online processes.

[96] Beyond Boundary Frames: Audio-Visual Semantic Guidance for Context-Aware Video Interpolation

Yuchen Deng,Xiuyang Wu,Hai-Tao Zheng,Jie Wang,Feidiao Yang,Yuxing Han

Main category: cs.CV

TL;DR: 本文提出了一种名为BBF的上下文感知视频帧插值框架，能够通过音频/视觉语义引导，支持多模态条件输入，在通用和音视频同步插值任务中均优于现有方法。

Details

Motivation: 现有基于扩散的方法在处理复杂非线性运动时仍难以覆盖多样应用场景，尤其在细粒度运动（如音视频同步插值）中表现不佳，缺乏统一的多条件支持框架。 Method: 1) 增强插值模型的输入设计，支持文本、音频、图像和视频等多种条件模态；2) 提出解耦的多模态融合机制，将不同条件信号依次注入DiT主干网络；3) 采用渐进式多阶段训练范式，利用起止帧差异嵌入动态调整数据采样和损失权重。 Result: 实验表明，BBF在通用插值和音视频同步插值任务上均优于当前最先进的专用方法，实现了在多通道协同条件下的统一插值框架。 Conclusion: BBF通过上下文感知和多模态融合设计，有效提升了复杂运动下的插值质量与时间一致性，为多条件视频插值提供了通用且高效的解决方案。 Abstract: Handling fast, complex, and highly non-linear motion patterns has long posed challenges for video frame interpolation. Although recent diffusion-based approaches improve upon traditional optical-flow-based methods, they still struggle to cover diverse application scenarios and often fail to produce sharp, temporally consistent frames in fine-grained motion tasks such as audio-visual synchronized interpolation. To address these limitations, we introduce BBF (Beyond Boundary Frames), a context-aware video frame interpolation framework, which could be guided by audio/visual semantics. First, we enhance the input design of the interpolation model so that it can flexibly handle multiple conditional modalities, including text, audio, images, and video. Second, we propose a decoupled multimodal fusion mechanism that sequentially injects different conditional signals into a DiT backbone. Finally, to maintain the generation abilities of the foundation model, we adopt a progressive multi-stage training paradigm, where the start-end frame difference embedding is used to dynamically adjust both the data sampling and the loss weighting. Extensive experimental results demonstrate that BBF outperforms specialized state-of-the-art methods on both generic interpolation and audio-visual synchronized interpolation tasks, establishing a unified framework for video frame interpolation under coordinated multi-channel conditioning.

[97] Harnessing Hypergraphs in Geometric Deep Learning for 3D RNA Inverse Folding

Guang Yang,Lei Fan

Main category: cs.CV

TL;DR: 本文提出了一种名为HyperRNA的生成模型，利用超图和编码器-解码器架构解决RNA逆折叠问题，能够更有效地设计具有目标二级结构的RNA序列。

Details

Motivation: RNA逆折叠问题是RNA设计中的关键挑战，由于序列与结构之间的复杂关系，难以准确设计出能折叠成特定结构的RNA序列。 Method: 提出HyperRNA框架，包含预处理、编码和解码三个阶段：基于3珠粗粒化表示构建图结构，使用注意力嵌入模块和超图编码器捕捉高阶依赖关系，并通过自回归方式生成RNA序列。 Result: 在PDBBind和RNAsolo数据集上的实验表明，HyperRNA在RNA序列生成和RNA-蛋白质复合物序列生成任务中优于现有方法。 Conclusion: HyperRNA通过引入超图有效建模RNA中的复杂相互作用，在RNA逆折叠任务中表现出优越性能，展示了超图在RNA工程中的潜力。 Abstract: The RNA inverse folding problem, a key challenge in RNA design, involves identifying nucleotide sequences that can fold into desired secondary structures, which are critical for ensuring molecular stability and function. The inherent complexity of this task stems from the intricate relationship between sequence and structure, making it particularly challenging. In this paper, we propose a framework, named HyperRNA, a generative model with an encoder-decoder architecture that leverages hypergraphs to design RNA sequences. Specifically, our HyperRNA model consists of three main components: preprocessing, encoding and decoding. In the preprocessing stage, graph structures are constructed by extracting the atom coordinates of RNA backbone based on 3-bead coarse-grained representation. The encoding stage processes these graphs, capturing higher order dependencies and complex biomolecular interactions using an attention embedding module and a hypergraph-based encoder. Finally, the decoding stage generates the RNA sequence in an autoregressive manner. We conducted quantitative and qualitative experiments on the PDBBind and RNAsolo datasets to evaluate the inverse folding task for RNA sequence generation and RNA-protein complex sequence generation. The experimental results demonstrate that HyperRNA not only outperforms existing RNA design methods but also highlights the potential of leveraging hypergraphs in RNA engineering.

[98] CloseUpAvatar: High-Fidelity Animatable Full-Body Avatars with Mixture of Multi-Scale Textures

David Svitov,Pietro Morerio,Lourdes Agapito,Alessio Del Bue

Main category: cs.CV

TL;DR: 提出CloseUpAvatar，一种通过可学习纹理平面表示人物avatar的方法，能根据相机距离自适应切换高低频纹理，实现高质量近景渲染并支持更广泛的相机视角。

Details

Motivation: 现有方法在处理复杂相机运动时难以保持高质量的近景渲染效果，尤其是在远近不同视角下难以兼顾效率与细节表现。 Method: 将avatar表示为带纹理的平面集合，使用两组可学习纹理分别捕捉高低频细节，并根据相机与表面的距离动态切换纹理权重。 Result: 在ActorsHQ高分辨率数据集上实验表明，该方法在多种相机位置下均优于现有方法，兼顾渲染质量与帧率。 Conclusion: CloseUpAvatar通过距离感知的纹理切换机制，在保持高FPS的同时实现了更真实、更灵活的avatar渲染，适用于广泛相机视角下的高质量重放。 Abstract: We present a CloseUpAvatar - a novel approach for articulated human avatar representation dealing with more general camera motions, while preserving rendering quality for close-up views. CloseUpAvatar represents an avatar as a set of textured planes with two sets of learnable textures for low and high-frequency detail. The method automatically switches to high-frequency textures only for cameras positioned close to the avatar's surface and gradually reduces their impact as the camera moves farther away. Such parametrization of the avatar enables CloseUpAvatar to adjust rendering quality based on camera distance ensuring realistic rendering across a wider range of camera orientations than previous approaches. We provide experiments using the ActorsHQ dataset with high-resolution input images. CloseUpAvatar demonstrates both qualitative and quantitative improvements over existing methods in rendering from novel wide range camera positions, while maintaining high FPS by limiting the number of required primitives.

[99] HBFormer: A Hybrid-Bridge Transformer for Microtumor and Miniature Organ Segmentation

Fuchen Zheng,Xinyi Chen,Weixuan Li,Quanjun Li,Junhua Zhou,Xiaojiao Guo,Xuhang Chen,Chi-Man Pun,Shoujun Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为HBFormer的新型混合桥接Transformer架构，用于解决医学图像分割中局部注意力机制难以融合全局上下文的问题，尤其适用于微小肿瘤和微型器官的分割任务。

Details

Motivation: 现有的基于窗口化自注意力的视觉Transformer在医学图像分割中表现优异，但其局部注意力机制难以有效融合局部细节与全局上下文，限制了对微小结构的精确分割。 Method: 提出HBFormer，结合U型编码器-解码器结构与Swin Transformer主干网络，并设计多尺度特征融合（MFF）解码器作为‘桥接’机制，通过空洞卷积和深度可分离卷积构建的通道与空间注意力模块实现多尺度与全局上下文信息的融合。 Result: 在多个具有挑战性的医学图像分割数据集（如多器官、肝肿瘤、膀胱肿瘤）上实验表明，HBFormer达到了最先进的性能，尤其在微小肿瘤和微型器官分割中表现出色。 Conclusion: HBFormer通过其创新的混合桥接结构和MFF解码器，有效融合了局部细节与全局上下文，显著提升了医学图像中微小结构的分割精度。 Abstract: Medical image segmentation is a cornerstone of modern clinical diagnostics. While Vision Transformers that leverage shifted window-based self-attention have established new benchmarks in this field, they are often hampered by a critical limitation: their localized attention mechanism struggles to effectively fuse local details with global context. This deficiency is particularly detrimental to challenging tasks such as the segmentation of microtumors and miniature organs, where both fine-grained boundary definition and broad contextual understanding are paramount. To address this gap, we propose HBFormer, a novel Hybrid-Bridge Transformer architecture. The 'Hybrid' design of HBFormer synergizes a classic U-shaped encoder-decoder framework with a powerful Swin Transformer backbone for robust hierarchical feature extraction. The core innovation lies in its 'Bridge' mechanism, a sophisticated nexus for multi-scale feature integration. This bridge is architecturally embodied by our novel Multi-Scale Feature Fusion (MFF) decoder. Departing from conventional symmetric designs, the MFF decoder is engineered to fuse multi-scale features from the encoder with global contextual information. It achieves this through a synergistic combination of channel and spatial attention modules, which are constructed from a series of dilated and depth-wise convolutions. These components work in concert to create a powerful feature bridge that explicitly captures long-range dependencies and refines object boundaries with exceptional precision. Comprehensive experiments on challenging medical image segmentation datasets, including multi-organ, liver tumor, and bladder tumor benchmarks, demonstrate that HBFormer achieves state-of-the-art results, showcasing its outstanding capabilities in microtumor and miniature organ segmentation. Code and models are available at: https://github.com/lzeeorno/HBFormer.

[100] Memory-Guided Point Cloud Completion for Dental Reconstruction

Jianan Sun,Yukang Huang,Dongzhihan Wang,Mingyu Fan

Main category: cs.CV

TL;DR: 提出一种检索增强的牙齿点云补全框架，通过引入可学习的原型记忆模块，利用跨样本的形状先验来提升补全精度和细节恢复能力。

Details

Motivation: 部分牙科点云因遮挡和扫描视角有限存在大面积缺失，导致编码器-解码器模型特征偏差，解码器易产生结构幻觉，缺乏有效形状先验引导。 Method: 在编码器-解码器框架中引入可学习的原型记忆库；编码部分输入为全局描述符后，检索最近的记忆原型，并通过置信度门控机制融合到查询特征中再进行解码；记忆库端到端优化且自组织为可重用的牙齿形状原型。 Result: 在自处理的Teeth3DS基准上实验显示Chamfer Distance显著降低，可视化结果表现出更清晰的牙尖、脊线和邻接面过渡；模块即插即用，兼容主流补全骨干网络，无需额外损失函数。 Conclusion: 该方法通过引入无需牙位标签的结构先验记忆模块，有效稳定了缺失区域推断，释放了解码器对细节恢复的能力，为牙科点云补全提供了简单而高效的解决方案。 Abstract: Partial dental point clouds often suffer from large missing regions caused by occlusion and limited scanning views, which bias encoder-only global features and force decoders to hallucinate structures. We propose a retrieval-augmented framework for tooth completion that integrates a prototype memory into standard encoder--decoder pipelines. After encoding a partial input into a global descriptor, the model retrieves the nearest manifold prototype from a learnable memory and fuses it with the query feature through confidence-gated weighting before decoding. The memory is optimized end-to-end and self-organizes into reusable tooth-shape prototypes without requiring tooth-position labels, thereby providing structural priors that stabilize missing-region inference and free decoder capacity for detail recovery. The module is plug-and-play and compatible with common completion backbones, while keeping the same training losses. Experiments on a self-processed Teeth3DS benchmark demonstrate consistent improvements in Chamfer Distance, with visualizations showing sharper cusps, ridges, and interproximal transitions. Our approach provides a simple yet effective way to exploit cross-sample regularities for more accurate and faithful dental point-cloud completion.

[101] Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding

Haoran Zhou,Gim Hee Lee

Main category: cs.CV

TL;DR: 本文提出了Motion4D，一种将2D基础模型先验融入统一的4D高斯点阵表示的框架，通过序列与全局优化实现3D一致的动态场景建模。

Details

Motivation: 现有2D视觉基础模型在动态场景分析中缺乏3D一致性，导致空间错位和时间闪烁问题，难以准确理解复杂3D环境中的几何与运动。 Method: 提出两阶段迭代优化框架：1）序列优化，分阶段更新运动与语义场以保持局部一致性；2）全局优化，联合优化所有属性确保长期连贯性。引入3D置信图动态调整运动先验，并基于RGB和语义误差进行自适应重采样。通过交替优化语义场与SAM2提示实现语义一致性。 Result: 在点跟踪、视频对象分割和新视角合成等任务上显著优于2D基础模型和现有3D方法。 Conclusion: Motion4D有效融合2D先验与4D表示，在保持强泛化能力的同时实现高质量的3D一致动态场景建模。 Abstract: Recent advancements in foundation models for 2D vision have substantially improved the analysis of dynamic scenes from monocular videos. However, despite their strong generalization capabilities, these models often lack 3D consistency, a fundamental requirement for understanding scene geometry and motion, thereby causing severe spatial misalignment and temporal flickering in complex 3D environments. In this paper, we present Motion4D, a novel framework that addresses these challenges by integrating 2D priors from foundation models into a unified 4D Gaussian Splatting representation. Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence. To enhance motion accuracy, we introduce a 3D confidence map that dynamically adjusts the motion priors, and an adaptive resampling process that inserts new Gaussians into under-represented regions based on per-pixel RGB and semantic errors. Furthermore, we enhance semantic coherence through an iterative refinement process that resolves semantic inconsistencies by alternately optimizing the semantic fields and updating prompts of SAM2. Extensive evaluations demonstrate that our Motion4D significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis. Our code is available at https://hrzhou2.github.io/motion4d-web/.

[102] LAMP: Language-Assisted Motion Planning for Controllable Video Generation

Muhammed Burak Kizil,Enes Sanli,Niloy J. Mitra,Erkut Erdem,Aykut Erdem,Duygu Ceylan

Main category: cs.CV

TL;DR: 本文提出了LAMP框架，利用大语言模型将自然语言描述转化为3D运动轨迹，实现对动态对象和相机运动的精确控制，首次实现了从自然语言直接生成物体与相机运动的统一框架。

Details

Motivation: 现有的视频生成中运动控制接口有限，难以满足复杂、电影级场景的创作需求，尤其是在自然语言驱动下的物体与相机运动控制方面存在不足。 Method: 提出一种基于大语言模型的运动规划方法LAMP，定义了受电影拍摄惯例启发的运动领域特定语言（DSL），通过LLM的程序合成能力将自然语言转换为结构化运动程序，并确定性地映射到3D轨迹；同时构建了一个大规模的文本描述与运动程序及3D轨迹配对的数据集。 Result: 实验表明，LAMP在运动可控性和用户意图对齐方面优于现有最先进方法，能够准确生成符合自然语言描述的动态对象和相对相机运动轨迹。 Conclusion: LAMP是首个支持从自然语言直接生成物体与相机3D运动轨迹的框架，显著提升了视频生成中的运动控制能力与可用性。 Abstract: Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications.

[103] ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation

Yaokun Li,Shuaixian Wang,Mantang Guo,Jiehui Huang,Taojun Ding,Mu Hu,Kaixuan Wang,Shaojie Shen,Guang Tan

Main category: cs.CV

TL;DR: ReCamDriving是一种纯视觉、基于相机控制的新型视频生成框架，利用3DGS渲染提供密集且完整的几何指导，实现精确的视角控制生成。

Details

Motivation: 现有修复方法难以恢复复杂伪影，LiDAR方法依赖稀疏不完整线索，缺乏对复杂驾驶场景中精确相机控制和结构一致性的支持。 Method: 采用基于3DGS的两阶段训练：第一阶段使用相机姿态进行粗略控制，第二阶段引入3DGS渲染以实现细粒度几何与视角引导；并提出跨轨迹数据构建策略，构建包含11万对平行轨迹视频的ParaDrive数据集。 Result: 在多个实验中表现出最先进的相机可控性和结构一致性，有效缩小了训练与测试间的相机变换差距。 Conclusion: ReCamDriving通过结合3DGS渲染与两阶段训练，在纯视觉条件下实现了高质量、可控的新视角驾驶视频生成，推动了自动驾驶仿真中视觉生成模型的发展。 Abstract: We propose ReCamDriving, a purely vision-based, camera-controlled novel-trajectory video generation framework. While repair-based methods fail to restore complex artifacts and LiDAR-based approaches rely on sparse and incomplete cues, ReCamDriving leverages dense and scene-complete 3DGS renderings for explicit geometric guidance, achieving precise camera-controllable generation. To mitigate overfitting to restoration behaviors when conditioned on 3DGS renderings, ReCamDriving adopts a two-stage training paradigm: the first stage uses camera poses for coarse control, while the second stage incorporates 3DGS renderings for fine-grained viewpoint and geometric guidance. Furthermore, we present a 3DGS-based cross-trajectory data curation strategy to eliminate the train-test gap in camera transformation patterns, enabling scalable multi-trajectory supervision from monocular videos. Based on this strategy, we construct the ParaDrive dataset, containing over 110K parallel-trajectory video pairs. Extensive experiments demonstrate that ReCamDriving achieves state-of-the-art camera controllability and structural consistency.

[104] FeatureLens: A Highly Generalizable and Interpretable Framework for Detecting Adversarial Examples Based on Image Features

Zhigang Yang,Yuan Liu,Jiawei Zhang,Puning Zhang,Xinqiang Ma

Main category: cs.CV

TL;DR: FeatureLens是一个轻量级框架，通过提取图像特征并使用浅层分类器检测对抗性攻击，在保持高准确率的同时具备良好的可解释性、泛化能力和计算效率。

Details

Motivation: 现有对抗攻击检测方法依赖复杂且难以解释的模型，导致可解释性和泛化能力不足。 Method: 提出FeatureLens框架，包含图像特征提取器（IFE）和浅层分类器（如SVM、MLP、XGBoost），仅使用51维特征进行异常检测。 Result: 在闭集评估中检测精度为97.8%至99.75%，在泛化评估中为86.17%至99.6%，适用于FGSM、PGD、CW和DAmageNet等多种攻击。 Conclusion: FeatureLens在检测性能、可解释性、泛化性和计算效率之间取得了良好平衡，为透明有效的对抗防御提供了实用路径。 Abstract: Although the remarkable performance of deep neural networks (DNNs) in image classification, their vulnerability to adversarial attacks remains a critical challenge. Most existing detection methods rely on complex and poorly interpretable architectures, which compromise interpretability and generalization. To address this, we propose FeatureLens, a lightweight framework that acts as a lens to scrutinize anomalies in image features. Comprising an Image Feature Extractor (IFE) and shallow classifiers (e.g., SVM, MLP, or XGBoost) with model sizes ranging from 1,000 to 30,000 parameters, FeatureLens achieves high detection accuracy ranging from 97.8% to 99.75% in closed-set evaluation and 86.17% to 99.6% in generalization evaluation across FGSM, PGD, CW, and DAmageNet attacks, using only 51 dimensional features. By combining strong detection performance with excellent generalization, interpretability, and computational efficiency, FeatureLens offers a practical pathway toward transparent and effective adversarial defense.

[105] MKSNet: Advanced Small Object Detection in Remote Sensing Imagery with Multi-Kernel and Dual Attention Mechanisms

Jiahao Zhang,Xiao Zhao,Guangyu Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为MKSNet的新型网络架构，通过多核选择机制和双注意力机制提升遥感图像中小目标检测的性能。

Details

Motivation: 针对遥感图像中因高分辨率和小目标尺寸导致深层特征丢失、背景复杂和空间冗余等问题，现有CNN模型难以有效检测小目标。 Method: 设计了多核选择机制（MKS），利用大卷积核捕获丰富的上下文信息，并自适应选择核大小；同时引入结合空间与通道注意力的双注意力机制，增强关键区域的特征表达并抑制背景噪声。 Result: 在DOTA-v1.0和HRSC2016基准上实验表明，MKSNet在小目标检测性能上显著优于现有的最先进模型。 Conclusion: MKSNet能有效应对遥感图像中多尺度和高分辨率带来的挑战，在小目标检测方面具有优越性和创新性。 Abstract: Deep convolutional neural networks (DCNNs) have substantially advanced object detection capabilities, particularly in remote sensing imagery. However, challenges persist, especially in detecting small objects where the high resolution of these images and the small size of target objects often result in a loss of critical information in the deeper layers of conventional CNNs. Additionally, the extensive spatial redundancy and intricate background details typical in remote-sensing images tend to obscure these small targets. To address these challenges, we introduce Multi-Kernel Selection Network (MKSNet), a novel network architecture featuring a novel Multi-Kernel Selection mechanism. The MKS mechanism utilizes large convolutional kernels to effectively capture an extensive range of contextual information. This innovative design allows for adaptive kernel size selection, significantly enhancing the network's ability to dynamically process and emphasize crucial spatial details for small object detection. Furthermore, MKSNet also incorporates a dual attention mechanism, merging spatial and channel attention modules. The spatial attention module adaptively fine-tunes the spatial weights of feature maps, focusing more intensively on relevant regions while mitigating background noise. Simultaneously, the channel attention module optimizes channel information selection, improving feature representation and detection accuracy. Empirical evaluations on the DOTA-v1.0 and HRSC2016 benchmark demonstrate that MKSNet substantially surpasses existing state-of-the-art models in detecting small objects in remote sensing images. These results highlight MKSNet's superior ability to manage the complexities associated with multi-scale and high-resolution image data, confirming its effectiveness and innovation in remote sensing object detection.

[106] Multi-Scale Visual Prompting for Lightweight Small-Image Classification

Salim Khazem

Main category: cs.CV

TL;DR: 本文提出了多尺度视觉提示（MSVP）方法，通过在输入图像中融合全局、中尺度和局部提示图，在小图像基准上显著提升性能，且参数增加极少。

Details

Motivation: 尽管视觉提示已被用于大模型和高分辨率数据集，但对MNIST、CIFAR-10等小图像基准的关注较少，本文旨在填补这一空白。 Method: 设计了一种轻量级的多尺度提示模块MSVP，通过1×1卷积将不同尺度的可学习提示图融合到输入中，适用于CNN和Vision Transformer骨干网络。 Result: 在MNIST、Fashion-MNIST和CIFAR-10上实现了性能提升，计算开销极低，并通过消融实验和可视化验证了有效性。 Conclusion: 多尺度提示能为低分辨率图像提供有效的归纳偏置，是一种通用且高效的小图像模型适配方法。 Abstract: Visual prompting has recently emerged as an efficient strategy to adapt vision models using lightweight, learnable parameters injected into the input space. However, prior work mainly targets large Vision Transformers and high-resolution datasets such as ImageNet. In contrast, small-image benchmarks like MNIST, Fashion-MNIST, and CIFAR-10 remain widely used in education, prototyping, and research, yet have received little attention in the context of prompting. In this paper, we introduce \textbf{Multi-Scale Visual Prompting (MSVP)}, a simple and generic module that learns a set of global, mid-scale, and local prompt maps fused with the input image via a lightweight $1 \times 1$ convolution. MSVP is backbone-agnostic, adds less than $0.02\%$ parameters, and significantly improves performance across CNN and Vision Transformer backbones. We provide a unified benchmark on MNIST, Fashion-MNIST, and CIFAR-10 using a simple CNN, ResNet-18, and a small Vision Transformer. Our method yields consistent improvements with negligible computational overhead. We further include ablations on prompt scales, fusion strategies, and backbone architectures, along with qualitative analyzes using prompt visualizations and Grad-CAM. Our results demonstrate that multi-scale prompting provides an effective inductive bias even on low-resolution images.

[107] ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

Qi'ao Xu,Tianwen Qian,Yuqian Fu,Kailing Li,Yang Jiao,Jiacheng Zhang,Xiaoling Wang,Liang He

Main category: cs.CV

TL;DR: 本文提出了ToG-Bench，首个面向任务的、以自我为中心视频中的时空视频定位基准，强调任务导向推理、显式-隐式双重定位和一对多定位，旨在推动具身智能中感知与交互的融合。

Details

Motivation: 现有STVG研究主要集中于对象描述性指令，忽略了具身智能体完成目标导向交互所需的任务导向推理能力，因此需要构建更贴近实际任务需求的基准。 Method: 基于ScanNet视频构建ToG-Bench，包含100个标注片段和2,704条任务导向指令，采用结合基础模型自动标注与人工精炼的半自动化流程，并提出适用于多对象和显式-隐式定位的任务级评估指标。 Result: 实验表明当前最先进的MLLM在显式-隐式推理和多对象定位方面存在显著性能差距，验证了任务导向STVG的挑战性。 Conclusion: ToG-Bench为任务导向的时空定位提供了新基准，揭示了现有模型在连接视觉感知与任务推理方面的不足，推动具身智能的发展。 Abstract: A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce \textbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) \textbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) \textbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) \textbf{One-to-Many Grounding}, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: \href{https://github.com/qaxuDev/ToG-Bench}{https://github.com/qaxuDev/ToG-Bench}..

[108] Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning

Ge-Peng Ji,Jingyi Liu,Deng-Ping Fan,Nick Barnes

Main category: cs.CV

TL;DR: 本研究提出了Colon-X，一个推动结肠镜检查中多模态智能发展的开放计划，包括构建大规模多模态数据集ColonVQA，并探索从多模态理解向临床推理的转变，提出ColonR1模型在数据稀缺条件下显著提升推理性能。

Details

Motivation: 当前多模态大模型在结肠镜临床应用中的鲁棒性和可信度不足，且缺乏面向临床推理的系统性研究，亟需构建专业数据集并发展具备临床推理能力的智能模型。 Method: 构建了包含110多万个视觉问答条目的ColonVQA数据集；评估22个多模态大语言模型在人为扰动下的表现；通过多专家辩论流程标注构建ColonReason推理数据集；提出ColonR1模型，采用任务自适应奖励和梯度稳定优化技术进行训练。 Result: 评估显示现有MLLM在临床输出上鲁棒性差；在数据稀缺条件下，ColonR1达到56.61%的整体准确率，比监督微调提升25.22%，建立了新的多模态结肠镜分析推理基线。 Conclusion: Colon-X为结肠镜多模态智能提供了重要基础，ColonR1展示了临床推理在提升模型性能方面的潜力，推动了从感知理解到可信赖临床决策的演进。 Abstract: In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.

[109] ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers

Feice Huang,Zuliang Han,Xing Zhou,Yihuang Chen,Lifei Zhu,Haoqian Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为ConvRot的群组旋转量化方法，利用正则化Hadamard变换（RHT）抑制扩散Transformer中的行列异常值，并设计了支持W4A4推理的即插即用模块ConvLinear4bit，在无需重训练的情况下显著提升速度与内存效率。

Details

Motivation: 随着扩散Transformer模型规模增大，内存占用和推理延迟问题突出；现有基于旋转的量化方法在处理行方向异常值时存在局限性且计算开销大，难以直接应用于扩散模型。 Method: 提出ConvRot方法，采用分组建的正则化Hadamard变换进行群组旋转以抑制行列异常值，将复杂度从二次降至线性；并构建ConvLinear4bit模块，集成旋转、量化、GEMM和反量化，实现无需微调的W4A4推理。 Result: 在FLUX.1-dev上实验表明，相比现有方法实现了2.26倍的速度提升和4.05倍的内存减少，同时保持了良好的图像生成质量。 Conclusion: 这是首个将基于旋转的量化技术成功应用于扩散Transformer中实现即插即用W4A4推理的工作，有效解决了大规模模型部署中的效率问题。 Abstract: Diffusion transformers have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (LLMs) show that rotation-based techniques can smooth outliers and enable 4-bit quantization, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion transformers. To address these challenges, we propose ConvRot, a group-wise rotation-based quantization method that leverages regular Hadamard transform (RHT) to suppress both row-wise and column-wise outliers while reducing complexity from quadratic to linear. Building on this, we design ConvLinear4bit, a plug-and-play module that integrates rotation, quantization, GEMM, and dequantization, enabling W4A4 inference without retraining and preserving visual quality. Experiments on FLUX.1-dev demonstrate a 2.26$\times$ speedup and 4.05$\times$ memory reduction while maintaining image fidelity. To our knowledge, this is the first application of rotation-based quantization for plug-and-play W4A4 inference in diffusion transformers.

[110] GaussianBlender: Instant Stylization of 3D Gaussians with Disentangled Latent Spaces

Melis Ocal,Xiaoyan Xing,Yue Li,Ngo Anh Vien,Sezer Karaoglu,Theo Gevers

Main category: cs.CV

TL;DR: 本文提出了GaussianBlender，一种用于文本驱动的3D风格化的前馈框架，能够实现实时、高保真、多视角一致的3D风格化，克服了现有方法在大规模生产中的局限性。

Details

Motivation: 现有的文本到3D风格化方法依赖于2D图像编辑器，存在每资产优化耗时和多视图不一致的问题，难以满足大规模生产需求。 Method: 提出GaussianBlender框架，通过从空间分组的3D高斯分布中学习结构化且解耦的潜在空间，并利用潜在扩散模型进行文本条件下的编辑。 Result: 实验表明，GaussianBlender能够在推理时即时执行编辑，实现高保真、几何保持和多视角一致的风格化效果，性能优于需要测试时优化的方法。 Conclusion: GaussianBlender为实际的大规模3D风格化提供了可行方案，推动了3D内容创作的普及化。 Abstract: 3D stylization is central to game development, virtual reality, and digital arts, where the demand for diverse assets calls for scalable methods that support fast, high-fidelity manipulation. Existing text-to-3D stylization methods typically distill from 2D image editors, requiring time-intensive per-asset optimization and exhibiting multi-view inconsistency due to the limitations of current text-to-image models, which makes them impractical for large-scale production. In this paper, we introduce GaussianBlender, a pioneering feed-forward framework for text-driven 3D stylization that performs edits instantly at inference. Our method learns structured, disentangled latent spaces with controlled information sharing for geometry and appearance from spatially-grouped 3D Gaussians. A latent diffusion model then applies text-conditioned edits on these learned representations. Comprehensive evaluations show that GaussianBlender not only delivers instant, high-fidelity, geometry-preserving, multi-view consistent stylization, but also surpasses methods that require per-instance test-time optimization - unlocking practical, democratized 3D stylization at scale.

[111] Active Visual Perception: Opportunities and Challenges

Yian Li,Xiaoyu Guo,Hao Zhang,Shuiwang Li,Xiaowei Dai

Main category: cs.CV

TL;DR: 本文综述了主动视觉感知的机遇与挑战，涵盖其在复杂环境中通过主动交互获取信息的能力及其在机器人、自动驾驶等领域的应用潜力。

Details

Motivation: 主动视觉感知能够克服静态感知系统在复杂动态环境中信息不足的问题，提升系统对环境的理解和响应能力。 Method: 通过综述现有研究，分析主动视觉感知在多模态感知、实时处理和动态决策方面的技术进展与方法。 Result: 总结了主动视觉感知的关键应用场景，并识别出实现实时处理、高效决策和多传感器融合等方面的主要挑战。 Conclusion: 主动视觉感知具有广阔的应用前景，但需进一步解决算法效率、系统集成和实际部署中的鲁棒性问题以推动其广泛采用。 Abstract: Active visual perception refers to the ability of a system to dynamically engage with its environment through sensing and action, allowing it to modify its behavior in response to specific goals or uncertainties. Unlike passive systems that rely solely on visual data, active visual perception systems can direct attention, move sensors, or interact with objects to acquire more informative data. This approach is particularly powerful in complex environments where static sensing methods may not provide sufficient information. Active visual perception plays a critical role in numerous applications, including robotics, autonomous vehicles, human-computer interaction, and surveillance systems. However, despite its significant promise, there are several challenges that need to be addressed, including real-time processing of complex visual data, decision-making in dynamic environments, and integrating multimodal sensory inputs. This paper explores both the opportunities and challenges inherent in active visual perception, providing a comprehensive overview of its potential, current research, and the obstacles that must be overcome for broader adoption.

[112] Structured Uncertainty Similarity Score (SUSS): Learning a Probabilistic, Interpretable, Perceptual Metric Between Images

Paula Seidler,Neill D. F. Campbell,Ivor J A Simpson

Main category: cs.CV

TL;DR: 本文提出了Structured Uncertainty Similarity Score (SUSS)，一种结合可解释性与感知对齐的图像相似性度量方法，通过生成式自监督学习建模感知组件并实现与人类视觉高度一致的评估。

Details

Motivation: 现有深度感知指标（如LPIPS）虽性能好但缺乏可解释性，手工设计指标（如SSIM）可解释性强但感知属性建模不足，因此需要一种兼具两者优势的新方法。 Method: SUSS将图像表示为多个感知组件的结构化多元正态分布，通过自监督生成方式训练以赋予人眼不可察觉增强高似然，并使用从人类感知数据中学得的权重对组件对数概率加权求和得到最终分数。 Result: SUSS在多种失真类型下表现出良好的感知校准性，提供局部且可解释的相似性判断依据，在作为下游任务的感知损失时展现出稳定优化行为和竞争力性能。 Conclusion: SUSS实现了与人类感知高度一致、可解释性强且适用于训练和评估的图像相似性度量，为感知质量评估提供了透明且可靠的新工具。 Abstract: Perceptual similarity scores that align with human vision are critical for both training and evaluating computer vision models. Deep perceptual losses, such as LPIPS, achieve good alignment but rely on complex, highly non-linear discriminative features with unknown invariances, while hand-crafted measures like SSIM are interpretable but miss key perceptual properties. We introduce the Structured Uncertainty Similarity Score (SUSS); it models each image through a set of perceptual components, each represented by a structured multivariate Normal distribution. These are trained in a generative, self-supervised manner to assign high likelihood to human-imperceptible augmentations. The final score is a weighted sum of component log-probabilities with weights learned from human perceptual datasets. Unlike feature-based methods, SUSS learns image-specific linear transformations of residuals in pixel space, enabling transparent inspection through decorrelated residuals and sampling. SUSS aligns closely with human perceptual judgments, shows strong perceptual calibration across diverse distortion types, and provides localized, interpretable explanations of its similarity assessments. We further demonstrate stable optimization behavior and competitive performance when using SUSS as a perceptual loss for downstream imaging tasks.

[113] DINO-RotateMatch: A Rotation-Aware Deep Framework for Robust Image Matching in Large-Scale 3D Reconstruction

Kaichen Zhang,Tianxiang Sheng,Xuanming Shi

Main category: cs.CV

TL;DR: DINO-RotateMatch 是一种用于大规模互联网图像3D重建的图像匹配框架，结合DINO检索语义相关图像对与旋转感知关键点匹配，提升了匹配精度。

Details

Motivation: 在非结构化的互联网图像中进行大规模3D重建时，传统图像匹配方法面临视角变化大、旋转敏感和计算效率低等挑战，需要更鲁棒且可扩展的解决方案。 Method: 提出DINO-RotateMatch框架：使用DINO进行语义驱动的图像对检索，引入旋转增强的数据自适应配对策略，并结合ALIKED与LightGlue实现旋转感知的关键点提取与匹配。 Result: 在Kaggle图像匹配挑战赛2025上验证，显著提升平均精度均值（mAA），获得银奖（47/943）。 Conclusion: 结合自监督全局描述符与旋转增强的局部匹配，能够有效提升大规模3D重建中图像匹配的鲁棒性与可扩展性。 Abstract: This paper presents DINO-RotateMatch, a deep-learning framework designed to address the chal lenges of image matching in large-scale 3D reconstruction from unstructured Internet images. The method integrates a dataset-adaptive image pairing strategy with rotation-aware keypoint extraction and matching. DINO is employed to retrieve semantically relevant image pairs in large collections, while rotation-based augmentation captures orientation-dependent local features using ALIKED and Light Glue. Experiments on the Kaggle Image Matching Challenge 2025 demonstrate consistent improve ments in mean Average Accuracy (mAA), achieving a Silver Award (47th of 943 teams). The results confirm that combining self-supervised global descriptors with rotation-enhanced local matching offers a robust and scalable solution for large-scale 3D reconstruction.

[114] PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention

Ziwen Li,Xin Wang,Hanlue Zhang,Runnan Chen,Runqi Lin,Xiao He,Han Huang,Yandong Guo,Fakhri Karray,Tongliang Liu,Mingming Gong

Main category: cs.CV

TL;DR: 本文提出了一种高效的PosA-VLA框架，通过姿态条件监督锚定视觉注意力，提升视觉-语言-动作模型在目标导向任务中的动作生成精度和效率。

Details

Motivation: 现有VLA模型因空间均匀感知场容易被无关物体干扰，导致动作冗余或不稳定，难以满足实时性要求。 Method: 提出姿态条件锚定注意力机制，在无需额外感知模块的情况下，通过轻量级架构实现对任务相关区域的持续关注，并增强指令语义与视觉线索的对齐。 Result: 实验表明，该方法在多种机器人操作基准和复杂环境中均能实现精确、高效且鲁棒的行为执行。 Conclusion: PosA-VLA通过引导感知注意力显著提升了VLA模型的动作一致性与实时性，具有良好的实际应用潜力。 Abstract: The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.

[115] Out-of-the-box: Black-box Causal Attacks on Object Detectors

Melane Navaratnarajah,David A. Kelly,Hana Chockler

Main category: cs.CV

TL;DR: 本文提出了一种名为BlackCAtt的黑盒攻击方法，利用因果像素集对不同架构的目标检测器进行可解释、不可感知且可重现的攻击。

Details

Motivation: 现有对抗扰动方法多为白盒且依赖特定架构，缺乏对其有效性机制的理解，限制了防御能力的提升。 Method: BlackCAtt通过识别最小且因果充分的像素集合，结合目标检测器生成的边界框，构造针对目标检测器的黑盒攻击，实现检测丢失、修改或新增的攻击效果。 Result: 在COCO测试集上，BlackCAtt在移除检测方面比基线好2.7倍，改变检测好3.86倍，触发虚假检测好5.75倍，且攻击图像与原图极为接近，难以察觉。 Conclusion: 基于因果像素的攻击更精准、更不易察觉，BlackCAtt具有跨模型、跨架构的通用性，揭示了对抗攻击的有效机制，有助于后续防御策略的设计。 Abstract: Adversarial perturbations are a useful way to expose vulnerabilities in object detectors. Existing perturbation methods are frequently white-box and architecture specific. More importantly, while they are often successful, it is rarely clear why they work. Insights into the mechanism of this success would allow developers to understand and analyze these attacks, as well as fine-tune the model to prevent them. This paper presents BlackCAtt, a black-box algorithm and a tool, which uses minimal, causally sufficient pixel sets to construct explainable, imperceptible, reproducible, architecture-agnostic attacks on object detectors. BlackCAtt combines causal pixels with bounding boxes produced by object detectors to create adversarial attacks that lead to the loss, modification or addition of a bounding box. BlackCAtt works across different object detectors of different sizes and architectures, treating the detector as a black box. We compare the performance of BlackCAtt with other black-box attack methods and show that identification of causal pixels leads to more precisely targeted and less perceptible attacks. On the COCO test dataset, our approach is 2.7 times better than the baseline in removing a detection, 3.86 times better in changing a detection, and 5.75 times better in triggering new, spurious, detections. The attacks generated by BlackCAtt are very close to the original image, and hence imperceptible, demonstrating the power of causal pixels.

[116] Dual-level Modality Debiasing Learning for Unsupervised Visible-Infrared Person Re-Identification

Jiaze Li,Yan Lu,Bin Liu,Guojun Yin,Mang Ye

Main category: cs.CV

TL;DR: 提出了一种双层次模态去偏学习框架（DMDL），通过因果启发的调整干预模块和协同无偏训练策略，有效缓解无监督可见光-红外行人重识别中的模态偏差问题。

Details

Motivation: 现有两阶段学习流程在单模态学习中引入的模态特定偏差会传播到跨模态学习阶段，损害身份判别能力和模型泛化性。 Method: 在模型层面设计因果启发的调整干预（CAI）模块，以因果建模替代似然建模；在优化层面提出协同无偏训练（CBT）策略，结合模态特定增强、标签优化和特征对齐来阻断偏差传播。 Result: 在多个基准数据集上验证了DMDL的有效性，能够实现模态不变的特征学习，显著提升模型泛化能力。 Conclusion: DMDL通过模型和优化双层次去偏机制，有效缓解了USL-VI-ReID中的模态偏差问题，为跨模态学习提供了新的思路。 Abstract: Two-stage learning pipeline has achieved promising results in unsupervised visible-infrared person re-identification (USL-VI-ReID). It first performs single-modality learning and then operates cross-modality learning to tackle the modality discrepancy. Although promising, this pipeline inevitably introduces modality bias: modality-specific cues learned in the single-modality training naturally propagate into the following cross-modality learning, impairing identity discrimination and generalization. To address this issue, we propose a Dual-level Modality Debiasing Learning (DMDL) framework that implements debiasing at both the model and optimization levels. At the model level, we propose a Causality-inspired Adjustment Intervention (CAI) module that replaces likelihood-based modeling with causal modeling, preventing modality-induced spurious patterns from being introduced, leading to a low-biased model. At the optimization level, a Collaborative Bias-free Training (CBT) strategy is introduced to interrupt the propagation of modality bias across data, labels, and features by integrating modality-specific augmentation, label refinement, and feature alignment. Extensive experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.

[117] Fully Unsupervised Self-debiasing of Text-to-Image Diffusion Models

Korada Sri Vardhana,Shrikrishna Lolla,Soma Biswas

Main category: cs.CV

TL;DR: 本文提出了一种名为SelfDebias的无监督测试时去偏方法，适用于使用UNet作为噪声预测器的文本到图像扩散模型。该方法通过在图像编码器嵌入空间中识别语义簇来引导扩散过程，最小化输出分布与均匀分布之间的KL散度，无需人工标注数据或外部分类器，能有效减少多种偏见并保持生成图像的质量。

Details

Motivation: 由于大规模训练数据（如LAION-5B）包含大量偏见，现有的文本到图像扩散模型会学习并再现这些偏见，导致生成结果具有刻板印象。因此需要一种通用、无需监督的去偏方法。 Method: SelfDebias在推理阶段自动识别图像编码器嵌入空间中的语义簇，并利用这些簇引导扩散过程，通过最小化输出分布与均匀分布之间的KL散度实现去偏。整个过程无需人工标注或额外分类器，完全无监督。 Result: 实验表明SelfDebias在不同提示词和模型架构（包括条件与非条件模型）上均具有良好泛化能力，能在关键人口统计维度及抽象概念上有效减少偏见，同时保持图像视觉保真度。 Conclusion: SelfDebias是一种通用、完全无监督的测试时去偏方法，能够在不牺牲图像质量的前提下有效缓解扩散模型中的多种偏见，具有良好的应用前景。 Abstract: Text-to-image (T2I) diffusion models have achieved widespread success due to their ability to generate high-resolution, photorealistic images. These models are trained on large-scale datasets, like LAION-5B, often scraped from the internet. However, since this data contains numerous biases, the models inherently learn and reproduce them, resulting in stereotypical outputs. We introduce SelfDebias, a fully unsupervised test-time debiasing method applicable to any diffusion model that uses a UNet as its noise predictor. SelfDebias identifies semantic clusters in an image encoder's embedding space and uses these clusters to guide the diffusion process during inference, minimizing the KL divergence between the output distribution and the uniform distribution. Unlike supervised approaches, SelfDebias does not require human-annotated datasets or external classifiers trained for each generated concept. Instead, it is designed to automatically identify semantic modes. Extensive experiments show that SelfDebias generalizes across prompts and diffusion model architectures, including both conditional and unconditional models. It not only effectively debiases images along key demographic dimensions while maintaining the visual fidelity of the generated images, but also more abstract concepts for which identifying biases is also challenging.

[118] Research on Brain Tumor Classification Method Based on Improved ResNet34 Network

Yufeng Li,Wenchao Zhao,Bo Dang,Weimin Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于改进ResNet34网络的脑肿瘤分类模型，引入多尺度特征提取和通道注意力机制，显著提高了分类准确率并减少了模型参数量。

Details

Motivation: 为了提高脑肿瘤医学图像分类的效率和准确性，克服传统手动方法耗时耗力以及浅层卷积神经网络精度不高的问题。 Method: 以ResNet34为骨干网络，引入多尺度输入模块作为第一层，使用Inception v2模块作为残差下采样层，并加入通道注意力机制，从通道域对特征图赋予不同权重，增强关键特征表达。 Result: 五折交叉验证结果显示，改进模型的平均分类准确率达到约98.8%，比原始ResNet34高出1%，且参数量仅为原模型的80%。 Conclusion: 所提出的改进ResNet34模型在减少参数量的同时提升了分类精度，实现了更高效、更准确的脑肿瘤图像分类。 Abstract: Previously, image interpretation in radiology relied heavily on manual methods. However, manual classification of brain tumor medical images is time-consuming and labor-intensive. Even with shallow convolutional neural network models, the accuracy is not ideal. To improve the efficiency and accuracy of brain tumor image classification, this paper proposes a brain tumor classification model based on an improved ResNet34 network. This model uses the ResNet34 residual network as the backbone network and incorporates multi-scale feature extraction. It uses a multi-scale input module as the first layer of the ResNet34 network and an Inception v2 module as the residual downsampling layer. Furthermore, a channel attention mechanism module assigns different weights to different channels of the image from a channel domain perspective, obtaining more important feature information. The results after a five-fold crossover experiment show that the average classification accuracy of the improved network model is approximately 98.8%, which is not only 1% higher than ResNet34, but also only 80% of the number of parameters of the original model. Therefore, the improved network model not only improves accuracy but also reduces clutter, achieving a classification effect with fewer parameters and higher accuracy.

[119] LSRS: Latent Scale Rejection Sampling for Visual Autoregressive Modeling

Hong-Kai Zheng,Piji Li

Main category: cs.CV

TL;DR: 提出Latent Scale Rejection Sampling (LSRS) 方法，在推理阶段通过在潜在尺度上渐进优化标记图来提升视觉自回归（VAR）模型的生成质量，显著降低FID分数且仅带来轻微计算开销。

Details

Motivation: VAR模型在多尺度并行生成图像时虽高效，但同一尺度内并行采样可能导致结构错误，影响生成质量，需缓解自回归误差累积问题。 Method: 引入LSRS方法，使用轻量级评分模型评估每个尺度生成的多个候选标记图，选择高质量标记图用于后续尺度生成，并优先优化对结构一致性关键的早期尺度。 Result: 在VAR-d30模型上，推理时间仅增加1%时FID从1.95降至1.78；增加15%推理时间时FID进一步降至1.66，显著提升生成质量。 Conclusion: LSRS是一种高效的测试时扩展方法，能在几乎不增加计算成本的前提下有效提升VAR模型的图像生成质量，尤其改善结构连贯性。 Abstract: Visual Autoregressive (VAR) modeling approach for image generation proposes autoregressive processing across hierarchical scales, decoding multiple tokens per scale in parallel. This method achieves high-quality generation while accelerating synthesis. However, parallel token sampling within a scale may lead to structural errors, resulting in suboptimal generated images. To mitigate this, we propose Latent Scale Rejection Sampling (LSRS), a method that progressively refines token maps in the latent scale during inference to enhance VAR models. Our method uses a lightweight scoring model to evaluate multiple candidate token maps sampled at each scale, selecting the high-quality map to guide subsequent scale generation. By prioritizing early scales critical for structural coherence, LSRS effectively mitigates autoregressive error accumulation while maintaining computational efficiency. Experiments demonstrate that LSRS significantly improves VAR's generation quality with minimal additional computational overhead. For the VAR-d30 model, LSRS increases the inference time by merely 1% while reducing its FID score from 1.95 to 1.78. When the inference time is increased by 15%, the FID score can be further reduced to 1.66. LSRS offers an efficient test-time scaling solution for enhancing VAR-based generation.

[120] HieroGlyphTranslator: Automatic Recognition and Translation of Egyptian Hieroglyphs to English

Ahmed Nasser,Marwan Mohamed,Alaa Sherif,Basmala Mahmoud,Shereen Yehia,Asmaa Saad,Mariam S. El-Rahmany,Ensaf H. Mohamed

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的自动识别和翻译古埃及象形文字的方法，通过分割、符号映射和翻译三个阶段，使用Contour、Detectron2和CNN模型，在BLEU评分上达到了42.2，优于以往研究。

Details

Motivation: 古埃及象形文字翻译面临一符多义等挑战，现有方法效果有限，需要更高效的自动化翻译技术。 Method: 采用三阶段方法：首先使用Contour和Detectron2进行字符分割，然后将符号映射到Gardiner编码系统，最后利用CNN模型进行从图像到英文的翻译。 Result: 在Morris Franken和EgyptianTranslation两个数据集上进行了实验，模型取得了42.2的BLEU分数，显著优于此前的研究成果。 Conclusion: 该方法有效提升了古埃及象形文字图像到英文的自动翻译性能，展示了深度学习在古文字翻译领域的潜力。 Abstract: Egyptian hieroglyphs, the ancient Egyptian writing system, are composed entirely of drawings. Translating these glyphs into English poses various challenges, including the fact that a single glyph can have multiple meanings. Deep learning translation applications are evolving rapidly, producing remarkable results that significantly impact our lives. In this research, we propose a method for the automatic recognition and translation of ancient Egyptian hieroglyphs from images to English. This study utilized two datasets for classification and translation: the Morris Franken dataset and the EgyptianTranslation dataset. Our approach is divided into three stages: segmentation (using Contour and Detectron2), mapping symbols to Gardiner codes, and translation (using the CNN model). The model achieved a BLEU score of 42.2, a significant result compared to previous research.

[121] A Robust Camera-based Method for Breath Rate Measurement

Alexey Protopopov

Main category: cs.CV

TL;DR: 本文提出了一种基于视频的人体呼吸频率测量新方法，结合数学变换实现高精度检测，平均绝对误差仅为0.57次/分钟，相对偏差小于5%，且对受试者运动干扰具有较强鲁棒性。

Details

Motivation: 现有基于视频的呼吸频率测量方法多在理想条件下测试或精度不足，难以应用于真实场景，因此需要一种更鲁棒、低硬件需求的准确测量方法。 Method: 采用多种数学变换相结合的方法，从视频中提取呼吸信号，并通过滤波和信号处理技术抑制运动干扰，从而估算呼吸频率。 Result: 在14名志愿者共2.5小时以上的视频数据上进行测试，与参考数据相比，平均绝对误差为0.57次/分钟，相对偏差小于5%，性能优于以往方法。 Conclusion: 所提出的方法能够在受试者自由活动的情况下实现高精度、远程呼吸频率监测，具备较强的实用性和推广潜力。 Abstract: Proliferation of cheap and accessible cameras makes it possible to measure a subject's breath rate from video footage alone. Recent works on this topic have proposed a variety of approaches for accurately measuring human breath rate, however they are either tested in near-ideal conditions, or produce results that are not sufficiently accurate. The present study proposes a more robust method to measure breath rate in humans with minimal hardware requirements using a combination of mathematical transforms with a relative deviation from the ground truth of less than 5%. The method was tested on videos taken from 14 volunteers with a total duration of over 2 hours 30 minutes. The obtained results were compared to reference data and the average mean absolute error was found to be at 0.57 respirations per minute, which is noticeably better than the results from previous works. The breath rate measurement method proposed in the present article is more resistant to distortions caused by subject movement and thus allows one to remotely measure the subject's breath rate without any significant limitations on the subject's behavior.

[122] Lean Unet: A Compact Model for Image Segmentation

Ture Hassler,Ida Åkerholm,Marcus Nordström,Gabriele Balletti,Orcun Goksel

Main category: cs.CV

TL;DR: 提出了一种轻量级Unet架构（LUnet），通过保持通道数不变和简化层次结构，在显著减少参数量的同时实现与标准Unet及自适应剪枝网络相当甚至更优的性能。

Details

Motivation: 现有Unet架构因高内存占用限制了批处理大小并增加推理延迟，通道剪枝虽可压缩模型但优化耗时且泛化性差，因此需要一种更高效、通用的压缩方法。 Method: 提出LUnet架构，采用紧凑、平坦的层级结构，不再随分辨率减半而倍增通道数，并分析最终结构对性能的影响大于剪枝策略本身；通过在公共MRI和内部CT数据集上评估性能。 Result: LUnet相比传统Unet和自适应剪枝网络，在超过30倍更少参数下实现了相当或更好的性能；固定通道数结构在相同总参数量下优于标准Unet；跳过连接使得瓶颈层通道可大幅减少。 Conclusion: 最终网络结构比剪枝策略更重要，LUnet凭借简洁设计以极低参数量实现了高效语义分割，适用于医学图像分析任务。 Abstract: Unet and its variations have been standard in semantic image segmentation, especially for computer assisted radiology. Current Unet architectures iteratively downsample spatial resolution while increasing channel dimensions to preserve information content. Such a structure demands a large memory footprint, limiting training batch sizes and increasing inference latency. Channel pruning compresses Unet architecture without accuracy loss, but requires lengthy optimization and may not generalize across tasks and datasets. By investigating Unet pruning, we hypothesize that the final structure is the crucial factor, not the channel selection strategy of pruning. Based on our observations, we propose a lean Unet architecture (LUnet) with a compact, flat hierarchy where channels are not doubled as resolution is halved. We evaluate on a public MRI dataset allowing comparable reporting, as well as on two internal CT datasets. We show that a state-of-the-art pruning solution (STAMP) mainly prunes from the layers with the highest number of channels. Comparatively, simply eliminating a random channel at the pruning-identified layer or at the largest layer achieves similar or better performance. Our proposed LUnet with fixed architectures and over 30 times fewer parameters achieves performance comparable to both conventional Unet counterparts and data-adaptively pruned networks. The proposed lean Unet with constant channel count across layers requires far fewer parameters while achieving performance superior to standard Unet for the same total number of parameters. Skip connections allow Unet bottleneck channels to be largely reduced, unlike standard encoder-decoder architectures requiring increased bottleneck channels for information propagation.

[123] Heatmap Pooling Network for Action Recognition from RGB Videos

Mengyuan Liu,Jinfu Liu,Yongkang Jiang,Bin He

Main category: cs.CV

TL;DR: 提出了一种新的热图池化网络HP-Net，用于视频中的人体动作识别，通过反馈池化模块提取信息丰富、鲁棒且简洁的特征，并结合多模态数据提升识别性能，在多个基准上取得了优越效果。

Details

Motivation: 现有基于RGB视频的动作识别方法存在信息冗余、易受噪声干扰和存储成本高的问题，需要更高效地提取有用信息。 Method: 提出HP-Net，包含反馈池化模块以提取人体区域的紧凑热图池化特征，并设计空间-运动协同学习模块和文本细化调制模块，融合多模态信息进行动作识别。 Result: 在NTU RGB+D 60、NTU RGB+D 120、Toyota-Smarthome和UAV-Human等多个基准上验证了方法的有效性，性能优于现有的动作识别方法。 Conclusion: HP-Net能有效提取鲁棒且紧凑的特征，结合多模态数据显著提升了视频动作识别的准确性和稳定性。 Abstract: Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and UAV-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods. Our code is publicly available at: https://github.com/liujf69/HPNet-Action.

[124] CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation

Letian Zhou,Songhua Liu,Xinchao Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Core Distribution Alignment (CoDA)的新型数据集蒸馏框架，利用现成的文本到图像生成模型，在无需针对目标数据集预训练生成模型的前提下，通过发现并对齐目标数据集的“内在核心分布”，实现了高效且高性能的数据集蒸馏。

Details

Motivation: 现有数据集蒸馏方法要么依赖在完整目标数据集上预训练的扩散模型（违背了DD初衷且成本高昂），要么使用通用文生图模型但存在分布不匹配问题，导致性能不佳。因此需要一种既能避免特定训练又能准确捕捉目标数据语义的新方法。 Method: 提出CoDA框架：首先通过基于密度的鲁棒机制识别目标数据集的“内在核心分布”，然后在生成过程中引导文生图模型使其生成样本与该核心分布对齐，从而桥接通用先验与特定语义之间的差距。 Result: 实验表明，CoDA在不依赖目标数据集特定训练的情况下，在ImageNet-1K及其子集等多个基准上性能达到或超越此前依赖特定训练的方法，并在50-images-per-class设置下以60.4%的准确率刷新ImageNet-1K上的SOTA记录。 Conclusion: CoDA成功实现了仅使用现成文生图模型的高效数据集蒸馏，解决了传统方法在成本和分布匹配上的根本矛盾，为低资源场景下的数据集压缩提供了新范式。 Abstract: Prevailing Dataset Distillation (DD) methods leveraging generative models confront two fundamental limitations. First, despite pioneering the use of diffusion models in DD and delivering impressive performance, the vast majority of approaches paradoxically require a diffusion model pre-trained on the full target dataset, undermining the very purpose of DD and incurring prohibitive training costs. Second, although some methods turn to general text-to-image models without relying on such target-specific training, they suffer from a significant distributional mismatch, as the web-scale priors encapsulated in these foundation models fail to faithfully capture the target-specific semantics, leading to suboptimal performance. To tackle these challenges, we propose Core Distribution Alignment (CoDA), a framework that enables effective DD using only an off-the-shelf text-to-image model. Our key idea is to first identify the "intrinsic core distribution" of the target dataset using a robust density-based discovery mechanism. We then steer the generative process to align the generated samples with this core distribution. By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics, yielding highly representative distilled datasets. Extensive experiments suggest that, without relying on a generative model specifically trained on the target dataset, CoDA achieves performance on par with or even superior to previous methods with such reliance across all benchmarks, including ImageNet-1K and its subsets. Notably, it establishes a new state-of-the-art accuracy of 60.4% at the 50-images-per-class (IPC) setup on ImageNet-1K. Our code is available on the project webpage: https://github.com/zzzlt422/CoDA

[125] PULSE: A Unified Multi-Task Architecture for Cardiac Segmentation, Diagnosis, and Few-Shot Cross-Modality Clinical Adaptation

Hania Ghouse,Maryam Alsharqi,Farhad R. Nezami,Muzammil Behzad

Main category: cs.CV

TL;DR: PULSE是一个多任务视觉-语言框架，统一了心脏图像分析中的解剖分割、疾病分类和临床报告生成，通过自监督表示和复合监督策略实现跨模态和数据集的强泛化能力。

Details

Motivation: 现有心脏图像分析方法在不同任务间割裂，缺乏能统一解剖分割、疾病分类与临床报告生成且具备跨模态泛化能力的框架。 Method: 提出PULSE框架，基于自监督表示，采用复合监督策略（区域重叠学习、像素级分类保真度和边界感知IoU优化）；通过多尺度令牌重建解码器实现分割，共享全局表示支持分类与文本生成。 Result: PULSE在多个心脏图像数据集上表现出优异的分割精度、分类性能和临床报告生成能力，并展现出对新模态的强适应性与低监督需求下的泛化能力。 Conclusion: PULSE实现了心脏图像分析中多任务的统一建模，学习任务不变的心脏先验，推动了可扩展的基础型心脏分析框架的发展。 Abstract: Cardiac image analysis remains fragmented across tasks: anatomical segmentation, disease classification, and grounded clinical report generation are typically handled by separate networks trained under different data regimes. No existing framework unifies these objectives within a single architecture while retaining generalization across imaging modalities and datasets. We introduce PULSE, a multi-task vision-language framework built on self-supervised representations and optimized through a composite supervision strategy that balances region overlap learning, pixel wise classification fidelity, and boundary aware IoU refinement. A multi-scale token reconstruction decoder enables anatomical segmentation, while shared global representations support disease classification and clinically grounded text output allowing the model to transition from pixels to structures and finally clinical reasoning within one architecture. Unlike prior task-specific pipelines, PULSE learns task-invariant cardiac priors, generalizes robustly across datasets, and can be adapted to new imaging modalities with minimal supervision. This moves the field closer to a scalable, foundation style cardiac analysis framework.

[126] Traffic Image Restoration under Adverse Weather via Frequency-Aware Mamba

Liwen Pan,Longguang Wang,Guangwei Gao,Jun Wang,Jun Shi,Juncheng Li

Main category: cs.CV

TL;DR: 本文提出了一种名为Frequency-Aware Mamba (FAMamba)的新框架，结合频域引导与序列建模，用于恶劣天气下的交通图像恢复。

Details

Motivation: 现有方法主要关注空间域建模，忽略了频域先验信息；同时Mamba架构在长距离依赖建模上表现优异，但其在频域特征提取方面的潜力尚未被探索。 Method: 提出FAMamba框架，包含双分支特征提取块(DFEB)和先验引导块(PGB)，并设计自适应频域扫描机制(AFSM)，实现对不同子带纹理分布的动态路径调整与高频残差学习。 Result: 实验结果表明FAMamba在图像恢复任务中具有高效性和有效性，能够精确重建细节丰富的高质量图像。 Conclusion: FAMamba通过融合频域先验与Mamba架构，显著提升了恶劣天气下交通图像的恢复性能，为未来智能交通系统提供了新的解决方案。 Abstract: Traffic image restoration under adverse weather conditions remains a critical challenge for intelligent transportation systems. Existing methods primarily focus on spatial-domain modeling but neglect frequency-domain priors. Although the emerging Mamba architecture excels at long-range dependency modeling through patch-wise correlation analysis, its potential for frequency-domain feature extraction remains unexplored. To address this, we propose Frequency-Aware Mamba (FAMamba), a novel framework that integrates frequency guidance with sequence modeling for efficient image restoration. Our architecture consists of two key components: (1) a Dual-Branch Feature Extraction Block (DFEB) that enhances local-global interaction via bidirectional 2D frequency-adaptive scanning, dynamically adjusting traversal paths based on sub-band texture distributions; and (2) a Prior-Guided Block (PGB) that refines texture details through wavelet-based high-frequency residual learning, enabling high-quality image reconstruction with precise details. Meanwhile, we design a novel Adaptive Frequency Scanning Mechanism (AFSM) for the Mamba architecture, which enables the Mamba to achieve frequency-domain scanning across distinct subgraphs, thereby fully leveraging the texture distribution characteristics inherent in subgraph structures. Extensive experiments demonstrate the efficiency and effectiveness of FAMamba.

[127] Prostate biopsy whole slide image dataset from an underrepresented Middle Eastern population

Peshawa J. Muhammad Ali,Navin Vincent,Saman S. Abdulla,Han N. Mohammed Fadhl,Anders Blilie,Kelvin Szolnoky,Julia Anna Mielcarz,Xiaoyi Ji,Kimmo Kartasalo,Abdulbasit K. Al-Talabani,Nita Mulliqi

Main category: cs.CV

TL;DR: 本论文介绍了来自伊拉克埃尔比勒的339张前列腺穿刺活检全切片图像数据集，旨在支持在不同全球人群中开发和验证病理AI模型。

Details

Motivation: 现有的公开病理学数据集主要代表西方人群，缺乏对中东等地区人群的代表性，限制了AI模型的泛化能力。因此，需要发布更多样化的数据集以提高模型的全球适用性。 Method: 收集了185名患者的339张全切片图像，由三位病理学家独立进行Gleason评分和ISUP分级，并使用三种不同的扫描仪（Leica、Hamamatsu和Grundium）进行扫描，所有数据以原始格式提供且已去标识化。 Result: 该数据集可用于评估病理医生间的分级一致性、颜色归一化方法以及跨扫描仪的模型鲁棒性。数据将存入Bioimage Archive并采用CC BY 4.0许可发布。 Conclusion: 该数据集填补了中东地区数字病理数据的空白，有助于提升AI模型在全球多样化人群中的泛化能力和公平性。 Abstract: Artificial intelligence (AI) is increasingly used in digital pathology. Publicly available histopathology datasets remain scarce, and those that do exist predominantly represent Western populations. Consequently, the generalizability of AI models to populations from less digitized regions, such as the Middle East, is largely unknown. This motivates the public release of our dataset to support the development and validation of pathology AI models across globally diverse populations. We present 339 whole-slide images of prostate core needle biopsies from a consecutive series of 185 patients collected in Erbil, Iraq. The slides are associated with Gleason scores and International Society of Urological Pathology grades assigned independently by three pathologists. Scanning was performed using two high-throughput scanners (Leica and Hamamatsu) and one compact scanner (Grundium). All slides were de-identified and are provided in their native formats without further conversion. The dataset enables grading concordance analyses, color normalization, and cross-scanner robustness evaluations. Data will be deposited in the Bioimage Archive (BIA) under accession code: to be announced (TBA), and released under a CC BY 4.0 license.

[128] Diminishing Returns in Self-Supervised Learning

Oli Bridge,Huey Sun,Botond Branyicskai-Nagy,Charles D'Ornano,Shomit Basu

Main category: cs.CV

TL;DR: 小规模视觉Transformer通过有针对性的预训练受益，但中间微调可能因任务机制差异而损害下游性能。

Details

Motivation: 探索不同预训练和微调策略对小规模视觉Transformer的边际效益。 Method: 在三个不同的预训练、中间微调和下游数据集上实验，分析其对5M参数ViT的影响。 Result: 预训练和微调有帮助但收益递减，中间微调可能因任务不匹配而损害性能。 Conclusion: 小规模ViT需谨慎选择数据和训练流程，避免无效或有害的中间任务堆叠。 Abstract: While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.

[129] An Automated Framework for Large-Scale Graph-Based Cerebrovascular Analysis

Daniele Falcetta,Liane S. Canas,Lorenzo Suppa,Matteo Pentassuglia,Jon Cleary,Marc Modat,Sébastien Ourselin,Maria A. Zuluaga

Main category: cs.CV

TL;DR: CaravelMetrics是一个用于自动化脑血管分析的计算框架，通过骨架化衍生的图表示来建模血管形态，支持多尺度的脑血管组织特征提取。

Details

Motivation: 为了实现对脑血管结构的定量、可重复和多尺度分析，特别是在人群水平上研究血管健康与衰老的关系。 Method: 结合图谱引导的区域分割、中心线提取和图构建，从3D TOF-MRA扫描中生成血管图，并计算15种形态、拓扑、分形和几何特征，可进行全局或区域分析。 Result: 在IXI数据集的570例受试者中成功应用，结果显示出年龄、性别相关的血管变化以及教育程度与血管复杂性的正相关，具有良好的可重复性。 Conclusion: CaravelMetrics提供了一种可扩展且全自动的脑血管特征提取方法，适用于建立血管健康的正常模型及大规模人群研究。 Abstract: We present CaravelMetrics, a computational framework for automated cerebrovascular analysis that models vessel morphology through skeletonization-derived graph representations. The framework integrates atlas-based regional parcellation, centerline extraction, and graph construction to compute fifteen morphometric, topological, fractal, and geometric features. The features can be estimated globally from the complete vascular network or regionally within arterial territories, enabling multiscale characterization of cerebrovascular organization. Applied to 570 3D TOF-MRA scans from the IXI dataset (ages 20-86), CaravelMetrics yields reproducible vessel graphs capturing age- and sex-related variations and education-associated increases in vascular complexity, consistent with findings reported in the literature. The framework provides a scalable and fully automated approach for quantitative cerebrovascular feature extraction, supporting normative modeling and population-level studies of vascular health and aging.

[130] Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy

Jorge Tapias Gomez,Despoina Kanata,Aneesh Rangnekar,Christina Lee,Julio Garcia-Aguilar,Joshua Jesse Smith,Harini Veeraraghavan

Main category: cs.CV

TL;DR: 提出了一种基于Siamese Swin Transformer与双交叉注意力机制（SSDCA）的模型，用于通过纵向内窥镜图像区分直肠癌患者在完全新辅助治疗后的临床完全缓解（cCR）与局部再生（LR），在62名患者的测试集上表现出优异且稳健的性能。

Details

Motivation: 需要一种客观准确的方法，早期从随访内镜图像中检测“观察等待”策略下的局部再生（LR），以避免远处转移并优化临床管理。 Method: 开发了SSDCA模型，结合Swin Transformer提取特征，并引入双交叉注意力机制融合两个时间点的内镜图像，无需空间对齐即可判断治疗响应。模型使用135名患者的数据训练，并在62名患者的独立数据集上评估。 Result: SSDCA在测试集上达到81.76%±0.04的平衡准确率、90.07%±0.08的敏感性和72.86%±0.05的特异性，优于基线模型；UMAP分析显示其特征具有良好的类间分离与类内聚集性，且对血液、粪便等干扰因素具有鲁棒性。 Conclusion: SSDCA能有效区分cCR与LR，具备临床应用潜力，可支持直肠癌患者在观察等待策略中的无创、精准监测。 Abstract: Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76\% $\pm$ 0.04), sensitivity (90.07\% $\pm$ 0.08), and specificity (72.86\% $\pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $\pm$ 0.18) and minimal intra-cluster dispersion (1.07 $\pm$ 0.19) with SSDCA, confirming discriminative representation learning.

[131] Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence

Shuai Yang,Junxin Lin,Yifan Zhou,Ziwei Liu,Chen Change Loy

Main category: cs.CV

TL;DR: 本文提出了FRESCO方法，通过结合帧内和帧间对应关系，增强零样本视频生成中的时空一致性，显著提升了视频编辑和转换任务的视觉连贯性。

Details

Motivation: 现有的零样本文本到视频扩散模型在注意力机制中对跨帧对应关系的约束较弱，导致时间不一致问题，影响生成视频的连贯性。 Method: 提出FRESCO方法，将帧内对应与帧间对应结合，形成更强的时空约束，并在特征层面显式优化，而不仅依赖注意力引导。 Result: 在视频到视频转换和文本引导的视频编辑两个零样本任务上验证了方法的有效性，实验表明生成视频具有更高的质量和时空一致性。 Conclusion: FRESCO通过更鲁棒的时空约束机制，显著优于现有零样本方法，推动了无需训练的视频生成技术的发展。 Abstract: The remarkable success in text-to-image diffusion models has motivated extensive investigation of their potential for video applications. Zero-shot techniques aim to adapt image diffusion models for videos without requiring further model training. Recent methods largely emphasize integrating inter-frame correspondence into attention mechanisms. However, the soft constraint applied to identify the valid features to attend is insufficient, which could lead to temporal inconsistency. In this paper, we present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint. This enhancement ensures a consistent transformation of semantically similar content between frames. Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video, significantly enhancing the visual coherence of manipulated videos. We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing. Comprehensive experiments demonstrate the effectiveness of our framework in generating high-quality, coherent videos, highlighting a significant advance over current zero-shot methods.

[132] UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework

Youxin Pang,Yong Zhang,Ruizhi Shao,Xiang Deng,Feng Gao,Xu Xiaoming,Xiaoming Wei,Yebin Liu

Main category: cs.CV

TL;DR: 本文提出了UniMo，一种创新的自回归模型，首次实现了2D人类视频和3D人体动作在统一框架下的联合建模，能够同时生成和理解这两种模态。

Details

Motivation: 现有的方法主要集中在以一种模态为条件生成另一种模态，或将其中一种模态与其他模态（如文本、音频）结合，而将2D视频与3D动作统一建模仍鲜有探索，因其存在显著的结构和分布差异。 Method: 受大语言模型多模态融合能力的启发，UniMo将2D视频和3D动作用统一的token序列建模，采用独立的嵌入层缓解分布差距；设计了一种序列建模策略，在单一框架中整合两种不同任务；提出一种新颖的3D运动分词器，使用单个VQ-VAE生成量化运动token，并通过多个专家解码器分别处理身体形状、位移、全局方向和身体姿态，实现可靠的3D运动重建。 Result: 大量实验表明，该方法能同时生成对应的视频和动作，并实现精确的动作捕捉，在生成质量和跨模态一致性方面表现优异。 Conclusion: 本研究挖掘了大语言模型融合多样化数据类型的潜力，为将人体中心信息融入现有模型提供了新路径，有望推动人、物体与场景的多模态可控联合建模发展。 Abstract: We propose UniMo, an innovative autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework, enabling simultaneous generation and understanding of these two modalities for the first time. Current methods predominantly focus on generating one modality given another as the condition or integrating either of them with other modalities such as text and audio. Unifying 2D videos and 3D motions for simultaneous optimization and generation remains largely unexplored, presenting significant challenges due to their substantial structural and distributional differences. Inspired by the LLM's ability to unify different modalities, our method models videos and 3D motions as a unified tokens sequence, utilizing separate embedding layers to mitigate distribution gaps. Additionally, we devise a sequence modeling strategy that integrates two distinct tasks within a single framework, proving the effectiveness of unified modeling. Moreover, to efficiently align with visual tokens and preserve 3D spatial information, we design a novel 3D motion tokenizer with a temporal expansion strategy, using a single VQ-VAE to produce quantized motion tokens. It features multiple expert decoders that handle body shapes, translation, global orientation, and body poses for reliable 3D motion reconstruction. Extensive experiments demonstrate that our method simultaneously generates corresponding videos and motions while performing accurate motion capture. This work taps into the capacity of LLMs to fuse diverse data types, paving the way for integrating human-centric information into existing models and potentially enabling multimodal, controllable joint modeling of humans, objects, and scenes.

[133] Beyond the Ground Truth: Enhanced Supervision for Image Restoration

Donghun Ryou,Inju Ha,Sanghyeok Chu,Bohyung Han

Main category: cs.CV

TL;DR: 提出一种通过频率域自适应掩码融合超分辨率技术来增强图像恢复任务中真实感的监督信号的方法，从而提升现有模型的性能。

Details

Motivation: 由于实际数据采集中的限制，真实世界图像退化恢复依赖的数据集的高质量真值图像有限，影响了模型性能。因此需要一种方法来增强这些真值图像的质量以提供更好的监督。 Method: 设计一个新框架，利用条件频率掩码生成器学习自适应频率掩码，指导原始真值图像与其超分辨率变体之间的频率成分融合，生成感知质量更高的真值图像，并用于训练轻量级输出优化网络。 Result: 实验表明该方法能持续提升图像恢复质量，用户研究也验证了监督增强和输出优化的有效性。 Conclusion: 所提框架能有效增强监督信号，改善真实世界图像恢复效果，且可与现有模型无缝集成。 Abstract: Deep learning-based image restoration has achieved significant success. However, when addressing real-world degradations, model performance is limited by the quality of ground-truth images in datasets due to practical constraints in data acquisition. To address this limitation, we propose a novel framework that enhances existing ground truth images to provide higher-quality supervision for real-world restoration. Our framework generates perceptually enhanced ground truth images using super-resolution by incorporating adaptive frequency masks, which are learned by a conditional frequency mask generator. These masks guide the optimal fusion of frequency components from the original ground truth and its super-resolved variants, yielding enhanced ground truth images. This frequency-domain mixup preserves the semantic consistency of the original content while selectively enriching perceptual details, preventing hallucinated artifacts that could compromise fidelity. The enhanced ground truth images are used to train a lightweight output refinement network that can be seamlessly integrated with existing restoration models. Extensive experiments demonstrate that our approach consistently improves the quality of restored images. We further validate the effectiveness of both supervision enhancement and output refinement through user studies. Code is available at https://github.com/dhryougit/Beyond-the-Ground-Truth.

[134] MUT3R: Motion-aware Updating Transformer for Dynamic 3D Reconstruction

Guole Shen,Tianchen Deng,Xingrui Qin,Nailin Wang,Jianyu Wang,Yanbo Wang,Yongtao Chen,Hesheng Wang,Jingchuan Wang

Main category: cs.CV

TL;DR: 本文提出MUT3R，一种无需训练的框架，利用注意力机制中隐含的运动线索来抑制动态区域的影响，提升动态场景下的三维重建稳定性与时间一致性。

Details

Motivation: 现有基于状态循环神经网络的方法在静态3D重建上表现良好，但在非刚性运动区域易产生注意力传播误差，导致动态场景下性能下降。作者希望利用预训练Transformer中已编码但未显式使用的运动线索来解决此问题。 Method: 通过分析多层自注意力图的聚合模式，发现动态区域自然被降权，从而提取出隐式运动线索；在此基础上设计注意力级门控模块，在推理过程中早期抑制Transformer浅层中的动态内容传播，且不进行任何模型重训练或微调。 Result: MUT3R在多个动态基准测试中显著提升了时间一致性和相机位姿鲁棒性，验证了其在流式重建场景中的有效性。 Conclusion: MUT3R提供了一种简单、无需训练的方法，使预训练Transformer能自我诊断并纠正由运动引起的误差，为动态感知的流式3D重建开辟了新路径。 Abstract: Recent stateful recurrent neural networks have achieved remarkable progress on static 3D reconstruction but remain vulnerable to motion-induced artifacts, where non-rigid regions corrupt attention propagation between the spatial memory and image feature. By analyzing the internal behaviors of the state and image token updating mechanism, we find that aggregating self-attention maps across layers reveals a consistent pattern: dynamic regions are naturally down-weighted, exposing an implicit motion cue that the pretrained transformer already encodes but never explicitly uses. Motivated by this observation, we introduce MUT3R, a training-free framework that applies the attention-derived motion cue to suppress dynamic content in the early layers of the transformer during inference. Our attention-level gating module suppresses the influence of dynamic regions before their artifacts propagate through the feature hierarchy. Notably, we do not retrain or fine-tune the model; we let the pretrained transformer diagnose its own motion cues and correct itself. This early regulation stabilizes geometric reasoning in streaming scenarios and leads to improvements in temporal consistency and camera pose robustness across multiple dynamic benchmarks, offering a simple and training-free pathway toward motion-aware streaming reconstruction.

[135] TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

Tao Wu,Li Yang,Gen Zhan,Yiting Liao,Junlin Li,Deliang Fu,Li Zhang,Limin Wang

Main category: cs.CV

TL;DR: 本文提出了TempR1，一种面向多模态大语言模型的时序感知多任务强化学习框架，通过构建多样化时序任务语料和改进的GRPO算法，显著提升了模型在时间理解任务中的泛化能力和性能。

Details

Motivation: 现有的强化学习方法在提升多模态大语言模型时序理解能力方面受限于任务类型和数据的局限性，难以泛化到多样化的时序场景，因此需要一个更具通用性和系统性的解决方案。 Method: 提出TempR1框架，构建涵盖多种时序结构和语义的多任务语料库，基于Group Relative Policy Optimization（GRPO）算法进行稳定有效的跨任务优化；将时序任务分为三类预测区间与真实标注的对应关系，并为每一类设计定制化的定位奖励机制。 Result: 在多个基准测试上实现了最先进性能，联合优化带来了协同效应，显著提升了单任务表现和整体泛化能力。 Conclusion: TempR1为多模态大语言模型的时序推理提供了一个可扩展且原则性强的新范式，有效增强了对复杂时间结构的理解。 Abstract: Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs' temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.

[136] Training for Identity, Inference for Controllability: A Unified Approach to Tuning-Free Face Personalization

Lianyu Pang,Ji Zhou,Qiping Wang,Baoquan Zhao,Zhenguo Yang,Qing Li,Xudong Mao

Main category: cs.CV

TL;DR: 本文提出了UniID，一种无需调优的统一人脸个性化框架，通过协同整合文本嵌入和适配器方法，在保持高身份保真度的同时实现灵活的文本控制。

Details

Motivation: 现有无调优人脸个性化方法在身份保真度和文本可控性之间难以兼顾，本文旨在解决这一权衡问题。 Method: 提出一种统一框架UniID，训练时采用仅关注身份特征的学习策略，推理时引入归一化重缩放机制以恢复文本可控性并增强身份信号。 Result: 在六种最先进方法上的实验表明，UniID在身份保持和文本可控性方面均取得更优性能。 Conclusion: UniID通过协同融合两种范式，并设计训练-推理策略，实现了高质量的人脸个性化与良好的文本控制能力。 Abstract: Tuning-free face personalization methods have developed along two distinct paradigms: text embedding approaches that map facial features into the text embedding space, and adapter-based methods that inject features through auxiliary cross-attention layers. While both paradigms have shown promise, existing methods struggle to simultaneously achieve high identity fidelity and flexible text controllability. We introduce UniID, a unified tuning-free framework that synergistically integrates both paradigms. Our key insight is that when merging these approaches, they should mutually reinforce only identity-relevant information while preserving the original diffusion prior for non-identity attributes. We realize this through a principled training-inference strategy: during training, we employ an identity-focused learning scheme that guides both branches to capture identity features exclusively; at inference, we introduce a normalized rescaling mechanism that recovers the text controllability of the base diffusion model while enabling complementary identity signals to enhance each other. This principled design enables UniID to achieve high-fidelity face personalization with flexible text controllability. Extensive experiments against six state-of-the-art methods demonstrate that UniID achieves superior performance in both identity preservation and text controllability. Code will be available at https://github.com/lyuPang/UniID

[137] BlurDM: A Blur Diffusion Model for Image Deblurring

Jin-Ting He,Fu-Jen Tsai,Yan-Tsung Peng,Min-Hung Chen,Chia-Wen Lin,Yen-Yu Lin

Main category: cs.CV

TL;DR: 提出了一种新的模糊扩散模型（BlurDM），通过双扩散前向过程将模糊形成机制融入扩散模型，实现去噪与去模糊的联合恢复，显著提升了动态场景去模糊性能。

Details

Motivation: 现有扩散模型在图像去模糊任务中未能充分利用模糊过程的内在特性，限制了其潜力，因此需要一种能结合真实模糊形成机制的扩散模型。 Method: 提出BlurDM，采用双扩散前向过程，在图像上同时扩散噪声和模糊；在反向过程中推导出联合去噪与去模糊公式，并在潜在空间中实现以提高效率，作为去模糊网络的先验生成模块。 Result: 在四个基准数据集上实验表明，BlurDM能显著且一致地提升现有去模糊方法的性能。 Conclusion: BlurDM通过内在建模模糊形成过程，有效结合扩散模型与物理成像先验，为动态场景去模糊提供了新的高效解决方案。 Abstract: Diffusion models show promise for dynamic scene deblurring; however, existing studies often fail to leverage the intrinsic nature of the blurring process within diffusion models, limiting their full potential. To address it, we present a Blur Diffusion Model (BlurDM), which seamlessly integrates the blur formation process into diffusion for image deblurring. Observing that motion blur stems from continuous exposure, BlurDM implicitly models the blur formation process through a dual-diffusion forward scheme, diffusing both noise and blur onto a sharp image. During the reverse generation process, we derive a dual denoising and deblurring formulation, enabling BlurDM to recover the sharp image by simultaneously denoising and deblurring, given pure Gaussian noise conditioned on the blurred image as input. Additionally, to efficiently integrate BlurDM into deblurring networks, we perform BlurDM in the latent space, forming a flexible prior generation network for deblurring. Extensive experiments demonstrate that BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets. The source code is available at https://github.com/Jin-Ting-He/BlurDM.

[138] DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment

Sheng-Hao Liao,Shang-Fu Chen,Tai-Ming Huang,Wen-Huang Cheng,Kai-Lung Hua

Main category: cs.CV

TL;DR: 提出DirectDrag，一种无需掩码和文本提示的拖拽式图像编辑框架，通过自动软掩码生成和读出引导特征对齐机制，实现高保真、精确的交互式图像编辑。

Details

Motivation: 现有基于生成模型的拖拽式图像编辑方法依赖人工提供的掩码和文本提示来保证语义保真和运动精度，限制了编辑的自由度和便捷性。去除这些约束会引发视觉伪影和空间控制差的问题。 Method: 设计了两个关键模块：1）自动软掩码生成模块，从点位移中智能推断可编辑区域，沿运动路径自动定位形变；2）读出引导特征对齐机制，利用扩散模型中间激活特征保持结构一致性。整个框架无需手动掩码或提示即可实现精确编辑。 Result: 在DragBench和真实场景上的实验表明，DirectDrag在无需人工掩码和提示的情况下，仍能实现优于现有方法的图像质量和具有竞争力的拖拽精度。 Conclusion: DirectDrag有效解决了无掩码和无提示条件下的拖拽编辑难题，实现了高保真、强鲁棒性和良好用户交互性的平衡，推动了生成模型在交互式图像编辑中的应用。 Abstract: Drag-based image editing using generative models provides intuitive control over image structures. However, existing methods rely heavily on manually provided masks and textual prompts to preserve semantic fidelity and motion precision. Removing these constraints creates a fundamental trade-off: visual artifacts without masks and poor spatial control without prompts. To address these limitations, we propose DirectDrag, a novel mask- and prompt-free editing framework. DirectDrag enables precise and efficient manipulation with minimal user input while maintaining high image fidelity and accurate point alignment. DirectDrag introduces two key innovations. First, we design an Auto Soft Mask Generation module that intelligently infers editable regions from point displacement, automatically localizing deformation along movement paths while preserving contextual integrity through the generative model's inherent capacity. Second, we develop a Readout-Guided Feature Alignment mechanism that leverages intermediate diffusion activations to maintain structural consistency during point-based edits, substantially improving visual fidelity. Despite operating without manual mask or prompt, DirectDrag achieves superior image quality compared to existing methods while maintaining competitive drag accuracy. Extensive experiments on DragBench and real-world scenarios demonstrate the effectiveness and practicality of DirectDrag for high-quality, interactive image manipulation. Project Page: https://frakw.github.io/DirectDrag/. Code is available at: https://github.com/frakw/DirectDrag.

[139] DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

Zexin Lin,Hawen Wan,Yebin Zhong,Xiaoqiang

Main category: cs.CV

TL;DR: 本文提出了DIQ-H，首个用于评估视觉-语言模型在动态视觉退化条件下鲁棒性的基准，揭示了现有模型在误差恢复和时间一致性方面的显著缺陷。

Details

Motivation: 现有的视觉-语言模型（VLM）基准主要关注静态、高质量图像，忽略了在真实场景中常见的时序退化和错误传播问题，特别是在自动驾驶等安全关键应用中，短暂的视觉干扰可能导致持续的幻觉错误。因此，需要一个能评估VLM在动态退化条件下鲁棒性的新基准。 Method: 提出DIQ-H基准，采用基于物理的退化方式（如运动模糊、传感器噪声和压缩伪影），通过多轮问答任务测量幻觉持久性、错误恢复能力和时间一致性；并设计不确定性引导的迭代优化（UIR）方法，利用轻量级VLM结合不确定性过滤生成可靠的伪标注数据。 Result: 在16种最先进的VLM上实验表明，即使是GPT-4o也仅有78.5%的恢复率，开源模型的时间一致性普遍低于60%；UIR方法相比传统标注策略提升了15.3%的准确率。 Conclusion: DIQ-H为评估VLM在现实部署中的可靠性提供了全面平台，揭示了当前模型在处理动态视觉退化方面的严重不足，推动未来研究关注时间鲁棒性和错误恢复机制。 Abstract: Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.

[140] Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation

Hang Xu,Linjiang Huang,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到图像扩散模型中的测试时扩展方法，通过引入基于步长的文本嵌入扰动和频率引导的噪声调度，在不增加计算成本的情况下显著提升了生成质量和多样性。

Details

Motivation: 现有工作主要关注搜索策略和奖励模型，但忽略了T2I扩散模型中噪声随机性对性能的影响。本文旨在探索这种随机性的影响，并提出一种新的随机性形式——文本嵌入扰动，以增强生成多样性和质量。 Method: 1. 引入基于步长的文本嵌入扰动，结合频率引导的噪声调度与空间噪声扰动；2. 根据不同频率对生成的贡献和扰动容忍度自适应调整扰动强度。 Result: 该方法在多个基准上实现了显著改进，且几乎不需要额外计算资源。 Conclusion: 文本嵌入扰动与空间噪声在频域上具有互补行为，能够有效提升T2I扩散模型的生成效果，是一种高效且可集成于现有TTS方法的新策略。 Abstract: Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method's performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \href{https://github.com/xuhang07/TEP-Diffusion}{https://github.com/xuhang07/TEP-Diffusion}.

[141] Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

Jialuo Li,Bin Li,Jiahao Li,Yan Lu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的自适应视频帧选择框架DIG，根据查询类型（全局或局部）动态选择均匀采样或查询感知采样，以高效提升长视频理解中大模型的性能。

Details

Motivation: 现有的长视频理解方法受限于上下文长度和计算成本，且复杂查询感知机制带来额外开销，本文旨在探索是否所有查询都需要复杂机制，并提出更高效的替代方案。 Method: 首先构建查询分类体系，区分全局与局部查询；针对不同查询类型，DIG分别采用均匀采样和专门的查询相关帧提取流程，实现自适应帧选择。 Result: 在三个长视频理解基准上的实验表明，DIG在多种设置下均优于现有基线方法，即使输入帧数扩展到256帧仍能稳定提升LMM性能。 Conclusion: 并非所有查询都需复杂帧检索机制，基于查询类型的自适应策略（如DIG）可有效平衡效率与性能，为长视频理解提供更优解决方案。 Abstract: The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.

[142] On the Temporality for Sketch Representation Learning

Marcelo Isaias de Moraes Junior,Moacir Antonelli Ponti

Main category: cs.CV

TL;DR: 该论文探讨了将草图视为序列的合理性，研究了不同顺序对草图表示学习的影响，发现绝对坐标优于相对坐标，非自回归解码器表现更佳，且时间性的重要性依赖于任务和顺序类型。

Details

Motivation: 理解草图表示学习中时间因素对表示质量的真实影响，以及内部顺序的作用。 Method: 通过比较传统位置编码、绝对与相对坐标的性能，以及自回归与非自回归解码器的表现，分析不同顺序对草图建模的影响。 Result: 绝对坐标 consistently 优于相对坐标；非自回归解码器优于自回归解码器；时间性的重要性取决于所考虑的顺序和评估的任务。 Conclusion: 将草图作为序列建模是合理的，但其有效性依赖于具体的顺序定义和下游任务，绝对坐标和非自回归结构更具优势。 Abstract: Sketches are simple human hand-drawn abstractions of complex scenes and real-world objects. Although the field of sketch representation learning has advanced significantly, there is still a gap in understanding the true relevance of the temporal aspect to the quality of these representations. This work investigates whether it is indeed justifiable to treat sketches as sequences, as well as which internal orders play a more relevant role. The results indicate that, although the use of traditional positional encodings is valid for modeling sketches as sequences, absolute coordinates consistently outperform relative ones. Furthermore, non-autoregressive decoders outperform their autoregressive counterparts. Finally, the importance of temporality was shown to depend on both the order considered and the task evaluated.

[143] Emergent Outlier View Rejection in Visual Geometry Grounded Transformers

Jisang Han,Sunghwan Hong,Jaewoo Jung,Wooseok Jang,Honggyu An,Qianqian Wang,Seungryong Kim,Chen Feng

Main category: cs.CV

TL;DR: 本文发现现有的前馈3D重建模型（如VGGT）虽无显式异常值拒绝机制，但能自然区分干扰图像，通过分析发现特定层具有抑制离群特征的行为，并利用该特性实现无需微调的离群视图剔除，显著提升在野外图像下的3D重建鲁棒性。

Details

Motivation: 在真实场景图像集合中，噪声图像（无关输入、缺乏视角重叠）会严重影响3D重建的可靠性，而传统SfM方法依赖几何验证和异常值剔除，前馈模型缺乏此类机制，导致性能下降。因此需要探索前馈模型是否具备隐式的噪声处理能力。 Method: 通过对VGGT等前馈模型在不同比例合成干扰图像下的表现进行深入分析，识别出具有异常抑制行为的关键网络层；进一步探究该层的内部表征特性，发现其编码了具有判别性的内在表示，可有效过滤噪声，进而直接用于前馈3D重建中的离群视图剔除，无需额外训练或监督。 Result: 实验表明该隐式过滤机制在多种受控和真实野外数据集上均具有一致性和良好泛化能力，能有效提升重建完整性与准确性。 Conclusion: 前馈3D重建模型内部存在未被显式设计但天然具备的噪声过滤能力，通过挖掘特定层的内在表示即可实现高效的离群视图拒绝，为提升模型在真实场景下的鲁棒性提供了新思路。 Abstract: Reliable 3D reconstruction from in-the-wild image collections is often hindered by "noisy" images-irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.

[144] Learning Group Actions In Disentangled Latent Image Representations

Farhana Hossain Swarnali,Miaomiao Zhang,Tonmoy Hossain

Main category: cs.CV

TL;DR: 提出了一种端到端框架，首次实现从图像潜在流形中自动学习群作用，无需人工干预即可发现变换相关结构。

Details

Motivation: 现有方法在高维数据空间操作群作用或需手动划分潜变量子空间，难以解耦受变换影响的子空间且缺乏灵活性。 Method: 使用可学习的二值掩码与直通估计，动态划分潜表示为对变换敏感和不变部分，并在统一优化框架中联合学习潜变量解耦与群变换映射。 Result: 在五个2D/3D图像数据集上验证了方法有效性，能自动学习用于群作用的解耦潜因子，下游分类任务表明所学表示具有优越性。 Conclusion: 该框架能无缝集成到标准编码-解码结构中，实现了对潜在表示中群作用的自动、灵活且鲁棒的学习。 Abstract: Modeling group actions on latent representations enables controllable transformations of high-dimensional image data. Prior works applying group-theoretic priors or modeling transformations typically operate in the high-dimensional data space, where group actions apply uniformly across the entire input, making it difficult to disentangle the subspace that varies under transformations. While latent-space methods offer greater flexibility, they still require manual partitioning of latent variables into equivariant and invariant subspaces, limiting the ability to robustly learn and operate group actions within the representation space. To address this, we introduce a novel end-to-end framework that for the first time learns group actions on latent image manifolds, automatically discovering transformation-relevant structures without manual intervention. Our method uses learnable binary masks with straight-through estimation to dynamically partition latent representations into transformation-sensitive and invariant components. We formulate this within a unified optimization framework that jointly learns latent disentanglement and group transformation mappings. The framework can be seamlessly integrated with any standard encoder-decoder architecture. We validate our approach on five 2D/3D image datasets, demonstrating its ability to automatically learn disentangled latent factors for group actions in diverse data, while downstream classification tasks confirm the effectiveness of the learned representations. Our code is publicly available at https://github.com/farhanaswarnali/Learning-Group-Actions-In-Disentangled-Latent-Image-Representations .

[145] Ultra-lightweight Neural Video Representation Compression

Ho Man Kwan,Tianhao Peng,Ge Gao,Fan Zhang,Mike Nilsson,Andrew Gower,David Bull

Main category: cs.CV

TL;DR: 本文提出了NVRC-Lite，一种面向轻量级隐式神经表示的端到端视频压缩框架，通过引入多尺度特征网格和基于八叉树的上下文熵编码模型，在降低计算复杂度的同时显著提升了压缩性能与编解码速度。

Details

Motivation: 现有的INR-based视频压缩方法虽然性能优越，但计算复杂度高且熵编码效率低，尤其是自回归模型导致编码速度慢，限制了实际应用。因此，需要发展更高效、更轻量的INR压缩框架。 Method: 在NVRC基础上提出NVRC-Lite：1）引入多尺度特征网格，利用更高分辨率的网格提升低复杂度下INR的表现；2）设计一种基于八叉树的上下文模型用于高维特征网格的熵编码，替代传统的自回归模型以加速编码过程。 Result: 实验结果表明，NVRC-Lite相比当前最优的轻量级INR视频编解码器C3，PSNR和MS-SSIM指标下分别实现了最高21.03%和23.06%的BD-rate节省，同时编码速度提升8.4倍，解码速度提升2.5倍。 Conclusion: NVRC-Lite通过结构优化和高效的熵编码策略，成功将INR-based视频压缩推向更轻量、更快速的方向，兼顾高压缩效率与实用性，为神经视频压缩的实际部署提供了可行方案。 Abstract: Recent works have demonstrated the viability of utilizing over-fitted implicit neural representations (INRs) as alternatives to autoencoder-based models for neural video compression. Among these INR-based video codecs, Neural Video Representation Compression (NVRC) was the first to adopt a fully end-to-end compression framework that compresses INRs, achieving state-of-the-art performance. Moreover, some recently proposed lightweight INRs have shown comparable performance to their baseline codecs with computational complexity lower than 10kMACs/pixel. In this work, we extend NVRC toward lightweight representations, and propose NVRC-Lite, which incorporates two key changes. Firstly, we integrated multi-scale feature grids into our lightweight neural representation, and the use of higher resolution grids significantly improves the performance of INRs at low complexity. Secondly, we address the issue that existing INRs typically leverage autoregressive models for entropy coding: these are effective but impractical due to their slow coding speed. In this work, we propose an octree-based context model for entropy coding high-dimensional feature grids, which accelerates the entropy coding module of the model. Our experimental results demonstrate that NVRC-Lite outperforms C3, one of the best lightweight INR-based video codecs, with up to 21.03% and 23.06% BD-rate savings when measured in PSNR and MS-SSIM, respectively, while achieving 8.4x encoding and 2.5x decoding speedup. The implementation of NVRC-Lite will be made available.

[146] C3G: Learning Compact 3D Representations with 2K Gaussians

Honggyu An,Jaewoo Jung,Mungyeom Kim,Sunghwan Hong,Chaehyun Kim,Kazumi Fukuda,Minkyeong Jeon,Jisang Han,Takuya Narihira,Hyuna Ko,Junsu Kim,Yuki Mitsufuji,Seungryong Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为C3G的新颖前馈框架，用于从无姿态稀疏视图中高效地重建和理解3D场景。该方法通过在关键位置生成紧凑的3D高斯分布，并利用可学习token进行多视角特征聚合，显著减少了冗余并提高了特征保真度与内存效率。

Details

Motivation: 现有基于每像素3D高斯点阵的方法会产生大量冗余高斯分布，导致内存开销高且多视角特征聚合效果不佳，从而影响新视角合成和场景理解性能。 Method: 提出C3G框架，引入可学习token通过自注意力机制聚合多视角特征，指导仅在必要空间位置生成紧凑的3D高斯；并利用学习到的注意力模式进行高效的特征提升。 Result: 在无姿态新视角合成、3D开放词汇分割和视角不变特征聚合任务上进行了广泛实验，结果表明该方法在保持高质量重建的同时显著提升了内存效率和特征保真度。 Conclusion: 紧凑但几何意义明确的表示足以实现高质量的3D场景重建与理解，C3G为高效3D场景建模提供了新的解决方案。 Abstract: Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.

[147] PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

Xiaolong Li,Youping Gu,Xi Lin,Weijie Wang,Bohan Zhuang

Main category: cs.CV

TL;DR: 提出金字塔稀疏注意力（PSA），通过多级池化KV表示实现细粒度稀疏，减少信息损失，提升视频理解与生成任务的效率-质量权衡。

Details

Motivation: 现有稀疏注意力方法使用二值掩码导致高稀疏下信息损失严重，需更精细的稀疏机制以缓解该问题。 Method: 引入多级池化KV表示，每个查询块动态分配不同池化级别给关键和非关键KV块，结合解耦的块-瓦片硬件友好内核实现高效计算。 Result: 在视频理解和生成任务上，PSA在保持高效率的同时显著优于或媲美现有稀疏注意力方法，展现出更优的效率-质量权衡。 Conclusion: PSA通过类比定点量化和特征金字塔网络的设计，有效缓解了稀疏注意力中的信息损失问题，是一种通用且高效的稀疏注意力模块。 Abstract: Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA

[148] Fast & Efficient Normalizing Flows and Applications of Image Generative Models

Sandeep Nagar

Main category: cs.CV

TL;DR: 本论文在生成模型的效率提升和实际计算机视觉应用两方面做出了创新贡献，包括改进归一化流架构的六项关键技术，并将其应用于农业质量检测、地质制图、自动驾驶数据隐私保护及艺术修复等多个实际场景。

Details

Motivation: 提高生成模型（尤其是归一化流）的计算效率，并拓展其在现实世界计算机视觉问题中的应用，解决数据稀缺、类别不平衡、隐私保护和多类型退化处理等挑战。 Method: 提出了六项针对归一化流的技术改进：可逆3x3卷积层、更高效的Quad耦合层、kxk卷积层的并行反演算法、卷积逆向的快速反向传播算法、用于前向传递的卷积逆与Inverse-Flow训练方法，以及基于预训练权重和归一化流的轻量超分辨率模型Affine-StableSR；并在应用层面采用条件GAN、堆叠自编码器、基于Stable Diffusion的图像修复等方法解决具体问题。 Result: 在模型效率方面实现了更快的推理与训练速度及更低的参数量；在应用上成功实现了种子纯度检测、地质特征提取、自动驾驶数据隐私保护（人脸与车牌替换）以及多类型退化的艺术作品修复，表现出良好的性能与实用性。 Conclusion: 论文提出的归一化流改进方法显著提升了模型效率，其在多个真实世界视觉任务中的成功应用验证了生成模型在解决复杂实际问题中的潜力与广泛适用性。 Abstract: This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance. The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.

[149] RELIC: Interactive Video World Model with Long-Horizon Memory

Yicong Hong,Yiqun Mei,Chongjian Ge,Yiran Xu,Yang Zhou,Sai Bi,Yannick Hold-Geoffroy,Mike Roberts,Matthew Fisher,Eli Shechtman,Kalyan Sunkavalli,Feng Liu,Zhengqi Li,Hao Tan

Main category: cs.CV

TL;DR: RELIC是一个统一的框架，通过压缩的历史潜在令牌和因果学生生成器实现记忆感知、长时间实时场景探索，支持精确用户控制和三维一致性内容检索。

Details

Motivation: 现有方法通常只单独解决实时长时序流、一致的空间记忆或精确用户控制中的一个方面，难以同时满足三者，尤其是长期记忆机制常影响实时性能。 Method: 基于自回归视频扩散蒸馏技术，使用包含相对动作和绝对相机姿态的KV缓存来表示高效压缩的长期记忆，并通过新的内存高效自强制范式将双向教师模型转化为因果学生生成器，实现长序列生成与全上下文蒸馏。 Result: RELIC在16 FPS下实现实时生成，相比先前工作展现出更准确的动作跟随、更稳定的长时序流和更强的空间记忆检索能力。 Conclusion: RELIC同时解决了交互式世界建模中的三大挑战，为下一代交互式世界模型提供了强大基础。 Abstract: A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.

[150] SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

Siyi Chen,Mikaela Angelina Uy,Chan Hee Song,Faisal Ladhak,Adithyavairavan Murali,Qing Qu,Stan Birchfield,Valts Blukis,Jonathan Tremblay

Main category: cs.CV

TL;DR: 本文提出了Double Interactive Reinforcement Learning (DIRL)，一种两阶段训练框架，使视觉语言模型（VLMs）能够通过交互式探索和反馈协调多个视觉工具，从而提升其在具身应用中的度量级空间推理能力。

Details

Motivation: 现有的VLM在定性视觉理解上表现良好，但在需要精确空间推理的具身任务中表现不佳。虽然可通过引入外部工具（如深度估计、分割等）增强能力，但当前方法依赖手工提示或固定工具流程，限制了模型自主发现最优工具使用策略的能力。强化学习有潜力解决此问题，但在多工具场景下因搜索空间过大而受限。 Method: 提出DIRL框架，包含教学阶段和探索阶段：在教学阶段，结合单工具专家模型的强化学习示范与前沿模型使用所有工具的执行轨迹；在探索阶段，模型通过持续的强化学习进一步优化多工具协作策略。构建了具备工具增强空间推理能力的模型SpaceTools，并在模拟和真实机器人环境中进行评估。 Result: SpaceTools在多个空间理解基准（RoboSpatial-Home、BLINK、BOP-ASK）上达到SOTA性能，在RoboSpatial上比纯监督微调（SFT）提升+12%，比传统强化学习方法提升+16%，并成功实现了在7自由度真实机器人上的可靠操作任务。 Conclusion: DIRL为VLM实现灵活、高效的多工具协同提供了有效路径，显著提升了其在复杂空间推理与具身交互任务中的性能，展示了强化学习在开放工具使用中的潜力。 Abstract: Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.

[151] PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design

Jiazhe Wei,Ken Li,Tianyu Lao,Haofan Wang,Liang Wang,Caifeng Shan,Chenyang Si

Main category: cs.CV

TL;DR: PosterCopilot 是一种用于提升图形设计中布局推理和可控编辑的新框架，通过三阶段训练策略增强大模型的几何理解与美学推理能力，并结合生成模型实现可精确控制图层的迭代式设计。

Details

Motivation: 现有基于大模型的自动平面设计方法常存在布局几何不准确、缺乏专业工作流所需的逐层迭代编辑能力的问题，难以满足实际设计需求。 Method: 提出三阶段训练策略：扰动监督微调、面向视觉现实对齐的强化学习、以及基于美学反馈的强化学习；并构建一个将训练后的LMM设计模型与生成模型结合的工作流，支持图层可控的迭代编辑。 Result: 实验表明，PosterCopilot 能生成几何结构更准确、美学质量更高的布局，且在保持整体视觉一致性的前提下支持精细的元素调整。 Conclusion: PosterCopilot 显著提升了自动化平面设计在专业场景下的实用性，为实现可控、迭代的智能设计系统提供了有效路径。 Abstract: Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.

[152] SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows

Qinyu Zhao,Guangting Zheng,Tao Yang,Rui Zhu,Xingjian Leng,Stephen Gould,Liang Zheng

Main category: cs.CV

TL;DR: 本文提出了一种简单而有效的方法——SimFlow，通过固定VAE编码器中的方差为常数（如0.5），解决了Normalizing Flows在图像生成中依赖复杂数据增强和冻结预训练VAE带来的问题，实现了更优的重建与生成性能，在ImageNet 256×256上取得了2.15的gFID成绩，并结合REPA-E进一步提升至1.91。

Details

Motivation: 现有Normalizing Flows方法通常依赖添加噪声的数据增强或使用固定的预训练VAE编码器，导致流程复杂且生成质量受限。本文旨在简化训练流程并提升生成与重建效果。 Method: 提出SimFlow，将VAE编码器输出的方差固定为常数（如0.5），从而避免引入额外的噪声与去噪步骤，同时使VAE与NF可端到端联合训练。该方法简化了ELBO目标函数，提升了训练稳定性。 Result: 在ImageNet 256×256图像生成任务中，SimFlow取得2.15的gFID分数，优于STARFlow（2.40）；结合REPA-E后进一步降至1.91，成为当前最优的Normalizing Flow方法。 Conclusion: 固定VAE方差是一种简单但强大的策略，能够有效提升Normalizing Flows的生成质量与训练稳定性，为NF与VAE的联合优化提供了新思路。 Abstract: Normalizing Flows (NFs) learn invertible mappings between the data and a Gaussian distribution. Prior works usually suffer from two limitations. First, they add random noise to training samples or VAE latents as data augmentation, introducing complex pipelines including extra noising and denoising steps. Second, they use a pretrained and frozen VAE encoder, resulting in suboptimal reconstruction and generation quality. In this paper, we find that the two issues can be solved in a very simple way: just fixing the variance (which would otherwise be predicted by the VAE encoder) to a constant (e.g., 0.5). On the one hand, this method allows the encoder to output a broader distribution of tokens and the decoder to learn to reconstruct clean images from the augmented token distribution, avoiding additional noise or denoising design. On the other hand, fixed variance simplifies the VAE evidence lower bound, making it stable to train an NF with a VAE jointly. On the ImageNet $256 \times 256$ generation task, our model SimFlow obtains a gFID score of 2.15, outperforming the state-of-the-art method STARFlow (gFID 2.40). Moreover, SimFlow can be seamlessly integrated with the end-to-end representation alignment (REPA-E) method and achieves an improved gFID of 1.91, setting a new state of the art among NFs.

[153] Unique Lives, Shared World: Learning from Single-Life Videos

Tengda Han,Sayna Ebrahimi,Dilara Gokay,Li Yang Ku,Maks Ovsjanikov,Iva Babukova,Daniel Zoran,Viorica Patraucean,Joao Carreira,Andrew Zisserman,Dima Damen

Main category: cs.CV

TL;DR: 提出“单一生命周期”学习范式，通过个体第一视角视频自监督训练视觉模型，发现不同个体上训练的模型仍具高度对齐的几何理解，并能有效迁移到下游任务，且少量个人数据可媲美大规模网络数据效果。

Details

Motivation: 探索仅使用单一个体的日常视觉数据进行自监督表示学习的可能性，挑战传统依赖大规模多样化数据集的视觉模型训练范式。 Method: 提出单一生命周期学习框架，利用个体连续第一视角视频中的多视角信息进行自监督训练，并引入基于交叉注意力的度量方法评估不同模型内部表征的功能对齐性。 Result: 1）在不同个体数据上独立训练的模型展现出高度一致的几何理解能力；2）单一生命周期模型学到的表征可有效迁移至深度估计等下游任务；3）仅需同一人30小时视频数据即可达到与30小时网络数据相当的性能。 Conclusion: 世界的共享结构为单一生命周期下的视觉表示学习提供了强信号，使得从个体经验中学习通用、对齐的视觉表征成为可能。 Abstract: We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.

Table of Contents

cs.CL [Back]

[1] Entropy-Based Measurement of Value Drift and Alignment Work in Large Language Models

[2] Watermarks for Embeddings-as-a-Service Large Language Models

[3] Alleviating Choice Supportive Bias in LLM with Reasoning Dependency Generation

[4] Enhancing Job Matching: Occupation, Skill and Qualification Linking with the ESCO and EQF taxonomies

[5] InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation

[6] Identifying attributions of causality in political text

[7] Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs

[8] Modeling Topics and Sociolinguistic Variation in Code-Switched Discourse: Insights from Spanish-English and Spanish-Guaraní

[9] PERCS: Persona-Guided Controllable Biomedical Summarization Dataset

[10] Idea-Gated Transformers: Enforcing Semantic Coherence via Differentiable Vocabulary Pruning

[11] From Hypothesis to Premises: LLM-based Backward Logical Reasoning with Selective Symbolic Translation

[12] Nexus: Higher-Order Attention Mechanisms in Transformers

[13] Characterizing Language Use in a Collaborative Situated Game

[14] Dual LoRA: Enhancing LoRA with Magnitude and Direction Updates

[15] PretrainZero: Reinforcement Active Pretraining

[16] A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention

[17] Understanding LLM Reasoning for Abstractive Summarization

[18] Fine-grained Narrative Classification in Biased News Articles

[19] AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment

[20] Generative AI Practices, Literacy, and Divides: An Empirical Analysis in the Italian Context

[21] Evaluating Hydro-Science and Engineering Knowledge of Large Language Models

[22] Different types of syntactic agreement recruit the same units within large language models

[23] AITutor-EvalKit: Exploring the Capabilities of AI Tutors

[24] DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking in Long-Context Dialogue

[25] AR-Med: Automated Relevance Enhancement in Medical Search via LLM-Driven Information Augmentation

[26] Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

[27] In-Context Representation Hijacking

[28] Enhancing Instruction-Following Capabilities in Seq2Seq Models: DoLA Adaptations for T5

[29] Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

[30] Training and Evaluation of Guideline-Based Medical Reasoning in LLMs

[31] Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

[32] BERnaT: Basque Encoders for Representing Natural Textual Diversity

[33] Is Lying Only Sinful in Islam? Exploring Religious Bias in Multilingual Large Language Models Across Major Religions

[34] Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study

[35] Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models

[36] AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

[37] Jina-VLM: Small Multilingual Vision Language Model

[38] SkillFactory: Self-Distillation For Learning Cognitive Behaviors

cs.CV [Back]

[39] Hierarchical Process Reward Models are Symbolic Vision Learners

[40] Drainage: A Unifying Framework for Addressing Class Uncertainty

[41] Does Head Pose Correction Improve Biometric Facial Recognition?

[42] Flux4D: Flow-based Unsupervised 4D Reconstruction

[43] Object Counting with GPT-4o and GPT-5: A Comparative Study

[44] LLM-Guided Material Inference for 3D Point Clouds

[45] 2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition

[46] PixPerfect: Seamless Latent Diffusion Local Editing with Discriminative Pixel-Space Refinement

[47] PyroFocus: A Deep Learning Approach to Real-Time Wildfire Detection in Multispectral Remote Sensing Imagery

[48] SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

[49] NavMapFusion: Diffusion-based Fusion of Navigation Maps for Online Vectorized HD Map Construction

[50] Step-by-step Layered Design Generation

[51] ProtoEFNet: Dynamic Prototype Learning for Inherently Interpretable Ejection Fraction Estimation in Echocardiography

[52] HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration

[53] Hierarchical Attention for Sparse Volumetric Anomaly Detection in Subclinical Keratoconus

[54] SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation

[55] A Hybrid Deep Learning Framework with Explainable AI for Lung Cancer Classification with DenseNet169 and SVM

[56] FireSentry: A Multi-Modal Spatio-temporal Benchmark Dataset for Fine-Grained Wildfire Spread Forecasting

[57] ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

[58] MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification

[59] ViDiC: Video Difference Captioning

[60] YOLOA: Real-Time Affordance Detection via LLM Adapter

[61] DM3D: Deformable Mamba via Offset-Guided Gaussian Sequencing for Point Cloud Understanding

[62] Generalization Evaluation of Deep Stereo Matching Methods for UAV-Based Forestry Applications

[63] Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features

[64] Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation

[65] LM-CartSeg: Automated Segmentation of Lateral and Medial Cartilage and Subchondral Bone for Radiomics Analysis

[66] KeyPointDiffuser: Unsupervised 3D Keypoint Learning via Latent Diffusion Models

[67] GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers

[68] GeoVideo: Introducing Geometric Regularization into Video Generation Model

[69] Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

[70] Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

[71] Difference Decomposition Networks for Infrared Small Target Detection

[72] Procedural Mistake Detection via Action Effect Modeling

[73] Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis

[74] Towards Object-centric Understanding for Instructional Videos

[75] NAS-LoRA: Empowering Parameter-Efficient Fine-Tuning for Visual Foundation Models with Searchable Adaptation

[76] EEA: Exploration-Exploitation Agent for Long Video Understanding

[77] Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation