cs.CL [Back]

[1] Verify as You Go: An LLM-Powered Browser Extension for Fake News Detection

Dorsaf Sallami,Esma Aïmeur

Main category: cs.CL

TL;DR: 本文提出Aletheia浏览器扩展，利用检索增强生成（RAG）与大语言模型（LLM）实现可解释的假新闻检测，并通过讨论中心和信息更新功能提升用户参与和透明度；实验与用户研究表明其检测性能优越、易用且可信。

Details

Motivation: 现有假新闻检测浏览器扩展存在模型行为不透明、解释支持不足、用户参与度低等问题，亟需更透明、可解释、以用户为中心的解决方案。 Method: 设计并实现Aletheia浏览器扩展，融合检索增强生成（RAG）与大语言模型（LLM）进行假新闻检测与证据驱动解释；集成交互式讨论中心（Discussion Hub）和‘保持知情’（Stay Informed）功能；开展技术性能实验与250人用户研究。 Result: Aletheia在假新闻检测任务上优于当前最优基线；用户研究表明其具有高可用性、良好感知有效性及较强信任度。 Conclusion: Aletheia为应对数字时代假新闻挑战提供了一种兼顾技术性能与人机协同透明性的可行路径，验证了RAG+LLM范式在可解释内容可信评估中的实用价值。 Abstract: The rampant spread of fake news in the digital age poses serious risks to public trust and democratic institutions, underscoring the need for effective, transparent, and user-centered detection tools. Existing browser extensions often fall short due to opaque model behavior, limited explanatory support, and a lack of meaningful user engagement. This paper introduces Aletheia, a novel browser extension that leverages Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to detect fake news and provide evidence-based explanations. Aletheia further includes two interactive components: a Discussion Hub that enables user dialogue around flagged content and a Stay Informed feature that surfaces recent fact-checks. Through extensive experiments, we show that Aletheia outperforms state-of-the-art baselines in detection performance. Complementing this empirical evaluation, a complementary user study with 250 participants confirms the system's usability and perceived effectiveness, highlighting its potential as a transparent tool for combating online fake news.

[2] Attention Meets Reachability: Structural Equivalence and Efficiency in Grammar-Constrained LLM Decoding

Faruk Alpay,Bilge Senturk

Main category: cs.CL

TL;DR: 本文研究语法约束解码（GCD），提出基于上下文无关文法（CFG）编译的下推系统上的可达性预言机与自回归语言模型的耦合机制；证明了预言机不变性定理，分析了不同等价文法在状态空间和在线歧义代价上的差异，并引入结构歧义代价（SAC）度量；给出了严格复杂度下界、最小SAC文法存在性证明，以及硬掩码解码的统计失真界；最后将理论结果集成到Transformer和MoE架构中，关联SAC与性能建模及自动文法优化。

Details

Motivation: 解决语法约束解码中因文法等价但编译状态空间与在线计算代价差异导致的效率与一致性问题，建立理论基础以指导高效、保解析的实时语法引导生成。 Method: 构建基于下推系统的GCD形式化框架；证明oracle不变性定理；定义并量化结构歧义代价（SAC）；分析典型文法（如a^n b^n）的状态爆炸与SAC增长；推导引擎无关的计算下界；利用Doob h-变换刻画真实条件采样分布；结合Transformer/MoE架构建模延迟与性能。 Result: 1) 语言等价CFG诱导相同logit掩码但可能极大不同的状态空间与SAC；2) SAC在右递归文法下为O(1)/token，而连接式文法下达Θ(t²)/token；3) 任意sound且检索高效的在线掩码引擎对某常数大小CFG族必承受Ω(t²)每步代价；4) 存在有界重写族内的最小SAC文法；5) 给出硬掩码解码的KL与TV失真上界；6) 实现SAC与延迟建模、预测性能及自动文法优化的统一框架。 Conclusion: 语法约束解码的效率本质取决于文法的结构性质而非仅语言表达能力；SAC是刻画在线解码代价的核心指标；理论结果为语法引导生成提供了可验证的复杂度保证、失真控制与优化路径。 Abstract: We study grammar-constrained decoding (GCD) as a coupling between an autoregressive next-token distribution and a reachability oracle over a pushdown system compiled from a context-free grammar (CFG). We prove an oracle invariance theorem: language-equivalent grammars induce identical admissible next-token sets for every prefix, hence identical logit masks, yet can yield provably different compiled state spaces and online ambiguity costs. We give exact control-state blowup counts for the canonical $a^n b^n$ language under redundant nonterminal delegation, and introduce a left-to-right structural ambiguity cost (SAC) measuring incremental packed-parse-forest growth per token. For two equivalent grammars over all finite strings, SAC is $O(1)$ per token under right-recursion but $Θ(t^2)$ per token and $Θ(n^3)$ cumulatively under concatenation. We establish engine-independent lower bounds: any sound, retrieval-efficient, parse-preserving online masking engine must incur $Ω(t^2)$ work per token on a specific constant-size CFG family, unconditionally within this model. We define decoding-cost equivalence classes of grammars and prove existence of minimal-SAC representatives within bounded rewrite families. Finally, we characterize the true conditional sampler via a Doob $h$-transform and derive sharp one-step KL and total-variation distortion bounds for hard-masked decoding in terms of survival-probability spread among admissible next tokens. We integrate these results with Transformer and Mixture-of-Experts architectures, derive latency envelopes in terms of vocabulary size, active state sets, and beam width, and connect SAC to instrumentation-based predictive performance models and automated grammar optimization.

[3] NOTAI.AI: Explainable Detection of Machine-Generated Text via Curvature and Feature Attribution

Oleksandr Marchenko Breneur,Adelaide Danilov,Aria Nourbakhsh,Salima Lamsiyah

Main category: cs.CL

TL;DR: NOTAI.AI is an explainable framework for detecting AI-generated text by combining curvature-based, neural, and stylometric features in an XGBoost classifier, enhanced with SHAP-based interpretability and LLM-generated natural-language explanations, deployed as an interactive web application.

Details

Motivation: To improve transparency and trust in AI-generated text detection by providing interpretable, feature-level explanations alongside high detection accuracy. Method: Integrates 17 interpretable features—including Conditional Probability Curvature, ModernBERT score, readability metrics, and stylometric cues—into an XGBoost meta-classifier; applies SHAP for local/global attribution; and uses an LLM to generate structured natural-language rationales. Result: A deployable, real-time web application that delivers accurate detection with interactive visualizations, feature inspection, and human-readable explanations. Conclusion: NOTAI.AI bridges the gap between detection performance and explainability, offering a practical, open, and user-facing solution for trustworthy AI-text detection. Abstract: We present NOTAI.AI, an explainable framework for machine-generated text detection that extends Fast-DetectGPT by integrating curvature-based signals with neural and stylometric features in a supervised setting. The system combines 17 interpretable features, including Conditional Probability Curvature, ModernBERT detector score, readability metrics, and stylometric cues, within a gradient-boosted tree (XGBoost) meta-classifier to determine whether a text is human- or AI-generated. Furthermore, NOTAI.AI applies Shapley Additive Explanations (SHAP) to provide both local and global feature-level attribution. These attributions are further translated into structured natural-language rationales through an LLM-based explanation layer, which enables user-facing interpretability. The system is deployed as an interactive web application that supports real-time analysis, visual feature inspection, and structured evidence presentation. A web interface allows users to input text and inspect how neural and statistical signals influence the final decision. The source code and demo video are publicly available to support reproducibility.

[4] Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs

Patrick Ahrend,Tobias Eder,Xiyang Yang,Zhiyi Pan,Georg Groh

Main category: cs.CL

TL;DR: 本文研究了链式思维（CoT）提示在提升大语言模型推理能力的同时，如何加剧个人身份信息（PII）在推理过程中的泄露风险，并提出了一种模型无关的评估框架与多种轻量级推理时门控器的基准测试。

Details

Motivation: Chain-of-Thought（CoT）提示虽能提升LLM推理能力，但可能在推理轨迹和输出中重新暴露敏感的个人身份信息（PII），即使模型被明确指示不得复述PII，该风险在推理时直接发生，亟需系统性评估与缓解。 Method: 提出一个模型无关的PII泄漏评估框架：（i）基于11类PII定义风险加权的词元级泄漏事件；（ii）绘制不同CoT预算下的泄漏曲线；（iii）在结构化PII数据集与分层风险分类体系下，对比开源与闭源模型家族。同时，对四种轻量级推理时门控器（规则检测器、TF-IDF+逻辑回归、GLiNER-NER、LLM-as-judge）进行基准测试，采用风险加权F1、Macro-F1和召回率评估。 Result: 发现CoT始终加剧PII泄漏，尤其对高风险类别；泄漏程度高度依赖模型家族与CoT预算——增大预算可能放大或减弱泄漏，取决于基础模型。四种门控器无一在所有模型和预算下全面占优。 Conclusion: 需采用混合式、风格自适应的推理时门控策略，在统一、可复现的协议下权衡功能效用与隐私风险。 Abstract: Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i) defines leakage as risk-weighted, token-level events across 11 PII types, (ii) traces leakage curves as a function of the allowed CoT budget, and (iii) compares open- and closed-source model families on a structured PII dataset with a hierarchical risk taxonomy. We find that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent. Increasing the reasoning budget can either amplify or attenuate leakage depending on the base model. We then benchmark lightweight inference-time gatekeepers: a rule-based detector, a TF-IDF + logistic regression classifier, a GLiNER-based NER model, and an LLM-as-judge, using risk-weighted F1, Macro-F1, and recall. No single method dominates across models or budgets, motivating hybrid, style-adaptive gatekeeping policies that balance utility and risk under a common, reproducible protocol.

[5] The Fragility Of Moral Judgment In Large Language Models

Tom van Nuenen,Pratik S. Sachdeva

Main category: cs.CL

TL;DR: 本文提出一种扰动框架来测试大语言模型（LLM）在道德判断中的稳定性与可操纵性，发现模型判断高度依赖叙事形式（如视角转换）和评估协议，而非道德实质，揭示其在道德指导应用中的可靠性与公平性风险。

Details

Motivation: 人们日益依赖大语言模型获取道德与人际建议，但这些模型无法主动追问缺失上下文，只能就给定 dilemmas 做判断；作者旨在检验其道德判断是否稳定、可靠、可复现。 Method: 构建三类内容扰动（表面编辑、视角转换、说服线索）与三类协议扰动（输出顺序、指令位置、非结构化提示），在2939个r/AmItheAsshole道德困境上，用4个主流模型生成12.9万条判断进行稳定性分析。 Result: 表面扰动翻转率低（7.5%），而视角转换导致高翻转率（24.3%）；37.9%的困境对表面噪声鲁棒却对视角敏感；道德模糊案例最易翻转；说服线索引发系统性偏向；协议差异主导结果——不同协议间一致性仅67.6%，全协议一致率仅35.7%。 Conclusion: LLM的道德判断是叙事形式与任务框架共同建构的结果，其输出更反映‘表达技巧’而非‘道德实质’，这对实际部署中的可复现性、公平性与可信度构成严峻挑战。 Abstract: People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface perturbations produce low flip rates (7.5%), largely within the self-consistency noise floor (4-13%), whereas point-of-view shifts induce substantially higher instability (24.3%). A large subset of dilemmas (37.9%) is robust to surface noise yet flips under perspective changes, indicating that models condition on narrative voice as a pragmatic cue. Instability concentrates in morally ambiguous cases; scenarios where no party is assigned blame are most susceptible. Persuasion perturbations yield systematic directional shifts. Protocol choices dominate all other factors: agreement between structured protocols is only 67.6% (kappa=0.55), and only 35.7% of model-scenario units match across all three protocols. These results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.

[6] FreeTxt-Vi: A Benchmarked Vietnamese-English Toolkit for Segmentation, Sentiment, and Summarisation

Hung Nguyen Huy,Mo El-Haj,Dawn Knight,Paul Rayson

Main category: cs.CL

TL;DR: FreeTxt-Vi 是一个开源的、基于网页的双语（越南语-英语）文本分析工具包，无需编程基础即可构建和分析自由文本语料库；其核心是统一的双语NLP流水线，集成了混合分词策略、微调的情感分类器与摘要模型，并在分词、情感分析和摘要任务上达到或超越主流基线性能。

Details

Motivation: 降低多语言文本分析的技术门槛，支持越南语这一使用广泛但NLP资源匮乏的语言的可复现研究与资源建设。 Method: 设计统一的双语NLP流水线，融合VnCoreNLP与BPE混合分词、微调TabularisAI情感分类器、微调Qwen2.5抽象式摘要模型，并集成语料库语言学功能（如共现检索、关键词分析、词关系探索及交互可视化）。 Result: 在分词、情感分析和摘要三项评估中，FreeTxt-Vi在越南语和英语上均达到或优于主流基线方法。 Conclusion: FreeTxt-Vi 通过整合语料库语言学与前沿NLP技术，为教育、数字人文、文化遗产和社会科学等领域提供了易用、可复现、多语言支持的文本分析解决方案。 Abstract: FreeTxt-Vi is a free and open source web based toolkit for creating and analysing bilingual Vietnamese English text collections. Positioned at the intersection of corpus linguistics and natural language processing NLP it enables users to build explore and interpret free text data without requiring programming expertise. The system combines corpus analysis features such as concordancing keyword analysis word relation exploration and interactive visualisation with transformer based NLP components for sentiment analysis and summarisation. A key contribution of this work is the design of a unified bilingual NLP pipeline that integrates a hybrid VnCoreNLP and Byte Pair Encoding BPE segmentation strategy a fine tuned TabularisAI sentiment classifier and a fine tuned Qwen2.5 model for abstractive summarisation. Unlike existing text analysis platforms FreeTxt Vi is evaluated as a set of language processing components. We conduct a three part evaluation covering segmentation sentiment analysis and summarisation and show that our approach achieves competitive or superior performance compared to widely used baselines in both Vietnamese and English. By reducing technical barriers to multilingual text analysis FreeTxt Vi supports reproducible research and promotes the development of language resources for Vietnamese a widely spoken but underrepresented language in NLP. The toolkit is applicable to domains including education digital humanities cultural heritage and the social sciences where qualitative text data are common but often difficult to process at scale.

[7] Towards Robust Retrieval-Augmented Generation Based on Knowledge Graph: A Comparative Analysis

Hazem Amamou,Stéphane Gagnon,Alan Davoust,Anderson R. Avila

Main category: cs.CL

TL;DR: 本文使用RGB基准评估LLM在噪声鲁棒性、信息整合、负样本拒绝和反事实鲁棒性四个场景下的表现，并对比RGB RAG基线与基于知识图谱的GraphRAG系统，提出三种GraphRAG定制方案以提升鲁棒性，结果表明其优于基线。

Details

Motivation: 解决RAG系统中因检索信息不一致导致LLM响应质量下降的问题，提升其在真实场景中的可靠性。 Method: 基于RGB基准，在噪声鲁棒性、信息整合、负拒绝、反事实鲁棒性四个维度上，对比RGB RAG基线与GraphRAG（含三种定制化改进）的表现。 Result: GraphRAG及其三种定制方案在多个鲁棒性指标上优于RGB基线，验证了知识图谱增强检索对提升RAG系统稳定性的作用。 Conclusion: 引入知识图谱结构化检索可有效提升RAG系统的鲁棒性，为构建更可靠的现实应用RAG系统提供了可行路径与设计启示。 Abstract: Retrieval-Augmented Generation (RAG) was introduced to enhance the capabilities of Large Language Models (LLMs) beyond their encoded prior knowledge. This is achieved by providing LLMs with an external source of knowledge, which helps reduce factual hallucinations and enables access to new information not available during pretraining. However, inconsistent retrieved information can negatively affect LLM responses. The Retrieval-Augmented Generation Benchmark (RGB) was introduced to evaluate the robustness of RAG systems under such conditions. In this work, we use the RGB corpus to evaluate LLMs in four scenarios: noise robustness, information integration, negative rejection, and counterfactual robustness. We perform a comparative analysis between the RGB RAG baseline and GraphRAG, a knowledge graph based retrieval system. We test three GraphRAG customizations to improve robustness. Results show improvements over the RGB baseline and provide insights for designing more reliable RAG systems for real world scenarios.

[8] Cultural Perspectives and Expectations for Generative AI: A Global Survey Approach

Erin van Liemt,Renee Shelby,Andrew Smart,Sinchana Kumbale,Richard Zhang,Neha Dixit,Qazi Mamunur Rashid,Jamila Smith-Loud

Main category: cs.CL

TL;DR: This paper conducts a global survey to understand how different cultures perceive and define 'culture' in the context of Generative AI, and proposes recommendations for culturally responsible GenAI development.

Details

Motivation: There is a lack of empirical evidence on global attitudes regarding how Generative AI should represent cultures. Method: A large-scale global survey across Europe, North and South America, Asia, and Africa, collecting data on cultural meanings and expectations for GenAI representation; working definitions of culture were distilled directly from participants. Result: Identification of conceptual complexities of culture across communities and insights into how GenAI should handle cultural artifacts, concepts, and values. Conclusion: Proposes recommendations including participatory development, attention to non-geographic cultural dimensions (e.g., religion, tradition), and a cultural 'redlines' sensitivity framework. Abstract: There is a lack of empirical evidence about global attitudes around whether and how GenAI should represent cultures. This paper assesses understandings and beliefs about culture as it relates to GenAI from a large-scale global survey. We gathered data about what culture means to different groups, and about how GenAI should approach the representation of cultural artifacts, concepts, or values. We distill working definitions of culture directly from these communities to build an understanding of its conceptual complexities and how they relate to representations in Generative AI. We survey from across parts of Europe, North and South America, Asia, and Africa. We conclude with a set of recommendations for Culture and GenAI development. These include participatory approaches, prioritizing specific cultural dimensions beyond geography, such as religion and tradition, and a sensitivity framework for addressing cultural ``redlines''.

[9] Structured Multidimensional Representation Learning for Large Language Models

Alaa El Ichi,Khalide Jbilou,Mohamed El Guide,Franck Dufrenois

Main category: cs.CL

TL;DR: 本文提出了一种基于三阶张量L-积的嵌入空间结构化谱分解方法，构建了Tensor Transformer（L-Transformer），在保持标准Transformer语义的同时，将编码器分解为p个独立的谱子变换器，实现约1/p的参数压缩，并引入频率相关的归纳偏置以提升泛化能力。

Details

Motivation: Transformer模型参数量大、嵌入维度存在冗余，限制其可扩展性与效率，亟需一种既能压缩参数又不损害性能的结构化方法。 Method: 基于三阶张量的L-积定义嵌入空间的谱分解，将token表示重塑为谱张量切片，在变换域中执行注意力和前馈操作；采用实值离散余弦变换（DCT）作为具体实现，保证可微性与训练兼容性。 Result: 在IMDB和AG News数据集上，当p=4时编码器参数最多减少75%；IMDB上精度持平或略优，AG News在中等宽度下略有精度下降但BERT-base宽度（d=768）下恢复至基线水平；同时引入频率缩放增强泛化。 Conclusion: L-Transformer通过谱域分解实现了高效参数压缩与语义保持，并通过频域归纳偏置提升了模型泛化能力，是一种兼具理论保证与实用价值的Transformer轻量化新范式。 Abstract: Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the embedding dimension. In this work, we introduce a structured spectral factorization of the embedding space based on the L-product for third-order tensors. By reshaping token representations into spectral tensor slices and performing attention and feed-forward operations in the transform domain, we obtain a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics. We prove that the proposed L-Transformer is spectrally equivalent to p parallel Transformers operating on reduceddimensional embeddings, which yields approximately 1/p reduction (up to lower-order terms such as biases and normalization parameters) in encoder parameters under fixed total embedding size. When instantiated with a real-valued Discrete Cosine Transform (DCT), the method remains fully differentiable and compatible with existing training pipelines. Beyond compression, the spectral decomposition introduces an inductive bias over embedding frequencies, enabling slice-dependent frequency scaling that improves generalization. Experiments on IMDB and AG~News show that the proposed model can substantially reduce encoder parameters (up to 75\% for p=4) while maintaining competitive accuracy. On IMDB, the tensorized encoder matches or improves upon the standard baseline under compression, whereas on AG~News at moderate width we observe a small accuracy decrease in exchange for a 4 times encoder reduction; at BERT-base width (d=768), performance returns to parity.

[10] Let's Talk, Not Type: An Oral-First Multi-Agent Architecture for Guaraní

Samantha Adorno,Akshata Kishore Moharir,Ratna Kandala

Main category: cs.CL

TL;DR: 本文以巴拉圭官方语言瓜拉尼语为例，批判当前AI系统以文本为中心的设计范式忽视了口语语言和原住民社区的需求，提出一种‘口语优先’的多智能体架构，强调对话轮转、修复机制与共享语境，并主张将口语对话本身作为核心设计要求，以实现文化扎根的AI。

Details

Motivation: 当前AI和HCI系统虽标榜普适性，但其设计仍以文本为中心，无法满足主要使用口语及原住民语言群体的实际需求，尤其在语言支持、数据主权与语言双言制（diglossia）方面存在严重不足。 Method: 以瓜拉尼语为案例，提出一种‘口语优先’的多智能体架构，将自然语言理解、对话状态管理与社区主导治理解耦，强调对话轮转、修复与共享上下文等口语交互核心要素。 Result: 构建了一个尊重原住民数据主权与语言双言制的技术框架，验证了口语交互可作为AI系统的一等设计要素，而不仅限于文本转语音的附加功能。 Conclusion: AI若要真正实现文化扎根，必须从‘将口语适配进文本中心系统’转向‘以口语对话为首要设计前提’，从而赋能而非边缘化多元语言实践。 Abstract: Although artificial intelligence (AI) and Human-Computer Interaction (HCI) systems are often presented as universal solutions, their design remains predominantly text-first, underserving primarily oral languages and indigenous communities. This position paper uses Guaraní, an official and widely spoken language of Paraguay, as a case study to argue that language support in AI remains insufficient unless it aligns with lived oral practices. We propose an alternative to the standard "text-to-speech" pipeline, proposing instead an oral-first multi-agent architecture. By decoupling Guaraní natural language understanding from dedicated agents for conversation state and community-led governance, we demonstrate a technical framework that respects indigenous data sovereignty and diglossia. Our work moves beyond mere recognition to focus on turn-taking, repair, and shared context as the primary locus of interaction. We conclude that for AI to be truly culturally grounded, it must shift from adapting oral languages to text-centric systems to treating spoken conversation as a first-class design requirement, ensuring digital ecosystems empower rather than overlook diverse linguistic practices.

[11] CodeScout: Contextual Problem Statement Enhancement for Software Agents

Manan Suri,Xiangci Li,Mehdi Shojaie,Songyang Han,Chao-Chun Hsu,Shweta Garg,Aniket Anand Deshmukh,Varun Kumar

Main category: cs.CL

TL;DR: 本文提出CodeScout方法，通过轻量级预探索目标代码库，将模糊的用户请求转化为结构化、可执行的问题描述，从而提升AI代码助手在欠规范任务上的解决能力。

Details

Motivation: 现有AI代码辅助工具在处理缺乏上下文和需求说明的模糊问题时表现不佳，常因过度探索或重复无效尝试导致失败。 Method: CodeScout采用上下文查询精炼策略：先进行目标代码库的定向上下文界定，再从多角度分析潜在修复方案与探索机会，最后合成包含复现步骤、预期行为和探索提示的增强型问题陈述。 Result: 在SWEBench-Verified基准上，相比基线方法，解决率提升20%，多解决27个问题。 Conclusion: 系统性地通过上下文分析进行查询精炼，是提升AI代码辅助能力的有效新方向。 Abstract: Current AI-powered code assistance tools often struggle with poorly-defined problem statements that lack sufficient task context and requirements specification. Recent analysis of software engineering agents reveals that failures on such underspecified requests are highly correlated with longer trajectories involving either over-exploration or repeated attempts at applying the same fix without proper evolution or testing, leading to suboptimal outcomes across software development tasks. We introduce CodeScout, a contextual query refinement approach that systematically converts underspecified user requests into comprehensive, actionable problem statements through lightweight pre-exploration of the target codebase. Our key innovation is demonstrating that structured analysis before task execution can supplement existing agentic capabilities without requiring any modifications to their underlying scaffolds. CodeScout performs targeted context scoping, conducts multi-perspective analysis examining potential fixes and exploration opportunities, then synthesizes these insights into enhanced problem statements with reproduction steps, expected behaviors, and targeted exploration hints. This pre-exploration directly addresses the identified failure patterns by reducing non-converging agent trajectories while clarifying user intent in natural language space. We evaluate CodeScout using state-of-the-art agentic scaffolds and language models on SWEBench-Verified, demonstrating a 20\% improvement in resolution rates with up to 27 additional issues resolved compared to the default baseline method. Our results suggest that systematic query refinement through contextual analysis represents a promising direction for enhancing AI code assistance capabilities.

[12] NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

Genet Asefa Gesese,Zongxiong Chen,Shufan Jiang,Mary Ann Tan,Zhaotai Liu,Sonja Schimmler,Harald Sack

Main category: cs.CL

TL;DR: 本文提出了NERdME数据集，包含200个手动标注的README文件，用于填补学术信息抽取（SIE）在代码仓库实现级细节上的空白，并验证了其在实体链接等下游任务中的有效性。

Details

Motivation: 现有学术信息抽取（SIE）数据集主要面向科研论文，忽视了代码仓库中README等实现级文档所含的关键信息；而README的自由Markdown格式缺乏语义结构，难以自动抽取。 Method: 构建了NERdME数据集：200份人工标注的README文件，含超10,000个标注片段和10类实体；并基于大语言模型与微调Transformer模型开展基线实验，辅以下游实体链接实验。 Result: 基线实验揭示论文级与实现级实体存在显著差异；实体链接实验证明README抽取的实体可有效支持工具/数据集等开发构件发现及元数据整合。 Conclusion: NERdME拓展了SIE基准，强调实现级实体的重要性，为科研软件工程与知识图谱构建提供了新资源与新方向。 Abstract: Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.

[13] PVminerLLM: Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Models

Samah Fodeh,Linhai Ma,Ganesh Puthiaraju,Srivani Talakokkul,Afshan Khan,Ashley Hagaman,Sarah Lowe,Aimee Roundtree

Main category: cs.CL

TL;DR: 本文提出PVminer基准和PVminerLLM模型，用于从患者生成文本中结构化提取患者声音信号，在多个指标上显著优于基线方法，且小模型即可实现高性能，支持大规模分析非临床健康影响因素。

Details

Motivation: 患者生成文本包含关于其生活经历、社会环境和医疗参与的关键信息，但这些‘患者声音’信号通常缺乏结构化形式，限制了其在以患者为中心的研究和临床质量改进中的应用，亟需可靠的大规模提取方法。 Method: 构建PVminer基准数据集，并提出经监督微调的专用大语言模型PVminerLLM，用于从患者文本中结构化抽取代码（Code）、子代码（Sub-code）和证据片段（Span）。 Result: PVminerLLM在多个数据集和模型规模下显著超越基于提示的基线方法，最高达83.82% F1（Code）、80.74% F1（Sub-code）、87.03% F1（Span）；小模型亦表现优异，验证了轻量级方案的可行性。 Conclusion: PVminerLLM为规模化解析患者文本中的社会性与体验性信号提供了高效、实用的技术路径，有助于深入理解并干预影响健康结局的非临床因素。 Abstract: Motivation: Patient-generated text contains critical information about patients' lived experiences, social circumstances, and engagement in care, including factors that strongly influence adherence, care coordination, and health equity. However, these patient voice signals are rarely available in structured form, limiting their use in patient-centered outcomes research and clinical quality improvement. Reliable extraction of such information is therefore essential for understanding and addressing non-clinical drivers of health outcomes at scale. Results: We introduce PVminer, a benchmark for structured extraction of patient voice, and propose PVminerLLM, a supervised fine-tuned large language model tailored to this task. Across multiple datasets and model sizes, PVminerLLM substantially outperforms prompt-based baselines, achieving up to 83.82% F1 for Code prediction, 80.74% F1 for Sub-code prediction, and 87.03% F1 for evidence Span extraction. Notably, strong performance is achieved even with smaller models, demonstrating that reliable patient voice extraction is feasible without extreme model scale. These results enable scalable analysis of social and experiential signals embedded in patient-generated text. Availability and Implementation: Code, evaluation scripts, and trained LLMs will be released publicly. Annotated datasets will be made available upon request for research use. Keywords: Large Language Models, Supervised Fine-Tuning, Medical Annotation, Patient-Generated Text, Clinical NLP

[14] Tutor Move Taxonomy: A Theory-Aligned Framework for Analyzing Instructional Moves in Tutoring

Zhuqian Zhou,Kirk Vanacore,Tamisha Thompson,Jennifer St John,Rene Kizilcec

Main category: cs.CL

TL;DR: 本文提出了一种用于大规模分析辅导对话的辅导行为分类法，涵盖支持学习、社会情感与动机、后勤等四类辅导行为，并支持AI标注与计算建模。

Details

Motivation: 理解有效辅导需系统分析辅导者在学习互动中的教学行为，现有方法难以支撑大规模、标准化分析。 Method: 采用混合演绎-归纳法构建分类法：先整合认知科学、学习科学、课堂话语分析和智能辅导系统的研究成果形成初步框架；再通过专家标注员对真实辅导转录文本进行迭代编码加以完善。 Result: 形成了包含四大类（辅导支持、学习支持、社会情感与动机支持、后勤支持）的辅导行为分类法，其中学习支持进一步按学生参与度谱系细分（如激发推理 vs 直接讲解）。 Conclusion: 该分类法为辅导对话提供了可操作、可扩展的标注框架，支持AI辅助标注、辅导策略计算建模及辅导行为与学习成效的实证研究。 Abstract: Understanding what makes tutoring effective requires methods for systematically analyzing tutors' instructional actions during learning interactions. This paper presents a tutor move taxonomy designed to support large-scale analysis of tutoring dialogue within the National Tutoring Observatory. The taxonomy provides a structured annotation framework for labeling tutors' instructional moves during one-on-one tutoring sessions. We developed the taxonomy through a hybrid deductive-inductive process. First, we synthesized research from cognitive science, the learning sciences, classroom discourse analysis, and intelligent tutoring systems to construct a preliminary framework of tutoring moves. We then refined the taxonomy through iterative coding of authentic tutoring transcripts conducted by expert annotators with extensive instructional and qualitative research experience. The resulting taxonomy organizes tutoring behaviors into four categories: tutoring support, learning support, social-emotional and motivational support, and logistical support. Learning support moves are further organized along a spectrum of student engagement, distinguishing between moves that elicit student reasoning and those that provide direct explanation or answers. By defining tutoring dialogue in terms of discrete instructional actions, the taxonomy enables scalable annotation using AI, computational modeling of tutoring strategies, and empirical analysis of how tutoring behaviors relate to learning outcomes.

[15] RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning

Yuhang Liu,Ruijie Wang,Yunlong Chu,Bing Hao,Yumeng Lin,Shengzhong Liu,Minglai Shao

Main category: cs.CL

TL;DR: 本文提出RouteGoT，一种预算可控、节点自适应的图结构推理路由框架，通过在图内动态分配强/轻量模型以优化性能与成本权衡，在多个基准上显著降低token消耗并提升准确率。

Details

Motivation: 现有图结构推理方法（如GoT、AGoT）虽能提升部分任务准确率，但存在高token开销、延迟大及跨任务性能不稳定等问题；其根本原因在于推理流程中各阶段和节点的任务异质性——规划与合成需强模型，而大量中间子任务可用轻量模型高效解决。 Method: RouteGoT采用图内路由机制：对高质量规划和最终合成优先调用强模型，对叶节点子任务则依据预测难度动态分配轻量模型或低成本策略；同时引入显式预算约束的全局推理调度器，控制图扩展规模以满足用户指定的token预算。 Result: 在推理、检索和多跳问答等多个基准上，RouteGoT在匹配或超越AGoT准确率的同时，平均提升准确率8.1个百分点，并减少79.1%输出token；相比其他路由基线，其成本-精度权衡更优，且在不同预算和任务下鲁棒性更强。 Conclusion: RouteGoT通过节点级模型适配与预算感知图调度，有效缓解了图结构推理中的效率瓶颈，为构建高效、可控、鲁棒的大模型推理系统提供了新范式。 Abstract: Large Language Models (LLMs) excel at multi-step reasoning, yet increasing the structural complexity of inference does not consistently improve system-level returns. Methods such as Tree of Thoughts (ToT), Graph of Thoughts (GoT), and Adaptive Graph of Thoughts (AGoT) can boost accuracy on some benchmarks, but often introduce substantial overhead in token consumption and latency, and their gains can be unstable across task distributions-sometimes underperforming simpler Chain-of-Thought (CoT) or direct input-output prompting (IO). We attribute this inefficiency to stage-wise and node-wise heterogeneity inside GoT-style reasoning pipelines: high-quality planning and final synthesis are globally coupled and typically benefit from strong models, whereas many intermediate subtasks are localized and can be solved accurately by lighter models with far fewer tokens. Motivated by these observations, we propose RouteGoT, a budget-controllable, node-adaptive routing framework for graph-structured reasoning. RouteGoT performs in-graph routing by prioritizing strong models for planning and synthesis, while dynamically allocating lightweight models and cost-effective strategies to leaf subtasks based on predicted difficulty. It further integrates explicit budget constraints into a global inference scheduler to control graph expansion under a user-specified token budget, enabling predictable performance-cost trade-offs. Experiments across reasoning, retrieval, and multi-hop QA benchmarks show that RouteGoT matching or improving accuracy while substantially reducing token usage; specifically, it achieves an average 8.1 percentage points accuracy improvement and 79.1\% output token reduction compared to AGoT. Furthermore, RouteGoT outperforms existing routing baselines by maintaining a superior cost-accuracy trade-off, demonstrating improved robustness under varying budget targets and tasks.

[16] HART: Data-Driven Hallucination Attribution and Evidence-Based Tracing for Large Language Models

Shize Liang,Hongzhi Wang

Main category: cs.CL

TL;DR: 本文提出HART框架，用于细粒度地归因大语言模型的幻觉现象并检索支持/反驳证据，通过四阶段结构化建模（片段定位、机制归因、证据检索、因果追踪）提升可解释性与可追溯性，并构建首个面向幻觉追踪的结构化标注数据集。

Details

Motivation: 现有幻觉归因方法难以在片段级别建立幻觉类型、错误生成机制与外部事实证据之间的结构化对应关系，导致幻觉片段解释性差、证据溯源困难。 Method: 提出HART框架，将幻觉追踪形式化为包含片段定位、机制归因、证据检索和因果追踪四个阶段的结构化建模任务，并构建首个联合标注幻觉类型、错误机制及反事实证据的结构化数据集。 Result: 在自建数据集上的实验表明，HART显著优于BM25和DPR等强检索基线，验证了其在幻觉分析与证据对齐方面的有效性与泛化能力。 Conclusion: HART为大语言模型幻觉提供了可解释、可追溯的细粒度分析范式，推动了幻觉归因从语义匹配或表征判别向结构化因果建模演进。 Abstract: Large language models (LLMs) have demonstrated remarkable performance in text generation and knowledge-intensive question answering. Nevertheless, they are prone to producing hallucinated content, which severely undermines their reliability in high-stakes application domains. Existing hallucination attribution approaches, based on either external knowledge retrieval or internal model mechanisms, primarily focus on semantic similarity matching or representation-level discrimination. As a result, they have difficulty establishing structured correspondences at the span level between hallucination types, underlying error generation mechanisms, and external factual evidence, thereby limiting the interpretability of hallucinated fragments and the traceability of supporting or opposing evidence. To address these limitations, we propose HART, a fine-grained hallucination attribution and evidence retrieval framework for large language models. HART formalizes hallucination tracing as a structured modeling task comprising four stages: span localization, mechanism attribution, evidence retrieval, and causal tracing. Based upon this formulation, we develop the first structured dataset tailored for hallucination tracing, in which hallucination types, error mechanisms, and sets of counterfactual evidence are jointly annotated to enable causal-level interpretability evaluation. Experimental results on the proposed dataset demonstrate that HART substantially outperforms strong retrieval baselines, including BM25 and DPR, validating the effectiveness and generalization capability of the proposed tracing paradigm for hallucination analysis and evidence alignment.

[17] ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

Juyong Jiang,Jiasi Shen,Sunghun Kim,Kang Min Yoo,Jeonghoon Kim,Sungju Kim

Main category: cs.CL

TL;DR: 本文提出ReflexiCoder，一种基于强化学习的新型框架，使大语言模型具备内在的自我反思与自我修正能力，无需依赖外部反馈或执行引擎，在多项代码生成基准上达到开源模型SOTA性能，并显著降低推理计算开销。

Details

Motivation: 标准单次前向生成（System 1）在复杂算法任务上性能受限；现有迭代优化方法依赖外部oracle、执行反馈或高成本prompt-response循环，缺乏内在化、自主的推理修正能力。 Method: 提出ReflexiCoder框架，采用RL-zero训练范式与细粒度奖励函数，将初始生成、带缺陷与优化意识的反思、及自我修正全过程内化至模型权重中，实现完全自主的推理-反思-修正轨迹。 Result: ReflexiCoder-8B在HumanEval（Plus）、MBPP（Plus）、BigCodeBench、LiveCodeBench和CodeForces上分别达94.51%、81.80%、35.00%、52.21%、37.34%（单次尝试），为1.5B–14B级开源模型SOTA，媲美GPT-5.1，且推理token开销降低约40%。 Conclusion: ReflexiCoder成功将结构化反思与自我修正能力内化进模型，摆脱对外部信号的依赖，验证了自主推理范式的有效性与高效性，为LLM代码生成开辟新路径。 Abstract: While Large Language Models (LLMs) have revolutionized code generation, standard "System 1" approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-zero training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, high-speed reasoning and reflection patterns. Source code is available at https://github.com/juyongjiang/ReflexiCoder.

[18] ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

Mingluo Su,Huan Wang

Main category: cs.CL

TL;DR: 本文提出ROSE方法，通过重新排序SparseGPT中的剪枝顺序，优先剪枝潜在误差更大的权重，显著提升大语言模型剪枝性能。

Details

Motivation: SparseGPT采用固定从左到右的剪枝顺序，在权重呈现列模式时表现不佳；本文旨在研究并优化剪枝顺序以提升性能。 Method: ROSE首先进行预剪枝识别候选权重，并估计列级与块级剪枝损失；然后进行两级重排序：块内按列损失降序重排列，块间按块损失重排序；引入块损失相对范围指标自适应识别列式层。 Result: 在LLaMA2-7B/13B/70B、LLaMA3-8B、Mistral-7B等主流大语言模型上，ROSE显著优于原始SparseGPT及其他对比剪枝方法。 Conclusion: 剪枝顺序对基于Hessian的单次剪枝效果影响显著；ROSE通过结构感知的动态重排序策略，有效缓解列模式带来的性能下降，是一种更鲁棒、高效的LLM剪枝方法。 Abstract: Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of LLM one-shot pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier. ROSE first performs pre-pruning to identify candidate weights for removal, and estimates both column and block pruning loss. Subsequently, two-level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods. Our code is available at https://github.com/mingluo-su/ROSE.

[19] Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

Changcheng Li,Jiancan Wu,Hengheng Zhang,Zhengsu Chen,Guo An,Junxiang Qiu,Xiang Wang,Qi Tian

Main category: cs.CL

TL;DR: 本文提出了一种名为CoCA的强化学习框架，采用‘置信度优先’范式，在生成答案前先输出模型对正确回答问题的概率估计，并通过分段信用分配联合优化置信度校准与答案准确性。

Details

Motivation: 现有大语言模型的不确定性估计方法多为‘答案优先’，即先生成答案再评估其置信度，这限制了实际部署中的可用性；本文旨在探索更实用、更可靠的‘置信度优先’范式。 Method: 提出CoCA（Co-optimized Confidence and Answers）框架，基于GRPO强化学习，对置信度段和答案段分别设计奖励函数与组相对优势函数，实现置信度校准与答案准确性的协同优化。 Result: 在数学、代码及事实类问答基准上实验表明，CoCA显著提升了置信度校准性和不确定性判别能力，同时保持答案质量不下降。 Conclusion: 置信度优先范式可行且有效；CoCA为LLM不确定性建模提供了新思路，拓展了其在需要可靠置信度评估的下游任务（如安全关键型应用）中的适用性。 Abstract: Reliable deployment of large language models (LLMs) requires accurate uncertainty estimation. Existing methods are predominantly answer-first, producing confidence only after generating an answer, which measure the correctness of a specific response and limits practical usability. We study a confidence-first paradigm, where the model outputs its confidence before answering, interpreting this score as the model's probability of answering the question correctly under its current policy. We propose CoCA(Co-optimized Confidence and Answers), a GRPO reinforcement learning framework that jointly optimizes confidence calibration and answer accuracy via segmented credit assignment. By assigning separate rewards and group-relative advantages to confidence and answer segments, CoCA enables stable joint optimization and avoids reward hacking. Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications.

[20] VerChol -- Grammar-First Tokenization for Agglutinative Languages

Prabhu Raja

Main category: cs.CL

TL;DR: 本文提出了一种面向黏着语的新型分词方法，旨在解决BPE等现有方法在处理黏着语时破坏词素边界、增加token数量的问题。

Details

Motivation: 现有主流分词方法（如BPE）是脚本无关且针对英语形态优化的，无法有效处理黏着语（如达罗毗荼语系、突厥语族、乌拉尔语系等）中一个单词承载多个语法信息（如时态、人称、格、后置词等）的特点，导致词素边界被割裂、token数量膨胀。 Method: 提出一种专为黏着语设计的新型分词方法，强调保留词素边界并减少token数量，但具体技术细节未在摘要中说明。 Result: 该方法有望显著改善黏着语在LLM中的分词效果，提升模型对形态丰富语言的理解与生成能力。 Conclusion: 针对黏着语的分词需突破传统BPE范式，发展形态感知的分词策略，以更好支持多语言大模型的发展。 Abstract: Tokenization is the foundational step in all large language model (LLM) pipelines, yet the dominant approach Byte Pair Encoding (BPE) and its variants is inherently script agnostic and optimized for English like morphology. For agglutinative languages a typological class encompassing the Dravidian family (Tamil, Kannada, Telugu, Malayalam), Turkic languages (Turkish, Azerbaijani, Uzbek), Uralic languages (Finnish, Hungarian, Estonian), Korean, Japanese, Swahili, Basque, and others, a single word may encode root, tense, aspect, person, number, gender agreement, case, and postpositions into one orthographic unit. Statistical tokenizers fragment these words into byte pair chunks that sever morpheme boundaries and inflate token counts.

[21] Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

Junjie Li,Xinrui Guo,Yuhao Wu,Roy Ka-Wei Lee,Hongzhi Li,Yutao Xie

Main category: cs.CL

TL;DR: 本文提出了ConStory-Bench基准和ConStory-Checker检测工具，系统评估大语言模型在长篇叙事生成中的叙事一致性问题，并揭示了不一致错误的分布规律与成因。

Details

Motivation: 现有故事生成评测主要关注情节质量和流畅性，忽视了长篇叙事中普遍存在的事实、人物、时间等维度的一致性错误。 Method: 构建包含2000个提示、覆盖4种任务场景的ConStory-Bench基准；提出5类19种子类的一致性错误分类法；开发基于文本证据的自动化检测工具ConStory-Checker；通过5个研究问题对多类LLM进行实证评估。 Result: 发现一致性错误最常出现在事实和时间维度，多集中于叙事中段，高token熵段更易出错，且特定错误类型存在共现倾向。 Conclusion: 该工作填补了长篇叙事一致性评测的空白，为改进LLM叙事一致性提供了可量化的评估框架与实证依据。 Abstract: What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at https://picrew.github.io/constory-bench.github.io/.

[22] Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions

Hussein Ghaly

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型（LLM）性能变异性构建集成系统的方法，用于联合国安理会决议文本的准确、高效语义标注与数据清洗，并设计了CPR和TWF两个新指标来抑制幻觉与冗余/缺失，最终实现低成本、高可靠性的语义标注系统。

Details

Motivation: 解决LLM在语义标注任务中易产生幻觉、冗余或遗漏的问题，提升其在高可靠性要求场景（如联合国决议处理）中的实用性。 Method: 构建基于多LLM（如GPT-4.1、GPT-4.1-mini）的集成系统，引入Content Preservation Ratio（CPR）和Tag Well-Formedness（TWF）两个评估指标，通过多次运行与指标筛选选出最优输出。 Result: GPT-4.1在清洗（CPR 84.9%）与语义标注（CPR 99.99%，TWF 99.92%）中表现最佳；GPT-4.1-mini以20%成本达到相近性能；该集成方法可稳定选出最优清洗与标注结果。 Conclusion: 结合性能变异建模、新型评估指标与轻量模型替代策略，可构建兼具准确性、鲁棒性与成本效益的LLM语义标注系统。 Abstract: This paper introduces a new methodology for using LLM-based systems for accurate and efficient semantic tagging of UN Security Council resolutions. The main goal is to leverage LLM performance variability to build ensemble systems for data cleaning and semantic tagging tasks. We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement. These metrics allow the selection of the best output from multiple runs of several GPT models. GPT-4.1 achieved the highest metrics for both tasks (Cleaning: CPR 84.9% - Semantic Tagging: CPR 99.99% and TWF 99.92%). In terms of cost, smaller models, such as GPT-4.1-mini, achieved comparable performance to the best model in each task at only 20% of the cost. These metrics ultimately allowed the ensemble to select the optimal output (both cleaned and tagged content) for all the LLM models involved, across multiple runs. With this ensemble design and the use of metrics, we create a reliable LLM system for performing semantic tagging on challenging texts.

[23] InfoGatherer: Principled Information Seeking via Evidence Retrieval and Strategic Questioning

Maksym Taranukhin,Shuyue Stella Li,Evangelos Milios,Geoff Pleiss,Yulia Tsvetkov,Vered Shwartz

Main category: cs.CL

TL;DR: 本文提出InfoGatherer框架，通过结合文档检索与用户追问，在法律和医疗等高风险领域实现更可靠、可解释的文档问答决策支持。它基于Dempster-Shafer证据理论建模不确定性，避免依赖LLM隐式置信度信号，从而减少错误且过度自信的回答，并降低交互轮次。

Details

Motivation: 现有文档问答系统在用户初始查询不明确时，单次检索难以支撑可靠决策，而依赖LLM隐式置信信号的追问机制缺乏可解释性与结构性，无法清晰识别未知信息、关键缺失项及追问终止条件。 Method: InfoGatherer构建结构化证据网络，使用Dempster-Shafer理论对来自检索文档和用户追问的不完整/矛盾证据进行不确定性建模与融合，不强制早熟收敛至确定答案；通过双源（文档+用户）协同补全信息。 Result: 在法律与医疗任务上，InfoGatherer显著优于强基线方法，同时所需交互轮次更少。 Conclusion: 将不确定性建模建立在形式化证据理论基础上，而非启发式LLM信号，可提升高风险领域中LLM决策支持系统的可信性与可解释性。 Abstract: LLMs are increasingly deployed in high-stakes domains such as medical triage and legal assistance, often as document-grounded QA systems in which a user provides a description, relevant sources are retrieved, and an LLM generates a prediction. In practice, initial user queries are often underspecified, and a single retrieval pass is insufficient for reliable decision-making, leading to incorrect and overly confident answers. While follow-up questioning can elicit missing information, existing methods typically depend on implicit, unstructured confidence signals from the LLM, making it difficult to determine what remains unknown, what information matters most, and when to stop asking questions. We propose InfoGatherer, a framework that gathers missing information from two complementary sources: retrieved domain documents and targeted follow-up questions to the user. InfoGatherer models uncertainty using Dempster-Shafer belief assignments over a structured evidential network, enabling principled fusion of incomplete and potentially contradictory evidence from both sources without prematurely collapsing to a definitive answer. Across legal and medical tasks, InfoGatherer outperforms strong baselines while requiring fewer turns. By grounding uncertainty in formal evidential theory rather than heuristic LLM signals, InfoGatherer moves towards trustworthy, interpretable decision support in domains where reliability is critical.

[24] Learning Next Action Predictors from Human-Computer Interaction

Omar Shaikh,Valentin Teutschbein,Kanishk Gandhi,Yikun Chi,Nick Haber,Thomas Robinson,Nilam Ram,Byron Reeves,Sherry Yang,Michael S. Bernstein,Diyi Yang

Main category: cs.CL

TL;DR: 本文提出了一种名为Next Action Prediction (NAP)的新任务，旨在通过多模态用户交互数据（如截图、点击、传感器数据）预测用户下一步操作；为此构建了大规模真实场景数据集，并提出了结合参数化学习与上下文学习的LongNAP模型，在预测准确率和跨用户泛化性上显著优于基线方法。

Details

Motivation: 真正主动的AI系统需预判用户行为，而仅依赖稀疏文本提示无法满足需求，必须基于用户视觉与行为等全上下文进行推理。 Method: 提出NAP任务；利用视觉语言模型对20名用户一个月的手机使用数据（1800小时屏幕时间、36万+动作）进行自动标注；构建开源标注流水线；设计LongNAP模型，融合策略梯度训练、用户特异性推理链生成、历史推理链检索与上下文应用。 Result: LongNAP在LLM-as-judge评估下较监督微调和提示基线分别提升79%和39%；在跨用户泛化测试中表现良好；整体17.1%预测轨迹与真实动作高度一致（得分≥0.5），高置信预测中达26%。 Conclusion: 基于全用户行为上下文预测未来动作已成为可行且具重大潜力的研究方向。 Abstract: Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts -- it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user's multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user's next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP's predicted trajectories are well-aligned with what a user does next (LLM-judge score $\geq$ 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.

[25] Addressing the Ecological Fallacy in Larger LMs with Human Context

Nikita Soni,Dhruv Vijay Kunjadiya,Pratham Piyush Shah,Dikshya Mohanty,H. Andrew Schwartz,Niranjan Balasubramanian

Main category: cs.CL

TL;DR: 本文探讨了在大型语言模型（8B Llama）中建模作者语言上下文（HuLM/HuFT）以克服生态学谬误，发现仅在微调阶段引入作者上下文（HuFT）即可提升性能，而结合QLoRA的HuLM持续预训练还能使模型在多个下游任务上实现泛化提升。

Details

Motivation: 现有语言模型训练与推理忽略了同一作者所写多段文本间的依赖性，即存在生态学谬误；已有工作表明在小模型中解决该问题可显著提升性能，本文探究其在更大规模模型（8B Llama）中的有效性。 Method: 提出HuLM（Human-aware Language Modeling）任务，在预训练和微调阶段建模作者的语言上下文，尤其关注作者其他时序文本；采用QLoRA技术实现高效参数微调（HuFT）和持续预训练；并在多个下游任务上评估线性分类器性能。 Result: HuFT（作者上下文微调）单独使用即优于标准微调；QLoRA支持下的HuLM持续预训练使模型在八个下游任务上仅用线性分类器即获得性能提升，验证了人类感知建模的有效性与泛化能力。 Conclusion: 建模语言生成者（作者）的上下文对提升大语言模型性能和泛化能力至关重要，应成为未来语言建模的重要方向。 Abstract: Language model training and inference ignore a fundamental linguistic fact -- there is a dependence between multiple sequences of text written by the same person. Prior work has shown that addressing this form of \textit{ecological fallacy} can greatly improve the performance of multiple smaller (~124M) GPT-based models. In this work, we ask if addressing the ecological fallacy by modeling the author's language context with a specific LM task (called HuLM) can provide similar benefits for a larger-scale model, an 8B Llama model. To this end, we explore variants that process an author's language in the context of their other temporally ordered texts. We study the effect of pre-training with this author context using the HuLM objective, as well as using it during fine-tuning with author context (\textit{HuFT:Human-aware Fine-Tuning}). Empirical comparisons show that addressing the ecological fallacy during fine-tuning alone using QLoRA improves the performance of the larger 8B model over standard fine-tuning. Additionally, QLoRA-based continued HuLM pre-training results in a human-aware model generalizable for improved performance over eight downstream tasks with linear task classifier training alone. These results indicate the utility and importance of modeling language in the context of its original generators, the authors.

[26] Implicit Style Conditioning: A Structured Style-Rewrite Framework for Low-Resource Character Modeling

Chanhui Zhu

Main category: cs.CL

TL;DR: 本文提出了一种结构化风格重写框架，通过显式解耦词汇、句法和语用三个维度的风格特征，并结合思维链（CoT）蒸馏实现隐式风格控制，显著提升小语言模型在角色扮演中的风格一致性与语义保真度。

Details

Motivation: 小语言模型在角色扮演中难以保持高度风格化人设，主要受限于数据稀缺和风格解耦复杂性；标准监督微调易导致“出戏”（OOC）问题。 Method: 提出结构化风格重写框架：1）显式解耦风格为词汇（PMI）、句法（PCFG规则）和语用三维度；2）引入基于思维链（CoT）蒸馏的隐式风格条件策略，利用推理轨迹作为归纳偏置对齐潜在表征。 Result: 在动漫角色高风格化任务上，Qwen-1.7B模型使用该方法显著超越4B规模的基线模型（如Vanilla SFT），在风格一致性和语义保真度上表现更优，且支持消费级硬件部署。 Conclusion: 该方法提供了一种数据高效、可解释、轻量化的风格建模范式，有助于推动小模型在角色扮演等风格敏感任务中的实际应用与普及。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing (RP); however, small Language Models (SLMs) with highly stylized personas remains a challenge due to data scarcity and the complexity of style disentanglement. Standard Supervised Fine-Tuning (SFT) often captures surface-level semantics while failing to reproduce the intricate syntactic and pragmatic nuances of a character, leading to "Out-Of-Character" (OOC) generation. To address this, we propose a Structured Style-Rewrite Framework that explicitly disentangles style into three interpretable dimensions: lexical signatures (via PMI), syntactic patterns (grounded in PCFG rules), and pragmatic style. Furthermore, we introduce an implicit style conditioning strategy via Chain-of-Thought (CoT) distillation. By leveraging explicit reasoning traces during training as a strong inductive bias, our approach aligns the model's latent representations with structured style features, enabling high-fidelity stylized generation without requiring explicit reasoning tokens during inference. Extensive experiments on a specific high-stylization domain (anime characters) demonstrate that our method enables a Qwen-1.7B model to outperform significantly larger baselines (e.g., 4B Vanilla SFT) in style consistency and semantic fidelity. Our approach offers a data-efficient paradigm for democratizing inference and deployment on consumer hardware.

[27] Who We Are, Where We Are: Mental Health at the Intersection of Person, Situation, and Large Language Models

Nikita Soni,August Håkan Nilsson,Syeda Mahwish,Vasudha Varadarajan,H. Andrew Schwartz,Ryan L. Boyd

Main category: cs.CL

TL;DR: 本文提出了一种结合心理学理论与计算建模的方法，利用社交媒体纵向数据预测个体幸福感，并识别适应性与非适应性自我状态。方法上整合了人格特质与基于DIAMONDS框架的情境语言特征，并与心理测量导向的语言模型嵌入进行对比。结果表明，理论驱动的特征在保持竞争力的同时更具可解释性。

Details

Motivation: 心理健康是动态过程，受个体特质与情境交互影响；现有计算方法缺乏理论基础与可解释性，需融合心理学理论提升模型的语境敏感性与人类可理解性。 Method: 构建可解释模型，整合个体心理特质（如韧性、认知扭曲）与基于Situational 8 DIAMONDS框架的语言推断情境特征，并与心理测量导向的语言模型嵌入进行对比分析。 Result: 理论驱动的特征在预测幸福感任务中性能媲美黑箱嵌入，且具备更高可解释性；定性分析显示最具预测力的特征具有良好的心理一致性。 Conclusion: 将计算建模与心理学理论深度结合，能更准确、可解释、情境敏感地评估动态心理状态。 Abstract: Mental health is not a fixed trait but a dynamic process shaped by the interplay between individual dispositions and situational contexts. Building on interactionist and constructionist psychological theories, we develop interpretable models to predict well-being and identify adaptive and maladaptive self-states in longitudinal social media data. Our approach integrates person-level psychological traits (e.g., resilience, cognitive distortions, implicit motives) with language-inferred situational features derived from the Situational 8 DIAMONDS framework. We compare these theory-grounded features to embeddings from a psychometrically-informed language model that captures temporal and individual-specific patterns. Results show that our principled, theory-driven features provide competitive performance while offering greater interpretability. Qualitative analyses further highlight the psychological coherence of features most predictive of well-being. These findings underscore the value of integrating computational modeling with psychological theory to assess dynamic mental states in contextually sensitive and human-understandable ways.

[28] Track-SQL: Enhancing Generative Language Models with Dual-Extractive Modules for Schema and Context Tracking in Multi-turn Text-to-SQL

Bingfeng Chen,Shaobin Shi,Yongqi Luo,Boyan Xu,Ruichu Cai,Zhifeng Hao

Main category: cs.CL

TL;DR: 本文提出Track-SQL框架，通过双抽取模块增强生成式语言模型在多轮Text-to-SQL任务中对上下文与数据库模式变化的跟踪能力，在SparC和CoSQL数据集上达到SOTA性能。

Details

Motivation: 生成式语言模型在单轮Text-to-SQL中表现良好，但在多轮场景中因难以有效建模上下文依赖和动态模式链接而性能下降。 Method: 提出Track-SQL框架，包含语义增强的模式抽取器（Semantic-enhanced Schema Extractor）和模式感知的上下文抽取器（Schema-aware Context Extractor），协同增强生成模型对多轮交互中模式与上下文变化的建模能力。 Result: 在SparC和CoSQL数据集上达到当前最优执行准确率；消融实验表明其分别提升执行准确率7.1%和9.55%。 Conclusion: 双抽取模块能有效弥补生成式模型在多轮Text-to-SQL中的上下文与模式跟踪缺陷，显著提升性能。 Abstract: Generative language models have shown significant potential in single-turn Text-to-SQL. However, their performance does not extend equivalently to multi-turn Text-to-SQL. This is primarily due to generative language models' inadequacy in handling the complexities of context information and dynamic schema linking in multi-turn interactions. In this paper, we propose a framework named Track-SQL, which enhances generative language models with dual-extractive modules designed to track schema and contextual changes in multi-turn Text-to-SQL. Specifically, Track-SQL incorporates a \emph{Semantic-enhanced Schema Extractor} and a \emph{Schema-aware Context Extractor}. Experimental results demonstrate that Track-SQL achieves state-of-the-art performance on the SparC and CoSQL datasets. Furthermore, detailed ablation studies reveal that Track-SQL significantly improves execution accuracy in multi-turn interactions by 7.1\% and 9.55\% on these datasets, respectively. Our implementation will be open-sourced at https://github.com/DMIRLAB-Group/Track-SQL.

[29] MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

Yang Liu,Jinxuan Cai,Yishen Li,Qi Meng,Zedi Liu,Xin Li,Chen Qian,Chuan Shi,Cheng Yang

Main category: cs.CL

TL;DR: 本文提出MASFactory，一种以图为中心的框架，用于编排基于大语言模型的多智能体系统（MAS），通过Vibe Graphing将自然语言意图转换为可编辑、可执行的工作流图，并支持可复用组件、可插拔上下文集成及可视化交互。

Details

Motivation: 现有LLM-based MAS框架在实现复杂图工作流时需大量人工干预、复用性差，且难以集成异构外部上下文源。 Method: 提出MASFactory框架，核心是Vibe Graphing（人机协同方式将自然语言意图编译为可编辑工作流规范，再转为可执行图），并提供可复用组件、可插拔上下文集成机制及可视化工具（拓扑预览、运行时追踪、人机交互）。 Result: 在7个公开基准上验证了MASFactory对典型MAS方法的复现一致性及Vibe Graphing的有效性。 Conclusion: MASFactory显著降低了复杂MAS工作流的构建门槛，提升了可复用性与上下文集成能力，推动了图结构化MAS的实用化发展。 Abstract: Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory) and video (https://youtu.be/ANynzVfY32k) are publicly available.

[30] ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

Xingjian Tao,Yiwei Wang,Yujun Cai,Yifan Song,Jing Tang

Main category: cs.CL

TL;DR: 本文提出ViewFusion框架，通过两阶段设计（跨视角空间预对齐 + 问题驱动推理）提升多视角空间推理能力，并采用合成监督与GRPO强化学习训练，在MMSI-Bench上显著提升准确率。

Details

Motivation: 当前视觉语言模型在多视角空间推理中难以有效利用跨视角关系，易依赖单图捷径，导致在视角变换和遮挡敏感场景下性能脆弱。 Method: 提出两阶段ViewFusion框架：第一阶段进行显式的跨视角空间预对齐，构建超越重描述的中间工作空间；第二阶段基于该工作空间进行问题驱动推理；训练采用合成推理监督加GRPO强化学习以提升正确性并稳定两阶段行为。 Result: 在MMSI-Bench上，ViewFusion比Qwen3-VL-4B-Instruct准确率提升5.3%，尤其在需真实跨视角对齐的样本上增益最大。 Conclusion: 显式分离空间预对齐与问答推理、辅以针对性训练策略，可有效增强模型多视角空间理解能力。 Abstract: Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3\% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.

[31] Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring

Jonas Kubesch,Lena Huber,Clemens Havas

Main category: cs.CL

TL;DR: 本文探讨了使用开源大语言模型（LLMs）对奥地利A-level德语作文进行基于评分标准的自动评分，发现当前模型与人工评分者在子维度上最高仅40.6%一致、最终成绩匹配率仅32.8%，尚不足以投入实际教学评估。

Details

Motivation: 减轻教师阅卷负担、减少主观偏差，并探索大语言模型在非英语、小众教育场景（奥地利德语A-level考试）中基于评分标准的自动作文评分可行性。 Method: 采用四款开源大语言模型（DeepSeek-R1 32B、Qwen3 30B、Mixtral 8x7B、Llama3.3 70B），在不同上下文与提示策略下，对101份匿名奥地利A-level德语学生作文（涵盖三类文体）进行基于官方评分标准的自动评分，并与人类专家评分结果对比一致性。 Result: LLMs在评分标准各子维度上与人类评分者最高达成40.6%一致；最终等级匹配率仅为32.8%；较小模型虽能理解并应用标准化评分标准，但准确性不足。 Conclusion: 当前开源大语言模型尚不具备在真实教学环境中替代人工进行奥地利德语A-level作文评分的可靠性，需进一步提升领域适配性与评分一致性。 Abstract: Automated Essay Scoring (AES) has been explored for decades with the goal to support teachers by reducing grading workload and mitigating subjective biases. While early systems relied on handcrafted features and statistical models, recent advances in Large Language Models (LLMs) have made it possible to evaluate student writing with unprecedented flexibility. This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation. A dataset of 101 anonymised student exams across three text types was processed and evaluated. Four LLMs, DeepSeek-R1 32b, Qwen3 30b, Mixtral 8x7b and LLama3.3 70b, were evaluated with different contexts and prompting strategies. The LLMs were able to reach a maximum of 40.6% agreement with the human rater in the rubric-provided sub-dimensions, and only 32.8% of final grades matched the ones given by a human expert. The results indicate that even though smaller models are able to use standardised rubrics for German essay grading, they are not accurate enough to be used in a real-world grading environment.

[32] Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality

Xi Wang,Mengdie Zhuang,Jiqun Liu

Main category: cs.CL

TL;DR: 本研究通过持续预训练使大语言模型接触领域特定文本，模拟经验积累，并利用机器人格量表（MPI）量化模型人格特质，发现模型能力呈双峰分布，'表达型通才'和'抑制型专家'表现最佳，且社会性特质降低反而提升复杂推理能力，揭示了训练数据语言特征与模型人格及能力间的因果关系。

Details

Motivation: 人类问题解决因风格和人格特质的多样性而丰富，但大语言模型（LLMs）的发展却主要依赖统一性能基准，偏向如自信等特定行为倾向，忽视了人格多样性对机器智能的影响。 Method: 采用无监督的持续预训练方法，让模型接触领域特定文本以模拟经验积累；基于大五人格框架，使用机器人格量表（MPI）量化模型人格特质，并分析其与语言风格和推理行为的关系。 Result: 发现模型能力呈双峰分布，峰值出现在'表达型通才'和'抑制型专家'两类模型上；识别出'抑制优势'现象，即社会性人格特质降低可提升复杂推理性能；确立了训练数据中命令式频率等语言特征与词汇多样性等人格指标之间的因果联系。 Conclusion: 模型人格可被数据驱动地塑造，且人格特质显著影响推理行为；'人格工程'（Personality Engineering）有望成为提升LLM特定能力的新范式。 Abstract: Human problem-solving is enriched by a diversity of styles and personality traits, yet the development of Large Language Models (LLMs) has largely prioritized uniform performance benchmarks that favour specific behavioural tendencies such as assertiveness. To investigate how diverse experiences shape machine personality and influence problem-solving, this study employs continued pre-training to expose models to domain-specific texts in an unsupervised manner, simulating the accumulation of experience. By adapting the Big Five framework via the Machine Personality Inventory (MPI), we quantify the personality traits of these model variants and analyse their relationship to linguistic style and reasoning behaviour. The findings reveal that model competence is bimodal, peaking at "Expressive Generalists" and "Suppressed Specialists," while identifying a "Suppression Advantage" where reduced social traits enhance complex reasoning performance. This study further establishes a causal link between training data linguistics, such as imperative frequency, and lexical diversity, providing a roadmap for "Personality Engineering".

[33] Making Implicit Premises Explicit in Logical Understanding of Enthymemes

Xuyao Feng,Anthony Hunter

Main category: cs.CL

TL;DR: 本文提出了一种结合大语言模型与神经符号推理的管道方法，用于将文本中的隐含前提（enthymeme）转化为逻辑公式并验证逻辑蕴含关系。

Details

Motivation: 现有NLP方法无法解析隐含前提的逻辑结构，而逻辑方法又依赖于完备的知识库；缺乏将文本成分系统转化为逻辑公式并实现解码与蕴含验证的统一方法。 Method: 构建三阶段管道：(1) 使用LLM生成隐含前提；(2) 使用另一LLM将自然语言转化为逻辑公式；(3) 利用基于SAT求解器的神经符号推理器判断逻辑蕴含。 Result: 在两个隐含前提数据集上验证了该方法的有效性，在隐含前提选择任务中取得了良好的精确率、召回率、F1值和准确率。 Conclusion: 所提管道有效弥合了NLP与逻辑推理之间的鸿沟，为隐含前提的自动识别与逻辑解码提供了可行路径。 Abstract: Real-world arguments in text and dialogues are normally enthymemes (i.e. some of their premises and/or claims are implicit). Natural language processing (NLP) methods for handling enthymemes can potentially identify enthymemes in text but they do not decode their underlying logic, whereas logic-based approaches for handling them assume a knowledgebase with sufficient formulae that can be used to decode them via abduction. There is therefore a lack of a systematic method for translating textual components of an enthymeme into a logical argument and generating the logical formulae required for their decoding, and thereby showing logical entailment. To address this, we propose a pipeline that integrates: (1) a large language model (LLM) to generate intermediate implicit premises based on the explicit premise and claim; (2) another LLM to translate the natural language into logical formulas; and (3) a neuro-symbolic reasoner based on a SAT solver to determine entailment. We evaluate our pipeline on two enthymeme datasets, demonstrating promising performance in selecting the correct implicit premise, as measured by precision, recall, F1-score, and accuracy.

[34] Diffusion Language Models Are Natively Length-Aware

Vittorio Rossi,Giacomo Cirò,Davide Beltrame,Luca Gandolfi,Paul Röttger,Dirk Hovy

Main category: cs.CL

TL;DR: 本文提出了一种零样本动态裁剪上下文窗口的方法，以减少扩散语言模型（DLMs）在短响应生成任务中的冗余计算，显著降低FLOPs且几乎不损害性能，甚至在部分任务上提升性能。

Details

Motivation: 扩散语言模型（DLMs）固定使用最大长度上下文和预设去噪步数，导致在大量短响应任务（如推理、对话）中产生计算浪费。 Method: 基于潜空间提示表示可预测输出长度的假设，设计零样本机制，在生成前动态裁剪上下文窗口，减少所需扩散步数。 Result: 在GSM8K、HumanEval、IfEval和LongFormQA四个基准上验证，FLOPs显著下降，无统计显著性能下降，并在2/4任务中性能显著提升。 Conclusion: 利用提示潜表示预估响应长度并动态截断上下文是高效且实用的策略，为DLMs实际部署提供了重要优化路径。 Abstract: Unlike autoregressive language models, which terminate variable-length generation upon predicting an End-of-Sequence (EoS) token, Diffusion Language Models (DLMs) operate over a fixed maximum-length context window for a predetermined number of denoising steps. However, this process is independent of the required response length, resulting in computational waste for the majority of short responses common in reasoning and chat tasks. To address this problem, we conjecture that the latent prompt representation contains sufficient information to estimate the required output length. We provide empirical evidence for this phenomenon and propose a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings. We evaluate our approach on four benchmarks with diverse tasks -- GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) -- revealing massive efficiency gains at minimal performance impact. We report significant reductions in FLOPs across all tasks, with no statistically significant performance degradation, and significant performance improvements in 2 out of 4 tasks.

[35] A Causal Graph Approach to Oppositional Narrative Analysis

Diego Revilla,Martin Fernandez-de-Retana,Lingfeng Chen,Aritz Bilbao-Jayo,Miguel Fernandez-de-Retana

Main category: cs.CL

TL;DR: 本文提出一种基于图的框架，用于检测、分析和分类对立性叙事及其底层实体，通过将叙事表示为实体交互图，并在节点级别引入因果估计，构建最小因果子图以实现更优的对立思维分类性能。

Details

Motivation: 现有文本分析方法依赖于预定义本体中的标注数据，容易嵌入人类偏见，且仅进行非结构化的线性模式识别，未能建模话语中自然出现的实体间结构化交互。 Method: 提出基于图的框架，将叙事表示为实体交互图；在节点级别引入因果估计，蒸馏句子图为最小因果子图；构建新的分类流水线。 Result: 所提方法在对立性思维分类任务上优于现有方法。 Conclusion: 基于图与因果推断的结构化表征能更有效地捕捉对立叙事中的实体关系，提升分类性能与可解释性。 Abstract: Current methods for textual analysis rely on data annotated within predefined ontologies, often embedding human bias within black-box models. Despite achieving near-perfect performance, these approaches exploit unstructured, linear pattern recognition rather than modeling the structured interactions between entities that naturally emerge in discourse. In this work, we propose a graph-based framework for the detection, analysis, and classification of oppositional narratives and their underlying entities by representing narratives as entity-interaction graphs. Moreover, by incorporating causal estimation at the node level, our approach derives a causal representation of each contribution to the final classification by distilling the constructed sentence graph into a minimal causal subgraph. Building upon this representation, we introduce a classification pipeline that outperforms existing approaches to oppositional thinking classification task.

[36] CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Mohammed Baharoon,Thibault Heintz,Siavash Raissi,Mahmoud Alabbad,Mona Alhammad,Hassan AlOmaish,Sung Eun Kim,Oishi Banerjee,Pranav Rajpurkar

Main category: cs.CL

TL;DR: CRIMSON是一个面向临床的胸部X光报告生成评估框架，综合诊断正确性、上下文相关性和患者安全性，引入临床背景（如年龄、检查指征）与指南驱动的严重性加权机制，并通过多维度错误分类和放射科医生验证实现高临床对齐度。

Details

Motivation: 现有评估指标缺乏临床语境，易受无关或正常发现干扰，无法区分错误的临床严重性，难以真实反映报告质量对诊疗的影响。 Method: 构建CRIMSON框架：定义诊断正确性、上下文相关性、患者安全三大维度；建立含9类错误（如假阳性、漏诊及8种属性级错误）的临床错误分类体系；依据心胸放射科专家共识制定四级临床显著性标签（紧急/需干预/无需干预/预期良性），实现严重性加权评分；设计并发布两个新基准RadJudge（挑战性通过/失败场景）与RadPref（100+成对偏好标注，含结构化错误与1–5分质量评级）。 Result: CRIMSON在ReXVal上与6位放射科医生标注的临床显著错误数高度一致（Kendall's tau 0.61–0.71；Pearson's r 0.71–0.84）；在RadJudge中与专家判断一致；在RadPref中与三位放射科医生的整体质量偏好对齐度最优；配套开源评估工具、基准数据集及微调版MedGemma模型。 Conclusion: CRIMSON显著提升了胸部X光报告生成评估的临床可信度与实用性，为医学AI报告系统提供了可解释、严重性感知、专家验证的标准化评估范式。 Abstract: We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.

[37] MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

Naifan Zhang,Ruihan Sun,Jinwei Su,Hengjie Yang,Zhengyuan Pan,Zhaohan Chen,Xiaofan Zhang

Main category: cs.CL

TL;DR: 本文提出了一种无需 critic 的高效强化学习算法 MAPO，利用 judge 模型提供的密集过程反馈和蒙特卡洛回报来传播长程影响，并通过混合优势估计器（结合回合级与批次级归一化）实现稳定、细粒度且可扩展的信用分配，在多个主观多轮对话基准上显著提升性能与训练稳定性。

Details

Motivation: 主观多轮对话任务（如情感支持）需要能适应用户状态变化并优化长程交互质量的对话策略，但现有强化学习方法受限于缺乏可靠的中间过程监督：仅用结果奖励会导致跨回合信用分配失效，而简单回合级采样又带来过高交互开销。 Method: 提出 MAPO 算法：1）不依赖 critic，直接利用 judge 模型提供密集过程反馈；2）采用 Monte Carlo 回报传播长程效应；3）设计混合优势估计器，融合 turn-level 和 batch-level 归一化以稳定优化。 Result: 在 EMPA、EmoBench、EQ-Bench 等多个主观对话基准及 7B–32B 模型规模上，MAPO 均优于仅用结果奖励的 GRPO 和单级归一化基线；在 EMPA 上相对 7B 基模型提升达 9 个百分点和 +43.2 对话分；即使仅在 EMPA 上训练，也能泛化至未见基准（EmoBench +4，EQ-Bench +3.5）。 Conclusion: 密集过程监督与混合层级归一化相结合，可实现对主观、开放性多轮对话的有效且可扩展的强化学习。 Abstract: Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while naïve turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.

[38] Wisdom of the AI Crowd (AI-CROWD) for Ground Truth Approximation in Content Analysis: A Research Protocol & Validation Using Eleven Large Language Models

Luis de-Marcos,Manuel Goyanes,Adrián Domínguez-Díaz

Main category: cs.CL

TL;DR: 本文提出AI-CROWD协议，利用多个大语言模型（LLMs）的集体输出生成共识性标签，以近似大规模内容分析中的‘真实标签’，缓解人工标注耗时、昂贵且不一致的问题。

Details

Motivation: 大规模内容分析常受限于缺乏可观测的真值标签（ground truth），而人工构建金标准标签成本高、耗时长、一致性差。 Method: 提出AI-CROWD协议：集成多个LLM的输出，通过多数投票聚合，并结合诊断性指标分析模型间的一致性与分歧，识别高置信度分类及潜在歧义或模型偏差。 Result: AI-CROWD能生成基于多模型共识的可靠标签近似，有效识别高置信分类并揭示标注不确定性与模型偏差。 Conclusion: AI-CROWD为无真值场景下的大规模内容分析提供了可扩展、可解释、鲁棒的自动化标注新范式。 Abstract: Large-scale content analysis is increasingly limited by the absence of observable ground truth or gold-standard labels, as creating such benchmarks through extensive human coding becomes impractical for massive datasets due to high time, cost, and consistency challenges. To overcome this barrier, we introduce the AI-CROWD protocol, which approximates ground truth by leveraging the collective outputs of an ensemble of large language models (LLMs). Rather than asserting that the resulting labels are true ground truth, the protocol generates a consensus-based approximation derived from convergent and divergent inferences across multiple models. By aggregating outputs via majority voting and interrogating agreement/disagreement patterns with diagnostic metrics, AI-CROWD identifies high-confidence classifications while flagging potential ambiguity or model-specific biases.

[39] LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

Koki Itai,Shunichi Hasegawa,Yuta Yamamoto,Gouki Minegishi,Masaki Otsuki

Main category: cs.CL

TL;DR: 本文提出了LIT-RAGBench，一个面向RAG中生成器（Generator）的综合性基准测试，涵盖整合、推理、逻辑、表格理解和拒答五大能力维度，使用虚构场景确保答案严格基于外部文档，并提供日英双语数据集及LLM-as-a-Judge评估方案。

Details

Motivation: 现有RAG生成器基准测试覆盖能力有限，缺乏在统一条件下同时评估多方面能力（如长上下文整合、多步推理、表格理解、证据缺失时拒答等）的框架，难以反映实际部署需求。 Method: 构建名为LIT-RAGBench的新基准，定义Integration、Reasoning、Logic、Table、Abstention五大类别，每类细化为实用评估维度；采用虚构实体与场景确保答案必须依据给定文档；构建114题日语人工题库及经人工校验的英文译本；使用LLM-as-a-Judge进行细粒度评分。 Result: 在API型与开源权重模型上测试，无一模型整体准确率超90%；各模型在不同能力类别上表现差异显著，可清晰识别强项与短板。 Conclusion: LIT-RAGBench填补了RAG生成器多能力联合评估的空白，为实际RAG系统选型与专用模型研发提供了可量化、细粒度的评估工具，并已开源数据集与代码。 Abstract: Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT-RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT-RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human-constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM-as-a-Judge for scoring and report category-wise and overall accuracy. Across API-based and open-weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT-RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG-specialized models. We release LIT-RAGBench, including the dataset and evaluation code, at https://github.com/Koki-Itai/LIT-RAGBench.

[40] FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

Qihang Fan,Huaibo Huang,Zhiying Wu,Juqiu Wang,Bingning Wang,Ran He

Main category: cs.CL

TL;DR: 本文提出FlashPrefill框架，通过即时模式发现与动态阈值技术实现超快预填充，在长上下文建模中显著提升效率，兼顾长、短序列性能。

Details

Motivation: 长上下文建模对大语言模型至关重要，但注意力机制的二次复杂度（尤其在prefill阶段）构成关键瓶颈；现有稀疏注意力方法普遍存在搜索延迟高或稀疏性不足的问题。 Method: 提出FlashPrefill框架：采用快速分块搜索技术同步定位动态垂直、斜向及块稀疏注意力模式，并引入无需排序或累加的动态阈值机制，以消除注意力分数长尾分布、提升稀疏性。 Result: 在256K长度序列上实现27.78倍加速；在4K短序列上仍保持1.71倍加速，显著优于现有方法。 Conclusion: FlashPrefill在保证精度前提下大幅降低prefill计算开销，兼具高效率、强鲁棒性与跨尺度实用性，为长上下文LLM部署提供新范式。 Abstract: Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.

[41] SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

Yunlong Chu,Minglai Shao,Yuhang Liu,Bing Hao,Yumeng Lin,Jialu Wang,Ruijie Wang

Main category: cs.CL

TL;DR: SPOT是一种新型隐式推理框架，通过跨度级语义对齐和冻结头解码约束，在压缩链式思维（CoT）的同时提升可解释性与推理性能。

Details

Motivation: 现有隐式推理方法存在两点不足：一是依赖僵化的逐点对齐，难以捕捉可变长度推理段的密集语义；二是缺乏可解释性，隐状态难以被预训练语言模型头解码或审计。 Method: 提出SPOT框架，核心包括：1）跨度级语义对齐——采用Sinkhorn最优传输目标，软匹配每个暂停标记与整个推理段语义；2）冻结头解码约束——保持隐状态在冻结的预训练LM头下可直接解码为词元分布，支持关键词解释。 Result: 在多个推理基准上，SPOT平均准确率提升2.3分，生成词元减少37.5%，并提供忠实、可读的隐式推理语义解释。 Conclusion: SPOT在降低推理开销的同时兼顾性能与可解释性，为高效且透明的隐式推理提供了新范式。 Abstract: Explicit Chain-of-Thought improves the reasoning performance of large language models but often incurs high inference cost due to verbose token-level traces. While recent approaches reduce this overhead via concise prompting or step pruning, they largely truncate what the model says rather than internalize what the model thinks. Latent reasoning offers a promising alternative by performing computation in the hidden space, yet prior methods face two critical challenges. Many existing approaches rely on rigid point-to-point alignment, forcing a latent token to approximate the final representation of a reasoning step, which can be insufficient to capture the dense, variable-length semantics of an entire reasoning segment. Furthermore, these methods often suffer from a lack of interpretability: latent states are commonly produced by unconstrained optimization or embedding mixing, yielding vectors that are difficult to decode or audit under the pretrained language head. We propose SPOT, a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template. At the core of SPOT is Span-level Semantic Alignment, a Sinkhorn optimal-transport objective that softly matches each pause token to the semantics of an entire reasoning segment, overcoming the rigidity of step-end alignment. To further improve interpretability, SPOT introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts. Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.

[42] Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion

Hari Shankar,Vedanta S P,Sriharini Margapuri,Debjani Mazumder,Ponnurangam Kumaraguru,Abhijnan Chakraborty

Main category: cs.CL

TL;DR: 本文对多个主流大语言模型在印度、东亚和东南亚地区的宗教文化对齐性进行了多语种审计，发现模型虽在一般社会议题上与公众态度基本一致，但在宗教观点（尤其是少数群体）上存在显著偏差，易强化负面刻板印象；轻量干预效果有限，下游偏见评测也显示持续性危害，呼吁开展区域性、系统性文化对齐评估。

Details

Motivation: LLMs依赖以英语为主的训练数据，可能与其部署地的多元文化价值观不一致，尤其在宗教等敏感领域存在文化错位风险。 Method: 采用基于log-probs/logits的内部表征分析方法，对比LLMs在宗教议题上的输出分布与各地真实公众态度；辅以 demographic priming 和本地语言提示等轻量干预，并在多个区域化偏见基准（CrowS-Pairs、IndiBias、ThaiCLI、KoBBQ）上评估下游影响。 Result: 主流LLMs在宗教观点（尤其少数宗教群体）上普遍偏离公众态度，常放大负面刻板印象；轻量干预仅部分缓解问题；下游偏见评测证实其在敏感语境中存在持续性危害与代表性不足。 Conclusion: 需建立系统化、区域扎根的文化对齐审计框架，以保障LLM在全球多元文化场景中的公平、可靠部署。 Abstract: Large Language Models (LLMs) are increasingly being deployed in multilingual, multicultural settings, yet their reliance on predominantly English-centric training data risks misalignment with the diverse cultural values of different societies. In this paper, we present a comprehensive, multilingual audit of the cultural alignment of contemporary LLMs including GPT-4o-Mini, Gemini-2.5-Flash, Llama 3.2, Mistral and Gemma 3 across India, East Asia and Southeast Asia. Our study specifically focuses on the sensitive domain of religion as the prism for broader alignment. To facilitate this, we conduct a multi-faceted analysis of every LLM's internal representations, using log-probs/logits, to compare the model's opinion distributions against ground-truth public attitudes. We find that while the popular models generally align with public opinion on broad social issues, they consistently fail to accurately represent religious viewpoints, especially those of minority groups, often amplifying negative stereotypes. Lightweight interventions, such as demographic priming and native language prompting, partially mitigate but do not eliminate these cultural gaps. We further show that downstream evaluations on bias benchmarks (such as CrowS-Pairs, IndiBias, ThaiCLI, KoBBQ) reveal persistent harms and under-representation in sensitive contexts. Our findings underscore the urgent need for systematic, regionally grounded audits to ensure equitable global deployment of LLMs.

[43] The Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks

Anca Dinu,Andreiana Mihail,Andra-Maria Florescu,Claudiu Creanga

Main category: cs.CL

TL;DR: 本研究探讨了ChatGPT在模仿原创艺术作品（如绘画、素描、雕塑和装置）生成新图像方面的人工视觉创造力，结合艺术家评价与计算分析，发现当前AI在色彩纹理相似性上表现尚可，但在构图、概念与感知层面差距显著，因此提出需用多维度‘风格迁移仪表板’评估AI仿作。

Details

Motivation: 探究大语言模型（如ChatGPT）在视觉艺术领域进行有意图风格模仿（pastiching）的能力及其局限性，尤其关注其对当代艺术作品的再创作是否具备真正艺术价值。 Method: 邀请来自五个欧洲国家的12位艺术家各提供3件原创作品，并对ChatGPT生成的仿作进行评分与评论；同步采用计算方法量化原作与AI生成图像在颜色、纹理、构图、风格等维度的相似性。 Result: AI生成图像在颜色和纹理层面与原作具有一定相似性，但在构图、概念表达与感知维度上存在显著差距；艺术家普遍认为AI仿作缺乏深度、语境与意图感，更像粗略引用而非情感丰富的艺术再创造。 Conclusion: 单一风格指标不足以评估AI艺术仿作质量，应构建融合多维指标的‘风格迁移仪表板’；当前ChatGPT在视觉创意任务中仍难以实现真正意义上的艺术性转译与再创造。 Abstract: This study explores artificial visual creativity, focusing on ChatGPT's ability to generate new images intentionally pastiching original artworks such as paintings, drawings, sculptures and installations. The process involved twelve artists from Romania, Bulgaria, France, Austria, and the United Kingdom, each invited to contribute with three of their artworks and to grade and comment on the AI-generated versions. The analysis combines human evaluation with computational methods aimed at detecting visual and stylistic similarities or divergences between the original works and their AI-produced renditions. The results point to a significant gap between color and texture-based similarity and compositional, conceptual, and perceptual one. Consequently, we advocate for the use of a "style transfer dashboard" of complementary metrics to evaluate the similarity between pastiches and originals, rather than using a single style metric. The artists' comments revealed limitations of ChatGPT's pastiches after contemporary artworks, which were perceived by the authors of the originals as lacking dimensionality, context, and intentional sense, and seeming more of a paraphrase or an approximate quotation rather than as a valuable, emotion-evoking artwork.

[44] Transparent AI for Mathematics: Transformer-Based Large Language Models for Mathematical Entity Relationship Extraction with XAI

Tanjim Taharat Aurpa

Main category: cs.CL

TL;DR: 本文提出数学实体关系抽取（MERE）任务，利用BERT等Transformer模型从数学文本中自动提取操作数（实体）与运算符（关系），准确率达99.39%，并结合SHAP实现可解释性分析，提升模型透明度与可信度。

Details

Motivation: 数学文本理解因包含专业实体和复杂关系而具有挑战性，亟需可解释、高精度的自动化方法支持智能教育与知识图谱构建。 Method: 将数学问题解析建模为数学实体关系抽取（MERE）任务，以操作数为实体、运算符为关系；采用BERT等Transformer模型进行端到端关系抽取，并引入SHAP进行可解释性分析。 Result: BERT在MERE任务上达到99.39%的准确率；SHAP分析揭示了关键文本与数学特征对预测的影响机制，验证了模型的可解释性与可靠性。 Conclusion: 本工作融合任务定制数据集、Transformer建模与XAI技术，构建了一个高效且可解释的MERE框架，为自动解题、知识图谱构建与智能教育系统提供了坚实基础。 Abstract: Mathematical text understanding is a challenging task due to the presence of specialized entities and complex relationships between them. This study formulates mathematical problem interpretation as a Mathematical Entity Relation Extraction (MERE) task, where operands are treated as entities and operators as their relationships. Transformer-based models are applied to automatically extract these relations from mathematical text, with Bidirectional Encoder Representations from Transformers (BERT) achieving the best performance, reaching an accuracy of 99.39%. To enhance transparency and trust in the model's predictions, Explainable Artificial Intelligence (XAI) is incorporated using Shapley Additive Explanations (SHAP). The explainability analysis reveals how specific textual and mathematical features influence relation prediction, providing insights into feature importance and model behavior. By combining transformer-based learning, a task-specific dataset, and explainable modeling, this work offers an effective and interpretable framework for MERE, supporting future applications in automated problem solving, knowledge graph construction, and intelligent educational systems.

[45] Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task

Hirohiko Abe,Kentaro Ozeki,Risako Ando,Takanobu Morishita,Koji Mineshima,Mitsuhiro Okada

Main category: cs.CL

TL;DR: 本研究构建了一个编码道义模态的新Wason选择任务数据集，系统考察了大语言模型（LLMs）在道义条件推理中的表现，发现其推理能力在道义规则下优于描述性规则，并表现出类似人类的匹配偏差错误，而非确认偏差。

Details

Motivation: 尽管已有研究比较了LLM与人类的推理能力，但LLM推理的领域特异性（尤其是规范性/道义语境 vs. 纯形式语境）尚未被充分探索。 Method: 构建一个显式编码道义模态的新Wason选择任务数据集，以区分道义条件句与描述性条件句；用该数据集评估LLMs在道义规则下的条件推理能力；分析其错误模式更符合确认偏差还是匹配偏差。 Result: LLMs在道义规则下的推理表现优于描述性规则；其错误模式呈现匹配偏差特征（忽视否定、偏好词汇匹配），而非确认偏差；整体表现与人类在该范式中的经典认知偏差平行。 Conclusion: LLMs的推理性能随规则类型系统性变化，其错误模式可类比人类已知的认知偏差，表明其推理机制具有一定认知现实性，而不仅依赖统计模式匹配。 Abstract: As large language models (LLMs) advance in linguistic competence, their reasoning abilities are gaining increasing attention. In humans, reasoning often performs well in domain specific settings, particularly in normative rather than purely formal contexts. Although prior studies have compared LLM and human reasoning, the domain specificity of LLM reasoning remains underexplored. In this study, we introduce a new Wason Selection Task dataset that explicitly encodes deontic modality to systematically distinguish deontic from descriptive conditionals, and use it to examine LLMs' conditional reasoning under deontic rules. We further analyze whether observed error patterns are better explained by confirmation bias (a tendency to seek rule-supporting evidence) or by matching bias (a tendency to ignore negation and select items that lexically match elements of the rule). Results show that, like humans, LLMs reason better with deontic rules and display matching-bias-like errors. Together, these findings suggest that the performance of LLMs varies systematically across rule types and that their error patterns can parallel well-known human biases in this paradigm.

[46] From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Minh Hoang Nguyen,Vu Hoang Pham,Xuan Thanh Huynh,Phuc Hong Mai,Vinh The Nguyen,Quang Nhut Huynh,Huy Tien Nguyen,Tung Le

Main category: cs.CL

TL;DR: 本文对四种主流大语言模型（LLM）驱动的自动作文评分（AES）方法在IELTS写作任务2上进行了首次统一实证比较，发现结合k-SFT与RAG的方法效果最佳（F1达93%），揭示了各方法在准确性、成本与鲁棒性间的权衡。

Details

Motivation: 现有研究多孤立考察单一技术，缺乏对LLM-based AES方法在英语二语（L2）写作场景下相对优势的系统性比较。 Method: 在统一的IELTS Writing Task 2基准上，对比评估四类LLM-based AES范式：(i) 编码器分类微调，(ii) 零样本/少样本提示，(iii) 指令微调+检索增强生成（RAG），(iv) 监督微调（SFT）联合直接偏好优化（DPO）与RAG。 Result: 各方法存在明显的准确率-成本-鲁棒性权衡；最优配置为k-SFT+RAG，F1-Score达93%。 Conclusion: 本研究提供了首个面向英语L2写作的现代LLM-based AES策略统一实证比较，验证了其在自动评分任务中的潜力。 Abstract: Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at https://github.com/MinhNguyenDS/LLM_AES-EnL2

[47] Abductive Reasoning with Syllogistic Forms in Large Language Models

Hirohiko Abe,Risako Ando,Takanobu Morishita Kentaro Ozeki,Koji Mineshima,Mitsuhiro Okada

Main category: cs.CL

TL;DR: 本文探讨了大语言模型（LLMs）在溯因推理（abduction）中的表现，通过将三段论数据集转换为适合溯因任务的形式，检验其是否表现出类似人类的偏差，并强调超越形式演绎的语境化推理的重要性。

Details

Motivation: 批评LLMs存在类似人类的认知偏差可能不公平，因为人类推理不仅包含演绎，还包括基于有限信息的溯因；需系统评估LLMs在溯因推理中的能力与偏差。 Method: 将标准三段论数据集重构为溯因推理任务数据集，对当前最先进的LLMs进行测试与分析。 Result: 发现LLMs在溯因推理中确实表现出特定偏差，且其表现受上下文建模能力显著影响。 Conclusion: LLMs在溯因推理中尚未达到类人水平，需加强语境化、非形式化推理建模，这对弥合机器与人类认知差距至关重要。 Abstract: Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key concern. Prior studies have indicated that LLMs and humans share similar biases, such as dismissing logically valid inferences that contradict common beliefs. However, criticizing LLMs for these biases might be unfair, considering our reasoning not only involves formal deduction but also abduction, which draws tentative conclusions from limited information. Abduction can be regarded as the inverse form of syllogism in its basic structure, that is, a process of drawing a minor premise from a major premise and conclusion. This paper explores the accuracy of LLMs in abductive reasoning by converting a syllogistic dataset into one suitable for abduction. It aims to investigate whether the state-of-the-art LLMs exhibit biases in abduction and to identify potential areas for improvement, emphasizing the importance of contextualized reasoning beyond formal deduction. This investigation is vital for advancing the understanding and application of LLMs in complex reasoning tasks, offering insights into bridging the gap between machine and human cognition.

[48] PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations

Vittoria Vineis,Matteo Silvestri,Lorenzo Antonelli,Filippo Betello,Gabriele Tolomei

Main category: cs.CL

TL;DR: 本文提出PONTE框架，通过人机协同的闭环验证与适配机制，实现个性化、可信的自然语言XAI解释，显著提升解释的完整性、风格一致性与可信度。

Details

Motivation: 现有XAI方法多采用“一刀切”范式，忽视用户在专业知识、目标和认知需求上的差异；而依赖大语言模型生成解释又面临忠实性差和幻觉问题。 Method: 提出PONTE框架，包含三部分：(i) 低维偏好模型刻画用户风格需求；(ii) 基于结构化XAI成果的偏好条件化生成器；(iii) 验证模块确保数值忠实性、信息完整性与风格一致性，并可结合检索增强论证；用户反馈驱动偏好状态迭代更新。 Result: 自动与人工评估显示，验证-精炼闭环显著优于无验证生成，在医疗与金融领域提升了完整性与风格对齐度；用户研究证实偏好向量与感知风格高度一致，对生成随机性鲁棒，且质量评价积极。 Conclusion: PONTE将个性化建模为闭环验证与适应过程，而非简单提示工程，有效兼顾XAI的可信性与个性化，为人机协同可解释AI提供了新范式。 Abstract: Explainable Artificial Intelligence (XAI) seeks to enhance the transparency and accountability of machine learning systems, yet most methods follow a one-size-fits-all paradigm that neglects user differences in expertise, goals, and cognitive needs. Although Large Language Models can translate technical explanations into natural language, they introduce challenges related to faithfulness and hallucinations. To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives. PONTE models personalization as a closed-loop validation and adaptation process rather than prompt engineering. It combines: (i) a low-dimensional preference model capturing stylistic requirements; (ii) a preference-conditioned generator grounded in structured XAI artifacts; and (iii) verification modules enforcing numerical faithfulness, informational completeness, and stylistic alignment, optionally supported by retrieval-grounded argumentation. User feedback iteratively updates the preference state, enabling quick personalization. Automatic and human evaluations across healthcare and finance domains show that the verification-refinement loop substantially improves completeness and stylistic alignment over validation-free generation. Human studies further confirm strong agreement between intended preference vectors and perceived style, robustness to generation stochasticity, and consistently positive quality assessments.

[49] Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing

Anmol Gulati,Sahil Sen,Waqar Sarguroh,Kevin Paul

Main category: cs.CL

TL;DR: 本文提出Beyond Rows to Reasoning (BRTR)，一种面向企业电子表格理解的多模态智能体框架，通过迭代式工具调用替代单次检索，提升复杂表格的多步推理能力，并在多个基准上达到SOTA性能。

Details

Motivation: 现有多模态RAG方法在处理企业级电子表格时存在单次检索丢失上下文、压缩损失数据精度、全上下文注入超出LLM窗口限制等问题，难以支持可靠的多步推理。 Method: 提出BRTR框架，采用基于多模态嵌入（如NeMo Retriever 1B）的迭代工具调用机制，结合规划器、检索模块与多步推理模块，支持端到端Excel分析与结构化编辑；并进行了专家评估、模型对比与消融实验。 Result: 在FRTR-Bench、SpreadsheetLLM和FINCH三个前沿基准上分别超越先前方法25、7和32个百分点；确定NeMo Retriever 1B为最优多模态嵌入模型，GPT-5.2为最优效率-精度平衡LLM；所有结果具备完整可审计的工具调用轨迹。 Conclusion: BRTR通过迭代式多模态智能体设计，显著提升了复杂电子表格的理解与推理能力，兼具高性能、可解释性与实用性，为办公场景中LLM落地提供了新范式。 Abstract: Recent advances in multimodal Retrieval-Augmented Generation (RAG) enable Large Language Models (LLMs) to analyze enterprise spreadsheet workbooks containing millions of cells, cross-sheet dependencies, and embedded visual artifacts. However, state-of-the-art approaches exclude critical context through single-pass retrieval, lose data resolution through compression, and exceed LLM context windows through naive full-context injection, preventing reliable multi-step reasoning over complex enterprise workbooks. We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis to structured editing. Supported by over 200 hours of expert human evaluation, BRTR achieves state-of-the-art performance across three frontier spreadsheet understanding benchmarks, surpassing prior methods by 25 percentage points on FRTR-Bench, 7 points on SpreadsheetLLM, and 32 points on FINCH. We evaluate five multimodal embedding models, identifying NVIDIA NeMo Retriever 1B as the top performer for mixed tabular and visual data, and vary nine LLMs. Ablation experiments confirm that the planner, retrieval, and iterative reasoning each contribute substantially, and cost analysis shows GPT-5.2 achieves the best efficiency-accuracy trade-off. Throughout all evaluations, BRTR maintains full auditability through explicit tool-call traces.

[50] Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

Yuchen Zhang,Haralambos Mouratidis,Ravi Shekhar

Main category: cs.CL

TL;DR: 本文提出了一种上下文感知的多语言自动语音识别（ASR）框架，通过轻量级投影模块连接冻结的语音编码器与解码器-only语言模型，并引入对比学习对齐语音与上下文表征，在11种语言和5种英语方言上验证了其有效性，整体性能提升超5%。

Details

Motivation: 现有ASR系统多局限于单语和短语音，缺乏多语言支持及语音与上下文表征间的合理对齐。 Method: 结合冻结语音编码器与解码器-only语言模型，引入轻量级投影模块支持结构化上下文提示（如对话历史、偏置词），并采用对比学习目标在共享嵌入空间中对齐语音与上下文表征。 Result: 在超过1500小时、涵盖11种语言和5种英语方言的真实会话语音数据上，上下文输入持续提升识别质量；对比对齐进一步带来额外增益，整体性能提升超5%。 Conclusion: 上下文建模与跨模态对齐对多语言ASR至关重要，所提框架兼顾多语言支持、口音适应性与预训练模型模块化。 Abstract: Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns their representations in a shared embedding space. Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality. Contrastive alignment provides additional gains when applied to different context types, with an overall performance gain of over 5%. These results highlight the importance of both contextual modeling and cross-modal alignment in multilingual ASR.

[51] KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection

Archie Sage,Salvatore Greco

Main category: cs.CL

TL;DR: 本文介绍了KCLarity团队在SemEval 2026 CLARITY共享任务中的参与，比较了两种建模方式（直接预测清晰度标签 vs. 先预测规避标签再推导清晰度）及其变体，在公开和隐藏测试集上评估了RoBERTa-large与零样本GPT-5.2等模型性能。

Details

Motivation: 参与CLARITY共享任务，系统研究政治话语中歧义与规避技巧的分类建模策略，探索更鲁棒、泛化性更强的方法。 Method: 提出两种建模方案：(i) 直接预测清晰度（clarity）标签；(ii) 先预测规避（evasion）标签，再依据任务层级关系推导清晰度；并引入多种辅助训练变体，在规避优先框架下对解码器-only模型进行零样本评估。 Result: 两种建模方案性能相当；RoBERTa-large在公开测试集上表现最优；零样本GPT-5.2在隐藏评估集上泛化能力更强。 Conclusion: 建模路径选择影响泛化性而非绝对性能，零样本大模型在未见数据上更具潜力，而微调编码器模型在已知分布下更稳定。 Abstract: This paper describes the KCLarity team's participation in CLARITY, a shared task at SemEval 2026 on classifying ambiguity and evasion techniques in political discourse. We investigate two modelling formulations: (i) directly predicting the clarity label, and (ii) predicting the evasion label and deriving clarity through the task taxonomy hierarchy. We further explore several auxiliary training variants and evaluate decoder-only models in a zero-shot setting under the evasion-first formulation. Overall, the two formulations yield comparable performance. Among encoder-based models, RoBERTa-large achieves the strongest results on the public test set, while zero-shot GPT-5.2 generalises better on the hidden evaluation set.

cs.CV [Back]

[52] Edges Are All You Need: Robust Gait Recognition via Label-Free Structure

Chao Zhang,Zhuang Zheng,Ruixin Li,Zhanyong Mei

Main category: cs.CV

TL;DR: 本文提出了一种新的无标签、密集结构化表征——SKETCH，用于步态识别，并设计了多模态框架SKETCHGAIT，在SUSTech1K和CCPG数据集上取得优异性能。

Details

Motivation: 现有步态识别方法主要依赖轮廓（silhouette）或人体解析（parsing）表征：轮廓稀疏且缺乏内部结构信息；解析虽引入部件级结构，但严重依赖上游解析器的精度与粒度，导致跨数据集性能不稳定，甚至不如轮廓。作者从结构视角重新审视表征设计空间，发现‘高密度部件级结构+无显式语义标签’这一范式尚未被充分探索。 Method: 提出SKETCH作为一种新型视觉模态：利用边缘检测器直接从RGB图像中无监督地提取高频结构线索（如肢体关节、自遮挡轮廓）；进一步提出SKETCHGAIT框架，包含两个独立模态学习流（解析流与草图流）及一个轻量级早期融合分支，以建模二者在结构上的互补性与语义解耦性。 Result: 在SUSTech1K数据集上Rank-1准确率达92.9%，在CCPG数据集上平均Rank-1达93.1%，验证了SKETCH模态与SKETCHGAIT框架的有效性与泛化能力。 Conclusion: SKETCH提供了一种稳定、鲁棒且无需标注的密集结构表征方式；SKETCHGAIT通过结构互补与语义解耦实现多模态协同，为步态识别提供了新思路与实用框架。 Abstract: Gait recognition is a non-intrusive biometric technique for security applications, yet existing studies are dominated by silhouette- and parsing-based representations. Silhouettes are sparse and miss internal structural details, limiting discriminability. Parsing enriches silhouettes with part-level structures, but relies heavily on upstream human parsers (e.g., label granularity and boundary precision), leading to unstable performance across datasets and sometimes even inferior results to silhouettes. We revisit gait representations from a structural perspective and describe a design space defined by edge density and supervision form: silhouettes use sparse boundary edges with weak single-label supervision, while parsing uses denser cues with strong semantic priors. In this space, we identify an underexplored paradigm: dense part-level structure without explicit semantic labels, and introduce SKETCH as a new visual modality for gait recognition. Sketch extracts high-frequency structural cues (e.g., limb articulations and self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. We further show that label-guided parsing and label-free sketch are semantically decoupled and structurally complementary. Based on this, we propose SKETCHGAIT, a hierarchically disentangled multi-modal framework with two independent streams for modality-specific learning and a lightweight early-stage fusion branch to capture structural complementarity. Extensive experiments on SUSTech1K and CCPG validate the proposed modality and framework: SketchGait achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG.

[53] Thinking with Spatial Code for Physical-World Video Reasoning

Jieneng Chen,Wenxin Ma,Ruisheng Yuan,Yunzhi Zhang,Jiajun Wu,Alan Yuille

Main category: cs.CV

TL;DR: 本文提出Thinking with Spatial Code框架，将RGB视频转化为显式的、时间连贯的3D表示，用于物理世界视觉问答；通过空间编码器生成带3D朝向包围盒和语义标签的空间码，并结合强化学习微调大语言模型，显著提升性能。

Details

Motivation: 现有视觉语言模型在物理世界视觉问答中缺乏对显式3D空间结构的理解与推理能力，难以进行几何 grounded 的推理。 Method: 提出空间编码器，统一6D物体解析与跟踪骨干网络及几何预测；将视频解析为含3D朝向包围盒和语义标签的空间码；利用基于空间准则（spatial rubric）的奖励函数，通过强化学习微调大语言模型。 Result: 在VSI-Bench基准上超越现有专有视觉语言模型，达到新SOTA。 Conclusion: 显式建模和推理三维空间结构可显著增强大语言模型在物理世界视觉问答中的几何理解与推理能力，验证了‘Thinking with Spatial Code’范式的有效性。 Abstract: We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at https://github.com/Beckschen/spatialcode.

[54] From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications

Xusheng Luo,Changliu Liu

Main category: cs.CV

TL;DR: 本文提出首个针对热图型关键点检测器的耦合鲁棒性验证框架，通过混合整数线性规划（MILP）建模联合偏差约束，实现对所有关键点整体行为的鲁棒性验证，具有理论保证和更高验证率。

Details

Motivation: 关键点检测器易受小扰动影响，但其形式化鲁棒性验证因高维输入与连续坐标输出而长期未被探索。 Method: 提出耦合鲁棒性验证框架，将验证建模为带联合偏差约束（多面体编码）和可达热图集的混合整数线性规划（MILP）反例搜索问题。 Result: 实验表明该方法在严格误差阈值下仍保持高验证率，显著优于独立验证的解耦方法；MILP不可行即证明鲁棒，可行则给出反例，且方法被证明是可靠的（sound）。 Conclusion: 所提耦合验证框架首次实现了对热图型关键点检测器的联合鲁棒性保证，兼顾理论严谨性与实际有效性。 Abstract: Keypoint detection underpins many vision tasks, including pose estimation, viewpoint recovery, and 3D reconstruction, yet modern neural models remain vulnerable to small input perturbations. Despite its importance, formal robustness verification for keypoint detectors is largely unexplored due to high-dimensional inputs and continuous coordinate outputs. We propose the first coupled robustness verification framework for heatmap-based keypoint detectors that bounds the joint deviation across all keypoints, capturing their interdependencies and downstream task requirements. Unlike prior decoupled, classification-style approaches that verify each keypoint independently and yield conservative guarantees, our method verifies collective behavior. We formulate verification as a falsification problem using a mixed-integer linear program (MILP) that combines reachable heatmap sets with a polytope encoding joint deviation constraints. Infeasibility certifies robustness, while feasibility provides counterexamples, and we prove the method is sound: if it certifies the model as robust, then the keypoint detection model is guaranteed to be robust. Experiments show that our coupled approach achieves high verified rates and remains effective under strict error thresholds where decoupled methods fail.

Mohammad Sadil Khan,Muhammad Usama,Rolandos Alexandros Potamias,Didier Stricker,Muhammad Zeshan Afzal,Jiankang Deng,Ismail Elezi

Main category: cs.CV

TL;DR: DreamCAD是一种多模态生成框架，无需CAD特定标注，仅通过点级监督直接生成可编辑的BRep模型，并结合新构建的大规模CAD描述数据集CADCap-1M，显著提升文本/图像/点云到CAD生成的质量与用户偏好。

Details

Motivation: 现有CAD生成方法受限于小规模带设计历史或BRep标注的数据集，而海量未标注3D网格未被有效利用，制约了可扩展CAD生成的发展。 Method: DreamCAD将BRep表示为参数化曲面（如贝塞尔曲面）集合，采用可微细分方法生成网格，实现基于点级监督的大规模训练；同时构建了含100万+ GPT-5生成描述的CADCap-1M数据集。 Result: 在ABC和Objaverse基准上，DreamCAD在文本、图像、点云三种模态下均达到SOTA性能，几何保真度更高，用户偏好超75%。 Conclusion: DreamCAD实现了无需CAD标注的高质量、可编辑BRep生成，推动了大规模、多模态CAD生成研究，配套代码与数据集将开源。 Abstract: Computer-Aided Design (CAD) relies on structured and editable geometric representations, yet existing generative methods are constrained by small annotated datasets with explicit design histories or boundary representation (BRep) labels. Meanwhile, millions of unannotated 3D meshes remain untapped, limiting progress in scalable CAD generation. To address this, we propose DreamCAD, a multi-modal generative framework that directly produces editable BReps from point-level supervision, without CAD-specific annotations. DreamCAD represents each BRep as a set of parametric patches (e.g., Bézier surfaces) and uses a differentiable tessellation method to generate meshes. This enables large-scale training on 3D datasets while reconstructing connected and editable surfaces. Furthermore, we introduce CADCap-1M, the largest CAD captioning dataset to date, with 1M+ descriptions generated using GPT-5 for advancing text-to-CAD research. DreamCAD achieves state-of-the-art performance on ABC and Objaverse benchmarks across text, image, and point modalities, improving geometric fidelity and surpassing 75% user preference. Code and dataset will be publicly available.

[56] Adversarial Batch Representation Augmentation for Batch Correction in High-Content Cellular Screening

Lei Tong,Xujing Yao,Adam Corrigan,Long Chen,Navin Rathna Kumar,Kerry Hallbrook,Jonathan Orme,Yinhai Wang,Huiyu Zhou

Main category: cs.CV

TL;DR: 本文提出ABRA方法，将生物批次效应缓解建模为领域泛化问题，通过对抗性批次表征增强，在表示空间中合成最坏情况的生物批次扰动，并结合分布对齐目标防止表征坍塌，显著提升了siRNA扰动分类性能。

Details

Motivation: 高内涵筛选产生的细胞绘画图像存在生物批次效应，导致协变量偏移，降低深度学习模型在未见数据上的泛化能力；现有批次校正方法依赖额外先验知识或难以泛化到未见生物批次。 Method: 将生物批次缓解建模为领域泛化（DG）问题，提出对抗性批次表征增强（ABRA）：1）将特征统计量参数化为结构化不确定性以显式建模批次间统计波动；2）通过min-max优化框架，在表示空间中主动合成最坏情况的生物批次扰动，利用严格的角几何间隔保持细粒度类别可分性；3）引入协同分布对齐目标防止对抗探索过程中的表征坍塌。 Result: 在大规模RxRx1和RxRx1-WILDS基准上广泛评估表明，ABRA在siRNA扰动分类任务上达到新的SOTA性能。 Conclusion: ABRA有效缓解生物批次效应，提升模型跨批次泛化能力，为高内涵筛选图像分析提供了鲁棒、无需额外先验知识的领域泛化新范式。 Abstract: High-Content Screening routinely generates massive volumes of cell painting images for phenotypic profiling. However, technical variations across experimental executions inevitably induce biological batch (bio-batch) effects. These cause covariate shifts and degrade the generalization of deep learning models on unseen data. Existing batch correction methods typically rely on additional prior knowledge (e.g., treatment or cell culture information) or struggle to generalize to unseen bio-batches. In this work, we frame bio-batch mitigation as a Domain Generalization (DG) problem and propose Adversarial Batch Representation Augmentation (ABRA). ABRA explicitly models batch-wise statistical fluctuations by parameterizing feature statistics as structured uncertainties. Through a min-max optimization framework, it actively synthesizes worst-case bio-batch perturbations in the representation space, guided by a strict angular geometric margin to preserve fine-grained class discriminability. To prevent representation collapse during this adversarial exploration, we introduce a synergistic distribution alignment objective. Extensive evaluations on the large-scale RxRx1 and RxRx1-WILDS benchmarks demonstrate that ABRA establishes a new state-of-the-art for siRNA perturbation classification.

[57] Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection

Trung Tien Dong,Dev Thakkar,Arman Sargolzaei,Xiaomin Lin

Main category: cs.CV

TL;DR: 本文提出了一种轻量级后融合稳定器（PFS），用于提升现有BEV融合检测器在域偏移和传感器故障下的鲁棒性，无需修改架构或重新训练，显著提升了相机失效和低光照等场景下的检测性能。

Details

Motivation: 现有BEV融合检测器在域偏移和传感器故障下性能显著下降，而现有鲁棒性方法需修改架构或重训练，难以部署到已上线系统中。 Method: 提出Post Fusion Stabilizer（PFS），一种作用于现有检测器中间BEV特征的轻量模块，通过稳定特征统计、抑制受损空间区域、残差式恢复弱化线索来增强鲁棒性，并设计为近恒等变换以保持原始性能。 Result: 在nuScenes上验证，PFS在多种故障模式下达到SOTA：相机丢包鲁棒性提升+1.2% mAP，低光照提升+4.4% mAP，参数仅3.3M。 Conclusion: PFS是一种即插即用、轻量高效、兼容现有系统的鲁棒性增强方案，显著提升了多传感器融合检测器在真实复杂场景中的可靠性。 Abstract: Camera-LiDAR fusion is widely used in autonomous driving to enable accurate 3D object detection. However, bird's-eye view (BEV) fusion detectors can degrade significantly under domain shift and sensor failures, limiting reliability in real-world deployment. Existing robustness approaches often require modifying the fusion architecture or retraining specialized models, making them difficult to integrate into already deployed systems. We propose a Post Fusion Stabilizer (PFS), a lightweight module that operates on intermediate BEV representations of existing detectors and produces a refined feature map for the original detection head. The design stabilizes feature statistics under domain shift, suppresses spatial regions affected by sensor degradation, and adaptively restores weakened cues through residual correction. Designed as a near-identity transformation, PFS preserves performance while improving robustness under diverse camera and LiDAR corruptions. Evaluations on the nuScenes benchmark demonstrate that PFS achieves state-of-the-art results in several failure modes, notably improving camera dropout robustness by +1.2% and low-light performance by +4.4% mAP while maintaining a lightweight footprint of only 3.3 M parameters.

[58] Rethinking Concept Bottleneck Models: From Pitfalls to Solutions

Merve Tapli,Quentin Bouniot,Wolfgang Stammer,Zeynep Akata,Emre Akbas

Main category: cs.CV

TL;DR: 本文提出CBM-Suite框架，通过熵基概念适宜性度量、非线性层解决线性问题、蒸馏损失缩小精度差距，并系统分析视觉编码器、VLM和概念集对CBM性能的影响。

Details

Motivation: Concept Bottleneck Models（CBMs）存在概念相关性无法预评估、线性问题导致绕过概念瓶颈、精度低于黑箱模型、以及缺乏对不同视觉骨干网络和VLM影响的系统研究等根本局限。 Method: 提出CBM-Suite框架：1）基于熵的概念适宜性度量；2）在概念激活与分类器间插入非线性层以解决线性问题；3）采用线性教师探针引导的蒸馏损失提升精度；4）系统评估不同视觉编码器、VLM及概念集对准确率与可解释性的影响。 Result: CBM-Suite在多个基准上实现了更高准确率，并为提升基于概念的可解释性提供了可复现的见解和指导。 Conclusion: CBM-Suite系统性地解决了当前CBMs的关键挑战，显著提升了其准确性与可解释性平衡，并推动了概念建模的实证研究范式。 Abstract: Concept Bottleneck Models (CBMs) ground predictions in human-understandable concepts but face fundamental limitations: the absence of a metric to pre-evaluate concept relevance, the "linearity problem" causing recent CBMs to bypass the concept bottleneck entirely, an accuracy gap compared to opaque models, and finally the lack of systematic study on the impact of different visual backbones and VLMs. We introduce CBM-Suite, a methodological framework to systematically addresses these challenges. First, we propose an entropy-based metric to quantify the intrinsic suitability of a concept set for a given dataset. Second, we resolve the linearity problem by inserting a non-linear layer between concept activations and the classifier, which ensures that model accuracy faithfully reflects concept relevance. Third, we narrow the accuracy gap by leveraging a distillation loss guided by a linear teacher probe. Finally, we provide comprehensive analyses on how different vision encoders, vision-language models, and concept sets interact to influence accuracy and interpretability in CBMs. Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability.

[59] Making Reconstruction FID Predictive of Diffusion Generation FID

Tongda Xu,Mingwei He,Shady Abu-Hussein,Jose Miguel Hernandez-Lobato,Haotian Zhang,Kai Zhao,Chao Zhou,Ya-Qin Zhang,Yan Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的评估指标iFID，通过在潜在空间中插值最近邻样本并解码来计算FID，显著提升了与扩散模型生成FID（gFID）的相关性（Pearson和Spearman相关性约0.85），并解释了其理论依据。

Details

Motivation: 现有VAE的重建FID（rFID）与扩散模型的生成FID（gFID）相关性差，缺乏能准确预测扩散模型生成质量的重建式评估指标。 Method: 提出插值FID（iFID）：对每个数据点，在潜在空间中找其最近邻，线性插值二者潜变量后解码，再计算解码样本与原始数据集间的FID；同时分析rFID与iFID分别对应扩散过程的不同阶段（精炼 vs 导航）。 Result: iFID首次实现了与扩散gFID的强相关性（Pearson≈0.85，Spearman≈0.85）；揭示rFID实际反映扩散精炼阶段质量，而iFID反映导航阶段质量；从扩散泛化与幻觉角度给出理论解释。 Conclusion: iFID是一种简单有效、具强相关性与可解释性的新评估指标，为基于重建的评估方法提供了新范式，并深化了对扩散模型不同阶段质量评估的理解。 Abstract: It is well known that the reconstruction FID (rFID) of a VAE is poorly correlated with the generation FID (gFID) of a latent diffusion model. We propose interpolated FID (iFID), a simple variant of rFID that exhibits a strong correlation with gFID. Specifically, for each element in the dataset, we retrieve its nearest neighbor (NN) in the latent space and interpolate their latent representations. We then decode the interpolated latent and compute the FID between the decoded samples and the original dataset. Additionally, we refine the claim that rFID correlates poorly with gFID, by showing that rFID correlates with sample quality in the diffusion refinement phase, whereas iFID correlates with sample quality in the diffusion navigation phase. Furthermore, we provide an explanation for why iFID correlates well with gFID, and why reconstruction metrics are negatively correlated with gFID, by connecting to results in the diffusion generalization and hallucination. Empirically, iFID is the first metric to demonstrate a strong correlation with diffusion gFID, achieving Pearson linear and Spearman rank correlations approximately 0.85. The source code is provided in https://github.com/tongdaxu/Making-rFID-Predictive-of-Diffusion-gFID.

[60] When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

Wisdom Ikezogwo,Mehmet Saygin Seyfioglu,Ranjay Krishna,Karim Bouyarmane

Main category: cs.CV

TL;DR: 本文提出隐式错误计数（IEC）方法，用于解决缺乏单一理想答案的参考自由型任务中的强化学习奖励建模问题；IEC通过严重性加权的多维度错误枚举与组校准生成每维度奖励，在虚拟试穿任务中显著优于基于参考答案的Rubrics as Rewards（RaR）等基线方法。

Details

Motivation: 现有RLVR和RaR方法依赖单一理想答案生成评估准则，难以适用于存在多个有效输出、无明确标准答案的真实任务场景，存在方法空白。 Method: 提出隐式错误计数（IEC）：不显式枚举错误（易噪声大），而是采用隐式打分+组内校准策略，在任务相关维度上进行严重性加权错误量化，并转化为标定后的每维度奖励；进一步设计级联错误计数（CEC）作为评估指标，并构建高错配基准MDressBench进行验证。 Result: 在MDressBench上，IEC全面优于RaR（CEC得分5.31 vs. 5.60及5.20 vs. 5.53）；在VITON-HD和DressCode上，IEC在8项感知指标中超越或持平6个基线；CEC指标与人类偏好高度一致（60% top-1准确率）。 Conclusion: 当理想答案不可得时，基于错误计数的奖励建模比基于参考答案构建rubric更鲁棒、更有效，IEC为参考自由型任务提供了新范式。 Abstract: Reinforcement learning with verifiable rewards (RLVR) and Rubrics as Rewards (RaR) have driven strong gains in domains with clear correctness signals and even in subjective domains by synthesizing evaluation criteria from ideal reference answers. But many real-world tasks admit multiple valid outputs and lack the single ideal answer that rubric generation depends on. We identify this reference-free setting as a gap in current post-training methods and propose Implicit Error Counting (IEC) to fill it. Instead of checking what a response gets right against a rubric, IEC enumerates what it gets wrong, applying severity-weighted scores across task-relevant axes and converting them into calibrated per-aspect rewards. We show that naïve explicit enumeration is too noisy for stable optimization, and that two design choices: implicit score emission and group calibration are necessary to make error counting a reliable reward. As a case study, we validate IEC on virtual try-on (VTO), a domain that is simultaneously too constrained for holistic scoring and too permissive for rubric-based evaluation: subtle garment errors are unacceptable, yet many output variations are correct. We introduce Cascaded Error Counting (CEC) as an evaluation metric, which tracks human preferences well (60% top-1 vs. 30% others), and curate Mismatch-DressCode (MDressBench), a benchmark with maximal attribute mismatch to stress-test reward designs. On MDressBench, IEC outperforms RaR across all metrics (CEC: 5.31 vs. 5.60 on flat references; 5.20 vs. 5.53 on non-flat). On VITON-HD and DressCode, IEC matches or surpasses six baselines on 6 of 8 perceptual metrics. These results suggest that when ideal answers are unavailable, counting errors provide a stronger signal than constructing rubrics.

[61] Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

Jiaqi Li,Shuntian Zheng,Yixian Shen,Jia-Hong Huang,Xiaoman Lu,Minzhe Ni,Yu Guan

Main category: cs.CV

TL;DR: 本文提出SemVID，一种无需训练的视觉标记剪枝框架，专为视频时序定位（VTG）任务设计，通过保留边界敏感证据和增强跨帧连通性，在大幅减少视觉标记数量的同时维持高定位精度。

Details

Motivation: 现有训练-free视觉标记剪枝方法直接用于VTG任务时性能急剧下降，因其忽视VTG对事件边界敏感证据和跨帧推理链的依赖。 Method: 提出两个VTG专属剪枝原则：证据保留（ER）和连通强度（CS）；据此构建SemVID框架，按帧分配标记预算，并选择三类语义互补的标记——物体标记、运动标记和上下文标记。 Result: 在VTG基准上，仅用12.5%视觉标记即保持95.4% mIoU，预填充速度提升最高达5.8倍，且在相同预算下持续优于先前方法。 Conclusion: SemVID验证了面向任务定制的无训练剪枝策略在保持VTG性能与提升效率间的有效性，为高效视频-语言模型落地提供了新思路。 Abstract: Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.

[62] OWL: A Novel Approach to Machine Perception During Motion

Daniel Raviv,Juan D. Yepes

Main category: cs.CV

TL;DR: 本文提出了一种名为OWL的感知相关函数，用于基于视觉运动线索实现无需先验知识的实时3D感知，支持尺度化3D建图与相机航向估计，并具有理论与机器人应用双重意义。

Details

Motivation: 解决运动过程中3D感知的复杂挑战，避免依赖先验知识（如环境静止性、相机运动模型等），并弥合理论感知模型与实际机器人应用之间的鸿沟。 Method: 基于两个基本视觉运动线索——注视点附近的局部视觉逼近（looming）和刚体相对于注视点的旋转——构建OWL函数；该函数直接从原始图像序列中像素级并行计算得出，不显式依赖相对距离或平移速度的测量。 Result: OWL在仿真中实现了3D物体的几何时间恒定性，并仅凭视觉运动线索完成尺度化3D场景重建；支持实时、无需环境假设的3D点动态表征。 Conclusion: OWL提供了一种统一、解析、时基的3D感知新范式，适用于机器人与自主导航，亦可能为自然感知机制及神经行为研究提供新视角。 Abstract: We introduce a perception-related function, OWL, designed to address the complex challenges of 3D perception during motion. It derives its values directly from two fundamental visual motion cues, with one set of cue values per point per time instant. During motion, two visual motion cues relative to a fixation point emerge: 1) perceived local visual looming of points near the fixation point, and 2) perceived rotation of the rigid object relative to the fixation point. It also expresses the relation between two well-known physical quantities, the relative instantaneous directional range and directional translation in 3D between the camera and any visible 3D point, without explicitly requiring their measurement or prior knowledge of their individual values. OWL offers a unified, analytical time-based approach that enhances and simplifies key perception capabilities, including scaled 3D mapping and camera heading. Simulations demonstrate that OWL achieves geometric constancy of 3D objects over time and enables scaled 3D scene reconstruction from visual motion cues alone. By leveraging direct measurements from raw visual motion image sequences, OWL values can be obtained without prior knowledge of stationary environments, moving objects, or camera motion. This approach employs minimalistic, pixel-based, parallel computations, providing an alternative real-time representation for 3D points in relative motion. OWL bridges the gap between theoretical concepts and practical applications in robotics and autonomous navigation and may unlock new possibilities for real-time decision-making and interaction, potentially serving as a building block for next-generation autonomous systems. This paper offers an alternative perspective on machine perception, with implications that may extend to natural perception and contribute to a better understanding of behavioral psychology and neural functionality.

[63] MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

Dannong Xu,Zhongyu Yang,Jun Chen,Yingfang Yuan,Ming Hu,Lei Sun,Luc Van Gool,Danda Pani Paudel,Chun-Mei Feng

Main category: cs.CV

TL;DR: 本文提出了MultiHaystack基准，用于评估多模态大语言模型（MLLMs）在大规模跨模态检索与推理任务中的真实能力，揭示了当前模型在异构多模态检索上的显著瓶颈。

Details

Motivation: 现有基准多局限于小规模、单模态的检索候选集，无法反映真实场景中从大规模异构多模态语料库中检索相关证据并进行推理的关键需求，导致对模型端到端可靠性的高估。 Method: 构建首个支持大规模跨模态检索与推理联合评估的基准MultiHaystack，包含46,000+多模态检索候选（文档、图像、视频）和747个可验证的开放性问题，每个问题均锚定唯一真实证据；系统评测主流MLLMs与检索器（如E5-V）在检索+推理联合任务下的表现。 Result: 即使最强检索器E5-V的Recall@1仅40.8%；GPT-5等SOTA MLLMs在给定证据时推理准确率达80.86%，但在Top-5检索结果下骤降至51.4%，表明多模态检索是当前主要瓶颈。 Conclusion: MultiHaystack有效暴露了现有MLLMs在真实多模态检索场景下的关键缺陷，为推动以检索为中心的多模态系统发展提供了重要测试平台。 Abstract: Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find that models perform competitively when provided with the corresponding evidence, but their performance drops sharply when required to retrieve that evidence from the full corpus. Additionally, even the strongest retriever, E5-V, achieves only 40.8% Recall@1, while state-of-the-art MLLMs such as GPT-5 experience a significant drop in reasoning accuracy from 80.86% when provided with the corresponding evidence to 51.4% under top-5 retrieval. These results indicate that multimodal retrieval over heterogeneous pools remains a primary bottleneck for MLLMs, positioning MultiHaystack as a valuable testbed that highlights underlying limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.

[64] Interpretable Perception and Reasoning for Audiovisual Geolocation

Yiyang Su,Xiaoming Liu

Main category: cs.CV

TL;DR: 本文提出Audiovisual Geolocation（AVG）框架，利用视听双模态信息实现高精度全球地理定位，构建了包含2万视频片段的全球基准数据集，并设计三阶段方法：感知阶段提取语义音频原子、多模态推理阶段融合音视特征、精确定位阶段在球面流形上进行预测，显著超越单模态基线。

Details

Motivation: 视觉场景存在固有歧义，且听觉线索尚未被充分利用，导致图像为基础的全球地理定位仍具挑战性。 Method: 提出三阶段框架：(1) 感知阶段使用混合自回归稀疏自编码器分解噪声音频为语义‘声学原子’；(2) 多模态推理阶段采用基于Group Relative Policy Optimization（GRPO）微调的多模态大语言模型融合声学原子与视觉特征；(3) 精确预测阶段在S²球面流形上应用黎曼流匹配。 Result: 所提框架在AVG基准上显著优于单模态基线，验证了可解释声景感知与多模态推理结合可大幅提升全球定位精度。 Conclusion: 听觉线索是视觉定位的重要正交补充信号；通过可解释感知与多模态协同推理，可有效缓解地理歧义，实现高精度全球定位。 Abstract: While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded "acoustic atoms"; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the $S^2$ manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.

[65] Any to Full: Prompting Depth Anything for Depth Completion in One Stage

Zhiyuan Zhou,Ruofeng Liu,Taichi Liu,Weijian Zuo,Shanshan Wang,Zhiqing Hong,Desheng Zhang

Main category: cs.CV

TL;DR: 本文提出Any2Full，一种单阶段、领域通用且模式无关的深度补全框架，通过尺度提示适配预训练单目深度估计（MDE）模型，设计尺度感知提示编码器以提升鲁棒性与效率。

Details

Motivation: 现有RGBD融合深度补全方法泛化能力差、对深度模式敏感；两阶段MDE集成策略计算开销大且引入结构化失真。 Method: 提出Any2Full框架，将深度补全重构为预训练MDE模型的尺度提示适配任务；设计Scale-Aware Prompt Encoder，从稀疏输入中蒸馏统一尺度提示，引导MDE输出全局尺度一致的深度图。 Result: 在平均AbsREL上比OMNI-DC提升32.2%，相比PriorDA提速1.4倍，同时保持相同MDE骨干网络，显著提升鲁棒性与效率。 Conclusion: Any2Full建立了通用深度补全的新范式，兼顾领域通用性、模式无关性与计算高效性。 Abstract: Accurate, dense depth estimation is crucial for robotic perception, but commodity sensors often yield sparse or incomplete measurements due to hardware limitations. Existing RGBD-fused depth completion methods learn priors jointly conditioned on training RGB distribution and specific depth patterns, limiting domain generalization and robustness to various depth patterns. Recent efforts leverage monocular depth estimation (MDE) models to introduce domain-general geometric priors, but current two-stage integration strategies relying on explicit relative-to-metric alignment incur additional computation and introduce structured distortions. To this end, we present Any2Full, a one-stage, domain-general, and pattern-agnostic framework that reformulates completion as a scale-prompting adaptation of a pretrained MDE model. To address varying depth sparsity levels and irregular spatial distributions, we design a Scale-Aware Prompt Encoder. It distills scale cues from sparse inputs into unified scale prompts, guiding the MDE model toward globally scale-consistent predictions while preserving its geometric priors. Extensive experiments demonstrate that Any2Full achieves superior robustness and efficiency. It outperforms OMNI-DC by 32.2\% in average AbsREL and delivers a 1.4$\times$ speedup over PriorDA with the same MDE backbone, establishing a new paradigm for universal depth completion. Codes and checkpoints are available at https://github.com/zhiyuandaily/Any2Full.

[66] Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation

Junyu Chen,Md Yousuf Harun,Christopher Kanan

Main category: cs.CV

TL;DR: 本文提出了一种无需人工标注的自动化流程，将ImageNet训练集转换为高质量多标签数据集，利用自监督ViT进行无监督目标发现与轻量分类器训练，生成的多标签标注显著提升了模型在ImageNet变体及下游检测任务上的性能。

Details

Motivation: 原始ImageNet采用单标签假设，无法反映真实场景中多物体共存的情况，导致标签噪声并限制学习信号的丰富性；现有工作仅改进验证集，缺乏可扩展、高质量的训练集多标签标注。 Method: 构建自动化流水线：基于自监督Vision Transformer进行无监督目标发现，选取与原始标签对齐的区域训练轻量级分类器，并将其应用于所有图像区域以生成全量一致的多标签标注。 Result: 生成的多标签标注与人类判断高度一致；在ReaL和ImageNet-V2上top-1准确率分别提升最多+2.0和+1.5；在COCO和VOC检测任务上mAP分别提升最多+4.2和+2.3。 Conclusion: 高质量多标签标注对提升图像分类性能和表征学习能力至关重要，本文方法为大规模多标签数据构建提供了可扩展、无需人工干预的新范式。 Abstract: The original ImageNet benchmark enforces a single-label assumption, despite many images depicting multiple objects. This leads to label noise and limits the richness of the learning signal. Multi-label annotations more accurately reflect real-world visual scenes, where multiple objects co-occur and contribute to semantic understanding, enabling models to learn richer and more robust representations. While prior efforts (e.g., ReaL, ImageNetv2) have improved the validation set, there has not yet been a scalable, high-quality multi-label annotation for the training set. To this end, we present an automated pipeline to convert the ImageNet training set into a multi-label dataset, without human annotations. Using self-supervised Vision Transformers, we perform unsupervised object discovery, select regions aligned with original labels to train a lightweight classifier, and apply it to all regions to generate coherent multi-label annotations across the dataset. Our labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks. Compared to traditional single-label scheme, models trained with our multi-label supervision achieve consistently better in-domain accuracy across architectures (up to +2.0 top-1 accuracy on ReaL and +1.5 on ImageNet-V2) and exhibit stronger transferability to downstream tasks (up to +4.2 and +2.3 mAP on COCO and VOC, respectively). These results underscore the importance of accurate multi-label annotations for enhancing both classification performance and representation learning. Project code and the generated multi-label annotations are available at https://github.com/jchen175/MultiLabel-ImageNet.

[67] From Phase Grounding to Intelligent Surgical Narratives

Ethan Peterson,Huixin Zhan

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP的多模态框架，用于从手术视频中自动生成手术时间线和叙事描述，通过视觉帧与手势文本描述的对齐，减少对手工标注的依赖。

Details

Motivation: 现有手术时间线构建方法依赖模糊的术后报告或耗时的手动视频标注，亟需一种高效、自动化的替代方案。 Method: 采用CLIP多模态框架，利用其视觉编码器提取手术视频帧特征，文本编码器嵌入手势描述句子，在共享嵌入空间中对齐二者，并微调模型以增强视频手势与文本标记间的匹配。 Result: 模型能预测视频帧对应的手势与手术阶段，从而构建结构化手术时间线；实验表明该方法可有效减少人工审查与标注负担。 Conclusion: 该方法成功利用预训练多模态表征桥接视觉手势与文本叙事，为工具辅助手术提供了一种低干预、高效率的时间线生成新范式。 Abstract: Video surgery timelines are an important part of tool-assisted surgeries, as they allow surgeons to quickly focus on key parts of the procedure. Current methods involve the surgeon filling out a post-operation (OP) report, which is often vague, or manually annotating the surgical videos, which is highly time-consuming. Our proposed method sits between these two extremes: we aim to automatically create a surgical timeline and narrative directly from the surgical video. To achieve this, we employ a CLIP-based multi-modal framework that aligns surgical video frames with textual gesture descriptions. Specifically, we use the CLIP visual encoder to extract representations from surgical video frames and the text encoder to embed the corresponding gesture sentences into a shared embedding space. We then fine-tune the model to improve the alignment between video gestures and textual tokens. Once trained, the model predicts gestures and phases for video frames, enabling the construction of a structured surgical timeline. This approach leverages pretrained multi-modal representations to bridge visual gestures and textual narratives, reducing the need for manual video review and annotation by surgeons.

[68] Full Dynamic Range Sky-Modelling For Image Based Lighting

Ian J. Maquignaz

Main category: cs.CV

TL;DR: 本文提出Icarus模型，一种全天气、高动态范围（FDR）的天空建模方法，通过深度学习实现对真实户外环境图的精准建模与可控生成，显著提升图像光照（IBL）中的光传输、阴影和色调真实性。

Details

Motivation: 现有基于深度学习的天空模型在高分辨率下难以准确建模太阳区域（14EV+类不平衡），导致光照失真、阴影与色调偏差，亟需更精确、光度一致的天空环境图生成方法。 Method: 提出Icarus——一种支持条件生成的全天气天空模型，可基于用户指定的太阳/云位置及文本控制大气纹理，并学习全动态范围（FDR）实拍户外影像的曝光特性；采用端到端深度网络架构适配物理光照约束。 Result: Icarus在图像基照明（IBL）中展现出与实拍FDR环境图或参数化模型相当甚至更优的性能，在光照方向性（阴影）、色调保真度与光传输准确性方面达到新高度。 Conclusion: Icarus为户外场景建模提供了高保真、可控、实用的天空环境图生成方案，有效弥补了当前DNN天空模型在高动态范围与类不平衡区域建模上的关键缺陷。 Abstract: Accurate environment maps are a key component to modelling real-world outdoor scenes. They enable captivating visual arts, immersive virtual reality and a wide range of scientific and engineering applications. To alleviate the burden of physical-capture, physically-simulation and volumetric rendering, sky-models have been proposed as fast, flexible, and cost-saving alternatives. In recent years, sky-models have been extended through deep learning to be more comprehensive and inclusive of cloud formations, but recent work has demonstrated these models fall short in faithfully recreating accurate and photorealistic natural skies. Particularly at higher resolutions, DNN sky-models struggle to accurately model the 14EV+ class-imbalanced solar region, resulting in poor visual quality and scenes illuminated with skewed light transmission, shadows and tones. In this work, we propose Icarus, an all-weather sky-model capable of learning the exposure range of Full Dynamic Range (FDR) physically captured outdoor imagery. Our model allows conditional generation of environment maps with intuitive user-positioning of solar and cloud formations, and extends on current state-of-the-art to enable user-controlled texturing of atmospheric formations. Through our evaluation, we demonstrate Icarus is interchangeable with FDR physically captured outdoor imagery or parametric sky-models, and illuminates scenes with unprecedented accuracy, photorealism, lighting directionality (shadows), and tones in Image Based Lightning (IBL).

[69] Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

Ruidong Chen,Yancheng Bai,Xuanpu Zhang,Jianhao Zeng,Lanjun Wang,Dan Song,Lei Sun,Xiangxiang Chu,Anan Liu

Main category: cs.CV

TL;DR: 本文提出LayerBind，一种无需训练、即插即用的区域与遮挡顺序控制方法，通过分层建模和早期潜空间结构重排实现高精度布局控制。

Details

Motivation: 现有文本到图像生成中的区域布局控制方法存在数据偏差、图像质量下降及难以处理遮挡顺序等问题；作者观察到空间布局与遮挡关系在去噪早期阶段即已确立，因此只需调整早期潜在表示即可控制最终输出。 Method: LayerBind包含两个阶段：（1）分层实例初始化——利用多模态联合注意力中的上下文共享机制，为每个实例创建独立分支并锚定共享背景，在指定早期步融合以建立预设布局；（2）分层语义养护——引入并行的分层注意力路径与全局路径协同，并通过层透明度调度器组合更新，强化区域细节并维持遮挡顺序。 Result: LayerBind在定性与定量实验中均展现出优异性能，支持即插即用、跨Diffusion Transformer兼容，且原生支持可编辑工作流（如更换实例或重排可见顺序）。 Conclusion: LayerBind是一种高效、灵活、无需训练的布局控制方案，显著提升了文本到图像生成中区域定位与遮挡关系建模的能力，具备广阔创意应用前景。 Abstract: Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind's effectiveness, highlighting its strong potential for creative applications.

[70] Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval

Donghoon Han,Eunhwan Park,Seunghyeon Seo

Main category: cs.CV

TL;DR: BM25-V是一种结合稀疏自编码器（SAE）与Okapi BM25的视觉检索方法，利用ViT patch特征生成可解释的视觉词，通过IDF加权实现高效、高召回、可归因的第一阶段检索，并在多基准上接近稠密检索精度。

Details

Motivation: 解决稠密图像检索可解释性差、计算开销大、缺乏归因能力的问题，同时利用稀疏表示提升效率与可解释性。 Method: 在ViT patch特征上训练稀疏自编码器（SAE）生成视觉词；将每张图像表示为视觉词激活的稀疏向量；基于大规模图库中视觉词的Zipfian分布，应用BM25（含IDF加权）进行检索；构建稀疏倒排索引实现快速第一阶段检索，并输出可归因的视觉词贡献。 Result: 在7个基准上Recall@200 ≥ 0.993；两阶段流水线仅需对200个候选重排序，平均精度损失仅0.2%；ImageNet-1K预训练的SAE零样本迁移至7个细粒度任务；检索决策可归因于具体视觉词及其IDF权重。 Conclusion: BM25-V实现了高效、高召回、可解释、可迁移的视觉检索，在保持近似稠密检索性能的同时显著降低计算成本并提供归因能力。 Abstract: Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf{BM25-V}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25's inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 $\geq$ 0.993, enabling a two-stage pipeline that reranks only $K{=}200$ candidates per query and recovers near-dense accuracy within $0.2$\% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.

[71] Spectral Probing of Feature Upsamplers in 2D-to-3D Scene Reconstruction

Ling Xiao,Yuliang Xiu,Yue Chen,Guoming Wang,Toshihiko Yamasaki

Main category: cs.CV

TL;DR: 本文提出了一种频谱诊断框架，用于评估2D-to-3D重建中上采样方法对几何一致性的影响，发现频谱结构一致性比空间细节增强更能决定重建质量。

Details

Motivation: 现有可学习上采样方法侧重提升空间细节（如几何锐度和纹理丰富度），但其对3D感知能力（即跨视角几何一致性）的影响尚未被系统研究。 Method: 构建包含六个互补指标的频谱诊断框架，量化振幅重分布、结构频谱对齐与方向稳定性；在CLIP和DINO骨干网络上对比经典插值与可学习上采样方法。 Result: 1）结构频谱一致性（SSC/CSC）是新视图合成（NVS）质量最强预测因子；2）高频谱斜率漂移（HFSS）常与重建性能负相关；3）角能量一致性（ADC）更影响几何精度，SSC/CSC略更影响纹理保真度；4）可学习上采样在重建质量上通常不优于经典插值。 Conclusion: 2D-to-3D重建质量更依赖于频谱结构的保持而非空间细节的增强，频谱一致性应成为设计上采样策略的核心原则。 Abstract: A typical 2D-to-3D pipeline takes multi-view images as input, where a Vision Foundation Model (VFM) extracts features that are spatially upsampled to dense representations for 3D reconstruction. If dense features across views preserve geometric consistency, differentiable rendering can recover an accurate 3D representation, making the feature upsampler a critical component. Recent learnable upsampling methods mainly aim to enhance spatial details, such as sharper geometry or richer textures, yet their impact on 3D awareness remains underexplored. To address this gap, we introduce a spectral diagnostic framework with six complementary metrics that characterize amplitude redistribution, structural spectral alignment, and directional stability. Across classical interpolation and learnable upsampling methods on CLIP and DINO backbones, we observe three key findings. First, structural spectral consistency (SSC/CSC) is the strongest predictor of NVS quality, whereas High-Frequency Spectral Slope Drift (HFSS) often correlates negatively with reconstruction performance, indicating that emphasizing high-frequency details alone does not necessarily improve 3D reconstruction. Second, geometry and texture respond to different spectral properties: Angular Energy Consistency (ADC) correlates more strongly with geometry-related metrics, while SSC/CSC influence texture fidelity slightly more than geometric accuracy. Third, although learnable upsamplers often produce sharper spatial features, they rarely outperform classical interpolation in reconstruction quality, and their effectiveness depends on the reconstruction model. Overall, our results indicate that reconstruction quality is more closely related to preserving spectral structure than to enhancing spatial detail, highlighting spectral consistency as an important principle for designing upsampling strategies in 2D-to-3D pipelines.

[72] EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition

Adam D. Hines,Gokul B. Nair,Nicolás Marticorena,Michael Milford,Tobias Fischer

Main category: cs.CV

TL;DR: 本文提出EventGeM，一种用于事件相机视觉地点识别的全局-局部特征融合方法，结合ViT、MaxViT和视觉基础模型，在多个数据集和光照条件下实现SOTA性能，并支持实时部署与真实机器人在线定位。

Details

Motivation: 事件相机因稀疏激活和高时间分辨率在机器人导航与定位中日益重要，但现有基于事件的地点识别方法在精度、鲁棒性与实时性方面仍有提升空间。 Method: 采用预训练ViT-S/16提取事件直方图图像的全局特征；用预训练MaxViT检测局部关键点并结合2D单应性与RANSAC重排序；再利用预训练视觉基础模型进行深度估计，通过结构相似性进一步重排序。 Result: 在多个基准数据集和不同光照条件下达到当前事件相机地点识别的最先进性能，且可在多种计算平台上实时运行，并成功部署于真实机器人平台进行在线定位。 Conclusion: EventGeM通过多阶段特征融合与跨模态（外观、几何、结构）重排序策略，显著提升了事件相机视觉地点识别的准确性、鲁棒性与实用性，验证了其在真实场景中的可行性。 Abstract: Dynamic vision sensors, also known as event cameras, are rapidly rising in popularity for robotic and computer vision tasks due to their sparse activation and high-temporal resolution. Event cameras have been used in robotic navigation and localization tasks where accurate positioning needs to occur on small and frequent time scales, or when energy concerns are paramount. In this work, we present EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition. We use a pre-trained vision transformer (ViT-S/16) backbone to obtain global feature patch for initial match predictions embeddings from event histogram images. Local feature keypoints were then detected using a pre-trained MaxViT backbone for 2D-homography based re-ranking with RANSAC. For additional re-ranking refinement, we subsequently used a pre-trained vision foundation model for depth estimation to compare structural similarity between references and queries. Our work performs state-of-the-art localization when compared to the best currently available event-based place recognition method across several benchmark datasets and lighting conditions all whilst being fully capable of running in real-time when deployed across a variety of compute architectures. We demonstrate the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera. Project page: https://eventgemvpr.github.io/

[73] Training-free Latent Inter-Frame Pruning with Attention Recovery

Dennis Menn,Yuedong Yang,Bokun Wang,Xiwen Wei,Mustafa Munir,Feng Liang,Radu Marculescu,Chenfeng Xu,Diana Marculescu

Main category: cs.CV

TL;DR: 本文提出LIPAR框架，通过检测并跳过重复计算的视频潜在块，并引入注意力恢复机制，显著提升视频编辑吞吐量（1.45倍），同时保持生成质量。

Details

Motivation: 当前视频生成模型计算延迟高，难以支持实时应用，亟需降低计算成本。 Method: 提出Latent Inter-frame Pruning with Attention Recovery（LIPAR）框架，利用视频潜在块的时间冗余性进行帧间剪枝，并设计Attention Recovery机制近似被剪枝token的注意力值以避免视觉伪影。 Result: 在NVIDIA A6000上平均达12.2 FPS，相比基线8.4 FPS提升1.45倍；无需额外训练，不损害生成质量，可即插即用。 Conclusion: LIPAR有效融合传统压缩思想与现代生成范式，在保持质量前提下显著加速视频生成，为实时视频编辑提供可行方案。 Abstract: Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To this end, we propose the Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework, which detects and skips recomputing duplicated latent patches. Additionally, we introduce a novel Attention Recovery mechanism that approximates the attention values of pruned tokens, thereby removing visual artifacts arising from naively applying the pruning method. Empirically, our method increases video editing throughput by $1.45\times$, on average achieving 12.2 FPS on an NVIDIA A6000 compared to the baseline 8.4 FPS. The proposed method does not compromise generation quality and can be seamlessly integrated with the model without additional training. Our approach effectively bridges the gap between traditional compression algorithms and modern generative pipelines.

[74] Margin and Consistency Supervision for Calibrated and Robust Vision Models

Salim Khazem

Main category: cs.CV

TL;DR: 本文提出了一种名为MaCS的正则化框架，通过联合施加logit空间间隔约束和局部预测一致性约束，提升深度视觉分类器的校准性、鲁棒性和泛化能力，且无需额外数据或模型修改。

Details

Motivation: 深度视觉分类器虽准确率高，但校准性差、对小分布偏移敏感（脆弱）。 Method: MaCS框架包含两部分：(i) hinge-squared间隔惩罚，强制正确类与最强竞争类在logit空间达到目标间隔；(ii) 一致性正则项，最小化原始输入与轻微扰动输入预测之间的KL散度。理论分析将间隔增大与局部敏感性降低统一为Lipschitz型稳定性代理，并推导出泛化界与鲁棒半径界。 Result: 在多个图像分类基准和CNN/ViT骨干网络上，MaCS一致降低了ECE和NLL（提升校准性），增强了对常见图像退化的鲁棒性，同时保持或提升top-1准确率；且无额外数据、无架构改动、推理开销极小。 Conclusion: MaCS是一种简单、通用、高效的训练目标替代方案，能显著改善模型校准性、鲁棒性与泛化性。 Abstract: Deep vision classifiers often achieve high accuracy while remaining poorly calibrated and fragile under small distribution shifts. We present Margin and Consistency Supervision (MaCS), a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability. MaCS augments cross-entropy with (i) a hinge-squared margin penalty that enforces a target logit gap between the correct class and the strongest competitor, and (ii) a consistency regularizer that minimizes the KL divergence between predictions on clean inputs and mildly perturbed views. We provide a unifying theoretical analysis showing that increasing classification margin while reducing local sensitivity formalized via a Lipschitz-type stability proxy yields improved generalization guarantees and a provable robustness radius bound scaling with the margin-to-sensitivity ratio. Across several image classification benchmarks and several backbones spanning CNNs and Vision Transformers, MaCS consistently improves calibration (lower ECE and NLL) and robustness to common corruptions while preserving or improving top-1 accuracy. Our approach requires no additional data, no architectural changes, and negligible inference overhead, making it an effective drop-in replacement for standard training objectives.

[75] Remote Sensing Image Classification Using Deep Ensemble Learning

Niful Islam,Md. Rayhan Ahmed,Nur Mohammad Fahad,Salekul Islam,A. K. M. Muzahidul Islam,Saddam Mukta,Swakkhar Shatabda

Main category: cs.CV

TL;DR: 本文提出了一种融合CNN与ViT的集成模型，通过训练四个独立融合模型并在预测阶段进行集成，有效克服冗余特征导致的性能瓶颈，在多个遥感数据集上取得SOTA精度，同时提升计算效率。

Details

Motivation: CNN擅长局部特征提取但难以建模全局上下文，ViT通过自注意力机制弥补此缺陷；然而简单拼接CNN和ViT会引入冗余特征表示，形成性能瓶颈。 Method: 提出一种多模型融合框架：训练四个独立的CNN-ViT融合模型（各自含CNN与ViT主干），在最终预测层通过集成（ensembling）融合其输出，避免中间特征冗余。 Result: 在UC Merced、RSSCN7和MSRSI数据集上分别达到98.10%、94.46%和95.45%分类精度，优于现有方法，且训练计算资源更高效。 Conclusion: CNN与ViT的协同不应仅依赖特征级融合，而可通过预测级集成实现更优性能与效率平衡，该策略对遥感图像分类具有普适价值。 Abstract: Remote sensing imagery plays a crucial role in many applications and requires accurate computerized classification techniques. Reliable classification is essential for transforming raw imagery into structured and usable information. While Convolutional Neural Networks (CNNs) are mostly used for image classification, they excel at local feature extraction, but struggle to capture global contextual information. Vision Transformers (ViTs) address this limitation through self attention mechanisms that model long-range dependencies. Integrating CNNs and ViTs, therefore, leads to better performance than standalone architectures. However, the use of additional CNN and ViT components does not lead to further performance improvement and instead introduces a bottleneck caused by redundant feature representations. In this research, we propose a fusion model that combines the strengths of CNNs and ViTs for remote sensing image classification. To overcome the performance bottleneck, the proposed approach trains four independent fusion models that integrate CNN and ViT backbones and combine their outputs at the final prediction stage through ensembling. The proposed method achieves accuracy rates of 98.10 percent, 94.46 percent, and 95.45 percent on the UC Merced, RSSCN7, and MSRSI datasets, respectively. These results outperform competing architectures and highlight the effectiveness of the proposed solution, particularly due to its efficient use of computational resources during training.

[76] Cog2Gen3D: Sculpturing 3D Semantic-Geometric Cognition for 3D Generation

Haonan Wang,Hanyu Zhou,Haoyue Liu,Tao Gu,Luxin Yan

Main category: cs.CV

TL;DR: 本文提出Cog2Gen3D，一种基于3D认知引导的扩散模型框架，通过融合语义与绝对几何信息构建3D认知图，提升3D生成的物理合理性和结构合理性。

Details

Motivation: 现有3D生成方法依赖相对几何关系，易导致绝对尺度不一致，缺乏对物理世界中语义与绝对几何协同建模的能力。 Method: 提出三阶段设计：1）认知特征嵌入，分别编码语义与几何表征并提取逻辑表征；2）构建双流语义-几何图并通过基于公共节点的交叉注意力融合为3D认知图；3）以该认知图为条件引导潜在扩散过程生成3D高斯表示。 Result: 在Marble World Labs构建的验证子集上，Cog2Gen3D在语义保真度和几何合理性两方面均显著优于现有方法。 Conclusion: 语义信息与绝对几何的联合建模是实现可控、物理可信3D生成的关键，Cog2Gen3D为此提供了统一有效的认知引导框架。 Abstract: Generative models have achieved success in producing semantically plausible 2D images, but it remains challenging in 3D generation due to the absence of spatial geometry constraints. Typically, existing methods utilize geometric features as conditions to enhance spatial awareness. However, these methods can only model relative relationships and are prone to scale inconsistency of absolute geometry. Thus, we argue that semantic information and absolute geometry empower 3D cognition, thereby enabling controllable 3D generation for the physical world. In this work, we propose Cog2Gen3D, a 3D cognition-guided diffusion framework for 3D generation. Our model is guided by three key designs: 1) Cognitive Feature Embeddings. We encode different modalities into semantic and geometric representations and further extract logical representations. 2) 3D Latent Cognition Graph. We structure different representations into dual-stream semantic-geometric graphs and fuse them via common-based cross-attention to obtain a 3D cognition graph. 3) Cognition-Guided Latent Diffusion. We leverage the fused 3D cognition graph as the condition to guide the latent diffusion process for 3D Gaussian generation. Under this unified framework, the 3D cognition graph ensures the physical plausibility and structural rationality of 3D generation. Moreover, we construct a validation subset based on the Marble World Labs. Extensive experiments demonstrate that our Cog2Gen3D significantly outperforms existing methods in both semantic fidelity and geometric plausibility.

[77] VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction

Muhua Zhu,Xinhao Jin,Yu Zhang,Yifei Xue,Tie Ji,Yizhen Lao

Main category: cs.CV

TL;DR: VS3R是一种结合前馈3D重建与生成式视频扩散的视频稳定化新框架，兼顾几何鲁棒性与全帧一致性，显著提升极端运动下的稳定效果与视觉质量。

Details

Motivation: 现有2D方法裁剪严重，3D方法在极端运动下优化脆弱，需兼顾几何鲁棒性与全帧一致性。 Method: 提出VS3R框架：联合估计相机参数、深度和掩码；设计混合稳定渲染模块融合语义与几何线索；采用双流视频扩散模型结合结构引导与语义锚点修复遮挡区域和伪影。 Result: 在多种相机模型上实现高保真、全帧稳定，在鲁棒性和视觉质量上显著优于现有最先进方法。 Conclusion: VS3R通过协同3D重建与生成式扩散，有效解决了视频稳定中几何鲁棒性与全帧一致性的根本权衡问题。 Abstract: Video stabilization aims to mitigate camera shake but faces a fundamental trade-off between geometric robustness and full-frame consistency. While 2D methods suffer from aggressive cropping, 3D techniques are often undermined by fragile optimization pipelines that fail under extreme motions. To bridge this gap, we propose VS3R, a framework that synergizes feed-forward 3D reconstruction with generative video diffusion. Our pipeline jointly estimates camera parameters, depth, and masks to ensure all-scenario reliability, and introduces a Hybrid Stabilized Rendering module that fuses semantic and geometric cues for dynamic consistency. Finally, a Dual-Stream Video Diffusion Model restores disoccluded regions and rectifies artifacts by synergizing structural guidance with semantic anchors. Collectively, VS3R achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in robustness and visual quality.

[78] TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis

Sijing Li,Zhongwei Qiu,Jiang Liu,Wenqiao Zhang,Tianwei Lin,Yihan Xie,Jianxiang An,Boxiang Yun,Chenglin Yang,Jun Xiao,Guangyu Guo,Jiawen Yao,Wei Liu,Yuan Gao,Ke Yan,Weiwei Cao,Zhilin Zheng,Tony C. W. Mok,Kai Cao,Yu Shi,Jiuyu Zhang,Jian Zhou,Beng Chin Ooi,Yingda Xia,Ling Zhang

Main category: cs.CV

TL;DR: 本文提出TumorCoT数据集和TumorChain模型，通过多模态链式推理实现从3D CT影像到临床印象及病理预测的可追溯、低幻觉肿瘤分析。

Details

Motivation: 临床肿瘤分析需早期检测、病灶表征和病理风险评估，而链式思维（CoT）推理可提升诊断可追溯性与准确性，减少错误。 Method: 构建大规模TumorCoT数据集（150万条带链式理由的VQA指令+3D CT扫描），并提出TumorChain框架：融合3D影像编码器、临床文本理解与器官级视觉-语言对齐，通过跨模态对齐与迭代交错因果推理实现多轮自优化。 Result: 在病灶检测、印象生成和病理分类任务上显著优于强基线，并在DeepTumorVQA基准上展现强泛化能力。 Conclusion: 多模态链式推理有望推动临床实践中可靠、可解释的肿瘤分析发展。 Abstract: Accurate tumor analysis is central to clinical radiology and precision oncology, where early detection, reliable lesion characterization, and pathology-level risk assessment guide diagnosis and treatment planning. Chain-of-Thought (CoT) reasoning is particularly important in this setting because it enables step-by-step interpretation from imaging findings to clinical impressions and pathology conclusions, improving traceability and reducing diagnostic errors. Here, we target the clinical tumor analysis task and build a large-scale benchmark that operationalizes a multimodal reasoning pipeline, spanning findings, impressions, and pathology predictions. We curate TumorCoT, a large-scale dataset of 1.5M CoT-labeled VQA instructions paired with 3D CT scans, with step-aligned rationales and cross-modal alignments along the trajectory from findings to impression to pathology, enabling evaluation of both answer accuracy and reasoning consistency. We further propose TumorChain, a multimodal interleaved reasoning framework that tightly couples 3D imaging encoders, clinical text understanding, and organ-level vision-language alignment. Through cross-modal alignment and iterative interleaved causal reasoning, TumorChain grounds visual evidence, aggregates conclusions, and issues pathology predictions after multiple rounds of self-refinement, improving traceability and reducing hallucination risk. Experiments show consistent improvements over strong baselines in lesion detection, impression generation, and pathology classification, and demonstrate strong generalization on the DeepTumorVQA benchmark. These results highlight the potential of multimodal reasoning for reliable and interpretable tumor analysis in clinical practice. Detailed information about our project can be found on our project homepage at https://github.com/ZJU4HealthCare/TumorChain.

[79] PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues

Yukun Qi,Pei Fu,Hang Li,Yuhan Liu,Chao Jiang,Bin Qin,Zhenbo Luo,Jian Luan

Main category: cs.CV

TL;DR: 本文提出PatchCue，一种基于图像块的视觉线索范式，通过两阶段训练（监督微调+带过程奖励的强化学习）提升视觉语言模型（VLMs）的视觉推理能力，在多项基准上显著优于像素级和点级线索方法。

Details

Motivation: 现有视觉语言模型的推理范式（如CoT）主要依赖文本信息，忽视视觉线索；而引入像素级线索又需精确空间定位，增加学习难度。 Method: 提出PatchCue：将图像分块，以图像块为单位提供视觉线索；采用两阶段训练——先监督微调使模型输出patch-level线索，再用过程监督的cue reward进行强化学习优化中间推理步骤。 Result: 在通用视觉问答、复杂推理和文档理解等多任务、多模型实验中，PatchCue持续提升性能；patch-level线索效果优于pixel-level bounding boxes和point-based cues。 Conclusion: PatchCue是一种更有效且符合人类感知习惯的视觉推理范式，能更好利用现代VLMs的patch-tokenized输入结构，显著增强其多模态推理能力。 Abstract: Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization, introducing additional learning complexity. To address this, we propose PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs. By partitioning images into patches and representing cues at the patch level, PatchCue aligns better with human perceptual habits and leverages the patch-tokenized input of modern VLMs. We train VLMs using a two-stage approach: cold-start supervised fine-tuning to output patch-level cues, followed by reinforcement learning with a process-supervised cue reward that guides intermediate visual reasoning steps. Extensive experiments on multiple VLMs and diverse benchmarks, including general visual question answering, complex reasoning, and document understanding, demonstrate that PatchCue consistently improves overall model performance. Our results show that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.

[80] Shifting Adaptation from Weight Space to Memory Space: A Memory-Augmented Agent for Medical Image Segmentation

Bowen Chen,Qiaohui Gao,Shaowen Wan,Shanhui Sun,Wei Liu,Xiang Li,Tianming Liu,Lin Zhao

Main category: cs.CV

TL;DR: 本文提出了一种名为MemSeg-Agent的记忆增强分割代理，通过将适应过程从权重空间转移到记忆空间，实现少样本学习、联邦监督学习和测试时自适应，显著降低联邦学习中的通信开销，并提升跨域鲁棒性。

Details

Motivation: 现有医学图像分割模型在单一数据集上训练后难以泛化到不同机构、设备或人群；视觉基础模型虽有潜力，但需任务特定微调，导致联邦学习中通信开销大且无法持续演进知识。 Method: 提出MemSeg-Agent，利用固定主干网络，结合轻量级静态记忆、少样本记忆和测试时工作记忆，并由智能体控制器动态组合；在联邦设置中仅更新紧凑的记忆单元而非模型参数。 Result: 在四个公开数据集上验证了强性能与域偏移鲁棒性：仅静态记忆即可匹敌甚至超越强监督基线且参数高效；测试时工作记忆进一步提升域内与跨域性能，无需微调。 Conclusion: MemSeg-Agent开创了一种面向具身AI时代的可扩展、自适应医学图像分割新范式。 Abstract: Medical image segmentation is fundamental to clinical workflows, yet models trained on a single dataset often fail to generalize across institutions, scanners, or patient populations. While vision foundation models have shown great promise in addressing this challenge, their deployment typically requires task-specific fine-tuning, which introduces substantial communication overhead in federated learning and prevents continuous knowledge evolution during deployment. In this work, we propose a memory-augmented segmentation agent (MemSeg-Agent) that shifts adaptation from weight space to memory space, enabling few-shot learning, federated supervised learning, and test-time adaptation within a unified architecture. MemSeg-Agent conditions a fixed backbone with lightweight static, few-shot, and test-time working memories, which are dynamically composed by an agentic controller. In federated settings, we update compact memory units instead of model parameters, substantially reducing communication overhead. Experiments on four public datasets demonstrate strong performance and robustness to domain shift: Static memory alone matches or surpasses strong supervised baselines with high parameter efficiency, and test-time working memory further improves in-domain and cross-domain performance without fine-tuning. Overall, MemSeg-Agent introduces a new paradigm for scalable and adaptive medical image segmentation in the era of agentic AI.

[81] Systematic Evaluation of Novel View Synthesis for Video Place Recognition

Muhammad Zawad Mahmud,Samiha Islam,Damian Lyons

Main category: cs.CV

TL;DR: 本文系统评估了合成新视角在视频地点识别（VPR）中的效果，发现少量新增合成视角可提升识别性能，而大量添加时，视角变化幅度不如新增视图数量和图像类型重要。

Details

Motivation: 利用合成新视角（如地面与空中视角互生成）提升机器人导航中的视频地点识别（VPR）能力，增强跨视角匹配鲁棒性。 Method: 在五个公开VPR图像数据集上，采用七种典型图像相似度方法，系统评估不同数量和视角差异的合成新视角对VPR性能的影响。 Result: 少量合成新视角能提升VPR识别指标；大量添加时，识别增益主要取决于新增视图数量和图像类型，而非视角变化幅度。 Conclusion: 合成新视角是提升VPR性能的有效手段，但其效益具有饱和性和数据依赖性，需权衡添加规模与数据特性。 Abstract: The generation of synthetic novel views has the potential to positively impact robot navigation in several ways. In image-based navigation, a novel overhead view generated from a scene taken by a ground robot could be used to guide an aerial robot to that location. In Video Place Recognition (VPR), novel views of ground locations from the air can be added that enable a UAV to identify places seen by the ground robot, and similarly, overhead views can be used to generate novel ground views. This paper presents a systematic evaluation of synthetic novel views in VPR using five public VPR image databases and seven typical image similarity methods. We show that for small synthetic additions, novel views improve VPR recognition statistics. We find that for larger additions, the magnitude of viewpoint change is less important than the number of views added and the type of imagery in the dataset.

[82] CylinderSplat: 3D Gaussian Splatting with Cylindrical Triplanes for Panoramic Novel View Synthesis

Qiwei Wang,Xianghui Ze,Jingyi Yu,Yujiao Shi

Main category: cs.CV

TL;DR: 本文提出CylinderSplat，一种面向全景影像的前馈式3D高斯泼溅框架，通过引入适配全景数据的圆柱形三平面表示和双分支架构，有效解决稀疏视角下的遮挡问题与几何失真，实现单/多视角全景新视角合成的SOTA性能。

Details

Motivation: 现有前馈式3D高斯泼溅方法在全景影像上表现不佳：依赖多视图体代价体积进行几何优化难以处理稀疏视角下的遮挡；标准体素表示（如笛卡尔三平面）不匹配360°场景的固有几何结构，导致畸变和混叠。 Method: 提出CylinderSplat框架，核心是新型圆柱形三平面（cylindrical Triplane）表示，更契合全景数据与曼哈顿世界假设；采用双分支架构——像素分支重建观测良好区域，体积分支利用圆柱三平面补全遮挡或稀疏区域；支持可变数量输入全景图（单张至多张）。 Result: 在单视图与多视图全景新视角合成任务上均达到SOTA，重建质量与几何精度均优于先前方法。 Conclusion: 圆柱形三平面表示与双分支设计显著提升了全景3DGS在稀疏视角、遮挡及几何保真度方面的性能，为实时全景新视角合成提供了高效、鲁棒的前馈解决方案。 Abstract: Feed-forward 3D Gaussian Splatting (3DGS) has shown great promise for real-time novel view synthesis, but its application to panoramic imagery remains challenging. Existing methods often rely on multi-view cost volumes for geometric refinement, which struggle to resolve occlusions in sparse-view scenarios. Furthermore, standard volumetric representations like Cartesian Triplanes are poor in capturing the inherent geometry of $360^\circ$ scenes, leading to distortion and aliasing. In this work, we introduce CylinderSplat, a feed-forward framework for panoramic 3DGS that addresses these limitations. The core of our method is a new {cylindrical Triplane} representation, which is better aligned with panoramic data and real-world structures adhering to the Manhattan-world assumption. We use a dual-branch architecture: a pixel-based branch reconstructs well-observed regions, while a volume-based branch leverages the cylindrical Triplane to complete occluded or sparsely-viewed areas. Our framework is designed to flexibly handle a variable number of input views, from single to multiple panoramas. Extensive experiments demonstrate that CylinderSplat achieves state-of-the-art results in both single-view and multi-view panoramic novel view synthesis, outperforming previous methods in both reconstruction quality and geometric accuracy.

[83] PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

Xiang Zhang,Sohyun Yoo,Hongrui Wu,Chuan Li,Jianwen Xie,Zhuowen Tu

Main category: cs.CV

TL;DR: PixARMesh是一种从单张RGB图像自回归生成完整3D室内场景网格的新方法，统一预测物体布局与几何结构，无需隐式表示或后处理优化。

Details

Motivation: 现有方法依赖隐式SDF表示和后处理布局优化，难以生成连贯、可直接使用的3D网格；需端到端、单次前向推理的高质量显式网格重建方法。 Method: 基于点云编码器，融合像素对齐图像特征与全局场景上下文（通过交叉注意力），在统一token流中自回归建模场景上下文、物体位姿与网格几何。 Result: 在合成与真实数据集上达到SOTA重建质量，生成轻量、高保真、艺术家可用的显式三角网格。 Conclusion: PixARMesh证明了单图驱动的自回归显式网格生成可行且高效，为3D场景理解与内容创作提供了新范式。 Abstract: We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.

[84] InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation

Yuxin Qin,Ke Cao,Haowei Liu,Ao Ma,Fengheng Li,Honghe Zhu,Zheng Zhang,Run Ling,Wei Feng,Xuanhua He,Zhanjie Zhang,Zhen Guo,Haoyi Bian,Jingjing Lv,Junjie Shen,Ching Law

Main category: cs.CV

TL;DR: 本文提出InnoAds-Composer，一种单阶段、支持主体/文字/风格三条件协同控制的电商海报生成框架，通过条件路由与文本特征增强模块提升生成质量，并构建首个含三条件的高质量电商海报数据集。

Details

Motivation: 现有基于扩散模型的电商海报生成方法多采用多阶段流水线，存在主体保真度低、文字不准确、风格不一致等问题，且缺乏对主体、文本、风格三者同时高效可控的单阶段方案。 Method: 提出InnoAds-Composer单阶段框架：1）引入主体、字形（glyph）、风格三类条件控制token；2）通过层与时步的重要性分析实现条件路由，缩短有效token序列以降低计算开销；3）设计文本特征增强模块（TFEM），融合字形图像与字形裁剪特征以提升中文文本渲染精度；4）构建首个涵盖主体、文本、风格三条件的高质量电商海报数据集与基准。 Result: InnoAds-Composer在多个指标上显著超越现有方法，在保持推理延迟基本不变的前提下，提升了主体保真度、文本准确性和风格一致性。 Conclusion: 单阶段三条件协同控制是电商海报生成的有效范式；条件路由与TFEM设计可兼顾效率与质量；所构建的数据集为后续研究提供了重要基础。 Abstract: E-commerce product poster generation aims to automatically synthesize a single image that effectively conveys product information by presenting a subject, text, and a designed style. Recent diffusion models with fine-grained and efficient controllability have advanced product poster synthesis, yet they typically rely on multi-stage pipelines, and simultaneous control over subject, text, and style remains underexplored. Such naive multi-stage pipelines also show three issues: poor subject fidelity, inaccurate text, and inconsistent style. To address these issues, we propose InnoAds-Composer, a single-stage framework that enables efficient tri-conditional control tokens over subject, glyph, and style. To alleviate the quadratic overhead introduced by naive tri-conditional token concatenation, we perform importance analysis over layers and timesteps and route each condition only to the most responsive positions, thereby shortening the active token sequence. Besides, to improve the accuracy of Chinese text rendering, we design a Text Feature Enhancement Module (TFEM) that integrates features from both glyph images and glyph crops. To support training and evaluation, we also construct a high-quality e-commerce product poster dataset and benchmark, which is the first dataset that jointly contains subject, text, and style conditions. Extensive experiments demonstrate that InnoAds-Composer significantly outperforms existing product poster methods without obviously increasing inference latency.

[85] Mitigating Bias in Concept Bottleneck Models for Fair and Interpretable Image Classification

Schrasing Tong,Antoine Salaun,Vincent Yuan,Annabel Adeyeri,Lalana Kagal

Main category: cs.CV

TL;DR: 本文提出三种偏见缓解技术来改进概念瓶颈模型（CBM）在图像分类中的公平性，包括top-k概念过滤、移除有偏见的概念和对抗去偏，显著提升了公平性与性能的权衡。

Details

Motivation: 现有概念瓶颈模型（CBM）虽旨在通过人类可解释的概念提升可解释性和公平性，但其概念仍存在语义无关的信息泄露，且在性别偏见缓解上效果有限。 Method: 提出三种偏见缓解技术：1）使用top-k概念过滤减少信息泄露；2）识别并移除带有偏见的概念；3）引入对抗去偏机制。 Result: 所提方法在公平性-性能权衡上优于先前工作，在ImSitu等数据集上显著降低性别偏见。 Conclusion: 经去偏的概念瓶颈模型是实现公平且可解释图像分类的重要进展。 Abstract: Ensuring fairness in image classification prevents models from perpetuating and amplifying bias. Concept bottleneck models (CBMs) map images to high-level, human-interpretable concepts before making predictions via a sparse, one-layer classifier. This structure enhances interpretability and, in theory, supports fairness by masking sensitive attribute proxies such as facial features. However, CBM concepts have been known to leak information unrelated to concept semantics and early results reveal only marginal reductions in gender bias on datasets like ImSitu. We propose three bias mitigation techniques to improve fairness in CBMs: 1. Decreasing information leakage using a top-k concept filter, 2. Removing biased concepts, and 3. Adversarial debiasing. Our results outperform prior work in terms of fairness-performance tradeoffs, indicating that our debiased CBM provides a significant step towards fair and interpretable image classification.

[86] CollabOD: Collaborative Multi-Backbone with Cross-scale Vision for UAV Small Object Detection

Xuecheng Bai,Yuxiang Wang,Chuanzhi Xu,Boyu Hu,Kang Han,Ruijie Pan,Xiaowei Niu,Xiaotian Guan,Liqiang Fu,Pengfei Ye

Main category: cs.CV

TL;DR: 本文提出CollabOD，一种轻量级协同检测框架，通过结构细节保持、跨路径特征对齐和定位感知轻量化设计，提升无人机图像中小目标检测的稳定性和鲁棒性。

Details

Motivation: 无人机图像中小目标检测面临尺度变化大、结构细节退化及计算资源受限等挑战，尤其在高空场景下，细粒度特征在下采样和跨尺度融合中进一步削弱，导致定位不稳定、鲁棒性下降。 Method: 提出CollabOD框架，包含结构细节保持模块、跨路径特征对齐机制和定位感知轻量化设计，并引入统一的细节感知检测头，在图像处理、通道结构和轻量化三方面优化传统UAV感知模型架构。 Result: 显著提升了小目标检测的表征稳定性与回归鲁棒性，同时保持高效推理能力，且不增加部署开销。 Conclusion: CollabOD在保持轻量化的同时有效缓解了UAV图像中小目标因尺度与细节损失带来的检测难题，为资源受限平台提供了实用可靠的检测方案。 Abstract: Small object detection in unmanned aerial vehicle (UAV) imagery is challenging, mainly due to scale variation, structural detail degradation, and limited computational resources. In high-altitude scenarios, fine-grained features are further weakened during hierarchical downsampling and cross-scale fusion, resulting in unstable localization and reduced robustness. To address this issue, we propose CollabOD, a lightweight collaborative detection framework that explicitly preserves structural details and aligns heterogeneous feature streams before multi-scale fusion. The framework integrates Structural Detail Preservation, Cross-Path Feature Alignment, and Localization-Aware Lightweight Design strategies. From the perspectives of image processing, channel structure, and lightweight design, it optimizes the architecture of conventional UAV perception models. The proposed design enhances representation stability while maintaining efficient inference. A unified detail-aware detection head further improves regression robustness without introducing additional deployment overhead. The code is available at: https://github.com/Bai-Xuecheng/CollabOD.

[87] Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D

Ping Chen,Zezhou Chen,Xingpeng Zhang,Yanlin Qian,Huan Hu,Xiang Liu,Zipeng Wang,Xin Wang,Zhaoxiang Liu,Kai Wang,Shiguo Lian

Main category: cs.CV

TL;DR: 本文提出了一种新的2D-to-3D转换范式——艺术视差合成（Artistic Disparity Synthesis），强调艺术表现力而非几何精度，设计了双路径框架Art3D，并在专业3D电影数据上验证其有效性。

Details

Motivation: 现有2D-to-3D方法虽几何准确，但缺乏艺术表现力，无法复现专业3D电影的沉浸感与情感张力；其根本问题在于将导演有意的艺术性深度处理（如零平面偏移、局部深度雕刻）误判为噪声或歧义。 Method: 提出Art3D框架，采用双路径结构解耦全局深度参数（宏观意图）与局部艺术效果（视觉笔触），并通过间接监督方式从专业3D电影数据中学习；同时引入初步的‘电影对齐’评估方法。 Result: 实验表明Art3D能有效复现关键局部出屏效果，并与专业3D电影的整体深度风格保持一致。 Conclusion: 艺术视差合成是2D-to-3D转换的新可行方向，Art3D为构建以艺术为导向的转换工具奠定了基础。 Abstract: Current 2D-to-3D conversion methods achieve geometric accuracy but are artistically deficient, failing to replicate the immersive and emotionally resonant experience of professional 3D cinema. This is because geometric reconstruction paradigms mistake deliberate artistic intent, such as strategic zero-plane shifts for pop-out effects and local depth sculpting, for data noise or ambiguity. This paper argues for a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis. We propose Art3D, a preliminary framework exploring this paradigm. Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes) and learns from professional 3D film data via indirect supervision. We also introduce a preliminary evaluation method to quantify cinematic alignment. Experiments show our approach demonstrates potential in replicating key local out-of-screen effects and aligning with the global depth styles of cinematic 3D content, laying the groundwork for a new class of artistically-driven conversion tools.

[88] Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image

Zidian Qiu,Ancong Wu

Main category: cs.CV

TL;DR: 本文提出Pano3DComposer，一种高效的前馈式全景图像到3D场景生成框架，通过解耦物体生成与布局估计，并引入Object-World Transformation Predictor和Coarse-to-Fine对齐机制，显著提升几何精度与生成效率。

Details

Motivation: 现有方法依赖耗时的迭代布局优化或僵化的联合物体-布局生成，且受限于窄视场透视图，难以生成完整360度环境。 Method: 提出Pano3DComposer框架，包含插拔式Object-World Transformation Predictor（基于改进VGGT架构，利用目标裁剪、多视角渲染与相机参数预测变换），采用伪几何监督训练；并为跨域输入引入Coarse-to-Fine对齐机制，通过场景渲染反馈迭代优化几何一致性。 Result: 在合成与真实数据集上图像/文本到3D任务中实现更优几何精度，RTX 4090 GPU上约20秒生成高保真3D场景。 Conclusion: Pano3DComposer有效解决了全景3D场景生成中的效率、灵活性与几何准确性问题，为宽视场3D内容创建提供了新范式。 Abstract: Current compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete 360-degree environments. To address these limitations, we design Pano3DComposer, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to Alignment-VGGT by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geometric supervision to address the shape discrepancy between generated and ground-truth objects. For input images from unseen domains, we further introduce a Coarse-to-Fine (C2F) alignment mechanism for Pano3DComposer that iteratively refines geometric consistency with feedback of scene rendering. Our method achieves superior geometric accuracy for image/text-to-3D tasks on synthetic and real-world datasets. It can generate a high-fidelity 3D scene in approximately 20 seconds on an RTX 4090 GPU. Project page: https://qiuzidian.github.io/pano3dcomposer-page/.

[89] CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning

Yuxin Xie,Yuming Chen,Yishan Yang,Yi Zhou,Tao Zhou,Zhen Zhao,Jiacheng Liu,Huazhu Fu

Main category: cs.CV

TL;DR: 本文提出ComLesion-14K数据集和CORE-Seg框架，通过语义引导的提示适配器与渐进式强化学习训练策略，实现推理驱动的复杂病灶分割，在Dice指标上显著超越现有方法。

Details

Motivation: 现有通用多模态大模型缺乏医学病灶所需的视觉推理能力，而传统分割模型缺乏逻辑可解释性，亟需融合推理与分割的新范式。 Method: 构建首个面向复杂病灶分割的链式思维（CoT）基准ComLesion-14K；提出CORE-Seg端到端框架，含语义引导提示适配器，并采用从监督微调（SFT）到带自适应双粒度奖励机制的GRPO的渐进式训练策略。 Result: 在复杂病灶分割任务上达到37.06%平均Dice分数，比次优基线高14.89%，失败率降至18.42%。 Conclusion: 推理与分割的深度融合可显著提升复杂医学图像分割性能与可解释性，为认知型医学影像分析提供了新路径。 Abstract: Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel-level segmentation but lack logical interpretability. In this paper, we introduce ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation. To accomplish this task, we propose CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter. We design a progressive training strategy from SFT to GRPO, equipped with an adaptive dual-granularity reward mechanism to mitigate reward sparsity. Our Method achieves state-of-the-art results with a mean Dice of 37.06\% (14.89\% higher than the second-best baseline), while reducing the failure rate to 18.42\%. Project Page: https://xyxl024.github.io/CORE-Seg.github.io/

[90] BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

Feiran Li,Qianqian Xu,Shilong Bao,Zhiyong Yang,Xilin Zhao,Xiaochun Cao,Qingming Huang

Main category: cs.CV

TL;DR: 本文提出BlackMirror框架，用于在黑盒设置下检测文本到图像模型中的后门攻击，通过MirrorMatch和MirrorVerify两个组件识别并验证语义偏差，无需训练且通用性强。

Details

Motivation: 现有方法依赖图像级相似性分析，难以应对生成图像视觉多样性增强的新后门攻击；作者观察到后门攻击仅稳定操纵部分语义模式，其余内容仍多样或良性。 Method: BlackMirror包含两个组件：MirrorMatch用于对齐视觉模式与指令以检测语义偏差；MirrorVerify用于评估这些偏差在不同提示下的稳定性，从而区分真实后门行为与良性响应。该框架无需训练，可即插即用。 Result: BlackMirror在多种后门攻击上实现了高精度检测，实验验证其广泛适用性和有效性。 Conclusion: BlackMirror是一种通用、免训练的黑盒后门检测框架，适用于Model-as-a-Service场景，显著提升了对视觉多样性后门攻击的检测能力。 Abstract: This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at https://github.com/Ferry-Li/BlackMirror.

[91] RAC: Rectified Flow Auto Coder

Sen Fang,Yalin Feng,Yanxin Zhang,Dimitris N. Metaxas

Main category: cs.CV

TL;DR: 本文提出了一种受Rectified Flow启发的Rectified Flow Auto Coder（RAC），以替代传统VAE，通过多步解码、双向推理和生成式解码提升重建与生成质量，并显著降低计算成本。

Details

Motivation: 解决传统VAE中重建与生成之间的差距，同时降低模型参数量和计算成本。 Method: 提出Rectified Flow Auto Coder（RAC），利用flow timestep实现多步解码，解码路径为直线且可校正；通过时间反转使解码器兼具编码功能，实现双向推理。 Result: RAC在重建和生成任务上均超越当前最优VAE方法，计算成本降低约70%，参数量减少约41%。 Conclusion: RAC是一种高效、高质量的生成模型架构，兼具低计算开销与强生成能力，为VAE类模型提供了新思路。 Abstract: In this paper, we propose a Rectified Flow Auto Coder (RAC) inspired by Rectified Flow to replace the traditional VAE: 1. It achieves multi-step decoding by applying the decoder to flow timesteps. Its decoding path is straight and correctable, enabling step-by-step refinement. 2. The model inherently supports bidirectional inference, where the decoder serves as the encoder through time reversal (hence Coder rather than encoder or decoder), reducing parameter count by nearly 41%. 3. This generative decoding method improves generation quality since the model can correct latent variables along the path, partially addressing the reconstruction--generation gap. Experiments show that RAC surpasses SOTA VAEs in both reconstruction and generation with approximately 70% lower computational cost.

[92] Towards Driver Behavior Understanding: Weakly-Supervised Risk Perception in Driving Scenes

Nakul Agarwal,Yi-Ting Chen,Behzad Dariush

Main category: cs.CV

TL;DR: 本文提出RAID数据集和弱监督风险对象识别框架，用于研究驾驶员风险感知和情境风险评估，实验表明该方法在RAID和HDDS数据集上分别比现有最优方法提升20.6%和23.1%。

Details

Motivation: 实现零碰撞移动是智能车辆系统的关键目标，这需要理解由驾驶员对外部刺激的自主反应及周围道路使用者对自车的关注度所塑造的复杂认知过程——即驾驶员风险感知。 Method: 构建了包含4691个标注视频片段的大规模数据集RAID，并基于此提出一种弱监督风险对象识别框架，建模驾驶员预期操作与响应之间的关系以识别潜在风险源；同时分析行人注意力在风险估计中的作用。 Result: 在RAID和HDDS数据集上，所提方法分别比先前最优方法提升了20.6%和23.1%的性能。 Conclusion: RAID数据集为驾驶员风险感知研究提供了重要资源，所提出的弱监督框架有效提升了风险识别能力，验证了行人注意力等上下文因素在风险评估中的关键价值。 Abstract: Achieving zero-collision mobility remains a key objective for intelligent vehicle systems, which requires understanding driver risk perception-a complex cognitive process shaped by voluntary response of the driver to external stimuli and the attentiveness of surrounding road users towards the ego-vehicle. To support progress in this area, we introduce RAID (Risk Assessment In Driving scenes)-a large-scale dataset specifically curated for research on driver risk perception and contextual risk assessment. RAID comprises 4,691 annotated video clips, covering diverse traffic scenarios with labels for driver's intended maneuver, road topology, risk situations (e.g., crossing pedestrians), driver responses, and pedestrian attentiveness. Leveraging RAID, we propose a weakly supervised risk object identification framework that models the relationship between driver's intended maneuver and responses to identify potential risk sources. Additionally, we analyze the role of pedestrian attention in estimating risk and demonstrate the value of the proposed dataset. Experimental evaluations demonstrate that our method achieves 20.6% and 23.1% performance gains over prior state-of-the-art approaches on the RAID and HDDS datasets, respectively.

[93] Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

Hongwei Fang,Jiahang Cai,Xun Wang,Wenwu Yang

Main category: cs.CV

TL;DR: 本文提出TAR-ViTPose，一种面向视频的2D人体姿态估计方法，通过引入关节中心时序聚合（JTA）和全局恢复注意力（GRA）机制，在不破坏ViT全局建模能力的前提下，有效融合帧间时序信息，显著提升姿态估计稳定性与精度。

Details

Motivation: 现有基于ViT的姿态估计方法仅处理单帧图像，忽略视频序列中的时序一致性，导致在运动模糊、遮挡或失焦等挑战性场景下预测不稳定。 Method: 提出TAR-ViTPose框架，包含两个核心模块：（1）关节中心时序聚合（JTA），为每个关节点分配可学习查询token，跨帧选择性地聚合对应区域特征；（2）全局恢复注意力（GRA），将聚合后的时序特征恢复至当前帧token序列中，在增强表征的同时保持全局上下文。 Result: 在PoseTrack2017数据集上相较单帧ViTPose基线提升+2.3 mAP，超越现有视频姿态估计SOTA方法，并具备更高运行帧率。 Conclusion: TAR-ViTPose以即插即用方式增强静态ViT模型的时序建模能力，验证了显式建模关节级时序一致性对视频姿态估计的有效性与实用性。 Abstract: Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a joint-centric temporal aggregation (JTA) that assigns each joint a learnable query token to selectively attend to its corresponding regions from neighboring frames. Furthermore, we develop a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame, enriching its pose representation while fully preserving global context for precise keypoint localization. Extensive experiments demonstrate that TAR-ViTPose substantially improves upon the single-frame baseline ViTPose, achieving a +2.3 mAP gain on the PoseTrack2017 benchmark. Moreover, our approach outperforms existing state-of-the-art video-based methods, while also achieving a noticeably higher runtime frame rate in real-world applications. Project page: https://github.com/zgspose/TARViTPose.

[94] FTSplat: Feed-forward Triangle Splatting Network

Xiong Jinlin,Li Can,Shen Jiawei,Qi Zhigang,Sun Lei,Zhao Dongyang

Main category: cs.CV

TL;DR: 本文提出了一种前馈式三角面元生成框架，直接从多视角图像预测连续三角表面，实现单次前向推理即可获得仿真就绪的3D模型，兼顾高效性与显式几何结构。

Details

Motivation: 现有NeRF和3DGS虽渲染质量高，但需逐场景优化，难以实时部署；而新兴前馈式高斯溅射方法缺乏显式的流形几何，不适用于直接仿真。 Method: 提出前馈式三角面元生成框架，包含像素对齐的三角生成模块，并引入相对3D点云监督以提升几何学习稳定性与一致性。 Result: 实验表明该方法在保证高效重建的同时，能无缝兼容标准图形与机器人仿真器。 Conclusion: 所提方法成功实现了无需逐场景优化或后处理的仿真就绪3D重建，填补了高效前馈建模与显式可仿真几何之间的空白。 Abstract: High-fidelity three-dimensional (3D) reconstruction is essential for robotics and simulation. While Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) achieve impressive rendering quality, their reliance on time-consuming per-scene optimization limits real-time deployment. Emerging feed-forward Gaussian splatting methods improve efficiency but often lack explicit, manifold geometry required for direct simulation. To address these limitations, we propose a feed-forward framework for triangle primitive generation that directly predicts continuous triangle surfaces from calibrated multi-view images. Our method produces simulation-ready models in a single forward pass, obviating the need for per-scene optimization or post-processing. We introduce a pixel-aligned triangle generation module and incorporate relative 3D point cloud supervision to enhance geometric learning stability and consistency. Experiments demonstrate that our method achieves efficient reconstruction while maintaining seamless compatibility with standard graphics and robotic simulators.

[95] OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving

Kota Shimomura,Masaki Nambata,Atsuya Ishikawa,Ryota Mimura,Takayuki Kawabuchi,Takayoshi Yamashita,Koki Inoue

Main category: cs.CV

TL;DR: 本文提出OD-RASE框架，结合道路交通领域本体与大视觉语言模型（LVLM），自动识别致事故道路结构并生成基础设施改进建议及可视化效果图，提升自动驾驶系统的主动安全能力。

Details

Motivation: 现有自动驾驶系统在罕见场景和复杂道路结构下表现受限；当前道路安全改进多为事故后被动响应，而自动驾驶需主动风险防控。 Method: 构建道路交通领域本体；利用LVLM生成改进建议，并通过本体驱动的数据过滤提升可靠性；自动标注预事故道路图像，构建新数据集；设计OD-RASE基线模型，融合LVLM与扩散模型生成改进建议及优化后道路图像。 Result: 实验表明本体驱动的数据过滤能高精度预测致事故道路结构及对应改进方案；成功构建了带标注的新数据集，并验证了生成建议与图像的合理性。 Conclusion: OD-RASE为连接自动驾驶感知缺陷与道路基础设施优化提供了新范式，有助于提升整体交通环境安全性，推动自动驾驶规模化落地。 Abstract: Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Such road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontology-driven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of traffic environments and marks an important step toward the broader adoption of autonomous driving systems.

[96] Facial Expression Recognition Using Residual Masking Network

Luan Pham,The Huynh Vu,Tuan Anh Tran

Main category: cs.CV

TL;DR: 本文提出了一种结合注意力机制与掩码思想的残差掩码网络（Residual Masking Network），通过分割网络优化特征图，提升CNN在面部表情识别（FER）任务中的性能，在FER2013和VEMO数据集上达到SOTA精度。

Details

Motivation: 提升自动面部表情识别（FER）的性能，尤其在复杂背景下增强模型对关键面部区域的关注能力。 Method: 提出一种新型Masking方法，利用分割网络（类似U-Net结构）对CNN（特别是Deep Residual Network）的特征图进行精细化掩码，从而引导网络聚焦于与表情判别最相关的信息。 Result: 在FER2013和私有VEMO数据集上取得当前最优（SOTA）准确率。 Conclusion: 引入掩码机制与分割引导的注意力策略可有效提升深度CNN在FER任务中的判别能力，所提Residual Masking Network具有强泛化性与实用性。 Abstract: Automatic facial expression recognition (FER) has gained much attention due to its applications in human-computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet-like architecture to produce a Residual Masking Network. The proposed method holds state-of-the-art (SOTA) accuracy on the well-known FER2013 and private VEMO datasets. The source code is available at https://github.com/phamquiluan/ResidualMaskingNetwork.

[97] SLER-IR: Spherical Layer-wise Expert Routing for All-in-One Image Restoration

Peng Shurui,Xin Lin,Shi Luo,Jincen Ou,Dizhe Zhang,Lu Qi,Truong Nguyen,Chao Ren

Main category: cs.CV

TL;DR: 本文提出SLER-IR框架，通过球面层间专家路由、球面均匀退化嵌入与全局-局部粒度融合模块，提升多退化图像恢复的统一建模能力，在多任务基准上超越现有方法。

Details

Motivation: 现有统一图像恢复框架面临特征干扰和专家专业化不足的问题，难以应对多样化退化。 Method: 提出SLER-IR：1）球面层向专家路由机制；2）基于对比学习的球面均匀退化嵌入以消除几何偏差；3）全局-局部粒度融合（GLGF）模块联合建模全局语义与局部退化线索。 Result: 在三任务和五任务基准上，PSNR和SSIM指标均一致优于当前最优方法。 Conclusion: SLER-IR通过几何感知的退化建模与分层专家协同，有效缓解了统一图像恢复中的特征干扰与粒度失配问题，提升了泛化性与精度。 Abstract: Image restoration under diverse degradations remains challenging for unified all-in-one frameworks due to feature interference and insufficient expert specialization. We propose SLER-IR, a spherical layer-wise expert routing framework that dynamically activates specialized experts across network layers. To ensure reliable routing, we introduce a Spherical Uniform Degradation Embedding with contrastive learning, which maps degradation representations onto a hypersphere to eliminate geometry bias in linear embedding spaces. In addition, a Global-Local Granularity Fusion (GLGF) module integrates global semantics and local degradation cues to address spatially non-uniform degradations and the train-test granularity gap. Experiments on three-task and five-task benchmarks demonstrate that SLER-IR achieves consistent improvements over state-of-the-art methods in both PSNR and SSIM. Code and models will be publicly released.

[98] Adaptive Radial Projection on Fourier Magnitude Spectrum for Document Image Skew Estimation

Luan Pham,Phu Hao Hoang,Xuan Toan Mai,Tuan Anh Tran

Main category: cs.CV

TL;DR: 本文提出了一种基于自适应径向投影和2D离散傅里叶幅值谱的新文档倾斜角估计算法，并构建了高质量数据集DISE-2021用于评估，实验表明该方法优于现有方法。

Details

Motivation: 倾斜估计是文档处理系统中的关键任务，尤其对扫描文档图像至关重要，其性能直接影响后续处理步骤；随着数字化发展，该问题受到广泛关注。 Method: 提出一种新方法：在2D离散傅里叶幅值谱上应用自适应径向投影以提取文档图像的主倾斜角；同时构建了DISE-2021数据集，并对傅里叶类方法的多个改进方面进行了综合分析。 Result: 所提方法具有鲁棒性和可靠性，在多个指标上均优于所有对比方法。 Conclusion: 该基于傅里叶变换与自适应径向投影的倾斜估计算法有效提升了精度与稳定性，所构建的数据集和开源代码（GitHub）为后续研究提供了有力支持。 Abstract: Skew estimation is one of the vital tasks in document processing systems, especially for scanned document images, because its performance impacts subsequent steps directly. Over the years, an enormous number of researches focus on this challenging problem in the rise of digitization age. In this research, we first propose a novel skew estimation method that extracts the dominant skew angle of the given document image by applying an Adaptive Radial Projection on the 2D Discrete Fourier Magnitude spectrum. Second, we introduce a high quality skew estimation dataset DISE-2021 to assess the performance of different estimators. Finally, we provide comprehensive analyses that focus on multiple improvement aspects of Fourier-based methods. Our results show that the proposed method is robust, reliable, and outperforms all compared methods. The source code is available at https://github.com/phamquiluan/jdeskew.

[99] LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Generative Real-World Super-Resolution

Song Fei,Tian Ye,Sixiang Chen,Zhaohu Xing,Jianyu Lai,Lei Zhu

Main category: cs.CV

TL;DR: 本文提出LucidNFT，一种面向真实世界图像超分辨率（Real-ISR）的多奖励强化学习框架，通过引入降质鲁棒的语义评估器（LucidConsistency）、解耦式优势归一化策略和大规模真实退化数据集（LucidLR），解决了现有方法中LR锚定保真度难衡量、优势坍塌及偏好信号质量差等问题，显著提升了感知质量与保真度的平衡。

Details

Motivation: 真实世界图像超分辨率（Real-ISR）存在语义/结构幻觉问题，即生成结果虽锐利但违背低分辨率输入证据；而缺乏高分辨率真值使得LR锚定保真度难以评估，传统强化学习在该任务中受限于退化鲁棒性差、优势坍塌和真实退化覆盖不足。 Method: 提出LucidNFT框架：1）LucidConsistency——基于LR参考的降质鲁棒语义一致性评估器；2）解耦式优势归一化策略，避免rollout组内多目标归一化导致的优势坍塌；3）LucidLR——大规模真实退化图像数据集以增强RL微调鲁棒性；整体应用于flow-matching Real-ISR模型的偏好式强化学习微调。 Result: LucidNFT在多个强流匹配Real-ISR基线上稳定提升，显著改善感知质量与LR保真度的权衡，在多样化真实场景中展现出更稳定的优化动力学。 Conclusion: LucidNFT通过可优化的LR锚定保真度建模、改进的多奖励优势估计及高质量真实退化数据支持，为Real-ISR中的偏好式强化学习提供了系统性解决方案，推动了生成保真性与视觉质量的协同提升。 Abstract: Generative real-world image super-resolution (Real-ISR) can synthesize visually convincing details from severely degraded low-resolution (LR) inputs, yet its stochastic sampling makes a critical failure mode hard to avoid: outputs may look sharp but be unfaithful to the LR evidence (semantic and structural hallucination), while such LR-anchored faithfulness is difficult to assess without HR ground truth. Preference-based reinforcement learning (RL) is a natural fit because each LR input yields a rollout group of candidates to compare. However, effective alignment in Real-ISR is hindered by (i) the lack of a degradation-robust LR-referenced faithfulness signal, and (ii) a rollout-group optimization bottleneck where naive multi-reward scalarization followed by normalization compresses objective-wise contrasts, causing advantage collapse and weakening the reward-weighted updates in DiffusionNFT-style forward fine-tuning. Moreover, (iii) limited coverage of real degradations restricts rollout diversity and preference signal quality. We propose LucidNFT, a multi-reward RL framework for flow-matching Real-ISR. LucidNFT introduces LucidConsistency, a degradation-robust semantic evaluator that makes LR-anchored faithfulness measurable and optimizable; a decoupled advantage normalization strategy that preserves objective-wise contrasts within each LR-conditioned rollout group before fusion, preventing advantage collapse; and LucidLR, a large-scale collection of real-world degraded images to support robust RL fine-tuning. Experiments show that LucidNFT consistently improves strong flow-based Real-ISR baselines, achieving better perceptual-faithfulness trade-offs with stable optimization dynamics across diverse real-world scenarios.

[100] Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

Jialuo He,Huangxun Chen

Main category: cs.CV

TL;DR: E-AdaPrune是一种基于能量驱动的视觉token自适应剪枝方法，根据视觉特征的奇异值谱动态分配token预算，提升VLM推理效率与性能。

Details

Motivation: 现有视觉token压缩方法采用固定预算，忽视图像信息密度差异，导致效率与精度权衡不佳。 Method: 提出E-AdaPrune框架，依据视觉特征的奇异值谱保留指定比例的谱能量，动态决定各图像所需token数；使用随机SVD加速计算，不引入额外可学习参数。 Result: 在9个基准和3个LLaVA系列VLM上验证，平均性能提升达0.6%，MMVet推理任务相对提升5.1%；单图额外延迟仅8ms。 Conclusion: E-AdaPrune实现了无参、高效、自适应的视觉token压缩，在保持低延迟的同时提升了多任务性能，为VLM轻量化提供了新范式。 Abstract: Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.

[101] Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation

Hongli Liu,Yu Wang,Shengjie Zhao

Main category: cs.CV

TL;DR: 本文提出VINE框架，通过空间-视角图建模结构一致性与前景判别性，结合SAM与ResNet特征，提升少样本分割在视角变化和复杂结构下的性能。

Details

Motivation: 少样本分割（FSS）在大外观或视角变化下存在结构错位和跨视角不一致问题，亟需提升结构鲁棒性与前景判别能力。 Method: 提出VINE框架：构建空间-视角图以建模局部几何与视角不变语义；利用支持集-查询集特征差异生成判别先验，重加权SAM特征并校准骨干网络激活；通过掩码交叉注意力融合前景增强的SAM特征与结构增强的ResNet特征，生成类一致原型作为SAM解码器的自适应提示。 Result: 在多个FSS基准上验证了VINE的有效性与鲁棒性，尤其在视角偏移和复杂结构场景下表现突出。 Conclusion: VINE通过联合建模结构一致性与前景判别性，显著提升了少样本分割的泛化能力与精度，为解决视角变化与结构歧义提供了新思路。 Abstract: Few-shot segmentation (FSS) has gained significant attention for its ability to generalize to novel classes with limited supervision, yet remains challenged by structural misalignment and cross-view inconsistency under large appearance or viewpoint variations. This paper tackles these challenges by introducing VINE (View-Informed NEtwork), a unified framework that jointly models structural consistency and foreground discrimination to refine class-specific prototypes. Specifically, VINE introduces a spatial-view graph on backbone features, where the spatial graph captures local geometric topology and the view graph connects features from different perspectives to propagate view-invariant structural semantics. To further alleviate foreground ambiguity, we derive a discriminative prior from the support-query feature discrepancy to capture category-specific contrast, which reweights SAM features by emphasizing salient regions and recalibrates backbone activations for improved structural focus. The foreground-enhanced SAM features and structurally enriched ResNet features are progressively integrated through masked cross-attention, yielding class-consistent prototypes used as adaptive prompts for the SAM decoder to generate accurate masks. Extensive experiments on multiple FSS benchmarks validate the effectiveness and robustness of VINE, particularly under challenging scenarios with viewpoint shifts and complex structures. The code is available at https://github.com/HongliLiu1/VINE-main.

[102] OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

Si-Yu Lu,Po-Ting Chen,Hui-Che Hsu,Sin-Ye Jhong,Wen-Huang Cheng,Yung-Yao Chen

Main category: cs.CV

TL;DR: 本文提出OVGGT框架，通过Self-Selective Caching和Dynamic Anchor Protection技术，在不增加内存与计算开销的前提下，实现任意长度视频流的高效3D几何重建。

Details

Motivation: 现有几何基础模型因全连接注意力机制导致计算成本高，难以处理长序列；因果注意力变体虽支持单次流式推理，但KV缓存持续增长，导致GPU内存迅速耗尽，无法满足长时序部署需求。 Method: 提出无需训练的OVGGT框架，包含两项核心技术：1）Self-Selective Caching——利用FFN残差幅值选择性压缩KV缓存，并兼容FlashAttention；2）Dynamic Anchor Protection——动态保护坐标关键token不被驱逐，抑制长期轨迹中的几何漂移。 Result: 在室内外及超长序列基准上验证，OVGGT可在恒定VRAM占用下处理任意长度视频，同时达到当前最优的3D几何重建精度。 Conclusion: OVGGT成功解决了流式3D重建中内存与计算随序列长度增长的问题，为长时序、资源受限场景下的实时几何建模提供了可行方案。 Abstract: Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.

[103] Exploring Open-Vocabulary Object Recognition in Images using CLIP

Wei Yu Chen,Ying Dai

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的两阶段开放词汇目标识别（OVOR）框架：先分割后识别，结合CLIP与CNN/MLP特征对齐及SVD降维，在多个数据集上取得SOTA性能。

Details

Motivation: 现有开放词汇目标识别方法存在系统复杂、训练成本高、泛化能力有限等问题。 Method: 采用两阶段策略：先进行目标分割，再进行识别；利用CLIP生成图像和文本嵌入；引入CNN/MLP提取视觉特征并对其与文本嵌入对齐；通过SVD构建共享表征空间；最后基于嵌入相似度完成识别。 Result: 在COCO、Pascal VOC和ADE20K上实验表明，仅用CLIP且不使用SVD的无训练编码方式取得了最高平均AP，超越当前SOTA；同时验证了CNN/MLP编码在OVOR中的潜力。 Conclusion: 该框架显著降低训练依赖与标注成本，提升了开放词汇识别的灵活性与泛化性，CLIP与轻量CNN/MLP融合是有效路径。 Abstract: To address the limitations of existing open-vocabulary object recognition methods, specifically high system complexity, substantial training costs, and limited generalization, this paper proposes a novel Open-Vocabulary Object Recognition (OVOR) framework based on a streamlined two-stage strategy: object segmentation followed by recognition. The framework eliminates the need for complex retraining and labor-intensive annotation. After cropping object regions, we generate object-level image embeddings alongside category-level text embeddings using CLIP, which facilitates arbitrary vocabularies. To reduce reliance on CLIP and enhance encoding flexibility, we further introduce a CNN/MLP-based method that extracts convolutional neural network (CNN) feature maps and utilizes a multilayer perceptron (MLP) to align visual features with text embeddings. These embeddings are concatenated and processed via Singular Value Decomposition (SVD) to construct a shared representation space. Finally, recognition is performed through embedding similarity matching. Experiments on COCO, Pascal VOC, and ADE20K demonstrate that training-free, CLIP-based encoding without SVD achieves the highest average AP, outperforming current state-of-the-art methods. Simultaneously, the results highlight the potential of CNN/MLP-based image encoding for OVOR.

[104] Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

Siyuan Yang,Jun Liu,Hao Cheng,Chong Wang,Shijian Lu,Hedvig Kjellstrom,Weisi Lin,Alex C. Kot

Main category: cs.CV

TL;DR: 本文提出了一种名为Skeleton-to-Image Encoding（S2I）的新表示方法，将3D人体骨架序列转换为图像-like格式，从而首次实现利用大规模视觉预训练模型进行自监督骨架表征学习。该方法统一处理异构骨架数据，在多个基准上展现出优越性能和泛化能力。

Details

Motivation: 现有大规模视觉预训练模型难以直接应用于3D骨架数据，因其数据格式差异大；同时缺乏大规模骨架数据集，且多模态动作识别中需避免引入额外模型分支。 Method: 提出Skeleton-to-Image Encoding（S2I），依据身体部位语义对关节进行分区排列，并缩放为标准图像尺寸，将骨架序列转化为图像-like表示，以适配视觉预训练模型进行自监督学习。 Result: 在NTU-60、NTU-120和PKU-MMD数据集上验证了方法有效性与泛化性，尤其在跨格式评估设置下表现突出。 Conclusion: S2I提供了一种统一、图像化的骨架表示方式，成功桥接视觉预训练知识与骨架分析任务，解决了骨架数据异构性与模型复用难题。 Abstract: Recent advances in large-scale pretrained vision models have demonstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without introducing additional model branches present significant research opportunities. To address these challenges, we introduce Skeleton-to-Image Encoding (S2I), a novel representation that transforms skeleton sequences into image-like data by partitioning and arranging joints based on body-part semantics and resizing to standardized image dimensions. This encoding enables, for the first time, the use of powerful vision-pretrained models for self-supervised skeleton representation learning, effectively transferring rich visual-domain knowledge to skeleton analysis. While existing skeleton methods often design models tailored to specific, homogeneous skeleton formats, they overlook the structural heterogeneity that naturally arises from diverse data sources. In contrast, our S2I representation offers a unified image-like format that naturally accommodates heterogeneous skeleton data. Extensive experiments on NTU-60, NTU-120, and PKU-MMD demonstrate the effectiveness and generalizability of our method for self-supervised skeleton representation learning, including under challenging cross-format evaluation settings.

[105] CR-QAT: Curriculum Relational Quantization-Aware Training for Open-Vocabulary Object Detection

Jinyeong Park,Donghwa Kim,Brent ByungHoon Kang,Hyeongboo Baek,Jibum Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为课程关系量化感知训练（CR-QAT）的新框架，用于解决开放词汇目标检测（OVOD）中极端低比特量化导致的视觉-语言对齐退化和区域关系结构失真问题。该框架结合分阶段优化与文本中心的关系知识蒸馏，在LVIS和COCO零样本基准上显著提升性能。

Details

Motivation: 开放词汇目标检测（OVOD）因模型过大难以部署于资源受限设备；而简单粗暴的极低比特（如4位）量化会严重损害细粒度视觉-语言对齐和区域间关系结构。 Method: 提出课程关系量化感知训练（CR-QAT），包含两部分：1）课程量化感知训练（CQAT），通过分阶段、分区量化隔离误差以稳定优化；2）文本中心关系知识蒸馏（TRKD），构建文本锚定的成对相似性矩阵，将教师模型的多维关系知识迁移至任务相关模块。 Result: 在LVIS和COCO零样本基准上，CR-QAT在激进低比特设置下持续超越现有QAT基线，相对AP提升分别达38.9%和40.9%。 Conclusion: CR-QAT有效缓解了低比特量化对OVOD中关键对齐与关系建模能力的破坏，为轻量化开放词汇检测提供了可行且高效的技术路径。 Abstract: Open-vocabulary object detection (OVOD) enables novel category detection via vision-language alignment, but massive model sizes hinder deployment on resource-constrained devices. While quantization offers practical compression, we reveal that naive extreme low-bit (e.g., 4-bit) quantization severely degrades fine-grained vision-language alignment and distorts inter-region relational structures. To address this, we propose curriculum relational quantization-aware training (CR-QAT), an integrated framework combining stage-by-stage optimization with relational knowledge distillation. Within CR-QAT, curriculum QAT (CQAT) mitigates error accumulation by partitioning the model for progressive quantization, ensuring stable optimization via error isolation. Concurrently, text-centric relational KD (TRKD) is applied to task-relevant modules. By constructing text-anchored pairwise similarity matrices, TRKD comprehensively transfers the teacher's multi-dimensional relational knowledge. Experiments on LVIS and COCO zero-shot benchmarks demonstrate that CR-QAT consistently outperforms existing QAT baselines under aggressive low-bit settings, achieving relative AP improvements of up to 38.9% and 40.9%, respectively.

[106] Imagine How To Change: Explicit Procedure Modeling for Change Captioning

Jiayang Sun,Zixin Guo,Min Cao,Guibo Zhu,Jorma Laaksonen

Main category: cs.CV

TL;DR: ProCap 提出了一种将变化建模从静态图像对比转向动态过程建模的新框架，通过两阶段设计（关键帧驱动的过程编码 + 可学习过程查询的端到端描述生成）提升变化描述性能。

Details

Motivation: 现有变化描述方法仅处理静态图像对，忽略了变化过程中的丰富时间动态信息，难以理解‘如何变化’。 Method: ProCap 包含两个阶段：第一阶段利用自动生成的中间帧提取关键帧，并通过字幕条件下的掩码重建任务训练过程编码器；第二阶段引入可学习的过程查询替代显式帧输入，在编码器-解码器架构中端到端训练变化描述模型。 Result: 在三个数据集上的实验验证了 ProCap 的有效性，显著提升了变化描述质量与时间一致性。 Conclusion: ProCap 成功将变化理解建模为动态过程，通过隐式过程表征和可学习查询机制，兼顾效率、鲁棒性与描述准确性。 Abstract: Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which is the key to understand not only what has changed but also how it occurs. We introduce ProCap, a novel framework that reformulates change modeling from static image comparison to dynamic procedure modeling. ProCap features a two-stage design: The first stage trains a procedure encoder to learn the change procedure from a sparse set of keyframes. These keyframes are obtained by automatically generating intermediate frames to make the implicit procedural dynamics explicit and then sampling them to mitigate redundancy. Then the encoder learns to capture the latent dynamics of these keyframes via a caption-conditioned, masked reconstruction task. The second stage integrates this trained encoder within an encoder-decoder model for captioning. Instead of relying on explicit frames from the previous stage -- a process incurring computational overhead and sensitivity to visual noise -- we introduce learnable procedure queries to prompt the encoder for inferring the latent procedure representation, which the decoder then translates into text. The entire model is then trained end-to-end with a captioning loss, ensuring the encoder's output is both temporally coherent and captioning-aligned. Experiments on three datasets demonstrate the effectiveness of ProCap. Code and pre-trained models are available at https://github.com/BlueberryOreo/ProCap

[107] Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions

Jingtao Ye,Kexin Zhang,Xunchi Ma,Yuehan Li,Guangming Zhu,Peiyi Shen,Linhua Jiang,Xiangdong Zhang,Liang Zhang

Main category: cs.CV

TL;DR: 本文提出了DynUAV基准，用于评估动态无人机视角下的多目标跟踪（MOT）性能，包含42个视频序列和170万+标注框，强调剧烈自运动、尺度/视角变化及运动模糊等挑战，揭示了现有SOTA跟踪器在该场景下的局限性。

Details

Motivation: 现有无人机视角MOT基准缺乏真实场景中的剧烈自运动和复杂表观轨迹，难以反映实际应用中的挑战。 Method: 构建了DynUAV新基准，包含42个高动态视频序列、1.7M+高质量标注，涵盖车辆、行人及工业机械；设计综合评估协议以测试主流跟踪器在ego-motion相关挑战下的表现。 Result: SOTA跟踪器在DynUAV上表现显著下降，尤其在检测与关联联合优化方面存在明显瓶颈；验证了DynUAV作为严苛基准的有效性。 Conclusion: DynUAV填补了动态无人机MOT基准的空白，为推动真实场景下鲁棒跟踪算法的发展提供了重要测试平台。 Abstract: The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multi-object tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmark comprises 42 video sequences with over 1.7 million bounding box annotations, covering vehicles, pedestrians, and specialized industrial categories such as excavators, bulldozers and cranes. Compared to existing benchmarks, DynUAV introduces substantial challenges arising from ego-motion, including drastic scale changes and viewpoint changes, as well as motion blur. Comprehensive evaluations of state-of-the-art trackers on DynUAV reveal their limitations, particularly in managing the intertwined challenges of detection and association under such dynamic conditions, thereby establishing DynUAV as a rigorous benchmark. We anticipate that DynUAV will serve as a demanding testbed to spur progress in real-world UAV-perspective MOT, and we will make all resources available at link.

[108] DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

Hao Yang,Hongbo Zhang,Yanyan Zhao,Bing Qin

Main category: cs.CV

TL;DR: 本文提出了DeepSight，首个专用于深度感知的多模态大语言模型，通过构建深度图像-文本对与指令数据集、改进ViT编码器，并设计深度问答基准，显著提升了三维场景理解能力。

Details

Motivation: 现有MLLMs难以准确理解视觉数据中的深度信息，限制了三维场景理解能力。 Method: 提出DeepSight模型：1）构建深度图像-文本对和深度指令数据集（GLPN生成深度图，GPT-4生成指令）；2）改进CLIP中的ViT编码器以更好捕获深度连续变化；3）设计基于深度图像的问答评估基准。 Result: 在自建深度问答基准上实验表明，DeepSight显著提升深度感知能力及下游任务性能。 Conclusion: DeepSight是首个专为深度理解设计的MLLM，推动了多模态三维理解的发展。 Abstract: Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.

[109] Towards High-resolution and Disentangled Reference-based Sketch Colorization

Dingkun Yan,Xinrui Wang,Ru Wang,Zhuoru Li,Jinze Yu,Yusuke Iwasawa,Yutaka Matsuo,Jiaxian Guo

Main category: cs.CV

TL;DR: 本文提出了一种双分支框架，通过显式建模训练与推理过程的数据分布，并结合Gram正则化损失、动漫专用Tagger网络和纹理增强插件模块，有效缓解草图上色任务中的分布偏移问题，显著提升上色质量、分辨率与可控性。

Details

Motivation: 先前方法主要关注缓解语义对齐训练数据与高度多样测试数据之间的分布偏移所导致的伪影，但未从根本上解决该分布偏移问题。 Method: 提出双分支框架：语义对齐分支建模训练分布，语义错位分支建模推理分布；引入Gram正则化损失保障跨域分布一致性；采用动漫专用Tagger网络提取参考图像细粒度属性并调制SDXL条件编码器；加入插件模块增强纹理迁移。 Result: 在定量、定性评估及用户研究中均优于现有方法，达到质量与可控性指标的SOTA水平；消融实验验证了各模块的有效性。 Conclusion: 本方法通过直接最小化分布偏移，从根本上提升了草图上色的性能，在质量、分辨率和可控性方面取得显著进步。 Abstract: Sketch colorization is a critical task for automating and assisting in the creation of animations and digital illustrations. Previous research identified the primary difficulty as the distribution shift between semantically aligned training data and highly diverse test data, and focused on mitigating the artifacts caused by the distribution shift instead of fundamentally resolving the problem. In this paper, we present a framework that directly minimizes the distribution shift, thereby achieving superior quality, resolution, and controllability of colorization. We propose a dual-branch framework to explicitly model the data distributions of the training process and inference process with a semantic-aligned branch and a semantic-misaligned branch, respectively. A Gram Regularization Loss is applied across the feature maps of both branches, effectively enforcing cross-domain distribution coherence and stability. Furthermore, we adopt an anime-specific Tagger Network to extract fine-grained attributions from reference images and modulate SDXL's conditional encoders to ensure precise control, and a plugin module to enhance texture transfer. Quantitative and qualitative comparisons, alongside user studies, confirm that our method effectively overcomes the distribution shift challenge, establishing State-of-the-Art performance across both quality and controllability metrics. Ablation study reveals the influence of each component.

[110] Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning

Claire Roman,Philippe Meyer

Main category: cs.CV

TL;DR: 本文提出了一种两阶段框架，利用对比学习和知识蒸馏，在有标签的虚构字母表上训练教师模型，再通过无监督蒸馏扩展到历史文字系统，实现字形识别与文字系统间潜在历史关系的发现。

Details

Motivation: 学习字形和文字系统的相似性度量面临根本挑战：虚构字母中的字形可可靠标注，但历史文字间的演化关系不确定且存在争议。 Method: 第一阶段：在标注的虚构字母表上用对比损失训练编码器，构建具有强判别能力的教师模型；第二阶段：通过师生知识蒸馏，让学生模型在无监督下学习历史文字表示，继承教师判别结构的同时发现跨文字的潜在相似性。 Result: 在多种文字系统上的实验表明，该方法能有效实现少样本字形识别和有意义的文字聚类，无需真实演化关系标注。 Conclusion: 该方法桥接了监督对比学习与无监督发现，既能划定不同文字系统的硬边界，又能揭示反映潜在历史影响的软相似性。 Abstract: Learning similarity metrics for glyphs and writing systems faces a fundamental challenge: while individual graphemes within invented alphabets can be reliably labeled, the historical relationships between different scripts remain uncertain and contested. We propose a two-stage framework that addresses this epistemological constraint. First, we train an encoder with contrastive loss on labeled invented alphabets, establishing a teacher model with robust discriminative features. Second, we extend to historically attested scripts through teacher-student distillation, where the student learns unsupervised representations guided by the teacher's knowledge but free to discover latent cross-script similarities. The asymmetric setup enables the student to learn deformation-invariant embeddings while inheriting discriminative structure from clean examples. Our approach bridges supervised contrastive learning and unsupervised discovery, enabling both hard boundaries between distinct systems and soft similarities reflecting potential historical influences. Experiments on diverse writing systems demonstrate effective few-shot glyph recognition and meaningful script clustering without requiring ground-truth evolutionary relationships.

[111] Technical Report: Automated Optical Inspection of Surgical Instruments

Zunaira Shafqat,Atif Aftab Ahmed Jilani,Qurrat Ul Ain

Main category: cs.CV

TL;DR: 本报告探讨了巴基斯坦制造的外科手术器械的制造缺陷，提出利用YOLOv8、ResNet-152和EfficientNet-b4等深度学习模型进行自动化光学检测（AOI），以提升质量控制水平。

Details

Motivation: 外科器械微小缺陷可能导致严重临床后果，亟需高精度、自动化的缺陷检测方法保障患者安全与制造商经济效益。 Method: 构建包含4414张高分辨率图像的新数据集，结合YOLOv8、ResNet-152和EfficientNet-b4等深度学习架构，开展自动化光学检测（AOI）分析。 Result: 实现了对裂纹、锈蚀和结构异常等关键缺陷的有效识别与分类，提升了巴基斯坦产外科器械的质量控制能力。 Conclusion: 通过产学研合作与先进AI技术融合，可显著提高外科器械制造质量标准，推动Sialkot手术器械集群向智能化质检升级。 Abstract: In the dynamic landscape of modern healthcare, maintaining the highest standards in surgical instruments is critical for clinical success. This report explores the diverse realm of surgical instruments and their associated manufacturing defects, emphasizing their pivotal role in ensuring the safety of surgical procedures. With potentially fatal consequences arising from even minor defects, precision in manufacturing is paramount.The report addresses the identification and rectification of critical defects such as cracks, rust, and structural irregularities. Such scrutiny prevents substantial financial losses for manufacturers and, more crucially, safeguards patient lives. The collaboration with industry leaders Daddy D Pro and Dr. Frigz International, renowned trailblazers in the Sialkot surgical cluster, provides invaluable insights into the analysis of defects in Pakistani-made instruments. This partnership signifies a commitment to advancing automated defect detection methodologies, specifically through the integration of deep learning architectures including YOLOv8, ResNet-152, and EfficientNet-b4, thereby elevating quality standards in the manufacturing process. The scope of this report is to identify various surgical instruments manufactured in Pakistan and analyze their associated defects using a newly developed dataset of 4,414 high-resolution images. By focusing on quality assurance through Automated Optical Inspection (AOI) tools, this document serves as a resource for manufacturers, healthcare professionals, and regulatory bodies. The insights gained contribute to the enhancement of instrument standards, ensuring a more reliable healthcare environment through industry expertise and cutting-edge technology.

[112] MM-ISTS: Cooperating Irregularly Sampled Time Series Forecasting with Multimodal Vision-Text LLMs

Zhi Lei,Chenxi Liu,Hao Miao,Wanghui Qiu,Bin Yang,Chenjuan Guo

Main category: cs.CV

TL;DR: 本文提出MM-ISTS，一种融合视觉-文本大模型的多模态框架，用于不规则采样时间序列（ISTS）预测，通过双阶段编码、跨模态生成、自适应查询提取和模态对齐模块，显著提升预测性能与语义理解能力。

Details

Motivation: 现有ISTS预测方法仅依赖历史观测，难以学习上下文语义和细粒度时序模式。 Method: 提出MM-ISTS多模态框架，包含：1）跨模态视觉-文本编码模块，自动生成图像与文本；2）ISTS专用编码器（多视角嵌入融合+时序-变量编码）；3）自适应查询特征提取器压缩MLLM token；4）带模态感知门控的多模态对齐模块。 Result: 在真实数据集上大量实验验证了该方法在预测精度、语义理解及计算效率方面的有效性。 Conclusion: MM-ISTS成功桥接时间、视觉与文本模态，为ISTS预测提供了更鲁棒、可解释且高效的多模态范式。 Abstract: Irregularly sampled time series (ISTS) are widespread in real-world scenarios, exhibiting asynchronous observations on uneven time intervals across variables. Existing ISTS forecasting methods often solely utilize historical observations to predict future ones while falling short in learning contextual semantics and fine-grained temporal patterns. To address these problems, we achieve MM-ISTS, a multimodal framework augmented by vision-text large language models, that bridges temporal, visual, and textual modalities, facilitating ISTS forecasting. MM-ISTS encompasses a novel two-stage encoding mechanism. In particular, a cross-modal vision-text encoding module is proposed to automatically generate informative visual images and textual data, enabling the capture of intricate temporal patterns and comprehensive contextual understanding, in collaboration with multimodal LLMs (MLLMs). In parallel, ISTS encoding extracts complementary yet enriched temporal features from historical ISTS observations, including multi-view embedding fusion and a temporal-variable encoder. Further, we propose an adaptive query-based feature extractor to compress the learned tokens of MLLMs, filtering out small-scale useful knowledge, which in turn reduces computational costs. In addition, a multimodal alignment module with modality-aware gating is designed to alleviate the modality gap across ISTS, images, and text. Extensive experiments on real data offer insight into the effectiveness of the proposed solutions.

[113] RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation

Cheng Guan,Chunyu Lin,Zhijie Shen,Junsong Zhang,Jiyuan Wang

Main category: cs.CV

TL;DR: 本文提出RePer-360，一种失真感知的自调制框架，用于单目全景深度估计，通过轻量几何对齐引导模块和Self-Conditioned AdaLN-Zero机制，在仅用1%训练数据的情况下，显著提升模型在360°图像上的泛化能力并保持预训练透视先验。

Details

Motivation: 现有基于透视图像训练的深度基础模型在360°全景图像上泛化差，且全量微调需大量全景数据，亟需高效域适应方法。 Method: 提出RePer-360框架：包含几何对齐的双投影（ERP/CP）引导模块生成调制信号；Self-Conditioned AdaLN-Zero机制生成像素级缩放因子缩小特征分布差异；引入立方体域一致性损失增强训练稳定性和跨投影对齐。 Result: 相比标准微调，在仅1%训练数据下实现约20%的RMSE提升，并超越常规微调方法。 Conclusion: RePer-360通过保留预训练透视先验并进行失真感知的全景域适配，实现了高效、鲁棒的单目全景深度估计。 Abstract: Recent depth foundation models trained on perspective imagery achieve strong performance, yet generalize poorly to 360$^\circ$ images due to the substantial geometric discrepancy between perspective and panoramic domains. Moreover, fully fine-tuning these models typically requires large amounts of panoramic data. To address this issue, we propose RePer-360, a distortion-aware self-modulation framework for monocular panoramic depth estimation that adapts depth foundation models while preserving powerful pretrained perspective priors. Specifically, we design a lightweight geometry-aligned guidance module to derive a modulation signal from two complementary projections (i.e., ERP and CP) and use it to guide the model toward the panoramic domain without overwriting its pretrained perspective knowledge. We further introduce a Self-Conditioned AdaLN-Zero mechanism that produces pixel-wise scaling factors to reduce the feature distribution gap between the perspective and panoramic domains. In addition, a cubemap-domain consistency loss further improves training stability and cross-projection alignment. By shifting the focus from complementary-projection fusion to panoramic domain adaptation under preserved pretrained perspective priors, RePer-360 surpasses standard fine-tuning methods while using only 1\% of the training data. Under the same in-domain training setting, it further achieves an approximately 20\% improvement in RMSE. Code will be released upon acceptance.

[114] Demystifying KAN for Vision Tasks: The RepKAN Approach

Minjong Cheon

Main category: cs.CV

TL;DR: RepKAN是一种结合CNN结构效率与KAN非线性表达能力的新型遥感图像分类架构，通过双路径设计实现物理可解释推理，并在多个数据集上超越现有最优模型。

Details

Motivation: 标准CNN和Transformer在遥感图像分类中常为不可解释的黑箱，亟需兼具高性能与物理可解释性的模型。 Method: 提出RepKAN架构，采用Spatial Linear与Spectral Non-linear双路径设计，融合CNN的结构效率与KAN的非线性表征能力，自主发现类别特异性光谱指纹与物理作用流形。 Result: 在EuroSAT和NWPU-RESISC45数据集上，RepKAN在保持性能领先的同时，提供了显式的、符合物理规律的可解释推理过程。 Conclusion: RepKAN有望成为未来可解释视觉基础模型的核心骨干网络。 Abstract: Remote sensing image classification is essential for Earth observation, yet standard CNNs and Transformers often function as uninterpretable black-boxes. We propose RepKAN, a novel architecture that integrates the structural efficiency of CNNs with the non-linear representational power of KANs. By utilizing a dual-path design -- Spatial Linear and Spectral Non-linear -- RepKAN enables the autonomous discovery of class-specific spectral fingerprints and physical interaction manifolds. Experimental results on the EuroSAT and NWPU-RESISC45 datasets demonstrate that RepKAN provides explicit physically interpretable reasoning while outperforming state-of-the-art models. These findings indicate that RepKAN holds significant potential to serve as the backbone for future interpretable visual foundation models.

[115] EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation

Shiyuan Yang,Ruihuang Li,Jiale Tao,Shuai Shao,Qinglin Lu,Jing Liao

Main category: cs.CV

TL;DR: 本文提出EffectMaker，一种无需针对每个特效微调的统一推理-生成框架，通过多模态大语言模型与扩散Transformer结合，实现基于参考视频的视觉特效定制化生成，并构建了包含130k视频、覆盖3k类特效的大规模合成数据集EffectData。

Details

Motivation: 现有AIGC系统在VFX生成中面临特效专用数据稀缺、难以建模超自然/风格化效果、且需逐特效微调等问题，导致可扩展性与泛化能力受限。 Method: 提出语义-视觉双路径引导机制：多模态大语言模型解析高层特效语义并推理其在目标主体上的适配方式；扩散Transformer利用上下文学习从参考视频中捕捉细粒度视觉线索。同时构建大规模合成数据集EffectData（130k视频，3k类别）支撑训练与泛化。 Result: EffectMaker在视觉质量与特效一致性上显著优于现有SOTA方法，验证了其在定制化VFX生成中的可扩展性与灵活性。 Conclusion: EffectMaker为视觉特效生成提供了一种无需逐特效微调、具备强泛化能力和可控性的新范式，推动AIGC在专业视频制作中的实用化落地。 Abstract: Visual effects (VFX) are essential for enhancing the expressiveness and creativity of video content, yet producing high-quality effects typically requires expert knowledge and costly production pipelines. Existing AIGC systems face significant challenges in VFX generation due to the scarcity of effect-specific data and the inherent difficulty of modeling supernatural or stylized effects. Moreover, these approaches often require per-effect fine-tuning, which severely limits their scalability and generalization to novel VFX. In this work, we present EffectMaker, a unified reasoning-generation framework that enables reference-based VFX customization. EffectMaker employs a multimodal large language model to interpret high-level effect semantics and reason about how they should adapt to a target subject, while a diffusion transformer leverages in-context learning to capture fine-grained visual cues from reference videos. These two components form a semantic-visual dual-path guidance mechanism that enables accurate, controllable, and effect-consistent synthesis without per-effect fine-tuning. Furthermore, we construct EffectData, the largest high-quality synthetic dataset containing 130k videos across 3k VFX categories, to improve generalization and scalability. Experiments show that EffectMaker achieves superior visual quality and effect consistency over state-of-the-art baselines, offering a scalable and flexible paradigm for customized VFX generation. Project page: https://effectmaker.github.io

[116] MOSIV: Multi-Object System Identification from Videos

Chunjiang Liu,Xiaoyuan Wang,Qingran Lin,Albert Xiao,Haoyu Chen,Shizheng Wen,Hao Zhang,Lu Qi,Ming-Hsuan Yang,Laszlo A. Jeni,Min Xu,Yizhou Zhao

Main category: cs.CV

TL;DR: 本文提出了MOSIV框架，用于从视频中进行多物体系统识别，通过可微分模拟器优化每个物体的连续材质参数，并在新合成基准上验证了其优越性。

Details

Motivation: 现有方法主要针对单物体场景或离散材质分类，难以适用于多物体视频的系统识别任务。 Method: 提出MOSIV框架，利用可微分模拟器结合视频导出的几何目标，直接优化每个物体的连续材质参数；并构建了一个包含丰富接触交互的多物体合成基准。 Result: 在新基准上，MOSIV显著提升了定位精度和长时序仿真保真度，优于适配基线；分析表明物体级细粒度监督与几何对齐目标对优化稳定性至关重要。 Conclusion: MOSIV为多物体系统识别这一新任务提供了强有力的基础框架，代码与数据集将开源。 Abstract: We introduce the challenging problem of multi-object system identification from videos, for which prior methods are ill-suited due to their focus on single-object scenes or discrete material classification with a fixed set of material prototypes. To address this, we propose MOSIV, a new framework that directly optimizes for continuous, per-object material parameters using a differentiable simulator guided by geometric objectives derived from video. We also present a new synthetic benchmark with contact-rich, multi-object interactions to facilitate evaluation. On this benchmark, MOSIV substantially improves grounding accuracy and long-horizon simulation fidelity over adapted baselines, establishing it as a strong baseline for this new task. Our analysis shows that object-level fine-grained supervision and geometry-aligned objectives are critical for stable optimization in these complex, multi-object settings. The source code and dataset will be released.

[117] StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

Yuanhuiyi Lyu,Kaiyu Lei,Ziqiao Weng,Xu Zheng,Lutao Jiang,Teng Li,Yangfu Li,Ziyuan Huang,Linfeng Zhang,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出StruVis框架，通过文本化的结构化视觉表示作为中间推理状态，提升多模态大语言模型（MLLM）在文本到图像生成中的推理能力，无需依赖中间图像生成，兼顾效率与视觉理解，且兼容多种T2I生成器。

Details

Motivation: 现有文本到图像推理方法存在两难：纯文本推理缺乏视觉上下文，而图文交织推理计算开销大且受限于生成器表征能力。 Method: 提出StruVis框架，利用文本形式的结构化视觉表示（如布局、关系等）作为中间推理状态，在纯文本推理过程中引导MLLM‘感知’视觉结构，实现生成器无关的高效推理增强。 Result: 在T2I-ReasonBench和WISE等推理型T2I基准上分别取得4.61%和4%的性能提升，验证了其有效性与通用性。 Conclusion: StruVis通过结构化视觉引导的文本推理，突破了传统图文交织推理的计算与表征瓶颈，为高效、强推理能力的T2I生成提供了新范式。 Abstract: Reasoning-based text-to-image (T2I) generation requires models to interpret complex prompts accurately. Existing reasoning frameworks can be broadly categorized into two types: (1) Text-Only Reasoning, which is computationally efficient but lacks access to visual context, often resulting in the omission of critical spatial and visual elements; and (2) Text-Image Interleaved Reasoning, which leverages a T2I generator to provide visual references during the reasoning process. While this approach enhances visual grounding, it incurs substantial computational costs and constrains the reasoning capacity of MLLMs to the representational limitations of the generator. To this end, we propose StruVis, a novel framework that enhances T2I generation through Thinking with Structured Vision. Instead of relying on intermediate image generation, StruVis employs text-based structured visual representations as intermediate reasoning states, thereby enabling the MLLM to effectively "perceive" visual structure within a purely text-based reasoning process. Powered by this, the reasoning potential for T2I generation of the MLLM is unlocked through structured-vision-guided reasoning. Additionally, as a generator-agnostic reasoning framework, our proposed StruVis can be seamlessly integrated with diverse T2I generators and efficiently enhance their performance in reasoning-based T2I generation. Extensive experiments demonstrate that StruVis achieves significant performance improvements on reasoning-based T2I benchmarks, e.g., a 4.61% gain on T2I-ReasonBench and a 4% gain on WISE.

[118] Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking

Chunjiang Li,Jianbo Ma,Li Shen,Yanru Chen,Liangyin Chen

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、即插即用的多目标跟踪框架OA-SORT，通过引入遮挡感知模块（OAM）、遮挡感知偏移（OAO）和偏差感知动量（BAM），有效缓解因部分遮挡导致的位置代价混淆问题，在多个基准数据集上显著提升跟踪性能，并具有良好的可复用性。

Details

Motivation: 2D多目标跟踪（MOT）中，由于目标部分遮挡导致的位置代价混淆问题严重影响跟踪性能，现有方法缺乏对遮挡状态的有效建模与利用。 Method: 提出OA-SORT框架，包含三个核心组件：1）遮挡感知模块（OAM），利用高斯图（GM）建模遮挡状态并抑制背景干扰；2）遮挡感知偏移（OAO），基于OAM输出校正检测-轨迹匹配代价；3）偏差感知动量（BAM），动态调整状态估计以抑制遮挡下的滤波不稳定性。整个框架无需训练，可即插即用地集成到现有SORT类追踪器中。 Result: 在DanceTrack测试集上达到63.1% HOTA和64.2% IDF1；在SportsMOT和MOT17上也取得显著提升；将该框架嵌入四个其他追踪器后，HOTA和IDF1平均提升2.08%和3.05%。 Conclusion: 遮挡建模对2D MOT至关重要；OA-SORT以轻量、无训练、可复用的方式有效提升了遮挡场景下的跟踪鲁棒性与精度，为通用追踪器设计提供了新思路。 Abstract: Multi-object tracking (MOT) involves analyzing object trajectories and counting the number of objects in video sequences. However, 2D MOT faces challenges due to positional cost confusion arising from partial occlusion. To address this issue, we present the novel Occlusion-Aware SORT (OA-SORT) framework, a plug-and-play and training-free framework that includes the Occlusion-Aware Module (OAM), the Occlusion-Aware Offset (OAO), and the Bias-Aware Momentum (BAM). Specifically, OAM analyzes the occlusion status of objects, where a Gaussian Map (GM) is introduced to reduce background influence. In contrast, OAO and BAM leverage the OAM-described occlusion status to mitigate cost confusion and suppress estimation instability. Comprehensive evaluations on the DanceTrack, SportsMOT, and MOT17 datasets demonstrate the importance of occlusion handling in MOT. On the DanceTrack test set, OA-SORT achieves 63.1% and 64.2% in HOTA and IDF1, respectively. Furthermore, integrating the Occlusion-Aware framework into the four additional trackers improves HOTA and IDF1 by an average of 2.08% and 3.05%, demonstrating the reusability of the occlusion awareness.

[119] Ensemble Learning with Sparse Hypercolumns

Julia Dietlmeier,Vayangi Ganepola,Oluwabukola G. Adegboro,Mayug Maniparambil,Claudia Mazo,Noel E. O'Connor

Main category: cs.CV

TL;DR: 本文提出了一种基于VGG16超列（hypercolumns）的轻量化图像分割方法，通过分层子采样降低计算复杂度，并结合集成学习提升性能；在脑肿瘤数据集上，小样本（N≤20）下逻辑回归最优，10%采样率时Dice分数达0.66，显著优于UNet基线。

Details

Motivation: 生物视觉启发的超列虽具潜力，但因高维稠密特征导致计算复杂度随训练集大小线性增长，实际应用受限，亟需高效稀疏化与分类策略。 Method: 对VGG16提取的多尺度超列进行分层子采样以稀疏化，并探索堆叠（stacking）和投票（voting）等集成学习方法在稀疏超列上的表现。 Result: 在脑肿瘤数据集上，N=20、10%分层采样时平均Dice分数达0.66，较UNet基线提升24.53%（p=3.07e-11）；极低样本下逻辑回归效果最优。 Conclusion: 稀疏化超列配合简单分类器在小样本医学图像分割中更具鲁棒性与有效性，过度复杂的模型（如UNet）易过拟合，而集成方法仅在样本量适中时具优势。 Abstract: Directly inspired by findings in biological vision, high-dimensional hypercolumns are feature vectors built by concatenating multi-scale activations of convolutional neural networks for a single image pixel location. Together with powerful classifiers, they can be used for image segmentation i.e. pixel classification. However, in practice, there are only very few works dedicated to the use of hypercolumns. One reason is the computational complexity of processing concatenated dense hypercolumns that grows linearly with the size $N$ of the training set. In this work, we address this challenge by applying stratified subsampling to the VGG16 based hypercolumns. Furthermore, we investigate the performance of ensemble learning on sparse hypercolumns. Our experiments on a brain tumor dataset show that stacking and voting ensembles deliver competitive performance, but in the extreme low-shot case of $N \leq 20$, a simple Logistic Regression classifier is the most effective method. For 10% stratified subsampling rate, our best average Dice score is 0.66 for $N=20$. This is a statistically significant improvement of 24.53% over the standard multi-scale UNet baseline ($p$-value = $[3.07e-11]$, Wilcoxon signed-rank test), which is less effective due to overfitting.

[120] FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

Xia Xin,Yuki Endo,Yoshihiro Kanamori

Main category: cs.CV

TL;DR: 本文提出了一种数据驱动的方法，通过构建专用于字体排印的大型标注数据集FontUse（约7万张图像），结合OCR、分割模型和多模态大语言模型自动生成结构化标注（包括用户友好提示、文本区域位置和识别文本），从而提升文本到图像模型对字体样式与使用场景的可控生成能力；无需修改模型架构，仅通过微调即可显著提升 typography 对齐效果，并引入基于 Long-CLIP 的新评估指标。

Details

Motivation: 现有文本到图像模型难以准确控制生成图像中的字体排印（typography），常忽略或弱响应用户对字体风格和使用场景的提示，缺乏高质量、结构化、面向 typography 的训练数据是核心瓶颈。 Method: 构建端到端自动化标注流水线，融合分割模型与多模态大语言模型（MLLM）生成 FontUse 数据集（70K 图像），包含用户友好提示（如 'serif, wedding invitations'）、文本区域定位及 OCR 识别文本；在此数据上对现有文生图模型进行微调，不改变模型架构；提出基于 Long-CLIP 的 typography 对齐评估指标。 Result: 在多种提示与版式下实验表明，使用 FontUse 微调的模型在 typography 控制一致性上显著优于主流基线；Long-CLIP 指标能有效衡量生成字体与请求属性的对齐程度。 Conclusion: 数据质量与结构化监督对 typography 控制至关重要；FontUse 数据集与配套标注流程为可控文本渲染提供了可扩展、低门槛、无需模型修改的实用方案，推动文生图模型向精细化视觉语义控制发展。 Abstract: Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at https://github.com/xiaxinz/FontUSE.

[121] Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

Jiadong Pan,Liang Li,Yuxin Peng,Yu-Ming Tang,Shuohuan Wang,Yu Sun,Hua Wu,Qingming Huang,Haifeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为GvU的token级内在图文对齐奖励机制，并设计了基于理解的自监督强化学习框架，使统一多模态模型（UMMs）能利用自身理解能力指导生成过程，从而缩小其视觉理解与生成能力之间的差距。

Details

Motivation: 统一多模态模型（UMMs）在视觉理解上表现优异，但在复杂文本到图像生成任务中生成能力较弱，主要源于理解和生成过程的内在解耦。 Method: 提出token级内在图文对齐奖励机制GvU，使UMM同时充当教师和学生；并构建基于理解的自监督强化学习框架，利用理解分支评估自身生成结果以指导优化。 Result: 实验表明该方法显著提升了UMMs的图像生成质量，并反向增强其细粒度视觉理解能力，有效缩小了理解与生成之间的能力差距。 Conclusion: 通过内在奖励驱动的自监督强化学习，可有效弥合UMMs在视觉理解和生成任务间的性能鸿沟，提升其整体多模态能力。 Abstract: Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes. While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts. To address this challenge, we explore UMMs' internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism, GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals--without reliance on external supervision. Experimental results show that our method substantially boosts UMMs' generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs' visual understanding and generation.

[122] GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

Xuan Huang,Mochu Xiang,Zhelun Shen,Jinbo Wu,Chenming Wu,Chen Zhao,Kaisiyuan Wang,Hang Zhou,Shanshan Liu,Haocheng Feng,Wei He,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出GenHOI，一种轻量级插件式方法，通过时序均衡（Head-Sliding RoPE）与空间选择性（双层空间注意力门）增强预训练视频生成模型对人手-物体交互（HOI）的建模能力，在野外场景中显著提升交互一致性与物体身份保持。

Details

Motivation: 现有HOI重演方法泛化能力差，难以应对真实复杂场景；通用视频编辑模型虽鲁棒性强，但在HOI任务中存在物体外观不一致等问题。 Method: 提出GenHOI：1）Head-Sliding RoPE——为参考token分配头特异性时间偏移，缓解3D RoPE时序衰减，实现长程物体一致性；2）双层空间注意力门——聚焦于HOI区域并自适应调节注意力强度，兼顾背景真实感与交互保真度。 Result: 在未见的野外场景上，GenHOI在定性与定量评估中均显著超越当前最优HOI重演与通用视频编辑方法。 Conclusion: GenHOI以轻量、即插即用方式有效提升预训练视频生成模型对HOI的建模能力，为数字人视频合成中物理合理接触与物体身份保持提供了新思路。 Abstract: Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: https://xuanhuang0.github.io/GenHOI/

[123] Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

Canyu Chen,Yuguang Yang,Zhewen Tan,Yizhi Wang,Ruiyi Zhan,Haiyan Liu,Xuanyao Mao,Jason Bao,Xinyue Tang,Linlin Yang,Bingchuan Sun,Yan Wang,Baochang Zhang

Main category: cs.CV

TL;DR: 本文提出Curious-VLA框架，通过可行轨迹扩展（FTE）和自适应多样性感知采样（ADAS）等方法，缓解自主视觉语言动作（VLA）模型在模仿学习与强化学习阶段中的探索-利用困境，显著提升导航性能。

Details

Motivation: 现有自主VLA模型受限于窄策略（Narrow Policy），导致模仿学习阶段探索能力坍缩，进而限制后续强化学习阶段的性能提升，因反馈多样性不足而提前饱和。 Method: 提出两阶段Curious-VLA框架：IL阶段采用可行轨迹扩展（FTE）生成多条物理有效轨迹，并使用步长归一化轨迹表示；RL阶段引入自适应多样性感知采样（ADAS）和聚焦式加权的跨距驾驶奖励（SDR）。 Result: 在Navsim基准上达到SoTA结果（PDMS 90.3，EPDMS 85.4），Best-of-N PDMS达94.8。 Conclusion: Curious-VLA有效释放VLA模型的探索潜力，为解决VLA中探索-利用权衡问题提供了新思路。 Abstract: We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward's value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: https://github.com/Mashiroln/curious_vla.git.

[124] Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

Nikos Theodoridis,Reenu Mohandas,Ganesh Sistu,Anthony Scanlan,Ciarán Eising,Tim Brophy

Main category: cs.CV

TL;DR: 本文研究了视觉-语言模型（VLMs）在自动驾驶中处理简单视觉问题时的失败原因，通过分析中间激活并设计反事实图像集，发现部分视觉概念（如物体存在）被线性编码，而空间概念（如朝向）仅隐式编码；并识别出感知失败与认知失败两类失败模式。

Details

Motivation: Vision-Language Models (VLMs) 在自动驾驶中常用于应对长尾场景，但其在关键简单视觉问题上频繁失败，且失败原因尚不明确。 Method: 构建仅在目标视觉概念上有差异的反事实图像集，训练线性探针分析四个SOTA VLM中间激活中特定视觉概念的线性可分性，并结合模型输出诊断失败类型。 Result: 物体/智能体存在等概念被显式线性编码，而朝向等空间概念仅依赖视觉编码器保留的空间结构隐式编码；即使概念被线性编码，模型仍可能因语言对齐失败而答错；物体距离增加会显著降低概念的线性可分性。 Conclusion: VLMs 在自动驾驶相关视觉任务中的失败可分为感知失败（视觉信息未线性编码）和认知失败（信息存在但语义对齐失败），该分析有助于针对性提升模型鲁棒性。 Abstract: The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model's activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model's activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept. Overall, our findings improve our understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.

[125] TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

Soumya Mazumdar,Vineet Kumar Rakesh

Main category: cs.CV

TL;DR: TempoSyncDiff提出一种参考条件化的潜在扩散框架，通过教师-学生蒸馏、身份锚定、时序正则化和基于视位的音频条件控制，在保证生成质量的同时显著降低推理延迟，提升时序稳定性，适用于边缘设备部署。

Details

Motivation: 现有扩散模型在说话人头像生成中存在高推理延迟、时序不稳定（如闪烁和身份漂移）以及复杂语音条件下音视频对齐不佳等问题。 Method: 提出TempoSyncDiff框架：采用教师-学生蒸馏结构，轻量级学生网络实现少步去噪；引入身份锚定与时间正则化抑制身份漂移和帧间闪烁；使用基于视位的音频条件控制粗粒度唇动。 Result: 在LRS3数据集上验证了去噪阶段组件级指标（相对于VAE重建），并报告CPU及边缘设备上的延迟实测与部署可行性评估，表明蒸馏后模型在大幅降低延迟的同时保持了教师模型大部分重建性能。 Conclusion: 蒸馏扩散模型是实现低延迟、高稳定性、边缘可部署的说话人头像生成的可行路径，本工作为扩散模型在资源受限场景下的实际应用迈出初步但关键的一步。 Abstract: Diffusion models have recently advanced photorealistic human synthesis, although practical talking-head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio-visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation. The approach adopts a teacher-student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame-to-frame flicker during synthesis, while viseme-based audio conditioning provides coarse lip motion control. Experiments on the LRS3 dataset report denoising-stage component-level metrics relative to VAE reconstructions and preliminary latency characterization, including CPU-only and edge computing measurements and feasibility estimates for edge deployment. The results suggest that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference. The study is positioned as an initial step toward practical diffusion-based talking-head generation under constrained computational settings. GitHub: https://mazumdarsoumya.github.io/TempoSyncDiff

[126] Transforming Omnidirectional RGB-LiDAR data into 3D Gaussian Splatting

Semin Bae,Hansol Lim,Jongseong Brad Choi

Main category: cs.CV

TL;DR: 本文提出了一种面向存档的全向RGB-LiDAR日志重用流程，用于为3D高斯泼溅（3DGS）提供鲁棒初始化，通过ERP到立方体映射转换、分层色彩降采样（PRISM）及FPFH+ICP多模态配准，显著提升复杂场景下的渲染质量。

Details

Motivation: 现有大规模数字孪生构建依赖昂贵专用数据采集，而部署平台产生的大量全向RGB与LiDAR日志因传输限制和缺乏可扩展复用流程而被大量丢弃或低效利用。 Method: 提出全向RGB-LiDAR复用流水线：包含ERP-to-cubemap转换模块实现确定性空间锚定，PRISM色彩分层降采样策略缓解LiDAR点云无序稠密问题，并基于FPFH全局配准与ICP精配准桥接多模态输入，生成可用于SfM的几何初始化。 Result: 该流程成功将大量废弃传感器日志转化为可用SfM几何；LiDAR增强初始化在结构复杂场景中持续优于纯视觉基线，提升了3DGS最终渲染保真度。 Conclusion: 本工作提供了从标准归档传感器日志构建仿真级数字孪生的确定性工作流，推动了低成本、规模化数字孪生建设。 Abstract: The demand for large-scale digital twins is rapidly growing in robotics and autonomous driving. However, constructing these environments with 3D Gaussian Splatting (3DGS) usually requires expensive, purpose-built data collection. Meanwhile, deployed platforms routinely collect extensive omnidirectional RGB and LiDAR logs, but a significant portion of these sensor data is directly discarded or strictly underutilized due to transmission constraints and the lack of scalable reuse pipeline. In this paper, we present an omnidirectional RGB-LiDAR reuse pipeline that transforms these archived logs into robust initialization assets for 3DGS. Direct conversion of such raw logs introduces practical bottlenecks: inherent non-linear distortion leads to unreliable Structure-from-Motion (SfM) tracking, and dense, unorganized LiDAR clouds cause computational overhead during 3DGS optimization. To overcome these challenges, our pipeline strategically integrates an ERP-to-cubemap conversion module for deterministic spatial anchoring, alongside PRISM-a color stratified downsampling strategy. By bridging these multi-modal inputs via Fast Point Feature Histograms (FPFH) based global registration and Iterative Closest Point (ICP), our pipeline successfully repurposes a considerable fraction of discarded data into usable SfM geometry. Furthermore, our LiDAR-reinforced initialization consistently enhances the final 3DGS rendering fidelity in structurally complex scenes compared to vision-only baselines. Ultimately, this work provides a deterministic workflow for creating simulation-grade digital twins from standard archived sensor logs.

[127] Text-Driven Emotionally Continuous Talking Face Generation

Hao Yang,Yanyan Zhao,Tian Zheng,Hongbo Zhang,Bichen Wang,Di Wu,Xing Fu,Xuda Zhi,Yongbo Huang,Hao He

Main category: cs.CV

TL;DR: 本文提出了一种新的说话人脸生成任务EC-TFG，旨在根据文本和动态情感描述生成具有连续自然情感变化的逼真视频，并设计了TIE-TFG模型来实现这一目标。

Details

Motivation: 现有说话人脸生成方法只能生成固定情绪的视频，缺乏人类在交流中自然、连续的情感变化能力。 Method: 提出了Emotionally Continuous Talking Face Generation（EC-TFG）新任务，并设计了Temporal-Intensive Emotion Modulated Talking Face Generation（TIE-TFG）模型，通过时序强化的情感波动建模生成与文本对齐的情感变化序列，驱动面部表情连续变化。 Result: 实验表明该方法能生成平滑的情感过渡效果，并在多种情绪状态下保持高质量的视觉效果和动作真实性。 Conclusion: EC-TFG任务及TIE-TFG模型有效解决了说话人脸视频中情感连续性建模的难题，显著提升了合成视频的情感表现力和自然度。 Abstract: Talking Face Generation (TFG) strives to create realistic and emotionally expressive digital faces. While previous TFG works have mastered the creation of naturalistic facial movements, they typically express a fixed target emotion in synthetic videos and lack the ability to exhibit continuously changing and natural expressions like humans do when conveying information. To synthesize realistic videos, we propose a novel task called Emotionally Continuous Talking Face Generation (EC-TFG), which takes a text segment and an emotion description with varying emotions as driving data, aiming to generate a video where the person speaks the text while reflecting the emotional changes within the description. Alongside this, we introduce a customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos. Extensive evaluations demonstrate our method's exceptional ability to produce smooth emotion transitions and uphold high-quality visuals and motion authenticity across diverse emotional states.

[128] Lyapunov Probes for Hallucination Detection in Large Foundation Models

Bozhi Luan,Gen Li,Yalan Qin,Jifeng Guo,Yun Zhou,Faguo Wu,Hongwei Zheng,Wenjun Wu,Zhaoxin Fan

Main category: cs.CV

TL;DR: 本文提出Lyapunov Probes方法，将大语言模型和多模态大语言模型视为动力系统，利用李雅普诺夫稳定性理论检测幻觉，通过导数约束的轻量网络识别知识过渡边界处的不稳定区域。

Details

Motivation: 现有幻觉检测方法多将其视为分类任务，忽视了模型内部表示空间中知识结构的动力学特性；作者旨在从动力系统稳定性角度建模事实知识与幻觉生成机制。 Method: 将(M)LLMs建模为动力系统，以稳定平衡点表征事实知识；定义知识过渡区边界为幻觉高发区；设计Lyapunov Probes——受导数稳定性约束的轻量网络，通过两阶段训练和系统扰动分析实现单调置信度衰减。 Result: 在多个数据集和模型上实验表明，该方法在幻觉检测任务中持续优于现有基线方法。 Conclusion: 基于动力系统稳定性理论的视角为幻觉检测提供了新范式，Lyapunov Probes能有效识别表示空间中稳定与不稳定区域，提升检测鲁棒性与可解释性。 Abstract: We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge-transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone regions. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.

[129] FedARKS: Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration for Person Re-identification

Xin Xu,Binchang Ma,Zhixi Yu,Wei Liu

Main category: cs.CV

TL;DR: 本文提出FedARKS框架，通过鲁棒知识（RK）和知识选择（KS）机制，解决联邦域泛化行人重识别中全局特征不足和客户端贡献不均的问题，提升模型在未知域的泛化能力与隐私保护。

Details

Motivation: 现有联邦域泛化行人重识别方法依赖全局特征和简单平均聚合，难以捕捉域不变局部细节，且忽略客户端间特征提取能力差异，导致高质量客户端贡献被稀释。 Method: 提出FedARKS联邦学习框架，包含鲁棒知识（RK）机制（增强局部细节建模）和知识选择（KS）机制（差异化加权聚合，突出高质量客户端贡献）。 Result: 该方法在多个跨域行人重识别基准上提升了模型对未知域的泛化性能，同时保障数据隐私。 Conclusion: FedARKS通过引入局部鲁棒表征与智能知识选择，有效缓解了联邦设置下域泛化的关键瓶颈，为隐私保护下的跨域视觉学习提供了新思路。 Abstract: The application of federated domain generalization in person re-identification (FedDG-ReID) aims to enhance the model's generalization ability in unseen domains while protecting client data privacy. However, existing mainstream methods typically rely on global feature representations and simple averaging operations for model aggregation, leading to two limitations in domain generalization: (1) Using only global features makes it difficult to capture subtle, domain-invariant local details (such as accessories or textures); (2) Uniform parameter averaging treats all clients as equivalent, ignoring their differences in robust feature extraction capabilities, thereby diluting the contributions of high quality clients. To address these issues, we propose a novel federated learning framework, Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration (FedARKS), comprising two mechanisms: RK (Robust Knowledge) and KS (Knowledge Selection).

[130] Cross-Resolution Distribution Matching for Diffusion Distillation

Feiyang Chen,Hongpeng Pan,Haonan Xu,Xinyu Duan,Yang Yang,Zhefeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Cross-Resolution Distribution Matching Distillation (RMD)的新蒸馏框架，用于解决多分辨率级联生成中因分辨率变化导致的分布差异问题，从而在极少步数下实现高保真图像/视频生成，并显著加速推理（如SDXL达33.4倍）。

Details

Motivation: 现有扩散模型蒸馏方法受限于去噪过程，步数缩减已趋饱和；而部分时间步低分辨率生成虽可加速，却因跨分辨率分布差异导致质量明显下降。 Method: RMD基于对数信噪比（logSNR）曲线划分各分辨率的时间步区间，引入logSNR映射补偿分辨率引起的分布偏移；沿分辨率轨迹进行分布匹配，缩小低分辨率生成器与高分辨率教师模型之间的分布差距；并设计预测噪声重注入机制以稳定上采样训练并提升合成质量。 Result: RMD在多个骨干模型上实现了显著加速（SDXL达33.4×，Wan2.1-14B达25.6×），同时保持高视觉保真度；定量与定性结果均验证其有效性。 Conclusion: RMD通过跨分辨率分布匹配与logSNR建模，有效缓解了多分辨率级联生成中的分布失配问题，为高效、高保真扩散模型推理提供了新范式。 Abstract: Diffusion distillation is central to accelerating image and video generation, yet existing methods are fundamentally limited by the denoising process, where step reduction has largely saturated. Partial timestep low-resolution generation can further accelerate inference, but it suffers noticeable quality degradation due to cross-resolution distribution gaps. We propose Cross-Resolution Distribution Matching Distillation (RMD), a novel distillation framework that bridges cross-resolution distribution gaps for high-fidelity, few-step multi-resolution cascaded inference. Specifically, RMD divides the timestep intervals for each resolution using logarithmic signal-to-noise ratio (logSNR) curves, and introduces logSNR-based mapping to compensate for resolution-induced shifts. Distribution matching is conducted along resolution trajectories to reduce the gap between low-resolution generator distributions and the teacher's high-resolution distribution. In addition, a predicted-noise re-injection mechanism is incorporated during upsampling to stabilize training and improve synthesis quality. Quantitative and qualitative results show that RMD preserves high-fidelity generation while accelerating inference across various backbones. Notably, RMD achieves up to 33.4X speedup on SDXL and 25.6X on Wan2.1-14B, while preserving high visual fidelity.

[131] Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

Bohai Gu,Taiyi Wu,Dazhao Du,Jian Liu,Shuai Yang,Xiaotong Zhao,Alan Zhao,Song Guo

Main category: cs.CV

TL;DR: 本文提出Place-it-R1框架，利用多模态大语言模型（MLLM）的链式思维（CoT）能力，实现环境感知、物理一致的视频对象插入，通过‘先思考后放置’范式、MLLM引导的空间直接偏好优化（DPO）及闭环迭代优化，兼顾物理合理性与视觉保真度，并提供两种用户可选编辑模式。

Details

Motivation: 现有视频编辑技术虽能实现高视觉保真度的对象插入，但忽视物理因果性，导致编辑结果与环境物理不一致；亟需一种能进行环境感知和物理推理的编辑框架。 Method: 提出端到端框架Place-it-R1：1）利用MLLM进行物理场景理解与交互推理，生成环境感知的链式思维token并推断有效插入区域；2）引入MLLM引导的空间直接偏好优化（DPO），将扩散输出反馈给MLLM打分以提升自然性；3）构建MLLM与扩散模型的闭环迭代精炼机制；4）设计两种用户可控模式——允许环境修改的合理性优先模式与保持场景完整的保真度优先模式。 Result: 在多项实验中，Place-it-R1在物理一致性方面显著优于现有最先进方法及商用模型，同时支持用户按需权衡物理合理性与视觉保真度。 Conclusion: Place-it-R1首次将MLLM的认知推理能力深度融入视频扩散编辑流程，实现了从‘视觉拟合’到‘物理可信’的范式跃迁，为物理感知视频编辑提供了新范式。 Abstract: Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R$1$, an end-to-end framework for video object insertion that unlocks the environment-aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain-of-Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think-then-Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment-aware chain-of-thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM-guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed-loop that progressively enhances editing quality. Furthermore, we provide two user-selectable modes: a plausibility-oriented flexible mode that permits environment modifications (\eg, generating support structures) to enhance physical plausibility, and a fidelity-oriented standard mode that preserves scene integrity for maximum fidelity, offering users explicit control over the plausibility-fidelity trade-off. Extensive experiments demonstrate Place-it-R1 achieves physically-coherent video object insertion compared with state-of-the-art solutions and commercial models.

[132] Spatial Colour Mixing Illusions as a Perception Stress Test for Vision-Language Models

Nicoleta-Nina Basoc,Adrian Cosma,Emilian Radoi

Main category: cs.CV

TL;DR: 本文研究了视觉-语言模型（VLMs）在面对结构化色彩失真（Spatial Colour Mixing）时的感知脆弱性，发现其性能显著下降且大语言模型缩放无法稳定缓解；人类在相同失真下表现远优于VLMs；引入一种受人类启发的预处理方法可部分恢复性能。

Details

Motivation: 尽管VLMs在基准测试中表现优异，但在面对结构化像素扰动时仍存在系统性感知弱点，而人类却能轻松识别，本文旨在揭示并缩小这一人机感知差距。 Method: 提出Spatial Colour Mixing——一种在RGB和Ostwald色彩系统中叠加结构化图案的程序化色彩失真方法；构建包含8种变体的评估框架；在4个数据集上评测9个VLM（3个模型家族）；开展61人参与的人类实验；设计并验证一种人类启发式预处理策略。 Result: 所有VLM在失真增强时准确率急剧下降；扩大语言模型规模不能可靠提升鲁棒性；人类在动物识别任务中显著优于VLMs；简单的人类启发式预处理可显著恢复多种失真类型下的性能。 Conclusion: VLMs存在本质性的低层感知脆弱性，需引入感知意识强的预处理与工具调用等实用策略来提升鲁棒性。 Abstract: Vision-language models (VLMs) achieve strong benchmark results, yet can exhibit systematic perceptual weaknesses: structured, large changes to pixel values can cause confident yet nonsensical predictions, even when the underlying scene remains easily recognizable to humans. We study this gap using Spatial Colour Mixing, a programmatic family of colour distortions that overlays structured patterns (in both RGB and Ostwald colour systems) onto natural images. We introduce a framework of eight spatial colour mixing variants and evaluate nine VLMs across three model families on four datasets. Across models and datasets, accuracy degrades sharply with increasing distortion, and scaling the language model does not reliably mitigate the failure. In a human study with 61 participants on an animal recognition dataset, humans substantially outperform VLMs under the same distortions. Finally, we show that a simple human-inspired preprocessing step recovers a meaningful portion of performance for several distortion types, motivating perception-aware preprocessing and tool-use as practical strategies for improving VLM robustness.

[133] Longitudinal NSCLC Treatment Progression via Multimodal Generative Models

Massimiliano Mantegna,Elena Mulero Ayllón,Alice Natalina Caragliano,Francesco Di Feola,Claudia Tacconi,Michele Fiore,Edy Ippolito,Carlo Greco,Sara Ramella,Philippe C. Cattin,Paolo Soda,Matteo Tortora,Valerio Guarrasi

Main category: cs.CV

TL;DR: 本文提出了一种虚拟治疗（VT）框架，将非小细胞肺癌（NSCLC）放疗期间的肿瘤演变建模为剂量感知的多模态条件图像到图像翻译任务，利用扩散模型生成反映解剖变化的随访CT图像，在222例患者数据上验证其优于GAN方法。

Details

Motivation: 预测放疗过程中肿瘤演变具有重要临床意义，但纵向变化受解剖结构和治疗因素共同影响，现有方法难以准确建模。 Method: 提出虚拟治疗（VT）框架，将NSCLC进展建模为剂量感知的多模态条件图像到图像翻译问题；以基线CT、临床变量和放射剂量增量为输入，合成随访CT图像；在2D和2.5D配置下对比GAN与扩散模型。 Result: 扩散模型在多模态、剂量感知条件下表现更稳定，生成的肿瘤演变轨迹更符合解剖学规律，优于GAN基线方法。 Conclusion: VT框架有望支持NSCLC的体外治疗监测与自适应放疗研究。 Abstract: Predicting tumor evolution during radiotherapy is a clinically critical challenge, particularly when longitudinal changes are driven by both anatomy and treatment. In this work, we introduce a Virtual Treatment (VT) framework that formulates non-small cell lung cancer (NSCLC) progression as a dose-aware multimodal conditional image-to-image translation problem. Given a CT scan, baseline clinical variables, and a specified radiation dose increment, VT aims to synthesize plausible follow-up CT images reflecting treatment-induced anatomical changes. We evaluate the proposed framework on a longitudinal dataset of 222 stage III NSCLC patients, comprising 895 CT scans acquired during radiotherapy under irregular clinical schedules. The generative process is conditioned on delivered dose increments together with demographic and tumor-related clinical variables. Representative GAN-based and diffusion-based models are benchmarked across 2D and 2.5D configurations. Quantitative and qualitative results indicate that diffusion-based models benefit more consistently from multimodal, dose-aware conditioning and produce more stable and anatomically plausible tumor evolution trajectories than GAN-based baselines, supporting the potential of VT as a tool for in-silico treatment monitoring and adaptive radiotherapy research in NSCLC.

[134] VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models

Rohit Saxena,Alessandro Suglia,Pasquale Minervini

Main category: cs.CV

TL;DR: 本文提出了VLM-RobustBench基准，系统评估了4种主流视觉语言模型在133种图像失真（噪声、模糊、天气、数字、几何扰动）下的鲁棒性，发现模型对空间扰动（如玻璃模糊、弹性变换）极为敏感，而对光度扰动相对鲁棒，揭示其‘语义强、空间弱’的特性，并呼吁建立强调重采样与几何不变性的新鲁棒性评测与训练范式。

Details

Motivation: 现有视觉语言模型（VLMs）在标准高质量数据集上表现优异，但其在真实世界图像失真（如噪声、模糊、几何畸变等）下的鲁棒性尚不明确，亟需系统性评估。 Method: 构建VLM-RobustBench基准，涵盖49类失真类型、3级严重程度（低/中/高）及二值变换，共133种失真设置；在MMBench（视觉接地）和MMMU-Pro（推理导向）两个互补基准上，评测Qwen、InternVL、Molmo、Gemma四大家族VLM的表现。 Result: 发现视觉失真严重程度并非性能下降的良好预测指标：低严重度的空间扰动（如glass_blur）平均导致MMBench准确率下降约8个百分点，而resampling与几何畸变（如upsample、elastic_transform）造成最大降幅达34个百分点。 Conclusion: 当前VLM具备较强语义理解能力，但在空间结构建模上高度脆弱；该结果推动定义新型鲁棒性评测协议与强调重采样和几何不变性的训练策略。 Abstract: Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., upsample, elastic_transform), reaching up to 34 pp. Overall, our findings suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.

[135] Reflective Flow Sampling Enhancement

Zikai Zhou,Muyao Wang,Shitong Shao,Lichen Bai,Haoyi Xiong,Bo Han,Zeke Xie

Main category: cs.CV

TL;DR: 本文提出了一种名为Reflective Flow Sampling (RF-Sampling)的推理时增强框架，专为基于流匹配（如FLUX）的文本到图像生成模型设计，通过理论推导证明其隐式执行文本-图像对齐分数的梯度上升，并在多个基准上显著提升生成质量与提示对齐能力，且首次展现出对FLUX的部分测试时缩放能力。

Details

Motivation: 现有推理增强技术主要适用于传统扩散模型，在基于流匹配的模型（如FLUX）上效果不佳，亟需专为流模型（尤其是CFG蒸馏变体）设计的、理论严谨且无需训练的增强方法。 Method: 提出RF-Sampling：基于形式化推导，利用文本表征的线性组合与流逆过程结合，在噪声空间中探索更符合输入提示的区域，隐式实现对齐分数的梯度上升；专为CFG蒸馏流模型（如FLUX）设计，无需额外训练。 Result: 在多个基准上一致提升生成质量与文本提示对齐；是首个在FLUX上展现出一定测试时缩放能力的推理增强方法。 Conclusion: RF-Sampling是一种理论扎实、即插即用的推理增强框架，有效弥补了当前增强技术不适用于流模型的空白，显著提升了FLUX类模型的性能与可控性。 Abstract: The growing demand for text-to-image generation has led to rapid advances in generative modeling. Recently, text-to-image diffusion models trained with flow matching algorithms, such as FLUX, have achieved remarkable progress and emerged as strong alternatives to conventional diffusion models. At the same time, inference-time enhancement strategies have been shown to improve the generation quality and text-prompt alignment of text-to-image diffusion models. However, these techniques are mainly applicable to conventional diffusion models and usually fail to perform well on flow models. To bridge this gap, we propose Reflective Flow Sampling (RF-Sampling), a theoretically-grounded and training-free inference enhancement framework explicitly designed for flow models, especially for the CFG-distilled variants (i.e., models distilled from CFG guidance techniques), like FLUX. Departing from heuristic interpretations, we provide a formal derivation proving that RF-Sampling implicitly performs gradient ascent on the text-image alignment score. By leveraging a linear combination of textual representations and integrating them with flow inversion, RF-Sampling allows the model to explore noise spaces that are more consistent with the input prompt. Extensive experiments across multiple benchmarks demonstrate that RF-Sampling consistently improves both generation quality and prompt alignment. Moreover, RF-Sampling is also the first inference enhancement method that can exhibit test-time scaling ability to some extent on FLUX.

[136] FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models

Andrew Caunes,Thierry Chateau,Vincent Fremont

Main category: cs.CV

TL;DR: 本文提出FreeOcc，一种无需训练的语义和全景占用预测方法，利用预训练基础模型从多视角图像中恢复语义与几何信息，在nuScenes数据集上达到与弱监督方法相当甚至更优的性能。

Details

Motivation: 现有基于纯相机的3D占用预测方法依赖昂贵的密集3D监督或需在目标域数据上训练，难以泛化到未见环境。 Method: FreeOcc采用无训练流程：利用可提示的基础分割模型提取每视角全景先验，并通过提示到分类规则映射；用重建型基础模型恢复度量级3D点；结合深度与置信度感知滤波将可靠标签提升至3D空间，再经时序融合与确定性体素细化；实例恢复则通过拟合与合并当前视角鲁棒3D框候选实现。 Result: 在Occ3D-nuScenes上，FreeOcc以零训练达成16.9 mIoU和16.5 RayIoU；作为伪标签生成器时达21.1 RayIoU，超越此前最优弱监督基线；并分别创下3.1和3.9 RayPQ的无训练与弱监督全景占用新基准。 Conclusion: FreeOcc验证了基于基础模型的感知范式是实现无训练3D场景理解的可行路径。 Abstract: Semantic and panoptic occupancy prediction for road scene analysis provides a dense 3D representation of the ego vehicle's surroundings. Current camera-only approaches typically rely on costly dense 3D supervision or require training models on data from the target domain, limiting deployment in unseen environments. We propose FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images. FreeOcc extracts per-view panoptic priors with a promptable foundation segmentation model and prompt-to-taxonomy rules, and reconstructs metric 3D points with a reconstruction foundation model. Depth- and confidence- aware filtering lifts reliable labels into 3D, which are fused over time and voxelized with a deterministic refinement stack. For panoptic occupancy, instances are recovered by fitting and merging robust current-view 3D box candidates, enabling instance-aware occupancy without any learned 3D model. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU train-free, on par with state-of-the-art weakly supervised methods. When employed as a pseudo-label generation pipeline for training downstream models, it achieves 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline. Furthermore, FreeOcc sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ, respectively. These results highlight foundation-model-driven perception as a practical route to training-free 3D scene understanding.

Ruili Li,Jiayi Ding,Ruiyu Li,Yilun Jin,Shiwen Ge,Yuwen Zeng,Xiaoyong Zhang,Eichi Takaya,Jan Vrba,Noriyasu Homma

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的半监督框架，利用视觉-语言模型（VLM）结合简单外观描述生成结构一致的伪标签，并通过静态教师模型与EMA教师模型协同优化，显著提升超声乳腺图像在极低标注率下的分割性能。

Details

Motivation: 现有半监督学习在乳腺超声图像分割中因极少量标注导致伪标签不稳定，而视觉-语言模型难以直接迁移至医学领域，缺乏适配的领域提示。 Method: 提出训练免费的伪标签生成与精炼框架：利用简单外观描述（如‘暗色椭圆’）实现跨域结构迁移；构建捕获全局结构先验的静态教师模型；结合EMA教师模型，引入不确定性熵加权融合与自适应不确定性引导的反向对比学习以增强边界判别能力。 Result: 在四个BUS数据集上，仅用2.5%标注数据即达到与全监督模型相当的性能，显著优于现有SSL方法；且该范式可扩展至其他模态或疾病，仅需一个全局外观描述即可获得可靠伪监督。 Conclusion: 本工作验证了轻量级、可泛化的视觉-语言协同伪标签策略在极低资源医学图像分割中的有效性，为可扩展的半监督医疗影像分析提供了新范式。 Abstract: Semi-supervised learning (SSL) has emerged as a promising paradigm for breast ultrasound (BUS) image segmentation, but it often suffers from unstable pseudo labels under extremely limited annotations, leading to inaccurate supervision and degraded performance. Recent vision-language models (VLMs) provide a new opportunity for pseudo-label generation, yet their effectiveness on BUS images remains limited because domain-specific prompts are difficult to transfer. To address this issue, we propose a semi-supervised framework with training-free pseudo-label generation and label refinement. By leveraging simple appearance-based descriptions (e.g., dark oval), our method enables cross-domain structural transfer between natural and medical images, allowing VLMs to generate structurally consistent pseudo labels. These pseudo labels are used to warm up a static teacher that captures global structural priors of breast lesions. Combined with an exponential moving average teacher, we further introduce uncertainty entropy weighted fusion and adaptive uncertainty-guided reverse contrastive learning to improve boundary discrimination. Experiments on four BUS datasets demonstrate that our method achieves performance comparable to fully supervised models even with only 2.5% labeled data, significantly outperforming existing SSL approaches. Moreover, the proposed paradigm is readily extensible: for other imaging modalities or diseases, only a global appearance description is required to obtain reliable pseudo supervision, enabling scalable semi-supervised medical image segmentation under limited annotations.

[138] JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas

Sandeep Inuganti,Hideaki Kanayama,Kanta Shimizu,Mahdi Chamseddine,Soichiro Yokota,Didier Stricker,Jason Rambach

Main category: cs.CV

TL;DR: JOPP-3D是一个开放词汇语义分割框架，联合利用全景图像和点云数据，通过语言驱动实现跨模态场景理解，在Stanford-2D-3D-S和ToF-360数据集上显著优于现有方法。

Details

Motivation: 解决3D点云与全景图像跨模态语义分割中标注数据稀缺及固定标签模型泛化能力差的问题。 Method: 将RGB-D全景图像转换为切向透视图像和3D点云，提取并对其基础视觉-语言特征，支持自然语言查询生成双模态语义掩码。 Result: 在Stanford-2D-3D-S和ToF-360数据集上实现了连贯且语义合理的跨模态分割，在开放与闭合词汇的2D/3D语义分割任务中显著超越SOTA。 Conclusion: JOPP-3D验证了联合利用多模态视觉数据与语言引导进行开放词汇分割的有效性，提升了跨模态场景理解的灵活性与泛化能力。 Abstract: Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.

[139] Optimizing 3D Diffusion Models for Medical Imaging via Multi-Scale Reward Learning

Yueying Tian,Xudong Han,Meng Zhou,Rodrigo Aviles-Espinosa,Rupert Young,Philip Birch

Main category: cs.CV

TL;DR: 本文提出了一种结合强化学习（PPO）与多尺度反馈（2D切片+3D体素）来优化预训练3D扩散模型的方法，显著提升了生成MRI图像的质量和下游分类任务性能。

Details

Motivation: 标准扩散模型训练目标与临床需求存在脱节，需提升生成医学图像的临床相关性。 Method: 先在MRI数据上预训练3D扩散模型，再用PPO算法基于融合2D切片评估和3D体积分的奖励函数进行微调。 Result: 在BraTS 2019和OASIS-1数据集上FID显著下降，且生成数据在肿瘤与疾病分类任务中表现优于基线。 Conclusion: 引入多尺度RL反馈可有效引导3D扩散模型生成更高质量、更具临床实用价值的医学图像。 Abstract: Diffusion models have emerged as powerful tools for 3D medical image generation, yet bridging the gap between standard training objectives and clinical relevance remains a challenge. This paper presents a method to enhance 3D diffusion models using Reinforcement Learning (RL) with multi-scale feedback. We first pretrain a 3D diffusion model on MRI volumes to establish a robust generative prior. Subsequently, we fine-tune the model using Proximal Policy Optimization (PPO), guided by a novel reward system that integrates both 2D slice-wise assessments and 3D volumetric analysis. This combination allows the model to simultaneously optimize for local texture details and global structural coherence. We validate our framework on the BraTS 2019 and OASIS-1 datasets. Our results indicate that incorporating RL feedback effectively steers the generation process toward higher quality distributions. Quantitative analysis reveals significant improvements in Fréchet Inception Distance (FID) and, crucially, the synthetic data demonstrates enhanced utility in downstream tumor and disease classification tasks compared to non-optimized baselines.

[140] Making Training-Free Diffusion Segmentors Scale with the Generative Power

Benyuan Meng,Qianqian Xu,Zitai Wang,Xiaochun Cao,Longtao Huang,Qingming Huang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的扩散模型语义分割方法，通过自动聚合和逐像素重缩放技术解决跨注意力图与全局语义表征之间的不一致及文本词元得分不平衡问题，从而更有效地利用生成能力提升分割性能。

Details

Motivation: 现有基于预训练扩散模型的无训练语义分割方法难以随模型生成能力增强而同步提升性能，其根本原因在于跨注意力图缺乏统一全局表征以及文本词元间得分不平衡。 Method: 提出两种关键技术：auto aggregation（自动聚合）用于融合多头多层跨注意力图以构建统一全局表征；per-pixel rescaling（逐像素重缩放）用于校正不同文本词元的分数偏差，从而提升语义相关性建模精度。 Result: 在标准语义分割基准上验证了方法有效性，并成功集成到生成式技术中，显著提升了分割性能与泛化能力。 Conclusion: 通过弥合跨注意力表征与全局语义需求之间的差距，所提方法使无训练扩散分割器能真正受益于更强的生成模型，为扩散模型在判别任务中的应用提供了新思路。 Abstract: As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.

[141] Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots

Mingzhe Li,Mengyin Liu,Zekai Wu,Xincheng Lin,Junsheng Zhang,Ming Yan,Zengye Xie,Changwang Zhang,Chenglu Wen,Lan Xu,Siqi Shen,Cheng Wang

Main category: cs.CV

TL;DR: 本文提出Motion Turing Test框架，通过仅使用运动学信息评估人类观察者能否区分人形机器人与人类姿态，并构建HHMotion数据集（含1000个动作序列、15类动作、11个机器人模型和10名人类受试者），统一用SMPL-X表示以消除外观影响；30名标注员完成超500小时的人类相似度评分（0-5分）；分析表明机器人在跳跃、拳击、跑步等动态动作中仍明显偏离人类运动；进一步提出自动预测人类相似度分数的任务，发现现有多模态大模型表现不足，而作者提出的简单基线模型效果更优；数据集、代码和基准将开源。

Details

Motivation: 受图灵测试启发，旨在建立一种客观评估人形机器人运动自然程度的量化标准，解决当前运动生成研究中缺乏统一、可衡量的人类相似性评价体系的问题。 Method: 提出Motion Turing Test评估框架；构建HHMotion数据集，涵盖1000个动作序列，统一采用SMPL-X表示；组织30名标注员进行人类相似度主观评分；设计人类相似度自动预测任务，并提出一个简单基线模型进行验证。 Result: 分析发现当前人形机器人在动态动作（如跳跃、拳击、跑步）中仍显著偏离人类运动；在人类相似度自动预测任务上，所提简单基线模型优于多种前沿多模态大语言模型。 Conclusion: Motion Turing Test为评估人形机器人运动自然性提供了新范式；HHMotion数据集及基准填补了该领域数据与评测空白；简单模型的有效性提示：针对运动质量评估任务，专用架构可能比通用大模型更具优势。 Abstract: Humanoid robots have achieved significant progress in motion generation and control, exhibiting movements that appear increasingly natural and human-like. Inspired by the Turing Test, we propose the Motion Turing Test, a framework that evaluates whether human observers can discriminate between humanoid robot and human poses using only kinematic information. To facilitate this evaluation, we present the Human-Humanoid Motion (HHMotion) dataset, which consists of 1,000 motion sequences spanning 15 action categories, performed by 11 humanoid models and 10 human subjects. All motion sequences are converted into SMPL-X representations to eliminate the influence of visual appearance. We recruited 30 annotators to rate the human-likeness of each pose on a 0-5 scale, resulting in over 500 hours of annotation. Analysis of the collected data reveals that humanoid motions still exhibit noticeable deviations from human movements, particularly in dynamic actions such as jumping, boxing, and running. Building on HHMotion, we formulate a human-likeness evaluation task that aims to automatically predict human-likeness scores from motion data. Despite recent progress in multimodal large language models, we find that they remain inadequate for assessing motion human-likeness. To address this, we propose a simple baseline model and demonstrate that it outperforms several contemporary LLM-based methods. The dataset, code, and benchmark will be publicly released to support future research in the community.

[142] SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection

Shuailin Xue,Jun Wan,Lihua Zhang,Wenwen Min

Main category: cs.CV

TL;DR: 本文提出SpaCRD方法，通过迁移学习深度融合组织病理图像与空间转录组（ST）数据，实现跨样本、跨平台和跨批次的癌症组织区域（CTR）精准检测。

Details

Motivation: 传统基于形态学的CTR检测易因组织形态相似性导致高假阳性；现有方法难以有效融合病理图像与ST数据，尤其在跨样本、跨平台/批次场景下。 Method: 提出SpaCRD：一种基于迁移学习的类别正则化变分重建引导的双向交叉注意力融合网络，自适应捕获病理特征与基因表达间的潜在共表达模式。 Result: 在23个涵盖多种疾病类型、平台和批次的配对病理-ST数据集上，SpaCRD持续优于8种现有最先进方法。 Conclusion: SpaCRD显著提升了CTR检测的准确性与泛化能力，为肿瘤微环境分析与治疗响应评估提供了可靠新工具。 Abstract: Accurate detection of cancer tissue regions (CTR) enables deeper analysis of the tumor microenvironment and offers crucial insights into treatment response. Traditional CTR detection methods, which typically rely on the rich cellular morphology in histology images, are susceptible to a high rate of false positives due to morphological similarities across different tissue regions. The groundbreaking advances in spatial transcriptomics (ST) provide detailed cellular phenotypes and spatial localization information, offering new opportunities for more accurate cancer region detection. However, current methods are unable to effectively integrate histology images with ST data, especially in the context of cross-sample and cross-platform/batch settings for accomplishing the CTR detection. To address this challenge, we propose SpaCRD, a transfer learning-based method that deeply integrates histology images and ST data to enable reliable CTR detection across diverse samples, platforms, and batches. Once trained on source data, SpaCRD can be readily generalized to accurately detect cancerous regions across samples from different platforms and batches. The core of SpaCRD is a category-regularized variational reconstruction-guided bidirectional cross-attention fusion network, which enables the model to adaptively capture latent co-expression patterns between histological features and gene expression from multiple perspectives. Extensive benchmark analysis on 23 matched histology-ST datasets spanning various disease types, platforms, and batches demonstrates that SpaCRD consistently outperforms existing eight state-of-the-art methods in CTR detection.

[143] Adaptive Language-Aware Image Reflection Removal Network

Siyan Fang,Yuntao Wang,Jinpu Zhang,Ziwen Li,Yuehuan Wang

Main category: cs.CV

TL;DR: 本文提出了一种自适应语言感知网络（ALANet），通过融合过滤与优化策略，有效缓解不准确语言描述对图像反射去除的负面影响，并提升语言与视觉特征对齐，从而在复杂反射场景下实现更优性能。

Details

Motivation: 现有图像反射去除方法难以处理复杂反射；而机器生成的语言描述因反射导致的模糊和失真而不准确，限制了语言引导方法的效果。 Method: 提出ALANet，融合过滤策略（削弱语言噪声、保留有益信息）与优化策略（增强语言-视觉特征对齐），并利用语言线索解耦特征图中的特定层内容；构建CRLAV数据集用于评估不同语言准确性下的性能。 Result: ALANet在复杂反射去除任务上超越当前最优方法。 Conclusion: ALANet能鲁棒地利用不准确语言描述指导反射去除，显著提升复杂反射场景下的性能，验证了语言感知建模的有效性与实用性。 Abstract: Existing image reflection removal methods struggle to handle complex reflections. Accurate language descriptions can help the model understand the image content to remove complex reflections. However, due to blurred and distorted interferences in reflected images, machine-generated language descriptions of the image content are often inaccurate, which harms the performance of language-guided reflection removal. To address this, we propose the Adaptive Language-Aware Network (ALANet) to remove reflections even with inaccurate language inputs. Specifically, ALANet integrates both filtering and optimization strategies. The filtering strategy reduces the negative effects of language while preserving its benefits, whereas the optimization strategy enhances the alignment between language and visual features. ALANet also utilizes language cues to decouple specific layer content from feature maps, improving its ability to handle complex reflections. To evaluate the model's performance under complex reflections and varying levels of language accuracy, we introduce the Complex Reflection and Language Accuracy Variance (CRLAV) dataset. Experimental results demonstrate that ALANet surpasses state-of-the-art methods for image reflection removal. The code and dataset are available at https://github.com/fashyon/ALANet.

[144] Point-Supervised Skeleton-Based Human Action Segmentation

Hongsong Wang,Yiqin Shen,Pengbo Yan,Jie Gui

Main category: cs.CV

TL;DR: 本文提出了一种点监督框架用于骨架数据的动作时序分割，仅需每个动作片段标注一帧，结合多模态骨架特征与多种伪标签生成策略，在多个基准上达到甚至超越全监督方法的性能。

Details

Motivation: 现有全监督方法依赖昂贵的逐帧标注且对动作边界模糊敏感，亟需减少标注成本并提升鲁棒性。 Method: 提出点监督框架，利用预训练统一模型编码关节、骨骼和运动多模态骨架信息；设计原型相似性方法，并融合能量函数与约束K-Medoids聚类生成伪标签；引入多模态伪标签集成机制指导模型训练。 Result: 在PKU-MMD（X-Sub/X-View）、MCFS-22和MCFS-130上建立新基准，实验表明该方法性能媲美甚至优于部分全监督方法，同时大幅降低标注开销。 Conclusion: 点监督范式在骨架动作分割中切实可行且高效，多模态特征建模与协同伪标签策略是提升性能的关键。 Abstract: Skeleton-based temporal action segmentation is a fundamental yet challenging task, playing a crucial role in enabling intelligent systems to perceive and respond to human activities. While fully-supervised methods achieve satisfactory performance, they require costly frame-level annotations and are sensitive to ambiguous action boundaries. To address these issues, we introduce a point-supervised framework for skeleton-based action segmentation, where only a single frame per action segment is labeled. We leverage multimodal skeleton data, including joint, bone, and motion information, encoded via a pretrained unified model to extract rich feature representations. To generate reliable pseudo-labels, we propose a novel prototype similarity method and integrate it with two existing methods: energy function and constrained K-Medoids clustering. Multimodal pseudo-label integration is proposed to enhance the reliability of the pseudo-label and guide the model training. We establish new benchmarks on PKU-MMD (X-Sub and X-View), MCFS-22, and MCFS-130, and implement baselines for point-supervised skeleton-based human action segmentation. Extensive experiments show that our method achieves competitive performance, even surpassing some fully-supervised methods while significantly reducing annotation effort.

[145] VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction

Xiaoyang Yan,Muleilan Pei,Shaojie Shen

Main category: cs.CV

TL;DR: 本文提出VG3S框架，通过引入视觉基础模型（VFM）的几何先验，增强基于高斯点阵的3D语义占据预测性能，在nuScenes数据集上显著提升IoU和mIoU。

Details

Motivation: 现有纯视觉范式下3D高斯生成依赖的几何线索不足，难以支撑高质量占据预测；而视觉基础模型具备强几何表征能力，可弥补该缺陷。 Method: 提出Visual Geometry Grounded Gaussian Splatting（VG3S），设计即插即用的分层几何特征适配器，将冻结VFM的通用token通过特征聚合、任务对齐与多尺度重构，转化为适配占据预测的几何感知特征。 Result: 在nuScenes占据预测基准上，相比基线提升12.6% IoU和7.5% mIoU；且可无缝适配多种VFMs，验证了几何先验迁移的有效性与泛化性。 Conclusion: 将预训练VFM的几何先验注入高斯占据建模是有效且通用的策略，显著提升纯视觉3D语义占据预测性能。 Abstract: 3D semantic occupancy prediction has become a crucial perception task for comprehensive scene understanding in autonomous driving. While recent advances have explored 3D Gaussian splatting for occupancy modeling to substantially reduce computational overhead, the generation of high-quality 3D Gaussians relies heavily on accurate geometric cues, which are often insufficient in purely vision-centric paradigms. To bridge this gap, we advocate for injecting the strong geometric grounding capability from Vision Foundation Models (VFMs) into occupancy prediction. In this regard, we introduce Visual Geometry Grounded Gaussian Splatting (VG3S), a novel framework that empowers Gaussian-based occupancy prediction with cross-view 3D geometric grounding. Specifically, to fully exploit the rich 3D geometric priors from a frozen VFM, we propose a plug-and-play hierarchical geometric feature adapter, which can effectively transform generic VFM tokens via feature aggregation, task-specific alignment, and multi-scale restructuring. Extensive experiments on the nuScenes occupancy benchmark demonstrate that VG3S achieves remarkable improvements of 12.6% in IoU and 7.5% in mIoU over the baseline. Furthermore, we show that VG3S generalizes seamlessly across diverse VFMs, consistently enhancing occupancy prediction accuracy and firmly underscoring the immense value of integrating priors derived from powerful, pre-trained geometry-grounded VFMs.

[146] Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

Xiaoxing You,Qiang Huang,Lingyu Li,Xiaojun Chang,Jun Yu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的多模态摘要框架CoE，通过链式事件（Chain-of-Events）和分层事件图（HEG）实现显式跨模态对齐与时序推理，在多个数据集上显著超越现有方法。

Details

Motivation: 现有多模态摘要方法存在依赖领域监督、跨模态对齐隐式且弱、缺乏事件级时序建模三大问题。 Method: 提出CoE框架，构建分层事件图（HEG）编码文本语义为显式事件层级结构，并以此引导视觉线索定位、事件演化与因果建模，辅以轻量风格适配实现领域对齐。 Result: 在8个多样化数据集上大幅超越视频链式推理（CoT）基线，平均提升+3.04 ROUGE、+9.51 CIDEr、+1.88 BERTScore。 Conclusion: CoE是一种训练免费、结构化、可解释且具备强跨域泛化能力的多模态摘要新范式。 Abstract: Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that **CoE** consistently outperforms state-of-the-art video CoT baselines, achieving average gains of **+3.04 ROUGE**, **+9.51 CIDEr**, and **+1.88 BERTScore**, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.

[147] EntON: Eigenentropy-Optimized Neighborhood Densification in 3D Gaussian Splatting

Miriam Jäger,Boris Jutzi

Main category: cs.CV

TL;DR: 本文提出了一种基于Eigenentropy优化的邻域密度化策略EntON，用于3D高斯泼溅（3DGS），以提升几何精度与渲染质量。该方法通过计算高斯中心k近邻协方差矩阵特征值的Eigenentropy，区分局部结构有序性，并在交替优化中引导自适应分裂与剪枝：在低Eigenentropy（平坦、有序）区域增强密度以捕获表面细节，在高Eigenentropy（混乱、球状）区域剪枝。实验表明，EntON在DTU和TUM2TWIN数据集上显著提升几何精度（+33%）、渲染质量（+7%），同时减少高斯数量（−50%）与训练时间（−23%）。

Details

Motivation: 标准3DGS中高斯中心与表面几何对齐差；而表面聚焦方法常牺牲光度精度。需兼顾几何准确性与渲染质量的密度化策略。 Method: 提出Eigenentropy-optimized neighborhood densification（EntON）：基于k近邻协方差矩阵特征值计算Eigenentropy表征局部结构有序性；构建交替优化框架，交替执行视图空间梯度驱动的标准密度化与Eigenentropy感知的密度化（低Eigenentropy区域分裂、高Eigenentropy区域剪枝）。 Result: 在DTU和TUM2TWIN数据集上，几何精度提升最高达33%，渲染质量提升最高达7%，高斯数量减少最多50%，训练时间缩短最多23%。 Conclusion: EntON实现了几何精度、渲染质量与计算效率的良好平衡，避免了不必要的场景扩展，是一种更鲁棒、高效的3DGS密度化策略。 Abstract: We present a novel Eigenentropy-optimized neighboorhood densification strategy EntON in 3D Gaussian Splatting (3DGS) for geometrically accurate and high-quality rendered 3D reconstruction. While standard 3DGS produces Gaussians whose centers and surfaces are poorly aligned with the underlying object geometry, surface-focused reconstruction methods frequently sacrifice photometric accuracy. In contrast to the conventional densification strategy, which relies on the magnitude of the view-space position gradient, our approach introduces a geometry-aware strategy to guide adaptive splitting and pruning. Specifically, we compute the 3D shape feature Eigenentropy from the eigenvalues of the covariance matrix in the k-nearest neighborhood of each Gaussian center, which quantifies the local structural order. These Eigenentropy values are integrated into an alternating optimization framework: During the optimization process, the algorithm alternates between (i) standard gradient-based densification, which refines regions via view-space gradients, and (ii) Eigenentropy-aware densification, which preferentially densifies Gaussians in low-Eigenentropy (ordered, flat) neighborhoods to better capture fine geometric details on the object surface, and prunes those in high-Eigenentropy (disordered, spherical) regions. We provide quantitative and qualitative evaluations on two benchmark datasets: small-scale DTU dataset and large-scale TUM2TWIN dataset, covering man-made objects and urban scenes. Experiments demonstrate that our Eigenentropy-aware alternating densification strategy improves geometric accuracy by up to 33% and rendering quality by up to 7%, while reducing the number of Gaussians by up to 50% and training time by up to 23%. Overall, EnTON achieves a favorable balance between geometric accuracy, rendering quality and efficiency by avoiding unnecessary scene expansion.

[148] Word-Anchored Temporal Forgery Localization

Tianyi Wang,Xi Shao,Harry Cheng,Yinglong Wang,Mohan Kankanhalli

Main category: cs.CV

TL;DR: 本文提出了一种名为WAFL的新型时间伪造定位范式，将任务从连续时间回归转为离散词级二分类，并通过FFR模块和ACA损失提升性能与效率。

Details

Motivation: 现有时间伪造定位方法存在特征粒度不匹配和计算成本高的问题。 Method: 提出词锚定的时间伪造定位（WAFL），引入法医特征重对齐（FFR）模块和以伪影为中心的非对称（ACA）损失函数。 Result: WAFL在定位性能上显著优于现有方法，且参数更少、计算效率更高，适用于同数据集和跨数据集设置。 Conclusion: WAFL通过离散化建模、特征空间重对齐和针对性损失设计，有效解决了时间伪造定位中的粒度与效率瓶颈。 Abstract: Current temporal forgery localization (TFL) approaches typically rely on temporal boundary regression or continuous frame-level anomaly detection paradigms to derive candidate forgery proposals. However, they suffer not only from feature granularity misalignment but also from costly computation. To address these issues, we propose word-anchored temporal forgery localization (WAFL), a novel paradigm that shifts the TFL task from temporal regression and continuous localization to discrete word-level binary classification. Specifically, we first analyze the essence of temporal forgeries and identify the minimum meaningful forgery units, word tokens, and then align data preprocessing with the natural linguistic boundaries of speech. To adapt powerful pre-trained foundation backbones for feature extraction, we introduce the forensic feature realignment (FFR) module, mapping representations from the pre-trained semantic space to a discriminative forensic manifold. This allows subsequent lightweight linear classifiers to efficiently perform binary classification and accomplish the TFL task. Furthermore, to overcome the extreme class imbalance inherent to forgery detection, we design the artifact-centric asymmetric (ACA) loss, which breaks the standard precision-recall trade-off by dynamically suppressing overwhelming authentic gradients while asymmetrically prioritizing subtle forensic artifacts. Extensive experiments demonstrate that WAFL significantly outperforms state-of-the-art approaches in localization performance under both in- and cross-dataset settings, while requiring substantially fewer learnable parameters and operating at high computational efficiency.

[149] Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention

Haiqing Hao,Zhipeng Sui,Rong Zou,Zijia Dai,Nikola Zubić,Davide Scaramuzza,Wenhui Wang

Main category: cs.CV

TL;DR: 本文提出了一种空间稀疏线性注意力机制（SSLA），用于事件相机的低延迟目标检测，通过混合空间状态分解和scatter-compute-gather训练策略，在保持并行训练效率的同时引入状态级稀疏性；基于SSLA构建的SSLA-Det模型在Gen1和N-Caltech101数据集上达到异步方法SOTA精度（mAP分别为0.375和0.515），且单事件计算量降低20倍以上。

Details

Motivation: 现有异步事件神经网络虽具低延迟优势，但存在长序列训练困难、精度提升常伴随单事件计算与延迟增加两大瓶颈；标准线性注意力虽支持并行训练与递归推理，但其全局状态更新导致精度-效率权衡差，难以满足事件检测对细粒度表征的需求。 Method: 提出空间稀疏线性注意力（SSLA），核心包括：1）混合空间状态分解（mixture-of-spaces state decomposition），实现状态级稀疏；2）scatter-compute-gather训练流程，兼顾事件稀疏性利用与并行训练；在此基础上构建端到端异步线性注意力检测模型SSLA-Det。 Result: SSLA-Det在Gen1和N-Caltech101数据集上分别取得0.375和0.515 mAP，为当前异步方法最高精度；相比最强异步基线，单事件计算量降低超20倍。 Conclusion: SSLA成功在事件驱动视觉中平衡了状态稀疏性与训练并行性，验证了线性注意力在低延迟事件检测中的潜力，为高效异步感知建模提供了新范式。 Abstract: Event cameras provide sequential visual data with spatial sparsity and high temporal resolution, making them attractive for low-latency object detection. Existing asynchronous event-based neural networks realize this low-latency advantage by updating predictions event-by-event, but still suffer from two bottlenecks: recurrent architectures are difficult to train efficiently on long sequences, and improving accuracy often increases per-event computation and latency. Linear attention is appealing in this setting because it supports parallel training and recurrent inference. However, standard linear attention updates a global state for every event, yielding a poor accuracy-efficiency trade-off, which is problematic for object detection, where fine-grained representations and thus states are preferred. The key challenge is therefore to introduce sparse state activation that exploits event sparsity while preserving efficient parallel training. We propose Spatially-Sparse Linear Attention (SSLA), which introduces a mixture-of-spaces state decomposition and a scatter-compute-gather training procedure, enabling state-level sparsity as well as training parallelism. Built on SSLA, we develop an end-to-end asynchronous linear attention model, SSLA-Det, for event-based object detection. On Gen1 and N-Caltech101, SSLA-Det achieves state-of-the-art accuracy among asynchronous methods, reaching 0.375 mAP and 0.515 mAP, respectively, while reducing per-event computation by more than 20 times compared to the strongest prior asynchronous baseline, demonstrating the potential of linear attention for low-latency event-based vision.

[150] TaPD: Temporal-adaptive Progressive Distillation for Observation-Adaptive Trajectory Forecasting in Autonomous Driving

Mingyu Fan,Yi Liu,Hao Zhou,Deheng Qian,Mohammad Haziq Khan,Matthias Raetsch

Main category: cs.CV

TL;DR: 本文提出TaPD框架，通过观察自适应预测器（OAF）和时间回填模块（TBM）联合解决变长历史观测下的轨迹预测问题，显著提升短时观测下的预测性能，并可即插即用地增强其他预测器。

Details

Motivation: 现有轨迹预测方法大多假设固定长度的历史观测，在真实场景中（如遮挡、感知范围有限）面对可变或极短历史观测时性能大幅下降。 Method: 提出TaPD框架，包含Observation-Adaptive Forecaster（OAF）和Temporal Backfilling Module（TBM）；OAF基于渐进式知识蒸馏（PKD），通过分层特征回归将长时序教师模型的知识迁移至短时序学生模型，并引入余弦退火蒸馏权重平衡预测监督与特征对齐；TBM在极短历史下显式回填缺失历史片段，以场景演化为条件生成上下文丰富的轨迹，增强PKD效果；采用解耦的预训练-重建-微调流程保留真实运动先验。 Result: 在Argoverse 1和Argoverse 2数据集上，TaPD在所有观测长度下均一致优于强基线，尤其在极短输入下增益显著，并能以即插即用方式提升HiVT等其他预测器性能。 Conclusion: TaPD是一种通用、即插即用的变长历史轨迹预测框架，有效缓解短时观测下的性能退化问题，提升了预测鲁棒性与泛化能力。 Abstract: Trajectory prediction is essential for autonomous driving, enabling vehicles to anticipate the motion of surrounding agents to support safe planning. However, most existing predictors assume fixed-length histories and suffer substantial performance degradation when observations are variable or extremely short in real-world settings (e.g., due to occlusion or a limited sensing range). We propose TaPD (Temporal-adaptive Progressive Distillation), a unified plug-and-play framework for observation-adaptive trajectory forecasting under variable history lengths. TaPD comprises two cooperative modules: an Observation-Adaptive Forecaster (OAF) for future prediction and a Temporal Backfilling Module (TBM) for explicit reconstruction of the past. OAF is built on progressive knowledge distillation (PKD), which transfers motion pattern knowledge from long-horizon "teachers" to short-horizon "students" via hierarchical feature regression, enabling short observations to recover richer motion context. We further introduce a cosine-annealed distillation weighting scheme to balance forecasting supervision and feature alignment, improving optimization stability and cross-length consistency. For extremely short histories where implicit alignment is insufficient, TBM backfills missing historical segments conditioned on scene evolution, producing context-rich trajectories that strengthen PKD and thereby improve OAF. We employ a decoupled pretrain-reconstruct-finetune protocol to preserve real-motion priors while adapting to backfilled inputs. Extensive experiments on Argoverse 1 and Argoverse 2 show that TaPD consistently outperforms strong baselines across all observation lengths, delivers especially large gains under very short inputs, and improves other predictors (e.g., HiVT) in a plug-and-play manner. Code will be available at https://github.com/zhouhao94/TaPD.

[151] Hierarchical Collaborative Fusion for 3D Instance-aware Referring Expression Segmentation

Keshen Zhou,Runnan Chen,Mingming Gong,Tongliang Liu

Main category: cs.CV

TL;DR: 本文提出HCF-RES多模态框架，通过分层视觉语义分解与渐进式多级融合，提升3D场景中基于自然语言的物体定位精度，尤其在细粒度描述和多目标/零目标场景下表现优异。

Details

Motivation: 现有方法仅依赖稀疏点云，缺乏丰富视觉语义，难以处理细粒度、多目标或零目标的3D指代表达分割任务。 Method: 提出HCF-RES框架：1）分层视觉语义分解——利用SAM实例掩码引导CLIP在像素级和实例级双粒度编码，并保障2D到3D投影中物体边界完整性；2）渐进式多级融合——包含模态内协作、2D语义与3D几何特征的跨模态自适应加权，以及语言引导的特征精化。 Result: 在ScanRefer和Multi3DRefer数据集上达到SOTA性能。 Conclusion: 融合2D视觉语义与3D几何信息的多模态协同建模，可显著提升3D指代表达分割的鲁棒性与准确性。 Abstract: Generalised 3D Referring Expression Segmentation (3D-GRES) localizes objects in 3D scenes based on natural language, even when descriptions match multiple or zero targets. Existing methods rely solely on sparse point clouds, lacking rich visual semantics for fine-grained descriptions. We propose HCF-RES, a multi-modal framework with two key innovations. First, Hierarchical Visual Semantic Decomposition leverages SAM instance masks to guide CLIP encoding at dual granularities -- pixel-level and instance-level features -- preserving object boundaries during 2D-to-3D projection. Second, Progressive Multi-level Fusion integrates representations through intra-modal collaboration, cross-modal adaptive weighting between 2D semantic and 3D geometric features, and language-guided refinement. HCF-RES achieves state-of-the-art results on both ScanRefer and Multi3DRefer.

[152] NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving

Kai Luo,Xu Wang,Rui Fan,Kailun Yang

Main category: cs.CV

TL;DR: 本文提出NOVA，一种基于大语言模型的开放词汇自回归3D多目标跟踪新范式，将轨迹建模为时空语义序列，通过语言先验和常识推理提升跨未知类别的泛化能力，在nuScenes等数据集上显著提升新颖类别跟踪性能。

Details

Motivation: 现有3D多目标跟踪方法受限于闭集假设和缺乏语义理解的启发式匹配，难以泛化到未知目标类别。 Method: 提出Next-step Open-Vocabulary Autoregression（NOVA）范式，将3D轨迹建模为结构化的时空语义序列，利用大语言模型（LLM）的自回归能力完成下一步序列预测，融合运动连续性与语言先验，并借助语言空间的层次结构进行细粒度语义消歧和长程身份一致性维护。 Result: 在nuScenes、V2X-Seq-SPD和KITTI数据集上验证有效性；nuScenes上对Novel类别的AMOTA达22.41%，相比基线绝对提升20.21%；仅使用0.5B参数量的紧凑自回归模型即实现该性能。 Conclusion: NOVA成功将生成式语义建模引入3D MOT，突破传统距离匹配范式，显著增强开放世界下的跨类别泛化能力和语义鲁棒性。 Abstract: Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind'' heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at https://github.com/xifen523/NOVA.

[153] GazeMoE: Perception of Gaze Target with Mixture-of-Experts

Zhuangzhuang Dai,Zhongxi Lu,Vincent G. Zakka,Luis J. Manso,Jose M Alcaraz Calero,Chen Li

Main category: cs.CV

TL;DR: 本文提出GazeMoE框架，利用冻结的视觉基础模型和Mixture-of-Experts（MoE）模块自适应融合多模态线索（眼、头姿、手势、上下文）进行视线目标估计，并引入类别平衡辅助损失与针对性数据增强，显著提升在帧内/帧外分类等挑战性任务上的性能。

Details

Motivation: 现有基于可见图像的视线目标估计方法在神经架构泛化性和多模态线索（眼、头姿、手势、上下文）高效融合方面仍面临挑战；预训练视觉基础模型虽有潜力，但需适配性强、解码高效的机制。 Method: 提出GazeMoE端到端框架：1）利用冻结的视觉基础模型提取特征；2）通过MoE模块选择性激活与视线目标相关的多模态线索；3）引入类别平衡辅助损失缓解帧内/帧外样本不平衡；4）采用区域裁剪与光度变换等策略性数据增强提升鲁棒性。 Result: 在多个基准数据集上达到SOTA性能，显著优于现有方法，尤其在具有挑战性的视线估计任务（如帧内/帧外分类）中表现突出；代码与预训练模型已开源。 Conclusion: GazeMoE验证了MoE机制在多模态视线目标估计中的有效性，为利用冻结基础模型实现轻量、自适应、鲁棒的视觉理解提供了新范式。 Abstract: Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at https://huggingface.co/zdai257/GazeMoE

[154] ODD-SEC: Onboard Drone Detection with a Spinning Event Camera

Kuan Dai,Hongxin Zhang,Sheng Zhong,Yi Zhou

Main category: cs.CV

TL;DR: 本文提出了一种面向移动载体（如四足机器人或无人地面车辆）的实时无人机检测系统，采用旋转式事件相机实现360°水平视场和方位角估计；创新性地设计了无需运动补偿的类图像事件表示方法，并结合轻量神经网络实现实时、高精度检测（平均角度误差<2°），已在Jetson Orin NX上验证。

Details

Motivation: 现有基于事件相机的无人机检测方法大多假设相机静止，难以适用于实际中搭载于移动载体上的场景；同时传统帧相机在高速目标或恶劣光照下性能受限，亟需兼顾鲁棒性与适用性的新方案。 Method: 采用旋转式事件相机获取360°视野数据；提出一种无需运动补偿的新型类图像事件表示方法；设计轻量级神经网络进行时空特征学习；整套系统部署于Jetson Orin NX平台实现嵌入式实时运行。 Result: 在户外复杂环境下实现可靠实时检测，平均方位角误差低于2°；系统可在Jetson Orin NX上以实时帧率运行；代码将开源。 Conclusion: 该工作突破了事件相机在移动载体上应用的关键限制，为动态平台下的低功耗、高鲁棒无人机感知提供了可行且高效的解决方案，推动事件相机技术向真实野外部署迈进。 Abstract: The rapid proliferation of drones requires balancing innovation with regulation. To address security and privacy concerns, techniques for drone detection have attracted significant attention.Passive solutions, such as frame camera-based systems, offer versatility and energy efficiency under typical conditions but are fundamentally constrained by their operational principles in scenarios involving fast-moving targets or adverse illumination.Inspired by biological vision, event cameras asynchronously detect per-pixel brightness changes, offering high dynamic range and microsecond-level responsiveness that make them uniquely suited for drone detection in conditions beyond the reach of conventional frame-based cameras.However, the design of most existing event-based solutions assumes a static camera, greatly limiting their applicability to moving carriers--such as quadrupedal robots or unmanned ground vehicles--during field operations.In this paper, we introduce a real-time drone detection system designed for deployment on moving carriers. The system utilizes a spinning event-based camera, providing a 360° horizontal field of view and enabling bearing estimation of detected drones. A key contribution is a novel image-like event representation that operates without motion compensation, coupled with a lightweight neural network architecture for efficient spatiotemporal learning. Implemented on an onboard Jetson Orin NX, the system can operate in real time. Outdoor experimental results validate reliable detection with a mean angular error below 2° under challenging conditions, underscoring its suitability for real-world surveillance applications. We will open-source our complete pipeline to support future research.

[155] HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models

Lincen Bai,Hedi Tabia,Raul Santos-Rodriguez

Main category: cs.CV

TL;DR: 本文提出HiPP-Prune，一种面向视觉语言模型（VLMs）的分层偏好条件化结构化剪枝框架，通过引入视觉敏感性信号与多目标优化（含任务效用、幻觉鲁棒性、压缩率和稳定性），实现可控、鲁棒且高效的剪枝。

Details

Motivation: 传统VLM剪枝易加剧物体幻觉等问题，难以兼顾任务效用与视觉定位能力；需在多目标约束下进行可控、鲁棒的资源分配式剪枝。 Method: 提出HiPP-Prune框架：1）将剪枝建模为偏好条件下的分层资源分配问题，输出全局剪枝蓝图（总稀疏度+层间分配）；2）引入基于注意力流的视觉敏感性信号指导关键层保护；3）采用计划级Group Relative Policy Optimization（GRPO）优化多目标回报（任务效用、POPE幻觉鲁棒性、压缩率、突触流启发的稳定性代理）。 Result: 在LLaVA上结合POPE与ScienceQA实验表明，HiPP-Prune能发现多样化的非支配剪枝方案，并在相同稀疏度下提供可调节的鲁棒性-效用权衡。 Conclusion: HiPP-Prune为VLM剪枝提供了可解释、可控、面向多目标失败模式的新型范式，显著提升高稀疏度下的部署鲁棒性与实用性。 Abstract: Pruning vision-language models (VLMs) for efficient deployment is challenging because compression can affect not only task utility but also visual grounding, often amplifying object hallucinations even at the same sparsity level. We present HiPP-Prune, a hierarchical preference-conditioned structured pruning framework that treats pruning as conditional resource allocation under multiple objectives. HiPP-Prune makes plan-level decisions: a single policy invocation outputs a global pruning blueprint by factorizing decisions into an overall sparsity budget and a layer-wise allocation, enabling queryable trade-offs via a user-specified preference vector. To account for VLM-specific failure modes, our policy state integrates a visual sensitivity signal derived from attention flow between vision tokens and language hidden states, discouraging over-pruning of vision-critical layers that facilitate cross-modal fusion. We optimize pruning plans with plan-level Group Relative Policy Optimization (GRPO) under a multi-objective return that combines task utility, hallucination robustness (POPE), compression, and a synaptic-flow-inspired stability proxy to reduce unproductive exploration in high-sparsity regimes. Experiments on LLaVA with POPE and ScienceQA demonstrate that HiPP-Prune discovers diverse non-dominated pruning plans and provides controllable robustness--utility trade-offs under matched sparsity budgets.

[156] Spectral and Trajectory Regularization for Diffusion Transformer Super-Resolution

Jingkai Wang,Yixin Tang,Jue Gong,Jiatong Li,Shu Li,Libo Liu,Jianliang Lan,Yutong Liu,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出StrSR，一种用于真实图像超分辨率（Real-ISR）的新型单步对抗蒸馏框架，通过谱正则化和轨迹正则化解决DiT模型蒸馏中的轨迹不匹配与周期性伪影问题，显著提升性能。

Details

Motivation: Diffusion transformer (DiT) 在真实图像超分辨率（Real-ISR）中潜力巨大，但其迭代采样计算昂贵，需单步蒸馏；现有单步蒸馏方法在Real-ISR任务上表现差，存在轨迹不匹配和严重网格状周期性伪影问题。 Method: 提出StrSR：1）不对称判别式蒸馏架构以弥合轨迹差距；2）频率分布匹配策略抑制由高频谱泄漏引起的DiT特有周期性伪影。 Result: 在Real-ISR任务上，StrSR在定量指标和视觉感知质量上均达到当前最优（SOTA）性能。 Conclusion: StrSR有效解决了DiT在Real-ISR中单步蒸馏的关键挑战，为高效高质量真实图像超分辨率提供了新范式。 Abstract: Diffusion transformer (DiT) architectures show great potential for real-world image super-resolution (Real-ISR). However, their computationally expensive iterative sampling necessitates one-step distillation. Existing one-step distillation methods struggle with Real-ISR on DiT. They suffer from fundamental trajectory mismatch and generate severe grid-like periodic artifacts. To tackle these challenges, we propose StrSR, a novel one-step adversarial distillation framework featuring spectral and trajectory regularization. Specifically, we propose an asymmetric discriminative distillation architecture to bridge the trajectory gap. Additionally, we design a frequency distribution matching strategy to effectively suppress DiT-specific periodic artifacts caused by high-frequency spectral leakage. Extensive experiments demonstrate that StrSR achieves state-of-the-art performance in Real-ISR, across both quantitative metrics and visual perception. The code and models will be released at https://github.com/jkwang28/StrSR .

[157] Can we Trust Unreliable Voxels? Exploring 3D Semantic Occupancy Prediction under Label Noise

Wenxin Li,Kunyu Peng,Di Wen,Junwei Zheng,Jiale Wei,Mengfei Duan,Yuheng Zhang,Rui Fan,Kailun Yang

Main category: cs.CV

TL;DR: 本文提出OccNL基准和DPR-Occ框架，解决3D语义占据预测中因动态拖尾和结构伪影导致的标签噪声问题，显著提升在高噪声下的鲁棒性。

Details

Motivation: 现实世界中的3D体素标注常受结构伪影和动态拖尾效应污染，引发对基于此类不可靠监督信号训练的自主系统安全性的根本质疑。 Method: 构建首个面向占据不对称与动态拖尾噪声的3D占据预测基准OccNL；提出DPR-Occ框架，通过双源部分标签推理，结合时序模型记忆与表征级结构相似性，动态扩增与剪枝候选标签集。 Result: 在SemanticKITTI上验证，即使在90%标签噪声下，DPR-Occ仍比适配至3D任务的现有标签噪声学习基线提升最高2.57% mIoU和13.91% IoU，有效防止几何与语义坍塌。 Conclusion: OccNL与DPR-Occ首次系统 bridged 标签噪声学习与3D感知，为动态环境中的安全关键机器人感知提供了可靠基础。 Abstract: 3D semantic occupancy prediction is a cornerstone of robotic perception, yet real-world voxel annotations are inherently corrupted by structural artifacts and dynamic trailing effects. This raises a critical but underexplored question: can autonomous systems safely rely on such unreliable occupancy supervision? To systematically investigate this issue, we establish OccNL, the first benchmark dedicated to 3D occupancy under occupancy-asymmetric and dynamic trailing noise. Our analysis reveals a fundamental domain gap: state-of-the-art 2D label noise learning strategies collapse catastrophically in sparse 3D voxel spaces, exposing a critical vulnerability in existing paradigms. To address this challenge, we propose DPR-Occ, a principled label noise-robust framework that constructs reliable supervision through dual-source partial label reasoning. By synergizing temporal model memory with representation-level structural affinity, DPR-Occ dynamically expands and prunes candidate label sets to preserve true semantics while suppressing noise propagation. Extensive experiments on SemanticKITTI demonstrate that DPR-Occ prevents geometric and semantic collapse under extreme corruption. Notably, even at 90% label noise, our method achieves significant performance gains (up to 2.57% mIoU and 13.91% IoU) over existing label noise learning baselines adapted to the 3D occupancy prediction task. By bridging label noise learning and 3D perception, OccNL and DPR-Occ provide a reliable foundation for safety-critical robotic perception in dynamic environments. The benchmark and source code will be made publicly available at https://github.com/mylwx/OccNL.

[158] Attribute Distribution Modeling and Semantic-Visual Alignment for Generative Zero-shot Learning

Haojie Pu,Zhuoming Li,Yongbiao Gao,Yuheng Jia

Main category: cs.CV

TL;DR: 本文提出ADiVA方法，通过建模属性分布和语义-视觉对齐来解决生成式零样本学习中的类-实例差距和语义-视觉域差距问题，在多个基准数据集上显著优于现有方法。

Details

Motivation: 生成式零样本学习中存在两个固有挑战：类级属性无法捕捉实例特定的视觉外观（类-实例差距），以及语义与视觉特征分布间存在显著不匹配（语义-视觉域差距）。 Method: 提出ADiVA方法，包含两个模块：属性分布建模（ADM）模块用于学习每类可迁移的属性分布并为未见类采样实例级属性；视觉引导对齐（VGA）模块用于优化语义表示以更好反映视觉结构。 Result: 在AWA2和SUN等三个主流基准数据集上显著超越现有最先进方法（如在AWA2和SUN上分别提升4.7%和6.1%），且可作为插件提升其他生成式ZSL方法性能。 Conclusion: ADiVA通过联合建模属性分布与显式语义-视觉对齐，有效缓解了生成式零样本学习中的类-实例差距和语义-视觉域差距，提升了泛化能力与性能。 Abstract: Generative zero-shot learning (ZSL) synthesizes features for unseen classes, leveraging semantic conditions to transfer knowledge from seen classes. However, it also introduces two intrinsic challenges: (1) class-level attributes fails to capture instance-specific visual appearances due to substantial intra-class variability, thus causing the class-instance gap; (2) the substantial mismatch between semantic and visual feature distributions, manifested in inter-class correlations, gives rise to the semantic-visual domain gap. To address these challenges, we propose an Attribute Distribution Modeling and Semantic-Visual Alignment (ADiVA) approach, jointly modeling attribute distributions and performing explicit semantic-visual alignment. Specifically, our ADiVA consists of two modules: an Attribute Distribution Modeling (ADM) module that learns a transferable attribute distribution for each class and samples instance-level attributes for unseen classes, and a Visual-Guided Alignment (VGA) module that refines semantic representations to better reflect visual structures. Experiments on three widely used benchmark datasets demonstrate that ADiVA significantly outperforms state-of-the-art methods (e.g., achieving gains of 4.7% and 6.1% on AWA2 and SUN, respectively). Moreover, our approach can serve as a plugin to enhance existing generative ZSL methods.

[159] FlowMotion: Training-Free Flow Guidance for Video Motion Transfer

Zhen Wang,Youcan Xu,Jun Xiao,Long Chen

Main category: cs.CV

TL;DR: 本文提出FlowMotion，一种无需训练的视频运动迁移框架，通过直接利用基于光流的文本到视频（T2V）模型的预测输出实现高效灵活的运动迁移。核心是提出‘光流引导’和‘速度正则化’策略，提升效率与运动平滑性。

Details

Motivation: 现有无训练视频运动迁移方法依赖预训练T2V模型中间输出构建运动引导，计算开销大、灵活性低；作者发现早期潜在预测已蕴含丰富时序信息，可被更高效利用。 Method: 提出FlowMotion框架：1）设计flow guidance，从flow-based T2V模型的早期潜变量预测中提取运动表征并对其对齐；2）引入velocity regularization策略稳定优化过程、保障运动连续性；全程仅操作模型预测结果，无需额外训练或微调。 Result: FlowMotion在时间与资源效率上显著优于现有SOTA方法，同时保持有竞争力的生成质量与运动保真度。 Conclusion: 直接利用flow-based T2V模型的潜变量预测进行运动建模是高效、灵活且可行的路径；FlowMotion为无训练视频运动迁移提供了新范式。 Abstract: Video motion transfer aims to generate a target video that inherits motion patterns from a source video while rendering new scenes. Existing training-free approaches focus on constructing motion guidance based on the intermediate outputs of pre-trained T2V models, which results in heavy computational overhead and limited flexibility. In this paper, we present FlowMotion, a novel training-free framework that enables efficient and flexible motion transfer by directly leveraging the predicted outputs of flow-based T2V models. Our key insight is that early latent predictions inherently encode rich temporal information. Motivated by this, we propose flow guidance, which extracts motion representations based on latent predictions to align motion patterns between source and generated videos. We further introduce a velocity regularization strategy to stabilize optimization and ensure smooth motion evolution. By operating purely on model predictions, FlowMotion achieves superior time and resource efficiency as well as competitive performance compared with state-of-the-art methods.

[160] 3D CBCT Artefact Removal Using Perpendicular Score-Based Diffusion Models

Susanne Schaub,Florentin Bieder,Matheus L. Oliveira,Yulan Wang,Dorothea Dagassan-Berndt,Michael M. Bornstein,Philippe C. Cattin

Main category: cs.CV

TL;DR: 本文提出了一种基于垂直分数扩散模型的3D牙科种植体投影域修复方法，通过联合两个2D分数扩散模型建模投影序列的3D分布，有效缓解CBCT中由种植体引起的伪影，提升图像质量与诊断准确性。

Details

Motivation: CBCT在牙科中易受高密度种植体影响产生伪影，现有基于扩散模型的植入体修复方法仅处理独立2D投影，忽略投影间相关性，导致重建不一致。 Method: 提出一种基于垂直分数扩散模型的3D种植体 inpainting 方法：在两个正交平面分别训练2D分数扩散模型，并在采样阶段融合二者以建模投影序列的3D分布，实现投影域中的3D一致性修复。 Result: 该方法显著提升了CBCT图像质量，实现了高质量、低伪影的3D重建，在临床成像中展现出良好应用前景。 Conclusion: 所提3D扩散建模方法克服了传统2D独立修复的局限，为CBCT伪影校正提供了更鲁棒、一致的新范式。 Abstract: Cone-beam computed tomography (CBCT) is a widely used 3D imaging technique in dentistry, offering high-resolution images while minimising radiation exposure for patients. However, CBCT is highly susceptible to artefacts arising from high-density objects such as dental implants, which can compromise image quality and diagnostic accuracy. To reduce artefacts, implant inpainting in the sequence of projections plays a crucial role in many artefact reduction approaches. Recently, diffusion models have achieved state-of-the-art results in image generation and have widely been applied to image inpainting tasks. However, to our knowledge, existing diffusion-based methods for implant inpainting operate on independent 2D projections. This approach neglects the correlations among individual projections, resulting in inconsistencies in the reconstructed images. To address this, we propose a 3D dental implant inpainting approach based on perpendicular score-based diffusion models, each trained in two different planes and operating in the projection domain. The 3D distribution of the projection series is modelled by combining the two 2D score-based diffusion models in the sampling scheme. Our results demonstrate the method's effectiveness in producing high-quality, artefact-reduced 3D CBCT images, making it a promising solution for improving clinical imaging.

[161] DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

Walid Bousselham,Angie Boggust,Hendrik Strobelt,Hilde Kuehne

Main category: cs.CV

TL;DR: 本文提出DEX-AR方法，用于解释自回归视觉语言模型（VLMs）的决策过程，通过生成逐token和序列级2D热图来定位影响文本输出的关键图像区域。

Details

Motivation: 传统可解释性方法难以适用于现代自回归VLMs，因其复杂的逐token生成过程及图文模态间复杂交互。 Method: DEX-AR通过在token生成过程中计算各层注意力图对梯度，结合动态头筛选（聚焦视觉信息的注意力头）与序列级筛选（区分视觉驱动与纯语言token），生成多粒度热图解释。 Result: 在ImageNet、VQAv2和PascalVOC上验证，DEX-AR在基于扰动（归一化困惑度）和基于分割的指标上均取得一致提升。 Conclusion: DEX-AR是一种有效、可扩展的自回归VLM可解释性方法，支持层间重要性与token差异性建模，显著提升了图文联合推理过程的透明度与可信度。 Abstract: As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.

[162] Latent Transfer Attack: Adversarial Examples via Generative Latent Spaces

Eitan Shaar,Ariel Shaulov,Yalcin Tur,Gal Chechik,Ravid Shwartz-Ziv

Main category: cs.CV

TL;DR: 本文提出LTA（Latent Transfer Attack），一种在预训练Stable Diffusion VAE的潜在空间中优化对抗扰动的迁移攻击方法，相比传统像素空间攻击，其生成的扰动更鲁棒、低频且迁移性强。

Details

Motivation: 现有像素空间对抗攻击对预处理敏感、迁移性差、产生高频噪声；需探索更鲁棒、结构化的扰动优化空间。 Method: 在Stable Diffusion VAE的潜在空间中优化扰动，以最大化代理分类器损失，并通过软约束保证解码后满足像素级ℓ∞限制；引入EOT（随机缩放/插值/裁剪）和周期性潜在高斯平滑提升鲁棒性与稳定性。 Result: LTA在CNN和ViT目标模型上展现出优异的跨模型迁移成功率，生成空间连贯、低频主导的扰动，在迁移性-质量权衡中占据新优势位置。 Conclusion: 预训练生成模型的潜在空间是进行对抗优化的有效且结构化的新域，可将鲁棒性评估与现代生成先验有机结合。 Abstract: Adversarial attacks are a central tool for probing the robustness of modern vision models, yet most methods optimize perturbations directly in pixel space under $\ell_\infty$ or $\ell_2$ constraints. While effective in white-box settings, pixel-space optimization often produces high-frequency, texture-like noise that is brittle to common preprocessing (e.g., resizing and cropping) and transfers poorly across architectures. We propose $\textbf{LTA}$ ($\textbf{L}$atent $\textbf{T}$ransfer $\textbf{A}$ttack), a transfer-based attack that instead optimizes perturbations in the latent space of a pretrained Stable Diffusion VAE. Given a clean image, we encode it into a latent code and optimize the latent representation to maximize a surrogate classifier loss, while softly enforcing a pixel-space $\ell_\infty$ budget after decoding. To improve robustness to resolution mismatch and standard input pipelines, we incorporate Expectation Over Transformations (EOT) via randomized resizing, interpolation, and cropping, and apply periodic latent Gaussian smoothing to suppress emerging artifacts and stabilize optimization. Across a suite of CNN and vision-transformer targets, LTA achieves strong transfer attack success while producing spatially coherent, predominantly low-frequency perturbations that differ qualitatively from pixel-space baselines and occupy a distinct point in the transfer-quality trade-off. Our results highlight pretrained generative latent spaces as an effective and structured domain for adversarial optimization, bridging robustness evaluation with modern generative priors.

[163] WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection

Peng Chen,Chao Huang

Main category: cs.CV

TL;DR: 本文提出了一种小波增强的混合专家提示学习方法，用于零样本异常检测（ZSAD），通过变分自编码器建模全局语义、小波分解提取多频图像特征，并结合语义感知的混合专家模块，显著提升了对复杂和细微异常的检测能力。

Details

Motivation: 现有零样本异常检测方法依赖固定文本提示且仅使用空间域特征，难以捕捉复杂语义和细微异常。 Method: 提出波浪增强的混合专家提示学习方法：使用变分自编码器建模全局语义并融入提示；利用小波分解提取多频图像特征，通过跨模态交互动态优化文本嵌入；引入语义感知的混合专家模块聚合上下文信息。 Result: 在14个工业和医学数据集上的大量实验验证了该方法的有效性，显著提升了零样本异常检测性能。 Conclusion: 所提方法有效克服了固定提示和单域特征的局限性，增强了模型对多样化和细微异常的泛化与检测能力。 Abstract: Vision-language models have recently shown strong generalization in zero-shot anomaly detection (ZSAD), enabling the detection of unseen anomalies without task-specific supervision. However, existing approaches typically rely on fixed textual prompts, which struggle to capture complex semantics, and focus solely on spatial-domain features, limiting their ability to detect subtle anomalies. To address these challenges, we propose a wavelet-enhanced mixture-of-experts prompt learning method for ZSAD. Specifically, a variational autoencoder is employed to model global semantic representations and integrate them into prompts to enhance adaptability to diverse anomaly patterns. Wavelet decomposition extracts multi-frequency image features that dynamically refine textual embeddings through cross-modal interactions. Furthermore, a semantic-aware mixture-of-experts module is introduced to aggregate contextual information. Extensive experiments on 14 industrial and medical datasets demonstrate the effectiveness of the proposed method.

[164] P-SLCR: Unsupervised Point Cloud Semantic Segmentation via Prototypes Structure Learning and Consistent Reasoning

Lixin Zhan,Jie Jiang,Tianjian Zhou,Yukun Du,Yan Zheng,Xuehu Duan

Main category: cs.CV

TL;DR: 本文提出了一种基于原型库驱动的无监督点云语义分割方法P-SLCR，通过结构学习与一致性推理提升性能，在多个数据集上超越了有监督基线方法。

Details

Motivation: 当前点云语义分割严重依赖人工标注，而针对原始点云的无监督方法研究尚处初期，面临无标注、无预训练等挑战。 Method: 提出P-SLCR方法：1）一致结构学习，利用高质量特征建立点与一致原型库间的结构特征学习；2）语义关系一致性推理，分别构建一致与模糊原型库间的原型互关系矩阵，以约束并保持语义一致性。 Result: 在S3DIS（Area-5）、SemanticKITTI和ScanNet数据集上达到最优无监督性能；S3DIS Area-5的mIoU达47.1%，超出全监督PointNet 2.5%。 Conclusion: P-SLCR验证了原型库驱动与结构一致性建模在无监督点云分割中的有效性，为摆脱标注依赖提供了新思路。 Abstract: Current semantic segmentation approaches for point cloud scenes heavily rely on manual labeling, while research on unsupervised semantic segmentation methods specifically for raw point clouds is still in its early stages. Unsupervised point cloud learning poses significant challenges due to the absence of annotation information and the lack of pre-training. The development of effective strategies is crucial in this context. In this paper, we propose a novel prototype library-driven unsupervised point cloud semantic segmentation strategy that utilizes Structure Learning and Consistent Reasoning (P-SLCR). First, we propose a Consistent Structure Learning to establish structural feature learning between consistent points and the library of consistent prototypes by selecting high-quality features. Second, we propose a Semantic Relation Consistent Reasoning that constructs a prototype inter-relation matrix between consistent and ambiguous prototype libraries separately. This process ensures the preservation of semantic consistency by imposing constraints on consistent and ambiguous prototype libraries through the prototype inter-relation matrix. Finally, our method was extensively evaluated on the S3DIS, SemanticKITTI, and Scannet datasets, achieving the best performance compared to unsupervised methods. Specifically, the mIoU of 47.1% is achieved for Area-5 of the S3DIS dataset, surpassing the classical fully supervised method PointNet by 2.5%.

[165] WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

Weilun Feng,Guoxin Fan,Haotong Qin,Chuanguang Yang,Mingqiang Wu,Yuqi Li,Xiangqi Li,Zhulin An,Libo Huang,Dingrui Wang,Longlong Liao,Michele Magno,Yongjun Xu

Main category: cs.CV

TL;DR: 本文提出WorldCache框架，通过曲率引导的异构令牌预测和混沌优先自适应跳过策略，在扩散世界模型中实现高达3.7倍加速，同时保持98%的轨迹质量。

Details

Motivation: 扩散世界模型虽具潜力，但迭代去噪计算成本高；现有特征缓存方法因令牌异质性和非均匀时间动态性难以直接迁移应用。 Method: 提出WorldCache框架，包括曲率引导的异构令牌预测（基于物理曲率评分与Hermite阻尼预测器）和混沌优先自适应跳过（基于曲率归一化漂移信号动态重计算关键令牌）。 Result: 在扩散世界模型上实现最高3.7×端到端加速，同时保持98% rollout质量。 Conclusion: WorldCache针对扩散世界模型特有问题设计，显著提升推理效率且不牺牲质量，适用于资源受限场景。 Abstract: Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: \emph{token heterogeneity} from multi-modal coupling and spatial variation, and \emph{non-uniform temporal dynamics} where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose \textbf{WorldCache}, a caching framework tailored to diffusion world models. We introduce \textit{Curvature-guided Heterogeneous Token Prediction}, which uses a physics-grounded curvature score to estimate token predictability and applies a Hermite-guided damped predictor for chaotic tokens with abrupt direction changes. We also design \textit{Chaotic-prioritized Adaptive Skipping}, which accumulates a curvature-normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to \textbf{3.7$\times$} end-to-end speedups while maintaining \textbf{98\%} rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios. Our code is released in https://github.com/FofGofx/WorldCache.

Jiajun Zeng,Shadi Albarqouni

Main category: cs.CV

TL;DR: K-MaT是一种无需低端模态图像训练的提示学习框架，通过知识锚定和融合Gromov-Wasserstein最优传输对齐高低端模态提示流形，显著提升跨模态医学视觉语言模型零样本迁移性能。

Details

Motivation: 大型生物医学视觉语言模型在高端成像（如CT）上训练后难以迁移到低端模态（如X光），易陷入模态特异性捷径。 Method: 提出K-MaT框架：将提示分解并锚定于临床文本描述，利用融合Gromov-Wasserstein最优传输对齐低端提示流形与高端视觉空间。 Result: 在四个跨模态基准上达到SOTA：平均调和准确率44.1%（较BiomedCoOp提升2.1%），宏F1达36.2%；在乳腺影像任务中避免灾难性遗忘（CoOp降至27.0%，K-MaT保持稳健）。 Conclusion: 基于最优传输对齐提示流形是实现医学VLM零样本跨模态部署的有效途径。 Abstract: Large-scale biomedical vision-language models (VLMs) adapted on high-end imaging (e.g., CT) often fail to transfer to frontline low-end modalities (e.g., radiography), collapsing into modality-specific shortcuts. We propose K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images. K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport. We evaluate K-MaT on four cross-modal benchmarks, including dermoscopy, mammography to ultrasound, and CT to chest X-ray. K-MaT achieves state-of-the-art results, improving the average harmonic mean of accuracy to 44.1% (from BiomedCoOp's 42.0%) and macro-F1 to 36.2%. Notably, on the challenging breast imaging task, it mitigates the catastrophic forgetting seen in standard methods like CoOp (which drops to 27.0% accuracy on the low-end), preserving robust performance across modalities. Aligning prompt manifolds via optimal transport provides a highly effective route for the zero-shot cross-modal deployment of medical VLMs.

[167] Dynamic Chunking Diffusion Transformer

Akash Haridas,Utkarsh Saxena,Parsa Ashrafi Fashi,Mehdi Rezagholizadeh,Vikram Appia,Emad Barsoum

Main category: cs.CV

TL;DR: 本文提出动态分块扩散变换器（DC-DiT），通过学习型编码器-路由器-解码器结构，在扩散过程中自适应地压缩图像为可变长度的token序列，从而在保持或提升生成质量的同时显著降低计算开销。

Details

Motivation: 传统扩散变换器（DiT）对图像进行固定长度的分块（patchify），导致计算资源在高低信息区域上均匀分配，忽略了图像内容和扩散过程（从粗到细）的时空异质性。 Method: 提出DC-DiT架构，在DiT主干网络上增加端到端联合训练的编码器-路由器-解码器模块，实现基于数据和扩散时间步的动态分块（chunking）：背景区域压缩为更少token，细节区域保留更多token，并随去噪进程逐步增加token数。 Result: 在ImageNet 256×256类条件生成任务上，DC-DiT在4×和16×压缩下均一致优于参数量匹配和FLOP匹配的DiT基线（FID与Inception Score更优）；支持从预训练DiT高效微调（最多减少8×训练步数），且可与其他动态计算方法协同进一步降FLOP。 Conclusion: DC-DiT验证了在扩散模型中引入数据与时间自适应的动态token压缩是有效且实用的范式，具有向像素级、视频及3D生成拓展的潜力。 Abstract: Diffusion Transformers process images as fixed-length sequences of tokens produced by a static $\textit{patchify}$ operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet $256{\times}256$, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across $4{\times}$ and $16{\times}$ compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to $8{\times}$ fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.

[168] LATO: 3D Mesh Flow Matching with Structured TOpology Preserving LAtents

Tianhao Zhao,Youjia Zhang,Hang Long,Jinshen Zhang,Wenbing Li,Yang Yang,Gongbo Zhang,Jozef Hladký,Matthias Nießner,Wei Yang

Main category: cs.CV

TL;DR: LATO is a novel latent representation for 3D mesh generation that preserves topology via a voxel-based VAE and flow matching, enabling efficient, high-fidelity mesh synthesis without isosurface extraction.

Details

Motivation: Existing 3D mesh generation methods (e.g., isosurface-based or autoregressive) suffer from inefficiency, topological inaccuracies, or geometric limitations; there's a need for scalable, topology-aware, and geometry-precise synthesis. Method: LATO introduces a Vertex Displacement Field (VDF) on mesh surfaces, compresses it via a sparse voxel VAE, and uses progressive voxel subdivision/pruning + a connection head for topology recovery; generative modeling employs two-stage flow matching on structure and topology features. Result: LATO generates meshes with complex geometry and well-formed topology more efficiently than prior isosurface/triangle-based diffusion or autoregressive models. Conclusion: LATO provides a scalable, topology-preserving framework for explicit 3D mesh synthesis, outperforming existing methods in fidelity, topological correctness, and inference efficiency. Abstract: In this paper, we introduce LATO, a novel topology-preserving latent representation that enables scalable, flow matching-based synthesis of explicit 3D meshes. LATO represents a mesh as a Vertex Displacement Field (VDF) anchored on surface, incorporating a sparse voxel Variational Autoencoder (VAE) to compress this explicit signal into a structured, topology-aware voxel latent. To decapsulate the mesh, the VAE decoder progressively subdivides and prunes latent voxels to instantiate precise vertex locations. In the end, a dedicated connection head queries the voxel latent to predict edge connectivity between vertex pairs directly, allowing mesh topology to be recovered without isosurface extraction or heuristic meshing. For generative modeling, LATO adopts a two-stage flow matching process, first synthesizing the structure voxels and subsequently refining the voxel-wise topology features. Compared to prior isosurface/triangle-based diffusion models and autoregressive generation approaches, LATO generates meshes with complex geometry, well-formed topology while being highly efficient in inference.

[169] Computer vision-based estimation of invertebrate biomass

Mikko Impiö,Philipp M. Rehsen,Jarrett Blair,Cecilie Mielec,Arne J. Beermann,Florian Leese,Toke T. Høye,Jenni Raitoharju

Main category: cs.CV

TL;DR: 本文提出两种基于图像的无脊椎动物干重估计方法：一是利用新型预测因子（面积和沉降速度）拟合线性模型；二是训练多种端到端深度神经网络（单视图、多视图、元数据感知架构），均无需额外人工操作，仅需成像。实验表明方法在形态复杂多样的标本上有效，结合自动分类可实现群体水平干重估计，个体误差中位数为10–20%。

Details

Motivation: 传统干重测量需手动、耗时且破坏性；亟需非接触、可扩展的图像驱动生物量估算方法以支持大规模生物多样性监测。 Method: 1）基于BIODISCOVER双摄像头系统获取标本在乙醇柱中下沉的图像序列，自动提取面积与沉降速度作为新预测变量，构建线性回归模型；2）设计并训练三类端到端深度神经网络（单视图、多视图、元数据感知），辅以不同损失函数、数据增强与架构优化。 Result: 所提方法在形态复杂多样的无脊椎动物标本上实现了可靠干重估计；结合自动分类后，群体水平干重估计准确，个体干重估计中位百分比误差为10–20%；验证了同时使用百分比误差与绝对误差作为评估指标的必要性。 Conclusion: 仅用图像即可较准确估计无脊椎动物干重，为自动化、非破坏性、高通量生物多样性监测提供了可行技术路径；强调评估指标选择与模型架构探索对任务性能的关键影响。 Abstract: The ability to estimate invertebrate biomass using only images could help scaling up quantitative biodiversity monitoring efforts. Computer vision-based methods have the potential to omit the manual, time-consuming, and destructive process of dry weighing specimens. We present two approaches for dry mass estimation that do not require additional manual effort apart from imaging the specimens: fitting a linear model with novel predictors, automatically calculated by an imaging device, and training a family of end-to-end deep neural networks for the task, using single-view, multi-view, and metadata-aware architectures. We propose using area and sinking speed as predictors. These can be calculated with BIODISCOVER, which is a dual-camera system that captures image sequences of specimens sinking in an ethanol column. For this study, we collected a large dataset of dry mass measurement and image sequence pairs to train and evaluate models. We show that our methods can estimate specimen dry mass even with complex and visually diverse specimen morphologies. Combined with automatic taxonomic classification, our approach is an accurate method for group-level dry mass estimation, with a median percentage error of 10-20% for individuals. We highlight the importance of choosing appropriate evaluation metrics, and encourage using both percentage errors and absolute errors as metrics, because they measure different properties. We also explore different optimization losses, data augmentation methods, and model architectures for training deep-learning models.

[170] OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis

Yuxuan Fan,Jing Hao,Hong Chen,Jiahao Bao,Yihua Shao,Yuci Liang,Kuo Feng Hung,Hao Tang

Main category: cs.CV

TL;DR: 本文提出OralGPT-Plus，一种面向全景牙科X光片的代理式视觉语言模型，通过迭代、对称感知的诊断推理提升临床可靠性；构建了含5000张图像的DentalProbe数据集与Reinspection-driven强化学习框架，并发布首个全景诊断基准MMOral-X（300道开放问题）；实验表明其在多项指标上显著优于基线。

Details

Motivation: 现有视觉语言模型采用静态单次推理范式，难以满足全景牙科影像所需的细粒度空间推理、双侧对称性理解与多步诊断验证，临床可靠性受限。 Method: 提出OralGPT-Plus代理式VLM；构建专家标注的DentalProbe数据集（含诊断轨迹）；设计基于复检的强化学习框架（含评分标准奖励与诊断驱动奖励）；建立首个全景诊断基准MMOral-X。 Result: OralGPT-Plus在MMOral-X及既有全景影像基准上持续稳定超越强基线模型。 Conclusion: 代理式建模与对称感知推理显著提升牙科影像分析的临床适配性，为未来临床对齐的全景放射分析研究奠定基础。 Abstract: Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.

[171] Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation

Jonas Ernst,Wolfgang Boettcher,Lukas Hoyer,Jan Eric Lenssen,Bernt Schiele

Main category: cs.CV

TL;DR: Rewis3d 利用前馈式3D重建作为弱监督信号，提升2D图像的稀疏标注语义分割性能，无需额外标注或推理开销。

Details

Motivation: 密集像素级标注成本高昂，稀疏标注虽可缓解但仍有性能差距，需更有效的弱监督方法。 Method: 提出双学生-教师架构，利用前沿的前馈式3D重建生成几何监督信号，在2D图像与3D点云间强制语义一致性。 Result: 在稀疏监督设定下达到SOTA性能，相比现有方法提升2–7%，且不增加标注或推理负担。 Conclusion: 3D几何结构可有效传播稀疏标注，为弱监督语义分割提供可靠、高效的辅助监督信号。 Abstract: We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.

[172] MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis

Dongqing Xie,Yonghuang Wu

Main category: cs.CV

TL;DR: MoEMambaMIL是一种面向全切片图像（WSI）分析的结构感知状态空间模型框架，通过区域嵌套扫描与混合专家（MoE）建模，有效利用WSI的空间层次结构，在9个下游任务中达到最优性能。

Details

Motivation: 现有MIL方法将WSI视为无序图块集合，难以建模全局组织与局部细胞模式间的结构依赖；SSM虽擅长长序列建模，但如何构建适配WSI空间层级的token序列仍是难题。 Method: 提出MoEMambaMIL：基于多分辨率预处理构建区域感知、空间包含关系保持的图块序列；采用静态分辨率专家与动态稀疏路由专家协同，解耦分辨率感知编码与区域自适应上下文建模。 Result: 在9个下游WSI分析任务上均取得最佳性能。 Conclusion: 结构化token序列设计与MoE机制的结合，显著提升了SSM在WSI长序列、多尺度、异质性建模中的有效性与效率。 Abstract: Whole-slide image (WSI) analysis is challenging due to the gigapixel scale of slides and their inherent hierarchical multi-resolution structure. Existing multiple instance learning (MIL) approaches often model WSIs as unordered collections of patches, which limits their ability to capture structured dependencies between global tissue organization and local cellular patterns. Although recent State Space Models (SSMs) enable efficient modeling of long sequences, how to structure WSI tokens to fully exploit their spatial hierarchy remains an open problem.We propose MoEMambaMIL, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling. Leveraging multi-resolution preprocessing, MoEMambaMIL organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions. On top of this structured sequence, we decouple resolution-aware encoding and region-adaptive contextual modeling via a combination of static, resolution-specific experts and dynamic sparse experts with learned routing. This design enables efficient long-sequence modeling while promoting expert specialization across heterogeneous diagnostic patterns. Experiments demonstrate that MoEMambaMIL achieves the best performance across 9 downstream tasks.

[173] CHMv2: Improvements in Global Canopy Height Mapping using DINOv3

John Brandt,Seungeun Yi,Jamie Tolan,Xinyuan Li,Peter Potapov,Jessica Ertel,Justine Spore,Huy V. Vo,Michaël Ramamonjisoa,Patrick Labatut,Piotr Bojanowski,Camille Couprie

Main category: cs.CV

TL;DR: 本文提出了CHMv2，一种基于高分辨率光学卫星影像和DINOv3深度估计模型生成的全球米级冠层高度图，相较于现有产品在精度、偏差校正和细尺度结构保留方面均有显著提升，并通过ALS、GEDI和ICESat-2数据验证了其跨森林生物群系的一致性能。

Details

Motivation: 准确的冠层高度信息对森林碳估算、恢复与退化监测及生境结构评估至关重要，但机载激光雷达（ALS）数据在全球范围内分布不均，亟需更普适、高精度的替代方案。 Method: 利用DINOv3预训练特征构建深度估计模型，以高分辨率光学卫星影像为输入，ALS生成的冠层高度模型为监督标签；采用地理多样性大幅扩展的训练数据、自动化数据清洗与配准流程，以及针对冠层高度分布定制的损失函数和采样策略。 Result: CHMv2在精度上显著优于现有产品，尤其降低了高大森林中的系统性偏差，并更好保留冠层边缘与林隙等细尺度结构；在独立ALS测试集及数千万条GEDI与ICESat-2观测数据上验证显示其在主要森林生物群系中表现稳健一致。 Conclusion: CHMv2为全球森林结构监测提供了高分辨率、高保真、可扩展的新工具，有效弥补了ALS数据空间覆盖不足的缺陷，推动了遥感驱动的生态与碳循环研究。 Abstract: Accurate canopy height information is essential for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure, yet high-fidelity measurements from airborne laser scanning (ALS) remain unevenly available globally. Here we present CHMv2, a global, meter-resolution canopy height map derived from high-resolution optical satellite imagery using a depth-estimation model built on DINOv3 and trained against ALS canopy height models. Compared to existing products, CHMv2 substantially improves accuracy, reduces bias in tall forests, and better preserves fine-scale structure such as canopy edges and gaps. These gains are enabled by a large expansion of geographically diverse training data, automated data curation and registration, and a loss formulation and data sampling strategy tailored to canopy height distributions. We validate CHMv2 against independent ALS test sets and against tens of millions of GEDI and ICESat-2 observations, demonstrating consistent performance across major forest biomes.

[174] Prompt Group-Aware Training for Robust Text-Guided Nuclei Segmentation

Yonghuang Wu,Zhenyang Liang,Wenwen Zeng,Xuan Xie,Jinhua Yu

Main category: cs.CV

TL;DR: 本文提出了一种针对文本引导医学图像分割中提示敏感性问题的组一致性训练框架，通过质量引导的组正则化和logit级一致性约束，在不改变模型结构和推理流程的前提下，显著提升了核分割的鲁棒性与泛化能力。

Details

Motivation: 基础模型（如SAM3）在文本引导医学图像分割中对提示词高度敏感，即使语义等价的提示也可能导致不一致分割结果，影响临床和病理工作流的可靠性。 Method: 将语义相关提示组织为‘提示组’，共享同一真值掩码；引入提示组感知训练框架，包括（i）以分割损失为隐式排序信号的质量引导组正则化，（ii）采用stop-gradient策略的logit级组内一致性约束。 Result: 在多数据集核分割基准上显著降低不同提示质量下的性能方差；在六个零样本跨数据集任务中Dice系数平均提升2.16分。 Conclusion: 该方法无需修改模型架构或推理流程，有效增强了视觉-语言分割在计算病理学中的鲁棒性与泛化能力。 Abstract: Foundation models such as Segment Anything Model 3 (SAM3) enable flexible text-guided medical image segmentation, yet their predictions remain highly sensitive to prompt formulation. Even semantically equivalent descriptions can yield inconsistent masks, limiting reliability in clinical and pathology workflows. We reformulate prompt sensitivity as a group-wise consistency problem. Semantically related prompts are organized into \emph{prompt groups} sharing the same ground-truth mask, and a prompt group-aware training framework is introduced for robust text-guided nuclei segmentation. The approach combines (i) a quality-guided group regularization that leverages segmentation loss as an implicit ranking signal, and (ii) a logit-level consistency constraint with a stop-gradient strategy to align predictions within each group. The method requires no architectural modification and leaves inference unchanged. Extensive experiments on multi-dataset nuclei benchmarks show consistent gains under textual prompting and markedly reduced performance variance across prompt quality levels. On six zero-shot cross-dataset tasks, our method improves Dice by an average of 2.16 points. These results demonstrate improved robustness and generalization for vision-language segmentation in computational pathology.

[175] REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation

Maëlic Neau,Zoe Falomir

Main category: cs.CV

TL;DR: 本文提出REACT++模型，通过在原型空间中使用高效特征提取和主客体交叉注意力机制，在保证实时推理速度的同时，提升了场景图生成（SGG）中的关系预测准确率和物体检测性能。

Details

Motivation: 现有场景图生成方法通常只侧重于提升关系预测精度、物体检测精度或降低延迟中的某一方面，缺乏对三者平衡的综合优化。 Method: 基于REACT架构，提出REACT++模型，采用原型空间中的高效特征提取与主客体交叉注意力机制，兼顾推理速度与表征能力。 Result: REACT++在现有SGG模型中实现最高推理速度，关系预测准确率平均提升10%，同时保持物体检测性能；相比REACT，速度提升20%。 Conclusion: REACT++成功实现了实时性、关系预测精度和物体检测性能三者的协同优化，是当前实时场景图生成任务的最先进模型。 Abstract: Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we build on the powerful Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture and propose REACT++, a new state-of-the-art model for real-time SGG. By leveraging efficient feature extraction and subject-to-object cross-attention within the prototype space, REACT++ balances latency and representational power. REACT++ achieves the highest inference speed among existing SGG models, improving relation prediction accuracy without sacrificing object detection performance. Compared to the previous REACT version, REACT++ is 20% faster with a gain of 10% in relation prediction accuracy on average. The code is available at https://github.com/Maelic/SGG-Benchmark.

[176] Solving Jigsaw Puzzles in the Wild: Human-Guided Reconstruction of Cultural Heritage Fragments

Omidreza Safaei,Sinem Aslan,Sebastiano Vascon,Luca Palmieri,Marina Khoroshiltseva,Marcello Pelillo

Main category: cs.CV

TL;DR: 本文提出了一种人机协同（HIL）的拼图求解框架，用于大规模真实考古文物碎片的重建，结合自动松弛标记算法与两种交互策略（迭代锚定和连续交互精化），在RePAIR基准上显著优于全自动与纯手动方法。

Details

Motivation: 真实考古碎片存在侵蚀、缺失、形状不规则及大规模歧义等问题，传统拼图算法难以应对数千碎片规模的重建任务。 Method: 提出人机协同框架，融合自动松弛标记求解器与交互式人工引导；设计两种交互策略：迭代锚定（锁定已验证位置）和连续交互精化（实时纠错与引导）。 Result: 在多个RePAIR数据组上实验表明，该混合方法在精度和效率上均显著优于全自动和纯手动基线。 Conclusion: 所提HIL框架为大规模、高歧义的真实文化遗产碎片重建提供了实用、可扩展且专家可参与的解决方案。 Abstract: Reassembling real-world archaeological artifacts from fragmented pieces poses significant challenges due to erosion, missing regions, irregular shapes, and large-scale ambiguity. Traditional jigsaw puzzle solvers, often designed for clean synthetic scenarios, struggle under these conditions, especially when the number of fragments grows into the thousands, as in the RePAIR benchmark. In this paper, we propose a human-in-the-loop (HIL) puzzle solving framework designed to address the complexity and scale of real-world cultural heritage reconstruction. Our approach integrates an automatic relaxation-labeling solver with interactive human guidance, allowing users to iteratively lock verified placements, correct errors, and guide the system toward semantically and geometrically coherent assemblies. We introduce two complementary interaction strategies, Iterative Anchoring and Continuous Interactive Refinement, which support scalable reconstruction across varying levels of ambiguity and puzzle size. Experiments on several RePAIR groups demonstrate that our hybrid approach substantially outperforms both fully automatic and manual baselines in accuracy and efficiency, offering a practical solution for large-scale expert-in-the-loop artifact reassembly.

[177] DiffInf: Influence-Guided Diffusion for Supervision Alignment in Facial Attribute Learning

Basudha Pal,Rama Chellappa

Main category: cs.CV

TL;DR: 本文提出DiffInf框架，通过自影响引导的扩散模型修复面部属性标注中的不一致性，提升分类性能。

Details

Motivation: 面部属性分类依赖大规模标注数据，但许多属性（如年龄、表情）本质上是连续且模糊的，离散化标注易受主观性和视觉干扰（姿态、光照、表情、人口统计差异）影响，导致标注不一致和监督错误，损害表征学习和下游预测。 Method: 首先训练基线分类器，利用一阶近似计算样本级自影响得分以识别对优化过程影响过大的训练样本；不直接剔除这些样本，而是通过潜在扩散自编码器进行有针对性的生成式修正，使图像内容与标签更一致，同时保持身份和真实性；为实现可微分引导，训练一个轻量级高影响样本成员预测器作为代理影响正则化器；修正后的样本替换原始样本，形成大小不变的影响精炼数据集。 Result: 在多类面部属性分类任务中，DiffInf相比标准噪声标签训练、鲁棒优化基线及基于影响的过滤方法，持续提升了泛化性能。 Conclusion: 在图像层面修复高影响的标注不一致性，可在不牺牲分布覆盖的前提下有效增强下游面部属性分类性能。 Abstract: Facial attribute classification relies on large-scale annotated datasets in which many traits, such as age and expression, are inherently ambiguous and continuous but are discretized into categorical labels. Annotation inconsistencies arise from subjectivity and visual confounders such as pose, illumination, expression, and demographic variation, creating mismatch between images and assigned labels. These inconsistencies introduce supervision errors that impair representation learning and degrade downstream prediction. We introduce DiffInf, a self-influence--guided diffusion framework for mitigating annotation inconsistencies in facial attribute learning. We first train a baseline classifier and compute sample-wise self-influence scores using a practical first-order approximation to identify training instances that disproportionately destabilize optimization. Instead of discarding these influential samples, we apply targeted generative correction via a latent diffusion autoencoder to better align visual content with assigned labels while preserving identity and realism. To enable differentiable guidance during correction, we train a lightweight predictor of high-influence membership and use it as a surrogate influence regularizer. The edited samples replace the originals, yielding an influence-refined dataset of unchanged size. Across multi-class facial attribute classification, DiffInf consistently improves generalization compared with standard noisy-label training, robust optimization baselines, and influence-based filtering. Our results demonstrate that repairing influential annotation inconsistencies at the image level enhances downstream facial attribute classification without sacrificing distributional coverage.

[178] Locating and Editing Figure-Ground Organization in Vision Transformers

Stefan Arnold,René Gröbner

Main category: cs.CV

TL;DR: 本文研究了视觉Transformer（BEiT）如何在局部几何证据与全局组织先验之间进行权衡以解决图形-背景组织问题，重点探究了经典格式塔先验‘凸性’在模型内部的实现位置。通过基于合成‘飞镖’形状的可控感知冲突实验，发现BEiT在冲突中稳定偏好凸性完成，并利用logit归因将该偏好定位到特定功能单元（尤其是L0H9注意力头），揭示了从早期模糊到后期突变的组织决策机制。

Details

Motivation: 视觉Transformer需解决图底组织中的感知歧义问题，即在局部几何线索与全局格式塔先验（如凸性）之间做出选择；本文旨在定位凸性先验在BEiT内部的具体实现位置及机制。 Method: 构建基于合成飞镖形状的可控感知冲突刺激，系统遮蔽同时支持凹/凸完成的区域；结合logit归因将内部激活投影至离散视觉码本空间；分解注意力头的直接影响，识别关键功能单元。 Result: BEiT在冲突中稳定偏好凸性完成；图底组织歧义存在于早期和中期层，在后期层突然解析；注意力头L0H9作为早期种子引入微弱凸性偏向；单独下调该头可使决策分布向凹性证据偏移。 Conclusion: 凸性格式塔先验并非全局嵌入，而是在BEiT深层由特定注意力机制（如L0H9）动态实现；图底组织是一个分阶段、由局部引导逐步转向全局主导的层级化决策过程。 Abstract: Vision Transformers must resolve figure-ground organization by choosing between completions driven by local geometric evidence and those favored by global organizational priors, giving rise to a characteristic perceptual ambiguity. We aim to locate where the canonical Gestalt prior convexity is realized within the internal components of BEiT. Using a controlled perceptual conflict based on synthetic shapes of darts, we systematically mask regions that equally admit either a concave completion or a convex completion. We show that BEiT reliably favors convex completion under this competition. Projecting internal activations into the model's discrete visual codebook space via logit attribution reveals that this preference is governed by identifiable functional units within transformer substructures. Specifically, we find that figure-ground organization is ambiguous through early and intermediate layers and resolves abruptly in later layers. By decomposing the direct effect of attention heads, we identify head L0H9 acting as an early seed, introducing a weak bias toward convexity. Downscaling this single attention head shifts the distributional mass of the perceptual conflict across a continuous decision boundary, allowing concave evidence to guide completion.

[179] Physical Simulator In-the-Loop Video Generation

Lin Geng Foo,Mark He Huang,Alexandros Lattas,Stylianos Moschoglou,Thabo Beeler,Christian Theobalt

Main category: cs.CV

TL;DR: 本文提出PSIVG框架，将物理模拟器嵌入视频扩散生成过程，通过重建4D场景与物体网格、在物理模拟器中生成符合物理规律的运动轨迹，并结合测试时纹理一致性优化（TTCO）技术，显著提升生成视频的物理合理性和纹理一致性，同时保持视觉质量与多样性。

Details

Motivation: 现有基于扩散模型的视频生成方法虽具高视觉真实感，但常违背重力、惯性、碰撞等基本物理规律，导致运动不连贯、动力学不合理或违反物理约束，限制其真实感与可靠性。 Method: 提出Physical Simulator In-the-loop Video Generation（PSIVG）框架：1）利用预训练扩散模型生成模板视频；2）重建4D场景与前景物体网格；3）初始化至物理模拟器中生成物理一致的运动轨迹；4）用该轨迹引导扩散模型生成时空物理一致的视频；5）引入Test-Time Texture Consistency Optimization（TTCO）技术，基于模拟器提供的像素对应关系动态调整文本与特征嵌入以增强纹理一致性。 Result: 实验表明PSIVG在保持视觉质量与多样性的同时，显著提升生成视频对真实世界物理规律（如重力、碰撞）的遵循程度；TTCO有效缓解运动过程中纹理闪烁与失真问题。 Conclusion: 将物理先验显式嵌入视频扩散生成流程是提升AI生成视频物理合理性的可行且有效路径；PSIVG为构建更可信、可控的生成式视频系统提供了新范式。 Abstract: Recent advances in diffusion-based video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints, limiting the realism and reliability of AI-generated videos. We address this gap by introducing Physical Simulator In-the-loop Video Generation (PSIVG), a novel framework that integrates a physical simulator into the video diffusion process. Starting from a template video generated by a pre-trained diffusion model, PSIVG reconstructs the 4D scene and foreground object meshes, initializes them within a physical simulator, and generates physically consistent trajectories. These simulated trajectories are then used to guide the video generator toward spatio-temporally physically coherent motion. To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. Comprehensive experiments demonstrate that PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity. Project Page: https://vcai.mpi-inf.mpg.de/projects/PSIVG/

[180] Non-invasive Growth Monitoring of Small Freshwater Fish in Home Aquariums via Stereo Vision

Clemens Seibold,Anna Hilsmann,Peter Eisert

Main category: cs.CV

TL;DR: 本文提出了一种非侵入式、折射感知的双目视觉方法，利用YOLOv11-Pose检测鱼体关键点，并结合折射校正的极线约束与3D三角测量，实现水族箱中鱼类体长的准确估计。

Details

Motivation: 鱼类生长行为监测对水产养殖和家庭水族箱中的健康评估至关重要，但因鱼类体型小且水-玻璃-空气界面导致严重光学折射，传统图像测量面临挑战。 Method: 采用YOLOv11-Pose网络在双目图像中检测鱼类并预测解剖关键点；引入折射感知的极线约束进行鲁棒匹配；通过学习的质量分数剔除不可靠检测；最后进行折射感知的3D三角测量以恢复关键点并计算体长。 Result: 在新构建的濒危苏拉威西稻鱼双目数据集上验证了方法有效性，证明剔除低质量检测对长度估计精度至关重要。 Conclusion: 该系统为非侵入式鱼类生长监测提供了简单实用的解决方案，易于部署于家庭水族箱等实际场景。 Abstract: Monitoring fish growth behavior provides relevant information about fish health in aquaculture and home aquariums. Yet, monitoring fish sizes poses different challenges, as fish are small and subject to strong refractive distortions in aquarium environments. Image-based measurement offers a practical, non-invasive alternative that allows frequent monitoring without disturbing the fish. In this paper, we propose a non-invasive refraction-aware stereo vision method to estimate fish length in aquariums. Our approach uses a YOLOv11-Pose network to detect fish and predict anatomical keypoints on the fish in each stereo image. A refraction-aware epipolar constraint accounting for the air-glass-water interfaces enables robust matching, and unreliable detections are removed using a learned quality score. A subsequent refraction-aware 3D triangulation recovers 3D keypoints, from which fish length is measured. We validate our approach on a new stereo dataset of endangered Sulawesi ricefish captured under aquarium-like conditions and demonstrate that filtering low-quality detections is essential for accurate length estimation. The proposed system offers a simple and practical solution for non-invasive growth monitoring and can be easily applied in home aquariums.

[181] CLoPA: Continual Low Parameter Adaptation of Interactive Segmentation for Medical Image Annotation

Parhom Esmaeili,Chayanin Tangwiriyasakul,Eli Gibson,Sebastien Ourselin,M. Jorge Cardoso

Main category: cs.CV

TL;DR: 本文提出CLoPA方法，通过在标注缓存上持续微调nnInteractive模型的少量参数，实现无需新增参数或改变推理流程的在线自适应，显著提升零样本医学图像分割性能至专家水平。

Details

Motivation: 现有零样本交互式分割模型（如nnInteractive）难以在多样化的医学影像任务中稳定达到专家级性能；而临床标注过程中不断产生的任务特异性标注数据，为模型在线自适应提供了自然机会。 Method: 提出CLoPA（Continual Learning for Prompt-based Adaptation）策略：基于轻量级episode调度机制，在标注缓存上仅微调nnInteractive的一小部分参数，不引入新参数、不修改推理流程，完全嵌入现有标注工作流。 Result: 在8个MSD任务上，CLoPA快速将性能提升至专家水平，多数增益在单次训练episode后即实现；不同参数组的调优效果依赖于任务特性与数据规模；对复杂几何结构（如肝内血管），实例归一化与低层特征调优趋于饱和。 Conclusion: CLoPA是一种高效、即插即用的持续自适应方案，能显著增强零样本交互分割模型的泛化能力；但在最具挑战性的目标上，需更深层的特征表示对齐。 Abstract: Interactive segmentation enables clinicians to guide annotation, but existing zero-shot models like nnInteractive fail to consistently reach expert-level performance across diverse medical imaging tasks. Because annotation campaigns produce a growing stream of task-specific labelled data, online adaptation of the segmentation model is a natural complement to zero-shot inference. We propose CLoPA, a continual adaptation strategy that tunes a small fraction of nnInteractive's parameters on the annotation cache, triggered by lightweight episode scheduling. CLoPA requires no new parameters or changes to the inference pipeline, and operates entirely within the existing annotation workflow. Across eight Medical Segmentation Decathlon tasks spanning diverse anatomical targets and imaging characteristics, CLoPA rapidly elevates performance to expert-level, even for tasks where nnInteractive previously failed, with the majority of gains realised after a single training episode. We show that the benefits of tuning different parameter groups depends on task characteristics and data regimes. Also, that for targets with complex geometries (e.g., hepatic vessels), instance normalisation and low-level feature tuning saturates, suggesting a need for deeper feature-representation alignment in the most challenging scenarios.

[182] What if? Emulative Simulation with World Models for Situated Reasoning

Ruiping Liu,Yufan Chen,Yuheng Zhang,Junwei Zheng,Kunyu Peng,Chengzhi Wu,Chenguang Huang,Di Wen,Jiaming Zhang,Kailun Yang,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: 本文提出了WanderDream数据集，用于评估模型在无法主动探索环境时进行心理模拟和空间推理的能力。

Details

Motivation: 由于机器人物理限制或视障用户的安全考虑，许多现实场景中无法进行主动探索，因此需要研究如何仅凭有限观察进行心理模拟和空间推理。 Method: 构建了WanderDream数据集，包括WanderDream-Gen（15.8K全景视频）和WanderDream-QA（158K问答对），并在世界模型和多模态大语言模型上进行了实验验证。 Result: 实验证明心理探索对具身推理至关重要，世界模型在WanderDream-Gen上表现优异，想象能力显著提升WanderDream-QA上的推理效果，且该数据集具有良好的现实场景迁移能力。 Conclusion: WanderDream为无需主动探索的具身推理提供了首个大规模基准，推动了心理模拟与空间推理的研究。 Abstract: Situated reasoning often relies on active exploration, yet in many real-world scenarios such exploration is infeasible due to physical constraints of robots or safety concerns of visually impaired users. Given only a limited observation, can an agent mentally simulate a future trajectory toward a target situation and answer spatial what-if questions? We introduce WanderDream, the first large-scale dataset designed for the emulative simulation of mental exploration, enabling models to reason without active exploration. WanderDream-Gen comprises 15.8K panoramic videos across 1,088 real scenes from HM3D, ScanNet++, and real-world captures, depicting imagined trajectories from current viewpoints to target situations. WanderDream-QA contains 158K question-answer pairs, covering starting states, paths, and end states along each trajectory to comprehensively evaluate exploration-based reasoning. Extensive experiments with world models and MLLMs demonstrate (1) that mental exploration is essential for situated reasoning, (2) that world models achieve compelling performance on WanderDream-Gen, (3) that imagination substantially facilitates reasoning on WanderDream-QA, and (4) that WanderDream data exhibit remarkable transferability to real-world scenarios. The source code and all data will be released.

[183] CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Yitong Chen,Zuxuan Wu,Xipeng Qiu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出CaTok，一种1D因果图像分词器，结合MeanFlow解码器和REPA-A正则化方法，实现高效、高保真、因果的自回归图像生成，在ImageNet重建任务中达到SOTA性能。

Details

Motivation: 现有视觉分词器难以满足自回归语言模型所需的因果性要求：传统方法或破坏2D结构，或采用启发式序列顺序；扩散自编码器则在因果性与训练稳定性上存在缺陷。 Method: 提出CaTok——基于时间区间选择token并绑定MeanFlow目标的1D因果图像分词器；引入REPA-A正则化，对齐编码器特征与视觉基础模型（VFMs）；支持单步快速生成与多步高保真采样。 Result: 在ImageNet重建任务中取得SOTA结果：FID=0.75，PSNR=22.53，SSIM=0.674，且训练epoch更少；AR模型性能媲美当前领先方法。 Conclusion: CaTok成功将因果建模引入视觉分词，兼顾生成质量、效率与训练稳定性，为视觉自回归建模提供了新范式。 Abstract: Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.

[184] Pinterest Canvas: Large-Scale Image Generation at Pinterest

Yu Wang,Eric Tzeng,Raymond Shiau,Jie Yang,Dmitry Kislyuk,Charles Rosenberg

Main category: cs.CV

TL;DR: Pinterest提出了Canvas系统，通过在多模态数据上预训练基础扩散模型，并针对不同任务快速微调专用模型，显著提升了图像编辑与增强效果，在线上A/B测试中获得显著用户参与度提升。

Details

Motivation: 现有通用图像生成模型灵活性高但可控性差，难以满足严格的产品需求，需构建可精准控制、面向具体任务的图像生成系统。 Method: 构建基于扩散模型的Pinterest Canvas系统：先在多样化多模态数据上训练基础模型，再针对各下游任务（如背景增强、宽高比外绘等）快速微调专用变体；强调数据筛选、训练与推理的最佳实践。 Result: 背景增强和宽高比外绘两个任务在线上A/B实验中分别带来18.0%和12.5%的用户参与度提升；人工评估显示优于第三方模型；还成功拓展至多图场景合成与图像转视频等任务。 Conclusion: 任务专用微调策略比单一通用模型更有效，Canvas框架具备良好泛化能力，可支撑多样化的实际图像生成需求。 Abstract: While recent image generation models demonstrate a remarkable ability to handle a wide variety of image generation tasks, this flexibility makes them hard to control via prompting or simple inference adaptation alone, rendering them unsuitable for use cases with strict product requirements. In this paper, we introduce Pinterest Canvas, our large-scale image generation system built to support image editing and enhancement use cases at Pinterest. Canvas is first trained on a diverse, multimodal dataset to produce a foundational diffusion model with broad image-editing capabilities. However, rather than relying on one generic model to handle every downstream task, we instead rapidly fine-tune variants of this base model on task-specific datasets, producing specialized models for individual use cases. We describe key components of Canvas and summarize our best practices for dataset curation, training, and inference. We also showcase task-specific variants through case studies on background enhancement and aspect-ratio outpainting, highlighting how we tackle their specific product requirements. Online A/B experiments demonstrate that our enhanced images receive a significant 18.0% and 12.5% engagement lift, respectively, and comparisons with human raters further validate that our models outperform third-party models on these tasks. Finally, we showcase other Canvas variants, including multi-image scene synthesis and image-to-video generation, demonstrating that our approach can generalize to a wide variety of potential downstream tasks.

[185] Training Flow Matching: The Role of Weighting and Parameterization

Anne Gagneux,Ségolène Martin,Rémi Gribonval,Mathurin Massias

Main category: cs.CV

TL;DR: 本文系统研究了基于去噪的生成模型的训练目标，重点分析了损失加权和输出参数化（如噪声、干净图像、速度等）对模型性能的影响，并通过合成数据和图像数据实验，评估了不同设计选择在去噪精度和生成质量上的表现。

Details

Motivation: 旨在厘清影响流匹配模型训练效果的关键因素，为实际设计提供指导，而非提出新方法。 Method: 通过系统的数值实验，分析损失加权与不同输出参数化方式（噪声、干净图像、速度）如何与数据流形本征维数、模型架构及数据集大小相互作用。 Result: 在合成数据（可控几何）和真实图像数据上，使用PSNR（不同噪声水平）和FID指标对比了多种训练目标的去噪精度与生成质量。 Conclusion: 训练目标的选择需综合考虑数据特性、模型结构与数据规模，不同参数化方式在不同场景下各有优劣，提供了实用的设计洞见。 Abstract: We study the training objectives of denoising-based generative models, with a particular focus on loss weighting and output parameterization, including noise-, clean image-, and velocity-based formulations. Through a systematic numerical study, we analyze how these training choices interact with the intrinsic dimensionality of the data manifold, model architecture, and dataset size. Our experiments span synthetic datasets with controlled geometry as well as image data, and compare training objectives using quantitative metrics for denoising accuracy (PSNR across noise levels) and generative quality (FID). Rather than proposing a new method, our goal is to disentangle the various factors that matter when training a flow matching model, in order to provide practical insights on design choices.

[186] Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

Yakov Pyotr Shkolnikov

Main category: cs.CV

TL;DR: 本文发现视觉-语言模型的视觉编码器中蕴含丰富的几何信息（如手部关节角度），但其文本通路无法有效表达；通过线性探针可高精度提取该信息，而文本输出性能较差；LoRA微调可显著缩小差距；不同训练目标的编码器在几何任务上功能收敛，但表征不收敛；自回归生成损害几何保真度，但问题出在生成过程而非语言对齐；各模型均存在中间层几何信号峰值。

Details

Motivation: 探究视觉-语言模型是否隐式编码连续几何结构，以及这种几何信息为何难以通过文本输出体现，从而厘清是表征能力不足还是路径训练缺陷。 Method: 使用6000参数线性探针从冻结特征中回归手部关节角；对比文本解码性能；采用LoRA（r=16）在2000张图像上微调；评估五种不同训练范式（自监督、对比学习、混合）的编码器；分析Qwen2.5-VL各模块贡献；进行逐层探针与注意力头定位分析。 Result: 线性探针达6.1° MAE，远优于文本输出（20.0°）；LoRA后降至6.5°；五种编码器几何精度统计等价（R²≈0.55），但CKA仅0.41；Qwen2.5-VL的LLM层反而提升探针精度；所有模型几何信号峰值集中于第18–22层注意力头。 Conclusion: 视觉-语言模型的视觉主干已具备强几何表征能力，瓶颈在于文本通路和生成机制；无需微调或文本生成，仅用轻量探针即可将冻结模型作为多任务几何传感器使用。 Abstract: Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees -- a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity -- functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process, not in language alignment: Qwen2.5-VL's LLM layers actually improve probe accuracy over its raw vision encoder. Layer-wise analysis reveals a universal mid-network accuracy peak across all architectures, with attention heads in layers 18-22 carrying disproportionate geometric signal. These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation.

[187] GreenRFM: Toward a resource-efficient radiology foundation model

Yingtai Li,Shuai Ming,Mingyue Zhao,Haoran Lai,Rongsheng Wang,Rui Zhou,Rundong Wang,Yujia Li,Wei Wei,Shaohua Kevin Zhou

Main category: cs.CV

TL;DR: 本文提出GreenRFM，一种资源高效的放射学基础模型预训练框架，通过MUST监督设计（更精炼、更普遍、语义强化、任务对齐）实现卓越性能与极低计算开销，可在单卡24GB或6GB GPU上快速训练，显著优于现有大模型。

Details

Motivation: 现有放射学基础模型依赖暴力扩展，照搬自然图像方法，忽视临床所需的精度与鲁棒性，导致模型脆弱且昂贵。 Method: 提出GreenRFM框架，核心是MUST监督设计（More distilled, Ubiquitous, Semantic-enforcing, Task-aligning），强调监督信号质量而非数据量；提供两种配置：高性能版（24GB GPU/24h）和轻量版（6GB VRAM/4h）。 Result: 在四个机构、两种模态超20万张图像上验证，GreenRFM在胸腹CT公开/私有基准上均达SOTA；跨模态迁移至肌骨MRI亦有效；大幅降低计算需求，单GPU即可完成训练。 Conclusion: GreenRFM证明高质量监督设计可替代盲目扩模，挑战‘规模即一切’范式，推动临床级放射学基础模型的普惠化与实用化。 Abstract: The development of radiology foundation models (RFMs) is hindered by a reliance on brute-force scaling. Existing approaches often directly translate methods for natural images, which prioritize scale over precision and hence lead to brittle and expensive models in clinical practice. To address this, we present a resource-efficient pre-training framework, GreenRFM, that achieves state-of-the-art performance. Our framework ensures robust generalization across diverse patient populations and imaging protocols, reducing computational requirements by orders of magnitude while surpassing complex, parameter-heavy models. These capabilities stem from principled supervision design that aims to maximally utilize supervisory signals via More distilled, Ubiquitous, Semantic-enforcing, and Task-aligning (MUST) supervision, rather than simply piling up the quantity of training data. We offer two GreenRFM configurations: (i) a performant model that establishes a new state-of-the-art using a single 24GB GPU within 24 hours, and (ii) a lightweight model that matches existing benchmarks with 6GB VRAM in 4 hours. We conduct extensive experiments using over 200,000 images from four institutions and of two modalities. GreenRFMs achieve superior performances on chest and abdominal CT datasets, regardless of public or private benchmark, surpassing a range of baseline models. In addition, the results on internal musculoskeletal MRI images show that the same supervision principles transfer between different modalities. Our performance and efficiency challenge the ``scale is all you need'' dogma and democratize the equitable development of state-of-the-art RFMs for clinicians even on a laptop.

[188] Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

Zhuorui Zhang,Roger Pallarès-López,Praneeth Namburi,Brian W. Anthony

Main category: cs.CV

TL;DR: 本文提出Match4Annotate，一种轻量级框架，用于超声视频中点和掩码标注的帧间与跨视频传播；通过SIREN隐式神经表示建模DINOv3特征并学习平滑形变场，实现高分辨率、时空连续的对应匹配，在临床超声数据上达到跨视频传播SOTA性能。

Details

Motivation: 医学影像等专业领域中，逐帧视频标注成本高、耗时长，现有标签传播方法在跨视频泛化性、时空平滑性及对点/掩码标注统一支持方面存在局限。 Method: 提出Match4Annotate框架：在测试时基于DINOv3特征拟合SIREN隐式神经表示，构建连续高分辨率时空特征场，并学习帧间平滑隐式形变场以指导对应匹配，支持点和掩码标注的帧内与跨视频传播。 Result: 在三个临床超声数据集上验证，Match4Annotate在跨视频传播任务上达到SOTA，优于特征匹配和单次分割基线；在帧内传播上与专用跟踪器性能相当。 Conclusion: 轻量级、测试时优化的特征匹配流程可为可扩展标注工作流提供高效、易用的解决方案。 Abstract: Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.

Hila Chefer,Patrick Esser,Dominik Lorenz,Dustin Podell,Vikash Raja,Vinh Tong,Antonio Torralba,Robin Rombach

Main category: cs.CV

TL;DR: 本文提出Self-Flow，一种自监督流匹配范式，通过Dual-Timestep Scheduling机制在生成框架内联合学习语义表征与生成能力，无需外部模型或监督，在图像、视频、音频生成中取得更优效果并符合预期缩放规律。

Details

Motivation: 现有扩散与流模型依赖外部模型获取强语义表征，但存在训练分离、目标不一致及缩放行为异常等问题；根本原因在于其去噪训练目标缺乏对语义表征学习的内在激励。 Method: 提出Self-Flow自监督流匹配范式，核心为Dual-Timestep Scheduling：对不同token施加异质噪声水平，制造信息不对称，迫使模型从被破坏输入中推断缺失信息，从而同步学习表征与生成能力。 Result: 方法跨模态通用，支持多模态联合训练，遵循预期缩放律，在图像、视频和音频生成任务上均取得优于现有方法的性能。 Conclusion: 将表征学习内生于生成目标可消除对外部模型的依赖，Self-Flow验证了自监督协同学习语义表征与生成能力的可行性与有效性。 Abstract: Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.

[190] Artificial Intelligence for Detecting Fetal Orofacial Clefts and Advancing Medical Education

Yuanji Zhang,Yuhao Huang,Haoran Dou,Xiliang Zhu,Chen Ling,Zhong Yang,Lianying Liang,Jiuping Li,Siying Liang,Rui Li,Yan Cao,Yuhan Zhang,Jiewei Lai,Yongsong Zhou,Hongyu Zheng,Xinru Gao,Cheng Yu,Liling Shi,Mengqin Yuan,Honglong Li,Xiaoqiong Huang,Chaoyu Chen,Jialin Zhang,Wenxiong Pan,Alejandro F. Frangi,Guangzhi He,Xin Yang,Yi Xiong,Linliang Yin,Xuedong Deng,Dong Ni

Main category: cs.CV

TL;DR: 本文提出了一种基于45,139张超声图像训练的人工智能系统，用于产前诊断胎儿口面部裂，其敏感性和特异性分别超过93%和95%，性能媲美资深放射科医生，并能提升初级医生敏感性6%以上，同时助力罕见病临床专长培养。

Details

Motivation: 口面部裂是最常见的先天性颅面畸形之一，但产前准确检测困难，主要受限于经验丰富的专科医生稀缺及疾病相对罕见，亟需提高早期可靠诊断能力以支持及时干预、降低发病率。 Method: 构建一个AI系统，使用来自22家医院、9215名胎儿的45139张超声图像进行训练；将其作为医疗协作者（copilot）辅助放射科医生诊断，并开展涉及24名放射科医生及培训人员的试点研究以评估其对专长发展的影响。 Result: 该AI系统诊断胎儿口面部裂的敏感性>93%、特异性>95%，达到资深放射科医生水平，显著优于初级医生；作为协作者可使初级医生敏感性提升超6%；试点研究表明其可加速罕见病临床专长培养。 Conclusion: 该AI系统兼具高精度诊断与临床培训双重功能，为缺乏资深放射科医生的地区提供了可扩展的解决方案，有望提升诊断准确性与专科人才培养效率。 Abstract: Orofacial clefts are among the most common congenital craniofacial abnormalities, yet accurate prenatal detection remains challenging due to the scarcity of experienced specialists and the relative rarity of the condition. Early and reliable diagnosis is essential to enable timely clinical intervention and reduce associated morbidity. Here we show that an artificial intelligence system, trained on over 45,139 ultrasound images from 9,215 fetuses across 22 hospitals, can diagnose fetal orofacial clefts with sensitivity and specificity exceeding 93% and 95% respectively, matching the performance of senior radiologists and substantially outperforming junior radiologists. When used as a medical copilot, the system raises junior radiologists' sensitivity by more than 6%. Beyond direct diagnostic assistance, the system also accelerates the development of clinical expertise. A pilot study involving 24 radiologists and trainees demonstrated that the model can improve the expertise development for rare conditions. This dual-purpose approach offers a scalable solution for improving both diagnostic accuracy and specialist training in settings where experienced radiologists are scarce.

[191] SCAN: Visual Explanations with Self-Confidence and Analysis Networks

Gwanghee Lee,Sungyoon Jeong,Kyoungson Jhang

Main category: cs.CV

TL;DR: 本文提出SCAN框架，一种基于自编码器和信息瓶颈原理的通用可解释性AI方法，适用于CNN和Transformer架构，能生成高分辨率的自信心图，提升解释的清晰度和保真度。

Details

Motivation: 当前视觉解释方法在架构特定方法的高保真性和通用方法的广泛适用性之间存在权衡，导致解释抽象、碎片化，难以跨模型比较解释能力。 Method: SCAN采用基于自编码器的方法重建模型中间层特征，并依据信息瓶颈原理生成高分辨率自信心图，以识别信息丰富的区域。 Result: 在多种架构和数据集上的实验表明，SCAN在AUC-D、Negative AUC、Drop%、Win%等定量指标上表现优异，且定性上生成更清晰、聚焦物体的解释。 Conclusion: SCAN提供了一个统一、通用且高保真的可解释性框架，增强了模型透明性，为理解复杂神经网络决策过程提供了更可靠的工具。 Abstract: Explainable AI (XAI) has become essential in computer vision to make the decision-making processes of deep learning models transparent. However, current visual explanation (XAI) methods face a critical trade-off between the high fidelity of architecture-specific methods and the broad applicability of universal ones. This often results in abstract or fragmented explanations and makes it difficult to compare explanatory power across diverse model families, such as CNNs and Transformers. This paper introduces the Self-Confidence and Analysis Networks (SCAN), a novel universal framework that overcomes these limitations for both convolutional neural network and transformer architectures. SCAN utilizes an AutoEncoder-based approach to reconstruct features from a model's intermediate layers. Guided by the Information Bottleneck principle, it generates a high-resolution Self-Confidence Map that identifies information-rich regions. Extensive experiments on diverse architectures and datasets demonstrate that SCAN consistently achieves outstanding performance on various quantitative metrics such as AUC-D, Negative AUC, Drop%, and Win%. Qualitatively, it produces significantly clearer, object-focused explanations than existing methods. By providing a unified framework that is both architecturally universal and highly faithful, SCAN enhances model transparency and offers a more reliable tool for understanding the decision-making processes of complex neural networks.

[192] AV-Unified: A Unified Framework for Audio-visual Scene Understanding

Guangyao Li,Xin Wang,Wenwu Zhu

Main category: cs.CV

TL;DR: 本文提出AV-Unified统一框架，通过序列化输入输出、多尺度时空感知网络和跨模态空间引导模块，实现多种音视频理解任务的联合学习。

Details

Motivation: 现有音视频理解任务（如事件定位、解析、分割、问答）多被单独研究，难以全面理解复杂音视频场景及挖掘任务间关系。 Method: 将各类任务的输入输出统一为离散token序列；设计多尺度时空感知网络，包含多尺度时间感知模块与基于跨模态引导的空间感知模块；引入任务特定文本提示增强任务感知能力。 Result: 在AVE、LLP、MUSIC-AVQA、VGG-SS和AVS等多个基准数据集上，AV-Unified在时序、空间及时空联合任务中均展现出优异性能。 Conclusion: AV-Unified为音视频场景理解提供了可扩展、统一且高效的多任务联合学习范式，显著提升了模型对复杂动态场景的综合理解能力。 Abstract: When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model's adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.

[193] Spatial Calibration of Diffuse LiDARs

Nikhil Behari,Ramesh Raskar

Main category: cs.CV

TL;DR: 本文提出了一种针对扩散式直接飞行时间（dToF）LiDAR与RGB相机的空间校准方法，通过扫描反光贴片并进行背景减除，估计每个LiDAR像素在RGB图像平面上的有效视场和空间响应，实现跨模态对齐与融合。

Details

Motivation: 扩散式dToF LiDAR的每个像素对应较宽的瞬时视场，违背了传统LiDAR-RGB校准所依赖的单光线假设，导致标准校准方法失效。 Method: 利用扫描的反光贴片和背景减除技术，在共置RGB图像平面上恢复每个LiDAR像素的响应图，从而估计其像素级足迹（footprint）和相对空间灵敏度。 Result: 实现了对ams OSRAM TMF8828扩散式LiDAR的校准，生成了显式的LiDAR-to-RGB像素级对应关系，支持高精度跨模态对齐与融合。 Conclusion: 该方法简单有效，克服了扩散式LiDAR因宽视场带来的校准难题，为基于此类传感器的多模态感知系统提供了实用校准方案。 Abstract: Diffuse direct time-of-flight LiDARs report per-pixel depth histograms formed by aggregating photon returns over a wide instantaneous field of view, violating the single-ray assumption behind standard LiDAR-RGB calibration. We present a simple spatial calibration procedure that estimates, for each diffuse LiDAR pixel, its footprint (effective support region) and relative spatial sensitivity in a co-located RGB image plane. Using a scanned retroreflective patch with background subtraction, we recover per-pixel response maps that provide an explicit LiDAR-to-RGB correspondence for cross-modal alignment and fusion. We demonstrate the method on the ams OSRAM TMF8828.

[194] NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

Taewon Kang,Ming C. Lin

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的、基于约束投影的方法，将语言否定建模为扩散模型中语义引导的结构化可行性约束，统一处理多种否定现象，并在图像与视频生成中验证了其有效性。

Details

Motivation: 扩散模型对语言否定建模不足，现有方法多依赖启发式或重训练，缺乏形式化、统一且无需训练的处理框架。 Method: 将否定建模为扩散动态中语义引导方向上的凸约束集投影， reinterpret classifier-free guidance 为语义更新方向，并基于语言结构构建约束集，不修改预训练模型参数。 Result: 在图像和视频生成任务上实现了高否定合规性、视觉保真度与结构一致性；构建了首个面向否定的结构化评测基准；验证了方法对对象缺失、多否定组合、范围歧义等现象的有效性。 Conclusion: 该工作首次为扩散生成模型提供了形式化、统一、免训练的语言否定建模范式，超越了仅在表征层面评估的局限，推动了生成模型对复杂语言逻辑的理解能力。 Abstract: Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics. Rather than introducing heuristics or retraining model parameters, we reinterpret classifier-free guidance as defining a semantic update direction and enforce negation by projecting the update onto a convex constraint set derived from linguistic structure. This novel formulation provides a unified framework for handling diverse negation phenomena, including object absence, graded non-inversion semantics, multi-negation composition, and scope-sensitive disambiguation. Our approach is training-free, compatible with pretrained diffusion backbones, and naturally extends from image generation to temporally evolving video trajectories. In addition, we introduce a structured negation-centric benchmark suite that isolates distinct linguistic failure modes in generative systems, to further research in this area. Experiments demonstrate that our method achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.

[195] SurgFormer: Scalable Learning of Organ Deformation with Resection Support and Real-Time Inference

Ashkan Shahbazi,Elaheh Akbari,Kyvia Pereira,Jon S. Heiselman,Annie C. Benson,Garrison L. H. Johnston,Jie Ying Wu,Nabil Simaan,Michael I. Miga,Soheil Kolouri

Main category: cs.CV

TL;DR: SurgFormer 是一种多分辨率门控Transformer模型，用于基于体网格的数据驱动软组织仿真，支持标准形变预测与切口条件下的拓扑改变仿真，具有高精度和近实时效率。

Details

Motivation: 高保真生物力学求解器计算成本过高，难以满足交互式手术仿真需求；现有学习型体网格代理模型缺乏对切口条件（如切除）下形变的统一建模能力。 Method: 提出SurgFormer：构建固定体网格层次结构，采用多分支模块融合局部消息传递、粗粒度全局自注意力和逐点前馈更新，并通过节点/通道级可学习门控机制自适应整合局部与长程信息；引入切口嵌入编码切除信息作为额外输入；构建两个基于XFEM监督的外科仿真数据集（胆囊切除与阑尾切除）。 Result: SurgFormer在多种基线方法上展现出优异的精度与效率平衡，首次在同一套体网格流程中实现XFEM监督下的切口条件形变与标准形变的统一建模。 Conclusion: SurgFormer是一种实用、可扩展且统一的体网格代理模型，适用于交互式手术仿真中的标准形变与拓扑改变场景。 Abstract: We introduce SurgFormer, a multiresolution gated transformer for data driven soft tissue simulation on volumetric meshes. High fidelity biomechanical solvers are often too costly for interactive use, so we train SurgFormer on solver generated data to predict nodewise displacement fields at near real time rates. SurgFormer builds a fixed mesh hierarchy and applies repeated multibranch blocks that combine local message passing, coarse global self attention, and pointwise feedforward updates, fused by learned per node, per channel gates to adaptively integrate local and long range information while remaining scalable on large meshes. For cut conditioned simulation, resection information is encoded as a learned cut embedding and provided as an additional input, enabling a unified model for both standard deformation prediction and topology altering cases. We also introduce two surgical simulation datasets generated under a unified protocol with XFEM based supervision: a cholecystectomy resection dataset and an appendectomy manipulation and resection dataset with cut and uncut cases. To our knowledge, this is the first learned volumetric surrogate setting to study XFEM supervised cut conditioned deformation within the same volumetric pipeline as standard deformation prediction. Across diverse baselines, SurgFormer achieves strong accuracy with favorable efficiency, making it a practical backbone for both tasks. {Code, data, and project page: \href{https://mint-vu.github.io/SurgFormer/}{available here}}

[196] Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving

Yuhan Zhou,Mehri Sattari,Haihua Chen,Kewei Sha

Main category: cs.CV

TL;DR: 本文研究自动驾驶车辆（AV）中多源多模态数据的冗余性这一数据质量（DQ）问题，通过在nuScenes和Argoverse 2数据集上建模与量化图像及图像-LiDAR间的标签冗余，并验证选择性剔除冗余标签可提升YOLOv8目标检测性能（如nuScenes中mAP50最高提升0.04），表明冗余是可测且可操作的关键DQ因素，倡导数据为中心的AV评估范式。

Details

Motivation: 现有AV研究过度关注算法设计，忽视数据质量（尤其是多源多模态数据中的冗余性）对感知性能的实际影响；而现实中传感器受限与环境变化导致数据冗余普遍存在，亟需系统分析其影响。 Method: 基于nuScenes和Argoverse 2数据集，建模并量化多相机（共享视场）图像数据及图像-LiDAR多模态数据中的标签冗余；采用选择性剔除冗余标签策略，在YOLOv8目标检测任务上评估其影响。 Result: 在nuScenes中，去除重叠区域冗余图像标签后，三类典型重叠区域mAP50分别从0.66→0.70、0.64→0.67、0.53→0.55；AV2中去除4.1%–8.6%标签后mAP50仍稳定在0.64基准；图像与LiDAR间亦存在显著冗余。 Conclusion: 数据冗余是AV感知中一种可测量、可干预的关键数据质量因子，直接影响模型性能；该工作推动从‘算法中心’转向‘数据中心’的AV数据集评估与优化范式。 Abstract: Next-generation autonomous vehicles (AVs) rely on large volumes of multisource and multimodal ($M^2$) data to support real-time decision-making. In practice, data quality (DQ) varies across sources and modalities due to environmental conditions and sensor limitations, yet AV research has largely prioritized algorithm design over DQ analysis. This work focuses on redundancy as a fundamental but underexplored DQ issue in AV datasets. Using the nuScenes and Argoverse 2 (AV2) datasets, we model and measure redundancy in multisource camera data and multimodal image-LiDAR data, and evaluate how removing redundant labels affects the YOLOv8 object detection task. Experimental results show that selectively removing redundant multisource image object labels from cameras with shared fields of view improves detection. In nuScenes, mAP${50}$ gains from $0.66$ to $0.70$, $0.64$ to $0.67$, and from $0.53$ to $0.55$, on three representative overlap regions, while detection on other overlapping camera pairs remains at the baseline even under stronger pruning. In AV2, $4.1$-$8.6\%$ of labels are removed, and mAP${50}$ stays near the $0.64$ baseline. Multimodal analysis also reveals substantial redundancy between image and LiDAR data. These findings demonstrate that redundancy is a measurable and actionable DQ factor with direct implications for AV performance. This work highlights the role of redundancy as a data quality factor in AV perception and motivates a data-centric perspective for evaluating and improving AV datasets. Code, data, and implementation details are publicly available at: https://github.com/yhZHOU515/RedundancyAD

[197] EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

Fangrui Zhu,Yunfeng Xi,Jianmo Ni,Mu Cai,Boqing Gong,Long Zhao,Chen Qu,Ian Miao,Yi Li,Cheng Zhong,Huaizu Jiang,Shwetak Patel

Main category: cs.CV

TL;DR: 本文提出EgoReasoner框架，针对以自我为中心的4D视频理解中的多个复杂推理任务，通过任务自适应思维模板和任务感知奖励函数，在监督与强化学习两阶段中对齐推理结构与认知需求，显著提升性能。

Details

Motivation: 现有通用方法（如链式思维、统一强化学习）难以应对以自我为中心视频中不同4D推理任务（如固定装置交互计数、视角相关定位等）所需的多样化认知操作（空间锚定、时间追踪、时长推理），导致性能不足或不稳定。 Method: 提出两阶段EgoReasoner框架：第一阶段使用任务自适应思维模板进行监督微调，生成结构化链式推理轨迹；第二阶段采用任务感知奖励函数（验证实体定位、时间对齐与逻辑一致性），结合GRPO进行强化微调。 Result: 3B参数模型仅用16K样本训练，在HD-EPIC基准上平均准确率达37.5%，显著超越Qwen2.5-VL-7B（25.7%）超10个百分点。 Conclusion: 任务结构驱动的推理 scaffold 与奖励设计对提升 egocentric 4D 推理能力至关重要，EgoReasoner为多认知维度视频理解提供了可扩展、可解释的新范式。 Abstract: Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.

[198] Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Boqiang Zhang,Lei Ke,Ruihan Yang,Qi Gao,Tianyuan Qu,Rossell Chen,Dong Yu,Leoweiliang

Main category: cs.CV

TL;DR: 本文提出Penguin-VL，一种轻量级视觉语言模型，其视觉编码器源自纯文本大模型（而非传统对比学习预训练如CLIP），从而在保持2B/8B小参数量的同时，在数学推理、文档理解、视觉知识和多视角视频理解等任务上媲美或超越主流大VLM，证明高质量视觉表征比模型缩放更关键。

Details

Motivation: 现有VLM依赖大规模对比预训练（如CLIP）的视觉编码器，但其粗粒度类别不变性会抑制细粒度视觉线索，不利于密集描述与复杂推理；同时大模型难以部署于边缘设备。 Method: 提出Penguin-VL，用纯文本大模型初始化视觉编码器（Penguin-Encoder），摒弃CLIP/SigLIP式对比预训练，提升视觉保真度与数据效率，并在图像/视频多任务上系统评估。 Result: Penguin-VL在数学推理上媲美Qwen3-VL，在文档理解、视觉知识、多视角视频理解等任务上超越之；轻量架构下显著优于对比预训练编码器，更好保留空间与时间细粒度信息。 Conclusion: 视觉编码器的初始化方式比模型规模更重要；基于文本LLM初始化的视觉编码器是高效、高保真VLM的关键，为边缘设备部署提供新路径。 Abstract: Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL

[199] SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

Alejandra Perez,Anita Rau,Lee White,Busisiwe Mlambo,Chinedu Nwoye,Muhammad Abdullah Jamal,Omid Mohareri

Main category: cs.CV

TL;DR: 本文提出SUREON数据集，从外科教学视频中自动提取手术推理相关的问答对，并构建了两个模型SureonVLM和SureonVLM-R1，在手术推理任务上显著优于通用大模型。

Details

Motivation: 现有外科AI缺乏对手术意图、风险判断和操作预判等深层推理能力，而标注此类推理数据成本极高；外科教学视频中的专家讲解天然蕴含这些推理信息，但尚未被系统利用。 Method: 构建SUREON大规模视频问答数据集，涵盖12类手术推理问题；设计多智能体流水线从134.7K手术视频片段中自动提取结构化QA对；提出SureonVLM（监督微调的视觉语言模型）和SureonVLM-R1（基于Group Relative Policy Optimization训练的推理模型）。 Result: SUREON包含206.8K QA对和354例专家验证基准；SureonVLM和SureonVLM-R1在SUREON基准上准确率超84%，且在标准外科感知任务上超越更大规模通用模型；SureonVLM-R1展现出显式的推理行为，如从视觉上下文中推断手术意图。 Conclusion: 利用外科教学视频中的自然语言讲解可有效提升AI的手术推理能力；SUREON数据集与对应模型为构建具备临床理解力的外科AI提供了新范式。 Abstract: Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.

[200] SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

Vishal Thengane,Zhaochong An,Tianjin Huang,Son Lam Phung,Abdesselam Bouzerdoum,Lu Yin,Na Zhao,Xiatian Zhu

Main category: cs.CV

TL;DR: 本文提出SCOPE框架，通过利用基类训练场景中未标注的背景区域提取伪实例来丰富新类原型，在不重训主干网络和不增加参数的前提下提升3D点云增量小样本分割性能。

Details

Motivation: 现有3D点云增量小样本分割方法存在灾难性遗忘、稀疏监督下难以学习判别性原型等问题，且忽视了新类别常以未标注背景形式出现在基类训练场景中的关键线索。 Method: SCOPE是一种即插即用的背景引导原型增强框架：基类训练后，使用类无关分割模型从背景区域提取高置信度伪实例构建原型池；当新类别以少量标注样本到来时，检索并融合相关背景原型与少样本原型，形成增强表征。 Result: 在ScanNet和S3DIS数据集上，SCOPE达到SOTA性能，新类IoU分别提升6.98%和3.61%，平均IoU提升2.25%和1.70%，同时保持较低的遗忘率。 Conclusion: SCOPE有效利用背景信息缓解灾难性遗忘并增强少样本原型判别力，为3D点云增量小样本分割提供了高效、轻量的新范式。 Abstract: Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains underexplored for 3D point clouds. Existing methods suffer from catastrophic forgetting or fail to learn discriminative prototypes under sparse supervision, and often overlook a key cue: novel categories frequently appear as unlabelled background in base-training scenes. We introduce SCOPE (Scene-COntextualised Prototype Enrichment), a plug-and-play background-guided prototype enrichment framework that integrates with any prototype-based 3D segmentation method. After base training, a class-agnostic segmentation model extracts high-confidence pseudo-instances from background regions to build a prototype pool. When novel classes arrive with few labelled samples, relevant background prototypes are retrieved and fused with few-shot prototypes to form enriched representations without retraining the backbone or adding parameters. Experiments on ScanNet and S3DIS show that SCOPE achieves SOTA performance, improving novel-class IoU by up to 6.98% and 3.61%, and mean IoU by 2.25% and 1.70%, respectively, while maintaining low forgetting. Code is available https://github.com/Surrey-UP-Lab/SCOPE.

[201] BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

Thomas Monninger,Shaoyuan Xie,Qi Alfred Chen,Sihao Ding

Main category: cs.CV

TL;DR: 本文提出BEVLM框架，将空间一致且语义蒸馏的鸟瞰图（BEV）表征与大语言模型（LLM）结合，以提升自动驾驶中跨视角场景的3D空间推理与语义理解能力。实验表明其在准确率和闭环端到端驾驶安全性上分别提升46%和29%。

Details

Motivation: 现有方法将多视角多帧图像独立送入LLM，导致计算冗余、空间不一致，难以进行准确的3D空间推理；而传统BEV表征虽具几何结构但缺乏语义丰富性。 Method: 提出BEVLM框架，将几何引导学习的BEV表征与LLM联合建模：一方面用BEV特征作为LLM的统一输入以增强空间一致性推理；另一方面通过知识蒸馏将LLM的语义能力注入BEV表示。 Result: 在跨视角驾驶场景推理任务中准确率提升46%；在安全关键的闭环端到端驾驶任务中性能提升29%。 Conclusion: BEVLM有效弥合了BEV的空间结构优势与LLM的语义推理能力之间的鸿沟，为LLM驱动的自动驾驶提供了更鲁棒、几何一致且语义丰富的感知-决策范式。 Abstract: The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.

[202] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Lijiang Li,Zuwei Long,Yunhang Shen,Heting Gao,Haoyu Cao,Xing Sun,Caifeng Shan,Ran He,Chaoyou Fu

Main category: cs.CV

TL;DR: 本文提出了Omni-Diffusion，首个完全基于掩码式离散扩散模型的任意模态到任意模态多模态大语言模型，统一处理文本、语音和图像的理解与生成任务。

Details

Motivation: 现有MLLMs多采用传统自回归架构，而离散扩散模型在视觉等领域展现出潜力，值得探索其作为多模态系统新主干的可行性。 Method: 提出Omni-Diffusion，使用统一的掩码式离散扩散模型直接建模离散多模态token的联合分布，支持任意模态组合的理解与生成。 Result: 在多个多模态基准测试中，Omni-Diffusion性能优于或媲美现有双模态或多模态系统。 Conclusion: 离散扩散模型是一种有前景的多模态基础模型主干架构，可有效统一多模态理解与生成。 Abstract: While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

[203] Multimodal Large Language Models as Image Classifiers

Nikita Kisel,Illia Volkov,Klara Janouskova,Jiri Matas

Main category: cs.CV

TL;DR: 本文指出多模态大语言模型（MLLM）在分类任务上的性能评估受评估协议和标注质量影响极大；通过修正评估协议、重标注ImageNet-1k子集（ReGT）并量化关键设计选择的影响，发现MLLM性能被严重低估，其与监督模型的差距大幅缩小；此外，MLLM还可辅助人工标注，提升数据集构建效率。

Details

Motivation: 现有研究对MLLM分类性能的评估结论不一致，作者认为这源于评估协议缺陷和真实标签质量差，亟需系统性诊断与修正。 Method: 识别并修复主流评估协议中的三类问题（输出越界丢弃、弱干扰项导致分数虚高、开放世界映射不佳）；量化批大小、图像顺序、文本编码器等设计选择的影响；构建ReGT——625类ImageNet-1k的多标签重标注数据集；开展人机协同标注案例研究。 Result: 修正评估协议和使用ReGT标注后，MLLM分类准确率最高提升10.8%；MLLM与监督模型的性能差距显著缩小；模型对标注噪声越敏感，越依赖监督信号；在困难样本上，人工标注者约50%采纳或整合MLLM预测。 Conclusion: MLLM在分类任务上的所谓‘性能不足’主要是评估失当和标注噪声所致，而非模型本质缺陷；高质量评估协议与标注是公正衡量MLLM能力的前提；MLLM具备辅助大规模数据标注的实用潜力。 Abstract: Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.

Table of Contents

cs.CL [Back]

[1] Verify as You Go: An LLM-Powered Browser Extension for Fake News Detection

[2] Attention Meets Reachability: Structural Equivalence and Efficiency in Grammar-Constrained LLM Decoding

[3] NOTAI.AI: Explainable Detection of Machine-Generated Text via Curvature and Feature Attribution

[4] Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs

[5] The Fragility Of Moral Judgment In Large Language Models

[6] FreeTxt-Vi: A Benchmarked Vietnamese-English Toolkit for Segmentation, Sentiment, and Summarisation

[7] Towards Robust Retrieval-Augmented Generation Based on Knowledge Graph: A Comparative Analysis

[8] Cultural Perspectives and Expectations for Generative AI: A Global Survey Approach

[9] Structured Multidimensional Representation Learning for Large Language Models

[10] Let's Talk, Not Type: An Oral-First Multi-Agent Architecture for Guaraní

[11] CodeScout: Contextual Problem Statement Enhancement for Software Agents

[12] NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

[13] PVminerLLM: Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Models

[14] Tutor Move Taxonomy: A Theory-Aligned Framework for Analyzing Instructional Moves in Tutoring

[15] RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning

[16] HART: Data-Driven Hallucination Attribution and Evidence-Based Tracing for Large Language Models

[17] ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

[18] ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

[19] Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

[20] VerChol -- Grammar-First Tokenization for Agglutinative Languages

[21] Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

[22] Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions

[23] InfoGatherer: Principled Information Seeking via Evidence Retrieval and Strategic Questioning

[24] Learning Next Action Predictors from Human-Computer Interaction

[25] Addressing the Ecological Fallacy in Larger LMs with Human Context

[26] Implicit Style Conditioning: A Structured Style-Rewrite Framework for Low-Resource Character Modeling

[27] Who We Are, Where We Are: Mental Health at the Intersection of Person, Situation, and Large Language Models

[28] Track-SQL: Enhancing Generative Language Models with Dual-Extractive Modules for Schema and Context Tracking in Multi-turn Text-to-SQL

[29] MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

[30] ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

[31] Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring

[32] Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality

[33] Making Implicit Premises Explicit in Logical Understanding of Enthymemes

[34] Diffusion Language Models Are Natively Length-Aware

[35] A Causal Graph Approach to Oppositional Narrative Analysis

[36] CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

[37] MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

[38] Wisdom of the AI Crowd (AI-CROWD) for Ground Truth Approximation in Content Analysis: A Research Protocol & Validation Using Eleven Large Language Models

[39] LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

[40] FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

[41] SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

[42] Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion

[43] The Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks

[44] Transparent AI for Mathematics: Transformer-Based Large Language Models for Mathematical Entity Relationship Extraction with XAI

[45] Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task

[46] From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

[47] Abductive Reasoning with Syllogistic Forms in Large Language Models

[48] PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations

[49] Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing

[50] Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

[51] KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection

cs.CV [Back]

[52] Edges Are All You Need: Robust Gait Recognition via Label-Free Structure

[53] Thinking with Spatial Code for Physical-World Video Reasoning

[54] From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications

[55] DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces

[56] Adversarial Batch Representation Augmentation for Batch Correction in High-Content Cellular Screening

[57] Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection

[58] Rethinking Concept Bottleneck Models: From Pitfalls to Solutions

[59] Making Reconstruction FID Predictive of Diffusion Generation FID

[60] When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

[61] Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

[62] OWL: A Novel Approach to Machine Perception During Motion

[63] MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

[64] Interpretable Perception and Reasoning for Audiovisual Geolocation

[65] Any to Full: Prompting Depth Anything for Depth Completion in One Stage

[66] Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation

[67] From Phase Grounding to Intelligent Surgical Narratives

[68] Full Dynamic Range Sky-Modelling For Image Based Lighting

[69] Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

[70] Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval

[71] Spectral Probing of Feature Upsamplers in 2D-to-3D Scene Reconstruction

[72] EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition

[73] Training-free Latent Inter-Frame Pruning with Attention Recovery

[74] Margin and Consistency Supervision for Calibrated and Robust Vision Models

[75] Remote Sensing Image Classification Using Deep Ensemble Learning

[76] Cog2Gen3D: Sculpturing 3D Semantic-Geometric Cognition for 3D Generation

[77] VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction