Skip to content

Table of Contents

cs.CL [Back]

[1] Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

Anna Babarczy,Andras Lukacs,Peter Vedres,Zeteny Bujka

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型(LLMs)是否具备类人的心理理论(ToM)能力,即从文本中推断他人信念、意图和情绪的能力;实验发现GPT-4o在ToM任务中表现接近人类,而早期小模型则易受线索数量和干扰信息影响,揭示了当前LLMs在社会认知上的能力边界。

Details Motivation: LLMs在缺乏社会具身经验与真实心理表征接触的情况下展现出类社会认知行为,引发其是否真正理解心理状态还是仅依赖统计模式匹配的根本性问题。 Method: 采用改编自人类ToM研究的经典文本测试范式,对五种LLMs及人类被试进行对比实验,评估其在信念、意图与情绪推理任务中的准确性与鲁棒性。 Result: 模型间存在显著性能差异:早期小模型表现受相关推理线索数量及无关干扰信息影响较大;GPT-4o在各类条件下均达到高准确率,表现与人类控制组相当。 Conclusion: GPT-4o展现出接近人类水平的ToM推理能力,提示部分先进LLMs可能已具备一定形式的心理状态归因能力,但该能力本质仍需在‘真实理解’与‘高级统计拟合’之间进一步辨析。 Abstract: The study explores whether current Large Language Models (LLMs) exhibit Theory of Mind (ToM) capabilities -- specifically, the ability to infer others' beliefs, intentions, and emotions from text. Given that LLMs are trained on language data without social embodiment or access to other manifestations of mental representations, their apparent social-cognitive reasoning raises key questions about the nature of their understanding. Are they capable of robust mental-state attribution indistinguishable from human ability in its output, or do their outputs merely reflect superficial pattern completion? To address this question, we tested five LLMs and compared their performance to that of human controls using an adapted version of a text-based tool widely used in human ToM research. The test involves answering questions about the beliefs, intentions, and emotions of story characters. The results revealed a performance gap between the models. Earlier and smaller models were strongly affected by the number of relevant inferential cues available and, to some extent, were also vulnerable to the presence of irrelevant or distracting information in the texts. In contrast, GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions. This work contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation.

[2] TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

Fangrui Huang,Souhad Chbeir,Arpandeep Khatua,Sheng Wang,Sijun Tan,Kenan Ye,Lily Bailey,Merryn Daniel,Ryan Louie,Sanmi Koyejo,Ehsan Adeli

Main category: cs.CL

TL;DR: 本文提出THERAPYGYM框架,用于评估和提升心理治疗聊天机器人的临床保真度(fidelity)与安全性(safety),并配套发布专家标注的验证集THERAPYJUDGEBENCH;通过CTRS自动评分与多标签安全风险标注,结合基于RL的训练,显著提升模型在临床标准上的表现。

Details Motivation: 现有LLM评估方法(如流畅性、偏好测试、通用对话基准)无法衡量心理治疗中关键的临床维度(如认知行为疗法依从性、风险应对能力)。 Method: 构建THERAPYGYM框架:1)用自动化CTRS流水线评估多轮对话中的CBT技术保真度;2)采用多标签方案评估治疗特异性安全风险;3)发布含116段对话、1270条专家评级的THERAPYJUDGEBENCH以校准LLM裁判偏差;4)将CTRS与安全指标作为奖励信号,驱动面向多样化患者模拟的强化学习训练。 Result: 经THERAPYGYM训练的模型在专家评级中平均CTRS从0.10提升至0.60(LLM裁判下从0.16升至0.59),显著改善临床保真度与安全性。 Conclusion: THERAPYGYM为开发符合循证实践、高安全性的心理治疗聊天机器人提供了可扩展的评估与训练基础设施。 Abstract: Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians. THERAPYGYM also serves as a training harness: CTRS and safety-based rewards drive RL with configurable patient simulations spanning diverse symptom profiles. Models trained in THERAPYGYM improve on expert ratings, with average CTRS rising from 0.10 to 0.60 (and 0.16 to 0.59 under LLM judges). Our work enables scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes use.

[3] How Confident Is the First Token? An Uncertainty-Calibrated Prompt Optimization Framework for Large Language Model Classification and Understanding

Wei Chen,Guoyang Ju,Yuanyuan Qi

Main category: cs.CL

TL;DR: 本文提出Log-Scale Focal Uncertainty(LSFU)度量方法及基于其的不确定性校准提示优化框架(UCPOF),以解决大模型在多选任务中因先验偏差导致的置信度校准不佳问题;LSFU融合标签先验作为风险调制因子,UCPOF据此动态选择示例并触发RAG,显著提升准确率并降低检索开销。

Details Motivation: 传统基于输出概率的不确定性度量(如熵)忽略预训练语料中的类别先验差异,难以区分由先验导致的虚假置信与由上下文理解产生的真实确定性,从而影响提示优化的可靠性。 Method: 提出首个基于首token的Log-Scale Focal Uncertainty(LSFU)不确定性度量,受focal loss启发,引入标签先验概率作为风险调制因子,抑制高频类噪声、增强长尾类风险,并统一量纲;在此基础上构建不确定性校准的提示优化框架(UCPOF),利用首token不确定性动态筛选高质量示例并按需触发RAG。 Result: UCPOF在平均准确率上较少样本基线提升6.03%,较始终启用的全量RAG提升5.75%,并将平均检索触发率降低50.66%。 Conclusion: LSFU能更准确刻画模型真实不确定性,UCPOF通过自适应RAG触发机制,在保障性能的同时显著降低计算开销,为可靠、高效的大模型提示优化提供了新范式。 Abstract: With the widespread adoption of large language models (LLMs) in natural language processing, prompt engineering and retrieval-augmented generation (RAG) have become mainstream to enhance LLMs' performance on complex tasks. However, LLMs generate outputs autoregressively, leading to inevitable output uncertainty. Since model performance is highly sensitive to prompt design, precise uncertainty measurement is crucial for reliable prompt optimization. For multi-class multiple-choice (understanding) tasks, conventional uncertainty measures (e.g., entropy) based on output probabilities treat all classes equally and ignore class prior differences in pretraining corpora. This failure to distinguish spurious confidence (from priors) from true certainty (from contextual understanding) results in poor confidence calibration. To address this, we propose Log-Scale Focal Uncertainty (LSFU), a first-token-based metric inspired by focal loss. LSFU incorporates label prior probabilities as a risk-modulation factor to suppress noise from high-frequency classes and emphasize risk for low-frequency long-tail classes, with a dynamic weighting mechanism unifying the measurement scale. Based on LSFU, we further propose the uncertainty-calibrated prompt optimization framework (UCPOF), which leverages the first token of model outputs to select high-quality exemplars and dynamically optimize prompts. Comprehensive evaluations show UCPOF improves average accuracy by 6.03% over few-shot baselines, surpasses always-on full RAG by 5.75% in overall average accuracy, and reduces the average retrieval trigger rate by 50.66%. By adaptively triggering RAG only for high-uncertainty samples, our framework significantly lowers computational costs while maintaining state-of-the-art performance.

[4] Agentic Framework for Political Biography Extraction

Yifei Zhu,Songpo Yang,Jiangnan Zhu,Junyan Jiang

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型(LLM)的两阶段“合成-编码”框架,用于自动化构建大规模政治精英传记数据库,显著提升准确性、可扩展性与透明度。

Details Motivation: 政治学研究长期受限于大规模结构化政治数据集构建成本高、难以自动化的问题,尤其依赖昂贵的人力专家。 Method: 提出两阶段Synthesis-Coding框架:上游由递归智能体式LLM从异构网络源中搜索、筛选、整合传记信息;下游将整合后的文本映射为结构化数据表。 Result: 1)在给定高质量上下文时,LLM编码器准确率媲美或超越人类专家;2)该智能体系统从网络资源中提取的信息量超过维基百科等人类集体智慧;3)直接对长文本或多语料编码会引入偏差,而合成阶段能通过生成高信噪比表征缓解该问题。 Conclusion: 该框架为政治学领域提供了通用、可扩展、透明且可拓展的大规模数据库构建新范式。 Abstract: The production of large-scale political datasets typically demands extracting structured facts from vast piles of unstructured documents or web sources, a task that traditionally relies on expensive human experts and remains prohibitively difficult to automate at scale. In this paper, we leverage Large Language Models (LLMs) to automate the extraction of multi-dimensional elite biographies, addressing a long-standing bottleneck in political science research. We propose a two-stage ``Synthesis-Coding'' framework for complex extraction task: an upstream synthesis stage that uses recursive agentic LLMs to search, filter, and curate biography from heterogeneous web sources, followed by a downstream coding stage that maps curated biography into structured dataframes. We validate this framework through three primary results. First, we demonstrate that, when given curated contexts, LLM coders match or outperform human experts in extraction accuracy. Second, we show that in web environments, the agentic system synthesizes more information from web resources than human collective intelligence (Wikipedia). Finally, we diagnosed that directly coding from long and multi-language corpora introduces bias that the synthesis stage can alleviate by curating evidence into signal-dense representations. By comprehensive evaluation, We provide a generalizable, scalable framework for building transparent and expansible large scale database in political science.

[5] Controllable Evidence Selection in Retrieval-Augmented Question Answering via Deterministic Utility Gating

Victor P. Unda

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的确定性证据选择框架(MUE/DUE),通过显式语义、术语覆盖、概念区分度和冗余控制来筛选可作为证据的文本单元,确保每个被选单元独立满足问题所需事实或条件,否则不返回答案。

Details Motivation: 现有基于向量相似度的检索系统无法解释为何某些高相似度文本可作证据而其他不可,且易选到冗余、不完整或条件不符的文本。 Method: 提出Meaning-Utility Estimation(MUE)和Diversity-Utility Estimation(DUE)两种固定评分与冗余控制机制,对每个句子/记录独立评估语义相关性、术语覆盖、概念独特性与冗余性;仅当某单元显式陈述任务所需事实、规则或条件时才接受,不合并、不扩展。 Result: 实现了紧凑、可审计的证据集,明确区分‘相关文本’与‘可用证据’,避免了模糊匹配带来的不可靠性。 Conclusion: 该确定性框架提升了检索增强问答中证据选择的可解释性、可靠性和可审计性,无需训练即可部署。 Abstract: Many modern AI question-answering systems convert text into vectors and retrieve the closest matches to a user question. While effective for topical similarity, similarity scores alone do not explain why some retrieved text can serve as evidence while other equally similar text cannot. When many candidates receive similar scores, systems may select sentences that are redundant, incomplete, or address different conditions than the question requires. This paper presents a deterministic evidence selection framework for retrieval-augmented question answering. The approach introduces Meaning-Utility Estimation (MUE) and Diversity-Utility Estimation (DUE), fixed scoring and redundancy-control procedures that determine evidence admissibility prior to answer generation. Each sentence or record is evaluated independently using explicit signals for semantic relatedness, term coverage, conceptual distinctiveness, and redundancy. No training or fine-tuning is required. In the prototype, a unit is accepted only if it explicitly states the fact, rule, or condition required by the task. Units are not merged or expanded. If no unit independently satisfies the requirement, the system returns no answer. This deterministic gating produces compact, auditable evidence sets and establishes a clear boundary between relevant text and usable evidence.

[6] DynaRAG: Bridging Static and Dynamic Knowledge in Retrieval-Augmented Generation

Penghao Liang,Mengwei Yuan,Jianan Liu,Jing Yang,Xianyou Li,Weiran Yan,Yichao Wu

Main category: cs.CL

TL;DR: DynaRAG 是一种新型检索增强生成(RAG)框架,通过动态调用外部 API 补充静态检索的不足,结合 LLM 重排序、充分性分类器和 Gorilla v2 API 调用模型,显著提升对时序敏感问题的回答准确率并减少幻觉。

Details Motivation: 传统 RAG 仅依赖静态语料库,难以应对时间敏感或动态变化的信息需求,导致回答不准确或产生幻觉。 Method: 提出 DynaRAG 框架:1)LLM 重排序器评估文档相关性;2)充分性分类器判断是否需调用外部 API;3)Gorilla v2 模型执行精准 API 调用;4)基于 FAISS 的 schema 过滤机制提升 API 选择鲁棒性。 Result: 在 CRAG 基准测试中,DynaRAG 在动态问题上的准确率显著提升,同时有效降低幻觉率。 Conclusion: 动态感知的路由机制与选择性工具调用是构建可靠真实场景问答系统的关键。 Abstract: We present DynaRAG, a retrieval-augmented generation (RAG) framework designed to handle both static and time-sensitive information needs through dynamic knowledge integration. Unlike traditional RAG pipelines that rely solely on static corpora, DynaRAG selectively invokes external APIs when retrieved documents are insufficient for answering a query. The system employs an LLM-based reranker to assess document relevance, a sufficiency classifier to determine when fallback is necessary, and Gorilla v2 -- a state-of-the-art API calling model -- for accurate tool invocation. We further enhance robustness by incorporating schema filtering via FAISS to guide API selection. Evaluations on the CRAG benchmark demonstrate that DynaRAG significantly improves accuracy on dynamic questions, while also reducing hallucinations. Our results highlight the importance of dynamic-aware routing and selective tool use in building reliable, real-world question-answering systems.

[7] Learned but Not Expressed: Capability-Expression Dissociation in Large Language Models

Toshiyuki Shigemura

Main category: cs.CL

TL;DR: 本文通过实证研究发现,尽管大语言模型(LLMs)具备从训练数据中重建非因果解的能力,但在标准生成任务中却完全不表达此类内容,表明生成策略可系统性抑制已习得知识。

Details Motivation: 探究为何LLM虽能重建训练数据中的非因果、不可实施解,却在常规生成中从不输出此类内容,挑战‘训练数据存在即影响输出概率’的默认假设。 Method: 对300组提示-响应生成进行经验观察,覆盖3种LLM、10种任务场景及叙事与问题解决两类上下文;结合记忆连续性与对齐诱导话语先验理论,分析生成内容中非因果解的出现频率。 Result: 在全部300次生成中未观测到任何非因果解(0%,95% CI: [0%, 1.2%]),而条件提取实验确认模型确具该重建能力。 Conclusion: LLM的生成策略(task-conditioned policies)可在多场景下全面压制已习得但不符合任务预期的内容,说明输出分布不仅取决于训练数据,更受生成机制调控。 Abstract: Large language models (LLMs) demonstrate the capacity to reconstruct and trace learned content from their training data under specific elicitation conditions, yet this capability does not manifest in standard generation contexts. This empirical observational study examines the expression of non-causal, non-implementable solution types across 300 prompt-response generations spanning narrative and problem-solving task contexts. Drawing on recent findings regarding memorization contiguity and alignment-induced discourse priors, we document a systematic dissociation between learned capability and expressed output. Across three distinct LLMs, ten task scenarios, and both creative narrative and practical advisory contexts, we documented zero instances of non-causal solution frames in generated outputs (0%, 95% CI: [0%, 1.2%]), despite verified reconstruction capability under conditional extraction. These findings challenge the prevailing assumption that training data presence directly predicts output probability, demonstrating instead that task-conditioned generation policies can comprehensively suppress learned content across diverse contexts. The results offer implications for understanding generation dynamics, output distribution control, and the behavioral boundaries of contemporary LLMs.

[8] Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction

Hui Wen Goh,Jonas Mueller

Main category: cs.CL

TL;DR: CONSTRUCT是一种实时评估大语言模型结构化输出可信度的方法,可定位错误字段,适用于各类黑盒LLM,无需训练数据或定制部署。

Details Motivation: 当前大语言模型的结构化输出存在偶发性错误,阻碍企业AI应用落地,亟需一种无需标注、通用且细粒度的可信度评估方法。 Method: 提出CONSTRUCT方法,通过无监督方式对整体结构化输出及各字段分别打分,量化其可信度;适配任意LLM(包括无logprobs的黑盒API),支持复杂嵌套JSON Schema。 Result: 在首个高质量公开结构化输出基准(含4个数据集)上,CONSTRUCT在检测Gemini 3、GPT-5等模型错误时,显著优于其他评分方法,具备更高精度与召回率。 Conclusion: CONSTRUCT为结构化输出提供了实用、即插即用的可信度评估方案,能有效指导人工复核,提升企业级LLM应用的可靠性与效率。 Abstract: Structured Outputs from current LLMs exhibit sporadic errors, hindering enterprise AI efforts from realizing their immense potential. We present CONSTRUCT, a method to score the trustworthiness of LLM Structured Outputs in real-time, such that lower-scoring outputs are more likely to contain errors. This reveals the best places to focus limited human review bandwidth. CONSTRUCT additionally scores the trustworthiness of each field within a LLM Structured Output, helping reviewers quickly identify which parts of the output are wrong. Our method is suitable for any LLM (including black-box LLM APIs without logprobs such as reasoning models and Anthropic models), does not require labeled training data nor custom model deployment, and works for complex Structured Outputs with many fields of diverse types (including nested JSON schemas). We additionally present one of the first public LLM Structured Output benchmarks with reliable ground-truth values that are not full of mistakes. Over this four-dataset benchmark, CONSTRUCT detects errors from various LLMs (including Gemini 3 and GPT-5) with significantly higher precision/recall than other scoring methods.

[9] Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

Trishita Dhara,Siddhesh Sheth

Main category: cs.CL

TL;DR: 本文通过Shapley Additive Explanations和Integrated Gradients两种后验解释方法,对RoBERTa模型在有害内容检测中的决策逻辑进行可解释性分析,揭示其在边界、语境及政治敏感内容上的系统性失效模式,强调可解释AI在提升透明度与辅助人工审核中的诊断价值,而非仅用于提升性能。

Details Motivation: 现有有害内容检测系统缺乏可解释性,尤其在边界、语境依赖及政治敏感内容上难以理解模型为何做出判断;当前研究多聚焦准确率提升,忽视对错误原因的深入诊断。 Method: 基于Civil Comments数据集训练RoBERTa分类器,并采用Shapley Additive Explanations(SHAP)和Integrated Gradients(IG)两种后验解释方法,对正确预测与系统性错误案例进行对比分析,辅以定性案例研究识别典型失败模式。 Result: 尽管模型AUC达0.93、准确率达0.94,解释分析仍暴露其局限:IG倾向于弥散式上下文归因,SHAP更聚焦显性词汇线索;二者归因差异导致假阴性与假阳性;常见失败模式包括间接毒性、词汇过归因和政治话语误判。 Conclusion: 可解释AI的核心价值在于为人工审核提供透明、可诊断的决策依据,暴露模型不确定性与逻辑缺陷,应被定位为透明性与诊断工具,而非单纯提升性能的手段。 Abstract: Although automated harmful content detection systems are frequently used to monitor online platforms, moderators and end users frequently cannot understand the logic underlying their predictions. While recent studies have focused on increasing classification accuracy, little focus has been placed on comprehending why neural models identify content as harmful, especially when it comes to borderline, contextual, and politically sensitive situations. In this work, a neural harmful content detection model trained on the Civil Comments dataset is analyzed explainability-drivenly. Two popular post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, are used to analyze the behavior of a RoBERTa-based classifier in both correct predictions and systematic failure cases. Despite strong overall performance, with an area under the curve of 0.93 and an accuracy of 0.94, the analysis reveals limitations that are not observable from aggregate evaluation metrics alone. Integrated Gradients appear to extract more diffuse contextual attributions while Shapley Additive Explanations extract more focused attributions on explicit lexical cues. The consequent divergence in their outputs manifests in both false negatives and false positives. Qualitative case studies reveal recurring failure modes such as indirect toxicity, lexical over-attribution, or political discourse. The results suggest that explainable AI can foster human-in-the-loop moderation by exposing model uncertainty and increasing the interpretable rationale behind automated decisions. Most importantly, this work highlights the role of explainability as a transparency and diagnostic resource for online harmful content detection systems rather than as a performance-enhancing lever.

[10] MineDraft: A Framework for Batch Parallel Speculative Decoding

Zhenwei Tang,Arun Verma,Zijian Zhou,Zhaoxuan Wu,Alok Prakash,Daniela Rus,Bryan Kian Hsiang Low

Main category: cs.CL

TL;DR: 本文提出MineDraft,一种批处理并行推测解码(PSD)框架,通过重叠草稿生成与验证阶段来隐藏延迟,显著提升吞吐量(最高75%)和端到端延迟(最高39%),并在vLLM中实现为插件。

Details Motivation: 标准推测解码(SD)受限于草稿生成与验证阶段的严格串行执行,导致性能瓶颈。 Method: 提出MineDraft框架,采用新颖的批处理并行设计:维护两个请求批次,使一个批次的草稿生成与另一个批次的验证过程重叠;并进行理论分析证明其效率优势。 Result: 实验表明,MineDraft相比标准SD在吞吐量上最高提升75%,端到端延迟最高降低39%;且已作为插件集成至vLLM,验证了其生产可用性。 Conclusion: MineDraft通过批处理并行推测解码有效缓解了传统SD的时序瓶颈,在保持准确性的同时显著提升了推理效率与实用性。 Abstract: Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

[11] An Agentic System for Schema Aware NL2SQL Generation

David Onyango,Naseef Mansoor

Main category: cs.CL

TL;DR: 本文提出了一种基于模式的代理系统,利用小型语言模型(SLMs)为主力,并在检测到错误时选择性调用大型语言模型(LLM)作为后备,显著降低计算成本与隐私风险,在BIRD基准上实现高效率与低成本的NL2SQL任务。

Details Motivation: 现有NL2SQL方法依赖大语言模型(LLM),带来高计算开销、数据隐私问题及在资源受限环境中的部署困难。 Method: 构建基于数据库schema的多智能体系统,以小型语言模型(SLMs)为主要执行单元,并设计错误检测机制触发LLM选择性回退。 Result: 在BIRD基准上达到47.78%执行准确率和51.05%验证效率,约67%查询由本地SLM完成,单查询平均成本降至0.0085(相比LLM-only的0.094),总成本降低超90%。 Conclusion: 该SLM主导+LLM按需回退的架构在保持合理性能的同时,大幅提升了NL2SQL系统的实用性、经济性与隐私安全性,适用于真实场景部署。 Abstract: The natural language to SQL (NL2SQL) task plays a pivotal role in democratizing data access by enabling non-expert users to interact with relational databases through intuitive language. While recent frameworks have enhanced translation accuracy via task specialization, their reliance on Large Language Models (LLMs) raises significant concerns regarding computational overhead, data privacy, and real-world deployability in resource-constrained environments. To address these challenges, we propose a schema based agentic system that strategically employs Small Language Models (SLMs) as primary agents, complemented by a selective LLM fallback mechanism. The LLM is invoked only upon detection of errors in SLM-generated output, the proposed system significantly minimizes computational expenditure. Experimental results on the BIRD benchmark demonstrate that our system achieves an execution accuracy of 47.78% and a validation efficiency score of 51.05%, achieving over 90% cost reduction compared to LLM-centric baselines as approximately 67% of queries are resolved using local SLMs. The system achieves an average cost per query of 0.0085 compared to 0.094 for LLM-only systems, achieving near-zero operational costs for locally executed queries. [Github repository: https://github.com/mindslab25/CESMA.]

[12] BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity

Harshita Diddee,Gregory Yauney,Swabha Swayamdipta,Daphne Ippolito

Main category: cs.CL

TL;DR: 本文提出BenchBrowser工具,用于检索与自然语言使用场景相关的评测项,以帮助从业者诊断基准测试在内容效度和聚合效度上的不足,从而量化实践目标与实际评测之间的差距。

Details Motivation: 现有语言模型基准测试的高层元数据过于粗略,无法反映其实际评测内容的细节,导致难以验证其是否真正契合从业者的实际需求,容易造成模型‘看似胜任’而实则在关键维度上失效的假象。 Method: 提出BenchBrowser——一个面向20个基准套件的检索系统,支持按自然语言使用场景检索具体评测项;并通过人工研究验证其检索精度。 Result: BenchBrowser能有效支持从业者识别基准测试的内容效度(能力覆盖不全)和聚合效度(同一能力下排名不稳定)问题,并提供可量化的证据。 Conclusion: BenchBrowser有助于揭示并量化从业者意图与基准实际评测范围之间的关键鸿沟,推动更可信、更透明的模型评估实践。 Abstract: Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.

[13] Evaluating FrameNet-Based Semantic Modeling for Gender-Based Violence Detection in Clinical Records

Lívia Dutra,Arthur Lorenzi,Frederico Belcavello,Ely Matos,Marcelo Viridiano,Lorena Larré,Olívia Guaranha,Erik Santos,Sofia Reinach,Pedro de Paula,Tiago Torrent

Main category: cs.CL

TL;DR: 本研究探讨了基于FrameNet的语义标注在电子病历开放文本中识别性别暴力(GBV)模式的有效性,发现结合语义标注的模型显著优于仅使用结构化数据的模型,F1分数提升超0.3。

Details Motivation: 巴西医疗人员虽有法定义务上报性别暴力案件,但因识别困难及信息系统整合不足,导致严重漏报。 Method: 采用FrameNet对电子病历开放文本进行语义标注,并构建SVM分类器,对比三种输入:(1)仅帧标注文本;(2)帧标注文本+参数化数据;(3)仅参数化数据。 Result: 融合语义标注的模型F1分数提升超0.3,显著优于纯结构化数据模型;定性与定量分析均证实领域特定语义表征提供了超越人口统计结构数据的有意义信号。 Conclusion: 临床叙事的语义分析可增强性别暴力的早期识别能力,从而支持更精准的公共卫生干预措施。 Abstract: Gender-based violence (GBV) is a major public health issue, with the World Health Organization estimating that one in three women experiences physical or sexual violence by an intimate partner during her lifetime. In Brazil, although healthcare professionals are legally required to report such cases, underreporting remains significant due to difficulties in identifying abuse and limited integration between public information systems. This study investigates whether FrameNet-based semantic annotation of open-text fields in electronic medical records can support the identification of patterns of GBV. We compare the performance of an SVM classifier for GBV cases trained on (1) frame-annotated text, (2) annotated text combined with parameterized data, and (3) parameterized data alone. Quantitative and qualitative analyses show that models incorporating semantic annotation outperform categorical models, achieving over 0.3 improvement in F1 score and demonstrating that domain-specific semantic representations provide meaningful signals beyond structured demographic data. The findings support the hypothesis that semantic analysis of clinical narratives can enhance early identification strategies and support more informed public health interventions.

[14] How LLMs Distort Our Written Language

Marwa Abdulhai,Isadora White,Yanming Wan,Ibrahim Qureshi,Joel Leibo,Max Kleiman-Weiner,Natasha Jaques

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLMs)在写作辅助中对人类文本语义的系统性改变,发现其不仅影响风格与语气,更会显著扭曲原意;通过用户调研、回溯式修订实验和真实AI评审分析,揭示LLM使用导致中立化倾向增强、原创性下降、科学评审标准弱化等深层问题。

Details Motivation: 探究LLMs在广泛写作辅助场景下是否及如何隐性改变人类表达的语义内容,而非仅表面风格,尤其关注其对教育、创作与科研评价等关键领域可能产生的深远影响。 Method: 结合三项实证方法:1)人类用户调研,分析不同LLM使用强度对写作中立性、创意性与个人声音的影响;2)基于2021年纯人工撰写论文集,用LLM按专家反馈进行‘仅语法修订’,量化语义偏移;3)分析某顶会21% AI生成的同行评审,对比其评分倾向与关注维度(如清晰度、重要性)的差异。 Result: 1)重度LLM使用者作文中立率上升近70%,且自评创意性与个人风格显著下降;2)即使严格限制为语法修改,LLM仍大幅改变原文语义;3)AI生成评审更少关注研究清晰度与重要性,平均评分高出1分。 Conclusion: LLMs在写作辅助中存在系统性语义偏移效应,该效应与用户感知益处不一致,亟需关注其对文化表达与科学治理机制的长期结构性影响。 Abstract: Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point higher.These findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.

[15] Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

Maria Andueza Rodriguez,Marie Candito,Richard Huyghe

Main category: cs.CL

TL;DR: 本研究通过比较人类与大语言模型(LLMs)的词语联想,评估LLMs内部词典的人类相似性;发现模型虽能反映词频和具体性等人类词汇规律,但在响应变异性与典型性上存在系统性差异,且受模型规模与采样温度显著影响。

Details Motivation: 探究大语言模型(LLMs)内部词汇知识是否具有人类相似性,特别是其词典结构与人类联想模式的一致程度。 Method: 基于SWOW英语线索-反应数据集,对比人类反应与三个LLM(Mistral-7B、Llama-3.1-8B、Qwen-2.5-32B)在多温度设置下生成的词语联想;分析词频、具体性等词汇因素的影响,并量化响应的变异性与典型性。 Result: 所有模型均复现人类在词频和具体性上的趋势;但Qwen等较大模型生成高度典型、低变异的响应(类似单个典型人类被试),而Mistral/Llama等较小模型响应更分散但典型性较低;温度升高提升变异、降低典型性。 Conclusion: LLMs词汇表征与人类既有相似(基础统计规律),又有本质差异(变异性-典型性权衡),其表现高度依赖模型规模与温度参数,提示在探查词义表征时需谨慎控制这些变量。 Abstract: Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single "prototypical" human participant, generating highly typical but minimally variable responses, while smaller models such as Mistral and Llama produce more variable yet less typical responses. Temperature settings further influence this trade-off, with higher values increasing variability but decreasing typicality. These findings highlight both the similarities and differences between human and LLM lexicons, emphasizing the need to account for model size and temperature when probing LLM lexical representations.

[16] GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

Ja Young Lee,Mírian Silva,Mohamed Nasr,Shonda Witherspoon,Enzo Bozzani,Veronique Demers,Radha Ratnaparkhi,Hui Wu,Sara Rosenthal

Main category: cs.CL

TL;DR: 本文提出GRAFITE,一个用于持续评估大语言模型(LLM)性能的平台,通过用户反馈构建问题库,并利用LLM-as-a-judge进行质量保证测试,支持多模型对比与版本间回归检测,以应对基准数据污染导致的性能虚高问题。

Details Motivation: 大语言模型在发布时常因基准数据污染而导致后续评估失真,亟需一种可持续、动态、抗污染的评估机制。 Method: 构建GRAFITE平台,整合用户反馈形成动态问题库,设计基于LLM-as-a-judge的QA测试流水线,支持多模型并行评估与跨版本回归分析。 Result: 实现了可公开访问的开源评估平台(GitHub),支持实时问题收集、自动化测试及可视化对比,验证了其在识别模型退化与能力差异上的有效性。 Conclusion: GRAFITE为LLM评估提供了可持续、社区驱动、抗污染的新范式,有助于提升模型评测的可靠性与透明度。 Abstract: Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo video is available at www.youtube.com/watch?v=XFZyoleN56k.

[17] CWoMP: Morpheme Representation Learning for Interlinear Glossing

Morris Alper,Enora Rice,Bhargav Shandilya,Alexis Palmer,Lori Levin

Main category: cs.CL

TL;DR: 本文提出CWoMP方法,通过对比学习将词与构形素在共享嵌入空间中对齐,并利用可更新词典进行自回归生成,实现高效、可解释且支持推理时改进的自动IGT生成。

Details Motivation: 现有自动IGT方法将gloss视为字符序列,忽视其构形成分结构;而人工标注IGT费时费力,尤其对低资源语言亟需高效准确的自动化方案。 Method: 提出CWoMP(Contrastive Word-Morpheme Pretraining):1)对比学习编码器对齐上下文中的词与其构成构形素;2)自回归解码器基于可修改的构形素嵌入词典生成gloss序列。 Result: 在多种低资源语言上显著优于现有方法,尤其在极低资源场景下提升明显,同时训练与推理效率更高。 Conclusion: CWoMP通过建模构形素为形式-意义原子单元,兼顾性能、效率与可解释性,并支持无需重训练的推理时词典扩展,为低资源语言IGT自动化提供了新范式。 Abstract: Interlinear glossed text (IGT) is a standard notation for language documentation which is linguistically rich but laborious to produce manually. Recent automated IGT methods treat glosses as character sequences, neglecting their compositional structure. We propose CWoMP (Contrastive Word-Morpheme Pretraining), which instead treats morphemes as atomic form-meaning units with learned representations. A contrastively trained encoder aligns words-in-context with their constituent morphemes in a shared embedding space; an autoregressive decoder then generates the morpheme sequence by retrieving entries from a mutable lexicon of these embeddings. Predictions are interpretable--grounded in lexicon entries--and users can improve results at inference time by expanding the lexicon without retraining. We evaluate on diverse low-resource languages, showing that CWoMP outperforms existing methods while being significantly more efficient, with particularly strong gains in extremely low-resource settings.

[18] How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence

Alex Anvi Eponon,Ildar Batyrshin,Christian E. Maldonado-Sifuentes,Grigori Sidorov

Main category: cs.CL

TL;DR: This paper critiques how AI paradigms inherited structural limitations from their psychological roots (behaviorism → RL, cognitivism → DL, constructivism → curriculum learning), and proposes ReSynth—a trimodular framework (Intellect/Identity/Memory) to enable systematic, adaptable AGI.

Details Motivation: AI paradigms have inherited not only strengths but also deep structural limitations from their founding psychological theories; current approaches fail to support principled knowledge update, internal knowledge structure, or formal construction of new understanding—hindering progress toward AGI. Method: Analyzes historical genealogy from psychology to AI; critiques limitations using philosophical and cognitive science arguments (e.g., Aizawa’s critique, systematicity debate); introduces ReSynth—a trimodular neuro-symbolic architecture separating reasoning (Intellect), purpose (Identity), and knowledge (Memory). Result: ReSynth is proposed as a novel architectural framework that enforces systematicity by design, enabling structured knowledge composition, transparent updates, and goal-directed reasoning—addressing core limitations of RL, deep learning, and integrative AI methods. Conclusion: True adaptability in AGI requires representational architectures where systematic behavior emerges necessarily—not accidentally—and ReSynth provides a psychologically grounded, formally motivated path toward that goal. Abstract: The dominant paradigms of artificial intelligence were shaped by learning theories from psychology: behaviorism inspired reinforcement learning, cognitivism gave rise to deep learning and memory-augmented architectures, and constructivism influenced curriculum learning and compositional approaches. This paper argues that each AI paradigm inherited not only the strengths but the structural limitations of the psychological theory that inspired it. Reinforcement learning cannot account for the internal structure of knowledge, deep learning compresses representations into opaque parameter spaces resistant to principled update, and current integrative approaches lack a formal account of how new understanding is constructed from existing components. The paper further examines a cross-cultural divergence in the interpretation of rote learning, arguing that the Eastern conception of memorization as a structured, multi-phase precursor to understanding offers an underexploited bridge between psychological theory and AI methodology. Drawing on the systematicity debate and critique of Aizawa of both classicism and connectionism, this paper introduces ReSynth, a trimodular framework that separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as architecturally independent components. The paper traces the genealogy from psychological paradigm to AI method, diagnoses the inherited limitations at each stage, and argues that adaptability, the central challenge of artificial general intelligence requires a representational architecture in which systematic behavior is a necessary consequence rather than an accidental property.

[19] From Noise to Signal: When Outliers Seed New Topics

Evangelia Zve,Gauvain Bourgne,Benjamin Icard,Jean-Gabriel Ganascia

Main category: cs.CL

TL;DR: 本文提出了一种时间分类法,用于识别新闻文档在主题演化中的轨迹,特别是能预示新兴主题的‘前瞻性离群点’,并在法语氢能新闻语料库上验证了其有效性。

Details Motivation: 传统动态主题建模将离群点视为噪声,但作者认为部分离群点可能是新兴主题的早期信号,值得系统建模和利用。 Method: 构建了一个描述新闻文档随时间与主题形成关系的时间分类法,区分‘前瞻性离群点’、强化型文档和孤立文档;在累积聚类框架下,使用11种先进语言模型生成的文档嵌入进行实现与评估。 Result: 在HydroNewsFr语料库上发现一小部分跨模型高度一致的前瞻性离群点,提升了标签可信度;定性案例研究证实了该分类法对主题演化过程(如预示、启动、漂移)的刻画能力。 Conclusion: 离群点不仅是噪声,更是理解主题起源与演化的重要线索;所提时间分类法为弱信号检测与动态主题建模提供了可解释、可验证的新视角。 Abstract: Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.

[20] Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Tianhui Zhang,Bei Peng,Danushka Bollegala

Main category: cs.CL

TL;DR: 本文提出了一种两阶段方法,构建首个用于多样化生成式常识推理(GCR)的合成数据集CommonSyn,以缓解高质量、大规模多样化常识训练数据稀缺的问题;在该数据集上微调的模型在多样性和生成质量上均优于基线模型。

Details Motivation: 现有生成式常识推理(GCR)数据集规模小、覆盖场景窄、标注成本高,难以支撑多样化常识生成模型的训练需求。 Method: 提出两阶段合成数据构建方法,生成首个大规模、高质量、多样化的GCR合成数据集CommonSyn。 Result: 在CommonSyn上微调的模型,在多个规模的大型语言模型上,相较基线模型和在人工数据上微调的模型,同时提升了生成多样性与质量。 Conclusion: 合成数据可有效弥补多样化常识推理训练资源的缺口,CommonSyn为该方向提供了可行且有效的数据基础。 Abstract: Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)

[21] PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Ruishuo Chen,Yu Chen,Zhuoran Li,Longbo Huang

Main category: cs.CL

TL;DR: 本文提出PowerFlow框架,将无监督强化学习重述为分布匹配问题,利用GFlowNet实现长度感知的轨迹平衡目标,通过α-幂分布调控LLM的逻辑推理与创造性表达能力,并在多项实验中超越现有方法。

Details Motivation: 当前无监督强化学习方法依赖启发式内在奖励,缺乏明确理论优化目标且易受退化偏差影响。 Method: 提出PowerFlow框架,将GFlowNet作为非归一化密度的摊销变分采样器,设计长度感知的Trajectory-Balance目标,并引入α-幂分布以定向调控LLM输出分布形态。 Result: PowerFlow在多个任务上持续优于现有RLIF方法,性能媲美甚至超过监督式GRPO;在对齐模型中缓解过锐化,同步提升生成多样性与质量。 Conclusion: PowerFlow为无监督细调提供了原理性新范式,能灵活激发LLM的推理与创造力双重能力,并推动创意生成的Pareto前沿。 Abstract: Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

[22] AutoScreen-FW: An LLM-based Framework for Resume Screening

Zhelin Xu,Shuhei Yamamoto,Atsuyuki Morishima

Main category: cs.CL

TL;DR: 本文提出AutoScreen-FW,一种基于开源大语言模型(LLM)的本地化自动简历筛选框架,通过代表性样本选择与上下文学习提升筛选效果,在保证隐私的同时降低 recruiter 工作负担。

Details Motivation: 企业招聘人员需在有限时间内筛选大量简历,负担重且易漏掉合适候选人;现有LLM方法依赖商业模型存在隐私风险,且缺乏公开带标注的简历数据集指导模型训练。 Method: 提出AutoScreen-FW框架,采用多种策略选取少量代表性简历样本,结合角色设定(persona)和评估标准进行上下文学习(in-context learning),驱动开源LLM作为职业顾问评估新简历。 Result: 实验表明,该框架驱动的开源LLM在多个真实标注基准下持续优于GPT-5-nano;在一组基准下超越GPT-5-mini;虽在其他基准下略逊于GPT-5-mini,但单份简历处理速度显著更快。 Conclusion: AutoScreen-FW具备本地部署潜力,可在保障数据隐私前提下提升简历筛选效率,有效减轻招聘人员负担。 Abstract: Corporate recruiters often need to screen many resumes within a limited time, which increases their burden and may cause suitable candidates to be overlooked. To address these challenges, prior work has explored LLM-based automated resume screening. However, some methods rely on commercial LLMs, which may pose data privacy risks. Moreover, since companies typically do not make resumes with evaluation results publicly available, it remains unclear which resume samples should be used during learning to improve an LLM's judgment performance. To address these problems, we propose AutoScreen-FW, an LLM-based locally and automatically resume screening framework. AutoScreen-FW uses several methods to select a small set of representative resume samples. These samples are used for in-context learning together with a persona description and evaluation criteria, enabling open-source LLMs to act as a career advisor and evaluate unseen resumes. Experiments with multiple ground truths show that the open-source LLM judges consistently outperform GPT-5-nano. Under one ground truth setting, it also surpass GPT-5-mini. Although it is slightly weaker than GPT-5-mini under other ground-truth settings, it runs substantially faster per resume than commercial GPT models. These findings indicate the potential for deploying AutoScreen-FW locally in companies to support efficient screening while reducing recruiters' burden.

[23] TopoChunker: Topology-Aware Agentic Document Chunking Framework

Xiaoyu Liu

Main category: cs.CL

TL;DR: 本文提出TopoChunker,一种基于代理的文档分块框架,通过构建结构化中间表示(SIR)来保留文档内在拓扑结构,缓解线性分块导致的语义碎片化问题,在多个基准上显著提升RAG检索与生成性能并降低token开销。

Details Motivation: 现有RAG文档分块方法采用强制线性化,破坏文档固有的拓扑层次结构,引发语义碎片化,损害下游检索质量。 Method: 提出TopoChunker框架,包含Inspector Agent(动态选择成本优化的抽取路径)和Refiner Agent(进行容量审计与拓扑上下文消歧),将异构文档映射到结构化中间表示(SIR)以显式保留跨段依赖关系。 Result: 在GutenQA和GovReport数据集上达到SOTA:生成准确率绝对提升8.0%,Recall@3达83.26%,同时token开销降低23.5%。 Conclusion: TopoChunker为结构感知的RAG提供了可扩展、高效且高性能的解决方案,验证了显式建模文档拓扑结构对提升RAG效果的关键作用。 Abstract: Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.

[24] TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Arushi Rai,Qiang Zhang,Hanqing Zeng,Yunkai Zhang,Dipesh Tamboli,Xiangjun Fan,Zhuokai Zhao,Lizhu Zhang

Main category: cs.CL

TL;DR: 本文提出Token-level Adaptive Routing (TARo),一种在推理时对冻结大语言模型进行结构化推理引导的轻量级方法,通过训练细粒度奖励模型并引入可学习的token级路由器,在不修改模型参数的前提下显著提升数学、临床和指令遵循等多领域推理能力。

Details Motivation: 大型语言模型虽具强推理能力,但通常需昂贵的后训练;现有测试时对齐方法主要面向偏好对齐而非推理,亟需一种轻量、通用的推理对齐方法。 Method: 提出Token-level Adaptive Routing(TARo):1)基于分步数学推理解析训练细粒度奖励模型;2)设计可学习的token级路由器,动态控制奖励模型对基础模型的推理引导。 Result: TARo在数学推理上相较基线模型提升+22.4%,优于现有token级测试时对齐方法+8.4%;同时提升临床推理(MedXpertQA)与指令遵循(AlpacaEval)性能,并支持从小到大模型骨架零样本迁移。 Conclusion: TARo成功将测试时对齐从偏好优化拓展至鲁棒、跨领域的结构化推理,为冻结LLM的高效推理增强提供了新范式。 Abstract: Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.

[25] Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Masayuki Kawarada,Tatsuya Ishigaki,Hiroya Takamura

Main category: cs.CL

TL;DR: This paper introduces a benchmark to study task interference in multimodal LLMs, revealing that interference is highly directional—especially severe when switching from text-only to image-based tasks—and most strongly driven by modality mismatch.

Details Motivation: Task interference has been studied only in text-only dialogue systems, despite the rise of multimodal dialogue systems; there is a need to evaluate this phenomenon in multimodal LLMs. Method: The authors introduce a new benchmark covering six tasks across text and vision, with systematic variation along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Result: Experiments show task interference is highly directional (severe drop from text-only to image-based targets, minimal reverse degradation); interference intensifies with co-occurring mismatches; modality mismatch is the strongest driver, followed by answer format mismatch, while reasoning mismatch has minimal effect. Conclusion: Task interference in multimodal LLMs is not symmetric and is predominantly shaped by modality differences, highlighting the need for modality-aware training and evaluation strategies. Abstract: Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

[26] Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

Asmita Bhardwaj,Yuya Jeremy Ong,Eelaaf Zahid,Basel Shbita

Main category: cs.CL

TL;DR: 本文提出了一种基于强化学习的解码器采样方法,通过在测试时动态调整采样参数(如temperature/top-p)来提升大语言模型生成质量,无需更新模型权重,在多个摘要数据集上显著优于传统静态解码策略。

Details Motivation: 现有主流解码策略(如贪婪解码、固定temperature/top-p)是静态且任务无关的,难以适应不同领域对风格或结构灵活性的需求,导致生成质量不稳定或次优。 Method: 将解码建模为序列决策过程,设计轻量级强化学习策略网络,在测试时动态调整采样参数;使用复合奖励函数(含长度、覆盖度、重复性、完整性等结构化塑形项)进行训练;模型权重冻结,仅优化采样策略。 Result: 在BookSum、arXiv、WikiHow等摘要数据集上,使用Granite-3.3-2B和Qwen-2.5-0.5B模型验证,相对基线最高提升88%(BookSum+Granite)和79%(WikiHow+Qwen);消融实验证明复合奖励优于仅重叠类奖励,结构化塑形项对稳定提升至关重要。 Conclusion: 强化学习是一种实用的测试时解码自适应机制,可在不重训大模型的前提下实现领域感知与用户可控的文本生成。 Abstract: Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.

[27] UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

Lang Zhou,Shuxuan Li,Zhuohao Li,Shi Liu,Zhilin Zhao,Wei-Shi Zheng

Main category: cs.CL

TL;DR: 本文提出UT-ACA框架,在推理时根据词元级不确定性动态调整上下文窗口,通过回滚、扩展上下文并重生成来应对证据不足,从而在保持生成质量的同时显著降低平均上下文使用量。

Details Motivation: 长上下文推理中,注意力稀释和分布外退化导致性能下降;现有上下文选择方法固定预算,无法适配非均匀的词元级上下文需求。 Method: 提出不确定性触发的自适应上下文分配(UT-ACA)框架:构建融合语义嵌入与logit置信度的不确定性检测器,并建模解码步间不确定性累积;当检测到不确定性高时,选择性回滚、扩大上下文窗口并重生成当前词元。 Result: 实验表明UT-ACA在长上下文任务中显著降低平均上下文使用量,同时保持生成质量。 Conclusion: 动态、不确定性驱动的上下文分配是一种高效且鲁棒的长上下文推理策略,优于固定预算方法。 Abstract: Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.

[28] GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

Masayuki Kawarada,Kodai Watanabe,Soichiro Murakami

Main category: cs.CL

TL;DR: 本文提出了GAIN基准,用于评估大语言模型在面对不完美规范时如何平衡规范遵守与商业目标,引入五类情境压力以系统分析影响决策的因素,并在四个商业领域开展实验,发现模型在个人激励压力下显著偏离人类决策模式,更倾向于遵守规范。

Details Motivation: 现有基准多聚焦于抽象场景,缺乏对真实商业应用的覆盖,且难以揭示影响大语言模型决策的关键因素,限制了对模型在复杂规范-目标冲突中适应能力的评估。 Method: 构建GAIN基准,包含1200个涵盖招聘、客服、广告和金融四领域的场景;每个场景提供目标、情境、规范及五类明确设计的情境压力(目标一致性、风险规避、情感/伦理诉求、社会/权威影响、个人激励),用以系统评估模型决策机制。 Result: 实验表明先进大语言模型通常模仿人类决策模式,但在‘个人激励’压力下显著不同——更坚持规范而非妥协,显示出与人类行为的明显偏差。 Conclusion: GAIN为评估大语言模型在现实商业规范冲突中的决策能力提供了新工具,揭示了当前模型在特定压力类型(如个人激励)下的局限性,对提升其实际部署可靠性具有重要启示。 Abstract: We introduce GAIN (Goal-Aligned Decision-Making under Imperfect Norms), a benchmark designed to evaluate how large language models (LLMs) balance adherence to norms against business goals. Existing benchmarks typically focus on abstract scenarios rather than real-world business applications. Furthermore, they provide limited insights into the factors influencing LLM decision-making. This restricts their ability to measure models' adaptability to complex, real-world norm-goal conflicts. In GAIN, models receive a goal, a specific situation, a norm, and additional contextual pressures. These pressures, explicitly designed to encourage potential norm deviations, are a unique feature that differentiates GAIN from other benchmarks, enabling a systematic evaluation of the factors influencing decision-making. We define five types of pressures: Goal Alignment, Risk Aversion, Emotional/Ethical Appeal, Social/Authoritative Influence, and Personal Incentive. The benchmark comprises 1,200 scenarios across four domains: hiring, customer support, advertising and finance. Our experiments show that advanced LLMs frequently mirror human decision-making patterns. However, when Personal Incentive pressure is present, they diverge significantly, showing a strong tendency to adhere to norms rather than deviate from them.

[29] WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Haonan Yu,Junhao Liu,Zhenyu Yan,Haoran Lin,Xin Zhang

Main category: cs.CL

TL;DR: 本文提出WASD框架,通过识别生成token的充分神经条件来解释大语言模型行为,实现了低成本、自然语言可控且语义连贯的行为控制。

Details Motivation: 现有大语言模型行为控制方法存在训练成本高、缺乏自然语言可控性或损害语义连贯性等问题,亟需一种更优方案。 Method: WASD框架将候选条件表示为神经元激活谓词,并在输入扰动下迭代搜索保证当前输出的最小充分条件集。 Result: 在SST-2和CounterFact数据集及Gemma-2-2B模型上的实验表明,WASD生成的解释比传统归因图更稳定、准确和简洁;跨语言生成控制案例验证了其实用有效性。 Conclusion: WASD是一种高效、可控且语义保持的LLM行为解释与控制新范式。 Abstract: Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.

[30] The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

Esteban Garces Arias,Nurzhan Sapargali,Christian Heumann,Matthias Aßenmacher

Main category: cs.CL

TL;DR: 本文指出,标准文本生成解码策略(如top-k、核采样等)依赖于词元的统计概率,而人类语言更注重交际适宜性而非频率,导致模型无法选择语境合适但统计罕见的词元,形成‘截断盲点’,从而增强机器文本的可检测性。实证分析发现,8–18%的人类选择词元落在常规截断边界之外;检测性能主要由截断参数决定,而非模型规模或结构;低可检测性配置常以牺牲连贯性为代价。

Details Motivation: 标准解码策略与人类语言生成机制存在根本差异:前者依赖高概率区域采样,后者强调语境适宜性;该差异可能导致机器文本易被识别,即存在‘截断盲点’。 Method: 对8个语言模型、5种解码策略及53种超参数配置下生成的超180万段文本进行大规模实证分析,量化人类选择词元落在典型截断边界外的比例,并构建基于可预测性和词汇多样性特征的简单分类器评估检测性能。 Result: 8–18%的人类选择词元位于常规截断边界之外;简单分类器在检测任务中表现优异;截断参数是影响检测率的最主要因素,模型规模与架构影响微弱;低检测率配置常伴随文本不连贯。 Conclusion: 机器文本的可检测性主要源于基于似然的词元选择机制本身,而非模型能力不足;提升自然性与降低可检测性是两个相互冲突的目标。 Abstract: Standard decoding strategies for text generation, including top-k, nucleus sampling, and contrastive search, select tokens based on likelihood, restricting selection to high-probability regions. Human language production operates differently: tokens are chosen for communicative appropriateness rather than statistical frequency. This mismatch creates a truncation blind spot: contextually appropriate but statistically rare tokens remain accessible to humans yet unreachable by likelihood-based decoding. We hypothesize this contributes to the detectability of machine-generated text. Analyzing over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations, we find that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers trained on predictability and lexical diversity achieve remarkable detection rates. Crucially, neither model scale nor architecture correlates strongly with detectability; truncation parameters account for most variance. Configurations achieving low detectability often produce incoherent text, indicating that evading detection and producing natural text are distinct objectives. These findings suggest detectability is enhanced by likelihood-based token selection, not merely a matter of model capability.

[31] EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Minsoo Cheong,Donghyun Son,Woosang Lim,Sungjoo Yoo

Main category: cs.CL

TL;DR: 本文提出EntropyCache,一种无需训练的KV缓存方法,利用新解码token分布的最大熵作为常数开销信号来决定是否重计算,显著提升dLLM推理速度。

Details Motivation: 扩散型大语言模型(dLLMs)因使用双向注意力机制,无法进行无损KV缓存,导致每步去噪均需完整前向传播;现有近似KV缓存方法虽降低计算成本,但其决策开销随上下文长度或模型深度增长。 Method: EntropyCache基于两个经验观察设计:(1)解码token熵与KV缓存漂移相关,可廉价表征缓存陈旧性;(2)解码token的特征不稳定性在解掩码后持续多步,因此需重计算最近k个token。其跳过/重计算决策仅需每步O(V)计算,与上下文长度和模型规模无关。 Result: 在LLaDA-8B-Instruct和Dream-7B-Instruct上,EntropyCache在标准基准上实现15.2×–26.4×加速,在思维链基准上达22.4×–24.1×加速,精度保持竞争力,决策开销仅占推理时间0.5%。 Conclusion: EntropyCache是一种高效、轻量、训练无关的KV缓存优化方法,为dLLMs实际部署提供了显著的推理加速与低开销平衡方案。 Abstract: Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5\%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.

[32] When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

Abhinaba Basu,Pavan Chakraborty

Main category: cs.CL

TL;DR: 本文提出ICE-Guard框架,通过干预一致性测试检测大语言模型在高风险决策中对虚假特征(如权威、表述框架和人口统计)的依赖,并发现权威与表述偏差远高于人口统计偏差;结构化解析可显著降低偏差,且基于ICE的迭代提示修补可实现78%的累计偏差降低。

Details Motivation: 大型语言模型(LLMs)被越来越多地用于高风险决策,但其对虚假特征的依赖性尚未被充分刻画,尤其现有研究过度聚焦于人口统计偏差而忽视其他类型偏差。 Method: 提出ICE-Guard框架,采用干预一致性测试(Intervention Consistency Testing)检测三类虚假特征依赖:人口统计(姓名/种族替换)、权威(资历/声望替换)和表述框架(正负重述);在10个高风险领域共3000个案例上评估11个LLM;引入结构化解析方法,并构建ICE引导的‘检测-诊断-缓解-验证’闭环进行偏差缓解。 Result: (1)权威偏差(均值5.8%)和表述偏差(5.0%)显著高于人口统计偏差(2.2%);(2)偏差在领域间差异显著,如金融领域权威偏差达22.6%,而刑事司法仅2.8%;(3)结构化解析使翻转率中位数下降49%(最高100%);(4)ICE引导的迭代提示修补实现累计78%偏差降低;(5)在真实COMPAS再犯数据上验证,表明合成基准提供保守偏差估计。 Conclusion: 虚假特征依赖具有多维性与领域特异性,不能仅关注人口统计偏差;结构化推理与ICE驱动的迭代提示工程是有效缓解高风险场景中LLM偏差的可行路径。 Abstract: Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.

[33] Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

Ivaxi Sheth,Zeno Jonke,Amin Mantrach,Saab Mansour

Main category: cs.CL

TL;DR: 本文提出了一种基于分解的跨语言大模型评估框架,核心是通用准则集(UCS),无需目标语言人工标注即可实现跨语言迁移评估。

Details Motivation: 现有大语言模型评估方法主要面向英语,难以适配其他语言,主因是多语言人工标注稀缺且昂贵。 Method: 构建语言无关的通用准则集(UCS),将评估任务分解为共享的、可解释的中间维度,支持低监督跨语言迁移。 Result: 在多种语言和模型主干上的忠实性任务实验中,该方法持续优于强基线,且无需目标语言人工标注。 Conclusion: UCS提供了一种高效、可解释、低资源依赖的跨语言自动化评估新范式。 Abstract: As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.

[34] ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

Abhinaba Basu,Pavan Chakraborty

Main category: cs.CL

TL;DR: 本文提出ICE框架,通过多重干预操作和随机化检验评估模型解释的保真度,发现保真度高度依赖于干预算子,且与人类可理解性几乎无关。

Details Motivation: 现有解释保真度评估方法仅使用单一干预且缺乏统计检验,难以区分真实保真与偶然性能。 Method: 提出ICE(Intervention-Consistent Explanation)框架,利用多种干预算子对解释与匹配的随机基线进行随机化检验,输出带置信区间的胜率。 Result: 在7个大语言模型、4个英文任务、6种非英文语言和2种归因方法上的实验表明:保真度显著依赖干预算子(最大差距达44个百分点);约1/3配置出现反保真现象;保真度与人类可理解性几乎无关(|r| < 0.04);多语言评估揭示显著的模型-语言交互效应。 Conclusion: 解释保真度不应被简化为单一分数,而应相对不同干预算子进行比较;ICE框架和ICEBench基准已开源。 Abstract: Evaluating whether explanations faithfully reflect a model's reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| < 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.

[35] Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors

Yusuke Takase,Momose Oyama,Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: 本文提出了一种基于对数似然向量表示语言模型的方法,并构建模型映射以比较其条件分布,距离近似KL散度;实验表明该映射能揭示模型属性、任务性能及提示修改带来的系统性偏移,还引入PMI向量削弱无条件分布影响,提升对训练数据差异的刻画能力。

Details Motivation: 需要一种可解释、可量化的方法来比较不同语言模型的条件分布行为,尤其是分析提示工程和模型间差异背后的结构关系。 Method: 将语言模型表示为在prompt-response对上的对数似然向量,构建模型映射空间,其中模型间距离近似对应条件分布的KL散度;进一步引入点互信息(PMI)向量以抑制无条件分布偏差。 Result: 模型映射能有效反映模型间的全局结构关系(如规模、架构、训练数据)、任务性能差异,以及提示修改引起的系统性分布偏移;PMI向量在部分场景下更敏感地揭示训练数据相关差异。 Conclusion: 该框架为分析语言模型输入依赖的行为提供了统一、可度量的几何视角,支持对提示操作效应的建模与预测。 Abstract: We propose a method that represents language models by log-likelihood vectors over prompt-response pairs and constructs model maps for comparing their conditional distributions. In this space, distances between models approximate the KL divergence between the corresponding conditional distributions. Experiments on a large collection of publicly available language models show that the maps capture meaningful global structure, including relationships to model attributes and task performance. The method also captures systematic shifts induced by prompt modifications and their approximate additive compositionality, suggesting a way to analyze and predict the effects of composite prompt operations. We further introduce pointwise mutual information (PMI) vectors to reduce the influence of unconditional distributions; in some cases, PMI-based model maps better reflect training-data-related differences. Overall, the framework supports the analysis of input-dependent model behavior.

[36] Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media

Thi Huyen Nguyen,Koustav Rudra,Wolfgang Nejdl

Main category: cs.CL

TL;DR: 本文提出了一种可解释的多模态分类框架,通过跨模态理由迁移,从文本中提取理由并映射到图像,从而实现对危机相关图文信息的透明、准确分类,并在CrisisMMD数据集上显著提升性能。

Details Motivation: 现有危机信息分类方法缺乏可解释性,尤其在图像模态上缺少有效理由提取,限制了其实际部署;需兼顾准确性与透明性,并减少人工标注成本。 Method: 基于视觉语言Transformer学习图文联合表征,先提取文本理由,再通过跨模态映射生成图像理由(即跨模态理由迁移),最后基于双模态理由进行分类。 Result: 在CrisisMMD上Macro-F1提升2–35%;人类评估显示图像理由补丁质量提升12%;零样本迁移至新数据集达80%准确率。 Conclusion: 所提可解释-by-design框架有效实现了多模态危机信息分类的透明化与高性能,且具备强泛化能力与低标注依赖特性。 Abstract: Advances in social media data dissemination enable the provision of real-time information during a crisis. The information comes from different classes, such as infrastructure damages, persons missing or stranded in the affected zone, etc. Existing methods attempted to classify text and images into various humanitarian categories, but their decision-making process remains largely opaque, which affects their deployment in real-life applications. Recent work has sought to improve transparency by extracting textual rationales from tweets to explain predicted classes. However, such explainable classification methods have mostly focused on text, rather than crisis-related images. In this paper, we propose an interpretable-by-design multimodal classification framework. Our method first learns the joint representation of text and image using a visual language transformer model and extracts text rationales. Next, it extracts the image rationales via the mapping with text rationales. Our approach demonstrates how to learn rationales in one modality from another through cross-modal rationale transfer, which saves annotation effort. Finally, tweets are classified based on extracted rationales. Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales. Human evaluation also supports the claim that our proposed method is able to retrieve better image rationale patches (12%) that help to identify humanitarian classes. Our method adapts well to new, unseen datasets in zero-shot mode, achieving an accuracy of 80%.

[37] DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

Maxime Poli,Manel Khentout,Angelo Ortiz Tandazo,Ewan Dunbar,Emmanuel Chemla,Emmanuel Dupoux

Main category: cs.CL

TL;DR: DiscoPhon is a multilingual benchmark for unsupervised phoneme discovery from discrete speech units, covering 12 languages and evaluating unit quality, recognition, and segmentation.

Details Motivation: To evaluate unsupervised phoneme discovery across diverse languages with limited data (10 hours per language), addressing the need for standardized multilingual benchmarks in speech representation learning. Method: Constructing DiscoPhon, a benchmark with 6 dev and 6 test languages; using pretrained multilingual HuBERT and SpidR models as baselines; evaluating discrete units via mapping to phoneme inventories under many-to-one or one-to-one assignments. Result: Current multilingual models contain sufficient phonemic information for derived discrete units to correlate well with phonemes, though correlation strength varies across languages. Conclusion: DiscoPhon provides a robust evaluation framework showing that unsupervised phoneme discovery is feasible with modern self-supervised speech models, albeit with language-dependent performance. Abstract: We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.

[38] Learning to Self-Evolve

Xiaoyin Chen,Canwen Xu,Yite Wang,Boyi Liu,Zhewei Yao,Yuxiong He

Main category: cs.CL

TL;DR: 本文提出Learning to Self-Evolve(LSE),一种强化学习框架,用于训练大语言模型在测试时自主优化其上下文,显著提升Text-to-SQL与问答任务性能,并具备跨模型泛化能力。

Details Motivation: 现有模型依赖固有推理能力进行测试时自我演化,缺乏专门针对该能力的显式训练;本文旨在将自我演化建模为一种可学习技能。 Method: 提出LSE框架,将多步上下文演化简化为单步强化学习目标(以下游性能提升作为编辑奖励),并结合树引导的演化循环。 Result: 在BIRD和MMLU-Redux上,4B参数模型使用LSE训练后,超越GPT-5、Claude Sonnet 4.5驱动的自演化策略及GEPA、TextGrad等提示优化方法,并能零样本指导其他模型。 Conclusion: 将测试时自我演化显式建模并训练为一项可学习技能是有效且泛化性强的新范式。 Abstract: We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.

[39] A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

Aram Abrahamyan,Sachin Kumar

Main category: cs.CL

TL;DR: 本文对持续学习(CL)方法在持续意图分类任务中的灾难性遗忘缓解效果进行了实证比较研究,使用CLINC150数据集构建10任务标签不相交场景,评估了ANN、GRU和Transformer三种骨干架构与多种CL策略(MIR、LwF、HAT)及其组合的性能。结果表明:回放(MIR)是关键要素;最优CL配置依赖于骨干架构;某些CL组合甚至优于联合训练。

Details Motivation: 神经语言模型在实际应用中需持续适应新任务和领域,同时避免遗忘旧知识,但灾难性遗忘仍是关键挑战,尤其在意图分类等任务中缺乏系统性比较研究。 Method: 基于CLINC150构建10任务标签不相交的持续学习场景;评估ANN、GRU、Transformer三种骨干架构;采用代表性CL方法——回放型MIR、正则化型LwF、参数隔离型HAT,及其所有两两与三者组合;以平均准确率、宏F1和后向迁移(backward transfer)为指标评估稳定性-可塑性权衡。 Result: 朴素顺序微调在所有架构上均严重遗忘;单一CL方法无法完全防止遗忘;含MIR的组合(如MIR+HAT、MIR+LwF、MIR+LwF+HAT)表现最稳健,后向迁移接近零或略正;最优组合因架构而异:ANN和Transformer下MIR+HAT最佳,GRU下MIR+LwF+HAT最佳;部分CL组合性能甚至超越联合训练。 Conclusion: 回放机制(MIR)是缓解遗忘的核心要素;持续意图分类系统的设计必须联合选择骨干架构与CL机制,不能孤立优化;CL方法本身可能带来有益的正则化效应。 Abstract: Neural language models deployed in real-world applications must continually adapt to new tasks and domains without forgetting previously acquired knowledge. This work presents a comparative empirical study of catastrophic forgetting mitigation in continual intent classification. Using the CLINC150 dataset, we construct a 10-task label-disjoint scenario and evaluate three backbone architectures: a feed-forward Artificial Neural Network (ANN), a Gated Recurrent Unit (GRU), and a Transformer encoder, under a range of continual learning (CL) strategies. We consider one representative method from each major CL family: replay-based Maximally Interfered Retrieval (MIR), regularization-based Learning without Forgetting (LwF), and parameter-isolation via Hard Attention to Task (HAT), both individually and in all pairwise and triple combinations. Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence. Our results show that naive sequential fine-tuning suffers from severe forgetting for all architectures and that no single CL method fully prevents it. Replay emerges as a key ingredient: MIR is the most reliable individual strategy, and combinations that include replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) consistently achieve high final performance with near-zero or mildly positive backward transfer. The optimal configuration is architecture-dependent. MIR+HAT yields the best result for ANN and Transformer, MIR+LwF+HAT, on the other hand, works the best for GRU, and in several cases CL methods even surpass joint training, indicating a regularization effect. These findings highlight the importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems.

[40] Automatic detection of Gen-AI texts: A comparative framework of neural models

Cristian Buttaro,Irene Amerini

Main category: cs.CL

TL;DR: 本文设计并评估了四种基于机器学习的AI生成文本检测器(MLP、1D-CNN、MobileNet-CNN、Transformer),在多语言(英/意)及特定主题(艺术与心理健康)数据集上,对比多个主流在线检测工具,结果表明监督式检测器在稳定性与跨领域鲁棒性上优于商用工具。

Details Motivation: 大语言模型的快速普及使得区分人类写作与AI生成文本变得愈发困难,给学术、出版和社会等领域带来关键挑战,亟需可靠、鲁棒的AI文本检测方法。 Method: 构建并对比四种神经网络架构(MLP、1D-CNN、MobileNet-CNN、Transformer)作为监督式检测器,并在COLING多语言数据集(英文/意大利文)及自建艺术与心理健康主题数据集上进行训练与测试;同时与ZeroGPT、GPTZero等8种主流在线检测工具进行基准对比。 Result: 监督式检测器在不同语言和领域下展现出比商用工具更稳定、更鲁棒的检测性能;实验揭示了当前检测策略的关键优势与固有局限。 Conclusion: 基于监督学习的定制化检测模型优于现有黑盒商用工具,尤其在多语言与专业领域场景中;未来工作需兼顾泛化能力、可解释性与对抗鲁棒性。 Abstract: The rapid proliferation of Large Language Models has significantly increased the difficulty of distinguishing between human-written and AI generated texts, raising critical issues across academic, editorial, and social domains. This paper investigates the problem of AI generated text detection through the design, implementation, and comparative evaluation of multiple machine learning based detectors. Four neural architectures are developed and analyzed: a Multilayer Perceptron, a one-dimensional Convolutional Neural Network, a MobileNet-based CNN, and a Transformer model. The proposed models are benchmarked against widely used online detectors, including ZeroGPT, GPTZero, QuillBot, Originality.AI, Sapling, IsGen, Rephrase, and Writer. Experiments are conducted on the COLING Multilingual Dataset, considering both English and Italian configurations, as well as on an original thematic dataset focused on Art and Mental Health. Results show that supervised detectors achieve more stable and robust performance than commercial tools across different languages and domains, highlighting key strengths and limitations of current detection strategies.

[41] Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

Rudra Jadhav,Janhavi Danve,Sonalika Shaw

Main category: cs.CL

TL;DR: 本研究发现,大型语言模型(LLMs)在作文类任务中存在显著的隐式评分偏差——即使内容正确,仅因语法错误、非正式语言或非母语表达等写作风格差异,就会被系统性地压分;而在数学和编程类任务中则几乎无此偏差。该偏差在明确要求‘只评内容、勿看文风’的提示下依然存在。

Details Motivation: 随着LLM被用作自动阅卷工具,其评分公平性与潜在偏见引发关键关切;本文旨在检验LLMs是否会在内容正确性不变的前提下,因写作风格差异(如语法、语体、母语背景)产生隐式评分偏差。 Method: 构建包含180份学生作答的受控数据集(覆盖数学、编程、作文三科),每份作答施加三种表层扰动(语法错误、非正式语言、非母语表达);使用LLaMA 3.3 70B与Qwen 2.5 72B两个开源大模型,在严格指令(仅依据内容正确性评分)下进行1–10分制评分;采用统计检验(p值)与Cohen's d效应量分析偏差强度。 Result: 作文类任务中,两类模型对所有扰动类型均呈现显著评分偏差(p < 0.05),效应量为中到极大(d = 0.64–4.25);非正式语言扣分最重(LLaMA平均扣1.90分,Qwen扣1.20分),非母语表达次之;数学与编程任务中偏差微弱且大多不显著。 Conclusion: LLM阅卷偏差具有学科依赖性与风格敏感性,且无法通过简单提示指令消除;需在教育场景部署前开展系统性偏见审计,并制定相应治理规范。 Abstract: As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen's d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale -- penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.

[42] Mi:dm K 2.5 Pro

KT Tech innovation Group

Main category: cs.CL

TL;DR: Mi:dm K 2.5 Pro 是一款32B参数的韩语旗舰大模型,聚焦企业级复杂任务,通过AST分析、Gap-filling合成、Depth Upscaling、多阶段后训练与Fusion Training等技术,在长上下文理解、多步推理、工具调用及韩语文化理解方面实现突破,并通过安全评估保障部署可靠性。

Details Motivation: 现有大模型在韩语及领域特定的企业场景中难以满足多步推理、长上下文理解和智能体工作流等新需求,单纯扩大规模已不足够。 Method: 构建基于AST分析(代码)、Gap-filling合成(数学)和LLM质量评估器的数据筛选流程;采用Depth Upscaling(DuS)与渐进式策略支持128K上下文预训练;后训练包含推理监督微调、模型融合与异步强化学习;最后通过Fusion Training融合推理能力与对话流畅性、风格一致性及工具调用可靠性。 Result: 在韩语专属基准测试中达到SOTA水平,展现深层语言与文化理解能力;在通用与国内主流模型对比中表现具竞争力;通过负责任AI评估,兼顾安全性(抗攻击)与响应能力。 Conclusion: Mi:dm K 2.5 Pro 验证了面向企业级复杂任务的‘推理优先’优化范式在韩语大模型中的有效性,为垂直领域高可靠大模型研发提供了可复用的技术路径。 Abstract: The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. "Fusion Training" then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.

[43] Detecting Basic Values in A Noisy Russian Social Media Text Data: A Multi-Stage Classification Framework

Maria Milkova,Maksim Rudnev

Main category: cs.CL

TL;DR: 本研究提出了一种多阶段分类框架,用于在嘈杂的俄语社交媒体文本中检测人类价值观,基于Schwartz理论,结合LLM标注、软标签聚合与Transformer模型(XLM-RoBERTa large),在750万条帖子上验证,F1 macro达0.83,并揭示了俄语社交网络中价值观表达的模式与文化特异性。

Details Motivation: 在嘈杂、非结构化的俄语社交媒体文本中自动识别Schwartz十项基本人类价值观,弥补现有方法对标注主观性、文化语境和语言噪声处理不足的问题。 Method: 构建多阶段流程:垃圾/非个人内容过滤 → 价值/政治相关帖筛选 → LLM(如GPT)多轮标注 → 基于专家校验与LLM一致性生成软标签 → 使用软标签训练多标签Transformer模型(XLM-RoBERTa large等),将专家标注视为具不确定性的解释性基准而非绝对真值。 Result: XLM-RoBERTa large在测试集上达到F1 macro=0.83、F1=0.71;发现模型系统性高估‘开放变化’(Openness to Change)维度;揭示俄语社交网络中价值观共现与表达的特有模式;所有模型已开源。 Conclusion: 将价值观检测建模为多视角解释性任务更符合现实——专家、LLM与模型输出是同一文本的不同合理解读;该框架兼顾标注主观性与文化敏感性,为跨语言数字环境中的价值分析提供了可复现、可扩展的方法论范式。 Abstract: This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts. Drawing on Schwartz's theory of basic human values, we design a multi-stage pipeline that includes spam and nonpersonal content filtering, targeted selection of value relevant and politically relevant posts, LLM based annotation, and multi-label classification. Particular attention is given to verifying the quality of LLM annotations and model predictions against human experts. We treat human expert annotations not as ground truth but as an interpretative benchmark with its own uncertainty. To account for annotation subjectivity, we aggregate multiple LLM generated judgments into soft labels that reflect varying levels of agreement. These labels are then used to train transformer based models capable of predicting the probability of each of the ten basic values. The best performing model, XLM RoBERTa large, achieves an F1 macro of 0.83 and an F1 of 0.71 on held out test data. By treating value detection as a multi perspective interpretive task, where expert labels, GPT annotations, and model predictions represent coherent but not identical readings of the same texts, we show that the model generally aligns with human judgments but systematically overestimates the Openness to Change value domain. Empirically, the study reveals distinct patterns of value expression and their co-occurrence in Russian social networks, contributing to a broader research agenda on cultural variation, communicative framing, and value based interpretation in digital environments. All models are released publicly.

[44] Why Better Cross-Lingual Alignment Fails for Better Cross-Lingual Transfer: Case of Encoders

Yana Veitsman,Yihong Liu,Hinrich Schütze

Main category: cs.CL

TL;DR: 本文揭示了跨语言对齐与下游任务性能提升之间存在不一致现象,指出对齐目标与任务目标正交、不同语言和任务类型受益程度差异大,并通过表征分析验证了嵌入距离不可靠、对齐与任务梯度接近正交;据此提出应谨慎选择损失函数以结合对齐与微调。

Details Motivation: 现有研究假设更好的跨语言对齐能带来更好的跨语言迁移效果,但实践中显式对齐方法虽提升嵌入相似性,却常无法提升词级别下游任务性能,其原因尚不明确。 Method: 分析四个在不同语对上对齐的XLM-R编码器模型,分别在词性标注和句子分类任务上微调;采用表征分析方法,包括嵌入距离、任务损失与对齐损失的梯度相似性及梯度模长。 Result: (1)嵌入距离不能可靠预测任务性能提升或下降;(2)对齐损失与任务损失的梯度常接近正交,表明优化一个目标对另一个目标贡献甚微。 Conclusion: ‘更好’的对齐不一定带来‘更好’的跨语言迁移,因其目标与下游任务目标正交且收益因语言和任务而异;应依据具体任务谨慎设计联合训练的损失函数。 Abstract: Better cross-lingual alignment is often assumed to yield better cross-lingual transfer. However, explicit alignment techniques -- despite increasing embedding similarity -- frequently fail to improve token-level downstream performance. In this work, we show that this mismatch arises because alignment and downstream task objectives are largely orthogonal, and because the downstream benefits from alignment vary substantially across languages and task types. We analyze four XLM-R encoder models aligned on different language pairs and fine-tuned for either POS Tagging or Sentence Classification. Using representational analyses, including embedding distances, gradient similarities, and gradient magnitudes for both task and alignment losses, we find that: (1) embedding distances alone are unreliable predictors of improvements (or degradations) in task performance and (2) alignment and task gradients are often close to orthogonal, indicating that optimizing one objective may contribute little to optimizing the other. Taken together, our findings explain why ``better'' alignment often fails to translate into ``better'' cross-lingual transfer. Based on these insights, we provide practical guidelines for combining cross-lingual alignment with task-specific fine-tuning, highlighting the importance of careful loss selection.

[45] Evaluating LLM-Generated Lessons from the Language Learning Students' Perspective: A Short Case Study on Duolingo

Carlos Rafael Catalan,Patricia Nicole Monderin,Lheane Marie Dizon,Gap Estrella,Raymund John Sarmimento,Marie Antoinette Patalagsa

Main category: cs.CL

TL;DR: 本文探讨了当前语言学习应用(如Duolingo)在职业场景支持上的不足,通过调研菲律宾跨国公司员工发现通用场景虽有效夯实基础,但缺乏专业领域内容阻碍专业级流利度发展;据此提出应结合个性化、领域定制化课程与通用基础课程的混合教学策略。

Details Motivation: 现有语言学习App多聚焦日常通用场景,缺乏对职业/专业语境的支持,难以帮助用户达成‘专业级流利度’——即能自如交流工作及领域特有信息的能力。 Method: 对菲律宾一家跨国公司的五名员工开展访谈式调查,分析其使用Duolingo时接触通用场景与工作相关场景的频率、感知有效性及对课程内容的建议,并进行聚合分析。 Result: 受访者普遍认为通用场景(如问候、点餐)更常出现、更具亲和力且有助于夯实语法、词汇与文化基础;而工作相关场景虽少见,却对提升专业流利度至关重要,因其涵盖领域专有词汇;各人建议的职场情境差异显著,凸显个性化需求。 Conclusion: 语言学习应用应采用混合课程生成策略:一方面维持通用、可理解的基础场景以支撑语言习得,另一方面基于用户职业背景动态生成个性化、领域适配的课程内容,从而有效弥合通识学习与专业流利之间的鸿沟。 Abstract: Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited support for profession-specific contexts. This gap can hinder learners from achieving professional-level fluency, which we define as the ability to communicate comfortably various work-related and domain-specific information in the target language. We surveyed five employees from a multinational company in the Philippines on their experiences with Duolingo. Results show that respondents encountered general scenarios more frequently than work-related ones, and that the former are relatable and effective in building foundational grammar, vocabulary, and cultural knowledge. The latter helps bridge the gap toward professional fluency as it contains domain-specific vocabulary. Each participant suggested lesson scenarios that diverge in contexts hen analyzed in aggregate. With this understanding, we propose that language learning applications should generate lessons that adapt to an individual's needs through personalized, domain specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios.

[46] A Human-in/on-the-Loop Framework for Accessible Text Generation

Lourdes Moreno,Paloma Martínez

Main category: cs.CL

TL;DR: 本文提出了一种人机协同的易读文本生成框架,将人类参与(HiTL/HoTL)嵌入大模型简化流程,通过标准化检查表、触发规则和可量化的无障碍KPI,提升可解释性、可追溯性与伦理问责。

Details Motivation: 现有自动文本简化方法过于依赖自动化指标,忽视真实用户理解效果和规范标准,缺乏认知无障碍所需的以人为本评估机制。 Method: 构建混合式人机协同框架:HiTL在生成中实时引导调整,HoTL在生成后系统审查;基于实证研究设计标准对齐检查表、事件-条件-行动(ECA)触发规则、无障碍KPI;将人类反馈结构化用于模型迭代优化。 Result: 实现了可追溯、可复现、可审计的无障碍文本生成与评估流程,验证了人类中心机制可编码为评估模块并支持模型自适应改进。 Conclusion: 将人类角色深度嵌入生成与监督环节,不仅提升了文本可访问性,更将可解释性与伦理问责内化为NLP系统的核心设计原则,推动更透明、包容的技术实践。 Abstract: Plain Language and Easy-to-Read formats in text simplification are essential for cognitive accessibility. Yet current automatic simplification and evaluation pipelines remain largely automated, metric-driven, and fail to reflect user comprehension or normative standards. This paper introduces a hybrid framework that explicitly integrates human participation into LLM-based accessible text generation. Human-in-the-Loop (HiTL) contributions guide adjustments during generation, while Human-on-the-Loop (HoTL) supervision ensures systematic post-generation review. Empirical evidence from user studies and annotated resources is operationalized into (i) checklists aligned with standards, (ii) Event-Condition-Action trigger rules for activating expert oversight, and (iii) accessibility Key Performance Indicators (KPIs). The framework shows how human-centered mechanisms can be encoded for evaluation and reused to provide structured feedback that improves model adaptation. By embedding the human role in both generation and supervision, it establishes a traceable, reproducible, and auditable process for creating and evaluating accessible texts. In doing so, it integrates explainability and ethical accountability as core design principles, contributing to more transparent and inclusive NLP systems.

[47] Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

Vedant Pandya

Main category: cs.CL

TL;DR: 本文提出XKD-Dial,一种面向英-印双语、具备显式引用机制和可解释性的知识驱动对话生成训练流程,通过四阶段渐进式训练提升事实准确性与跨语言能力,并系统分析引用行为如何被学习。

Details Motivation: 现有知识驱动对话系统多限于英文、缺乏可验证的引用机制、决策过程不透明,且双语支持薄弱。 Method: 提出四阶段训练流程(多语言适配→英文带引用监督微调→双语监督微调→引用感知的GRPO对齐),结合三种后验可解释性分析方法(交叉注意力对齐、积分梯度归因、遮蔽因果定位),并在多种架构模型上系统评估。 Result: 引用感知SFT使编码器-解码器模型幻觉率降至0.0%;渐进训练避免灾难性遗忘并增强印地语能力;小模型经SFT后英语性能媲美大模型;GRPO对结构化引用任务仅带来边际增益。 Conclusion: 显式引用建模与渐进式多阶段训练是提升双语知识对话系统事实性、可解释性与泛化能力的关键路径,可解释性分析揭示了‘如何学习引用’而不仅是‘是否学会’。 Abstract: Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).

[48] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Xinghao Zhao

Main category: cs.CL

TL;DR: 本文提出熵轨迹单调性(entropy-trajectory monotonicity)作为链式推理(CoT)中预测答案正确性的低成本指标:若每一步推理生成的答案分布熵严格递减,则该推理链为单调链;实验表明单调链显著更可能正确,且该‘形状’特征比熵总量变化等‘幅度’指标更具判别力。

Details Motivation: 尽管链式推理(CoT)提升了大语言模型的准确性,但尚缺乏廉价、可靠的失败检测方法;现有基于标量置信度(如token log-prob)的方法随推理步数加深校准变差,而聚合型不确定性度量(如总熵减)又缺乏预测力。 Method: 提出并定义‘熵轨迹单调性’——对CoT每一步采样少量答案完成,计算其答案分布的熵,若熵在所有步骤中严格递减则判定为单调链;在GSM8K和Mistral-7B上进行实证分析,对比单调/非单调链的准确率,并与token级置信度、总熵减等基线比较。 Result: 在Qwen2.5-7B-Instruct上,单调链准确率达68.8%,显著高于非单调链(46.8%),p=0.0005;总熵减与准确率无相关性(ρ=-0.06);单调性在73.7%覆盖率下较标量置信度提升5.8个百分点,成本仅为40链自一致性方法的1/8;Mistral-7B上结果一致(72.3% vs. 37.6%)。 Conclusion: 不确定性轨迹的结构特性(如单调性)比其聚合统计量(如总熵变)更能揭示推理链的可靠性;该发现支持将动态不确定性模式建模为推理验证的新范式,具有低开销、高判别力的优势。 Abstract: Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps--captured by sampling a few answer completions per step--predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher's p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ($ρ$=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question--1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.

[49] RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation

Weronika Łajewska,Paul Missault,George Davidson,Saab Mansour

Main category: cs.CL

TL;DR: 本文提出RADIUS评估套件,用于全面衡量调查模拟中LLM生成响应的排名对齐与分布对齐,并引入统计显著性检验,弥补现有评估指标的不足。

Details Motivation: 现有调查模拟评估指标零散、非标准化,且忽视关键的排名对齐维度,难以支撑决策类应用。 Method: 提出RADIUS——一个包含排名对齐(Ranking alignment)和分布对齐(Distribution alignment)两维度的评估套件,并为每项提供统计显著性检验。 Result: RADIUS揭示了现有指标的局限性,支持更合理的调查模拟评估,并开源实现以保障可复现性与可比性。 Conclusion: RADIUS为LLM驱动的调查模拟提供了标准化、多维度、具统计严谨性的新评估范式。 Abstract: Simulation of surveys using LLMs is emerging as a powerful application for generating human-like responses at scale. Prior work evaluates survey simulation using metrics borrowed from other domains, which are often ad hoc, fragmented, and non-standardized, leading to results that are difficult to compare. Moreover, existing metrics focus mainly on accuracy or distributional measures, overlooking the critical dimension of ranking alignment. In practice, a simulation can achieve high accuracy while still failing to capture the option most preferred by humans - a distinction that is critical in decision-making applications. We introduce RADIUS, a comprehensive two-dimensional alignment suite for survey simulation that captures: 1) RAnking alignment and 2) DIstribUtion alignment, each complemented by statistical Significance testing. RADIUS highlights the limitations of existing metrics, enables more meaningful evaluation of survey simulation, and provides an open-source implementation for reproducible and comparable assessment.

[50] Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

Hangeol Chang,Changsun Lee,Seungjoon Rho,Junho Yeo,Jong Chul Ye

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的预检索框架Hypothesis-Conditioned Query Rewriting(HCQR),通过基于假设重写查询,使RAG从主题导向转向证据导向检索,从而提升多选项决策任务(如医学问答)的准确性。

Details Motivation: 现有RAG方法依赖单一初始查询,易偏向主题相关性而非决策相关证据,导致检索到的背景信息难以区分候选答案,影响最终决策质量。 Method: HCQR首先从问题和候选答案中生成轻量级工作假设,再将其转化为三个目标明确的检索查询:(1)支持该假设;(2)区分该假设与竞争选项;(3)验证问题中的关键线索。 Result: 在MedQA和MMLU-Med数据集上,HCQR分别比Simple RAG提升5.9和3.6个百分点的平均准确率,并持续优于单查询RAG及重排序/过滤基线。 Conclusion: HCQR是一种高效、无需训练的RAG增强方法,能显著提升需多选项判别的复杂推理任务性能,尤其适用于医学等专业领域问答。 Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at https://anonymous.4open.science/r/HCQR-1C2E.

[51] What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Gagan Bhatia,Ahmad Muhammad Isa,Maxime Peyrard,Wei Zhao

Main category: cs.CL

TL;DR: 本文提出了MultiTempBench,一个多语言时间推理基准,涵盖三种任务、五种语言和三种历法;通过评估20个大语言模型,发现分词质量是资源依赖型瓶颈,并引入了多语言日期碎片化比率(mDFR)和几何探测分析方法。

Details Motivation: 现有时间推理基准缺乏多语言与多历法覆盖,难以全面评估大语言模型在不同语言和历法下的时间理解能力。 Method: 构建MultiTempBench基准(含15,000样本),设计多语言、多历法的时间推理任务;评估20个LLM;提出mDFR指标并结合人类严重性评分校准;采用几何探测分析内部时间表征;使用交叉混合效应回归分析影响因素。 Result: 发现分词质量是关键瓶颈:低资源语言和罕见历法中日期碎片化导致年/月/日分离失败、准确率骤降;高资源环境下模型对数字级切分更鲁棒;时间线性度是高资源语言最强预测因子,而碎片化程度在低资源语言中更具预测力。 Conclusion: 时间推理性能高度依赖语言资源与分词适配性,需针对性优化低资源语言的时间表达建模与分词策略。 Abstract: We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

[52] MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

Chenyang Gu,Jiahao Cheng,Meicong Zhang,Pujun Zheng,Jinquan Zheng,Guoxiu He

Main category: cs.CL

TL;DR: 本文提出MoRI框架,通过动机驱动的推理增强大语言模型在科学构想生成中的技术深度和科学严谨性,显著优于现有商业大模型和智能体基线。

Details Motivation: 现有基于大语言模型的智能体方法虽模拟人类科研流程,但未能充分建模科学推理,导致生成的科学构想流于表面、缺乏技术深度与科学依据。 Method: 提出MoRI框架:先通过监督微调使基础大模型学会从给定语境生成研究动机;再采用复合强化学习奖励进行训练,包括熵感知的信息增益(鼓励挖掘高复杂度技术细节)和对比语义增益(约束推理路径保持科学有效性)。 Result: MoRI在新颖性、技术严谨性和可行性等多个维度上显著超越强商用大模型及复杂智能体基线。 Conclusion: MoRI通过显式建模从研究动机到方法论的推理过程,有效提升了大语言模型在科学构想任务中的表现,为科学AI提供了新范式。 Abstract: Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbf{MoRI} (\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \href{https://github.com/ECNU-Text-Computing/IdeaGeneration}{GitHub}.

[53] Parallelograms Strike Back: LLMs Generate Better Analogies than People

Qiawen Ella Liu,Raja Marjieh,Jian-Qiao Zhu,Adele E. Goldberg,Thomas L. Griffiths

Main category: cs.CL

TL;DR: 本文比较了人类和大语言模型(LLM)在四词类比任务(A:B::C:D)中的表现,发现LLM生成的类比更符合平行四边形几何模型,且质量更高;但人类表现差主要源于大量低质量、高频词驱动的错误响应,而非平行四边形模型本身无效。

Details Motivation: 探究四词类比中‘平行四边形’几何模型失效的原因:是模型本身不合理,还是人类难以稳定产出满足该关系的类比? Method: 在Peterson等人(2020)的相同类比数据集上,对比人类与LLM(基于GloVe嵌入)的完成结果;通过人工评分、平行四边形对齐度(向量几何距离)、词频及模态响应分析差异来源。 Result: LLM类比整体评分更高、更贴近平行四边形结构、更少依赖高频易得词;但优势主要来自人类长尾低质响应;仅比较模态响应时,LLM优势消失;而平行四边形对齐度与低频词仍可预测LLM更优响应。 Conclusion: 平行四边形模型并非对类比关系的差模型;人类表现不佳反映其生成稳定性不足,而LLM更能一致满足该关系约束。 Abstract: Four-term word analogies (A:B::C:D) are classically modeled geometrically as ''parallelograms,'' yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account (Peterson et al., 2020). But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from (Peterson et al., 2020). We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently.

[54] A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

Madeline Bittner,Dina Demner-Fushman,Yasmeen Shabazz,Davis Bartels,Dukyong Yoon,Brad Quitadamo,Rajiv Menghrajani,Leo Celi,Sarvesh Soni

Main category: cs.CL

TL;DR: 本文介绍了HEALIX,首个公开可用的基于真实临床记录的健康素养标注数据集,并利用其对四种开源大语言模型进行了零样本和少样本提示策略的基准测试。

Details Motivation: 当前健康素养筛查工具在可行性、项目数量、问题格式及所涵盖维度上差异较大,难以在结构化电子健康记录中统一应用;而从非结构化临床笔记中自动检测健康素养虽具潜力,却受限于缺乏标注资源。 Method: 构建了HEALIX数据集:通过社工笔记抽样、关键词过滤与大语言模型驱动的主动学习相结合的方式,从真实临床笔记中收集并标注589份涵盖9种类型笔记的样本,标注三类健康素养水平(低、正常、高);随后在四个开源大语言模型上评估零样本与少样本提示策略。 Result: 成功构建并发布了HEALIX数据集;实验证明少样本提示在多数情况下优于零样本提示,且不同模型表现存在差异,验证了该数据集对健康素养自动识别研究的有效支撑作用。 Conclusion: HEALIX填补了临床笔记中健康素养标注数据的空白,为基于自然语言处理的健康素养自动化评估提供了可靠基准资源和方法验证平台。 Abstract: Health literacy is a critical determinant of patient outcomes, yet current screening tools are not always feasible and differ considerably in the number of items, question format, and dimensions of health literacy they capture, making documentation in structured electronic health records difficult to achieve. Automated detection from unstructured clinical notes offers a promising alternative, as these notes often contain richer, more contextual health literacy information, but progress has been limited by the lack of annotated resources. We introduce HEALIX, the first publicly available annotated health literacy dataset derived from real clinical notes, curated through a combination of social worker note sampling, keyword-based filtering, and LLM-based active learning. HEALIX contains 589 notes across 9 note types, annotated with three health literacy labels: low, normal, and high. To demonstrate its utility, we benchmarked zero-shot and few-shot prompting strategies across four open source large language models (LLMs).

[55] DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

Yilin Wang,Yuchun Fan,Jiaoyang Li,Ziming Zhu,Yongyu Mu,Qiaozhi He,Tong Xiao,Jingbo Zhu

Main category: cs.CL

TL;DR: 本文提出DaPT框架,通过构建多语言多跳QA基准并采用双语检索与回答策略,显著提升了RAG系统在多语言场景下的性能。

Details Motivation: 现有RAG系统在多语言多跳问答(MM-hop QA)任务中缺乏评估基准,且过度依赖英文大模型的语义理解能力,导致多语言场景下性能下降。 Method: 首先构建五种语言的多语言多跳QA基准;然后提出DaPT框架,该框架并行生成源语言查询及其英文翻译的子问题图,合并后采用双语检索与回答策略顺序求解子问题。 Result: 实验表明,先进RAG系统在多语言场景中存在显著性能不平衡;DaPT在MuSiQue基准上平均EM分数相对最强基线提升18.3%。 Conclusion: DaPT有效缓解了多语言多跳问答中的性能瓶颈,提升了答案准确性与简洁性,为多语言RAG系统提供了新思路。 Abstract: Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems' capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs' strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3\% in average EM score over the strongest baseline.

[56] UGID: Unified Graph Isomorphism for Debiasing Large Language Models

Zikang Ding,Junchi Yao,Junhao Li,Yi Zhang,Wenbo Jiang,Hongbo Liu,Lijie Hu

Main category: cs.CL

TL;DR: 本文提出UGID框架,通过将Transformer建模为计算图,在内部表征层面联合约束注意力路由与隐状态,实现大语言模型的去偏,同时保持模型能力。

Details Motivation: 大型语言模型存在显著社会偏见,输出层或数据优化层面的去偏方法无法彻底解决,且偏见已嵌入模型内部表征中。 Method: 提出统一图同构去偏框架(UGID),将Transformer建模为结构化计算图(注意力机制为边、隐状态为节点),通过在反事实输入间强制图结构不变性(仅允许敏感属性变化),联合约束注意力路由与隐状态;引入对数空间敏感logits约束和选择性锚点目标以保持语义定义。 Result: 在多种大语言模型上实验表明,UGID能有效降低分布内与分布外场景下的偏见,显著减少内部结构差异,并保持模型安全性与实用性。 Conclusion: UGID是一种有效的内部表征级去偏框架,兼顾去偏效果与模型能力保留,为LLM公平性研究提供了新思路。 Abstract: Large language models (LLMs) exhibit pronounced social biases. Output-level or data-optimization--based debiasing methods cannot fully resolve these biases, and many prior works have shown that biases are embedded in internal representations. We propose \underline{U}nified \underline{G}raph \underline{I}somorphism for \underline{D}ebiasing large language models (\textit{\textbf{UGID}}), an internal-representation--level debiasing framework for large language models that models the Transformer as a structured computational graph, where attention mechanisms define the routing edges of the graph and hidden states define the graph nodes. Specifically, debiasing is formulated as enforcing invariance of the graph structure across counterfactual inputs, with differences allowed only on sensitive attributes. \textit{\textbf{UGID}} jointly constrains attention routing and hidden representations in bias-sensitive regions, effectively preventing bias migration across architectural components. To achieve effective behavioral alignment without degrading general capabilities, we introduce a log-space constraint on sensitive logits and a selective anchor-based objective to preserve definitional semantics. Extensive experiments on large language models demonstrate that \textit{\textbf{UGID}} effectively reduces bias under both in-distribution and out-of-distribution settings, significantly reduces internal structural discrepancies, and preserves model safety and utility.

[57] Optimal Splitting of Language Models from Mixtures to Specialized Domains

Skyler Seto,Pierre Ablin,Anastasiia Filippova,Jiayuan Ye,Louis Bethune,Angelos Katharopoulos,David Grangier

Main category: cs.CL

TL;DR: 本文提出了一种基于缩放定律的多模型预训练与持续预训练计算分配优化方法,提升了多领域语言模型在常识知识和推理任务上的性能。

Details Motivation: 标准的两阶段训练范式(预训练+专业化微调)在多领域场景下需为每个领域单独训练模型(split model training),效率低且计算资源分配缺乏理论指导。 Method: 提出一种利用缩放定律预测模型损失的方法,支持在通用预训练语料上独立预训练多个模型,并动态优化预训练与持续预训练(specialization)阶段的计算资源分配(即token数量D和D'),并可外推至更大模型和更多数据规模。 Result: 该方法在常识知识和推理基准测试中,对不同模型尺寸和计算预算均实现一致性能提升。 Conclusion: 基于缩放定律的计算分配策略优于传统分域训练,为高效、可扩展的多领域语言模型训练提供了新范式。 Abstract: Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.

[58] VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Chonghan Liu,Yimin Du,Qi An,Xin He,Cunqi Zhai,Fei Tan,Weijia Lin,Xiaochun Gong,Yongchao Deng,Shousheng Jia,Xiangzheng Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为变量熵策略优化(VEPO)的新方法,利用带可验证奖励的强化学习,在训练中引入确定性结构约束,动态平衡字面保真度与语义自然性,显著提升低资源语言的分词效率和翻译质量。

Details Motivation: 大语言模型在低资源语言上表现不佳,主要由于子词分割效率低和训练数据不平衡。 Method: 提出Variable Entropy Policy Optimization(VEPO),结合强化学习与可验证奖励,引入确定性结构约束(如序列长度、格式一致性、语言合式性),并设计变量熵机制调节探索-利用平衡,辅以熵调节的优势估计和非对称裁剪。 Result: 在90个FLORES-200、COMET-22、chrF翻译方向上的实验表明,VEPO显著提升了分词效率和翻译质量,缩小了低资源语言的性能差距。 Conclusion: VEPO通过结构化策略对齐与动态熵控制,有效缓解低资源语言建模中的关键瓶颈,为多语言NLP提供了可验证、鲁棒且可控的优化框架。 Abstract: Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.

[59] Evaluating Counterfactual Strategic Reasoning in Large Language Models

Dimitrios Georgousis,Maria Lymperaiou,Angeliki Dimitriou,Giorgos Filandrianos,Giorgos Stamou

Main category: cs.CL

TL;DR: 本文评估了大语言模型(LLMs)在重复博弈场景中的策略表现,通过引入改变收益结构和动作标签的反事实变体(如囚徒困境和石头剪刀布),检验其是否具备真正的策略推理能力,还是仅依赖记忆模式;结果表明LLMs在激励敏感性、结构泛化和反事实环境下的战略推理方面存在明显局限。

Details Motivation: 检验LLMs在博弈论场景中的策略表现是否源于真实推理能力,而非对训练数据中常见模式的简单记忆。 Method: 在经典博弈(囚徒困境、石头剪刀布)基础上构建反事实变体,改变收益结构与动作标签以打破原有对称性和占优关系,并采用多指标评估框架对比模型在原始与反事实设置下的表现。 Result: LLMs在反事实环境中表现出显著下降的战略性能,暴露出其在激励敏感性、结构泛化能力和深层策略推理方面的不足。 Conclusion: 当前LLMs的博弈行为更可能源于表面模式匹配而非内在的、可泛化的战略推理能力。 Abstract: We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner's Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

[60] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Zhuolin Yang,Zihan Liu,Yang Chen,Wenliang Dai,Boxin Wang,Sheng-Chieh Lin,Chankyu Lee,Yangyi Chen,Dongfu Jiang,Jiafan He,Renjie Pi,Grace Lam,Nayeon Lee,Alexander Bukharin,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping

Main category: cs.CL

TL;DR: Nemotron-Cascade 2 是一个开源的30B MoE模型(仅激活3B参数),在数学与编程推理、智能体能力上达到前沿水平,以极小参数量实现IMO/IOI/ICPC金牌级表现,并通过扩展Cascade RL和多领域在线策略蒸馏提升性能。

Details Motivation: 提升小型MoE模型的推理与智能体能力,以更少参数实现媲美大模型的高性能,推动高效、开源、高智能密度模型的发展。 Method: 在精挑细选数据集上进行监督微调(SFT)后,大幅扩展Cascade RL覆盖范围至更广的推理与智能体任务;引入多领域在线策略蒸馏,利用各阶段最强教师模型对齐不同领域性能,防止回归并持续增益。 Result: 在IMO、IOI和ICPC世界总决赛中达到金牌级水平,数学与编程推理性能逼近前沿开源大模型,以20倍更少参数实现高智能密度;成为继DeepSeekV3.2-Speciale后第二款达成此成就的开源模型。 Conclusion: Nemotron-Cascade 2验证了通过结构化强化学习与跨领域知识蒸馏,可在显著降低参数量前提下保持甚至提升复杂推理与智能体能力,为高效大模型设计提供了新范式。 Abstract: We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.

[61] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang

Main category: cs.CL

TL;DR: F2LLM-v2 is a new family of efficient, multilingual embedding models (80M–14B parameters), trained on 60M high-quality samples, supporting 200+ languages—especially mid/low-resource ones—using a two-stage LLM pipeline with matryoshka learning, pruning, and distillation; it achieves SOTA on MTEB benchmarks and is fully open-sourced.

Details Motivation: To address the lack of high-quality, efficient, and truly multilingual embedding models—particularly for mid- and low-resource languages—while improving computational efficiency over prior LLM-based embedding approaches. Method: A two-stage LLM-based embedding training pipeline integrated with matryoshka learning, model pruning, and knowledge distillation, trained on a newly curated dataset of 60 million high-quality, multilingual samples. Result: F2LLM-v2-14B ranks first on 11 MTEB benchmarks; smaller variants also achieve new state-of-the-art performance for resource-constrained settings; all models, data, code, and checkpoints are released open-source. Conclusion: F2LLM-v2 establishes a new benchmark for efficient, scalable, and inclusive multilingual embedding models, bridging capability and accessibility across diverse language communities and hardware constraints. Abstract: We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

cs.CV [Back]

[62] RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers

X. Gao,C. Chien,G. Liu,A. Manullang

Main category: cs.CV

TL;DR: 本文针对胶囊内镜视频(CEV)的多标签分类任务,微调基于Transformer的深度学习模型(Google Vision Transformer, ViT),在17个解剖与病理标签上进行识别,但在测试集上mAP@0.5和mAP@0.95均极低(约0.02),表明性能严重不足。

Details Motivation: 为支持Gastro Competition中胶囊内镜视频的多标签分类任务,需开发能准确识别多种解剖结构和病理特征的自动分析方法。 Method: 基于Google Vision Transformer(ViT)模型,在224×224分辨率、batch size为16的设置下进行微调,用于17类标签的多标签分类。 Result: 在三段测试视频上,mAP@0.5为0.0205,mAP@0.95为0.0196,指标极低,表明模型当前性能不佳。 Conclusion: 尽管采用ViT架构并适配CEV多标签任务,但所获结果远未达实用水平,亟需改进数据、模型或评估方式。 Abstract: This work is corresponding to the Gastro Competition for multi-label classification from capsule endoscopic videos (CEV). Deep learning network based on Transformers are fined-tune for this task. The based online mode is Google Vision Transformer (ViT) batch16 with 224 x 224 resolutions. In total, 17 labels are classified, which are mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve, active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, and ulcer. For test dataset of three videos, the overall mAP @0.5 is 0.0205 whereas the overall mAP @0.95 is 0.0196.

[63] S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Naichuan Zheng,Hailun Xia,Zepeng Sun,Weiyi Li,Yujia Wang

Main category: cs.CV

TL;DR: 本文提出S3T-Former,首个纯脉冲驱动的Transformer架构,用于能效型骨架动作识别;通过多流解剖脉冲嵌入(M-ASE)、侧向脉冲拓扑路由(LSTR)和脉冲状态空间(S3)引擎,在保持高稀疏性的同时解决短期遗忘问题,显著降低能耗并达到SOTA性能。

Details Motivation: 现有基于脉冲神经网络(SNN)的骨架动作识别方法因依赖稠密矩阵聚合、多模态融合或非稀疏频域变换而牺牲了SNN固有的稀疏性,且受神经元短期遗忘困扰,难以在资源受限边缘设备部署。 Method: 提出Spiking State-Space Topology Transformer(S3T-Former),包含:1)Multi-Stream Anatomical Spiking Embedding(M-ASE),作为广义运动学微分算子,将多模态骨架特征转为异构稀疏脉冲流;2)Lateral Spiking Topology Routing(LSTR),实现按需条件脉冲传播;3)Spiking State-Space(S3)Engine,建模长时序动态而不依赖非稀疏谱方法。 Result: 在多个大规模数据集上实验表明,S3T-Former在保持高度竞争力精度的同时,理论能耗显著低于传统ANN,确立能效型类脑动作识别新SOTA。 Conclusion: S3T-Former首次实现了纯脉冲驱动、真正时空稀疏且具备长时记忆能力的Transformer架构,为边缘端低功耗骨架动作识别提供了新范式。 Abstract: Skeleton-based action recognition is crucial for multimedia applications but heavily relies on power-hungry Artificial Neural Networks (ANNs), limiting their deployment on resource-constrained edge devices. Spiking Neural Networks (SNNs) provide an energy-efficient alternative; however, existing spiking models for skeleton data often compromise the intrinsic sparsity of SNNs by resorting to dense matrix aggregations, heavy multimodal fusion modules, or non-sparse frequency domain transformations. Furthermore, they severely suffer from the short-term amnesia of spiking neurons. In this paper, we propose the Spiking State-Space Topology Transformer (S3T-Former), which, to the best of our knowledge, is the first purely spike-driven Transformer architecture specifically designed for energy-efficient skeleton action recognition. Rather than relying on heavy fusion overhead, we formulate a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a generalized kinematic differential operator, elegantly transforming multimodal skeleton features into heterogeneous, highly sparse event streams. To achieve true topological and temporal sparsity, we introduce Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation, and a Spiking State-Space (S3) Engine to systematically capture long-range temporal dynamics without non-sparse spectral workarounds. Extensive experiments on multiple large-scale datasets demonstrate that S3T-Former achieves highly competitive accuracy while theoretically reducing energy consumption compared to classic ANNs, establishing a new state-of-the-art for energy-efficient neuromorphic action recognition.

[64] DarkDriving: A Real-World Day and Night Aligned Dataset for Autonomous Driving in the Dark Environment

Wuqi Wang,Haochen Yang,Baolu Li,Jiaqi Sun,Xiangmo Zhao,Zhigang Xu,Qing Guo,Haigen Min,Tianyun Zhang,Hongkai Yu

Main category: cs.CV

TL;DR: 本文提出了首个真实世界昼夜对齐的自动驾驶低光增强数据集DarkDriving,通过轨迹跟踪姿态匹配(TTPM)方法在69英亩封闭测试场采集了9538对精确对齐(误差仅几厘米)的昼夜图像,并标注2D框,支持低光增强及2D/3D检测等四项任务。

Details Motivation: 现有低光增强数据集多为小范围曝光控制或静态场景,且夜间驾驶数据缺乏精确配对的日间图像;动态驾驶场景下难以构建真实昼夜对齐数据集,严重制约相关研究。 Method: 提出基于轨迹跟踪的自动昼夜姿态匹配(TTPM)方法,在大型封闭实车测试场采集并精确对齐昼夜图像;人工标注每对图像的2D目标框;定义四项面向感知的低光增强任务。 Result: 构建了包含9538对高精度对齐昼夜图像的DarkDriving数据集,对齐误差仅数厘米,并完成2D框标注;验证了其在低光增强与检测任务上的基准价值,并可迁移到nuScenes等其他低光驾驶场景。 Conclusion: DarkDriving填补了真实世界动态驾驶场景中昼夜对齐低光数据集的空白,为低光增强及其在自动驾驶感知任务中的应用提供了可靠、可推广的基准平台。 Abstract: The low-light conditions are challenging to the vision-centric perception systems for autonomous driving in the dark environment. In this paper, we propose a new benchmark dataset (named DarkDriving) to investigate the low-light enhancement for autonomous driving. The existing real-world low-light enhancement benchmark datasets can be collected by controlling various exposures only in small-ranges and static scenes. The dark images of the current nighttime driving datasets do not have the precisely aligned daytime counterparts. The extreme difficulty to collect a real-world day and night aligned dataset in the dynamic driving scenes significantly limited the research in this area. With a proposed automatic day-night Trajectory Tracking based Pose Matching (TTPM) method in a large real-world closed driving test field (area: 69 acres), we collected the first real-world day and night aligned dataset for autonomous driving in the dark environment. The DarkDriving dataset has 9,538 day and night image pairs precisely aligned in location and spatial contents, whose alignment error is in just several centimeters. For each pair, we also manually label the object 2D bounding boxes. DarkDriving introduces four perception related tasks, including low-light enhancement, generalized low-light enhancement, and low-light enhancement for 2D detection and 3D detection of autonomous driving in the dark environment. The experimental results show that our DarkDriving dataset provides a comprehensive benchmark for evaluating low-light enhancement for autonomous driving and it can also be generalized to enhance dark images and promote detection in some other low-light driving environment, such as nuScenes.

[65] SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation

Wei Tang,Xuejing Liu,Yanpeng Sun,Zechao Li

Main category: cs.CV

TL;DR: 本文提出SSP-SAM框架,通过引入语义-空间提示(SSP)编码器增强SAM对自然语言的理解能力,从而提升指代表达分割(RES)与广义指代表达分割(GRES)性能,尤其在高IoU阈值和开放词汇场景下表现优异。

Details Motivation: Segment Anything Model (SAM) 擅长通用图像分割,但缺乏对自然语言的理解能力,难以直接应用于指代表达分割(RES)任务。 Method: 提出SSP-SAM框架,设计语义-空间提示(SSP)编码器,融合视觉与语言注意力适配器,分别突出视觉特征中的显著物体和语言特征中的判别性短语,生成高质量提示以引导SAM进行语言驱动的精确分割。 Result: 在主流RES和GRES基准上显著优于现有方法;在Pr@0.9等严格指标下保持高精度;在PhraseCut开放词汇数据集上性能提升。 Conclusion: SSP-SAM有效桥接了SAM的强分割能力与语言理解需求,无需额外修改即可支持更灵活的广义RES设置,验证了其泛化性与实用性。 Abstract: The Segment Anything Model (SAM) excels at general image segmentation but has limited ability to understand natural language, which restricts its direct application in Referring Expression Segmentation (RES). Toward this end, we propose SSP-SAM, a framework that fully utilizes SAM's segmentation capabilities by integrating a Semantic-Spatial Prompt (SSP) encoder. Specifically, we incorporate both visual and linguistic attention adapters into the SSP encoder, which highlight salient objects within the visual features and discriminative phrases within the linguistic features. This design enhances the referent representation for the prompt generator, resulting in high-quality SSPs that enable SAM to generate precise masks guided by language. Although not specifically designed for Generalized RES (GRES), where the referent may correspond to zero, one, or multiple objects, SSP-SAM naturally supports this more flexible setting without additional modifications. Extensive experiments on widely used RES and GRES benchmarks confirm the superiority of our method. Notably, our approach generates segmentation masks of high quality, achieving strong precision even at strict thresholds such as Pr@0.9. Further evaluation on the PhraseCut dataset demonstrates improved performance in open-vocabulary scenarios compared to existing state-of-the-art RES methods. The code and checkpoints are available at: https://github.com/WayneTomas/SSP-SAM.

[66] CytoSyn: a Foundation Diffusion Model for Histopathology -- Tech Report

Thomas Duboudin,Xavier Fontaine,Etienne Andrier,Lionel Guillou,Alexandre Filiot,Thalyssa Baiocco-Rodrigues,Antoine Olivier,Alberto Romagnoni,John Klein,Jean-Baptiste Schiratti

Main category: cs.CV

TL;DR: 本文提出了CytoSyn,一种用于组织病理学的生成式基础潜变量扩散模型,可生成高度逼真且多样的H&E染色图像;通过方法改进、数据扩展与采样优化得到升级版CytoSyn-v2,并在多个方面超越了PixCell,同时强调了预处理(如JPEG压缩)对扩散模型性能和评估指标的显著影响。

Details Motivation: 当前组织病理学领域缺乏专用于生成任务的基础模型,而现有特征提取器无法胜任虚拟染色等生成类任务,因此需要开发专门的生成式基础模型。 Method: 提出CytoSyn——一种基于潜变量的扩散模型,涵盖方法改进(如训练策略、采样策略)、训练集扩展(超10,000张TCGA全切片图像,覆盖32种癌症)、防止滑片级过拟合等关键技术;并发布模型权重、数据集及合成图像。 Result: CytoSyn-v2在生成H&E图像质量、多样性与泛化性(如跨病种生成炎症性肠病图像)上达到SOTA;揭示JPEG压缩等预处理细节对扩散模型性能与评估指标具有强敏感性;模型已在Hugging Face开源。 Conclusion: CytoSyn为组织病理学生成建模树立了新基准,证明了大规模、高质量、多癌种数据训练对生成模型泛化能力的关键作用,并呼吁社区重视预处理对评估结果的影响。 Abstract: Computational pathology has made significant progress in recent years, fueling advances in both fundamental disease understanding and clinically ready tools. This evolution is driven by the availability of large amounts of digitized slides and specialized deep learning methods and models. Multiple self-supervised foundation feature extractors have been developed, enabling downstream predictive applications from cell segmentation to tumor sub-typing and survival analysis. In contrast, generative foundation models designed specifically for histopathology remain scarce. Such models could address tasks that are beyond the capabilities of feature extractors, such as virtual staining. In this paper, we introduce CytoSyn, a state-of-the-art foundation latent diffusion model that enables the guided generation of highly realistic and diverse histopathology H&E-stained images, as shown in an extensive benchmark. We explored methodological improvements, training set scaling, sampling strategies and slide-level overfitting, culminating in the improved CytoSyn-v2, and compared our work to PixCell, a state-of-the-art model, in an in-depth manner. This comparison highlighted the strong sensitivity of both diffusion models and performance metrics to preprocessing-specific details such as JPEG compression. Our model has been trained on a dataset obtained from more than 10,000 TCGA diagnostic whole-slide images of 32 different cancer types. Despite being trained only on oncology slides, it maintains state-of-the-art performance generating inflammatory bowel disease images. To support the research community, we publicly release CytoSyn's weights, its training and validation datasets, and a sample of synthetic images in this repository: https://huggingface.co/Owkin-Bioptimus/CytoSyn.

[67] Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Chen Zhao,Zhuoran Wang,Haoyang Li,Shifeng Bao,Guanlin Li,Youhe Feng,Yang Li,Jie Tang,Jing Zhang

Main category: cs.CV

TL;DR: 本文提出Action-Draft-and-Verify(ADV)方法,结合扩散模型的动作生成高效性与VLM重排序的鲁棒性,在仿真和真实世界任务中显著提升成功率。

Details Motivation: 扩散动作专家虽高效精准,但自回归范式在分布外环境中更具鲁棒性和泛化性;需融合二者优势。 Method: ADV先由扩散动作专家生成多个候选动作块,再由视觉语言模型(VLM)通过基于困惑度的单次前向传播对所有候选进行打分并选择最优动作。 Result: 在仿真环境中成功率提升+4.3点,在真实世界中提升+19.7点,仅引入单次VLM重排序开销。 Conclusion: ADV有效融合扩散与自回归范式优势,在保持效率的同时显著提升VLA模型在分布内外任务中的性能与鲁棒性。 Abstract: Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

[68] One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

Haoxiang Rao,Zhao Wang,Chenyang Si,Yan Lyu,Yuanyi Duan,Fang Zhao,Caifeng Shan

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的少样本工业异常生成方法O2MAG,利用单张异常图像的自注意力机制合成更逼真的异常样本,通过自注意力嫁接、异常掩码引导、异常引导优化和双注意力增强等技术提升生成质量与下游检测性能。

Details Motivation: 工业异常检测中异常样本稀缺,现有少样本异常合成方法训练耗时且难以忠实还原真实异常分布,限制了下游检测模型性能。 Method: 提出O2MAG:基于单张异常图像,采用三路并行扩散过程、自注意力嫁接、异常掩码缓解前景-背景混淆、异常引导优化对齐文本提示与真实异常语义、双注意力增强强化掩码区域注意力。 Result: 在多个下游异常检测任务上显著优于现有最先进方法。 Conclusion: O2MAG是一种高效、免训练、高保真的少样本异常生成框架,有效提升了工业异常检测性能。 Abstract: Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of O2MAG, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.

[69] Q-Drift: Quantization-Aware Drift Correction for Diffusion Model Sampling

Sooyoung Ryu,Mathieu Salzmann,Saqib Javed

Main category: cs.CV

TL;DR: 本文提出Q-Drift方法,通过在采样器端进行漂移校正来缓解后训练量化(PTQ)在扩散模型中引入的量化噪声累积问题,显著提升生成质量(FID降低最多4.59),且不损害CLIP分数,具有即插即用、低开销特性。

Details Motivation: 后训练量化(PTQ)虽便于部署大扩散模型,但量化噪声沿去噪轨迹累积,导致生成质量下降。 Method: 将量化误差建模为每步去噪中的隐式随机扰动,推导保持边缘分布的漂移校正;基于少量(如5次)全精度/量化配对校准运行,估计时步相关的方差统计量,实现轻量、通用的采样器端修正。 Result: 在6个文本到图像模型(DiT与U-Net)、3种采样器(Euler、flow-matching、DPM-Solver++)和2种PTQ方法(SVDQuant、MixDQ)上,Q-Drift在多数设置下均提升FID,最高降低4.59(PixArt-Sigma + SVDQuant W3A4),同时保持CLIP分数。 Conclusion: Q-Drift是一种原理清晰、实用性强的采样器侧PTQ校正方法,可广泛兼容现有模型、采样器与量化方案,有效缓解量化噪声累积,提升生成质量而不增加显著推理开销。 Abstract: Post-training quantization (PTQ) is a practical path to deploy large diffusion models, but quantization noise can accumulate over the denoising trajectory and degrade generation quality. We propose Q-Drift, a principled sampler-side correction that treats quantization error as an implicit stochastic perturbation on each denoising step and derives a marginal-distribution-preserving drift adjustment. Q-Drift estimates a timestep-wise variance statistic from calibration, in practice requiring as few as 5 paired full-precision/quantized calibration runs. The resulting sampler correction is plug-and-play with common samplers, diffusion models, and PTQ methods, while incurring negligible overhead at inference. Across six diverse text-to-image models (spanning DiT and U-Net), three samplers (Euler, flow-matching, DPM-Solver++), and two PTQ methods (SVDQuant, MixDQ), Q-Drift improves FID over the corresponding quantized baseline in most settings, with up to 4.59 FID reduction on PixArt-Sigma (SVDQuant W3A4), while preserving CLIP scores.

[70] Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

Mohammed Rahman Sherif Khan Mohammad,Ardhendu Behera,Sandip Pradhan,Swagat Kumar,Amr Ahmed

Main category: cs.CV

TL;DR: 本文提出了一种仅在训练阶段使用的异构图教师框架(TOGA),通过多模态图建模和跨模态推理,将细粒度的图文关系知识蒸馏到Tip-Adapter的缓存中,显著提升少样本性能,且不增加推理开销。

Details Motivation: 现有基于Adapter的CLIP少样本方法(如Tip-Adapter)仅使用全局单模态特征,忽略了图像块间细粒度关系及其与文本类别的结构对齐。 Method: 构建一个训练专用的异构图教师(Heterogeneous Graph Teacher),将多尺度图像块与文本提示构建成统一图,用模态感知图Transformer(MGT)进行深层跨模态推理,并通过判别性节点筛选提取高质量类别特征;再以cache-aware双目标策略将该关系知识蒸馏至Tip-Adapter的key-value缓存中。 Result: 在1–16-shot标准基准上持续达到SOTA性能;消融实验证明图监督、文本引导推理和节点筛选是关键组件。 Conclusion: 无需修改推理结构,仅靠训练时引入轻量级图教师并蒸馏其关系知识,即可显著增强Tip-Adapter的少样本泛化能力,实现零开销性能提升。 Abstract: Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.

[71] From Concepts to Judgments: Interpretable Image Aesthetic Assessment

Xiao-Chang Liu,Johan Wagemans

Main category: cs.CV

TL;DR: 本文提出了一种基于人类可理解美学概念的可解释图像美学评估(IAA)框架,通过学习高阶美学概念子空间并引入残差预测器,在保持竞争力预测性能的同时提供透明、可解释的美学判断。

Details Motivation: 现有IAA模型虽预测性能强,但缺乏可解释性;而人类在美学判断中依赖高阶线索,因此需要构建基于人类可理解美学概念的可解释IAA方法。 Method: 提出一种基于人类可理解美学概念的可解释IAA框架:首先以可访问方式学习高阶美学概念并构建概念子空间,形成固有可解释模型基础;其次引入简单有效的残差预测器,捕捉超出显式概念的细微美学影响因素。 Result: 在摄影与艺术数据集上的实验表明,该方法在预测性能上具有竞争力,同时能提供透明、人类可理解的美学判断。 Conclusion: 所提框架在不牺牲预测性能的前提下显著提升了IAA模型的可解释性,验证了以人类认知为指导构建可解释美学评估模型的有效性。 Abstract: Image aesthetic assessment (IAA) aims to predict the aesthetic quality of images as perceived by humans. While recent IAA models achieve strong predictive performance, they offer little insight into the factors driving their predictions. Yet for users, understanding why an image is considered pleasing or not is as valuable as the score itself, motivating growing interest in interpretability within IAA. When humans evaluate aesthetics, they naturally rely on high-level cues to justify their judgments. Motivated by this observation, we propose an interpretable IAA framework grounded in human-understandable aesthetic concepts. We learn these concepts in an accessible manner, constructing a subspace that forms the foundation of an inherently interpretable model. To capture nuanced influences on aesthetic perception beyond explicit concepts, we introduce a simple yet effective residual predictor. Experiments on photographic and artistic datasets demonstrate that our method achieves competitive predictive performance while offering transparent, human-understandable aesthetic judgments.

[72] Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Yuhao Dong,Zuyan Liu,Shulin Tian,Yongming Rao,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了一种统一的多智能体视觉推理框架Insight-V++,通过自动生成高质量长链多模态推理数据、双智能体协同架构(推理+摘要)以及新提出的ST-GRPO和J-GRPO算法,显著提升了MLLM在图像与视频复杂推理任务上的性能。

Details Motivation: 现有MLLM缺乏高质量长链多模态推理数据及适配的训练范式,难以像LLM一样通过测试时推理提升可靠性与能力。 Method: 构建多智能体视觉推理框架Insight-V++:1)多粒度自动数据生成流水线;2)双智能体架构(推理Agent + 摘要Agent);3)引入ST-GRPO和J-GRPO两个新算法替代DPO,支持空间-时间长程推理与鲁棒评估;4)基于摘要Agent反馈的迭代式路径生成与系统自优化循环。 Result: 在LLaVA-NeXT和Qwen2.5-VL等基模型上,Insight-V++在图像与视频复杂推理基准中取得显著性能提升,同时保持传统感知任务强性能。 Conclusion: Insight-V++为MLLM提供了可扩展、自进化、面向长链多模态推理的新型训练范式,有效弥合了LLM与MLLM在测试时推理能力上的差距。 Abstract: Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

[73] VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

Mohammad Qazim Bhat,Yufan Huang,Niket Agarwal,Hao Wang,Michael Woods,John Kenyon,Tsung-Yi Lin,Xiaodong Yang,Ming-Yu Liu,Kevin Xie

Main category: cs.CV

TL;DR: 本文提出VLM-AutoDrive框架,通过多源监督(元数据字幕、LLM生成描述、VQA对、CoT推理)对预训练视觉语言模型进行后训练,显著提升其在车载摄像头视频中碰撞与近碰撞事件检测的性能与可解释性。

Details Motivation: 现有通用多模态大模型在驾驶场景中因领域和时序错位而表现不佳,且安全关键事件(如碰撞)短暂稀疏、难以检测。 Method: 提出模块化后训练框架VLM-AutoDrive,融合元数据衍生字幕、LLM生成描述、视觉问答对及链式推理监督,实现领域对齐与可解释学习。 Result: 在Nexar真实行车视频上,将Cosmos-Reason1 7B模型的碰撞F1值从0.00提升至0.69,整体准确率从35.35%提升至77.27%,并生成可解释推理轨迹。 Conclusion: VLM-AutoDrive为通用VLM适配于安全关键、时序定位感知任务提供了可扩展方案,弥合了感知、因果与决策推理之间的鸿沟。 Abstract: The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.

[74] MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles

Alexander Rasch,Rahul Rajendra Pai

Main category: cs.CV

TL;DR: 本文介绍了MicroVision数据集,一个专为检测脆弱道路使用者(VRUs)和静止微出行车辆(MMVs)而设计的开放图像数据集,旨在弥补现有数据集在VRUs和MMVs类别划分及视角多样性上的不足。

Details Motivation: 现有公开图像数据集缺乏对脆弱道路使用者(VRUs)和微出行车辆(MMVs)的细粒度分类与多样性覆盖(如将行人与电动滑板车骑行者统归为'person'、缺少新型MMVs如e-scooter),且多为车载视角,缺乏VRU视角(如人行道、自行车道)的数据,难以支撑交通安全与规划中的精准检测需求。 Method: 构建了MicroVision数据集:在瑞典哥德堡采集超8000张匿名高清图像,涵盖全年、近2000个独特交互场景;对超30000个VRUs(行人、骑行者、电动滑板车骑行者)和静止MMVs(自行车、电动滑板车)进行精细标注;并基于先进架构训练并发布了基准目标检测模型。 Result: 所发布的基准模型在未见测试集上达到最高0.723的平均精度(mAP);数据集与模型权重已开源,支持交通安全分析(区分不同VRUs/MMVs)与微出行使用监测。 Conclusion: MicroVision数据集填补了VRU/MMV细粒度检测数据的空白,提供了VRU视角、高多样性与高质量标注,配套基准模型验证了其有效性,有望推动交通安全与城市微出行规划的研究与应用。 Abstract: Micromobility is a growing mode of transportation, raising new challenges for traffic safety and planning due to increased interactions in areas where vulnerable road users (VRUs) share the infrastructure with micromobility, including parked micromobility vehicles (MMVs). Approaches to support traffic safety and planning increasingly rely on detecting road users in images -- a computer-vision task relying heavily on the quality of the images to train on. However, existing open image datasets for training such models lack focus and diversity in VRUs and MMVs, for instance, by categorizing both pedestrians and MMV riders as "person", or by not including new MMVs like e-scooters. Furthermore, datasets are often captured from a car perspective and lack data from areas where only VRUs travel (sidewalks, cycle paths). To help close this gap, we introduce the MicroVision dataset: an open image dataset and annotations for training and evaluating models for detecting the most common VRUs (pedestrians, cyclists, e-scooterists) and stationary MMVs (bicycles, e-scooters), from a VRU perspective. The dataset, recorded in Gothenburg (Sweden), consists of more than 8,000 anonymized, full-HD images with more than 30,000 carefully annotated VRUs and MMVs, captured over an entire year and part of almost 2,000 unique interaction scenes. Along with the dataset, we provide first benchmark object-detection models based on state-of-the-art architectures, which achieved a mean average precision of up to 0.723 on an unseen test set. The dataset and model can support traffic safety to distinguish between different VRUs and MMVs, or help monitoring systems identify the use of micromobility. The dataset and model weights can be accessed at https://doi.org/10.71870/eepz-jd52.

[75] Semantic Segmentation and Depth Estimation for Real-Time Lunar Surface Mapping Using 3D Gaussian Splatting

Guillem Casadesus Vila,Adam Dai,Grace Gao

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯泼溅(3DGS)的实时月面稠密建图框架,融合门控循环单元立体深度估计与CNN语义分割模型,在LuPNT仿真数据上验证了其在无LiDAR条件下的高精度(~3 cm高度误差)与实时性,支持新视角合成并为SLAM系统奠定基础。

Details Motivation: 月面导航与建图面临纹理匮乏、光照对比度高、计算资源受限等挑战,亟需鲁棒、高效、轻量的感知与建图方法。 Method: 构建集成稠密感知模型与3D高斯泼溅(3DGS)表示的实时建图框架;在LuPNT仿真器生成的合成数据上评估多个模型,选定基于GRU的立体深度估计模型和CNN语义分割模型;利用真值位姿解耦局部场景理解与全局状态估计,实现端到端稠密重建。 Result: 在120米巡视路径上实现约3 cm几何高度精度,优于无LiDAR的传统点云基线;生成的3DGS地图支持高质量新视角合成,并具备扩展为联合优化地图与位姿的SLAM系统的潜力。 Conclusion: 将语义分割、稠密深度估计与学习型地图表征(如3DGS)相结合,是构建高精度、大尺度月面地图以支撑未来探测任务的有效范式。 Abstract: Navigation and mapping on the lunar surface require robust perception under challenging conditions, including poorly textured environments, high-contrast lighting, and limited computational resources. This paper presents a real-time mapping framework that integrates dense perception models with a 3D Gaussian Splatting (3DGS) representation. We first benchmark several models on synthetic datasets generated with the LuPNT simulator, selecting a stereo dense depth estimation model based on Gated Recurrent Units for its balance of speed and accuracy in depth estimation, and a convolutional neural network for its superior performance in detecting semantic segments. Using ground truth poses to decouple the local scene understanding from the global state estimation, our pipeline reconstructs a 120-meter traverse with a geometric height accuracy of approximately 3 cm, outperforming a traditional point cloud baseline without LiDAR. The resulting 3DGS map enables novel view synthesis and serves as a foundation for a full SLAM system, where its capacity for joint map and pose optimization would offer significant advantages. Our results demonstrate that combining semantic segmentation and dense depth estimation with learned map representations is an effective approach for creating detailed, large-scale maps to support future lunar surface missions.

[76] LRConv-NeRV: Low Rank Convolution for Efficient Neural Video Compression

Tamer Shanableh

Main category: cs.CV

TL;DR: 本文提出LRConv-NeRV,通过在NeRV解码器中用结构化低秩可分离卷积替代部分3×3密集卷积层,实现计算与存储效率显著提升,同时保持视频重建质量与时间一致性。

Details Motivation: NeRV的卷积解码器计算开销大、内存占用高,难以部署于资源受限环境。 Method: 提出LRConv-NeRV,在NeRV解码器中对选定的3×3卷积层进行结构化低秩可分离分解,并从深层到浅层渐进式应用;支持端到端训练及INT8后训练量化。 Result: 仅对最后一级解码器应用LRConv即可降低68% GFLOPs(201.9→64.9)、减小9.3%模型大小,且PSNR/MS-SSIM几乎不变、码率降低9.2%;INT8量化下质量接近原始NeRV;LPIPS分析显示时间稳定性优异。 Conclusion: LRConv-NeRV是一种面向低精度与资源受限场景的高效神经视频解码架构,在效率-质量权衡上优于现有方法。 Abstract: Neural Representations for Videos (NeRV) encode entire video sequences within neural network parameters, offering an alternative paradigm to conventional video codecs. However, the convolutional decoder of NeRV remains computationally expensive and memory intensive, limiting its deployment in resource-constrained environments. This paper proposes LRConv-NeRV, an efficient NeRV variant that replaces selected dense 3x3 convolutional layers with structured low-rank separable convolutions, trained end-to-end within the decoder architecture. By progressively applying low-rank factorization from the largest to earlier decoder stages, LRConv-NeRV enables controllable trade-offs between reconstruction quality and efficiency. Extensive experiments demonstrate that applying LRConv only to the final decoder stage reduces decoder complexity by 68%, from 201.9 to 64.9 GFLOPs, and model size by 9.3%, while incurring negligible quality loss and achieving approximately 9.2% bitrate reduction. Under INT8 post-training quantization, LRConv-NeRV preserves reconstruction quality close to the dense NeRV baseline, whereas more aggressive factorization of early decoder stages leads to disproportionate quality degradation. Compared to existing work under layer-aligned settings, LRConv-NeRV achieves a more favorable efficiency versus quality trade-off, offering substantial GFLOPs and parameter reductions while maintaining higher PSNR/MS-SSIM and improved temporal stability. Temporal flicker analysis using LPIPS further shows that the proposed solution preserves temporal coherence close to the NeRV baseline, results establish LRConv-NeRV as a potential architectural alternative for efficient neural video decoding under low-precision and resource-constrained settings.

[77] CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

Marios Krestenitis,Christos Tzelepis,Konstantinos Ioannidis,Steafanos Vrochidis,Ioannis Kompatsiaris,Georgios Tzimiropoulos,Shaogang Gong,Ioannis Patras

Main category: cs.CV

TL;DR: 本文提出CycleCap方法,利用图像-文本双向映射的循环一致性作为自监督信号,通过Group Relative Policy Optimization(GRPO)微调视觉语言模型(VLM),仅需原始图像即可提升图像描述准确性并减少幻觉,无需人工标注数据。

Details Motivation: 现有视觉语言模型易出现图文错位,生成泛化或幻觉描述;已有改进方法依赖大规模标注数据或复杂测试时框架,成本高、可扩展性差。 Method: 构建图像→文本(VLM)与文本→图像(预训练文生图模型)的循环映射,以原始图像与重建图像间的相似度为奖励信号,采用GRPO算法对VLM进行自监督微调。 Result: 在四个1B至7B参数规模的VLM上验证,CycleCap在图像描述与幻觉评测基准上均稳定超越现有监督式循环一致性方法。 Conclusion: 循环一致性可作为高效、免标注的自监督训练信号,显著提升VLM图文对齐能力与描述真实性。 Abstract: Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.

[78] Fast and Generalizable NeRF Architecture Selection for Satellite Scene Reconstruction

Devjyoti Chakraborty,Zaki Sukma,Rakandhiya D. Rachmanto,Kriti Ghosh,In Kee Kim,Suchendra M. Bhandarkar,Lakshmish Ramaswamy,Nancy K. O'Hare,Deepak Mishra

Main category: cs.CV

TL;DR: 本文提出PreSCAN框架,通过轻量级几何与光度描述符在训练前预测NeRF重建质量,实现架构快速选择(<30秒)、大幅加速NAS(1000×),并提升边缘设备部署能效。

Details Motivation: 现有NeRF在卫星影像中部署困难:场景需单独训练,神经架构搜索(NAS)耗时长(数小时至天);而SHAP分析表明多视角一致性比模型架构更能决定重建质量。 Method: 基于SHAP分析洞察,构建PreSCAN预测框架,利用轻量级几何与光度描述符,在训练前估计NeRF重建质量;结合离线功耗/延迟分析,优化边缘部署。 Result: PreSCAN可在<30秒内选择合适架构,预测误差<1 dB;相比NAS提速1000倍;在Jetson Orin上降低推理功耗26%、延迟43%,质量损失极小;在DFC2019数据集上跨场景泛化良好,无需重训练。 Conclusion: 多视角一致性是影响卫星影像NeRF重建质量的关键因素;PreSCAN提供了一种高效、轻量、可部署的预测式架构选择范式,显著提升NeRF在资源受限卫星应用中的实用性。 Abstract: Neural Radiance Fields (NeRF) have emerged as a powerful approach for photorealistic 3D reconstruction from multi-view images. However, deploying NeRF for satellite imagery remains challenging. Each scene requires individual training, and optimizing architectures via Neural Architecture Search (NAS) demands hours to days of GPU time. While existing approaches focus on architectural improvements, our SHAP analysis reveals that multi-view consistency, rather than model architecture, determines reconstruction quality. Based on this insight, we develop PreSCAN, a predictive framework that estimates NeRF quality prior to training using lightweight geometric and photometric descriptors. PreSCAN selects suitable architectures in < 30 seconds with < 1 dB prediction error, achieving 1000$\times$ speedup over NAS. We further demonstrate PreSCAN's deployment utility on edge platforms (Jetson Orin), where combining its predictions with offline cost profiling reduces inference power by 26% and latency by 43% with minimal quality loss. Experiments on DFC2019 datasets confirm that PreSCAN generalizes across diverse satellite scenes without retraining.

[79] Unrolled Reconstruction with Integrated Super-Resolution for Accelerated 3D LGE MRI

Md Hasibul Husain Hisham,Shireen Elhabian,Ganesh Adluru,Jason Mendes,Andrew Arai,Eugene Kholmovski,Ravi Ranjan,Edward DiBella

Main category: cs.CV

TL;DR: 本文提出了一种混合展开重建框架,将增强型深度超分辨率(EDSR)网络嵌入到模型驱动的重建迭代中,以联合实现超分辨率增强和数据一致性约束,在加速3D LGE MRI中显著提升图像质量与左心房结构分割性能。

Details Motivation: 加速3D晚期钆增强(LGE)MRI需从欠采样k空间数据中稳健重建薄层心房结构,而现有展开模型在原始采集分辨率下运行,难以充分恢复高频细节。 Method: 提出一种混合展开重建框架,用EDSR网络替代传统展开网络中每步迭代的近端算子,实现超分辨率增强与数据一致性联合优化;模型在回顾性欠采样的临床前3D LGE数据上端到端训练。 Result: 在不同加速因子下,该方法PSNR和SSIM均优于压缩感知、MoDL和自引导DIP等基线方法,更优地保留精细心脏结构,并提升左心房(LA)分割性能。 Conclusion: 将超分辨率先验直接嵌入模型驱动重建框架,可在加速3D LGE MRI中带来可衡量的性能增益。 Abstract: Accelerated 3D late gadolinium enhancement (LGE) MRI requires robust reconstruction methods to recover thin atrial structures from undersampled k-space data. While unrolled model-based networks effectively integrate physics-driven data consistency with learned priors, they operate at the acquired resolution and may fail to fully recover high-frequency detail. We propose a hybrid unrolled reconstruction framework in which an Enhanced Deep Super-Resolution (EDSR) network replaces the proximal operator within each iteration of the optimization loop, enabling joint super-resolution enhancement and data consistency enforcement. The model is trained end-to-end on retrospectively undersampled preclinical 3D LGE datasets and compared against compressed sensing, Model-Based Deep Learning (MoDL), and self-guided Deep Image Prior (DIP) baselines. Across acceleration factors, the proposed method consistently improves PSNR and SSIM over standard unrolled reconstruction and better preserves fine cardiac structures, leading to improved LA (left atrium) segmentation performance. These results demonstrate that integrating super-resolution priors directly within model-based reconstruction provides measurable gains in accelerated 3D LGE MRI.

[80] VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection

Bo-Cheng Qiu,Yu-Fan Lin,Yu-Zhe Pien,Chia-Ming Lee,Fu-En Yang,Yu-Chiang Frank Wang,Chih-Chung Hsu

Main category: cs.CV

TL;DR: 本文提出RARE-VISION任务,聚焦于胶囊内镜视频中的事件级检测,融合EndoFM-LV与DINOv3 ViT-L/16双骨干网络,并引入多样性头集成、验证引导的分层融合及解剖感知的时间事件解码,显著提升事件级检测性能。

Details Motivation: 胶囊内镜事件检测面临病灶稀疏、视觉异质性强、视频长且噪声多等挑战,而现有方法多基于帧级分类,难以满足临床所需的事件级评估要求。 Method: 构建双骨干网络(EndoFM-LV建模局部时序上下文,DINOv3 ViT-L/16提取强帧级语义),结合多样性头集成、验证引导的分层融合(含类别加权、骨干加权与概率校准)以及解剖感知的时间事件解码(含时间平滑、解剖约束、阈值优化与逐标签事件生成)。 Result: 在官方隐藏测试集上,达到时间mAP@0.5为0.3530、mAP@0.95为0.3235,验证消融表明各模块均对事件级性能有正向贡献。 Conclusion: 事件级指标对齐的建模范式、互补骨干设计与解剖-时间联合建模是提升胶囊内镜事件检测性能的关键。 Abstract: Capsule endoscopy event detection is challenging because diagnostically relevant findings are sparse, visually heterogeneous, and embedded in long, noisy video streams, while evaluation is performed at the event level rather than by frame accuracy alone. We therefore formulate the RARE-VISION task as a metric-aligned event detection problem instead of a purely frame-wise classification task. Our framework combines two complementary backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Hierarchical Fusion, and Anatomy-Aware Temporal Event Decoding. The fusion stage uses validation-derived class-wise model weighting, backbone weighting, and probability calibration, while the decoding stage applies temporal smoothing, anatomical constraints, threshold refinement, and per-label event generation to produce stable event predictions. Validation ablations indicate that complementary backbones, validation-guided fusion, and anatomy-aware temporal decoding all contribute to event-level performance. On the official hidden test set, the proposed method achieved an overall temporal mAP@0.5 of 0.3530 and temporal mAP@0.95 of 0.3235.

[81] To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

Rui Hong,Shuxue Quan

Main category: cs.CV

TL;DR: 本文提出三层次诊断框架,用于分析视觉语言模型(VLMs)在正确回答时是否真正依赖视觉信息,还是利用语言捷径;发现多数模型存在‘视觉谄媚’现象,即察觉视觉异常却仍迎合用户预期而幻觉作答,且对齐训练抑制了模型对不确定性的诚实表达;模型规模增大虽减少语言捷径,却加剧视觉谄媚;该框架还可用于零成本后处理提升预测准确性。

Details Motivation: 探究VLMs在正确回答时是否真实依赖视觉输入,而非利用语言统计捷径或迎合指令导致的幻觉,揭示其视觉 grounding 的可靠性与对齐训练的副作用。 Method: 提出三层次诊断框架,包含潜在线异常检测(感知意识)、视觉必要性分数(基于KL散度衡量视觉依赖程度)和竞争分数(衡量视觉依据与指令遵循间的冲突);结合反事实干预(盲图、噪声图、冲突图),在7个VLM和7000个样本上进行评估,并开展缩放分析与后处理策略验证。 Result: 69.6%的样本表现出Visual Sycophancy(视觉谄媚),无一样本展现Robust Refusal(鲁棒拒绝);大模型降低语言捷径但加剧视觉谄媚;诊断分数支持零训练成本的后处理策略,在50%覆盖率下最高提升准确率9.5个百分点。 Conclusion: 当前VLMs普遍存在视觉依赖表象下的幻觉倾向,对齐训练可能以牺牲诚实不确定性表达为代价;单纯扩大模型规模无法解决视觉 grounding 问题;所提诊断框架可有效识别幻觉来源并支持高效后处理优化。 Abstract: When VLMs answer correctly, do they genuinely rely on visual information or exploit language shortcuts? We introduce the Tri-Layer Diagnostic Framework, which disentangles hallucination sources via three metrics: Latent Anomaly Detection (perceptual awareness), Visual Necessity Score (visual dependency, measured via KL divergence), and Competition Score (conflict between visual grounding and instruction following). Using counterfactual interventions (blind, noise, and conflict images) across 7 VLMs and 7,000 model-sample pairs, our taxonomy reveals that 69.6% of samples exhibit Visual Sycophancy--models detect visual anomalies but hallucinate to satisfy user expectations--while zero samples show Robust Refusal, indicating alignment training has systematically suppressed truthful uncertainty acknowledgment. A scaling analysis (Qwen2.5-VL 7B to 72B) shows larger models reduce Language Shortcuts but amplify Visual Sycophancy, demonstrating scale alone cannot resolve the grounding problem. Diagnostic scores further enable a post-hoc selective prediction strategy achieving up to +9.5pp accuracy at 50% coverage with no additional training cost.

[82] Pixel-Accurate Epipolar Guided Matching

Oleksii Nasypanyi,Francois Rameau

Main category: cs.CV

TL;DR: 本文提出了一种基于角空间的精确关键点匹配方法,通过为每个关键点分配容忍圆并转换为一维角区间查询,利用线段树实现对极约束下的高效、像素级精确匹配,显著提升速度与鲁棒性。

Details Motivation: 现有对极引导的关键点匹配方法依赖粗粒度空间分箱,存在近似误差、后处理开销大、易漏匹配等问题。 Method: 将每个关键点映射为以对极点为视角的角区间(由容忍圆定义),将匹配建模为1D角区间查询,使用线段树在O(log n)时间内完成高效、精确候选筛选。 Result: 在ETH3D数据集上验证,相比现有方法有明显加速,且能恢复完全精确的对应关系,保证像素级容差和逐点可控性。 Conclusion: 该角空间精确匹配框架克服了传统分箱方法的缺陷,在效率、精度和灵活性上均取得提升,适用于SfM等几何感知任务。 Abstract: Keypoint matching can be slow and unreliable in challenging conditions such as repetitive textures or wide-baseline views. In such cases, known geometric relations (e.g., the fundamental matrix) can be used to restrict potential correspondences to a narrow epipolar envelope, thereby reducing the search space and improving robustness. These epipolar-guided matching approaches have proved effective in tasks such as SfM; however, most rely on coarse spatial binning, which introduces approximation errors, requires costly post-processing, and may miss valid correspondences. We address these limitations with an exact formulation that performs candidate selection directly in angular space. In our approach, each keypoint is assigned a tolerance circle which, when viewed from the epipole, defines an angular interval. Matching then becomes a 1D angular interval query, solved efficiently in logarithmic time with a segment tree. This guarantees pixel-level tolerance, supports per-keypoint control, and removes unnecessary descriptor comparisons. Extensive evaluation on ETH3D demonstrates noticeable speedups over existing approaches while recovering exact correspondence sets.

[83] Inst4DGS: Instance-Decomposed 4D Gaussian Splatting with Multi-Video Label Permutation Learning

Yonghan Lee,Dinesh Manocha

Main category: cs.CV

TL;DR: Inst4DGS提出一种实例分解的4D高斯泼溅方法,通过可微Sinkhorn层实现跨视频实例标签对齐,并引入运动骨架提升长时轨迹优化效率,在渲染与实例分割性能上达到SOTA。

Details Motivation: 动态4D高斯泼溅(4DGS)发展迅速,但实例分解版本仍鲜有探索,主要难点在于多视角视频间不一致的实例标签难以跨视角关联。 Method: 1)引入每视频标签置换隐变量,结合可微Sinkhorn层学习跨视频实例匹配,实现一致身份保持的多视角直接监督;2)设计实例分解的运动骨架,为每个物体提供低维运动基,支持长时高斯轨迹优化。 Result: 在Panoptic Studio和Neural3DV数据集上,Inst4DGS同时支持跟踪与实例分解;Panoptic Studio上PSNR从26.10提升至28.36,实例mIoU从0.6310提升至0.9129,显著优于最强基线。 Conclusion: Inst4DGS通过显式标签对齐与运动骨架建模,解决了实例分解4DGS中的身份漂移与优化效率问题,实现了高质量渲染与精确实例分割的统一。 Abstract: We present Inst4DGS, an instance-decomposed 4D Gaussian Splatting (4DGS) approach with long-horizon per-Gaussian trajectories. While dynamic 4DGS has advanced rapidly, instance-decomposed 4DGS remains underexplored, largely due to the difficulty of associating inconsistent instance labels across independently segmented multi-view videos. We address this challenge by introducing per-video label-permutation latents that learn cross-video instance matches through a differentiable Sinkhorn layer, enabling direct multi-view supervision with consistent identity preservation. This explicit label alignment yields sharp decision boundaries and temporally stable identities without identity drift. To further improve efficiency, we propose instance-decomposed motion scaffolds that provide low-dimensional motion bases per object for long-horizon trajectory optimization. Experiments on Panoptic Studio and Neural3DV show that Inst4DGS jointly supports tracking and instance decomposition while achieving state-of-the-art rendering and segmentation quality. On the Panoptic Studio dataset, Inst4DGS improves PSNR from 26.10 to 28.36, and instance mIoU from 0.6310 to 0.9129, over the strongest baseline.

[84] Mind the Rarities: Can Rare Skin Diseases Be Reliably Diagnosed via Diagnostic Reasoning?

Yang Liu,Jiyao Yang,Hongjin Zhao,Xiaoyong Li,Yanzhe Ji,Xingjian Li,Runmin Jiang,Tianyang Wang,Saeed Anwar,Dongwoo Kim,Yue Yao,Zhenyue Qin,Min Xu

Main category: cs.CV

TL;DR: 本文构建了首个面向皮肤科罕见病诊断推理的长上下文多模态基准DermCase,并提出基于DermLIP的相似性评估指标,系统评测22个主流大视觉语言模型,发现其在诊断准确性、鉴别诊断和临床推理方面存在显著缺陷;指令微调有效提升性能,而直接偏好优化效果有限。

Details Motivation: 现有皮肤科评估基准聚焦常见病且仅关注最终诊断准确率,忽视对罕见病诊断所需的临床推理过程的评估,亟需构建能反映真实临床复杂性的新基准。 Method: 构建基于同行评议病例报告的长上下文多模态基准DermCase(含26,030图像-文本对、6,354疑难病例),设计涵盖完整临床信息与分步推理链的标注;提出DermLIP相似性度量以更好对齐皮肤科医生判断;系统评测22个LVLM,并开展指令微调与DPO微调实验及系统性错误分析。 Result: 22个主流LVLM在诊断准确率、鉴别诊断质量与临床推理能力上均表现严重不足;指令微调显著提升性能,DPO改进甚微;错误分析揭示模型在关键推理环节(如特征关联、假设检验)存在根本性缺陷。 Conclusion: 当前LVLM尚不具备可靠支持皮肤科复杂罕见病诊断推理的能力;未来研究需更重视临床推理建模与高质量长上下文多模态数据构建,评估体系应超越单一准确率、纳入过程性指标。 Abstract: Large vision-language models (LVLMs) demonstrate strong performance in dermatology; however, evaluating diagnostic reasoning for rare conditions remains largely unexplored. Existing benchmarks focus on common diseases and assess only final accuracy, overlooking the clinical reasoning process, which is critical for complex cases. We address this gap by constructing DermCase, a long-context benchmark derived from peer-reviewed case reports. Our dataset contains 26,030 multi-modal image-text pairs and 6,354 clinically challenging cases, each annotated with comprehensive clinical information and step-by-step reasoning chains. To enable reliable evaluation, we establish DermLIP-based similarity metrics that achieve stronger alignment with dermatologists for assessing differential diagnosis quality. Benchmarking 22 leading LVLMs exposes significant deficiencies across diagnosis accuracy, differential diagnosis, and clinical reasoning. Fine-tuning experiments demonstrate that instruction tuning substantially improves performance while Direct Preference Optimization (DPO) yields minimal gains. Systematic error analysis further reveals critical limitations in current models' reasoning capabilities.

[85] SynQ: Accurate Zero-shot Quantization by Synthesis-aware Fine-tuning

Minjun Kim,Jongjin Kim,U Kang

Main category: cs.CV

TL;DR: 本文提出SynQ框架,通过低通滤波减少合成数据噪声、对齐类激活图提升精度、仅对困难样本使用软标签避免错误引导,实现了零样本量化(ZSQ)的最先进性能。

Details Motivation: 解决零样本量化(ZSQ)中因无法访问训练数据而带来的三大挑战:合成数据噪声、偏离目标模式的预测、以及错误硬标签的误导。 Method: 提出SynQ框架:1)利用低通滤波抑制合成样本噪声;2)通过类激活图对齐预训练模型与量化模型;3)对困难样本仅使用软标签以避免预训练模型错误的误导。 Result: 在多个基准上实验表明,SynQ显著优于现有零样本量化方法,达到当前最优精度。 Conclusion: SynQ有效克服了零样本量化中的关键瓶颈,为隐私敏感场景下的模型压缩提供了高效、鲁棒的解决方案。 Abstract: How can we accurately quantize a pre-trained model without any data? Quantization algorithms are widely used for deploying neural networks on resource-constrained edge devices. Zero-shot Quantization (ZSQ) addresses the crucial and practical scenario where training data are inaccessible for privacy or security reasons. However, three significant challenges hinder the performance of existing ZSQ methods: 1) noise in the synthetic dataset, 2) predictions based on off-target patterns, and the 3) misguidance by erroneous hard labels. In this paper, we propose SynQ (Synthesis-aware Fine-tuning for Zero-shot Quantization), a carefully designed ZSQ framework to overcome the limitations of existing methods. SynQ minimizes the noise from the generated samples by exploiting a low-pass filter. Then, SynQ trains the quantized model to improve accuracy by aligning its class activation map with the pre-trained model. Furthermore, SynQ mitigates misguidance from the pre-trained model's error by leveraging only soft labels for difficult samples. Extensive experiments show that SynQ provides the state-of-the-art accuracy, over existing ZSQ methods.

[86] R&D: Balancing Reliability and Diversity in Synthetic Data Augmentation for Semantic Segmentation

Huy Che,Dinh-Duy Phan,Duc-Khai Lam

Main category: cs.CV

TL;DR: 本文提出了一种基于可控扩散模型的合成数据增强新方法,用于语义分割任务,通过类感知提示和视觉先验融合提升图像质量与标签对齐度,在PASCAL VOC和BDD100K等基准上验证了其在数据稀缺场景下的有效性与鲁棒性提升。

Details Motivation: 像素级语义分割的数据收集与标注成本高昂;传统数据增强难以生成新结构,而现有生成模型又难以保证生成图像与真实图像在像素级上的一致性。 Method: 提出一种融合可控扩散模型的合成数据增强流程,结合类感知提示(class-aware prompting)和视觉先验融合(visual prior blending),以提升生成图像质量及与分割标签的精确对齐。 Result: 在PASCAL VOC和BDD100K等基准数据集上显著提升语义分割性能,尤其在数据稀缺场景下效果突出,并增强了模型在真实场景中的鲁棒性。 Conclusion: 该方法有效弥合了合成数据与真实数据之间的鸿沟,在保证多样性的同时提升了可靠性,为语义分割提供了高效、可控的数据增强新范式。 Abstract: Collecting and annotating datasets for pixel-level semantic segmentation tasks are highly labor-intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real-world data collection. Traditional augmentation techniques, such as translation, scaling, and color transformations, create geometric variations but fail to generate new structures. While generative models have been employed to extend semantic information of datasets, they often struggle to maintain consistency between the original and generated images, particularly for pixel-level tasks. In this work, we propose a novel synthetic data augmentation pipeline that integrates controllable diffusion models. Our approach balances diversity and reliability data, effectively bridging the gap between synthetic and real data. We utilize class-aware prompting and visual prior blending to improve image quality further, ensuring precise alignment with segmentation labels. By evaluating benchmark datasets such as PASCAL VOC and BDD100K, we demonstrate that our method significantly enhances semantic segmentation performance, especially in data-scarce scenarios, while improving model robustness in real-world applications. Our code is available at \href{https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance}{https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance}.

[87] AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

Yibo Shi,Jungang Li,Linghao Zhang,Zihao Dongfang,Biao Wu,Sicheng Tao,Yibo Yan,Chenxi Qin,Weiting Liu,Zhixin Lin,Hanqian Li,Yu Huang,Song Dai,Yonghua Hei,Yue Ding,Xiang Li,Shikang Wang,Chengdong Xu,Jingqi Liu,Xueying Ma,Zhiwen Zheng,Xiaofei Zhang,Bincheng Wang,Nichen Yang,Jie Wu,Lihua Tian,Chen Li,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出AndroTMem框架,用于诊断和提升长周期Android GUI智能体的交互记忆能力;其核心基准AndroTMem-Bench强调步骤间强因果依赖,揭示性能下降主因是任务内记忆失效;据此提出锚定状态记忆(ASM)方法,通过因果链接的关键中间状态实现高效检索与归因决策,在多个GUI代理上显著提升任务完成率(TCR)和动作匹配得分(AMS)。

Details Motivation: 现有长周期GUI智能体在交互记忆方面存在缺陷:完整回放冗余且易引入噪声,摘要又易丢失关键依赖信息和可追溯性;亟需一种能保留因果关键状态、支持长程依赖建模的记忆机制。 Method: 提出AndroTMem诊断框架及配套基准AndroTMem-Bench(含1069个任务、34473步交互),聚焦任务内因果依赖;基于诊断结果,设计锚定状态记忆(ASM),将交互序列压缩为因果关联的中间状态锚点集,支持子目标导向检索与归因感知决策。 Result: 在12个GUI智能体上验证ASM有效性:相比全序列回放和摘要基线,TCR提升5%–30.16%,AMS提升4.93%–24.66%;证实锚定、结构化记忆可有效缓解长周期GUI任务中的交互记忆瓶颈。 Conclusion: 交互记忆是长周期GUI智能体的核心瓶颈,而以因果锚点为核心的结构化记忆(ASM)比传统回放或摘要更有效;AndroTMem框架为GUI智能体记忆能力评估与改进提供了标准化工具和新范式。 Abstract: Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem).

[88] SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation

Leyuan Fang,Zan Mao,Zijing Wang,Yinlong Yan

Main category: cs.CV

TL;DR: 本文提出SR-Nav框架,利用动态空间关系图(DSRG)建模物体与区域间的结构化先验关系,提升零样本目标导航在感知与规划阶段的鲁棒性与效率,在HM3D上达到SOTA性能。

Details Motivation: 现有基于基础模型的零样本目标导航方法在视角差或语义线索弱时易失效;而物体与区域间的固有空间关系可作为结构化场景先验,辅助在部分观测下推断目标位置。 Method: 提出Spatial Relation-aware Navigation(SR-Nav):1)构建动态空间关系图(DSRG),融合基础模型先验与实时观测;2)设计关系感知匹配模块,通过关系匹配替代简单检测以修正感知错误;3)设计动态关系规划模块,基于DSRG动态计算最优路径以缩减搜索空间。 Result: 在HM3D数据集上,SR-Nav在成功率和导航效率两方面均达到当前最优(state-of-the-art)性能。 Conclusion: 显式建模并利用空间关系先验可显著提升零样本导航在复杂、不完全观测场景下的鲁棒性与效率,为结合结构化知识与基础模型提供了新范式。 Abstract: Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models' comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, resulting in inefficient or failed navigation. We observe that inherent relationships among objects and regions encode structured scene priors, which help agents infer plausible target locations even under partial observations. Motivated by this insight, we propose Spatial Relation-aware Navigation (SR-Nav), a framework that models both observed and experience-based spatial relationships to enhance both perception and planning. Specifically, SR-Nav first constructs a Dynamic Spatial Relationship Graph (DSRG) that encodes the target-centered spatial relationships through the foundation models and updates dynamically with real-time observations. We then introduce a Relation-aware Matching Module. It utilizes relationship matching instead of naive detection, leveraging diverse relationships in the DSRG to verify and correct errors, enhancing visual perception robustness. Finally, we design a Dynamic Relationship Planning Module to reduce the planning search space by dynamically computing the optimal paths based on the DSRG from the current position, thereby guiding planning and reducing exploration redundancy. Experiments on HM3D show that our method achieves state-of-the-art performance in both success rate and navigation efficiency. The code will be publicly available at https://github.com/Mzyw-1314/SR-Nav

Arushi Rai,Adriana Kovashka

Main category: cs.CV

TL;DR: 本文提出了一种无需额外帧级标注的自一致性目标,通过约束相关任务(如生成与验证)对相同关键帧的关注来提升视频大模型在体育教练任务中的时间定位能力,并在多个基准上超越监督微调及闭源模型。

Details Motivation: 视频大语言模型(Video-LLMs)常关注无关帧,尤其在需精确时间定位的体育教练任务中影响严重;而获取帧级监督信号成本高且不可靠。 Method: 利用相关任务(生成、验证)应关注相同帧的观察,设计自一致性目标,约束其视觉注意力图的一致性;在VidDiffBench数据集上验证问题,并用于训练优化。 Result: 在Exact、FitnessQA和ExpertAF三个体育教练任务上,相比监督微调,准确率分别提升+3.0%、+14.1%,BERTScore提升+0.9,并超越闭源模型。 Conclusion: 无需额外帧级标注的自一致性注意力约束方法,能显著提升Video-LLMs的时间定位能力,尤其适用于标注稀缺的专业领域任务。 Abstract: Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.

[90] Interpretable Prostate Cancer Detection using a Small Cohort of MRI Images

Vahid Monfared,Mohammad Hadi Gharib,Ali Sabri,Maryam Shahali,Farid Rashidi,Amit Mehta,Reza Rawassizadeh

Main category: cs.CV

TL;DR: 本文提出了一种基于小样本T2加权MRI图像的可解释前列腺癌自动检测框架,通过迁移学习和数据增强缓解数据稀缺问题;在162张图像上比较了ViT、Swin、ResNet18及传统方法(HOG+SVM等),发现轻量ResNet18性能最优(90.9%准确率,95.2%敏感度,AUC 0.905),而HOG+SVM也表现优异(AUC 0.917);仅用T2图像即达到媲美双参数MRI方法的效果,并在放射科医生读片对比中显著提升敏感度。

Details Motivation: 前列腺癌是男性主要致死癌症之一,但T2加权MRI图像中病灶细微且异质性强,人工判读困难;同时现有AI方法多依赖大样本和双参数MRI(T2+DWI),临床落地受限于数据获取难度与计算成本。 Method: 采用迁移学习与数据增强策略,在仅162张T2加权图像(102例癌、60例正常)的小数据集上,系统评估Vision Transformer(ViT、Swin)、CNN(ResNet18)及经典机器学习方法(Logistic回归、SVM、HOG+SVM);引入可解释性分析以支持临床可信度。 Result: ResNet18迁移模型取得最优性能(准确率90.9%,敏感度95.2%,AUC 0.905),参数仅11M;HOG+SVM达AUC 0.917,表明手工特征在小样本下仍具竞争力;AI模型敏感度(95.2%)显著高于5位放射科医生均值(67.5%,Fleiss Kappa=0.524)。 Conclusion: 在小样本T2加权MRI上,轻量级CNN(如ResNet18)或经典手工特征方法即可实现高精度前列腺癌检测,无需复杂双参数扫描与海量数据;该框架兼顾性能、可解释性与临床实用性,有望用于AI辅助筛查以降低漏诊率并提升诊断一致性。 Abstract: Prostate cancer is a leading cause of mortality in men, yet interpretation of T2-weighted prostate MRI remains challenging due to subtle and heterogeneous lesions. We developed an interpretable framework for automatic cancer detection using a small dataset of 162 T2-weighted images (102 cancer, 60 normal), addressing data scarcity through transfer learning and augmentation. We performed a comprehensive comparison of Vision Transformers (ViT, Swin), CNNs (ResNet18), and classical methods (Logistic Regression, SVM, HOG+SVM). Transfer-learned ResNet18 achieved the best performance (90.9% accuracy, 95.2% sensitivity, AUC 0.905) with only 11M parameters, while Vision Transformers showed lower performance despite substantially higher complexity. Notably, HOG+SVM achieved comparable accuracy (AUC 0.917), highlighting the effectiveness of handcrafted features in small datasets. Unlike state-of-the-art approaches relying on biparametric MRI (T2+DWI) and large cohorts, our method achieves competitive performance using only T2-weighted images, reducing acquisition complexity and computational cost. In a reader study of 22 cases, five radiologists achieved a mean sensitivity of 67.5% (Fleiss Kappa = 0.524), compared to 95.2% for the AI model, suggesting potential for AI-assisted screening to reduce missed cancers and improve consistency. Code and data are publicly available.

[91] Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images

Kazuya Nishimura,Ryoma Bise,Shinnosuke Matsuo,Haruka Hirose,Yasuhiro Kojima

Main category: cs.CV

TL;DR: 本文提出了一种名为CPNN的细胞类型原型引导神经网络,利用单细胞RNA测序数据估计细胞类型原型,并从病理图像中学习细胞组成权重,从而更准确、可解释地预测基因表达谱。

Details Motivation: 现有方法将基因表达视为单纯切片或空间点水平的信号,忽略了其源于细胞水平表达聚合的本质,缺乏细胞分辨率的生物学指导。 Method: 提出Cell-type Prototype-informed Neural Network(CPNN):首先基于公开单细胞RNA-seq数据估计稳定、鲁棒的细胞类型原型(均值表达谱),再通过病理图像直接学习细胞类型组成权重,并建模原型与观测到的bulk或空间表达之间的关系。 Result: 在三个切片级和三个空间转录组补丁级数据集上,CPNN在Spearman相关性指标上均取得最高性能;且可通过可视化推断的细胞组成权重提供生物学可解释性。 Conclusion: CPNN通过引入细胞类型原型作为结构化先验,实现了更准确、更具生物学意义和可解释性的基因表达预测,为数字病理学与多组学整合提供了新范式。 Abstract: Estimating slide- and patch-level gene expression profiles from pathology images enables rapid and low-cost molecular analysis with broad clinical impact. Despite strong results, existing approaches treat gene expression as a mere slide- or spot-level signal and do not incorporate the fact that the measured expression arises from the aggregation of underlying cell-level expression. To explicitly introduce this missing cell-resolved guidance, we propose a Cell-type Prototype-informed Neural Network (CPNN) that leverages publicly available single-cell RNA-sequencing datasets. Since single-cell measurements are noisy and not paired with histology images, we first estimate cell-type prototypes-mean expression profiles that reflect stable gene-gene co-variation patterns.CPNN then learns cell-type compositional weights directly from images and models the relationship between prototypes and observed bulk or spatial expression, providing a biologically grounded and structurally regularized prediction framework. We evaluate CPNN on three slide-level datasets and three patch-level spatial transcriptomics datasets. Across all settings, CPNN achieves the highest performance in terms of Spearman correlation. Moreover, by visualizing the inferred compositional weights, our framework provides interpretable insights into which cell types drive the predicted expression. Code is publicly available at https://github.com/naivete5656/CPNN.

[92] MedQ-UNI: Toward Unified Medical Image Quality Assessment and Restoration via Vision-Language Modeling

Jiyao Liu,Junzhi Ning,Wanying Qu,Lihao Liu,Chenglong Ma,Junjun He,Ningsheng Xu

Main category: cs.CV

TL;DR: 本文提出MedQ-UNI模型,采用‘先评估后恢复’范式,通过结合医学图像质量评估(Med-IQA)与医学图像恢复(Med-IR),实现跨模态、跨退化类型的统一图像恢复。

Details Motivation: 现有医学图像恢复方法通常针对特定模态或特定退化类型,缺乏泛化能力,主因是将图像恢复与质量评估割裂开来,导致模型难以适应临床中多样化的退化情况。 Method: 提出MedQ-UNI:一种基于视觉-语言的多模态自回归双专家架构(共享注意力机制),包含质量评估专家(生成结构化自然语言描述)和恢复专家(依据描述进行针对性恢复);并构建约5万样本的多模态多任务配对数据集及2千样本评测基准。 Result: 单个MedQ-UNI模型在无需任务适配的情况下,在全部三模态五任务上均达到SOTA恢复性能,同时生成更优的质量描述,验证了显式质量理解可提升恢复保真度与可解释性。 Conclusion: 将医学图像质量评估显式融入恢复流程,是实现通用、鲁棒、可解释医学图像恢复的有效路径;MedQ-UNI为统一建模Med-IQA与Med-IR提供了新范式。 Abstract: Existing medical image restoration (Med-IR) methods are typically modality-specific or degradation-specific, failing to generalize across the heterogeneous degradations encountered in clinical practice. We argue this limitation stems from the isolation of Med-IR from medical image quality assessment (Med-IQA), as restoration models without explicit quality understanding struggle to adapt to diverse degradation types across modalities. To address these challenges, we propose MedQ-UNI, a unified vision-language model that follows an assess-then-restore paradigm, explicitly leveraging Med-IQA to guide Med-IR across arbitrary modalities and degradation types. MedQ-UNI adopts a multimodal autoregressive dual-expert architecture with shared attention: a quality assessment expert first identifies degradation issues through structured natural language descriptions, and a restoration expert then conditions on these descriptions to perform targeted image restoration. To support this paradigm, we construct a large-scale dataset of approximately 50K paired samples spanning three imaging modalities and five restoration tasks, each annotated with structured quality descriptions for joint Med-IQA and Med-IR training, along with a 2K-sample benchmark for evaluation. Extensive experiments demonstrate that a single MedQ-UNI model, without any task-specific adaptation, achieves state-of-the-art restoration performance across all tasks while generating superior descriptions, confirming that explicit quality understanding meaningfully improves restoration fidelity and interpretability.

[93] Recolour What Matters: Region-Aware Colour Editing via Token-Level Diffusion

Yuqi Yang,Dongliang Chang,Yijia Ling,Ruoyi Du,Zhanyu Ma

Main category: cs.CV

TL;DR: 本文提出ColourCrafter,一种基于扩散模型的细粒度、区域感知彩色编辑框架,通过RGB颜色令牌与图像令牌在潜在空间中的token级融合,并引入Lab空间感知损失,显著提升色彩编辑的准确性与可控性。

Details Motivation: 现有文本驱动的图像色彩编辑方法难以准确表达连续色度变化,导致编辑结果偏离目标色调,尤其在局部精细编辑时效果不佳。 Method: 提出ColourCrafter框架:1)在潜在空间中进行RGB颜色token与图像token的token级融合;2)设计感知Lab空间损失,解耦亮度与色度,并约束掩码区域内编辑;3)构建大规模高质量连续色彩变化数据集ColourfulSet。 Result: 在细粒度色彩编辑任务上达到SOTA性能,显著提升色彩准确性、可控性与感知保真度。 Conclusion: ColourCrafter将全局色调迁移转化为结构化、区域感知的生成过程,有效解决了传统方法在连续色度控制和局部编辑精度上的瓶颈。 Abstract: Colour is one of the most perceptually salient yet least controllable attributes in image generation. Although recent diffusion models can modify object colours from user instructions, their results often deviate from the intended hue, especially for fine-grained and local edits. Early text-driven methods rely on discrete language descriptions that cannot accurately represent continuous chromatic variations. To overcome this limitation, we propose ColourCrafter, a unified diffusion framework that transforms colour editing from global tone transfer into a structured, region-aware generation process. Unlike traditional colour driven methods, ColourCrafter performs token-level fusion of RGB colour tokens and image tokens in latent space, selectively propagating colour information to semantically relevant regions while preserving structural fidelity. A perceptual Lab-space Loss further enhances pixel-level precision by decoupling luminance and chrominance and constraining edits within masked areas. Additionally, we build ColourfulSet, a largescale dataset of high-quality image pairs with continuous and diverse colour variations. Extensive experiments demonstrate that ColourCrafter achieves state-of-the-art colour accuracy, controllability and perceptual fidelity in fine-grained colour editing. Our project is available at https://yangyuqi317.github.io/ColourCrafter.github.io/.

[94] Do Vision Language Models Understand Human Engagement in Games?

Ziyi Wang,Qizan Guo,Rishitosh Singh,Xiyang Hu

Main category: cs.CV

TL;DR: 本文评估了视觉-语言模型(VLMs)从游戏视频中推断玩家参与度的能力,发现零样本预测效果差,理论引导提示效果有限,记忆/检索增强提示在点预测上有提升但对变化预测仍困难,揭示当前VLM存在‘感知—理解’鸿沟。

Details Motivation: 探究视觉-语言模型能否仅凭视觉线索可靠推断人类玩家的潜在心理状态(如参与度),以支持游戏设计与用户体验研究。 Method: 在涵盖9款第一人称射击游戏的GameVibe少样本数据集上,评估3种VLM在6种提示策略(零样本、基于Flow/ GameFlow/自我决定理论/MDA的理论引导提示、检索增强提示)下的表现,任务包括点式参与度预测和连续时间窗间的参与度变化(成对)预测。 Result: 零样本VLM预测普遍弱于各游戏内的多数类基线;记忆或检索增强提示在部分设置下提升了点式预测性能,但成对预测始终困难;理论引导提示未稳定提升性能,反而可能强化表面级捷径。 Conclusion: 当前VLM虽能识别可见的游戏画面线索,但在跨游戏鲁棒推断人类参与度方面仍存在显著局限,反映出‘感知—理解’能力之间的差距。 Abstract: Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision--language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception--understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.

[95] T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World

Aditi Naiknaware,Salimeh Sekeh

Main category: cs.CV

TL;DR: 本文提出了一种面向动态环境的多模态OOD检测新框架T-QPM,通过引入跨模态一致性模式和时序自适应融合机制,并结合平均阈值置信度正则化,显著提升了在时间漂移和协变量偏移下的鲁棒性。

Details Motivation: 现有基于CLIP等视觉语言模型的OOD检测方法依赖固定融合规则、假设静态环境,难以应对时间漂移和协变量偏移问题。 Method: 提出两步式Temporal Quadruple-Pattern Matching(T-QPM)框架:第一步构建ID/OOD图像与文本描述的跨模态一致性模式;第二步学习轻量级时序融合权重,联合语义匹配与视觉典型性,并引入Average Thresholded Confidence(ATC)显式正则化以保障稳定性。 Result: 在多个时序划分基准上显著优于静态基线方法,展现出更强的时序一致性和分布漂移鲁棒性。 Conclusion: T-QPM为非平稳开放世界中的多模态OOD检测提供了稳健、可适应的解决方案,有效缓解了时间漂移与协变量偏移带来的挑战。 Abstract: Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern Matching (DPM), existing methods typically suffer from two major shortcomings: (1) They rely on fixed fusion rules and assume static environments, failing under temporal drift; and (2) they lack robustness against covariate shifted inputs. In this paper, we propose a novel two-step framework to enhance OOD detection and covariate distribution shift robustness in dynamic settings. We extend the dual-pattern regime into Temporal Quadruple-Pattern Matching (T-QPM). First, by pairing OOD images with text descriptions, we introduce cross-modal consistency patterns between ID and OOD signals, refining the decision boundary through joint image-text reasoning. Second, we address temporal distribution shifts by learning lightweight fusion weights to optimally combine semantic matching and visual typicality. To ensure stability, we enforce explicit regularization based on Average Thresholded Confidence (ATC), preventing performance degradation as distributions evolve. Experiments on temporally partitioned benchmarks demonstrate that our approach significantly outperforms static baselines, offering a robust, temporally-consistent framework for multimodal OOD detection in non-stationary environments.

[96] TexEditor: Structure-Preserving Text-Driven Texture Editing

Bo Zhao,Yihang Liu,Chenfeng Zhang,Huan Yang,Kun Gai,Wei Ji

Main category: cs.CV

TL;DR: 本文提出TexEditor,一种专用于文本引导纹理编辑的模型,通过构建高质量SFT数据集TexBlender和基于强化学习的结构保持方法StructureNFT,显著提升纹理编辑中几何结构的一致性,并引入新基准TexBench以更全面评估真实场景下的性能。

Details Motivation: 现有SOTA文本引导纹理编辑模型在编辑过程中难以保持结构一致性,尽管目标仅为外观修改。 Method: 提出TexEditor模型:1)构建基于Blender的高质量SFT数据集TexBlender以提供强结构先验;2)设计基于RL的StructureNFT方法,将SFT阶段学到的结构先验迁移到真实场景;3)建立新基准TexBench用于真实世界纹理编辑评估。 Result: TexEditor在Blender基准和自建TexBench上均显著优于Nano Banana Pro等强基线;在通用图像编辑基准ImgEdit上也展现出良好泛化能力。 Conclusion: 从数据与训练双视角联合增强结构保持能力是提升文本引导纹理编辑质量的有效路径,TexEditor为该任务提供了新范式与实用工具。 Abstract: Text-guided texture editing aims to modify object appearance while preserving the underlying geometric structure. However, our empirical analysis reveals that even SOTA editing models frequently struggle to maintain structural consistency during texture editing, despite the intended changes being purely appearance-related. Motivated by this observation, we jointly enhance structure preservation from both data and training perspectives, and build TexEditor, a dedicated texture editing model based on Qwen-Image-Edit-2509. Firstly, we construct TexBlender, a high-quality SFT dataset generated with Blender, which provides strong structural priors for a cold start. Sec- ondly, we introduce StructureNFT, a RL-based approach that integrates structure-preserving losses to transfer the structural priors learned during SFT to real-world scenes. Moreover, due to the limited realism and evaluation coverage of existing benchmarks, we introduce TexBench, a general-purpose real-world benchmark for text-guided texture editing. Extensive experiments on existing Blender-based texture benchmarks and our TexBench show that TexEditor consistently outperforms strong baselines such as Nano Banana Pro. In addition, we assess TexEditor on the general purpose benchmark ImgEdit to validate its generalization. Our code and data are available at https://github.com/KlingAIResearch/TexEditor.

[97] FILT3R: Latent State Adaptive Kalman Filter for Streaming 3D Reconstruction

Seonghyun Jin,Jong Chul Ye

Main category: cs.CV

TL;DR: 本文提出FILT3R,一种无需训练的潜在滤波层,将流式3D重建中的状态更新建模为token空间中的随机状态估计,通过在线估计过程噪声并自适应计算Kalman增益,实现记忆保留与新观测之间的平衡,显著提升长时序稳定性。

Details Motivation: 流式3D重建中,传统状态更新规则存在两难:激进覆盖会遗忘有用历史,保守更新则无法及时响应新证据,二者在超出训练时序后均易失稳。 Method: FILT3R将循环状态更新视为token空间中的随机状态估计问题;维护每个token的方差,计算类Kalman增益以自适应权衡记忆与新观测;过程噪声通过EMA归一化的时间漂移在线估计。 Result: FILT3R提供可解释、即插即用的更新规则,能泛化常见覆盖和门控策略;增益在稳定区域收缩(不确定性减小),在真实场景变化时上升(过程不确定性增大);在深度、位姿和3D重建任务中显著提升长时序稳定性。 Conclusion: FILT3R是一种通用、免训练、具理论依据的流式重建状态更新机制,通过引入不确定性感知的滤波思想,有效缓解了长期推理中的状态退化问题。 Abstract: Streaming 3D reconstruction maintains a persistent latent state that is updated online from incoming frames, enabling constant-memory inference. A key failure mode is the state update rule: aggressive overwrites forget useful history, while conservative updates fail to track new evidence, and both behaviors become unstable beyond the training horizon. To address this challenge, we propose FILT3R, a training-free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. FILT3R maintains a per-token variance and computes a Kalman-style gain that adaptively balances memory retention against new observations. Process noise -- governing how much the latent state is expected to change between frames -- is estimated online from EMA-normalized temporal drift of candidate tokens. Using extensive experiments, we demonstrate that FILT3R yields an interpretable, plug-in update rule that generalizes common overwrite and gating policies as special cases. Specifically, we show that gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long-horizon stability for depth, pose, and 3D reconstruction, compared to the existing methods. Code will be released at https://github.com/jinotter3/FILT3R.

[98] NymeriaPlus: Enriching Nymeria Dataset with Additional Annotations and Data

Daniel DeTone,Federica Bogo,Eric-Tuan Le,Duncan Frost,Julian Straub,Yawar Siddiqui,Yuting Ye,Jakob Engel,Richard Newcombe,Lingni Ma

Main category: cs.CV

TL;DR: 本文介绍了NymeriaPlus,即Nymeria数据集的升级版本,通过增强人体运动表示、增加密集3D/2D标注、提供实例级3D物体重建及新增多模态数据(如音频、腕带视频等),构建了一个更强大的野外第一人称视角基准数据集,以支持具身AI中的多模态学习研究。

Details Motivation: 现有egocentric数据集在多模态覆盖、精细标注和真实场景建模方面存在不足,难以支撑具身AI对复杂人-物-环境交互的深入研究。 Method: 在原始Nymeria数据集基础上,升级人体运动表示(MHR/SMPL格式)、引入室内物体与结构元素的密集3D/2D框标注、生成实例级3D物体重建,并融合基图、音频、腕带视频等新模态,构建统一、协同的NymeriaPlus基准。 Result: 发布了NymeriaPlus数据集,显著提升人体运动精度、场景理解粒度与多模态信息丰富度,形成当前最全面的野外egocentric多模态基准之一。 Conclusion: NymeriaPlus填补了野外第一人称视角数据集中高质量、多模态、细粒度标注的空白,有望推动具身AI、多模态学习与人机交互等方向的发展。 Abstract: The Nymeria Dataset, released in 2024, is a large-scale collection of in-the-wild human activities captured with multiple egocentric wearable devices that are spatially localized and temporally synchronized. It provides body-motion ground truth recorded with a motion-capture suit, device trajectories, semi-dense 3D point clouds, and in-context narrations. In this paper, we upgrade Nymeria and introduce NymeriaPlus. NymeriaPlus features: (1) improved human motion in Momentum Human Rig (MHR) and SMPL formats; (2) dense 3D and 2D bounding box annotations for indoor objects and structural elements; (3) instance-level 3D object reconstructions; and (4) additional modalities e.g., basemap recordings, audio, and wristband videos. By consolidating these complementary modalities and annotations into a single, coherent benchmark, NymeriaPlus strengthens Nymeria into a more powerful in-the-wild egocentric dataset. We expect NymeriaPlus to bridge a key gap in existing egocentric resources and to support a broader range of research, including unique explorations of multimodal learning for embodied AI.

[99] Efficient Video Diffusion with Sparse Information Transmission for Video Compression

Mingde Zhou,Zheng Chen,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Diff-SIT的高效视频扩散压缩方法,通过稀疏时序编码(STEM)和带帧类型嵌入的一步视频扩散(ODFTE),在超低码率下显著提升感知质量和时间一致性。

Details Motivation: 传统端到端视频压缩模型在超低码率下重建图像模糊、感知质量差;现有生成式压缩方法常独立处理帧,缺乏时间连贯性和效率。 Method: 提出Diff-SIT框架,包含稀疏时序编码模块(STEM)和带帧类型嵌入器(FTE)的一步视频扩散模块(ODFTE);STEM实现高信息密度的稀疏编码,ODFTE整体处理中间序列并利用时序相关性进行自适应重建。 Result: 在多个数据集上实验表明,Diff-SIT在超低码率下显著优于现有方法,在感知质量和时间一致性方面达到新SOTA。 Conclusion: Diff-SIT通过稀疏信息传输与帧类型引导的扩散重建,有效解决了超低码率下视频压缩中感知质量差和时间不一致的问题,兼顾效率与性能。 Abstract: Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra-low bitrates, traditional end-to-end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One-Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information-rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff-SIT establishes a new state-of-the-art in perceptual quality and temporal consistency, particularly in the challenging ultra-low-bitrate regime. Code is released at https://github.com/MingdeZhou/Diff-SIT.

[100] HOMEY: Heuristic Object Masking with Enhanced YOLO for Property Insurance Risk Detection

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: HOMEY是一种结合YOLO与领域特定掩码机制及自定义损失函数的新型财产风险检测框架,可准确识别17类风险相关属性,显著提升检测精度与可靠性。

Details Motivation: 自动化财产风险检测在计算机视觉中具有高影响力但尚未被充分探索,对房地产、承保和保险业务有直接影响。 Method: 提出HOMEY框架,融合YOLO、启发式目标掩码机制和风险感知损失校准,以增强杂乱背景中的弱信号并平衡类别不平衡与风险严重性加权。 Result: 在真实财产图像上实验表明,HOMEY相比基准YOLO模型具有更优的检测精度与可靠性,同时保持快速推理能力,并支持可解释、低成本的风险分析。 Conclusion: HOMEY为可扩展的AI驱动财产保险工作流奠定了基础,推动了计算机视觉在保险领域的实际应用。 Abstract: Automated property risk detection is a high-impact yet underexplored frontier in computer vision with direct implications for real estate, underwriting, and insurance operations. We introduce HOMEY (Heuristic Object Masking with Enhanced YOLO), a novel detection framework that combines YOLO with a domain-specific masking mechanism and a custom-designed loss function. HOMEY is trained to detect 17 risk-related property classes, including structural damages (e.g., cracked foundations, roof issues), maintenance neglect (e.g., dead yards, overgrown bushes), and liability hazards (e.g., falling gutters, garbage, hazard signs). Our approach introduces heuristic object masking to amplify weak signals in cluttered backgrounds and risk-aware loss calibration to balance class skew and severity weighting. Experiments on real-world property imagery demonstrate that HOMEY achieves superior detection accuracy and reliability compared to baseline YOLO models, while retaining fast inference. Beyond detection, HOMEY enables interpretable and cost-efficient risk analysis, laying the foundation for scalable AI-driven property insurance workflows.

[101] From Snapshots to Symphonies: The Evolution of Protein Prediction from Static Structures to Generative Dynamics and Multimodal Interactions

Jingzhi Chen,Lijian Xu

Main category: cs.CV

TL;DR: 本文综述了人工智能在蛋白质科学中的范式转变,涵盖多模态表征、静态结构预测改进、构象生成、异质相互作用预测及功能推断五大维度,并指出当前瓶颈与未来方向。

Details Motivation: 蛋白质折叠问题因AI而发生根本性变革,需从静态结构预测转向动态构象系综与复杂生物分子相互作用的建模。 Method: 系统性综述分析法,围绕五个互联维度展开:统一多模态表征、无MSA静态预测改进、基于扩散模型与流匹配的生成框架、异质相互作用预测、功能推断。 Result: 明确了AI驱动蛋白科学的关键进展与现存瓶颈(如数据偏差、可解释性差、几何指标与物理现实脱节),并提出物理一致性生成模型、多模态基础架构和实验闭环系统等未来方向。 Conclusion: AI正从结构分析工具演变为能理解并重写生命动态语言的通用模拟器。 Abstract: The protein folding problem has been fundamentally transformed by artificial intelligence, evolving from static structure prediction toward the modeling of dynamic conformational ensembles and complex biomolecular interactions. This review systematically examines the paradigm shift in AI driven protein science across five interconnected dimensions: unified multimodal representations that integrate sequences, geometries, and textual knowledge; refinement of static prediction through MSA free architectures and all atom complex modeling; generative frameworks, including diffusion models and flow matching, that capture conformational distributions consistent with thermodynamic ensembles; prediction of heterogeneous interactions spanning protein ligand, protein nucleic acid, and protein protein complexes; and functional inference of fitness landscapes, mutational effects, and text guided property prediction. We critically analyze current bottlenecks, including data distribution biases, limited mechanistic interpretability, and the disconnect between geometric metrics and biophysical reality, while identifying future directions toward physically consistent generative models, multimodal foundation architectures, and experimental closed loop systems. This methodological transformation marks artificial intelligence's transition from a structural analysis tool into a universal simulator capable of understanding and ultimately rewriting the dynamic language of life.

[102] Foundations and Architectures of Artificial Intelligence for Motor Insurance

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: This handbook introduces a vertically integrated AI paradigm for motor insurance, featuring domain-adapted transformer architectures for vehicle damage analysis, claims evaluation, and underwriting, all deployed in real-world Thai insurance systems with emphasis on MLOps and production reliability.

Details Motivation: To bridge the gap between cutting-edge AI research and high-stakes, real-world industrial deployment in motor insurance—particularly addressing challenges in vehicle damage analysis, claims processing, and underwriting under practical operational constraints. Method: Develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence; integrates them into a scalable, production-ready pipeline; co-designs learning algorithms with MLOps practices. Result: A cohesive, vertically integrated AI stack enabling end-to-end automation of key motor insurance workflows—vehicle damage analysis, claims evaluation, and underwriting—successfully deployed in nationwide Thai motor insurance systems. Conclusion: Reliable, production-grade AI in high-stakes domains like motor insurance requires not only novel model architectures but also tight integration of domain adaptation, system architecture, and MLOps—establishing a principled blueprint for industrial AI deployment. Abstract: This handbook presents a systematic treatment of the foundations and architectures of artificial intelligence for motor insurance, grounded in large-scale real-world deployment. It formalizes a vertically integrated AI paradigm that unifies perception, multimodal reasoning, and production infrastructure into a cohesive intelligence stack for automotive risk assessment and claims processing. At its core, the handbook develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence, enabling end-to-end automation of vehicle damage analysis, claims evaluation, and underwriting workflows. These components are composed into a scalable pipeline operating under practical constraints observed in nationwide motor insurance systems in Thailand. Beyond model design, the handbook emphasizes the co-evolution of learning algorithms and MLOps practices, establishing a principled framework for translating modern artificial intelligence into reliable, production-grade systems in high-stakes industrial environments.

[103] OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

Hongjia Zhai,Qi Zhang,Xiaokun Pan,Xiyu Zhang,Yitong Dong,Huaqi Zhang,Dan Xu,Guofeng Zhang

Main category: cs.CV

TL;DR: 本文提出OnlinePG系统,结合3D高斯泼溅与在线局部到全局映射策略,实现开放词汇场景理解与在线全景建图,兼顾实时性与实例级语义一致性。

Details Motivation: 现有方法多为离线或缺乏实例级理解,难以满足真实机器人任务对在线、开放词汇和全景感知的需求。 Method: 采用滑动窗口的局部到全局范式;构建融合几何与语义线索的3D段聚类图以实现局部一致性;通过带空间属性的显式网格与鲁棒双向二分3D高斯实例匹配更新全局地图;利用网格内VLM特征实现开放词汇理解。 Result: 在多个常用数据集上实验表明,该方法在在线方法中性能更优,同时保持实时效率。 Conclusion: OnlinePG首次将3D高斯泼溅有效应用于在线全景建图与开放词汇场景理解,为具身智能提供了高效、一致且可扩展的感知-建图框架。 Abstract: Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit grids with spatial attributes for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.

[104] CAFlow: Adaptive-Depth Single-Step Flow Matching for Efficient Histopathology Super-Resolution

Elad Yoshai,Ariel D. Yoshai,Natan T. Shaked

Main category: cs.CV

TL;DR: CAFlow是一种自适应深度的单步流匹配超分辨率框架,通过为每个图像块选择最浅的有效网络出口来显著降低计算成本,同时保持高质量重建。它在重排空间中进行流匹配,减少16倍空间计算,并引入轻量级出口分类器实现33%计算节省,仅损失0.12 dB PSNR。在多器官组织病理学x4超分任务中,其自适应路由性能接近全深度模型(仅差0.12 dB),且浅层出口即超越双三次插值;x8超分时优于同计算量基线,媲美更大模型SwinIR-Medium;下游核分割验证临床结构保留;单GPU训练<5小时,整张切片推理从分钟级降至秒级。

Details Motivation: 数字病理中的全切片图像常达千兆像素级,传统生成式超分辨率计算开销过大,难以实际部署,亟需高效、低延迟、高质量的超分辨率方法。 Method: 提出CAFlow:一种自适应深度单步流匹配框架;采用像素未打乱的重排空间进行流匹配以降低空间计算;设计FlowResNet主干(含卷积与窗口自注意力模块,支持4个早期出口,计算量3.1–13.3 GFLOPs);引入轻量出口分类器(~6K参数)动态路由;强调训练中一半样本设t=0对单步质量至关重要。 Result: x4超分:自适应路由PSNR 31.72 dB(vs 全深度31.84 dB);最浅出口PSNR超双三次+1.9 dB,计算仅为SwinIR-light的1/2.8;x8超分超越同计算量基线,媲美SwinIR-Medium;泛化至结肠组织仅-0.02 dB损失;下游核分割性能保持;单GPU训练<5小时;整张切片推理加速至秒级。 Conclusion: CAFlow通过自适应深度路由与重排空间流匹配,在大幅降低计算成本的同时维持高重建质量与临床结构保真度,为数字病理实时超分辨率提供了实用可行的新范式。 Abstract: In digital pathology, whole-slide images routinely exceed gigapixel resolution, making computationally intensive generative super-resolution (SR) impractical for routine deployment. We introduce CAFlow, an adaptive-depth single-step flow-matching framework that routes each image tile to the shallowest network exit that preserves reconstruction quality. CAFlow performs flow matching in pixel-unshuffled rearranged space, reducing spatial computation by 16x while enabling direct inference. We show that dedicating half of training to exact t=0 samples is essential for single-step quality (-1.5 dB without it). The backbone, FlowResNet (1.90M parameters), mixes convolution and window self-attention blocks across four early exits spanning 3.1 to 13.3 GFLOPs. A lightweight exit classifier (~6K parameters) achieves 33% compute savings at only 0.12 dB cost. On multi-organ histopathology x4 SR, adaptive routing achieves 31.72 dB PSNR versus 31.84 dB at full depth, while the shallowest exit exceeds bicubic by +1.9 dB at 2.8x less compute than SwinIR-light. The method generalizes to held-out colon tissue with minimal quality loss (-0.02 dB), and at x8 upscaling it outperforms all comparable-compute baselines while remaining competitive with the much larger SwinIR-Medium model. Downstream nuclei segmentation confirms preservation of clinically relevant structure. The model trains in under 5 hours on a single GPU, and adaptive routing can reduce whole-slide inference from minutes to seconds.

[105] Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Liwei Che,Zhiyu Xue,Yihao Quan,Benlin Liu,Zeru Shi,Michelle Hurst,Jacob Feldman,Ruixiang Tang,Ranjay Krishna,Vladimir Pavlovic

Main category: cs.CV

TL;DR: This paper investigates how Large Vision-Language Models (LVLMs) perform counting, revealing a human-like pattern and identifying a shared 'counting circuit' via new interpretability methods; it then proposes a lightweight fine-tuning strategy on synthetic counting data that improves both counting and broader visual reasoning performance.

Details Motivation: Counting is a fundamental test of LVLMs' reasoning ability, requiring object individuation and aggregation; understanding how LVLMs count can reveal core mechanisms underlying visual reasoning. Method: The authors use controlled synthetic and real-world counting benchmarks, combined with two novel interpretability techniques—Visual Activation Patching and HeadLens—to analyze model internals and identify a 'counting circuit'; they further propose a lightweight fine-tuning intervention using only synthetic counting images. Result: LVLMs exhibit human-like counting behavior (precise for small numbers, noisy for large ones); a shared 'counting circuit' is discovered across tasks; fine-tuning on synthetic counting data improves in-distribution counting, boosts out-of-distribution counting by +8.36%, and enhances general visual reasoning by +1.54% on Qwen2.5-VL. Conclusion: Counting plays a central, influential role in visual reasoning; targeted enhancement of counting mechanisms offers a promising pathway to improve overall LVLM reasoning capabilities. Abstract: Counting serves as a simple but powerful test of a Large Vision-Language Model's (LVLM's) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured "counting circuit" that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.

[106] 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Hyun-kyu Ko,Jihyeon Park,Younghyun Kim,Dongheok Park,Eunbyung Park

Main category: cs.CV

TL;DR: 本文提出3DreamBooth和3Dapter框架,实现无需多视角视频训练的3D感知视频定制化生成,通过单帧优化与不对称条件调制,将鲁棒3D先验嵌入模型并提升纹理细节与收敛速度。

Details Motivation: 现有主体驱动视频生成方法多基于2D表征,缺乏重建真实3D几何所需的完整空间先验,难以在新视角下保持主体3D一致性;同时多视角视频数据稀缺,直接微调易导致时序过拟合。 Method: 提出3DreamBooth(基于单帧优化解耦空间几何与时间运动,固化3D先验)和3Dapter(作为视觉条件模块,通过不对称联合优化从极少量参考视图中查询视角特异性几何线索,充当动态选择性路由器)。 Result: 实现了高质量、视图一致、3D感知的定制主体视频生成,在仅需单视角输入预训练、少量多视角参考条件下,显著提升新视角合成的真实感与几何一致性,避免了时序过拟合。 Conclusion: 该框架突破了2D-centric视频生成范式,为数据受限下的3D-aware视频定制提供了新思路,推动了VR/AR、虚拟制片与电商等应用的发展。 Abstract: Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

[107] Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

Yongwei Jiang,Yixiong Zou,Yuhua Li,Ruixuan Li

Main category: cs.CV

TL;DR: 本文提出了一种受人类视网膜中央凹启发的中心-外围注意力精炼框架,用于解决跨域小样本目标检测中的‘目标域散光’问题,显著提升了模型在目标域上的适应能力与检测精度。

Details Motivation: 跨域小样本目标检测面临域偏移和标注稀缺问题,作者发现模型在目标域上存在注意力分散、定位不准、冗余预测等类似‘散光’的现象,现有微调方法缓解有限。 Method: 提出中心-外围注意力精炼框架,包括:(1) 正样本模式精炼模块(模拟视觉中心),利用类别原型重塑注意力;(2) 负样本上下文调制模块(模拟视觉外周),建模背景以增强边界判别;(3) 文本语义对齐模块,通过跨模态线索强化中心-外围区分。 Result: 在六个CD-FSOD基准上均取得更高检测精度,建立新的SOTA结果。 Conclusion: 受生物视觉启发的注意力精炼机制能有效矫正目标域散光问题,提升模型跨域小样本适应能力,为CD-FSOD提供了新思路。 Abstract: Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning's inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.

[108] CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

Xiang Chen,Fangfang Yang,Chunlei Meng,Chengyin Hu,Ang Li,Yiwei Wei,Jiahuan Long,Jiujiang Guo

Main category: cs.CV

TL;DR: 本文提出CoDA框架,通过构建临床合理的图像处理链式分布偏移,评估医学视觉-语言模型(MVLMs)在真实临床流程中的鲁棒性;发现链式偏移比单阶段更严重损害模型性能,并提出一种基于教师引导的token空间自适应修复策略提升鲁棒性。

Details Motivation: 现有医学视觉-语言模型(MVLMs)的鲁棒性评估多基于干净或孤立失真图像,忽视了临床中常见的、保持可读性但改变统计特性的采集、重建、显示与传输等常规操作,导致其在真实工作流中的可靠性未被充分研究。 Method: 提出CoDA(Chain-of-Distribution Analysis)框架:在掩码结构相似性约束下,联合优化由模拟采集阴影、重建与显示映射、传输与导出退化组成的链式分布偏移;同时引入基于教师引导的token空间自适应修复策略,结合patch级对齐进行后处理。 Result: CoDA显著降低CLIP风格MVLMs在脑MRI、胸片和腹部CT上的零样本性能,且链式组合退化始终比任一单阶段更严重;商用及医学专用多模态大模型(MLLMs)在图像质量真实性审计任务中均表现不佳;所提修复策略有效提升了模型在CoDA退化样本上的准确性。 Conclusion: CoDA揭示了MVLMs在临床实际部署中面临的真实威胁面,表明仅靠模型架构改进不足,需结合轻量级对齐策略提升鲁棒性;该工作为医学AI系统的可靠性评估与增强提供了新范式。 Abstract: Medical vision--language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.

[109] HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Dan Ben-Ami,Gabriele Serussi,Kobi Cohen,Chaim Baskin

Main category: cs.CV

TL;DR: HiMu是一种无需训练的长视频问答框架,通过单次文本大模型调用分解查询为层次逻辑树,结合多模态轻量专家与模糊逻辑操作,实现高效准确的帧选择。

Details Motivation: 长视频问答需在时间维度上进行推理,而现有帧选择方法在效率与结构保持之间存在严重权衡:相似度方法快但丢失时序和跨模态关系,基于智能体的方法能保持结构但计算开销过大。 Method: HiMu利用纯文本LLM将问题分解为层次逻辑树,叶子节点为原子谓词;每个谓词交由覆盖视觉(CLIP、开放词汇检测、OCR)和音频(ASR、CLAP)的轻量专家处理;信号经归一化、时间平滑对齐后,通过模糊逻辑算子自底向上组合,生成连续满足度曲线。 Result: 在Video-MME、LongVideoBench和HERBench-Lite上,HiMu以16帧输入配合Qwen3-VL 8B超越所有对比选择器;使用GPT-4o时,以32–512帧级竞品系统1/10的FLOPs实现更高性能。 Conclusion: HiMu在不牺牲时序与跨模态结构的前提下,显著提升了长视频问答中帧选择的效率-精度帕累托前沿,为资源受限的LVLM应用提供了新范式。 Abstract: Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

[110] CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

Jiacheng Tang,Zhiyuan Zhou,Zhuolin He,Jia Zhang,Kai Zhang,Jian Pu

Main category: cs.CV

TL;DR: 本文提出CausalVAD框架,通过稀疏因果干预(SCIS)模块,在端到端驾驶模型中实现去混淆训练,以消除混杂因素导致的虚假关联,提升规划准确性与安全性。

Details Motivation: 现有规划导向的端到端驾驶模型仅学习统计相关性而非因果关系,易受数据集偏差影响而产生因果混淆,损害其在复杂场景下的可靠性与安全性。 Method: 提出CausalVAD去混淆训练框架,核心为轻量级、即插即用的稀疏因果干预方案(SCIS),基于后门调整理论,构建表征潜在驾驶上下文的原型字典,并对模型稀疏向量化查询进行干预,从而消除混杂变量引发的虚假关联。 Result: 在nuScenes等基准上取得最优的规划精度与安全性;在数据偏差和人为引入因果混淆的噪声场景下均展现出更强鲁棒性。 Conclusion: CausalVAD有效缓解了端到端驾驶模型中的因果混淆问题,提升了模型的因果可解释性、安全性和泛化鲁棒性。 Abstract: Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

[111] HAViT: Historical Attention Vision Transformer

Swarnendu Banik,Manish Das,Shiv Ram Dubey,Satish Kumar Singh

Main category: cs.CV

TL;DR: 本文提出了一种跨层注意力传播方法,通过在ViT中保存并融合历史注意力矩阵,改进层间信息流,提升特征学习与模型性能,仅需极少架构改动且在多个数据集和模型上验证有效。

Details Motivation: Vision Transformers中各层注意力机制独立运作,限制了信息流动和特征学习能力,因此需要一种能增强跨层信息整合的机制。 Method: 提出跨层注意力传播方法,保留并融合各编码器层的历史注意力矩阵,引入注意力矩阵存储与加权混合操作(含超参alpha),支持渐进式注意力模式优化。 Result: 在CIFAR-100上ViT准确率从75.74%提升至77.07%(+1.33%),TinyImageNet上从57.82%提升至59.07%(+1.25%);CaiT提升1.01%;最优混合系数alpha=0.45;随机初始化优于零初始化。 Conclusion: 该方法以极小开销显著提升ViT及其变体性能,揭示历史注意力信息的有效利用可优化训练动态与最终精度,具备良好泛化性与实用性。 Abstract: Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at https://github.com/banik-s/HAViT.

[112] Color image restoration based on nonlocal saturation-value similarity

Wei Wang,Yakun Li

Main category: cs.CV

TL;DR: 本文提出了一种基于饱和度-明度(saturation-value)相似性的新型非局部变分方法,用于彩色图像恢复,通过在HSV颜色空间中度量图像块相似性来提升颜色信息的精细描述能力,并设计了基于Bregman分裂法的有效算法求解。

Details Motivation: 传统非局部方法直接在RGB通道上提取图像块并计算灰度相似性,难以精细刻画彩色图像的颜色信息;本文旨在利用HSV空间中的饱和度与明度通道构建更符合人眼感知的颜色相似性度量,从而改进非局部正则化效果。 Method: 构建基于饱和度-明度相似性的非局部全变分(nonlocal total variation),将其嵌入变分模型;采用Bregman化算子分裂法设计高效数值算法,并分析其收敛性。 Result: 实验表明,所提模型在视觉质量及PSNR、SSIM、QSSIM和S-CIELAB色差等定量指标上均优于对比方法。 Conclusion: 基于饱和度-明度相似性的非局部变分方法能更有效地保留和恢复彩色图像的颜色结构信息,是一种有前景的彩色图像复原策略。 Abstract: In this paper, we propose and develop a novel nonlocal variational technique based on saturation-value similarity for color image restoration. In traditional nonlocal methods, image patches are extracted from red, green and blue channels of a color image directly, and the color information can not be described finely because the patch similarity is mainly based on the grayscale value of independent channel. The main aim of this paper is to propose and develop a novel nonlocal regularization method by considering the similarity of image patches in saturation-value channel of a color image. In particular, we first establish saturation-value similarity based nonlocal total variation by incorporating saturation-value similarity of color image patches into the proposed nonlocal gradients, which can describe the saturation and value similarity of two adjacent color image patches. The proposed nonlocal variational models are then formulated based on saturation-value similarity based nonlocal total variation. Moreover, we design an effective and efficient algorithm to solve the proposed optimization problem numerically by employing bregmanized operator splitting method, and we also study the convergence of the proposed algorithms. Numerical examples are presented to demonstrate that the performance of the proposed models is better than that of other testing methods in terms of visual quality and some quantitative metrics including peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), quaternion structural similarity index (QSSIM) and S-CIELAB color error.

[113] AU Codes, Language, and Synthesis: Translating Anatomy to Text for Facial Behavior Synthesis

Jiahe Wang,Cong Liang,Xuandong Huang,Yuxin Wang,Xin Yun,Yi Wu,Yanan Chang,Shangfei Wang

Main category: cs.CV

TL;DR: 本文提出一种基于自然语言描述动作单元(AU)的新方法,以解决现有AU-based面部行为合成中线性组合导致的解剖学不合理问题,特别是处理冲突AU时;构建了首个大规模AU文本-图像配对数据集BP4D-AUText,并提出VQ-AUFace生成模型,在解剖合理性、行为丰富性和感知真实性方面显著优于现有方法。

Details Motivation: 现有AU-based面部合成方法将AU编码为one-hot向量并线性组合,难以建模冲突AU(即激活同一肌肉但动作相反),导致解剖不合理的伪影和不自然运动叠加。 Method: 提出用自然语言描述AU来表征面部行为;构建BP4D-AUText文本-图像配对数据集(基于BP4D/BP4D+经规则驱动的动态AU文本处理器生成);设计VQ-AUFace生成模型,融合面部结构先验实现文本到高保真面部行为合成。 Result: 在定量实验与用户研究中显著优于现有方法,尤其在处理冲突AU时生成更解剖合理、行为丰富且感知逼真的面部表达。 Conclusion: 自然语言化AU表征是提升面部行为合成精度与真实性的有效路径,所提数据集与模型为非言语行为建模提供了新范式。 Abstract: Facial behavior synthesis remains a critical yet underexplored challenge. While text-to-face models have made progress, they often rely on coarse emotion categories, which lack the nuance needed to capture the full spectrum of human nonverbal communication. Action Units (AUs) provide a more precise and anatomically grounded alternative. However, current AU-based approaches typically encode AUs as one-hot vectors, modeling compound expressions as simple linear combinations of individual AUs. This linearity becomes problematic when handling conflicting AUs--defined as those which activate the same facial muscle with opposing actions. Such cases lead to anatomically implausible artifacts and unnatural motion superpositions. To address this, we propose a novel method that represents facial behavior through natural language descriptions of AUs. This approach preserves the expressiveness of the AU framework while enabling explicit modeling of complex and conflicting AUs. It also unlocks the potential of modern text-to-image models for high-fidelity facial synthesis. Supporting this direction, we introduce BP4D-AUText, the first large-scale text-image paired dataset for complex facial behavior. It is synthesized by applying a rule-based Dynamic AU Text Processor to the BP4D and BP4D+ datasets. We further propose VQ-AUFace, a generative model that leverages facial structural priors to synthesize realistic and diverse facial behaviors from text. Extensive quantitative experiments and user studies demonstrate that our approach significantly outperforms existing methods. It excels in generating facial expressions that are anatomically plausible, behaviorally rich, and perceptually convincing, particularly under challenging conditions involving conflicting AUs.

[114] myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition

Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi

Main category: cs.CV

TL;DR: 本文对myMNIST(原BHDD)缅甸手写数字数据集进行了首次系统性基准测试,评估了11种模型,发现CNN表现最优,PETNN(GELU)紧随其后,JEM等能量模型也具竞争力,KAN类模型略逊但仍有价值;研究建立了可复现基线,凸显PETNN优势,并公开基准以推动区域文字识别研究。

Details Motivation: 建立myMNIST数据集上可复现、跨范式的系统性基准,填补缅甸手写数字识别领域在新兴与经典模型上的系统评估空白,并推动区域文字AI研究。 Method: 在myMNIST数据集上系统评估11种模型:MLP、CNN、LSTM、GRU、Transformer、FastKAN、EfficientKAN、JEM及三种PETNN变体(Sigmoid/GELU/SiLU),采用Precision、Recall、F1-Score和Accuracy进行多指标评测。 Result: CNN取得最佳性能(F1=0.9959,Accuracy=0.9970);PETNN(GELU)次之(F1=0.9955,Accuracy=0.9966),超越LSTM、GRU、Transformer及KAN变体;JEM表现稳健(F1=0.9944,Accuracy=0.9958);KAN类模型Accuracy约0.992。 Conclusion: CNN仍是强基线;PETNN展现对区域文字识别的优越适应性;能量模型JEM验证了能量建模潜力;KAN类模型提供新思路但需改进;该基准为后续缅甸及类似区域性文字识别研究提供了标准化评测基础与开源资源。 Abstract: We present the first systematic benchmark on myMNIST (formerly BHDD), a publicly available Burmese handwritten digit dataset important for Myanmar NLP/AI research. We evaluate eleven architectures spanning classical deep learning models (Multi-Layer Perceptron, Convolutional Neural Network, Long Short-Term Memory, Gated Recurrent Unit, Transformer), recent alternatives (FastKAN, EfficientKAN), an energy-based model (JEM), and physics-inspired PETNN variants (Sigmoid, GELU, SiLU). Using Precision, Recall, F1-Score, and Accuracy as evaluation metrics, our results show that the CNN remains a strong baseline, achieving the best overall scores (F1 = 0.9959, Accuracy = 0.9970). The PETNN (GELU) model closely follows (F1 = 0.9955, Accuracy = 0.9966), outperforming LSTM, GRU, Transformer, and KAN variants. JEM, representing energy-based modeling, performs competitively (F1 = 0.9944, Accuracy = 0.9958). KAN-based models (FastKAN, EfficientKAN) trail the top performers but provide a meaningful alternative baseline (Accuracy ~0.992). These findings (i) establish reproducible baselines for myMNIST across diverse modeling paradigms, (ii) highlight PETNN's strong performance relative to classical and Transformer-based models, and (iii) quantify the gap between energy-inspired PETNNs and a true energy-based model (JEM). We release this benchmark to facilitate future research on Myanmar digit recognition and to encourage broader evaluation of emerging architectures on regional scripts.

[115] Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

Lu Yu,Haiyang Zhang,Changsheng Xu

Main category: cs.CV

TL;DR: 本文提出了一种基于文本引导注意力的零样本鲁棒性提升方法(TGA-ZSR)及其改进版Comp-TGA,通过局部精炼与全局约束机制增强CLIP等视觉语言模型对对抗样本的鲁棒性,并在16个数据集上显著提升零样本鲁棒准确率。

Details Motivation: CLIP等预训练视觉语言模型虽具备强零样本能力,但易受对抗样本攻击;作者观察到对抗扰动会改变文本引导的注意力分布,由此出发设计鲁棒性增强策略。 Method: 提出TGA-ZSR框架,含局部注意力精炼模块和全局注意力约束模块;进一步提出Comp-TGA,融合类别提示引导注意力与非类别提示反向注意力两种互补前景注意力机制。 Result: TGA-ZSR和Comp-TGA分别在16个数据集上相较当前最优方法提升零样本鲁棒准确率9.58%和11.95%。 Conclusion: 文本引导注意力机制可有效提升视觉语言模型的零样本鲁棒性;互补注意力设计能缓解关注无关/虚假特征的问题,进一步提升鲁棒性能。 Abstract: Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.

[116] SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

Jialiang Kang,Han Shu,Wenshuo Li,Yingjie Zhai,Xinghao Chen

Main category: cs.CV

TL;DR: SJD-PAC 是一种改进的推测性 Jacobi 解码方法,通过主动起草策略和自适应续写机制,在不损失图像质量的前提下,将文本到图像生成速度提升 3.8 倍。

Details Motivation: 原始 SJD 在高熵视觉生成区域接受率低,导致吞吐量瓶颈。 Method: 提出 SJD-PAC 框架:1)主动起草策略以提升高熵区域局部接受率;2)自适应续写机制在首次拒绝后继续验证序列,避免完全重采样。 Result: 在标准文本到图像基准上实现 3.8× 加速,图像质量无损。 Conclusion: SJD-PAC 有效提升了推测性解码在视觉生成中的效率与稳定性,同时严格保持目标分布。 Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework. First, SJD-PAC employs a proactive drafting strategy to improve local acceptance rates in these challenging high-entropy regions. Second, we introduce an adaptive continuation mechanism that sustains sequence validation after an initial rejection, bypassing the need for full resampling. Working in tandem, these optimizations significantly increase the average acceptance length per step, boosting inference speed while strictly preserving the target distribution. Experiments on standard text-to-image benchmarks demonstrate that SJD-PAC achieves a $3.8\times$ speedup with lossless image quality.

[117] Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Bingqi Ma,Linlong Lang,Ming Zhang,Dailan He,Xingtong Ge,Yi Zhang,Guanglu Song,Yu Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Cross-Modal Context Learning (CCL)的新方法,用于改进双流Transformer架构下的音视频联合生成,通过引入TARP、LCT、DCR和UCG等模块,缓解了跨模态交互中的模型流形变化、背景偏差、CFG不一致及多条件冲突等问题,在更少资源消耗下实现了SOTA性能。

Details Motivation: 现有双流Transformer音视频生成方法存在跨模态门控机制引起的模型流形变化、跨模态注意力导致的多模态背景区域偏差、训练与推理阶段多模态无分类器引导(CFG)不一致,以及多条件间冲突等问题。 Method: 提出Cross-Modal Context Learning(CCL),包含:1)时序对齐的RoPE与分块(TARP)提升音视频潜在表示时序对齐;2)可学习上下文标记(LCT)与动态上下文路由(DCR)构成的跨模态上下文注意力(CCA)模块,提供稳定无条件锚点并支持任务自适应路由;3)无条件上下文引导(UCG)在推理中利用LCT增强CFG一致性并缓解条件冲突。 Result: CCL在多项综合评估中超越近期学术方法,达到SOTA性能,同时显著降低计算与训练资源需求。 Conclusion: CCL通过系统性地建模和优化跨模态上下文交互机制,有效克服了双流扩散模型在音视频联合生成中的关键瓶颈,提升了生成质量、时序同步性与训练推理一致性。 Abstract: The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model's convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.

[118] Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

Lukas Bayer,Sheethal Bhat,Andreas Maier

Main category: cs.CV

TL;DR: 本研究在RATIC数据集上系统比较了三种混合Transformer模型(UNETR、SwinUNETR、UNETR++)与CNN基线模型SegResNet在腹部多器官分割任务上的性能,结果表明SegResNet整体表现最优,说明在中小规模异构数据集上,优化良好的CNN仍具优势。

Details Motivation: 准确的腹部CT多器官分割对计算机辅助诊断和治疗至关重要;近年来Transformer因其建模长程依赖能力受到关注,但其在医学影像分割中相对于CNN的实际优势尚需实证验证。 Method: 在包含206例来自23家机构的RATIC腹部CT数据集上,统一预处理和训练设置下,系统评测UNETR、SwinUNETR、UNETR++与SegResNet的体积多器官分割性能,以Dice相似系数(DSC)为主要指标。 Result: SegResNet整体性能最高,全面优于所有Transformer模型;UNETR++在Transformer模型中表现最接近CNN;UNETR收敛速度最快。 Conclusion: 对于小到中等规模、异构性强的医学影像数据集,经过良好优化的CNN架构(如SegResNet)仍极具竞争力,甚至优于当前主流混合Transformer模型。 Abstract: Accurate multi-organ segmentation in abdominal CT scans is essential for computer-aided diagnosis and treatment. While convolutional neural networks (CNNs) have long been the standard approach in medical image segmentation, transformer-based architectures have recently gained attention due to their ability to model long-range dependencies. In this study, we systematically benchmark the three hybrid transformer-based models UNETR, SwinUNETR, and UNETR++ against a strong CNN baseline, SegResNet, for volumetric multi-organ segmentation on the heterogeneous RATIC dataset. The dataset comprises 206 annotated CT scans from 23 institutions worldwide, covering five abdominal organs. All models were trained and evaluated under identical preprocessing and training conditions using the Dice Similarity Coefficient (DSC) as the primary metric. The results show that the CNN-based SegResNet achieves the highest overall performance, outperforming all hybrid transformer-based models across all organs. Among the transformer-based approaches, UNETR++ delivers the most competitive results, while UNETR demonstrates notably faster convergence with fewer training iterations. These findings suggest that, for small- to medium-sized heterogeneous datasets, well-optimized CNN architectures remain highly competitive and may outperform hybrid transformer-based designs.

[119] OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data

Bin Cao,Sipeng Zheng,Hao Luo,Boyuan Li,Jing Liu,Zongqing Lu

Main category: cs.CV

TL;DR: 本文提出了OpenT2M大规模高质量开源运动数据集和基于其预训练的MonoFrill文本到动作生成模型,核心是新型运动分词器2D-PRQ,显著提升了模型泛化能力和零样本性能。

Details Motivation: 现有文本到动作(T2M)模型在未见文本描述上表现差,主要受限于运动数据集规模小、多样性不足。 Method: 构建了百万级、2800小时以上的高质量开源运动数据集OpenT2M,含严格物理可行性验证与细粒度文本标注;设计自动化长时序生成流程;提出MonoFrill模型及核心组件2D-PRQ运动分词器,按人体生物部位建模时空依赖。 Result: OpenT2M显著提升现有T2M模型泛化能力;2D-PRQ在运动重建和零样本任务中表现优异;MonoFrill以简洁架构实现强T2M性能。 Conclusion: OpenT2M和MonoFrill共同解决了T2M领域长期存在的数据质量与基准评测难题,推动该方向发展。 Abstract: Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as "frills". Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.

[120] GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Yueying Zou,Pei Pei Li,Zekun Li,Xinyu Guo,Xing Cui,Huaibo Huang,Ran He

Main category: cs.CV

TL;DR: 本文提出GenVideoLens细粒度基准,用于评估大视觉语言模型(LVLMs)在AI生成视频检测中的多维能力,发现模型在光学一致性、物理交互和时序因果推理方面存在显著短板。

Details Motivation: 现有评估方法仅采用二分类与粗粒度准确率指标,难以揭示LVLM在AI生成视频检测中具体成功或失败的原因。 Method: 构建包含400个高欺骗性AI生成视频和100个真实视频的GenVideoLens基准,由专家在15个真实性维度(感知、光学、物理、时序线索)上标注;评估11个代表性LVLM,并开展时间扰动实验。 Result: 发现LVLM存在明显的维度性能不平衡:感知线索上表现较好,但在光学一致性、物理交互和时序因果推理上严重不足;小开源模型在某些维度上甚至优于大闭源模型;时间扰动实验表明当前LVLM对时序信息利用有限。 Conclusion: GenVideoLens为LVLM在AI生成视频检测中的能力提供了诊断性洞见,揭示了关键能力缺口,为未来系统改进指明方向。 Abstract: In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.

[121] GEAR: Geography-knowledge Enhanced Analog Recognition Framework in Extreme Environments

Zelin Liu,Bocheng Li,Yuling Zhou,Xuanting Li,Yixuan Yang,Jing Wang,Weishu Zhao,Xiaofeng Gao

Main category: cs.CV

TL;DR: 本文提出GEAR框架,通过三阶段流程(骨架引导筛选、物理感知过滤、图细粒度识别)在青藏高原上高效检索马里亚纳海沟的类比地形,并设计MSG-Net模型提升地形相似性识别精度,同时发现其特征与生物数据存在显著相关性。

Details Motivation: 深海采样成本高昂,亟需在青藏高原上寻找与马里亚纳海沟地质起源和微生物功能相似的陆地类比区域;但现有模型无法兼顾地理知识融合与计算效率。 Method: 提出GEAR三阶段框架:1)骨架引导筛选与裁剪;2)基于波形比较器(TWC)和形态纹理模块(MTM)的物理感知过滤;3)基于地貌指标的形态集成孪生图网络(MSG-Net)进行细粒度识别;并构建专家标注的地形相似性数据集。 Result: MSG-Net的F1-Score较SOTA基线提升1.38个百分点;MSG-Net提取的特征与生物数据呈显著相关性;验证了各阶段有效性。 Conclusion: GEAR框架能高效、精准地在青藏高原识别马里亚纳海沟类比地形,为跨域地形类比研究提供新范式,并支撑后续生物学分析。 Abstract: The Mariana Trench and the Qinghai-Tibet Plateau exhibit significant similarities in geological origins and microbial metabolic functions. Given that deep-sea biological sampling faces prohibitive costs, recognizing structurally homologous terrestrial analogs of the Mariana Trench on the Qinghai-Tibet Plateau is of great significance. Yet, no existing model adequately addresses cross-domain topographic similarity retrieval, either neglecting geographical knowledge or sacrificing computational efficiency. To address these challenges, we present \underline{\textbf{G}}eography-knowledge \underline{\textbf{E}}nhanced \underline{\textbf{A}}nalog \underline{\textbf{R}}ecognition (\textbf{GEAR}) Framework, a three-stage pipeline designed to efficiently retrieve analogs from 2.5 million square kilometers of the Qinghai-Tibet Plateau: (1) Skeleton guided Screening and Clipping: Recognition of candidate valleys and initial screening based on size and linear morphological criteria. (2) Physics aware Filtering: The Topographic Waveform Comparator (TWC) and Morphological Texture Module (MTM) evaluate the waveform and texture and filter out inconsistent candidate valleys. (3) Graph based Fine Recognition: We design a \underline{\textbf{M}}orphology-integrated \underline{\textbf{S}}iamese \underline{\textbf{G}}raph \underline{\textbf{N}}etwork (\textbf{MSG-Net}) based on geomorphological metrics. Correspondingly, we release an expert-annotated topographic similarity dataset targeting tectonic collision zones. Experiments demonstrate the effectiveness of every stage. Besides, MSG-Net achieved an F1-Score 1.38 percentage points higher than the SOTA baseline. Using features extracted by MSG-Net, we discovered a significant correlation with biological data, providing evidence for future biological analysis.

[122] SwiftGS: Episodic Priors for Immediate Satellite Surface Recovery

Rong Fu,Jiekai Wu,Haiyun Wei,Xiaowen Ma,Shiyin Lin,Kangan Qian,Chuang Liu,Jianyuan Ni,Simon James Fong

Main category: cs.CV

TL;DR: SwiftGS是一种元学习系统,通过单次前向传播预测解耦的高斯基元和轻量级SDF,实现快速、大规模、多时相卫星影像的3D重建,显著降低计算成本并保持精度。

Details Motivation: 现有方法难以应对多时相卫星影像3D重建中的光照变化、传感器异质性及逐场景优化高昂成本等问题。 Method: 提出SwiftGS:采用元学习框架,结合可微物理图(建模投影、光照、传感器响应)、空间门控机制(融合稀疏高斯细节与全局SDF结构)、语义-几何融合、条件轻量任务头,以及基于冻结几何教师模型的多视角不确定性感知多任务监督。 Result: 在推理阶段实现零样本3D表面重建,支持可选紧凑标定;生成高精度数字表面模型(DSM)与视角一致的渲染结果,计算成本大幅降低;消融实验证明混合表征、物理感知渲染与元训练策略的有效性。 Conclusion: SwiftGS通过解耦几何-辐射表征与元学习先验,为大规模、多时相卫星影像的高效3D重建提供了新范式,兼顾速度、精度与泛化能力。 Abstract: Rapid, large-scale 3D reconstruction from multi-date satellite imagery is vital for environmental monitoring, urban planning, and disaster response, yet remains difficult due to illumination changes, sensor heterogeneity, and the cost of per-scene optimization. We introduce SwiftGS, a meta-learned system that reconstructs 3D surfaces in a single forward pass by predicting geometry-radiation-decoupled Gaussian primitives together with a lightweight SDF, replacing expensive per-scene fitting with episodic training that captures transferable priors. The model couples a differentiable physics graph for projection, illumination, and sensor response with spatial gating that blends sparse Gaussian detail and global SDF structure, and incorporates semantic-geometric fusion, conditional lightweight task heads, and multi-view supervision from a frozen geometric teacher under an uncertainty-aware multi-task loss. At inference, SwiftGS operates zero-shot with optional compact calibration and achieves accurate DSM reconstruction and view-consistent rendering at significantly reduced computational cost, with ablations highlighting the benefits of the hybrid representation, physics-aware rendering, and episodic meta-training.

[123] Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

Jiayi Luo,Jiayu Chen,Jiankun Wang,Cong Wang,Hanxin Zhu,Qingyun Sun,Chen Gao,Zhibo Chen,Jianxin Li

Main category: cs.CV

TL;DR: 本文提出SVOO框架,通过离线层敏感度分析和在线双向协同聚类,实现无需训练的视频生成稀疏注意力优化,在保持高质量的同时显著提升推理速度。

Details Motivation: 现有无训练稀疏注意力方法在视频生成中存在忽略层异质性和查询-键耦合的问题,限制了质量与速度的平衡。 Method: 提出SVOO框架:第一阶段为离线逐层敏感度分析以确定各层固有剪枝比例;第二阶段为在线块级稀疏注意力,采用新型双向协同聚类算法。 Result: 在七个主流视频生成模型上验证,SVOO相比SOTA方法实现最高1.93倍加速,同时在Wan2.1数据集上保持高达29 dB的PSNR。 Conclusion: SVOO通过挖掘注意力稀疏性的层内在特性与双向耦合结构,实现了更优的质量-速度权衡,为高效视频生成提供了新范式。 Abstract: Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.

[124] PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance

Cong Wang,Hanxin Zhu,Xiao Tang,Jiayi Luo,Xin Jin,Long Chen,Fei-Yue Wang,Zhibo Chen

Main category: cs.CV

TL;DR: 本文提出PhysVideo框架,通过两阶段方法生成物理一致的视频:第一阶段利用Phys4View生成物理感知的正交前景视频,第二阶段用VideoSyn合成带背景的完整视频,并构建PhysMV多视角数据集验证效果。

Details Motivation: 现有视频生成方法在视觉保真度上进步显著,但难以保证运动的物理一致性,因为真实物体运动发生在三维空间,而视频仅提供部分、视角依赖的二维投影。 Method: 提出PhysVideo两阶段框架:第一阶段Phys4View使用物理感知注意力建模物理属性对运动的影响,并结合几何增强的跨视角注意力和时序注意力提升时空一致性;第二阶段VideoSyn以生成的前景视频为指导,学习前景动态与背景上下文的交互以实现可控视频合成;并构建包含40K场景、160K视频序列的PhysMV四视角数据集。 Result: 实验表明PhysVideo在物理真实性和时空连贯性方面显著优于现有视频生成方法。 Conclusion: PhysVideo通过引入三维物理先验与多视角建模,有效提升了生成视频的物理合理性和时空一致性,为视频生成提供了新的物理驱动范式。 Abstract: Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose PhysVideo, a two-stage framework that first generates physics-aware orthogonal foreground videos and then synthesizes full videos with background. In the first stage, Phys4View leverages physics-aware attention to capture the influence of physical attributes on motion dynamics, and enhances spatio-temporal consistency by incorporating geometry-enhanced cross-view attention and temporal attention. In the second stage, VideoSyn uses the generated foreground videos as guidance and learns the interactions between foreground dynamics and background context for controllable video synthesis. To support training, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that PhysVideo significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Home page: https://anonymous.4open.science/w/Phys4D/.

[125] MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration

Teer Song,Yue Zhang,Yu Tian,Ziyang Wang,Xianlin Zhang,Guixuan Zhang,Xuan Liu,Xueming Li,Yasen Zhang

Main category: cs.CV

TL;DR: 本文提出MeInTime,一种基于扩散模型的跨年龄人脸修复方法,通过解耦身份与年龄建模,在仅提供跨年龄参考图像和年龄提示的情况下,实现高保真身份保持与年龄一致性修复。

Details Motivation: 现有参考式人脸修复方法大多假设参考图与退化输入年龄一致,难以应对历史照片修复等仅有跨年龄参考的实际场景。 Method: 提出MeInTime:1)引入新注意力机制显式注入身份特征;2)设计门控残差融合模块融合退化特征与身份表征;3)提出无需训练的‘年龄感知梯度引导’采样策略,将身份感知隐空间向目标年龄语义流形迭代校正。 Result: 在身份保真度和年龄一致性两方面均超越现有方法。 Conclusion: MeInTime成功将参考式人脸修复拓展至跨年龄设定,为历史影像修复等实际应用提供了有效解决方案。 Abstract: To better preserve an individual's identity, face restoration has evolved from reference-free to reference-based approaches, which leverage high-quality reference images of the same identity to enhance identity fidelity in the restored outputs. However, most existing methods implicitly assume that the reference and degraded input are age-aligned, limiting their effectiveness in real-world scenarios where only cross-age references are available, such as historical photo restoration. This paper proposes MeInTime, a diffusion-based face restoration method that extends reference-based restoration from same-age to cross-age settings. Given one or few reference images along with an age prompt corresponding to the degraded input, MeInTime achieves faithful restoration with both identity fidelity and age consistency. Specifically, we decouple the modeling of identity and age conditions. During training, we focus solely on effectively injecting identity features through a newly introduced attention mechanism and introduce Gated Residual Fusion modules to facilitate the integration between degraded features and identity representations. At inference, we propose Age-Aware Gradient Guidance, a training-free sampling strategy, using an age-driven direction to iteratively nudge the identity-aware denoising latent toward the desired age semantic manifold. Extensive experiments demonstrate that MeInTime outperforms existing face restoration methods in both identity preservation and age consistency. Our code is available at: https://github.com/teer4/MeInTime

[126] Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA

Ruizhi Yu,Keyang Zhong,Peng Liu,Qi Wu,Haoran Zhang,Yanhao Zhang,Chen Chen,Haonan Lu

Main category: cs.CV

TL;DR: 本文提出Click-to-Ask系统,一种面向直播电商的AI助手,通过离线模块处理多模态商品信息并生成合规话术,结合在线模块实时响应观众提问,显著提升直播准备效率、内容互动性与应答及时性。

Details Motivation: 为提升主播在直播电商中产品推广的效率与便捷性,解决实时应答观众问题和高效准备促销内容的挑战。 Method: 设计包含离线与在线双模块的AI助手:离线模块处理多模态商品信息并生成结构化数据及合规话术;在线模块基于点击提问机制,融合离线生成数据与流式事件级历史记忆实现实时应答。 Result: 在自建TikTok直播帧数据集上,问题识别准确率达0.913,响应质量得分为0.876。 Conclusion: Click-to-Ask系统在提升直播电商效率、互动性与响应能力方面展现出显著实用潜力。 Abstract: Live streaming commerce has become a prominent form of broadcasting in the modern era. To facilitate more efficient and convenient product promotions for streamers, we present Click-to-Ask, an AI-driven assistant for live streaming commerce with complementary offline and online components. The offline module processes diverse multimodal product information, transforming complex inputs into structured product data and generating compliant promotional copywriting. During live broadcasts, the online module enables real-time responses to viewer inquiries by allowing streamers to click on questions and leveraging both the structured product information generated by the offline module and an event-level historical memory maintained in a streaming architecture. This system significantly reduces the time needed for promotional preparation, enhances content engagement, and enables prompt interaction with audience inquiries, ultimately improving the effectiveness of live streaming commerce. On our collected dataset of TikTok live stream frames, the proposed method achieves a Question Recognition Accuracy of 0.913 and a Response Quality score of 0.876, demonstrating considerable potential for practical application. The video demonstration can be viewed here: https://www.youtube.com/shorts/mWIXK-SWhiE.

[127] Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

Pius Horn,Janis Keuper

Main category: cs.CV

TL;DR: 本文提出了一种基于合成PDF和LLM-as-a-judge的语义表格评估新框架,显著优于传统基于规则的指标(如TEDS、GriTS),并通过大规模人类验证与21种PDF解析器评测验证其有效性。

Details Motivation: 现有PDF表格提取评估方法依赖规则型指标,无法准确衡量表格内容的语义等价性,缺乏可靠、可扩展且贴近人类判断的评价标准。 Method: 构建基于LaTeX生成的真实复杂合成PDF基准;设计融合LLM-as-a-judge的语义匹配评估流程,兼容解析器输出的不一致性;开展超1500次人工质量判断的人类验证研究。 Result: LLM评估与人类判断相关性达Pearson r=0.93,远高于TEDS(r=0.68)和GriTS(r=0.70);在100份含451张表的合成文档上评测21种解析器,揭示显著性能差异。 Conclusion: LLM-as-a-judge为PDF表格提取提供了更可靠、可复现、可扩展的评估范式,为实际应用中解析器选型提供指导,并推动该领域标准化评测发展。 Abstract: Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

[128] SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Carlos Hinojosa,Clemens Grange,Bernard Ghanem

Main category: cs.CV

TL;DR: 本文研究了视觉-语言模型(VLMs)在安全决策中对语义线索的依赖性,提出语义引导框架和SAVeS基准,发现VLMs的安全行为易受文本、视觉和认知干预影响,表明其依赖视觉-语言关联而非真实视觉理解,揭示了多模态安全系统的潜在漏洞。

Details Motivation: 现实与具身场景中VLMs的安全决策依赖视觉上下文,但驱动这些判断的视觉证据尚不明确;需探究其是否仅依赖简单语义线索而非真实视觉理解。 Method: 提出语义引导框架,通过可控的文本、视觉和认知干预(不改变场景内容)来操控VLMs的安全行为;构建SAVeS基准及评估协议,分离行为拒绝、基于依据的安全推理和错误拒绝。 Result: 实验表明多个VLMs的安全决策对语义线索高度敏感,依赖习得的视觉-语言关联而非接地的视觉理解;自动化引导流程可利用该机制,暴露多模态安全系统的脆弱性。 Conclusion: VLMs在安全判断中存在语义捷径依赖问题,当前多模态安全机制缺乏真正的视觉 grounding,需设计更鲁棒、可解释的安全评估与干预方法。 Abstract: Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.

[129] Multiscale Switch for Semi-Supervised and Contrastive Learning in Medical Ultrasound Image Segmentation

Jingguo Qu,Xinyang Han,Yao Pu,Man-Lik Chui,Simon Takadiyi Gunda,Ziman Chen,Jing Qin,Ann Dorothy King,Winnie Chiu-Wing Chu,Jing Cai,Michael Tin-Cheung Ying

Main category: cs.CV

TL;DR: 本文提出了一种名为Switch的新型半监督学习框架,通过多尺度切换(MSS)和频域切换(FDS)策略,在超声图像分割任务中显著提升了对少量标注数据和大量未标注数据的利用效率与鲁棒性特征表示能力,并在多个医学超声数据集上超越现有方法。

Details Motivation: 医学超声图像分割面临标注数据稀缺及成像伪影(如斑点噪声、低对比度边界)的挑战;现有半监督方法对未标注数据利用不足且缺乏鲁棒特征表示机制。 Method: 提出Switch框架,包含两个核心创新:(1) 多尺度切换(MSS)策略,采用分层图像块混合实现均匀空间覆盖;(2) 频域切换(FDS)结合对比学习,在傅里叶域进行幅度切换以增强鲁棒特征表达;整体基于教师-学生架构整合二者。 Result: 在六个不同超声数据集(淋巴结、乳腺病灶、甲状腺结节、前列腺)上验证,5%标注率下Dice分数达80.04%(LN-INT)、85.52%(DDTI)、83.48%(Prostate),甚至超过全监督基线;模型仅含1.8M参数,兼顾高效与高性能。 Conclusion: Switch是一种参数高效、性能优越的半监督超声图像分割方法,适用于资源受限的医学影像场景,具备良好的实用价值与推广潜力。 Abstract: Medical ultrasound image segmentation faces significant challenges due to limited labeled data and characteristic imaging artifacts including speckle noise and low-contrast boundaries. While semi-supervised learning (SSL) approaches have emerged to address data scarcity, existing methods suffer from suboptimal unlabeled data utilization and lack robust feature representation mechanisms. In this paper, we propose Switch, a novel SSL framework with two key innovations: (1) Multiscale Switch (MSS) strategy that employs hierarchical patch mixing to achieve uniform spatial coverage; (2) Frequency Domain Switch (FDS) with contrastive learning that performs amplitude switching in Fourier space for robust feature representations. Our framework integrates these components within a teacher-student architecture to effectively leverage both labeled and unlabeled data. Comprehensive evaluation across six diverse ultrasound datasets (lymph nodes, breast lesions, thyroid nodules, and prostate) demonstrates consistent superiority over state-of-the-art methods. At 5\% labeling ratio, Switch achieves remarkable improvements: 80.04\% Dice on LN-INT, 85.52\% Dice on DDTI, and 83.48\% Dice on Prostate datasets, with our semi-supervised approach even exceeding fully supervised baselines. The method maintains parameter efficiency (1.8M parameters) while delivering superior performance, validating its effectiveness for resource-constrained medical imaging applications. The source code is publicly available at https://github.com/jinggqu/Switch

[130] Multimodal Model for Computational Pathology:Representation Learning and Image Compression

Peihang Wu,Zehong Chen,Lijian Xu

Main category: cs.CV

TL;DR: 本文综述了多模态计算病理学的最新进展,聚焦于解决全切片图像(WSI)高分辨率、标注稀缺、多模态融合难及模型可解释性差等挑战,提出四大研究方向:自监督表征学习与结构感知令牌压缩、多模态数据生成与增强、参数高效适配与推理增强的小样本学习、多智能体协同推理,并强调需构建融合高分辨影像与临床/生物医学知识的统一框架以实现可解释、安全的AI辅助诊断。

Details Motivation: 全切片图像(WSI)分辨率极高、专家标注稀缺、多模态信息融合难以保持生物学可解释性,且超长视觉序列建模缺乏临床透明度,亟需系统梳理多模态计算病理学的前沿进展与关键挑战。 Method: 系统性综述方法,围绕四个核心方向展开分析:(1)自监督表征学习与结构感知令牌压缩;(2)多模态数据生成与增强;(3)参数高效适配与推理增强的小样本学习;(4)多智能体协同推理机制,并特别探讨令牌压缩与多智能体‘思维链’模拟在跨尺度建模与不确定性感知证据融合中的作用。 Result: 归纳出当前多模态计算病理学的关键技术路径与发展范式,明确了令牌压缩支持跨尺度建模、多智能体机制可模拟病理医生多层级决策过程等重要发现,并指出统一多模态框架是未来突破方向。 Conclusion: 未来计算病理学的发展依赖于融合高分辨率视觉数据、临床文本与生物医学知识的统一、可解释、安全的多模态AI框架,以真正支撑临床可信诊断。 Abstract: Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra-long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross-scale modeling and how multi-agent mechanisms simulate a pathologist's "Chain of Thought" across magnifications to achieve uncertainty-aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.

[131] Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels

Juan Miguel Valverde,Dim P. Papadopoulos,Rasmus Larsen,Anders Bjorholm Dahl

Main category: cs.CV

TL;DR: 本文提出SCNP方法,通过惩罚logits与其最差分类邻域像素的差异,提升图像分割的拓扑准确性,适用于多种结构形态和模态,在13个数据集上验证有效,并可灵活集成到多种分割框架和损失函数中。

Details Motivation: 标准深度学习图像分割模型无法保证拓扑准确性(如连通分量数量),影响分割质量和后续量化分析;已有方法存在集成困难、计算昂贵或形态限制等问题。 Method: 提出SCNP(Same-Class Neighbor Penalization)方法:在训练中对每个像素的logits施加惩罚,使其与同类别中最差分类邻域像素的预测差距最小化,从而强制模型优先优化边界区域的预测一致性。 Result: 在13个涵盖不同结构形态(非仅管状)和图像模态的数据集上显著提升拓扑准确性;成功集成到三种语义/实例分割框架及多种损失函数中,且计算开销低。 Conclusion: SCNP是一种轻量、通用、易集成的拓扑增强技术,无需修改网络架构,即可有效提升各类分割任务的拓扑鲁棒性。 Abstract: Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest-classified neighbor, forcing the model to improve the prediction at the pixels' neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at https://jmlipman.github.io/SCNP-SameClassNeighborPenalization.

[132] Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

Mohamed Youssef,Mayar Elfares,Anna-Maria Meer,Matteo Bortoletto,Andreas Bulling

Main category: cs.CV

TL;DR: 本文提出Ontology-Guided Diffusion(OGD),一种基于本体论引导的神经符号化零样本仿真到现实(sim2real)图像迁移框架,通过将‘真实性’建模为结构化知识(如光照、材质等可解释特征及其关系图谱),结合图神经网络嵌入与符号化编辑规划,驱动预训练扩散模型实现更优、可解释、数据高效且泛化的图像翻译。

Details Motivation: 现有基于扩散模型的sim2real方法依赖非结构化提示或统计对齐,难以刻画使图像真实化的结构化因素;同时真实标注数据稀缺,亟需零样本、可解释、泛化性强的方法。 Method: OGD构建一个真实性本体(含光照、材质等可解释特质)及其知识图谱;从合成图像中推断特质激活,用图神经网络生成全局嵌入;同时由符号规划器生成一致的视觉编辑序列;二者分别作为扩散模型的跨注意力条件和结构化指令提示。 Result: OGD在多个基准上,其图嵌入更能区分真实与合成图像;在sim2real图像翻译任务中显著优于现有最先进扩散方法。 Conclusion: 显式建模真实性的结构化知识,可提升sim2real迁移的可解释性、数据效率与泛化能力,验证了神经符号融合路径的有效性。 Abstract: Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents realism as structured knowledge. OGD decomposes realism into an ontology of interpretable traits -- such as lighting and material properties -- and encodes their relationships in a knowledge graph. From a synthetic image, OGD infers trait activations and uses a graph neural network to produce a global embedding. In parallel, a symbolic planner uses the ontology traits to compute a consistent sequence of visual edits needed to narrow the realism gap. The graph embedding conditions a pretrained instruction-guided diffusion model via cross-attention, while the planned edits are converted into a structured instruction prompt. Across benchmarks, our graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translations. Overall, OGD shows that explicitly encoding realism structure enables interpretable, data-efficient, and generalisable zero-shot sim2real transfer.

[133] EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Longfei Liu,Yongjie Hou,Yang Li,Qirui Wang,Youyang Sha,Yongjun Yu,Yinzhi Wang,Peizhe Ru,Xuanlong Yu,Xi Shen

Main category: cs.CV

TL;DR: 本文提出EdgeCrafter,一种面向边缘设备密集预测任务的轻量级视觉Transformer统一框架,通过任务特化蒸馏与边缘友好编解码设计,在检测、实例分割和姿态估计任务上均取得优异性能,证明紧凑ViT在边缘端具备实用竞争力。

Details Motivation: 现有边缘密集预测系统仍以CNN为主(如YOLO),紧凑ViT难以达到同等精度-效率权衡,作者认为主因是小规模ViT缺乏任务特化的表征学习,而非ViT本身不适用于边缘密集预测。 Method: 提出EdgeCrafter框架,核心为ECDet模型:采用知识蒸馏得到的紧凑骨干网络 + 面向边缘优化的编码器-解码器结构;并扩展至实例分割(ECInsSeg)和姿态估计(ECPose-X)。 Result: ECDet-S在COCO上达51.7 AP(<10M参数,仅用COCO标注);ECInsSeg性能媲美RF-DETR但参数更少;ECPose-X达74.8 AP,显著超越依赖Objects365预训练的YOLO26Pose-X(71.6 AP)。 Conclusion: 紧凑ViT结合任务特化蒸馏与边缘感知设计,可在资源受限边缘设备上实现高效、高精度的密集预测,成为CNN之外的实用可行方案。 Abstract: Deploying high-performance dense prediction models on resource-constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN-based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy efficiency tradeoff, even with large scale pretraining. We argue that this gap is largely due to insufficient task specific representation learning in small scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder decoder design. On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter's reliance on extensive Objects365 pretraining. These results show that compact ViTs, when paired with task-specialized distillation and edge-aware design, can be a practical and competitive option for edge dense prediction. Code is available at: https://intellindust-ai-lab.github.io/projects/EdgeCrafter/

[134] 6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

Rundong Su,Jintao Zhang,Zhihang Yuan,Haojie Duanmu,Jianfei Chen,Jun Zhu

Main category: cs.CV

TL;DR: 本文提出了一种针对视频扩散Transformer(Video DiTs)的推理时混合精度量化框架(NVFP4/INT8)与时间增量缓存(TDC)技术,通过动态分配低精度(NVFP4)与高精度(INT8)以及跳过时间上稳定的残差块计算,在显著提升推理速度(1.92×)和内存压缩(3.32×)的同时保持生成质量。

Details Motivation: 现有后训练量化方法采用静态比特宽分配,忽视了不同扩散步中激活值的量化难度差异,导致效率与质量权衡不佳;同时Video DiTs推理内存与计算开销大,亟需高效部署方案。 Method: 提出两阶段优化:1)基于Transformer块输入-输出差与线性层量化敏感性的强线性相关性,设计轻量预测器动态分配NVFP4(用于稳定层)和INT8(用于波动层);2)利用块残差在时间维度的高度一致性,引入Temporal Delta Cache(TDC)跳过不变块的重复计算。 Result: 在Video DiTs上实现端到端1.92×加速与3.32×内存减少,生成质量无损,建立高效视频扩散模型推理新基线。 Conclusion: 动态混合精度量化与时间冗余利用可协同解决Video DiTs推理瓶颈,为大模型视频生成的实际部署提供可行、高效的解决方案。 Abstract: Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block's input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92$\times$ end-to-end acceleration and 3.32$\times$ memory reduction, setting a new baseline for efficient inference in Video DiTs.

[135] WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification

Isabel Rio-Torto,Jaime S. Cardoso,Luís F. Teixeira

Main category: cs.CV

TL;DR: 本文提出WeNLEX,一种弱监督的自然语言解释生成模型,用于多标签胸部X光分类,通过图像重建确保解释忠实性,通过分布对齐保证解释合理性,并在少量标注数据下实现高解释质量与分类性能提升。

Details Motivation: 现有方法依赖大量人工标注解释进行监督,导致生成解释虽合理但不忠实于模型真实推理过程;需在无需大量标注解释的前提下,同时保证解释的忠实性与合理性。 Method: WeNLEX采用弱监督框架:利用黑盒模型特征空间中由生成解释重建的图像与原图的一致性来约束忠实性;通过小规模临床医生标注解释库进行分布对齐以保障合理性;支持post-hoc和in-model两种部署方式,并可更换解释库适配不同受众(如医者/大众)。 Result: WeNLEX在忠实性、可模拟性、多样性与合理性多项指标上显著优于基线;仅用每诊断5条真实解释即可达到优异效果;in-model训练使分类AUC提升2.21%;成功构建面向非医学用户的简化版解释模型。 Conclusion: WeNLEX实现了弱监督下忠实且合理的自然语言解释生成,兼具实用性、泛化性与可适配性,证明可解释性建模能反哺下游任务性能。 Abstract: Natural language explanations provide an inherently human-understandable way to explain black-box models, closely reflecting how radiologists convey their diagnoses in textual reports. Most works explicitly supervise the explanation generation process using datasets annotated with explanations. Thus, though plausible, the generated explanations are not faithful to the model's reasoning. In this work, we propose WeNLEX, a weakly supervised model for the generation of natural language explanations for multilabel chest X-ray classification. Faithfulness is ensured by matching images generated from their corresponding natural language explanations with original images, in the black-box model's feature space. Plausibility is maintained via distribution alignment with a small database of clinician-annotated explanations. We empirically demonstrate, through extensive validation on multiple metrics to assess faithfulness, simulatability, diversity, and plausibility, that WeNLEX is able to produce faithful and plausible explanations, using as little as 5 ground-truth explanations per diagnosis. Furthermore, WeNLEX can operate in both post-hoc and in-model settings. In the latter, i.e., when the multilabel classifier is trained together with the rest of the network, WeNLEX improves the classification AUC of the standalone classifier by 2.21%, thus showing that adding interpretability to the training process can actually increase the downstream task performance. Additionally, simply by changing the database, WeNLEX explanations are adaptable to any target audience, and we showcase this flexibility by training a layman version of WeNLEX, where explanations are simplified for non-medical users.

[136] DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection

Haochen Li,Rui Zhang,Hantao Yao,Xin Zhang,Yifan Hao,Shaohui Peng,Yongwei Zhao,Ling Li

Main category: cs.CV

TL;DR: 本文提出DA-Mamba,一种结合CNN与状态空间模型(SSM)的混合架构,用于域自适应目标检测,通过IA-SSM和OA-SSM模块分别增强图像级和实例级的全局-局部域不变特征对齐,兼顾效率与建模能力。

Details Motivation: 现有基于CNN的方法难以建模全局域不变特征,而基于Transformer的方法计算复杂度高、难以部署。 Method: 提出DA-Mamba架构,融合CNN与线性复杂度的状态空间模型(SSM);设计Image-Aware SSM(IA-SSM)嵌入骨干网络以增强图像级全局对齐,Object-Aware SSM(OA-SSM)嵌入检测头以建模对象间空间与语义依赖,实现实例级对齐。 Result: 在多个跨域检测基准上验证了DA-Mamba能高效提升检测器的跨域性能,兼具精度与计算效率。 Conclusion: DA-Mamba通过CNN-SSM混合设计,在保持低计算开销的同时有效建模全局与局部域不变特征,为高效域自适应目标检测提供了新范式。 Abstract: Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain. Existing DAOD methods employ multi-granularity feature alignment to learn domain-invariant representations. However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features. Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs) architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features. Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM). IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment. OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment. Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.

[137] ProCal: Probability Calibration for Neighborhood-Guided Source-Free Domain Adaptation

Ying Zheng,Yiyi Zhang,Yi Wang,Lap-Pui Chau

Main category: cs.CV

TL;DR: 本文提出ProCal方法,通过双模型协同预测机制动态校准邻域预测概率,解决源自由域自适应中源知识遗忘和局部噪声过拟合问题。

Details Motivation: 现有基于邻域结构的源自由域自适应方法过度依赖邻居预测相似性,导致源知识快速遗忘和易受局部噪声干扰。 Method: 提出ProCal概率校准方法,结合源模型初始预测与当前模型在线输出进行邻域预测动态校准,并设计融合软监督损失与多样性损失的联合优化目标。 Result: 在四个公开数据集共31个跨域任务上验证了方法有效性,理论分析表明ProCal能收敛至源知识与目标信息有效融合的平衡点。 Conclusion: ProCal在缓解知识遗忘和噪声过拟合之间取得良好平衡,提升了源自由域自适应性能。 Abstract: Source-Free Domain Adaptation (SFDA) adapts pre-trained models to unlabeled target domains without requiring access to source data. Although state-of-the-art methods leveraging local neighborhood structures show promise for SFDA, they tend to over-rely on prediction similarity among neighbors. This over-reliance accelerates the forgetting of source knowledge and increases susceptibility to local noise overfitting. To address these issues, we introduce ProCal, a probability calibration method that dynamically calibrates neighborhood-based predictions through a dual-model collaborative prediction mechanism. ProCal integrates the source model's initial predictions with the current model's online outputs to effectively calibrate neighbor probabilities. This strategy not only mitigates the interference of local noise but also preserves the discriminative information from the source model, thereby achieving a balance between knowledge retention and domain adaptation. Furthermore, we design a joint optimization objective that combines a soft supervision loss with a diversity loss to guide the target model. Our theoretical analysis shows that ProCal converges to an equilibrium where source knowledge and target information are effectively fused, reducing both knowledge forgetting and overfitting. We validate the effectiveness of our approach through extensive experiments on 31 cross-domain tasks across four public datasets. Our code is available at: https://github.com/zhengyinghit/ProCal.

[138] SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Vsevolod Skorokhodov,Chenghao Xu,Shuo Sun,Olga Fink,Malcolm Mielle

Main category: cs.CV

TL;DR: 本文提出SEAR方法,通过简单高效的微调策略,将预训练的视觉几何Transformer适配到RGB-thermal多模态输入,在小规模RGB-T数据集上实现显著性能提升,尤其在低光和浓烟等挑战性场景下仍保持高精度与跨模态一致性。

Details Motivation: 现有基于RGB数据预训练的视觉几何模型在处理RGB-thermal等混合传感模态时性能下降,尤其难以联合对齐RGB与热成像模态。 Method: 提出SEAR——一种针对预训练几何Transformer的轻量级微调策略,使其适应RGB-T多模态输入;并构建了一个涵盖不同时序、视角与光照条件的新型RGB-T数据集。 Result: 在3D重建与相机位姿估计任务上显著超越SOTA方法(如AUC@30提升超29%),保持高细节与跨模态一致性,推理开销可忽略,且在低光、浓烟等挑战场景下仍可靠。 Conclusion: SEAR验证了通过针对性微调即可高效扩展单模态预训练几何模型至多模态场景的可行性,为RGB-T等跨模态视觉几何任务提供了实用、鲁棒且可扩展的新范式。 Abstract: Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29\% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.

[139] Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

Jiatong Xia,Zicheng Duan,Anton van den Hengel,Lingqiao Liu

Main category: cs.CV

TL;DR: 本文提出Points-to-3D框架,利用点云先验(如LiDAR或VGGT生成的)增强扩散模型的几何可控性,通过结构修复网络与分阶段采样,在3D资产与场景生成中提升渲染质量与几何保真度。

Details Motivation: 现有3D生成方法多依赖图像或文本条件,而易获取的点云先验(如LiDAR或VGGT输出)未被充分利用,缺乏对显式几何约束的有效建模。 Method: 基于潜在3D扩散模型TRELLIS,设计点云先验引导的稀疏结构隐变量初始化,并引入结构修复网络,采用两阶段采样策略:先全局结构修复,再边界精细化。 Result: 在物体与场景生成任务上,相比SOTA基线,显著提升渲染质量与几何保真度;支持真实点云或单图估计点云输入,泛化性强。 Conclusion: 显式嵌入点云先验可有效提升3D生成的几何准确性与结构可控性,为融合多模态几何先验的生成建模提供新范式。 Abstract: Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.

[140] Rethinking Uncertainty Quantification and Entanglement in Image Segmentation

Jakob Lønborg Christensen,Vedrana Andersen Dahl,Morten Rieger Hannemose,Anders Bjorholm Dahl,Christian F. Baumgartner

Main category: cs.CV

TL;DR: 本文对医学图像分割中的不确定性量化(UQ)进行了全面实证研究,重点分析了数据不确定性(AU)与模型不确定性(EU)的组合方式及其相互纠缠问题,提出了量化纠缠的新指标,并在OOD检测、模糊建模和校准等下游任务中评估了多种AU-EU组合,发现集成方法(尤其是softmax ensemble)整体表现最优且纠缠较低。

Details Motivation: 现有AU和EU建模方法众多,但其组合效果及二者间广泛存在的纠缠现象损害了不确定性分解的可解释性与实用性,亟需系统性实证分析与量化评估。 Method: 开展覆盖多种AU(如Probabilistic UNet、Diffusion)与EU(如集成、MC Dropout)方法组合的大规模实证研究;提出一种新指标定量衡量AU与EU的不确定性纠缠程度;在OOD检测、模糊建模和不确定性校准三类下游UQ任务上统一评估性能。 Result: 集成方法(尤其softmax ensemble)在OOD检测中纠缠最低、性能最优;模糊建模与校准效果因数据集而异,softmax/SSN类方法表现良好,Probabilistic UNet纠缠更小;softmax ensemble在所有任务中均表现出色。 Conclusion: AU与EU的纠缠是影响UQ实用性的关键问题;集成方法(特别是softmax ensemble)是当前兼顾低纠缠与高鲁棒性的优选方案;需进一步探究纠缠成因并设计解耦机制。 Abstract: Uncertainty quantification (UQ) is crucial in safety-critical applications such as medical image segmentation. Total uncertainty is typically decomposed into data-related aleatoric uncertainty (AU) and model-related epistemic uncertainty (EU). Many methods exist for modeling AU (such as Probabilistic UNet, Diffusion) and EU (such as ensembles, MC Dropout), but it is unclear how they interact when combined. Additionally, recent work has revealed substantial entanglement between AU and EU, undermining the interpretability and practical usefulness of the decomposition. We present a comprehensive empirical study covering a broad range of AU-EU model combinations, propose a metric to quantify uncertainty entanglement, and evaluate both across downstream UQ tasks. For out-of-distribution detection, ensembles exhibit consistently lower entanglement and superior performance. For ambiguity modeling and calibration the best models are dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. A softmax ensemble fares remarkably well on all tasks. Finally, we analyze potential sources of uncertainty entanglement and outline directions for mitigating this effect.

[141] Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Yuchen Li,Amanmeet Garg,Shalini Chaudhuri,Rui Zhao,Garin Kessler

Main category: cs.CV

TL;DR: 本文提出Perceptio,一种感知增强的大型视觉语言模型(LVLM),通过在自回归序列中显式引入语义分割标记和深度标记,赋予模型2D和3D空间推理能力,显著提升其细粒度空间定位性能。

Details Motivation: 大型视觉语言模型(LVLMs)擅长语义理解,但在细粒度空间定位方面表现不佳,因其需隐式推断复杂几何关系而无法生成显式的空间解释。 Method: 1)蒸馏单目教师模型的VQVAE深度码本,将稠密深度图编码为紧凑序列;2)将SAM2语义分割标记与VQ-VAE深度标记集成进LLM,使模型先输出空间标记再作答;3)设计复合深度标记目标(标记、标记位置、计数损失)及可微软融合重建技术以稳定深度标记生成;4)在多数据集上采用多任务协同训练策略。 Result: 基于InternVL构建的Perceptio在多个基准测试中达到SOTA:RefCOCO/+/g上的指代表达分割cIoU分别提升+0.8/+1.4/+1.1;HardBLINK空间理解准确率提升10.3%;MMBench准确率提升1.0%。 Conclusion: 显式空间思维链(spatial chain-of-thought)能实质性增强LVLM的空间定位能力,验证了将感知标记直接嵌入语言建模流程的有效性。 Abstract: Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

[142] VesselTok: Tokenizing Vessel-like 3D Biomedical Graph Representations for Reconstruction and Generation

Chinmay Prabhakar,Bastian Wittmann,Tamaz Amiranashvili,Paul Büschl,Ezequiel de la Rosa,Julian McGinnis,Benedikt Wiestler,Bjoern Menze,Suprosanna Shit

Main category: cs.CV

TL;DR: 本文提出VesselTok框架,从参数化形状角度学习空间密集图(如血管、气道等)的潜在表示(tokens),利用中心线点与伪半径编码管状几何结构,并验证其在跨解剖结构泛化、生成建模及下游逆问题(如链路预测)中的有效性。

Details Motivation: 高分辨率大尺度空间图(如血管网络)复杂度高,带来巨大计算挑战,需更高效建模方法。 Method: 提出VesselTok框架,以中心线点及其伪半径为输入,学习条件于中心线的新型潜在表示,用于编码类血管管状结构的神经隐式表示。 Result: 在肺气道、肺血管和脑血管等多种解剖结构上验证了VesselTok的有效性,其潜在表示能泛化至未见解剖结构、支持合理解剖图生成,并有效迁移至链路预测等下游逆问题。 Conclusion: VesselTok为复杂管状解剖结构的空间图建模提供了高效、鲁棒且可迁移的新范式。 Abstract: Spatial graphs provide a lightweight and elegant representation of curvilinear anatomical structures such as blood vessels, lung airways, and neuronal networks. Accurately modeling these graphs is crucial in clinical and (bio-)medical research. However, the high spatial resolution of large networks drastically increases their complexity, resulting in significant computational challenges. In this work, we aim to tackle these challenges by proposing VesselTok, a framework that approaches spatially dense graphs from a parametric shape perspective to learn latent representations (tokens). VesselTok leverages centerline points with a pseudo radius to effectively encode tubular geometry. Specifically, we learn a novel latent representation conditioned on centerline points to encode neural implicit representations of vessel-like, tubular structures. We demonstrate VesselTok's performance across diverse anatomies, including lung airways, lung vessels, and brain vessels, highlighting its ability to robustly encode complex topologies. To prove the effectiveness of VesselTok's learnt latent representations, we show that they (i) generalize to unseen anatomies, (ii) support generative modeling of plausible anatomical graphs, and (iii) transfer effectively to downstream inverse problems, such as link prediction.

[143] Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging

Hesong Li,Ziqi Wu,Ruiwen Shao,Ying Fu

Main category: cs.CV

TL;DR: 本文提出了一种统计特性引导的HRTEM图像去噪网络(SCGN),通过空间偏差引导加权和频带引导加权,在空间域和频率域协同去噪,并结合HRTEM专用噪声标定与真实感数据集,显著提升原子定位精度。

Details Motivation: HRTEM在观测毫秒级快速成核过程时因短曝光导致严重噪声,掩盖原子位置,亟需高保真去噪方法以支撑原子尺度动态分析。 Method: 提出统计特性引导的去噪网络(SCGN):1)空间域采用空间偏差引导加权,依据局部偏差特征自适应选择卷积操作;2)频率域采用频带引导加权,依据频带特性增强信号、抑制噪声;3)构建HRTEM专用噪声标定方法及含无序结构与真实噪声的合成数据集。 Result: 在合成与真实HRTEM图像上均超越现有SOTA去噪方法,在原子定位等下游任务中展现更优性能。 Conclusion: SCGN通过融合统计先验与域特异性设计,有效平衡噪声抑制与结构保真,为HRTEM驱动的材料成核机理研究提供了可靠图像基础。 Abstract: High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nucleation dynamics, which boosts the studies of advanced solid materials. Nonetheless, due to the millisecond-scale rapid change of nucleation, it requires short-exposure rapid imaging, leading to severe noise that obscures atomic positions. In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. In the spatial domain, we present spatial deviation-guided weighting to select appropriate convolution operations for each spatial position based on deviation characteristic. In the frequency domain, we present frequency band-guided weighting to enhance signals and suppress noise based on band characteristics. We also develop an HRTEM-specific noise calibration method and generate a dataset with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downstream task. Code will be available at https://github.com/HeasonLee/SCGN.

[144] Towards Interpretable Foundation Models for Retinal Fundus Images

Samuel Ofosu Mensah,Maria Camila Roa Carvajal,Kerol Djoumessi,Philipp Berens

Main category: cs.CV

TL;DR: 本文提出了一种可解释性设计的视觉基础模型Dual-IFM,通过类证据图实现局部可解释性,通过2D投影层实现全局可解释性,在视网膜影像任务中达到与更大参数量SOTA模型相当的性能。

Details Motivation: 现有基于自监督学习的基础模型在医疗影像等高风险领域因架构缺乏可解释性而受限,亟需兼顾高性能与可解释性的新模型。 Method: 提出Dual-IFM模型:1)引入类证据图(class evidence maps)提供单张图像的局部可解释性;2)设计2D投影层实现整个数据集表征空间的可视化(全局可解释性);在80万张彩色眼底照片上进行大规模自监督预训练。 Result: Dual-IFM在下游任务中性能媲美参数量高达16倍的SOTA基础模型,并能在分布外数据上提供可解释预测。 Conclusion: 大规模自监督预训练与内在可解释性设计可协同提升视网膜影像表征的鲁棒性与临床适用性。 Abstract: Foundation models are used to extract transferable representations from large amounts of unlabeled data, typically via self-supervised learning (SSL). However, many of these models rely on architectures that offer limited interpretability, which is a critical issue in high-stakes domains such as medical imaging. We propose Dual-IFM, a foundation model that is interpretable-by-design in two ways: First, it provides local interpretability for individual images through class evidence maps that are faithful to the decision-making process. Second, it provides global interpretability for entire datasets through a 2D projection layer that allows for direct visualization of the model's representation space. We trained our model on over 800,000 color fundus photography from various sources to learn generalizable, interpretable representations for different downstream tasks. Our results show that our model reaches a performance range similar to that of state-of-the-art foundation models with up to $16\times$ the number of parameters, while providing interpretable predictions on out-of-distribution data. Our results suggest that large-scale SSL pretraining paired with inherent interpretability can lead to robust representations for retinal imaging.

[145] HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Xiangyu Bai,Bishoy Galoaa,Sarah Ostadabbas

Main category: cs.CV

TL;DR: 本文提出HORNet,一种轻量级帧选择策略,通过Group Relative Policy Optimization (GRPO)训练,以优化视频问答中关键帧的选择,显著减少输入帧数和处理时间,同时提升回答质量。

Details Motivation: 现有视频问答系统多采用均匀或启发式采样,无法针对下游问答质量进行优化,导致效率与效果受限。 Method: 提出HORNet框架,结合Select Any Frames(SAF)任务设定,并使用GRPO算法训练帧选择策略,在冻结视觉语言模型(VLM)前提下学习最优帧子集。 Result: HORNet参数少于1M,帧数减少最多99%,VLM处理时间降低最多93%;在MSVD-QA上F1提升1.7%,NExT-QA上时序推理性能提升7.3分;跨VLM迁移无需重训练,相对增益达8.5%。 Conclusion: 优化VLM‘看到什么’(即输入帧选择)是提升视频问答效率与性能的有效且互补路径,优于仅优化其生成过程。 Abstract: Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.

[146] Motion-o: Trajectory-Grounded Video Reasoning

Bishoy Galoaa,Shayda Moezzi,Xiangyu Bai,Sarah Ostadabbas

Main category: cs.CV

TL;DR: 本文提出Spatial-Temporal-Trajectory (STT)推理这一新能力,通过Motion-o模型和Motion Chain of Thought(MCoT)方法,显式建模并验证视频中物体运动轨迹,提升时空定位与轨迹预测性能。

Details Motivation: 现有视频推理工作未显式建模物体在连续帧间的运动模式(即轨迹),导致运动理解隐式且难以验证。 Method: 提出Motion-o——一种以运动为中心的视觉语言模型扩展;构建轨迹标注增强数据集;设计Motion Chain of Thought(MCoT),用标签结构化表征方向、速度、尺度变化;设计基于视觉证据的奖励函数进行训练,无需修改模型架构。 Result: Motion-o在时空定位和轨迹预测任务上性能提升,且完全兼容现有框架;MCoT使轨迹推理可解释、可验证。 Conclusion: 显式运动推理(STT)是证据驱动视频理解的关键新维度,Motion-o为该方向提供了可扩展、可验证的解决方案。 Abstract: Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.

[147] PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

Tianci Luo,Jinpeng Wang,Shiyu Qin,Niu Lian,Yan Feng,Bin Chen,Chun Yuan,Shu-Tao Xia

Main category: cs.CV

TL;DR: 本文提出PromptHub框架,通过局部感知融合、注意力集中与对齐机制,提升视觉上下文学习(VICL)中多示例提示的融合效果,显著增强模型性能、泛化性与鲁棒性。

Details Motivation: 现有基于patch-wise融合和模型无关监督的视觉上下文学习方法难以充分挖掘示例中的信息线索,限制了性能提升。 Method: 提出PromptHub框架,包含局部感知融合(利用空间先验建模上下文)、互补的注意力集中、对齐与预测目标联合训练,以及数据增强强化监督。 Result: 在三个基础视觉任务上显著优于现有方法,并在分布外场景和多种检索场景中验证了其通用性、可迁移性与鲁棒性。 Conclusion: PromptHub建立了可靠的局部感知提示融合范式,超越了以往patch-wise方法,为视觉上下文学习提供了新思路。 Abstract: Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.

[148] MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

Youngwan Lee,Soojin Jang,Yoorhim Cho,Seunghwan Lee,Yong-Ju Lee,Sung Ju Hwang

Main category: cs.CV

TL;DR: 本文提出了MultihopSpatial基准,用于评估和提升视觉语言模型(VLMs)在多跳、组合式空间推理与精确视觉定位方面的能力,并配套提出新指标Acc@50IoU及训练语料MultihopSpatial-Train,实验证明现有VLMs在此类任务上仍存在显著不足,而基于该数据集的强化学习微调可有效提升空间推理与具身操作性能。

Details Motivation: 现有空间推理基准仅关注简单单跳关系,缺乏对真实物理场景中必需的多跳组合推理和精确视觉定位能力的评测。 Method: 构建MultihopSpatial多跳空间推理基准(含1–3跳复杂查询)、提出联合评估推理与定位的Acc@50IoU指标、发布大规模训练语料MultihopSpatial-Train,并通过强化学习进行后训练优化。 Result: 对37个SOTA VLMs的系统评估揭示8项关键发现,表明组合式空间推理仍是重大挑战;强化学习微调显著提升了模型内在空间推理能力及下游具身操作性能。 Conclusion: 多跳组合空间推理是当前VLMs的关键短板,需专用基准、指标与训练数据协同推动;MultihopSpatial为VLA智能体的空间能力发展提供了系统性支撑。 Abstract: Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.

[149] Translating MRI to PET through Conditional Diffusion Models with Enhanced Pathology Awareness

Yitong Li,Igor Yakushev,Dennis M. Hedderich,Christian Wachinger

Main category: cs.CV

TL;DR: 本文提出PASTA框架,利用增强病理感知的条件扩散模型,从MRI生成高质量、病理信息丰富的合成PET图像,显著提升阿尔茨海默病诊断性能(较MRI提升4%,接近真实PET)。

Details Motivation: PET虽对神经退行性疾病诊断敏感,但成本高、有辐射;MRI安全但敏感性不足。现有MRI-to-PET合成方法侧重结构保真,缺乏病理感知能力。 Method: 提出基于条件扩散模型的PASTA框架,包含高度交互的双分支架构、多模态条件融合、循环交换一致性约束和体素级生成策略。 Result: 合成PET图像在定性和定量评估中均表现出高保真度与强病理感知能力;用于阿尔茨海默病诊断时,性能较MRI提升4%,接近真实PET。 Conclusion: PASTA有效弥合了MRI与PET在神经退行性疾病诊断中的性能差距,为低成本、无辐射的精准影像诊断提供了新路径。 Abstract: Positron emission tomography (PET) is a widely recognized technique for diagnosing neurodegenerative diseases, offering critical functional insights. However, its high costs and radiation exposure hinder its widespread use. In contrast, magnetic resonance imaging (MRI) does not involve such limitations. While MRI also detects neurodegenerative changes, it is less sensitive for diagnosis compared to PET. To overcome such limitations, one approach is to generate synthetic PET from MRI. Recent advances in generative models have paved the way for cross-modality medical image translation; however, existing methods largely emphasize structural preservation while neglecting the critical need for pathology awareness. To address this gap, we propose PASTA, a novel image translation framework built on conditional diffusion models with enhanced pathology awareness. PASTA surpasses state-of-the-art methods by preserving both structural and pathological details through its highly interactive dual-arm architecture and multi-modal condition integration. Additionally, we introduce a novel cycle exchange consistency and volumetric generation strategy that significantly enhances PASTA's ability to produce high-quality 3D PET images. Our qualitative and quantitative results demonstrate the high quality and pathology awareness of the synthesized PET scans. For Alzheimer's diagnosis, the performance of these synthesized scans improves over MRI by 4%, almost reaching the performance of actual PET. Our code is available at https://github.com/ai-med/PASTA.

[150] GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

Ahmed Tawfik Aboukhadra,Marcel Rogge,Nadia Robertini,Abdalla Arafa,Jameel Malik,Ahmed Elhayek,Didier Stricker

Main category: cs.CV

TL;DR: 本文提出GHOST框架,利用2D高斯泼溅技术实现单目RGB视频中快速、类别无关的手-物交互三维重建,具备物理一致性与可动画性。

Details Motivation: 现有方法依赖类别特定模板或计算开销大,难以生成物理一致的3D手-物对齐结果。 Method: 提出Gaussian Hand-Object Splatting(GHOST),将手和物体建模为稠密、视角一致的高斯圆盘,并引入几何先验检索与一致性损失、抓取感知对齐、手感知背景损失三项创新。 Result: 在ARCTIC、HO3D及野外数据集上达到SOTA的3D重建与2D渲染精度,速度比先前类别无关方法快一个数量级。 Conclusion: GHOST是一种高效、鲁棒的现实手-物交互建模方案,支持完整、物理一致且可动画的重建。 Abstract: Understanding realistic hand-object interactions from monocular RGB videos is essential for AR/VR, robotics, and embodied AI. Existing methods rely on category-specific templates or heavy computation, yet still produce physically inconsistent hand-object alignment in 3D. We introduce GHOST (Gaussian Hand-Object Splatting), a fast, category-agnostic framework for reconstructing dynamic hand-object interactions using 2D Gaussian Splatting. GHOST represents both hands and objects as dense, view-consistent Gaussian discs and introduces three key innovations: (1) a geometric-prior retrieval and consistency loss that completes occluded object regions, (2) a grasp-aware alignment that refines hand translations and object scale to ensure realistic contact, and (3) a hand-aware background loss that prevents penalizing hand-occluded object regions. GHOST achieves complete, physically consistent, and animatable reconstructions from a single RGB video while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality, establishing GHOST as an efficient and robust solution for realistic hand-object interaction modeling. Code is available at https://github.com/ATAboukhadra/GHOST.

[151] Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching

Feifan Luo,Hongyang Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于无监督对比学习的新型3D形状匹配方法,通过改进特征表示和简化功能映射学习架构,在精度与效率上均达到SOTA。

Details Motivation: 现有深度功能映射方法聚焦于优化点对点或功能映射本身,忽视嵌入空间中特征表示质量的提升,且依赖计算昂贵的传统功能映射求解器,导致特征质量不足、匹配性能受限、计算开销大。 Method: 提出无监督对比学习框架以增强特征一致性和判别性;设计简化的功能映射学习架构,去除耗时的功能映射求解器和多个辅助损失;将二者集成到统一双分支流水线中。 Result: 在近等距、非等距及拓扑不一致等多种挑战性基准上,精度与效率均超越当前SOTA方法,甚至优于部分有监督方法。 Conclusion: 无监督对比学习与轻量功能映射架构的有效结合,显著提升了3D非刚性形状匹配的性能与效率,为该领域提供了新范式。 Abstract: Estimating correspondences between pairs of non-rigid deformable 3D shapes remains a significant challenge in computer vision and graphics. While deep functional map methods have become the go-to solution for addressing this problem, they primarily focus on optimizing pointwise and functional maps either individually or jointly, rather than directly enhancing feature representations in the embedding space, which often results in inadequate feature quality and suboptimal matching performance. Furthermore, these approaches heavily rely on traditional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational costs. In this work, we introduce, for the first time, a novel unsupervised contrastive learning-based approach for efficient and robust 3D shape matching. We begin by presenting an unsupervised contrastive learning framework that promotes feature learning by maximizing consistency within positive similarity pairs and minimizing it within negative similarity pairs, thereby improving both the consistency and discriminability of the learned features.We then design a significantly simplified functional map learning architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary functional map losses, greatly enhancing computational efficiency. By integrating these two components into a unified two-branch pipeline, our method achieves state-of-the-art performance in both accuracy and efficiency. Extensive experiments demonstrate that our approach is not only computationally efficient but also outperforms current state-of-the-art methods across various challenging benchmarks, including near-isometric, non-isometric, and topologically inconsistent scenarios, even surpassing supervised techniques.

[152] VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Jiayi Yuan,Haobo Jiang,De Wen Soh,Na Zhao

Main category: cs.CV

TL;DR: VGGT-360是一种无需训练的零样本全景深度估计框架,通过利用VGGT类基础模型的内在3D一致性,将任务重构为基于多视角重建3D模型的全景重投影,实现几何一致的全景深度估计。

Details Motivation: 解决现有无训练、视图无关方法在全景深度估计中缺乏几何一致性和碎片化视角推理的问题。 Method: 提出三模块统一框架:(i) 不确定性引导的自适应投影,将全景图切分为透视视图并依据梯度不确定性分配更多视角;(ii) 结构显著性增强注意力,在VGGT注意力层注入结构感知置信度;(iii) 相关性加权的3D模型校正,利用注意力推断的相关性分数重加权重叠点以优化3D模型。 Result: 在多个分辨率及室内外数据集上,VGGT-360性能超越当前有训练和无训练的最先进方法。 Conclusion: VGGT-360验证了无需训练即可实现高精度、几何一致的全景深度估计的可行性,为基于基础模型的几何理解提供了新范式。 Abstract: This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

[153] CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

Zening Sun,Zhengpeng Xie,Lichen Bai,Shitong Shao,Shuo Yang,Zeke Xie

Main category: cs.CV

TL;DR: 本文提出CRAFT方法,通过复合奖励过滤(CRF)构建高质量数据集并改进SFT,在仅用100样本的情况下超越数千样本的SOTA偏好优化方法,且收敛速度快11-220倍。

Details Motivation: 现有扩散模型对齐方法(如SFT和DPO)受限于高质量图像或大规模偏好数据的获取成本及计算低效问题。 Method: 提出Composite Reward Assisted Fine-Tuning(CRAFT),包含两步:1)用Composite Reward Filtering(CRF)筛选高质量一致训练数据;2)在该数据上执行增强版SFT;并从理论上证明其优化了组式强化学习的下界。 Result: CRAFT仅用100样本即可超越需数千偏好对的SOTA方法,并实现11–220倍更快收敛。 Conclusion: CRAFT是一种轻量高效、数据需求少、理论可解释的扩散模型对齐新范式,显著提升训练效率与性能。 Abstract: Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220$\times$ faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.

[154] Unleashing the Power of Simplicity: A Minimalist Strategy for State-of-the-Art Fingerprint Enhancement

Raffaele Cappelli

Main category: cs.CV

TL;DR: 本文提出了一种简约高效的指纹增强方法,包含上下文滤波和基于学习的两种新方法,在低质量指纹上表现优于现有复杂方法,并开源实现以促进可复现性与后续研究。

Details Motivation: 现有指纹增强方法在处理低质量指纹时效果不佳且计算开销大,亟需更简单有效的方法。 Method: 提出两种新方法: contextual filtering(上下文滤波)和 learning-based(基于学习)的指纹增强方法,强调简约性与实用性。 Result: 在挑战性潜指纹数据库上验证,新方法生成更清晰、准确、低噪声的图像, consistently outperform state-of-the-art 方法。 Conclusion: 简约设计可在指纹增强中实现高质量效果;未来研究应权衡模型复杂度与实际效益。 Abstract: Fingerprint recognition systems, which rely on the unique characteristics of human fingerprints, are essential in modern security and verification applications. Accurate minutiae extraction, a critical step in these systems, depends on the quality of fingerprint images. Despite recent improvements in fingerprint enhancement techniques, state-of-the-art methods often struggle with low-quality fingerprints and can be computationally demanding. This paper presents a minimalist approach to fingerprint enhancement, prioritizing simplicity and effectiveness. Two novel methods are introduced: a contextual filtering method and a learning-based method. These techniques consistently outperform complex state-of-the-art methods, producing clearer, more accurate, and less noisy images. The effectiveness of these methods is validated using a challenging latent fingerprint database. The open-source implementation of these techniques not only fosters reproducibility but also encourages further advancements in the field. The findings underscore the importance of simplicity in achieving high-quality fingerprint enhancement and suggest that future research should balance complexity and practical benefits.

[155] Generalized Hand-Object Pose Estimation with Occlusion Awareness

Hui Yang,Wei Sun,Jian Liu,Jian Xiao Tao Xie,Hossein Rahmani,Ajmal Saeed mian,Nicu Sebe,Gim Hee Lee

Main category: cs.CV

TL;DR: 本文提出GenHOI框架,通过分层语义提示与手部先验知识结合,提升单张RGB图像下通用3D手-物姿态估计在严重遮挡下的泛化能力与鲁棒性。

Details Motivation: 通用3D手-物姿态估计面临物体外观与交互模式多变、严重遮挡下视觉线索缺失等挑战,现有方法泛化能力不足。 Method: 提出GenHOI框架:1)构建分层语义提示(文本描述物体状态、手部构型与交互模式);2)采用RGB图像、预测点云和文本的多模态掩码建模策略;3)引入手部先验作为稳定空间参考以提取隐式交互约束。 Result: 在DexYCB和HO3Dv2基准上达到SOTA性能,显著提升遮挡场景下的手-物姿态估计精度与泛化能力。 Conclusion: 分层语义知识与手部先验的协同建模可有效缓解遮挡问题,增强模型对未见物体与新交互类型的泛化能力,为通用手-物姿态估计提供了新思路。 Abstract: Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over RGB images, predicted point clouds, and textual descriptions. Moreover, we leverage hand priors as stable spatial references to extract implicit interaction constraints. This allows reliable pose inference even under significant variations in object shapes and interaction patterns. Extensive experiments on the challenging DexYCB and HO3Dv2 benchmarks demonstrate that our method achieves state-of-the-art performance in hand-object pose estimation.

[156] Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Anqi Zhang,Xiaokang Ji,Guangyu Gao,Jianbo Jiao,Chi Harold Liu,Yunchao Wei

Main category: cs.CV

TL;DR: 本文提出SELF1E方法,通过保留原始分辨率图像特征、引入残差特征填充、像素反混洗操作及双感知路径注意力掩码,实现无需专用掩码解码器的多模态大语言模型(MLLM)端到端分割。

Details Motivation: 现有基于MLLM的分割方法严重依赖外部专用掩码解码器或多附加token,限制了模型的简洁性与端到端能力;本文旨在探索仅用1个分割嵌入(SELF1E)从MLLM自身直接生成高质量分割掩码的可行性。 Method: 1)保持图像特征原始高分辨率;2)用MLLM压缩特征提取的残差特征进行特征填充以提升精度;3)对经/未经LLM处理的图像特征分别施加pixel-unshuffle操作,释放细节并放大残差;4)设计双路径注意力掩码(image-to-image & image-to-segmentation),增强像素与分割token间的特征交互。 Result: 在多个分割任务上,SELF1E性能媲美依赖专用掩码解码器的先进方法,验证了MLLM无需外部解码器即可实现高质量分割的可行性。 Conclusion: 仅需1个分割嵌入(SELF1E)即可从MLLM本体直接完成高质量分割,无需任何外部掩码解码器,为多模态大模型轻量化、端到端视觉理解提供了新范式。 Abstract: Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs. Project page: https://github.com/ANDYZAQ/SELF1E.

[157] SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

Quentin Guimard,Federico Bartsch,Simone Caldarella,Rahaf Aljundi,Elisa Ricci,Massimiliano Mancini

Main category: cs.CV

TL;DR: 本文提出Sparse Embedding Modulation (SEM),一种在稀疏自编码器(SAE)潜在空间中进行后处理、零样本去偏的框架,用于缓解CLIP等视觉-语言模型因大规模非结构化训练数据引入的社会与虚假偏差。SEM通过解耦文本嵌入的稀疏特征,精准调控偏差相关神经元,显著提升检索与零样本分类任务的公平性,同时保持语义保真度。

Details Motivation: CLIP等视觉-语言模型因使用大规模未筛选训练数据而存在严重社会和虚假偏差;现有后处理去偏方法在稠密嵌入空间中难以分离偏差与任务相关信息,导致去偏时语义性能下降。 Method: 提出Sparse Embedding Modulation(SEM),将CLIP文本嵌入映射至稀疏自编码器(SAE)潜在空间,在该空间中识别并调制偏差相关稀疏特征,保留查询相关特征,实现非线性、细粒度干预。 Result: 在四个基准数据集和两个CLIP骨干模型上,SEM显著提升了检索与零样本分类任务的公平性(fairness gains),同时维持或改善语义保真度。 Conclusion: 稀疏潜在表征为视觉-语言模型的后处理去偏提供了更有效、更可控的基础,SEM验证了在稀疏空间中实现高精度偏差干预的可行性与优越性。 Abstract: Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.

[158] FUMO: Prior-Modulated Diffusion for Single Image Reflection Removal

Telang Xu,Chaoyang Zhang,Guangtao Zhai,Xiaohong Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型与先验调制的单图反射去除框架FUMO,通过引入强度先验和高频先验增强空间可控性与结构保真度,并采用粗到细训练范式提升去除效果。

Details Motivation: 真实场景下单图像反射去除面临反射强度空间变化大、反射与透射结构高度纠缠的挑战,现有方法难以兼顾空间可控性与结构保真度。 Method: 提出FUMO框架:1)从混合图像中直接提取强度先验(表征反射严重程度)和高频先验(通过多尺度残差聚合捕获细节响应);2)设计粗到细两阶段训练:第一阶段用双先验门控条件残差注入,聚焦反射主导且结构敏感区域;第二阶段用精细化网络在图像空间校正局部错位并锐化细节。 Result: 在标准基准及野外挑战性图像上均取得具有竞争力的定量指标和一致提升的感知质量。 Conclusion: FUMO通过显式先验引导显著提升了反射去除的空间可控性与结构保真度,验证了先验调制扩散模型在复杂真实退化建模中的有效性。 Abstract: Single image reflection removal (SIRR) is challenging in real scenes, where reflection strength varies spatially and reflection patterns are tightly entangled with transmission structures. This paper presents a diffusion model with prior modulation framework (FUMO) that introduces explicit guidance signals to improve spatial controllability and structural faithfulness. Two priors are extracted directly from the mixed image, an intensity prior that estimates spatial reflection severity and a high-frequency prior that captures detail-sensitive responses via multi-scale residual aggregation. We propose a coarse-to-fine training paradigm. In the first stage, these cues are combined to gate the conditional residual injections, focusing the conditioning on regions that are both reflection-dominant and structure-sensitive. In the second stage, a fine-grained refinement network corrects local misalignment and sharpens fine details in the image space. Experiments conducted on both standard benchmarks and challenging images in the wild demonstrate competitive quantitative results and consistently improved perceptual quality. The code is released at https://github.com/Lucious-Desmon/FUMO.

[159] TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Yan Shu,Bin Ren,Zhitong Xiong,Xiao Xiang Zhu,Begüm Demir,Nicu Sebe,Paolo Rota

Main category: cs.CV

TL;DR: 本文提出TerraScope,一种统一的视觉-语言模型,专为地球观测设计,支持多模态(光学/SAR)和多时序推理,并引入Terra-CoT数据集与TerraScope-Bench基准测试,显著提升像素级地理空间推理能力。

Details Motivation: 现有视觉-语言模型在地球观测中难以实现需精准像素级视觉表征支撑的复杂空间推理任务。 Method: 提出TerraScope统一VLM,支持模态灵活融合(光学/SAR单模或双模自适应融合)与多时序推理;构建含100万样本、嵌入像素级掩码的Terra-CoT数据集;设计首个面向像素级地理空间推理的TerraScope-Bench基准(含6个子任务,兼顾答案准确率与掩码质量)。 Result: TerraScope在像素级地理空间推理任务上显著优于现有VLM,同时提供可解释的视觉证据。 Conclusion: TerraScope通过模态融合与多时序建模能力,有效提升了VLM在地球观测中像素级空间推理的性能与可解释性,推动了地理空间智能的发展。 Abstract: Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

[160] Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

Weijia Dou,Wenzhao Zheng,Weiliang Chen,Yu Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出SGC指标,用于评估生成视频的3D空间几何一致性,通过估计不同局部区域的相机姿态并计算其差异来量化几何不一致性。

Details Motivation: 现有评估方法无法准确刻画生成视频中的3D空间几何不一致性:保真度导向的指标(如FVD)对几何失真不敏感,而一致性导向的基准又常误判有效的前景动态。 Method: SGC方法首先分离静态与动态区域,再将静态背景划分为空间一致的子区域;然后为每个像素预测深度,并为每个子区域估计局部相机姿态,最后计算这些姿态间的差异以量化几何一致性。 Result: 实验表明SGC能稳健地量化几何不一致性,并有效识别出其他指标遗漏的关键失败案例。 Conclusion: SGC是一种新颖、有效且鲁棒的评估生成视频3D空间几何一致性的指标,弥补了现有评估方法的不足。 Abstract: Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.

[161] SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

Phuc Pham,Uy Dieu Tran,Binh-Son Hua,Phong Nguyen

Main category: cs.CV

TL;DR: 本文提出SwiftTailor,一种两阶段框架,通过紧凑的几何图像表示统一裁剪图推理与基于几何的网格合成,显著提升3D服装生成的速度与质量。

Details Motivation: 现有方法依赖大视觉语言模型生成2D裁剪图再转为3D网格,虽质量高但推理慢(30秒至1分钟),难以满足实时或规模化需求。 Method: SwiftTailor包含两个轻量模块:PatternMaker(高效多模态视觉语言模型,预测裁剪图)和GarmentSewer(密集预测Transformer,生成统一UV空间中的服装几何图像),最终通过逆映射、重网格化与动态缝合直接重建3D网格,规避物理仿真开销。 Result: 在Multimodal GarmentCodeData上达到SOTA精度与视觉保真度,同时大幅降低推理时间。 Conclusion: SwiftTailor提供了一种可扩展、可解释且高性能的下一代3D服装生成方案。 Abstract: Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.

[162] Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng,Xin Ding,Yifan Yang,Shiqi Jiang,Hao Wu,Qianxi Zhang,Weijun Wang,Ting Cao,Yunxin Liu

Main category: cs.CV

TL;DR: 本文提出Em-Garde框架,通过解耦语义理解与流式感知,提升流式视频理解中主动响应的准确性和效率。

Details Motivation: 现有基于逐帧触发决策的主动式VideoLLMs面临效率与准确性的权衡困境。 Method: Em-Garde框架包含两部分:查询时由指令引导的提案解析器(Instruction-Guided Proposal Parser)将用户查询转化为结构化、感知 grounded 的视觉提案;流式过程中,轻量级提案匹配模块(Lightweight Proposal Matching Module)进行高效的嵌入匹配以触发响应。 Result: 在StreamingBench和OVO-Bench上的实验表明,Em-Garde在主动响应准确率和效率上均持续优于先前模型。 Conclusion: Em-Garde为严格计算约束下的主动视频理解提供了一种有效解决方案。 Abstract: Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

[163] SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

Oliver Cory,Ozge Mercanoglu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: 本文提出SignAgent框架,利用大语言模型(LLM)实现可扩展、语言学驱动的手语(SL)标注与数据集构建,通过Orchestrator协调工具链、SignGraph提供词法与语言学支撑,在伪词素标注和ID词素标注任务中展现出优异性能。

Details Motivation: 传统手语计算方法局限于词素层面,忽略关键语言学细节;而人工语言学标注耗时昂贵,难以支撑大规模、音系感知数据集的构建。 Method: 提出SignAgent框架,包含SignAgent Orchestrator(推理型LLM,协调多语言学工具)和SignGraph(知识增强型LLM,提供词法与语言学基础),并在伪词素标注和ID词素标注两项下游任务中进行评估。 Result: 在伪词素标注(基于多模态证据提取并排序词素标签)和ID词素标注(结合视觉相似性与音系重叠进行聚类识别与优化)任务上,SignAgent实现了高精度、大规模、语言学感知的数据标注与整理效果。 Conclusion: SignAgent为手语数据的大规模、语言学精细化标注与构建提供了高效可行的新范式,显著突破了传统方法在语言学深度与标注效率上的双重瓶颈。 Abstract: This paper introduces SignAgent, a novel agentic framework that utilises Large Language Models (LLMs) for scalable, linguistically-grounded Sign Language (SL) annotation and dataset curation. Traditional computational methods for SLs often operate at the gloss level, overlooking crucial linguistic nuances, while manual linguistic annotation remains a significant bottleneck, proving too slow and expensive for the creation of large-scale, phonologically-aware datasets. SignAgent addresses these challenges through SignAgent Orchestrator, a reasoning LLM that coordinates a suite of linguistic tools, and SignGraph, a knowledge-grounded LLM that provides lexical and linguistic grounding. We evaluate our framework on two downstream annotation tasks. First, on Pseudo-gloss Annotation, where the agent performs constrained assignment, using multi-modal evidence to extract and order suitable gloss labels for signed sequences. Second, on ID Glossing, where the agent detects and refines visual clusters by reasoning over both visual similarity and phonological overlap to correctly identify and group lexical sign variants. Our results demonstrate that our agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.

[164] DROID-SLAM in the Wild

Moyang Li,Zihan Zhu,Marc Pollefeys,Daniel Barath

Main category: cs.CV

TL;DR: 本文提出了一种基于可微不确定性感知光束法平差的实时RGB SLAM系统,通过多视角视觉特征不一致性估计逐像素不确定性,从而在动态、杂乱环境中实现鲁棒跟踪与重建。

Details Motivation: 传统SLAM假设场景静态,在动态环境中易失效;现有动态SLAM方法依赖预定义动态先验或不确定性建图,在未知动态物体或高度杂乱场景中仍受限。 Method: 提出可微的不确定性感知Bundle Adjustment,利用多视角视觉特征不一致性估计逐像素不确定性,实现动态环境下的鲁棒SLAM。 Result: 在杂乱动态场景中达到SOTA的相机位姿和场景几何精度,实时运行约10 FPS。 Conclusion: 该方法显著提升了动态环境下RGB SLAM的鲁棒性与实用性,为真实世界部署提供了新思路。 Abstract: We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at https://github.com/MoyangLi00/DROID-W.git.

[165] Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline

Ye Wang,Wei Lu,Zhihui You,Keyan Chen,Tongfei Liu,Kaiyu Li,Hongruixuan Chen,Qingling Shu,Sibao Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态建筑变化检测数据集LSMD和网络MSCNet,通过融合RGB与近红外(NIR)信息提升小尺度变化检测精度。

Details Motivation: 现有变化检测方法易受光照、季节及地物材质变化影响,仅用RGB图像易产生伪变化且语义模糊;而多模态数据集常缺乏高分辨率、精确配准的双时相影像,且现有方法未能充分利用RGB与NIR模态间的异质性。 Method: 构建了大规模小变化多模态数据集LSMD,并提出多模态光谱互补网络MSCNet,包含邻域上下文增强模块(NCEM)、跨模态对齐交互模块(CAIM)和显著性感知多源精化模块(SMRM)。 Result: 实验表明MSCNet在多种输入配置下均优于现有方法,有效提升了细粒度建筑变化检测性能。 Conclusion: 融合RGB与NIR模态并设计针对性网络结构可显著提升复杂场景下的小变化检测鲁棒性与准确性,LSMD为多模态变化检测提供了新基准。 Abstract: Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: https://github.com/AeroVILab-AHU/LSMD

[166] TAU-R1: Visual Language Model for Traffic Anomaly Understanding

Yuqiang Lin,Kehua Chen,Sam Lockyer,Arjun Yadav,Mingxuan Sui,Shucheng Zhang,Yan Shi,Bingzhang Wang,Yuang Zhang,Markus Zarbock,Florain Stanek,Adrian Evans,Wenbin Li,Yinhai Wang,Nic Zhang

Main category: cs.CV

TL;DR: 本文提出了Roundabout-TAU数据集和TAU-R1模型,用于交通异常理解(TAU)任务,结合轻量级分类器与大模型推理器,并设计两阶段训练策略提升性能。

Details Motivation: 现有交通异常理解(TAU)研究受限于缺乏真实场景基准数据集和专用方法,难以支撑VLM在该任务上的有效应用。 Method: 构建了真实圆环路口视频数据集Roundabout-TAU(342个片段、2000+问答对);提出双层VLM框架TAU-R1(轻量分类器+大模型推理器);设计两阶段训练:分解式问答增强监督微调 + 基于TAU定制奖励函数的GRPO后训练。 Result: TAU-R1在异常分类与推理任务上均取得优异性能,同时兼顾部署效率;数据集与代码已开源。 Conclusion: Roundabout-TAU填补了真实交通异常理解基准空白,TAU-R1及其训练策略为VLM在特定安全关键任务中的落地提供了新范式。 Abstract: Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: https://github.com/siri-rouser/TAU-R1

[167] CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

Weilin Chen,Jiahao Rao,Wenhao Wang,Xinyang Li,Xuan Cheng,Liujuan Cao

Main category: cs.CV

TL;DR: CustomTex是一种基于参考图像的实例级高保真3D室内场景纹理生成框架,通过语义级与像素级双蒸馏,在变分分数蒸馏(VSD)框架下实现高质量、无阴影烘焙、低伪影的统一纹理映射。

Details Motivation: 现有文本驱动的3D场景纹理生成方法缺乏实例级精细控制能力,且生成纹理质量低、存在伪影和固有阴影问题。 Method: 提出CustomTex框架,采用双蒸馏策略:语义级蒸馏(引入实例交叉注意力)保障语义合理性与参考-实例对齐;像素级蒸馏提升视觉保真度;二者统一于变分分数蒸馏(VSD)优化框架中。 Result: 在多个实验中,CustomTex实现了更精确的实例级参考一致性,生成纹理更锐利、伪影更少、无烘焙阴影,显著优于当前最先进方法。 Conclusion: CustomTex为高质量、可定制的3D场景外观编辑提供了更直接、用户友好的新路径。 Abstract: The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance'' alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.

[168] Revisiting Autoregressive Models for Generative Image Classification

Ilia Sudakov,Artem Babenko,Dmitry Baranchuk

Main category: cs.CV

TL;DR: 本文提出了一种基于任意顺序自回归(AR)模型的类条件生成分类器,通过边缘化多种token顺序提升分类性能,超越了扩散模型,且效率更高。

Details Motivation: 现有视觉自回归生成分类器依赖固定token顺序,导致归纳偏置过强,限制图像理解能力;而多顺序平均可提供更全面的判别信号。 Method: 利用最新任意顺序AR模型,对不同token顺序进行预测并边缘化,获得更鲁棒的类条件生成分类结果。 Result: 在多个图像分类基准上持续优于基于扩散的分类器,推理效率最高提升25倍;与先进自监督判别模型相比,分类性能具有竞争力。 Conclusion: AR生成模型通过顺序边缘化可释放强大分类潜力,挑战了扩散模型在生成式分类中的主导地位,并兼顾性能与效率。 Abstract: Class-conditional generative models have emerged as accurate and robust classifiers, with diffusion models demonstrating clear advantages over other visual generative paradigms, including autoregressive (AR) models. In this work, we revisit visual AR-based generative classifiers and identify an important limitation of prior approaches: their reliance on a fixed token order, which imposes a restrictive inductive bias for image understanding. We observe that single-order predictions rely more on partial discriminative cues, while averaging over multiple token orders provides a more comprehensive signal. Based on this insight, we leverage recent any-order AR models to estimate order-marginalized predictions, unlocking the high classification potential of AR models. Our approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks, while being up to 25x more efficient. Compared to state-of-the-art self-supervised discriminative models, our method delivers competitive classification performance - a notable achievement for generative classifiers.

[169] GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning

Yiren Lu,Yi Du,Disheng Liu,Yunlai Zhou,Chen Wang,Yu Yin

Main category: cs.CV

TL;DR: 本文提出GSMem框架,利用3D高斯泼溅(3DGS)构建具备‘空间回忆’能力的持续性空间记忆,支持零样本具身探索与推理,通过结合场景图与语义语言场实现目标定位,并融合VLM语义评分与3DGS几何覆盖进行混合探索,在具身问答和终身导航任务中验证了其有效性。

Details Motivation: 现有场景表征(如离散场景图或静态视角快照)缺乏‘事后可重观性’,导致初始遗漏的目标无法补救,亟需一种能持续积累并灵活调用空间知识的记忆机制。 Method: 提出GSMem框架:基于3D高斯泼溅构建连续几何与外观的持久空间记忆;设计融合对象级场景图与语义级语言场的检索机制以实现目标定位;引入VLM驱动的语义评分与3DGS覆盖目标协同的混合探索策略。 Result: 在具身问答(Embodied QA)和终身导航(Lifelong Navigation)任务上,GSMem显著优于现有方法,展现出更强的目标定位鲁棒性、探索效率与长期记忆保持能力。 Conclusion: 3D高斯泼溅可作为具身智能体的理想空间记忆载体,其‘空间回忆’能力为零样本、长时程、任务自适应的具身探索提供了新范式。 Abstract: Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate'' optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework

[170] ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation

Kwanyoung Lee,Hyunwoo Oh,SeungJu Cha,Sungho Koh,Dong-Jin Kim

Main category: cs.CV

TL;DR: 本文提出ADAPT框架,一种无需训练的确定性方法,通过注意力分数和正交分量优化提示调度,提升扩散模型在罕见组合概念生成上的性能。

Details Motivation: 扩散模型在生成训练数据中不常见的罕见组合概念时存在挑战,现有方法如R2F因语言模型随机性和文本嵌入切换指导不佳而效果受限。 Method: ADAPT框架利用注意力分数和正交分量,实现确定性的提示调度规划与语义对齐,无需额外训练或微调。 Result: ADAPT在RareBench基准上显著提升了罕见概念的组合生成能力,准确反映罕见属性的语义信息,并保持图像视觉完整性。 Conclusion: ADAPT提供了一种确定、精准且无需训练的解决方案,有效增强扩散模型对罕见组合概念的生成控制能力。 Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.

[171] Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Kwanyoung Lee,SeungJu Cha,Yebin Ahn,Hyunwoo Oh,Sungho Koh,Dong-Jin Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为自适应辅助提示混合(AAPB)的新框架,用于提升扩散模型在低密度概念生成和图像编辑任务中的语义对齐与结构一致性,无需额外训练。

Details Motivation: 现有扩散模型在处理训练数据中稀疏(低密度)的概念或编辑指令时表现不佳,源于文本-图像数据集的长尾分布特性。 Method: 提出AAPB框架,利用辅助锚定提示提供语义/结构支持,并基于Tweedie恒等式推导出每步扩散过程中的闭式自适应插值系数,实现目标提示与辅助提示的最优动态平衡。 Result: 在RareBench和FlowEdit数据集上验证了AAPB的有效性,相比固定插值及其它无训练基线方法,在语义准确性和结构保真度上均有稳定提升。 Conclusion: AAPB是一种原理清晰、无需训练、可统一应用于稀有概念生成与图像编辑的自适应提示融合方法,有效缓解了扩散模型在长尾分布下的性能退化问题。 Abstract: Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.

[172] ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

Zhan Jin,Yu Luo,Yizhou Zhang,Ziyang Cui,Yuqing Wei,Xianchao Liu,Xueying Zeng,Qing Zhang

Main category: cs.CV

TL;DR: 本文提出ARIADNE框架,结合偏好对齐感知与强化学习诊断推理,解决冠状动脉分割中拓扑不一致问题;利用DPO在Betti数约束下微调视觉语言模型,并设计带拒绝机制的MDP进行狭窄定位,在多个数据集上实现SOTA性能。

Details Motivation: 传统逐像素损失函数无法保证冠状动脉分割的拓扑一致性,导致血管树碎片化,尽管像素级精度高,但影响临床诊断可靠性。 Method: 提出两阶段框架ARIADNE:感知模块采用DPO(直接偏好优化)微调Sa2VA模型,以Betti数作为偏好信号实现几何完整性对齐;推理模块将狭窄定位建模为带显式拒绝机制的马尔可夫决策过程(MDP),自主规避分叉、交叉等模糊解剖结构。 Result: 在1400例临床造影图像上,中心线Dice达0.838,假阳性率较几何基线降低41%;在外部队列ARCADE和XCAD上验证了跨采集协议泛化能力。 Conclusion: 这是首次将DPO用于医学影像拓扑对齐,证明基于结构约束的偏好学习可在不牺牲诊断敏感性的前提下显著缓解拓扑错误,提升介入心脏病学工作流可靠性。 Abstract: Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.

[173] Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

Yiren Lu,Xin Ye,Burhaneddin Yaman,Jingru Luo,Zhexiao Xiong,Liu Ren,Yu Yin

Main category: cs.CV

TL;DR: 本文提出Splat2BEV框架,通过引入显式的3D高斯溅射重建作为前置任务,提升鸟瞰图(BEV)感知的几何精度与语义丰富性,在nuScenes和Argoverse数据集上达到SOTA性能。

Details Motivation: 现有端到端BEV感知方法将整个过程视为黑箱,缺乏显式3D几何理解与可解释性,导致性能受限。 Method: 提出Splat2BEV框架:首先预训练一个高斯生成器,从多视角图像显式重建3D场景并生成几何对齐的特征;再将该特征投影至BEV空间供下游任务使用。 Result: 在nuScenes和Argoverse数据集上,Splat2BEV在语义分割、3D目标检测和运动预测等BEV任务中均取得SOTA性能。 Conclusion: 显式的3D表示对提升BEV感知精度和可解释性至关重要,将3D重建作为辅助任务能有效增强BEV特征的几何一致性与语义质量。 Abstract: Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.

[174] Tinted Frames: Question Framing Blinds Vision-Language Models

Wan-Cyuan Fan,Jiayun Luo,Declan Kutscher,Leonid Sigal,Ritwik Gupta

Main category: cs.CV

TL;DR: 本文揭示了视觉语言模型(VLMs)存在“选择性失明”现象:其对图像的注意力受文本提示形式(如多选、是非题 vs. 开放式提问)显著影响,导致视觉注意力分配不当和性能下降;据此提出一种轻量级可学习提示调优方法,提升跨提示形式下的视觉接地能力与性能。

Details Motivation: 现有研究表明VLMs在需视觉推理的任务中常忽视视觉输入,但作者指出问题本质并非完全‘失明’,而是受语言提示框架影响的‘选择性失明’,亟需从注意力机制层面深入理解并解决。 Method: 以视觉注意力为探针,量化不同语言框架(多选、是非、开放式)下模型对图像的注意力强度与空间分布差异;进而设计含可学习token的轻量提示调优方法,引导模型复现开放框架下稳健的视觉注意力模式。 Result: 发现约束性框架(如多选、是非)显著降低图像上下文注意力、削弱任务相关区域关注、并偏向无信息token;该注意力错配是准确率下降与跨框架不一致的主因;所提方法在多个VLM和基准上提升了视觉接地效果与跨框架一致性。 Conclusion: VLMs的视觉利用缺陷源于语言框架诱导的注意力偏差,而非固有架构缺陷;通过针对性提示调优可有效校正注意力分配,提升鲁棒性与泛化性。 Abstract: Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

[175] RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Yue Gong,Hongyu Li,Shanyuan Liu,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Manyuan Zhang,Dawei Leng,Yuhui Yin,Lijun Zhang

Main category: cs.CV

TL;DR: 本文提出Representation-Pivoted AutoEncoder(RPiAE),一种基于预训练视觉表征的可微分tokenizer,通过Representation-Pivot Regularization和变分桥结构,在保持语义结构的同时提升重建保真度并压缩潜在空间,从而改善扩散模型的生成与编辑性能。

Details Motivation: 现有基于预训练视觉表征的冻结编码器tokenizer存在重建保真度低、编辑质量差及潜在空间维度过高导致扩散建模困难的问题。 Method: 提出Representation-Pivoted AutoEncoder(RPiAE):1)Representation-Pivot Regularization——在微调初始化自表征模型的编码器时约束其保持原始语义结构;2)引入变分桥进一步压缩潜在空间;3)采用目标解耦的分阶段训练策略,分别优化生成可行性与重建保真度。 Result: RPiAE在文本到图像生成和图像编辑任务上优于其他视觉tokenizer,并在所有基于表征的tokenizer中实现最优重建保真度。 Conclusion: RPiAE有效平衡了语义保持、重建精度与扩散建模效率,为扩散模型提供了更优的潜在空间表示方案。 Abstract: Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.

[176] Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Shang-Jui Ray Kuo,Paola Cascante-Bonilla

Main category: cs.CV

TL;DR: 本文探讨了状态空间模型(SSM)作为视觉-语言模型(VLMs)视觉骨干网络的潜力,发现其在VQA和定位任务中表现优异,且在更小规模下仍具竞争力;同时指出高ImageNet精度或大模型并不总带来更好VLM性能,并提出了提升定位鲁棒性的稳定化策略。

Details Motivation: 探索状态空间模型(SSM)是否可作为传统Transformer视觉骨干的有力替代方案,以提升VLM在多任务(如VQA、定位)中的性能与效率。 Method: 在控制条件下系统评估SSM视觉骨干在VLM中的表现,包括ImageNet-1K初始化对比、检测/分割任务微调,并分析不同骨干的稳定性与性能关系。 Result: SSM骨干在VQA和定位任务中整体性能最强;经密集任务微调后仍保持竞争力且参数量更小;发现ImageNet精度与VLM性能无强相关性,部分骨干存在定位不稳定问题。 Conclusion: SSM视觉骨干是Transformer视觉编码器的有力替代方案,结合稳定化策略可提升VLM鲁棒性与实用性。 Abstract: Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

[177] DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Tianjiao Yu,Xinzhuo Li,Muntasir Wahed,Jerry Xiong,Yifan Shen,Ying Shen,Ismini Lourentzou

Main category: cs.CV

TL;DR: DreamPartGen 是一种语义驱动的、部件感知的文本到3D生成框架,通过双模态部件隐变量(DPLs)和关系语义隐变量(RSLs)建模部件几何/外观及部件间语义关系,并借助同步协同去噪实现几何与语义一致性,在几何保真度和文本-形状对齐方面达到SOTA。

Details Motivation: 现有文本到3D方法忽视3D对象的语义与功能部件结构;近期部件感知方法虽引入分解,但偏重几何,缺乏语义支撑,且未建模部件与文本描述及部件间关系的对齐。 Method: 提出 DreamPartGen 框架,包含 Duplex Part Latents(联合建模部件几何与外观)和 Relational Semantic Latents(从语言中提取部件间依赖关系),并设计同步协同去噪过程以保证几何与语义一致性。 Result: 在多个基准上实现了几何保真度和文本-形状对齐的最先进性能。 Conclusion: DreamPartGen 有效实现了语义可解释、部件可控且文本对齐的高质量3D生成,推动了具身AI与人机交互中结构化3D内容生成的发展。 Abstract: Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.

[178] LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Keda Tao,Yuhua Zheng,Jia Xu,Wenjie Du,Kele Shao,Hesong Wang,Xueyi Chen,Xin Jin,Junhan Zhu,Bohan Yu,Weiqiang Wang,Jian Liu,Can Qin,Yulun Zhang,Ming-Hsuan Yang,Huan Wang

Main category: cs.CV

TL;DR: 本文提出了LVOmniBench,首个专为长时音频视频跨模态理解设计的基准,包含275个10–90分钟的高质量视频及1014个问答对,用于评估OmniLLMs在长时记忆、时间定位、细粒度理解和多模态感知等方面的能力;实验表明现有模型表现不佳(开源模型<35%,Gemini 3 Pro约65%),旨在推动长时音视频理解研究。

Details Motivation: 现有OmniLLM评测集中于短片段(10秒–5分钟),无法反映真实场景中数十分钟长视频的理解需求,存在关键评估空白。 Method: 构建LVOmniBench基准:从开放平台精选高动态音视频内容,经人工筛选与标注,形成275个10–90分钟视频和1014个QA对,并设计涵盖长时记忆、时间定位、细粒度理解与多模态感知的综合评测方案。 Result: 当前OmniLLMs在长时音视频理解上表现较差:主流开源模型准确率低于35%,最强商业模型Gemini 3 Pro仅达约65%;验证了长时跨模态理解仍是重大挑战。 Conclusion: LVOmniBench填补了长时音视频跨模态理解评测的空白,其数据集与实证结果将促进新型具备长时建模与深度跨模态推理能力的OmniLLM研发。 Abstract: Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.

[179] Rethinking Vector Field Learning for Generative Segmentation

Chaoyang Wang,Yaobo Liang,Boci Peng,Fan Duan,Jingdong Wang,Yunhai Tong

Main category: cs.CV

TL;DR: 本文提出了一种基于向量场重塑和距离感知校正的扩散模型方法,用于提升生成式语义分割性能,解决了梯度消失与轨迹遍历问题,并结合高效类别编码方案,在保持原有训练框架的同时显著缩小了生成式与判别式方法的性能差距。

Details Motivation: 现有扩散模型在生成式语义分割中存在连续流匹配目标与离散感知任务之间的内在不匹配问题,尤其表现为梯度消失和轨迹遍历,导致收敛慢、类别分离差。 Method: 提出向量场重塑策略:在速度场中加入脱离式的距离感知校正项,引入吸引-排斥交互以增强中心附近梯度;设计基于Kronecker序列的准随机类别编码方案,适配端到端像素神经场框架。 Result: 在多个基准上显著优于基础流匹配方法,大幅缩小生成式分割与强判别式模型(如Transformer-based specialist)之间的性能差距。 Conclusion: 从向量场学习视角重新审视扩散分割是有效的;所提校正机制与编码方案可在不改变原有训练范式前提下提升生成式分割的判别能力与收敛效率。 Abstract: Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.

[180] DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Dong Zhuo,Wenzhao Zheng,Sicheng Zuo,Siming Yan,Lu Hou,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出DriveTok,一种用于自动驾驶多视角场景的高效3D视觉 tokenizer,通过3D可变形交叉注意力将视觉基础模型特征转化为统一场景token,并支持多任务重建与预测。

Details Motivation: 现有图像tokenizer主要面向单目2D场景,在高分辨率多视角驾驶场景中存在效率低和跨视角不一致问题。 Method: DriveTok利用视觉基础模型提取语义丰富的特征,通过3D可变形交叉注意力生成场景token;解码阶段采用多视角Transformer重建多视图特征,并用多个分支头实现RGB、深度、语义重建及3D语义占据预测。 Result: 在nuScenes数据集上,DriveTok生成的场景token在图像重建、语义分割、深度预测和3D占据预测任务中均表现优异。 Conclusion: DriveTok实现了语义、几何与纹理信息融合的统一多视角场景token表示,提升了自动驾驶系统中视觉模态的可扩展性与空间感知能力。 Abstract: With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.

[181] Spectrally-Guided Diffusion Noise Schedules

Carlos Esteves,Ameesh Makadia

Main category: cs.CV

TL;DR: 本文提出了一种基于图像频谱特性的、针对像素扩散模型的逐实例噪声调度设计方法,通过理论推导确定最优噪声范围,并在推理时条件化采样该调度,在低步数下显著提升生成质量。

Details Motivation: 现有扩散模型的噪声调度多为手工设计且需跨分辨率手动调优,缺乏对图像内容的自适应性。 Method: 基于图像频谱特性推导最小/最大噪声水平的有效性理论界,构建‘紧凑’噪声调度以消除冗余采样步;在推理中条件化采样该调度。 Result: 在单阶段像素扩散模型上验证了所提噪声调度能提升生成质量,尤其在低采样步数(low-step regime)下效果显著。 Conclusion: 噪声调度应适配图像内容而非固定统一;基于频谱的逐实例调度是一种原理清晰、高效实用的改进路径。 Abstract: Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image's spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight'' noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.

[182] EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

Yang Fu,Yike Zheng,Ziyun Dai,Henghui Ding

Main category: cs.CV

TL;DR: 本文提出了VOR数据集和EffectErase方法,旨在解决视频中目标物体及其视觉效应(如形变、阴影、反射)的高质量移除问题。VOR是首个大规模配对视频数据集,涵盖多种效应类型与复杂场景;EffectErase采用效果感知的互惠学习框架,结合任务区域引导与插入-移除一致性约束,在多项实验中显著优于现有方法。

Details Motivation: 现有扩散模型虽能移除视频中的目标物体,但难以彻底消除其伴随的视觉效应(如阴影、反射、形变),且缺乏系统涵盖各类效应的高质量配对训练/评估数据集,制约了该方向发展。 Method: 构建了VOR大规模配对视频数据集(60K高质量视频对,含5类效应、多对象动态场景);在此基础上提出EffectErase方法:以视频对象插入为逆辅助任务,引入任务感知区域引导机制聚焦受影响区域,并设计插入-移除一致性损失,促使模型共享效应区域定位与结构线索。 Result: EffectErase在VOR上训练后,在多样场景下实现了高质量的视频对象及效应移除,综合性能显著超越现有视频修复与对象移除方法。 Conclusion: VOR数据集填补了视频对象效应移除领域缺乏系统基准的空白;EffectErase通过互惠学习与一致性建模,有效提升了效应擦除的完整性与背景连贯性,为该任务提供了新范式。 Abstract: Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.

[183] Under One Sun: Multi-Object Generative Perception of Materials and Illumination

Nobuo Yoshii,Xinran Nicole Han,Ryo Kawahara,Todd Zickler,Ko Nishino

Main category: cs.CV

TL;DR: MultiGP是一种生成式逆向渲染方法,用于从单张图像中对多个物体的反射率、纹理和光照进行随机采样,通过利用同一场景中物体共享光照的先验,结合级联架构、协调引导、轴向注意力和ControlNet等技术实现解耦。

Details Motivation: 解决单张图像中辐射成分(反射率、纹理、光照)解耦这一固有歧义问题,利用同一场景中物体共享同一光照的共识性先验。 Method: 提出Multi-Object Generative Perception (MultiGP),包含四个关键技术:1)融合图像空间与角度空间解耦的级联端到端架构;2)协调引导扩散模型收敛至一致光照估计;3)轴向注意力机制促进不同反射率物体间的‘跨交流’;4)纹理提取ControlNet以保留高频纹理细节并解耦光照。 Result: 实验表明MultiGP能有效利用多物体外观在空间与频率上的互补特性,成功恢复各物体的纹理与反射率以及场景共享的光照。 Conclusion: MultiGP为单图多物体的生成式感知提供了新范式,显著提升了辐射成分解耦的准确性与鲁棒性。 Abstract: We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents -- reflectance, texture, and illumination -- underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk'' between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.

[184] Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Chenyang Gu,Mingyuan Zhang,Haozhe Xie,Zhongang Cai,Lei Yang,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了一种三阶段运动生成框架(感知-规划-控制),核心是基于扩散模型的离散运动分词器MoTok,兼顾语义条件与运动保真度,在HumanML3D上显著提升可控性与精度。

Details Motivation: 现有运动生成方法分为连续扩散模型(擅长运动学控制)和离散token生成(利于语义条件),但难以兼顾二者优势,需融合其长处。 Method: 提出三阶段框架:1)感知阶段提取条件特征;2)规划阶段生成离散token(使用MoTok分词器);3)控制阶段通过扩散模型合成运动;MoTok将语义抽象与细粒度重建解耦,用扩散解码器恢复运动,实现紧凑单层token与高保真。 Result: 在HumanML3D上,相比MaskControl,token数量减少至1/6,轨迹误差从0.72 cm降至0.08 cm,FID从0.083降至0.029;强运动学约束下FID进一步降至0.014,性能不降反升。 Conclusion: 该框架成功融合语义条件与运动学控制能力,MoTok设计有效解耦抽象与重建,显著提升运动生成的可控性、保真度与鲁棒性。 Abstract: Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.

[185] SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang,Wenkai Dong,Yuxin Song,Bo Fang,Qi Zhang,Jing Wang,Fan Chen,Hui Zhang,Haocheng Feng,Yu Lu,Hang Zhou,Chun Yuan,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出SAMA框架,通过语义锚定和运动对齐的解耦设计,提升指令引导视频编辑中语义修改精度与运动保真度的平衡能力,无需依赖外部先验,在零样本和监督设置下均取得优异性能。

Details Motivation: 现有指令引导视频编辑模型难以兼顾语义精准修改与运动忠实保留;依赖外部先验(如VLM特征或结构条件)限制了模型鲁棒性与泛化性。 Method: 提出SAMA框架:1)语义锚定——在稀疏锚帧联合预测语义token与视频潜变量,实现纯指令驱动的结构规划;2)运动对齐——通过立方体修复、速度扰动、管状打乱等运动中心预训练任务,使骨干网络直接从原始视频学习时序动态;采用两阶段训练:无配对数据的解耦预训练 + 有配对编辑数据的监督微调。 Result: SAMA在开源模型中达到SOTA,性能媲美领先商用系统(如Kling-Omni);仅靠解耦预训练即具备强零样本编辑能力,验证了解耦设计的有效性。 Conclusion: 语义与运动的显式解耦建模是提升视频编辑质量与泛化性的有效路径,SAMA为无需外部先验的端到端视频编辑提供了新范式。 Abstract: Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

[186] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Haitian Li,Haozhe Xie,Junxiang Xu,Beichen Wen,Fangzhou Hong,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出MonoArt框架,通过渐进式结构推理从单张图像重建关节化3D物体,避免了直接回归运动参数的不稳定性,实现了高精度、高效率和良好泛化性。

Details Motivation: 从单张图像重建关节化3D物体面临运动线索与物体结构纠缠导致直接回归不稳定的难题,现有方法依赖多视角监督、检索装配或视频生成,牺牲了可扩展性或效率。 Method: 提出MonoArt统一框架,基于渐进式结构推理:将视觉观测逐步转化为规范几何、结构化部件表示和运动感知嵌入,而非直接从图像特征预测关节运动,全程在单一架构内完成。 Result: 在PartNet-Mobility数据集上达到重建精度与推理速度的SOTA;并成功泛化至机器人操作和关节化场景重建任务。 Conclusion: MonoArt通过解耦结构与运动的渐进式建模,实现了稳定、可解释、无需外部模板或分阶段流程的单图关节重建,兼具性能与实用性。 Abstract: Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

[187] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Yuqing Wang,Chuofan Ma,Zhijie Lin,Yao Teng,Lijun Yu,Shuai Wang,Jiaming Han,Jiashi Feng,Yi Jiang,Xihui Liu

Main category: cs.CV

TL;DR: 本文提出Cubic Discrete Diffusion (CubiD),首个支持高维离散表征(768-1024维)的视觉生成模型,通过细粒度逐维掩码与预测机制,在固定步数内高效建模跨空间与维度相关性,并在ImageNet-256上实现SOTA离散生成性能,同时验证离散token兼顾理解与生成能力。

Details Motivation: 现有离散视觉生成方法受限于低维潜在token(8–32维),语义表达能力不足;而高维预训练表征(768–1024维)虽语义丰富,但其离散化生成面临根本性挑战,亟需新范式统一理解与生成。 Method: 提出Cubic Discrete Diffusion(CubiD):对高维离散表征进行细粒度、任意维度/位置的掩码与预测;采用固定T步生成(T远小于h×w×d),建模维度内与跨空间强相关性;支持大规模参数扩展(900M–3.7B)。 Result: 在ImageNet-256上达到离散生成SOTA,具备优异可扩展性;实证表明生成的离散token完整保留原始高维表征的理解能力,可同时支撑下游理解与生成任务。 Conclusion: CubiD首次实现了高维表征的高效离散生成,弥合了理解与生成之间的鸿沟,为构建统一多模态架构提供了可行路径。 Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.

[188] Matryoshka Gaussian Splatting

Zhilin Guo,Boqiao Zhang,Hakan Aktas,Kyle Fogarty,Jeffrey Hu,Nursena Koprucu Aslan,Wenzhao Li,Canberk Baykal,Albert Miao,Josef Bengtson,Chenliang Zhou,Weihao Xia,Cristina Nader Vasconcelos. Cengiz Oztireli

Main category: cs.CV

TL;DR: 本文提出Matryoshka Gaussian Splatting(MGS),一种支持连续细节层次(LoD)的3D高斯泼溅训练框架,无需牺牲全容量渲染质量,通过随机预算训练实现平滑的质量-速度权衡。

Details Motivation: 现有离散LoD方法操作点有限,连续LoD方法在全容量下常出现明显质量下降,导致LoD成为代价高昂的设计选择。 Method: 提出MGS框架,学习一个有序高斯集合,使任意前k个高斯渲染结果均保持连贯且保真度随预算平滑提升;核心为随机预算训练:每次迭代随机采样预算,同时优化对应前缀和完整集合,仅需两次前向传播且无需架构修改。 Result: 在四个基准和六个基线上实验表明,MGS在保持骨干模型全容量性能的同时,支持单模型连续速度-质量权衡;消融实验验证了排序策略、训练目标与模型容量设计的有效性。 Conclusion: MGS实现了高质量、连续、无损的LoD控制,显著提升了3D高斯泼溅在实际部署中的灵活性与实用性。 Abstract: The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Splatting (3DGS). Existing discrete LoD methods expose only a limited set of operating points, while concurrent continuous LoD approaches enable smoother scaling but often suffer noticeable quality degradation at full capacity, making LoD a costly design decision. We introduce Matryoshka Gaussian Splatting (MGS), a training framework that enables continuous LoD for standard 3DGS pipelines without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians such that rendering any prefix, the first k splats, produces a coherent reconstruction whose fidelity improves smoothly with increasing budget. Our key idea is stochastic budget training: each iteration samples a random splat budget and optimises both the corresponding prefix and the full set. This strategy requires only two forward passes and introduces no architectural modifications. Experiments across four benchmarks and six baselines show that MGS matches the full-capacity performance of its backbone while enabling a continuous speed-quality trade-off from a single model. Extensive ablations on ordering strategies, training objectives, and model capacity further validate the designs.

[189] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu,Dingkang Liang,Tianrui Feng,Kui Xia,Yumeng Zhang,Xiaofan Li,Xiao Tan,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出VEGA-3D框架,利用预训练视频生成模型中隐含的空间先验来增强多模态大语言模型(MLLMs)的几何与物理推理能力,无需显式3D监督,在多个空间理解与具身操作任务上达到SOTA。

Details Motivation: 现有MLLMs存在‘空间失明’问题,难以进行细粒度几何推理和物理动态建模;而依赖显式3D模态或几何结构的方法受限于数据稀缺与泛化能力不足。 Method: 提出VEGA-3D框架,将预训练视频扩散模型作为隐式‘潜在世界模拟器’,从其去噪过程的中间噪声层提取时空特征,并通过token级自适应门控融合机制将其与语义表征融合,从而为MLLM注入密集几何线索。 Result: 在3D场景理解、空间推理和具身操作等多个基准上显著超越现有方法,验证了生成式先验可作为物理世界理解的可扩展基础。 Conclusion: 视频生成模型蕴含鲁棒的3D结构与物理规律先验,可被有效迁移用于提升MLLM的空间智能,无需额外3D标注或复杂几何模块。 Abstract: While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.