Skip to content

Table of Contents

cs.CL [Back]

[1] Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

Anna Babarczy,Andras Lukacs,Peter Vedres,Zeteny Bujka

Main category: cs.CL

TL;DR: 本研究测试了五种大语言模型(LLMs)在文本版心理理论(ToM)任务中的表现,发现GPT-4o在信念、意图和情绪推理上接近人类水平,而较小模型易受线索数量和干扰信息影响,揭示了LLM社会认知能力的差异与局限。

Details Motivation: 探究当前大语言模型是否真正具备心理理论(ToM)能力——即能否从文本中推断他人信念、意图和情绪;质疑其社会认知表现是源于深层理解还是表面模式匹配。 Method: 采用人类ToM研究中广泛使用的文本型测验工具,改编后对五种LLM及人类被试进行测试,评估其对故事人物心理状态(信念、意图、情绪)的推理准确性与鲁棒性。 Result: 模型间存在显著性能差距:早期/小型模型准确率低且易受线索数量和无关信息干扰;GPT-4o在各类条件下均表现优异,准确率与鲁棒性接近人类水平。 Conclusion: GPT-4o展现出类人的ToM推理能力,但其他LLM仍依赖统计模式匹配;该结果提示LLM的认知能力存在质的差异,挑战了‘所有LLM仅具表层理解’的简化观点。 Abstract: The study explores whether current Large Language Models (LLMs) exhibit Theory of Mind (ToM) capabilities -- specifically, the ability to infer others' beliefs, intentions, and emotions from text. Given that LLMs are trained on language data without social embodiment or access to other manifestations of mental representations, their apparent social-cognitive reasoning raises key questions about the nature of their understanding. Are they capable of robust mental-state attribution indistinguishable from human ability in its output, or do their outputs merely reflect superficial pattern completion? To address this question, we tested five LLMs and compared their performance to that of human controls using an adapted version of a text-based tool widely used in human ToM research. The test involves answering questions about the beliefs, intentions, and emotions of story characters. The results revealed a performance gap between the models. Earlier and smaller models were strongly affected by the number of relevant inferential cues available and, to some extent, were also vulnerable to the presence of irrelevant or distracting information in the texts. In contrast, GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions. This work contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation.

[2] TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

Fangrui Huang,Souhad Chbeir,Arpandeep Khatua,Sheng Wang,Sijun Tan,Kenan Ye,Lily Bailey,Merryn Daniel,Ryan Louie,Sanmi Koyejo,Ehsan Adeli

Main category: cs.CL

TL;DR: 本文提出THERAPYGYM框架,用于评估和提升心理治疗聊天机器人的临床保真度(fidelity)与安全性(safety),并配套发布专家标注的验证集THERAPYJUDGEBENCH;基于该框架的强化学习训练显著提升了模型在认知行为疗法(CBT)技术遵循度(CTRS)和安全风险识别上的表现。

Details Motivation: 现有LLM评估方法(如流畅性指标、偏好测试、通用对话基准)无法衡量心理治疗中关键的临床维度(如技术保真度与安全性),亟需面向临床实践的专用评估体系。 Method: 构建THERAPYGYM框架:1)用自动化CTRS流水线评估多轮对话中对CBT技术的遵循程度(保真度);2)采用多标签标注方案评估治疗特有安全风险(如忽视伤害线索);3)发布含116段对话、1270条专家评分的THERAPYJUDGEBENCH以校准LLM裁判偏差;4)将CTRS与安全指标作为奖励信号,结合多样化症状的患者模拟器开展RL训练。 Result: 经THERAPYGYM训练的模型在专家评分中平均CTRS从0.10提升至0.60(LLM裁判下从0.16升至0.59),显著增强临床保真度与安全性。 Conclusion: THERAPYGYM为心理治疗聊天机器人提供了首个兼顾临床有效性与安全性的可扩展评估与训练框架,推动其向循证实践与高风险场景落地迈进。 Abstract: Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians. THERAPYGYM also serves as a training harness: CTRS and safety-based rewards drive RL with configurable patient simulations spanning diverse symptom profiles. Models trained in THERAPYGYM improve on expert ratings, with average CTRS rising from 0.10 to 0.60 (and 0.16 to 0.59 under LLM judges). Our work enables scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes use.

[3] How Confident Is the First Token? An Uncertainty-Calibrated Prompt Optimization Framework for Large Language Model Classification and Understanding

Wei Chen,Guoyang Ju,Yuanyuan Qi

Main category: cs.CL

TL;DR: 本文提出Log-Scale Focal Uncertainty(LSFU)度量和不确定性校准的提示优化框架(UCPOF),以解决大语言模型在多选任务中因先验偏差导致的置信度校准不佳问题,提升少样本学习性能并降低RAG触发率与计算开销。

Details Motivation: 传统基于输出概率的不确定性度量(如熵)忽略预训练语料中类别先验差异,无法区分由先验带来的虚假置信与由上下文理解产生的真实确定性,导致提示优化不可靠。 Method: 提出首个基于首token的不确定性度量LSFU,借鉴focal loss思想,引入标签先验概率作为风险调制因子,抑制高频类噪声、增强长尾类风险,并设计动态加权机制统一量纲;在此基础上构建UCPOF框架,利用首token不确定性选择高质量示例并动态优化提示。 Result: UCPOF在平均准确率上比少样本基线提升6.03%,比始终启用RAG高5.75%,并将平均RAG触发率降低50.66%。 Conclusion: LSFU能更精准刻画模型不确定性,UCPOF通过自适应RAG触发在显著降低计算成本的同时保持SOTA性能,为可靠、高效的提示工程提供了新范式。 Abstract: With the widespread adoption of large language models (LLMs) in natural language processing, prompt engineering and retrieval-augmented generation (RAG) have become mainstream to enhance LLMs' performance on complex tasks. However, LLMs generate outputs autoregressively, leading to inevitable output uncertainty. Since model performance is highly sensitive to prompt design, precise uncertainty measurement is crucial for reliable prompt optimization. For multi-class multiple-choice (understanding) tasks, conventional uncertainty measures (e.g., entropy) based on output probabilities treat all classes equally and ignore class prior differences in pretraining corpora. This failure to distinguish spurious confidence (from priors) from true certainty (from contextual understanding) results in poor confidence calibration. To address this, we propose Log-Scale Focal Uncertainty (LSFU), a first-token-based metric inspired by focal loss. LSFU incorporates label prior probabilities as a risk-modulation factor to suppress noise from high-frequency classes and emphasize risk for low-frequency long-tail classes, with a dynamic weighting mechanism unifying the measurement scale. Based on LSFU, we further propose the uncertainty-calibrated prompt optimization framework (UCPOF), which leverages the first token of model outputs to select high-quality exemplars and dynamically optimize prompts. Comprehensive evaluations show UCPOF improves average accuracy by 6.03% over few-shot baselines, surpasses always-on full RAG by 5.75% in overall average accuracy, and reduces the average retrieval trigger rate by 50.66%. By adaptively triggering RAG only for high-uncertainty samples, our framework significantly lowers computational costs while maintaining state-of-the-art performance.

[4] Agentic Framework for Political Biography Extraction

Yifei Zhu,Songpo Yang,Jiangnan Zhu,Junyan Jiang

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型(LLM)的两阶段“合成-编码”框架,用于自动化构建大规模政治精英传记数据库,显著提升准确性、可扩展性与透明度。

Details Motivation: 政治学研究长期受限于大规模结构化政治数据集构建成本高、人工依赖强、自动化困难的问题。 Method: 提出两阶段Synthesis-Coding框架:上游由递归智能体式LLM从异构网页源中搜索、筛选、整合传记信息;下游用LLM将整合后的内容编码为结构化数据框。 Result: 1)在提供高质量上下文时,LLM编码器准确率媲美或超越人类专家;2)在网页环境中,该代理系统比维基百科等人类集体智慧获取更多信息;3)直接对长文本或多语料编码会引入偏差,而合成阶段能通过生成信号密集的摘要缓解该问题。 Conclusion: 该框架为政治学领域提供了通用、可扩展、透明且可拓展的大规模数据库构建新范式。 Abstract: The production of large-scale political datasets typically demands extracting structured facts from vast piles of unstructured documents or web sources, a task that traditionally relies on expensive human experts and remains prohibitively difficult to automate at scale. In this paper, we leverage Large Language Models (LLMs) to automate the extraction of multi-dimensional elite biographies, addressing a long-standing bottleneck in political science research. We propose a two-stage ``Synthesis-Coding'' framework for complex extraction task: an upstream synthesis stage that uses recursive agentic LLMs to search, filter, and curate biography from heterogeneous web sources, followed by a downstream coding stage that maps curated biography into structured dataframes. We validate this framework through three primary results. First, we demonstrate that, when given curated contexts, LLM coders match or outperform human experts in extraction accuracy. Second, we show that in web environments, the agentic system synthesizes more information from web resources than human collective intelligence (Wikipedia). Finally, we diagnosed that directly coding from long and multi-language corpora introduces bias that the synthesis stage can alleviate by curating evidence into signal-dense representations. By comprehensive evaluation, We provide a generalizable, scalable framework for building transparent and expansible large scale database in political science.

[5] Controllable Evidence Selection in Retrieval-Augmented Question Answering via Deterministic Utility Gating

Victor P. Unda

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的确定性证据选择框架(MUE/DUE),通过显式语义、术语覆盖、概念区分度和冗余控制来筛选可作为证据的文本单元,确保每个被选单元独立满足问题所需事实或条件,否则不返回答案。

Details Motivation: 现有基于向量相似度的检索系统虽能捕捉主题相似性,但无法解释为何某些高相似度文本可作证据而另一些不能;且在分数相近时易选到冗余、不完整或条件不符的文本。 Method: 提出Meaning-Utility Estimation (MUE) 和 Diversity-Utility Estimation (DUE),采用固定评分与冗余控制流程,在生成答案前独立评估每句/记录的语义相关性、术语覆盖、概念独特性与冗余性;仅当某单元显式陈述任务所需事实、规则或条件时才接受,且不合并或扩展单元。 Result: 实现了紧凑、可审计的证据集,明确区分‘相关文本’与‘可用证据’,避免了模糊匹配与不可靠组合,提升了证据选择的可解释性与可靠性。 Conclusion: 该确定性框架为检索增强问答提供了无需训练、可解释、可验证的证据筛选机制,强化了系统在严谨推理场景下的可信度与可控性。 Abstract: Many modern AI question-answering systems convert text into vectors and retrieve the closest matches to a user question. While effective for topical similarity, similarity scores alone do not explain why some retrieved text can serve as evidence while other equally similar text cannot. When many candidates receive similar scores, systems may select sentences that are redundant, incomplete, or address different conditions than the question requires. This paper presents a deterministic evidence selection framework for retrieval-augmented question answering. The approach introduces Meaning-Utility Estimation (MUE) and Diversity-Utility Estimation (DUE), fixed scoring and redundancy-control procedures that determine evidence admissibility prior to answer generation. Each sentence or record is evaluated independently using explicit signals for semantic relatedness, term coverage, conceptual distinctiveness, and redundancy. No training or fine-tuning is required. In the prototype, a unit is accepted only if it explicitly states the fact, rule, or condition required by the task. Units are not merged or expanded. If no unit independently satisfies the requirement, the system returns no answer. This deterministic gating produces compact, auditable evidence sets and establishes a clear boundary between relevant text and usable evidence.

[6] DynaRAG: Bridging Static and Dynamic Knowledge in Retrieval-Augmented Generation

Penghao Liang,Mengwei Yuan,Jianan Liu,Jing Yang,Xianyou Li,Weiran Yan,Yichao Wu

Main category: cs.CL

TL;DR: DynaRAG 是一种新型检索增强生成(RAG)框架,通过动态调用外部API补充静态知识库的不足,结合LLM重排序、充分性分类器与Gorilla v2 API调用模型,在CRAG基准上显著提升动态问题回答准确率并减少幻觉。

Details Motivation: 传统RAG仅依赖静态文档,难以应对时间敏感或需实时信息的问题,导致答案不准确或产生幻觉。 Method: 提出DynaRAG框架:1)LLM重排序器评估文档相关性;2)充分性分类器判断是否需API回退;3)Gorilla v2调用外部API;4)FAISS支持的schema过滤优化API选择。 Result: 在CRAG基准测试中,DynaRAG显著提升动态问题回答准确率,并有效降低幻觉发生率。 Conclusion: 动态感知的路由机制与选择性工具调用对构建可靠的真实世界问答系统至关重要。 Abstract: We present DynaRAG, a retrieval-augmented generation (RAG) framework designed to handle both static and time-sensitive information needs through dynamic knowledge integration. Unlike traditional RAG pipelines that rely solely on static corpora, DynaRAG selectively invokes external APIs when retrieved documents are insufficient for answering a query. The system employs an LLM-based reranker to assess document relevance, a sufficiency classifier to determine when fallback is necessary, and Gorilla v2 -- a state-of-the-art API calling model -- for accurate tool invocation. We further enhance robustness by incorporating schema filtering via FAISS to guide API selection. Evaluations on the CRAG benchmark demonstrate that DynaRAG significantly improves accuracy on dynamic questions, while also reducing hallucinations. Our results highlight the importance of dynamic-aware routing and selective tool use in building reliable, real-world question-answering systems.

[7] Learned but Not Expressed: Capability-Expression Dissociation in Large Language Models

Toshiyuki Shigemura

Main category: cs.CL

TL;DR: 本文通过实证研究发现,尽管大语言模型(LLMs)能从训练数据中重建非因果解法,但在标准生成任务中却完全不输出此类内容,表明任务条件下的生成策略可全面抑制已习得但不符合任务范式的知识。

Details Motivation: 探究为何LLMs虽具备从训练数据中重建特定内容(如非因果解法)的能力,却在常规生成中不表现该能力,挑战‘训练数据存在即影响输出概率’的默认假设。 Method: 对300组提示-响应样本(涵盖叙事与问题求解两类任务、10种场景、3种LLM)进行经验性观察分析,结合记忆连续性与对齐诱导话语先验理论,检验非因果、不可实施解法的表达情况。 Result: 在全部300次生成中未观测到任何非因果解法(0%,95% CI: [0%, 1.2%]),而同一模型在条件提取下可稳定重建该类内容。 Conclusion: LLM的生成输出受任务条件策略严格调控,已习得内容是否表达取决于生成政策而非单纯训练数据存在;这对理解生成机制、输出分布控制及模型行为边界具有重要启示。 Abstract: Large language models (LLMs) demonstrate the capacity to reconstruct and trace learned content from their training data under specific elicitation conditions, yet this capability does not manifest in standard generation contexts. This empirical observational study examines the expression of non-causal, non-implementable solution types across 300 prompt-response generations spanning narrative and problem-solving task contexts. Drawing on recent findings regarding memorization contiguity and alignment-induced discourse priors, we document a systematic dissociation between learned capability and expressed output. Across three distinct LLMs, ten task scenarios, and both creative narrative and practical advisory contexts, we documented zero instances of non-causal solution frames in generated outputs (0%, 95% CI: [0%, 1.2%]), despite verified reconstruction capability under conditional extraction. These findings challenge the prevailing assumption that training data presence directly predicts output probability, demonstrating instead that task-conditioned generation policies can comprehensively suppress learned content across diverse contexts. The results offer implications for understanding generation dynamics, output distribution control, and the behavioral boundaries of contemporary LLMs.

[8] Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction

Hui Wen Goh,Jonas Mueller

Main category: cs.CL

TL;DR: CONSTRUCT是一种实时评估大语言模型结构化输出可信度的方法,可帮助识别错误并指导人工审核。

Details Motivation: 当前大语言模型的结构化输出存在偶发性错误,限制了其在企业AI中的应用潜力。 Method: 提出CONSTRUCT方法,通过实时打分评估结构化输出及其各字段的可信度,适用于任意LLM(包括无logprobs的黑盒API),无需标注数据或定制部署。 Result: 在包含四个数据集的首个公开结构化输出基准上,CONSTRUCT在检测Gemini 3和GPT-5等模型错误时,精度和召回率显著优于其他评分方法。 Conclusion: CONSTRUCT为提升结构化输出可靠性提供了高效、通用且实用的解决方案,尤其适合资源受限的企业级应用场景。 Abstract: Structured Outputs from current LLMs exhibit sporadic errors, hindering enterprise AI efforts from realizing their immense potential. We present CONSTRUCT, a method to score the trustworthiness of LLM Structured Outputs in real-time, such that lower-scoring outputs are more likely to contain errors. This reveals the best places to focus limited human review bandwidth. CONSTRUCT additionally scores the trustworthiness of each field within a LLM Structured Output, helping reviewers quickly identify which parts of the output are wrong. Our method is suitable for any LLM (including black-box LLM APIs without logprobs such as reasoning models and Anthropic models), does not require labeled training data nor custom model deployment, and works for complex Structured Outputs with many fields of diverse types (including nested JSON schemas). We additionally present one of the first public LLM Structured Output benchmarks with reliable ground-truth values that are not full of mistakes. Over this four-dataset benchmark, CONSTRUCT detects errors from various LLMs (including Gemini 3 and GPT-5) with significantly higher precision/recall than other scoring methods.

[9] Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

Trishita Dhara,Siddhesh Sheth

Main category: cs.CL

TL;DR: 本文通过Shapley Additive Explanations和Integrated Gradients两种后验解释方法,对RoBERTa模型在有害内容检测中的决策逻辑进行可解释性分析,揭示其在边界、语境与政治敏感案例中的系统性失效模式,并强调可解释AI的核心价值在于提升透明度与辅助人工审核,而非单纯提升性能。

Details Motivation: 现有有害内容检测系统缺乏可解释性,尤其在边界、语境依赖及政治敏感内容上难以理解模型判断依据;当前研究多聚焦准确率提升,忽视解释性分析。 Method: 基于Civil Comments数据集训练RoBERTa分类器,采用Shapley Additive Explanations(SHAP)与Integrated Gradients(IG)两种后验解释方法,对正确预测与典型错误案例进行归因分析,并结合定性案例研究识别常见失效模式。 Result: 模型整体性能优异(AUC=0.93,Accuracy=0.94),但解释分析暴露隐藏缺陷:IG倾向于弥散式上下文归因,SHAP更聚焦显性词汇线索;二者归因分歧导致假阴/假阳;发现间接毒性、词汇过归因、政治话语误判等重复失效模式。 Conclusion: 可解释AI应被定位为提升透明度与支持人机协同审核的诊断工具,而非性能优化手段;解释性分析能有效暴露聚合指标无法反映的模型不确定性与决策盲区。 Abstract: Although automated harmful content detection systems are frequently used to monitor online platforms, moderators and end users frequently cannot understand the logic underlying their predictions. While recent studies have focused on increasing classification accuracy, little focus has been placed on comprehending why neural models identify content as harmful, especially when it comes to borderline, contextual, and politically sensitive situations. In this work, a neural harmful content detection model trained on the Civil Comments dataset is analyzed explainability-drivenly. Two popular post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, are used to analyze the behavior of a RoBERTa-based classifier in both correct predictions and systematic failure cases. Despite strong overall performance, with an area under the curve of 0.93 and an accuracy of 0.94, the analysis reveals limitations that are not observable from aggregate evaluation metrics alone. Integrated Gradients appear to extract more diffuse contextual attributions while Shapley Additive Explanations extract more focused attributions on explicit lexical cues. The consequent divergence in their outputs manifests in both false negatives and false positives. Qualitative case studies reveal recurring failure modes such as indirect toxicity, lexical over-attribution, or political discourse. The results suggest that explainable AI can foster human-in-the-loop moderation by exposing model uncertainty and increasing the interpretable rationale behind automated decisions. Most importantly, this work highlights the role of explainability as a transparency and diagnostic resource for online harmful content detection systems rather than as a performance-enhancing lever.

[10] MineDraft: A Framework for Batch Parallel Speculative Decoding

Zhenwei Tang,Arun Verma,Zijian Zhou,Zhaoxuan Wu,Alok Prakash,Daniela Rus,Bryan Kian Hsiang Low

Main category: cs.CL

TL;DR: 本文提出MineDraft,一种批处理并行推测解码(PSD)框架,通过重叠草稿生成与验证阶段来隐藏草稿延迟,显著提升大语言模型推理吞吐量和端到端延迟。

Details Motivation: 标准推测解码(SD)受限于草稿生成与验证阶段的严格串行执行,导致性能瓶颈。 Method: 提出MineDraft框架,采用新颖的批处理并行设计:维护两个请求批次,使一个批次的草稿生成与另一个批次的验证过程重叠;并进行理论分析证明PSD比标准SD更高效。 Result: 实验表明,MineDraft相比标准SD在吞吐量上最高提升75%,端到端延迟最高降低39%;且已作为插件集成至vLLM,验证了其生产可用性。 Conclusion: MineDraft通过批处理并行化有效缓解推测解码中的延迟瓶颈,在保持准确性的同时显著提升推理效率,具备实际部署价值。 Abstract: Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

[11] An Agentic System for Schema Aware NL2SQL Generation

David Onyango,Naseef Mansoor

Main category: cs.CL

TL;DR: 本文提出了一种基于模式的代理系统CESMA,使用小型语言模型(SLM)为主、大语言模型(LLM)为选择性回退的混合架构,显著降低NL2SQL任务的计算成本与隐私风险,在BIRD基准上实现47.78%执行准确率和90%以上成本削减。

Details Motivation: 现有NL2SQL方法依赖大语言模型(LLM),存在计算开销大、数据隐私风险高、难以在资源受限环境中部署等问题,亟需轻量、高效、可部署的替代方案。 Method: 设计一种基于数据库schema的多智能体系统,以本地小型语言模型(SLM)作为主执行代理,仅当SLM输出被检测出错误时才调用LLM进行修正,实现按需调用与成本控制。 Result: 在BIRD基准上达到47.78%执行准确率和51.05%验证效率;67%查询由SLM本地完成;单查询平均成本降至0.0085(相比LLM-only的0.094),总成本降低超90%。 Conclusion: SLM主导+LLM按需回退的架构在保持合理性能的同时极大提升部署可行性与经济性,为NL2SQL在实际场景落地提供了新范式。 Abstract: The natural language to SQL (NL2SQL) task plays a pivotal role in democratizing data access by enabling non-expert users to interact with relational databases through intuitive language. While recent frameworks have enhanced translation accuracy via task specialization, their reliance on Large Language Models (LLMs) raises significant concerns regarding computational overhead, data privacy, and real-world deployability in resource-constrained environments. To address these challenges, we propose a schema based agentic system that strategically employs Small Language Models (SLMs) as primary agents, complemented by a selective LLM fallback mechanism. The LLM is invoked only upon detection of errors in SLM-generated output, the proposed system significantly minimizes computational expenditure. Experimental results on the BIRD benchmark demonstrate that our system achieves an execution accuracy of 47.78% and a validation efficiency score of 51.05%, achieving over 90% cost reduction compared to LLM-centric baselines as approximately 67% of queries are resolved using local SLMs. The system achieves an average cost per query of 0.0085 compared to 0.094 for LLM-only systems, achieving near-zero operational costs for locally executed queries. [Github repository: https://github.com/mindslab25/CESMA.]

[12] BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity

Harshita Diddee,Gregory Yauney,Swabha Swayamdipta,Daphne Ippolito

Main category: cs.CL

TL;DR: 本文提出BenchBrowser,一个用于检索与自然语言使用场景相关的评估项的工具,旨在解决现有基准测试缺乏细粒度覆盖和验证的问题,帮助实践者诊断基准测试的内容效度和聚合效度缺陷。

Details Motivation: 现有语言模型基准测试过于粗粒度,无法准确反映实际应用场景中的具体能力需求,导致模型看似能力强但实际在某些关键方面表现不佳。 Method: 提出BenchBrowser检索器,在20个基准套件中检索与自然语言使用场景相关的评估项,并通过人工研究验证其高检索精度。 Result: BenchBrowser能有效帮助实践者识别基准测试在内容效度(能力覆盖不全)和聚合效度(同一能力评分不稳定)方面的缺陷。 Conclusion: BenchBrowser量化了实践者意图与基准测试实际覆盖范围之间的关键差距,提升了模型评估的可信度和实用性。 Abstract: Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.

[13] Evaluating FrameNet-Based Semantic Modeling for Gender-Based Violence Detection in Clinical Records

Lívia Dutra,Arthur Lorenzi,Frederico Belcavello,Ely Matos,Marcelo Viridiano,Lorena Larré,Olívia Guaranha,Erik Santos,Sofia Reinach,Pedro de Paula,Tiago Torrent

Main category: cs.CL

TL;DR: 本研究探讨了基于FrameNet的语义标注在电子病历开放文本中识别性别暴力(GBV)模式的有效性,发现结合语义标注的模型显著优于仅使用结构化数据的模型,F1分数提升超0.3。

Details Motivation: 巴西医疗人员虽有法定义务上报性别暴力案件,但因识别困难及公共卫生信息系统整合不足,导致严重漏报。 Method: 采用FrameNet对电子病历开放文本进行语义标注,并构建三种SVM分类器:(1) 仅帧标注文本;(2) 帧标注文本+参数化数据;(3) 仅参数化数据,进行定量与定性对比分析。 Result: 融合语义标注的模型F1分数提升超0.3,显著优于纯结构化数据模型;语义表征提供了超越人口统计学结构数据的有意义信号。 Conclusion: 临床叙述的语义分析可增强性别暴力的早期识别能力,并支持更精准的公共卫生干预措施。 Abstract: Gender-based violence (GBV) is a major public health issue, with the World Health Organization estimating that one in three women experiences physical or sexual violence by an intimate partner during her lifetime. In Brazil, although healthcare professionals are legally required to report such cases, underreporting remains significant due to difficulties in identifying abuse and limited integration between public information systems. This study investigates whether FrameNet-based semantic annotation of open-text fields in electronic medical records can support the identification of patterns of GBV. We compare the performance of an SVM classifier for GBV cases trained on (1) frame-annotated text, (2) annotated text combined with parameterized data, and (3) parameterized data alone. Quantitative and qualitative analyses show that models incorporating semantic annotation outperform categorical models, achieving over 0.3 improvement in F1 score and demonstrating that domain-specific semantic representations provide meaningful signals beyond structured demographic data. The findings support the hypothesis that semantic analysis of clinical narratives can enhance early identification strategies and support more informed public health interventions.

[14] How LLMs Distort Our Written Language

Marwa Abdulhai,Isadora White,Yanming Wan,Ibrahim Qureshi,Joel Leibo,Max Kleiman-Weiner,Natasha Jaques

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLMs)在写作辅助中对人类文本语义的系统性改变,发现其不仅影响风格与语气,更会显著扭曲原意;通过用户调研、回溯式修订实验及真实AI审稿分析,揭示LLM使用导致中立化倾向增强、原创性下降、科学评审标准弱化等深层问题。

Details Motivation: 探究LLMs在广泛写作辅助场景下是否隐性地改变了人类表达的语义内容,而非仅优化语法或风格,从而评估其对文化与科学交流的潜在长期影响。 Method: 结合三项实证方法:(1)人类用户调研,分析不同LLM使用强度对写作中立性、创造力和作者声音的影响;(2)基于2021年人类撰写论文及其专家反馈的数据集,测试LLM仅作语法修订时对语义的干扰程度;(3)分析某顶级AI会议中21%由LLM生成的同行评审,对比其评分倾向与关注维度(如清晰度、重要性)与人工评审的差异。 Result: (1)重度LLM用户所写论文中持中立立场的比例上升近70%,且普遍反映写作缺乏创意与个人风格;(2)即使仅依据专家反馈进行‘语法修改’,LLM仍频繁改变原文核心语义;(3)LLM生成的审稿意见显著弱化对研究清晰度与重要性的评价,并平均给出高1分的评分。 Conclusion: LLMs在写作辅助中存在系统性语义偏移,这种偏移与用户感知的益处不一致,可能威胁学术严谨性与文化表达多样性,亟需在技术设计与制度规范层面开展进一步研究。 Abstract: Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point higher.These findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.

[15] Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

Maria Andueza Rodriguez,Marie Candito,Richard Huyghe

Main category: cs.CL

TL;DR: 本研究通过比较人类与大语言模型(LLMs)的词语联想反应,评估LLMs内部词典的人类相似性;结果表明,不同规模模型在响应典型性与变异性上呈现权衡,且受温度参数显著影响。

Details Motivation: 探究大语言模型(LLMs)内部词汇知识是否具备人类相似性,尤其是其词典结构是否反映真实人类的联想模式。 Method: 基于SWOW数据集的英语线索-反应对,采集Mistral-7B、Llama-3.1-8B和Qwen-2.5-32B三个LLM在多温度设置下的生成联想;分析词频、具体性等词汇因素的影响,并量化响应变异性与典型性。 Result: 所有模型均复现人类在词频和具体性上的趋势;但Qwen等大模型倾向于生成高典型、低变异响应(类似单个典型人类被试),而Mistral/Llama等小模型则响应更变异但典型性较低;温度升高提升变异、降低典型性。 Conclusion: LLMs的词汇表征既与人类有共性(如频率/具体性效应),也存在系统性差异(典型性-变异权衡),模型规模与温度是关键调节变量,需在词汇表征研究中予以控制。 Abstract: Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single "prototypical" human participant, generating highly typical but minimally variable responses, while smaller models such as Mistral and Llama produce more variable yet less typical responses. Temperature settings further influence this trade-off, with higher values increasing variability but decreasing typicality. These findings highlight both the similarities and differences between human and LLM lexicons, emphasizing the need to account for model size and temperature when probing LLM lexical representations.

[16] GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

Ja Young Lee,Mírian Silva,Mohamed Nasr,Shonda Witherspoon,Enzo Bozzani,Veronique Demers,Radha Ratnaparkhi,Hui Wu,Sara Rosenthal

Main category: cs.CL

TL;DR: 本文提出GRAFITE,一个用于持续评估大语言模型(LLMs)的平台,通过用户反馈构建问题库,并利用LLM-as-a-judge进行质量保证测试,支持多模型对比与版本回归检测。

Details Motivation: 随着基准数据在训练中大量暴露,LLM测试面临数据污染风险,导致性能虚高,亟需更可靠、持续的评估机制。 Method: 构建GRAFITE平台,整合用户反馈形成动态模型问题库,设计基于LLM-as-a-judge的QA测试流程,支持多模型并行评估与跨版本回归分析。 Result: 实现了可公开访问的持续评估系统(GitHub开源),支持问题驱动的LLM评测、模型间横向对比及版本间性能退化识别。 Conclusion: GRAFITE为缓解基准污染问题提供了实用、可持续的LLM评估框架,推动更透明、鲁棒的模型开发与迭代。 Abstract: Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo video is available at www.youtube.com/watch?v=XFZyoleN56k.

[17] CWoMP: Morpheme Representation Learning for Interlinear Glossing

Morris Alper,Enora Rice,Bhargav Shandilya,Alexis Palmer,Lori Levin

Main category: cs.CL

TL;DR: 本文提出CWoMP方法,通过对比学习将词与构形素对齐,并利用可变词典进行自回归生成,提升了低资源语言的IGT自动化生成效果。

Details Motivation: 现有自动IGT方法将词缀视为字符序列,忽略了其构形成分结构,且人工标注成本高。 Method: 提出CWoMP(Contrastive Word-Morpheme Pretraining),将构形素视为原子级形式-意义单元;使用对比训练编码器对齐上下文中的词与其构形素;自回归解码器基于可更新的构形素嵌入词典生成序列。 Result: 在多种低资源语言上显著优于现有方法,尤其在极低资源场景下提升明显,同时推理更高效,支持用户在推理时扩展词典而无需重训练。 Conclusion: CWoMP通过建模构形素的语义结构和可解释、可编辑的生成机制,为低资源语言的IGT自动化提供了更有效、灵活的解决方案。 Abstract: Interlinear glossed text (IGT) is a standard notation for language documentation which is linguistically rich but laborious to produce manually. Recent automated IGT methods treat glosses as character sequences, neglecting their compositional structure. We propose CWoMP (Contrastive Word-Morpheme Pretraining), which instead treats morphemes as atomic form-meaning units with learned representations. A contrastively trained encoder aligns words-in-context with their constituent morphemes in a shared embedding space; an autoregressive decoder then generates the morpheme sequence by retrieving entries from a mutable lexicon of these embeddings. Predictions are interpretable--grounded in lexicon entries--and users can improve results at inference time by expanding the lexicon without retraining. We evaluate on diverse low-resource languages, showing that CWoMP outperforms existing methods while being significantly more efficient, with particularly strong gains in extremely low-resource settings.

[18] How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence

Alex Anvi Eponon,Ildar Batyrshin,Christian E. Maldonado-Sifuentes,Grigori Sidorov

Main category: cs.CL

TL;DR: 本文探讨了人工智能范式与心理学理论之间的历史联系,指出每种AI范式继承了其对应心理学理论的优点与结构性局限;进而提出ReSynth三模块框架(Intellect/Identity/Memory),以克服现有方法在知识结构化、表征可更新性与理解构建方面的不足,目标是使系统性行为成为AGI架构的必然结果而非偶然属性。

Details Motivation: 现有AI范式(如强化学习、深度学习、课程学习)虽受心理学理论启发,但亦承袭其结构性缺陷,难以支撑人工通用智能所需的适应性与理解建构能力。 Method: 通过追溯心理学范式(行为主义、认知主义、建构主义)到AI方法的谱系,诊断各阶段继承的局限,并基于跨文化教育观(尤其是东方对熟记作为理解前阶的结构化理解),提出名为ReSynth的三模块(推理/目的/知识)分离式架构。 Result: ReSynth框架将推理(Intellect)、目的(Identity)和知识(Memory)作为架构上独立的组件,旨在使系统性行为成为表征架构的必然结果,而非偶然涌现特性。 Conclusion: 要实现真正适应性强的人工通用智能,需超越当前受心理学启发但受限于其缺陷的AI范式,转向能形式化支持理解建构与原则性知识更新的新架构——ReSynth为此提供了一条路径。 Abstract: The dominant paradigms of artificial intelligence were shaped by learning theories from psychology: behaviorism inspired reinforcement learning, cognitivism gave rise to deep learning and memory-augmented architectures, and constructivism influenced curriculum learning and compositional approaches. This paper argues that each AI paradigm inherited not only the strengths but the structural limitations of the psychological theory that inspired it. Reinforcement learning cannot account for the internal structure of knowledge, deep learning compresses representations into opaque parameter spaces resistant to principled update, and current integrative approaches lack a formal account of how new understanding is constructed from existing components. The paper further examines a cross-cultural divergence in the interpretation of rote learning, arguing that the Eastern conception of memorization as a structured, multi-phase precursor to understanding offers an underexploited bridge between psychological theory and AI methodology. Drawing on the systematicity debate and critique of Aizawa of both classicism and connectionism, this paper introduces ReSynth, a trimodular framework that separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as architecturally independent components. The paper traces the genealogy from psychological paradigm to AI method, diagnoses the inherited limitations at each stage, and argues that adaptability, the central challenge of artificial general intelligence requires a representational architecture in which systematic behavior is a necessary consequence rather than an accidental property.

[19] From Noise to Signal: When Outliers Seed New Topics

Evangelia Zve,Gauvain Bourgne,Benjamin Icard,Jean-Gabriel Ganascia

Main category: cs.CL

TL;DR: 本文提出了一种时间分类法,用于识别新闻文档在主题演化中的轨迹,特别关注能预示新兴主题的‘前瞻性离群点’,并在法语氢能新闻语料库上验证了其有效性。

Details Motivation: 传统动态主题建模将离群点视为噪声,但作者认为其中部分离群点可能是新兴主题的早期信号,值得系统建模与识别。 Method: 构建了一个描述新闻文档随时间与主题关系的时序分类法,区分前瞻性离群点、强化型文档和孤立文档;在累积聚类框架下,利用11种先进语言模型生成的文档嵌入实现该分类法。 Result: 在HydroNewsFr法语新闻语料库上的回溯实验表明,多模型一致性可识别出高共识的前瞻性离群点子集;定性案例研究进一步验证了该分类法对主题演化的解释力。 Conclusion: 前瞻性离群点不是噪声,而是弱信号探测与动态主题建模的关键桥梁;所提时间分类法为理解单篇文档如何预示、引发或漂移于主题演化提供了清晰框架。 Abstract: Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.

[20] Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Tianhui Zhang,Bei Peng,Danushka Bollegala

Main category: cs.CL

TL;DR: 本文提出了一种两阶段方法,构建了首个用于多样化生成式常识推理(GCR)的合成数据集CommonSyn,以弥补高质量多样化常识训练数据的缺失,并验证其在提升LLM生成多样性与质量上的有效性。

Details Motivation: 现有生成式常识推理(GCR)数据集规模小、覆盖场景窄、人工标注成本高,难以支撑多样化常识生成模型的训练需求。 Method: 提出两阶段合成数据构建方法,生成首个大规模、高质量、多样化的GCR合成数据集CommonSyn,并在多个规模的LLM上微调验证效果。 Result: 在CommonSyn上微调的模型相比基线模型及在人工数据上微调的模型,在生成多样性与质量两方面均有提升。 Conclusion: 合成数据可有效缓解多样化常识生成任务的数据瓶颈,CommonSyn为该方向提供了可行且有效的训练资源解决方案。 Abstract: Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)

[21] PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Ruishuo Chen,Yu Chen,Zhuoran Li,Longbo Huang

Main category: cs.CL

TL;DR: 本文提出PowerFlow框架,将无监督强化学习重述为分布匹配问题,利用GFlowNet作为变分采样器,并通过长度感知的轨迹平衡目标消除自回归生成中的长度偏差,支持通过α-幂分布调控LLM的逻辑推理(α>1)或创造性表达(α<1)能力。

Details Motivation: 现有无监督强化学习方法依赖启发式内在奖励,缺乏明确理论优化目标且易受退化性偏差影响。 Method: 将无监督微调建模为分布匹配问题;以GFlowNet作为非归一化密度的摊销变分采样器;设计长度感知的Trajectory-Balance目标;引入α-幂分布实现对LLM输出分布的定向调控。 Result: PowerFlow在多项任务上持续超越现有RLIF方法,性能媲美甚至超过监督式GRPO;在对齐模型中缓解过锐化现象,同步提升生成多样性与质量,推动创意任务的Pareto前沿。 Conclusion: PowerFlow提供了一种原理清晰、可调控、抗偏差的无监督LLM能力激发框架,统一支持逻辑增强与创意释放。 Abstract: Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

[22] AutoScreen-FW: An LLM-based Framework for Resume Screening

Zhelin Xu,Shuhei Yamamoto,Atsuyuki Morishima

Main category: cs.CL

TL;DR: 本文提出AutoScreen-FW,一种基于开源大语言模型(LLM)的本地化自动简历筛选框架,通过代表性样本选择与上下文学习提升判断性能,在保证隐私的同时降低 recruiter 工作负担。

Details Motivation: 企业招聘人员需在有限时间内筛选大量简历,负担重且易漏掉合适候选人;现有LLM方法依赖商业模型存在隐私风险,且缺乏公开带标注的简历数据指导模型训练。 Method: 提出AutoScreen-FW框架,采用多种策略选取少量代表性简历样本,结合角色描述和评估标准进行上下文学习,驱动开源LLM充当职业顾问评估新简历。 Result: 实验表明,所用开源LLM在多个真实标注(ground truth)下持续优于GPT-5-nano;在一组标注下超越GPT-5-mini;虽在其他标注下略逊于GPT-5-mini,但单份简历处理速度显著更快。 Conclusion: AutoScreen-FW具备本地部署潜力,可在保障数据隐私前提下提升筛选效率、减轻招聘人员负担。 Abstract: Corporate recruiters often need to screen many resumes within a limited time, which increases their burden and may cause suitable candidates to be overlooked. To address these challenges, prior work has explored LLM-based automated resume screening. However, some methods rely on commercial LLMs, which may pose data privacy risks. Moreover, since companies typically do not make resumes with evaluation results publicly available, it remains unclear which resume samples should be used during learning to improve an LLM's judgment performance. To address these problems, we propose AutoScreen-FW, an LLM-based locally and automatically resume screening framework. AutoScreen-FW uses several methods to select a small set of representative resume samples. These samples are used for in-context learning together with a persona description and evaluation criteria, enabling open-source LLMs to act as a career advisor and evaluate unseen resumes. Experiments with multiple ground truths show that the open-source LLM judges consistently outperform GPT-5-nano. Under one ground truth setting, it also surpass GPT-5-mini. Although it is slightly weaker than GPT-5-mini under other ground-truth settings, it runs substantially faster per resume than commercial GPT models. These findings indicate the potential for deploying AutoScreen-FW locally in companies to support efficient screening while reducing recruiters' burden.

[23] TopoChunker: Topology-Aware Agentic Document Chunking Framework

Xiaoyu Liu

Main category: cs.CL

TL;DR: 本文提出TopoChunker,一种基于代理的文档分块框架,通过构建结构化中间表示(SIR)来保留文档内在拓扑结构,缓解线性分块导致的语义碎片化问题,在多个基准上实现SOTA性能并降低token开销。

Details Motivation: 现有RAG文档分块方法采用线性化处理,破坏了文档固有的拓扑层次结构,导致语义碎片化,损害检索质量。 Method: 提出TopoChunker框架,包含Inspector Agent(动态选择成本优化的抽取路径)和Refiner Agent(进行容量审计与拓扑上下文消歧),将文档映射到结构化中间表示(SIR)以显式建模跨段依赖。 Result: 在GutenQA和GovReport数据集上,TopoChunker绝对生成准确率比最强LLM基线高8.0%,Recall@3达83.26%,同时token开销减少23.5%。 Conclusion: TopoChunker提供了一种兼顾结构保真度与计算效率的可扩展方案,推动结构感知型RAG的发展。 Abstract: Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.

[24] TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Arushi Rai,Qiang Zhang,Hanqing Zeng,Yunkai Zhang,Dipesh Tamboli,Xiangjun Fan,Zhuokai Zhao,Lizhu Zhang

Main category: cs.CL

TL;DR: 本文提出Token-level Adaptive Routing (TARo),一种在推理阶段对冻结大语言模型进行细粒度、奖励引导的结构化推理对齐方法,无需额外训练,显著提升数学、临床及指令遵循等多领域推理性能。

Details Motivation: 现有测试时对齐方法主要面向偏好对齐,缺乏针对推理能力的有效轻量级对齐方案;而传统推理优化依赖昂贵的后训练。 Method: 首先基于分步数学推理解析训练细粒度逻辑一致性奖励模型,再引入可学习的词元级路由器,在推理时动态调控奖励信号对基础模型的引导。 Result: TARo在多个基准上显著提升推理性能:较基线模型最高提升+22.4%,较现有词元级测试时对齐方法提升+8.4%;同时泛化至临床推理(MedXpertQA)和指令遵循(AlpacaEval),且支持从小模型到大模型零重训迁移。 Conclusion: TARo成功将测试时对齐从偏好优化拓展为鲁棒、跨领域的推理增强范式,验证了冻结模型在推理阶段实现高质量结构化推理的可行性。 Abstract: Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.

[25] Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Masayuki Kawarada,Tatsuya Ishigaki,Hiroya Takamura

Main category: cs.CL

TL;DR: 本文提出首个用于评估多模态大语言模型中任务干扰现象的基准,发现任务干扰具有方向性,尤其在从纯文本转向图像任务时性能下降显著。

Details Motivation: 尽管多模态对话系统日益普及,但任务干扰现象此前仅在纯文本场景中被研究,缺乏对多模态场景的系统评估。 Method: 构建覆盖六种文本与视觉任务的基准,沿模态错配、推理错配和答案格式错配三个维度系统调控历史-目标任务组合,并在开源与闭源多模态大模型上开展实验。 Result: 任务干扰具有强方向性:文本→图像切换导致严重性能下降,反之则影响甚微;多维错配加剧干扰;模态差异是主因,其次为答案格式,推理需求变化影响最小。 Conclusion: 任务干扰在多模态对话中不可忽视,其方向性与主导因素为设计鲁棒多模态系统提供了关键指导。 Abstract: Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

[26] Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

Asmita Bhardwaj,Yuya Jeremy Ong,Eelaaf Zahid,Basel Shbita

Main category: cs.CL

TL;DR: 本文提出了一种基于强化学习的解码器采样器,通过在测试时动态调整采样参数(如temperature/top-p)来提升大语言模型生成质量,无需更新模型权重;在多个摘要数据集上显著优于贪心和静态采样基线,并验证了复合奖励函数比单一重叠奖励更有效。

Details Motivation: 现有解码策略(如贪心、固定temperature/top-p)是静态且任务无关的,难以适应不同领域对风格或结构灵活性的需求,导致生成质量不稳定或次优。 Method: 将解码建模为序列决策问题,设计轻量级强化学习策略网络,在测试时动态调整采样参数;使用包含长度、覆盖度、重复性、完整性等结构化塑造项的复合奖励函数进行训练。 Result: 在BookSum、arXiv、WikiHow等摘要数据集上,使用Granite-3.3-2B和Qwen-2.5-0.5B模型,相对基线最高提升达+88%(BookSum, Granite)和+79%(WikiHow, Qwen);消融实验证明复合奖励优于仅重叠奖励,结构化塑造项带来稳定增益。 Conclusion: 强化学习是一种实用的测试时解码自适应机制,可在不重训大模型的前提下实现领域感知与用户可控的文本生成。 Abstract: Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.

[27] UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

Lang Zhou,Shuxuan Li,Zhuohao Li,Shi Liu,Zhilin Zhao,Wei-Shi Zheng

Main category: cs.CL

TL;DR: 本文提出UT-ACA框架,在推理时根据每个token的不确定性动态调整上下文窗口大小,以缓解长上下文推理中的注意力稀释和分布外退化问题。

Details Motivation: 长上下文推理中,固定上下文预算无法适应token级上下文需求的高度不均匀性,导致注意力稀释和性能下降。 Method: 提出不确定性触发的自适应上下文分配(UT-ACA)框架:构建融合语义嵌入与logit置信度的不确定性检测器,并在不确定性高时回滚、扩展上下文并重生成token。 Result: 实验表明,UT-ACA显著降低平均上下文使用量,同时保持长上下文下的生成质量。 Conclusion: 动态、不确定性驱动的上下文分配是一种高效且鲁棒的长上下文推理优化策略。 Abstract: Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.

[28] GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

Masayuki Kawarada,Kodai Watanabe,Soichiro Murakami

Main category: cs.CL

TL;DR: 本文提出了GAIN基准,用于评估大语言模型在面对不完美规范时如何平衡规范遵守与商业目标,引入五类情境压力以系统分析影响决策的因素,并在四个商业领域开展实验,发现模型在个人激励压力下显著偏离人类决策模式,更倾向于遵守规范。

Details Motivation: 现有基准多聚焦于抽象场景,缺乏对真实商业应用的覆盖,且难以揭示影响大语言模型决策的关键因素,限制了对模型在复杂规范-目标冲突中适应能力的评估。 Method: 构建GAIN基准,包含1200个涵盖招聘、客服、广告和金融四领域的场景;每个场景提供目标、情境、规范及五类明确设计的压力(目标一致性、风险规避、情感/伦理诉求、社会/权威影响、个人激励),用以系统考察模型决策行为。 Result: 实验表明先进大语言模型通常模仿人类决策模式,但在‘个人激励’压力下显著不同——模型强烈倾向于遵守规范而非违背,与人类行为形成明显差异。 Conclusion: GAIN有效揭示了大语言模型在规范与目标冲突下的决策机制,尤其凸显其在个人激励情境中过度保守的倾向,为提升模型在现实商业场景中的可信与适应性提供了新评估维度与改进方向。 Abstract: We introduce GAIN (Goal-Aligned Decision-Making under Imperfect Norms), a benchmark designed to evaluate how large language models (LLMs) balance adherence to norms against business goals. Existing benchmarks typically focus on abstract scenarios rather than real-world business applications. Furthermore, they provide limited insights into the factors influencing LLM decision-making. This restricts their ability to measure models' adaptability to complex, real-world norm-goal conflicts. In GAIN, models receive a goal, a specific situation, a norm, and additional contextual pressures. These pressures, explicitly designed to encourage potential norm deviations, are a unique feature that differentiates GAIN from other benchmarks, enabling a systematic evaluation of the factors influencing decision-making. We define five types of pressures: Goal Alignment, Risk Aversion, Emotional/Ethical Appeal, Social/Authoritative Influence, and Personal Incentive. The benchmark comprises 1,200 scenarios across four domains: hiring, customer support, advertising and finance. Our experiments show that advanced LLMs frequently mirror human decision-making patterns. However, when Personal Incentive pressure is present, they diverge significantly, showing a strong tendency to adhere to norms rather than deviate from them.

[29] WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Haonan Yu,Junhao Liu,Zhenyu Yan,Haoran Lin,Xin Zhang

Main category: cs.CL

TL;DR: WASD是一种新框架,通过识别生成token的充分神经条件来解释和控制大语言模型行为,相比传统方法更稳定、准确、简洁。

Details Motivation: 现有LLM行为控制方法存在训练成本高、缺乏自然语言可控性或损害语义连贯性等问题,亟需一种高效、可解释且可控的新方法。 Method: 提出WASD框架,将候选条件表示为神经元激活谓词,迭代搜索在输入扰动下仍能保证当前输出的最小充分条件集。 Result: 在SST-2和CounterFact数据集上,WASD生成的解释比传统归因图更稳定、准确、简洁;案例研究验证了其在跨语言输出控制中的有效性。 Conclusion: WASD为LLM行为提供了可解释、可控且高效的神经机制解释路径,显著提升了行为控制的精度与实用性。 Abstract: Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.

[30] The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

Esteban Garces Arias,Nurzhan Sapargali,Christian Heumann,Matthias Aßenmacher

Main category: cs.CL

TL;DR: 本文揭示了基于似然性的解码策略(如top-k、核采样等)与人类语言生成方式之间的根本差异,指出这种差异导致了‘截断盲点’——即语境恰当但统计上罕见的词人类可选而机器不可达,从而增强了机器生成文本的可检测性。

Details Motivation: 标准解码策略依赖于词的统计似然性,而人类语言生成更注重交际适宜性,二者不匹配可能导致机器文本易被识别。 Method: 在8个语言模型、5种解码策略和53种超参数配置下分析超过180万条文本,考察人类选择词是否落在典型截断边界之外,并训练简单分类器评估检测性能。 Result: 8–18%的人类选择词位于常规截断边界之外;仅靠可预测性和词汇多样性即可实现高检测率;截断参数而非模型规模或架构主导检测性差异;低检测性配置常导致文本不连贯。 Conclusion: 机器文本的可检测性主要源于似然驱动的token选择机制本身,而非模型能力不足;规避检测与生成自然文本是两个不同目标。 Abstract: Standard decoding strategies for text generation, including top-k, nucleus sampling, and contrastive search, select tokens based on likelihood, restricting selection to high-probability regions. Human language production operates differently: tokens are chosen for communicative appropriateness rather than statistical frequency. This mismatch creates a truncation blind spot: contextually appropriate but statistically rare tokens remain accessible to humans yet unreachable by likelihood-based decoding. We hypothesize this contributes to the detectability of machine-generated text. Analyzing over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations, we find that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers trained on predictability and lexical diversity achieve remarkable detection rates. Crucially, neither model scale nor architecture correlates strongly with detectability; truncation parameters account for most variance. Configurations achieving low detectability often produce incoherent text, indicating that evading detection and producing natural text are distinct objectives. These findings suggest detectability is enhanced by likelihood-based token selection, not merely a matter of model capability.

[31] EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Minsoo Cheong,Donghyun Son,Woosang Lim,Sungjoo Yoo

Main category: cs.CL

TL;DR: 本文提出EntropyCache,一种无需训练的KV缓存方法,利用新解码token分布的最大熵作为恒定开销信号来决定是否重计算,显著提升扩散型大语言模型(dLLMs)推理速度。

Details Motivation: 扩散型大语言模型(dLLMs)因使用双向注意力机制,无法进行无损KV缓存,导致每步去噪均需完整前向传播;现有近似KV缓存方法决策开销随上下文长度或模型深度增长,效率受限。 Method: EntropyCache基于两个经验观察:(1) 解码token熵与KV缓存漂移相关,可低成本表征缓存陈旧性;(2) 解码token特征不稳定性在解掩码后持续多步,因此应重计算最近k个token。其跳过/重计算决策仅需每步O(V)计算,与上下文长度和模型规模无关。 Result: 在LLaDA-8B-Instruct和Dream-7B-Instruct上,EntropyCache在标准基准上实现15.2×–26.4×加速,在思维链基准上达22.4×–24.1×加速,精度具竞争力,决策开销仅占推理时间0.5%。 Conclusion: EntropyCache是一种高效、轻量、训练无关的KV缓存策略,有效缓解dLLMs推理中因双向注意力带来的高计算成本问题,为实际部署提供了可行路径。 Abstract: Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5\%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.

[32] When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

Abhinaba Basu,Pavan Chakraborty

Main category: cs.CL

TL;DR: 本文提出ICE-Guard框架,通过干预一致性测试检测LLM在高风险决策中对虚假特征(如权威性、表述框架、人口统计)的依赖,并在3000个场景中评估11个模型,发现权威与框架偏差远高于人口统计偏差;结构化分解可显著降低偏差,且ICE引导的迭代提示修补实现78%偏差削减。

Details Motivation: 大型语言模型(LLMs)日益用于高风险决策,但其对虚假特征(spurious features)的依赖尚未被充分刻画,尤其现有研究过度聚焦于人口统计偏差而忽视其他类型偏差。 Method: 提出ICE-Guard框架,基于干预一致性测试(intervention consistency testing)识别三类虚假特征依赖:人口统计(姓名/种族替换)、权威性(资质/声望替换)和表述框架(正负向重述);在10个高风险领域共3000个案例上评估11个LLM;引入结构化分解方法(特征提取+确定性规则决策)并设计ICE引导的‘检测-诊断-缓解-验证’循环进行提示工程优化。 Result: (1)权威偏差(均值5.8%)与框架偏差(5.0%)显著高于人口统计偏差(2.2%);(2)偏差呈现领域集中性,如金融领域权威偏差达22.6%,刑事司法仅2.8%;(3)结构化分解使翻转率最多降低100%(中位数49%);(4)ICE引导的迭代提示修补实现累计78%偏差削减;(5)在真实COMPAS再犯数据上验证,其翻转率高于合成基准,说明该基准提供保守估计。 Conclusion: 虚假特征依赖问题比以往认知更广泛且异质,需超越人口统计维度进行系统性评估;结构化推理与干预驱动的提示优化是有效缓解路径;ICE-Guard为LLM高风险应用提供了可复现、可扩展的偏差检测与缓解范式。 Abstract: Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.

[33] Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

Ivaxi Sheth,Zeno Jonke,Amin Mantrach,Saab Mansour

Main category: cs.CL

TL;DR: 本文提出了一种基于分解的跨语言大模型评估框架,通过构建通用标准集(UCS)实现无需目标语言人工标注的可解释、低成本跨语言评估。

Details Motivation: 现有大语言模型评估方法主要面向英语,而其他语言缺乏高质量、低成本的人工标注数据,导致跨语言评估困难。 Method: 提出基于分解的评估框架,核心是语言无关的通用标准集(UCS),将评估任务分解为共享的、可解释的中间维度,支持低监督跨语言迁移。 Result: 在多种语言和不同模型主干上的忠实性任务实验中,该方法持续优于强基线,且无需目标语言人工标注。 Conclusion: UCS框架有效缓解了跨语言评估中对人工标注的依赖,提升了评估的可解释性与泛化能力,为多语言大模型评估提供了新范式。 Abstract: As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.

[34] ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

Abhinaba Basu,Pavan Chakraborty

Main category: cs.CL

TL;DR: 本文提出ICE框架,通过多重干预操作符下的随机化检验评估模型解释的忠实性,发现忠实性依赖于干预操作符,且与人类可理解性无关。

Details Motivation: 现有基准使用单一干预且无统计检验,无法区分真实忠实性与偶然性能。 Method: 引入ICE框架,通过随机化检验比较解释与匹配的随机基线,在多个干预操作符下计算胜率及置信区间。 Result: 在7个大语言模型、4个英文任务、6种非英文语言和2种归因方法上评估,发现忠实性具有操作符依赖性,部分配置中存在反忠实性,且与人类可理解性无相关性。 Conclusion: 忠实性应被相对地跨干预操作符解释,而非作为单一分数;ICE框架和ICEBench基准已开源。 Abstract: Evaluating whether explanations faithfully reflect a model's reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| < 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.

[35] Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors

Yusuke Takase,Momose Oyama,Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: 本文提出一种基于对数似然向量和PMI向量的语言模型表示方法,构建模型映射以量化条件分布差异,并揭示模型间结构关系及提示工程效应。

Details Motivation: 现有方法难以系统刻画语言模型在不同提示下的条件分布差异及其全局结构关系,亟需可解释、可度量的模型行为分析框架。 Method: 将语言模型表示为prompt-response对上的对数似然向量,构建欧氏距离近似KL散度的模型映射;引入点互信息(PMI)向量以削弱无条件分布影响。 Result: 在大量公开语言模型上验证了模型映射能有效反映模型属性、任务性能及提示修改引起的系统性偏移;PMI向量在刻画训练数据差异方面表现更优。 Conclusion: 该框架为分析语言模型输入依赖行为提供了可解释、可量化的工具,支持提示工程效应建模与预测。 Abstract: We propose a method that represents language models by log-likelihood vectors over prompt-response pairs and constructs model maps for comparing their conditional distributions. In this space, distances between models approximate the KL divergence between the corresponding conditional distributions. Experiments on a large collection of publicly available language models show that the maps capture meaningful global structure, including relationships to model attributes and task performance. The method also captures systematic shifts induced by prompt modifications and their approximate additive compositionality, suggesting a way to analyze and predict the effects of composite prompt operations. We further introduce pointwise mutual information (PMI) vectors to reduce the influence of unconditional distributions; in some cases, PMI-based model maps better reflect training-data-related differences. Overall, the framework supports the analysis of input-dependent model behavior.

[36] Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media

Thi Huyen Nguyen,Koustav Rudra,Wolfgang Nejdl

Main category: cs.CL

TL;DR: 本文提出了一种可解释的多模态分类框架,通过跨模态理由迁移(文本→图像)实现危机场景下图文联合分类与解释,显著提升分类性能与理由质量,并具备零样本泛化能力。

Details Motivation: 现有危机信息分类方法缺乏可解释性,尤其在图像模态上缺少有效理由提取,限制了实际部署;而现有可解释方法多集中于文本,难以兼顾图文协同解释与标注效率。 Method: 基于视觉语言Transformer学习图文联合表征,先提取文本理由,再通过跨模态映射生成图像理由(即跨模态理由迁移),最后基于双模态理由进行分类。 Result: 在CrisisMMD数据集上Macro-F1提升2–35%;人工评估显示图像理由补丁质量提升12%;零样本迁移准确率达80%。 Conclusion: 所提可解释多模态框架能高效、准确地联合提取图文理由,兼顾性能、可解释性与泛化能力,适用于真实危机响应场景。 Abstract: Advances in social media data dissemination enable the provision of real-time information during a crisis. The information comes from different classes, such as infrastructure damages, persons missing or stranded in the affected zone, etc. Existing methods attempted to classify text and images into various humanitarian categories, but their decision-making process remains largely opaque, which affects their deployment in real-life applications. Recent work has sought to improve transparency by extracting textual rationales from tweets to explain predicted classes. However, such explainable classification methods have mostly focused on text, rather than crisis-related images. In this paper, we propose an interpretable-by-design multimodal classification framework. Our method first learns the joint representation of text and image using a visual language transformer model and extracts text rationales. Next, it extracts the image rationales via the mapping with text rationales. Our approach demonstrates how to learn rationales in one modality from another through cross-modal rationale transfer, which saves annotation effort. Finally, tweets are classified based on extracted rationales. Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales. Human evaluation also supports the claim that our proposed method is able to retrieve better image rationale patches (12%) that help to identify humanitarian classes. Our method adapts well to new, unseen datasets in zero-shot mode, achieving an accuracy of 80%.

[37] DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

Maxime Poli,Manel Khentout,Angelo Ortiz Tandazo,Ewan Dunbar,Emmanuel Chemla,Emmanuel Dupoux

Main category: cs.CL

TL;DR: DiscoPhon is a new multilingual benchmark for unsupervised phoneme discovery from discrete speech units, covering 12 languages and evaluating unit quality, recognition, and segmentation; it includes four pretrained baselines (HuBERT/SpidR) showing phonemic information is accessible but varies across languages.

Details Motivation: To address the lack of standardized evaluation for unsupervised phoneme discovery across diverse languages, enabling fair comparison of models on phonemic alignment without language-specific supervision. Method: Constructing DiscoPhon — a benchmark with 6 dev and 6 test languages — where systems discover discrete units from only 10 hours of unlabeled speech per language and map them to a predefined phoneme inventory via many-to-one or one-to-one assignment; evaluation uses unit quality, recognition, and segmentation metrics. Result: Current multilingual models (HuBERT, SpidR) yield discrete units that correlate well with phonemes overall, but correlation strength varies significantly across languages. Conclusion: DiscoPhon enables rigorous cross-lingual evaluation of phoneme discovery; results confirm phonemic information is embedded in modern self-supervised speech models, yet robustness across phonemic diversity remains a challenge. Abstract: We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.

[38] Learning to Self-Evolve

Xiaoyin Chen,Canwen Xu,Yite Wang,Boyi Liu,Zhewei Yao,Yuxiong He

Main category: cs.CL

TL;DR: 本文提出Learning to Self-Evolve(LSE),一种强化学习框架,用于训练大语言模型在测试时自主优化其上下文,显著提升Text-to-SQL与问答任务性能,并具备跨模型泛化能力。

Details Motivation: 现有自演化方法依赖模型固有推理能力,未显式训练模型进行上下文自我优化;本文旨在将自演化建模为一种可学习技能。 Method: 将多步上下文演化简化为单步强化学习目标,以下游性能提升作为编辑奖励,并结合树状引导的演化循环。 Result: 在BIRD和MMLU-Redux上,4B参数模型使用LSE训练后,超越GPT-5、Claude Sonnet 4.5驱动的自演化策略及GEPA、TextGrad等提示优化方法,并能零样本指导其他模型。 Conclusion: 将测试时自演化视为可学习技能是有效的,LSE为提升LLM适应性提供了新范式。 Abstract: We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.

[39] A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

Aram Abrahamyan,Sachin Kumar

Main category: cs.CL

TL;DR: 本文对持续学习(CL)在意图分类任务中的灾难性遗忘问题进行了实证比较研究,使用CLINC150数据集和三种骨干网络(ANN、GRU、Transformer),评估了代表性CL方法(MIR、LwF、HAT)及其组合效果;结果表明回放机制(尤其是MIR)是缓解遗忘的关键,且最佳CL策略依赖于所选模型架构。

Details Motivation: 神经语言模型在实际应用中需持续适应新任务与领域,同时避免遗忘旧知识,而灾难性遗忘是持续意图分类中的核心挑战。 Method: 在CLINC150上构建10任务标签不相交的持续学习场景,对比ANN、GRU、Transformer三种骨干网络,分别评估单个及组合的三类CL方法(回放型MIR、正则化型LwF、参数隔离型HAT),以平均准确率、宏F1和后向迁移(backward transfer)为指标。 Result: 所有模型在朴素顺序微调下均严重遗忘;无单一CL方法能完全阻止遗忘;含MIR的组合(如MIR+HAT、MIR+LwF、MIR+LwF+HAT)表现最稳健,后向迁移接近零或轻微正值;最优组合因架构而异:ANN和Transformer适用MIR+HAT,GRU适用MIR+LwF+HAT;部分CL组合甚至超越联合训练,体现正则化效应。 Conclusion: 回放机制是缓解持续意图分类中灾难性遗忘的关键要素,且骨干网络与CL策略必须协同设计,不能独立选择;该研究为构建实用化持续学习意图识别系统提供了经验指导。 Abstract: Neural language models deployed in real-world applications must continually adapt to new tasks and domains without forgetting previously acquired knowledge. This work presents a comparative empirical study of catastrophic forgetting mitigation in continual intent classification. Using the CLINC150 dataset, we construct a 10-task label-disjoint scenario and evaluate three backbone architectures: a feed-forward Artificial Neural Network (ANN), a Gated Recurrent Unit (GRU), and a Transformer encoder, under a range of continual learning (CL) strategies. We consider one representative method from each major CL family: replay-based Maximally Interfered Retrieval (MIR), regularization-based Learning without Forgetting (LwF), and parameter-isolation via Hard Attention to Task (HAT), both individually and in all pairwise and triple combinations. Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence. Our results show that naive sequential fine-tuning suffers from severe forgetting for all architectures and that no single CL method fully prevents it. Replay emerges as a key ingredient: MIR is the most reliable individual strategy, and combinations that include replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) consistently achieve high final performance with near-zero or mildly positive backward transfer. The optimal configuration is architecture-dependent. MIR+HAT yields the best result for ANN and Transformer, MIR+LwF+HAT, on the other hand, works the best for GRU, and in several cases CL methods even surpass joint training, indicating a regularization effect. These findings highlight the importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems.

[40] Automatic detection of Gen-AI texts: A comparative framework of neural models

Cristian Buttaro,Irene Amerini

Main category: cs.CL

TL;DR: 本文设计并评估了四种基于机器学习的AI生成文本检测器(MLP、1D-CNN、MobileNet-CNN、Transformer),在多语言(英/意)及特定主题(艺术与心理健康)数据集上对比了多个商用检测工具,结果表明监督式检测器在跨语言和跨领域场景下比商用工具更稳定鲁棒。

Details Motivation: 大型语言模型的快速发展使得人写文本与AI生成文本难以区分,给学术、编辑和社会领域带来关键挑战,亟需可靠、鲁棒的AI文本检测方法。 Method: 提出并实现四种神经网络架构的监督式检测器(MLP、1D-CNN、MobileNet-CNN、Transformer),在COLING多语言数据集(英语和意大利语)及自建的艺术与心理健康主题数据集上进行训练与测试,并与ZeroGPT、GPTZero等八种主流在线检测工具进行对比评估。 Result: 监督式检测器在不同语言和领域下性能更稳定、鲁棒,显著优于多数商用检测工具;各模型表现存在差异,揭示了当前检测策略的关键优势与局限性。 Conclusion: 基于监督学习的定制化检测器在AI文本识别任务中具备更强泛化能力与可靠性,未来工作应聚焦于提升跨领域适应性与对抗鲁棒性。 Abstract: The rapid proliferation of Large Language Models has significantly increased the difficulty of distinguishing between human-written and AI generated texts, raising critical issues across academic, editorial, and social domains. This paper investigates the problem of AI generated text detection through the design, implementation, and comparative evaluation of multiple machine learning based detectors. Four neural architectures are developed and analyzed: a Multilayer Perceptron, a one-dimensional Convolutional Neural Network, a MobileNet-based CNN, and a Transformer model. The proposed models are benchmarked against widely used online detectors, including ZeroGPT, GPTZero, QuillBot, Originality.AI, Sapling, IsGen, Rephrase, and Writer. Experiments are conducted on the COLING Multilingual Dataset, considering both English and Italian configurations, as well as on an original thematic dataset focused on Art and Mental Health. Results show that supervised detectors achieve more stable and robust performance than commercial tools across different languages and domains, highlighting key strengths and limitations of current detection strategies.

[41] Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

Rudra Jadhav,Janhavi Danve,Sonalika Shaw

Main category: cs.CL

TL;DR: 本研究发现,大型语言模型(LLMs)在作为自动评分工具时,对写作类任务存在显著的隐性评分偏差——即使内容正确,仅因语法错误、非正式语言或非母语表达等表层风格差异,就会被系统性地压低分数;而数学和编程类任务则基本无此偏差。

Details Motivation: 随着LLMs越来越多地被用作教育场景中的自动评分工具,其评分公平性与偏见问题日益关键。本文旨在探究LLMs是否会在内容正确性不变的前提下,因学生写作风格(如语法、语体、母语背景)产生隐性评分偏差。 Method: 构建包含180份学生作答的控制数据集(覆盖数学、编程、议论文三类学科),每份作答施加三种表层扰动(语法错误、非正式语言、非母语表达);使用LLaMA 3.3 70B与Qwen 2.5 72B两个开源大模型,在明确指令‘仅依据内容正确性评分、忽略写作风格’下进行1–10分制评分;采用统计检验(p值)与Cohen's d效应量分析偏差显著性与强度。 Result: 在议论文任务中,两模型对所有三类风格扰动均表现出统计显著的评分偏差(p < 0.05),效应量达中到极强(d = 0.64–4.25);其中非正式语言扣分最重(LLaMA平均扣1.90分,Qwen扣1.20分),非母语表达次之;而数学与编程任务中偏差微弱且大多不显著。 Conclusion: LLM的评分偏差具有学科依赖性与风格敏感性,且无法通过简单提示词指令消除;该偏差对教育公平构成现实风险,亟需在机构部署前开展系统性偏见审计。 Abstract: As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen's d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale -- penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.

[42] Mi:dm K 2.5 Pro

KT Tech innovation Group

Main category: cs.CL

TL;DR: Mi:dm K 2.5 Pro 是一款面向企业级复杂任务的32B参数韩语大模型,通过推理优化、长上下文支持(128K)、多阶段后训练(含异步RL与Fusion Training)及高质量数据构建,在韩语基准上达到SOTA,并具备安全可靠特性。

Details Motivation: 现有大模型在韩语和领域特定的企业场景中难以满足多步推理、长上下文理解和智能体工作流等进阶需求,单纯扩大规模已不足够。 Method: 构建基于AST分析、数学填空合成和LLM质量评估的数据筛选流程;预训练采用Depth Upscaling(DuS)与渐进策略支持128K上下文;后训练包含推理监督微调、模型融合与异步强化学习,并通过Fusion Training融合推理能力与对话流畅性、风格一致性及工具调用可靠性。 Result: 在韩语专用评测集上达到SOTA,整体性能媲美国内外领先模型;通过负责任AI评估,兼具安全性(抗攻击)与响应能力。 Conclusion: Mi:dm K 2.5 Pro 证明了面向特定语言与企业场景的深度推理优化比单纯规模扩展更有效,为韩语大模型发展提供了新范式。 Abstract: The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. "Fusion Training" then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.

[43] Detecting Basic Values in A Noisy Russian Social Media Text Data: A Multi-Stage Classification Framework

Maria Milkova,Maksim Rudnev

Main category: cs.CL

TL;DR: 本研究提出了一种多阶段分类框架,用于在嘈杂的俄语社交媒体文本中检测人类价值观,基于Schwartz理论,结合LLM标注、软标签聚合与Transformer模型(XLM-RoBERTa large),在750万条帖子上验证,F1 macro达0.83,并揭示了俄语社交网络中价值观表达的模式与文化差异。

Details Motivation: 在嘈杂的俄语社交媒体中准确检测人类价值观具有挑战性,且专家标注存在主观性与不确定性,需兼顾可解释性、标注质量与文化特异性。 Method: 构建多阶段流程:垃圾/非个人内容过滤 → 价值/政治相关帖子筛选 → LLM(如GPT)标注 → 基于多LLM判断生成软标签 → 训练多标签Transformer模型(XLM-RoBERTa large等)预测10种基本价值观概率;将专家标注视为有不确定性的解释性基准,而非绝对真值。 Result: XLM-RoBERTa large模型在测试集上达到F1 macro=0.83、F1=0.71;发现模型系统性高估'开放变革'(Openness to Change)维度;揭示俄语社交网络中价值观表达与共现的独特模式;所有模型已开源。 Conclusion: 将价值观检测建模为多视角解释性任务更合理;软标签与不确定性建模提升了鲁棒性;该框架为跨文化数字环境中的价值分析提供了可复现、可扩展的方法论基础。 Abstract: This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts. Drawing on Schwartz's theory of basic human values, we design a multi-stage pipeline that includes spam and nonpersonal content filtering, targeted selection of value relevant and politically relevant posts, LLM based annotation, and multi-label classification. Particular attention is given to verifying the quality of LLM annotations and model predictions against human experts. We treat human expert annotations not as ground truth but as an interpretative benchmark with its own uncertainty. To account for annotation subjectivity, we aggregate multiple LLM generated judgments into soft labels that reflect varying levels of agreement. These labels are then used to train transformer based models capable of predicting the probability of each of the ten basic values. The best performing model, XLM RoBERTa large, achieves an F1 macro of 0.83 and an F1 of 0.71 on held out test data. By treating value detection as a multi perspective interpretive task, where expert labels, GPT annotations, and model predictions represent coherent but not identical readings of the same texts, we show that the model generally aligns with human judgments but systematically overestimates the Openness to Change value domain. Empirically, the study reveals distinct patterns of value expression and their co-occurrence in Russian social networks, contributing to a broader research agenda on cultural variation, communicative framing, and value based interpretation in digital environments. All models are released publicly.

[44] Why Better Cross-Lingual Alignment Fails for Better Cross-Lingual Transfer: Case of Encoders

Yana Veitsman,Yihong Liu,Hinrich Schütze

Main category: cs.CL

TL;DR: 本文揭示了跨语言对齐与下游任务性能之间存在目标不一致的问题,指出单纯提升嵌入相似性并不能保证下游任务性能提升,并通过表征分析验证了对齐损失与任务损失梯度接近正交,最后提出了结合对齐与微调的实用建议。

Details Motivation: 现有研究假设更好的跨语言对齐能带来更好的跨语言迁移效果,但实践中显式对齐方法虽提升嵌入相似性,却常无法提升词级别下游任务性能,其原因尚不明确。 Method: 分析四个在不同语言对上对齐的XLM-R编码器模型,分别在POS标注和句子分类任务上微调;采用嵌入距离、任务与对齐损失的梯度相似性及梯度模长等表征分析手段。 Result: (1)嵌入距离不能可靠预测任务性能变化;(2)对齐损失与任务损失的梯度常近似正交,表明优化一个目标对另一个目标贡献甚微。 Conclusion: 对齐与下游任务目标正交且受益程度因语言和任务而异,因此‘更好’的对齐未必带来‘更好’的跨语言迁移;应谨慎选择联合训练中的损失函数。 Abstract: Better cross-lingual alignment is often assumed to yield better cross-lingual transfer. However, explicit alignment techniques -- despite increasing embedding similarity -- frequently fail to improve token-level downstream performance. In this work, we show that this mismatch arises because alignment and downstream task objectives are largely orthogonal, and because the downstream benefits from alignment vary substantially across languages and task types. We analyze four XLM-R encoder models aligned on different language pairs and fine-tuned for either POS Tagging or Sentence Classification. Using representational analyses, including embedding distances, gradient similarities, and gradient magnitudes for both task and alignment losses, we find that: (1) embedding distances alone are unreliable predictors of improvements (or degradations) in task performance and (2) alignment and task gradients are often close to orthogonal, indicating that optimizing one objective may contribute little to optimizing the other. Taken together, our findings explain why ``better'' alignment often fails to translate into ``better'' cross-lingual transfer. Based on these insights, we provide practical guidelines for combining cross-lingual alignment with task-specific fine-tuning, highlighting the importance of careful loss selection.

[45] Evaluating LLM-Generated Lessons from the Language Learning Students' Perspective: A Short Case Study on Duolingo

Carlos Rafael Catalan,Patricia Nicole Monderin,Lheane Marie Dizon,Gap Estrella,Raymund John Sarmimento,Marie Antoinette Patalagsa

Main category: cs.CL

TL;DR: 本文探讨了当前语言学习应用(如Duolingo)在职业场景支持上的不足,通过调查菲律宾跨国公司员工发现:通用场景教学有效夯实基础,但缺乏专业语境导致难以达成职业级流利;建议结合个性化、领域定制化课程与通用基础课程以提升专业语言能力。

Details Motivation: 现有语言学习App(如Duolingo)主要提供通用生活场景课程,缺乏对职业/领域特定语境的支持,难以帮助用户实现‘职业级流利’——即在目标语言中自如交流工作及专业信息的能力。 Method: 对菲律宾一家跨国公司的5名员工开展访谈式调查,分析其使用Duolingo过程中接触通用场景与工作相关场景的频率、感知有效性及对课程内容的改进建议,并进行聚合分析。 Result: 受访者普遍认为通用场景(如问候、点餐)更常见、更具亲和力且有助于夯实语法、词汇与文化基础;而工作相关场景虽出现少,却对发展专业词汇和职业流利度至关重要;每位参与者提出的理想课程场景各不相同,凸显个性化需求。 Conclusion: 语言学习App应采用混合课程生成策略:一方面通过通用、可理解的场景维持基础语言能力培养;另一方面利用LLM动态生成适配用户职业背景的个性化、领域特定课程,从而弥合通用学习与专业流利之间的鸿沟。 Abstract: Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited support for profession-specific contexts. This gap can hinder learners from achieving professional-level fluency, which we define as the ability to communicate comfortably various work-related and domain-specific information in the target language. We surveyed five employees from a multinational company in the Philippines on their experiences with Duolingo. Results show that respondents encountered general scenarios more frequently than work-related ones, and that the former are relatable and effective in building foundational grammar, vocabulary, and cultural knowledge. The latter helps bridge the gap toward professional fluency as it contains domain-specific vocabulary. Each participant suggested lesson scenarios that diverge in contexts hen analyzed in aggregate. With this understanding, we propose that language learning applications should generate lessons that adapt to an individual's needs through personalized, domain specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios.

[46] A Human-in/on-the-Loop Framework for Accessible Text Generation

Lourdes Moreno,Paloma Martínez

Main category: cs.CL

TL;DR: 本文提出了一种将人类参与显式融入大语言模型(LLM)驱动的无障碍文本生成的混合框架,结合Human-in-the-Loop(生成中干预)与Human-on-the-Loop(生成后监督),通过标准对齐的检查表、事件触发规则和可量化的无障碍KPI,实现可追溯、可复现、可审计的文本简化流程,提升模型适应性与NLP系统的透明性和包容性。

Details Motivation: 当前自动文本简化与评估流程过于自动化、指标驱动,未能反映真实用户理解效果或遵循规范标准,亟需引入以人为中心的机制保障认知可访问性。 Method: 构建融合Human-in-the-Loop(HiTL)与Human-on-the-Loop(HoTL)的混合框架;将实证用户研究与标注资源转化为三类工具:(i)对标规范的检查表,(ii)基于Event-Condition-Action的专家介入触发规则,(iii)无障碍关键绩效指标(KPIs)。 Result: 该框架实现了人类角色在生成与监督阶段的结构化嵌入,支持可追溯、可复现、可审计的无障碍文本生成与评估流程,并能提供结构化反馈以持续改进模型适应性。 Conclusion: 将可解释性与伦理问责作为核心设计原则嵌入NLP系统,是提升文本简化技术透明度、包容性与实际可用性的关键路径;本框架为构建以人为本的AI辅助无障碍内容生产提供了方法论范式。 Abstract: Plain Language and Easy-to-Read formats in text simplification are essential for cognitive accessibility. Yet current automatic simplification and evaluation pipelines remain largely automated, metric-driven, and fail to reflect user comprehension or normative standards. This paper introduces a hybrid framework that explicitly integrates human participation into LLM-based accessible text generation. Human-in-the-Loop (HiTL) contributions guide adjustments during generation, while Human-on-the-Loop (HoTL) supervision ensures systematic post-generation review. Empirical evidence from user studies and annotated resources is operationalized into (i) checklists aligned with standards, (ii) Event-Condition-Action trigger rules for activating expert oversight, and (iii) accessibility Key Performance Indicators (KPIs). The framework shows how human-centered mechanisms can be encoded for evaluation and reused to provide structured feedback that improves model adaptation. By embedding the human role in both generation and supervision, it establishes a traceable, reproducible, and auditable process for creating and evaluating accessible texts. In doing so, it integrates explainability and ethical accountability as core design principles, contributing to more transparent and inclusive NLP systems.

[47] Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

Vedant Pandya

Main category: cs.CL

TL;DR: 本文提出XKD-Dial,一种面向英-印双语、具备可解释性与知识引用能力的对话生成训练流程,通过四阶段渐进式训练结合多种可解释性分析方法,显著降低幻觉率并提升多语言知识对话性能。

Details Motivation: 现有知识驱动对话系统多局限于英语,缺乏显式引用机制和决策透明性,难以验证事实性且不支持多语言。 Method: 提出XKD-Dial四阶段训练流程:(1)多语言适配;(2)带引用标注的英文对话监督微调(SFT);(3)双语对话SFT;(4)基于引用感知奖励的GRPO对齐。并在各阶段系统应用三种后验可解释性分析:跨注意力对齐、积分梯度归因、遮蔽因果定位。 Result: (i)引用标注SFT使编码器-解码器模型幻觉率降至0.0%(第二阶段起);(ii)渐进式训练避免灾难性遗忘并增强印地语能力;(iii)小模型经SFT后英文性能媲美大模型;(iv)GRPO在结构化引用任务中仅带来边际提升;(v)在BLEU、ROUGE、BERTScore、FactScore、Citation-F1及幻觉率六项指标上全面评估。 Conclusion: 显式引用建模与渐进式多阶段训练是提升知识对话系统事实性、可解释性与多语言能力的关键,而高质量SFT在多数情况下比强化学习对齐更有效。 Abstract: Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).

[48] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Xinghao Zhao

Main category: cs.CL

TL;DR: 本文提出熵轨迹单调性(entropy-trajectory monotonicity)作为链式推理(CoT)中预测答案正确性的低成本指标:若每一步推理的输出分布熵严格递减,则该推理链为单调,其正确率显著更高;该性质优于总熵降幅、token置信度等标量指标,且计算开销远低于自一致性。

Details Motivation: 尽管链式思维(CoT)能提升大语言模型推理准确性,但尚缺乏低成本、高判别力的方法来提前识别推理失败。现有不确定性度量(如总熵、token置信度)效果有限或校准差,亟需挖掘不确定性演化过程中的结构特征。 Method: 提出‘熵轨迹单调性’——对CoT每一步采样少量完成结果,计算其答案分布的熵,并检验熵是否在每一步均严格下降;在GSM8K数据集上,使用Qwen2.5-7B-Instruct和Mistral-7B模型进行实证分析,对比单调/非单调链的准确率、OR值、Fisher检验p值及覆盖率,并与token log-prob信心、总熵变化、40-chain自一致性等基线比较。 Result: 在Qwen2.5-7B上,单调链准确率达68.8%,显著高于非单调链(46.8%),提升21.9个百分点(p=0.0005, OR=2.50);熵总降幅与准确率无相关性(ρ=-0.06);单调性在73.7%覆盖率下带来+5.8pp增益,成本仅约1500 tokens/题(为40-chain自一致性的1/8);Mistral-7B上复现更强效果(+34.7pp, OR=4.33)。 Conclusion: 不确定性随推理步骤演化的结构特性(如单调性)比其整体幅度(如总熵变)或局部标量置信度更具判别力;熵轨迹单调性是一种高效、可扩展、模型无关的CoT可信度信号,为轻量级推理验证提供了新范式。 Abstract: Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps--captured by sampling a few answer completions per step--predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher's p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ($ρ$=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question--1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.

[49] RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation

Weronika Łajewska,Paul Missault,George Davidson,Saab Mansour

Main category: cs.CL

TL;DR: 本文提出RADIUS,一个用于调查模拟的二维对齐评估套件,涵盖排名对齐与分布对齐,并引入统计显著性检验,弥补现有评估指标在决策相关场景中的不足。

Details Motivation: 现有调查模拟评估指标零散、非标准化,且忽视关键的排名对齐维度,导致高准确率下仍可能无法反映人类真实偏好排序,影响决策应用效果。 Method: 提出RADIUS评估框架,包含两个核心维度:1)排名对齐(Ranking alignment),衡量模型生成选项排序与人类真实排序的一致性;2)分布对齐(Distribution alignment),评估响应分布的拟合程度;二者均辅以统计显著性检验。 Result: RADIUS揭示了现有指标的局限性,提升了调查模拟评估的意义性与可比性,并提供了开源实现以支持可复现研究。 Conclusion: RADIUS为调查模拟提供了更全面、严谨、标准化的评估范式,尤其适用于需依赖偏好排序的决策场景。 Abstract: Simulation of surveys using LLMs is emerging as a powerful application for generating human-like responses at scale. Prior work evaluates survey simulation using metrics borrowed from other domains, which are often ad hoc, fragmented, and non-standardized, leading to results that are difficult to compare. Moreover, existing metrics focus mainly on accuracy or distributional measures, overlooking the critical dimension of ranking alignment. In practice, a simulation can achieve high accuracy while still failing to capture the option most preferred by humans - a distinction that is critical in decision-making applications. We introduce RADIUS, a comprehensive two-dimensional alignment suite for survey simulation that captures: 1) RAnking alignment and 2) DIstribUtion alignment, each complemented by statistical Significance testing. RADIUS highlights the limitations of existing metrics, enables more meaningful evaluation of survey simulation, and provides an open-source implementation for reproducible and comparable assessment.

[50] Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

Hangeol Chang,Changsun Lee,Seungjoon Rho,Junho Yeo,Jong Chul Ye

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的预检索框架Hypothesis-Conditioned Query Rewriting (HCQR),通过基于假设重写查询,使RAG从主题导向转向证据导向检索,从而提升多选任务中的准确率。

Details Motivation: 现有RAG方法依赖单一初始查询,偏向主题相关性而非决策相关证据,难以在多选项中有效判别正确答案。 Method: HCQR首先从问题和候选答案中生成轻量级工作假设,再将其重写为三个目标明确的检索查询:(1)支持该假设;(2)区分该假设与竞争选项;(3)验证问题中的关键线索。 Result: 在MedQA和MMLU-Med数据集上,HCQR分别比Simple RAG提升5.9和3.6个百分点的平均准确率,并优于重排序/过滤基线方法。 Conclusion: HCQR通过证据导向的检索显著增强RAG在多选推理任务中的决策能力,是一种高效、无需训练的改进框架。 Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at https://anonymous.4open.science/r/HCQR-1C2E.

[51] What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Gagan Bhatia,Ahmad Muhammad Isa,Maxime Peyrard,Wei Zhao

Main category: cs.CL

TL;DR: 本文提出了MultiTempBench,一个多语言时间推理基准,涵盖三种任务、五种语言和多种历法,并分析了LLM在不同资源条件下的时间推理能力,发现分词质量是关键瓶颈。

Details Motivation: 现有基准缺乏多语言、多历法的时间推理评估,难以揭示模型在低资源语言和非主流历法下的时间推理瓶颈。 Method: 构建MultiTempBench基准(15,000个样本,覆盖5语言、3历法),设计多语言日期碎片化比率(mDFR)与几何探测方法,结合交叉混合效应回归分析影响因素。 Result: 发现时间标记的分词质量是资源依赖型瓶颈:低资源语言/罕见历法中,日期碎片化严重破坏年月日分离,准确率骤降;高资源场景对数字级切分更鲁棒;时间线性度主导高资源语言推理,而碎片化主导低资源语言推理。 Conclusion: 时间推理性能高度依赖于分词器对时间符号的建模能力,尤其在低资源语言和非主流历法中需针对性优化分词与内部时间表征。 Abstract: We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

[52] MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

Chenyang Gu,Jiahao Cheng,Meicong Zhang,Pujun Zheng,Jinquan Zheng,Guoxiu He

Main category: cs.CL

TL;DR: 本文提出MoRI框架,通过动机驱动的推理增强大语言模型在科学构想任务中的表现,显著提升新颖性、技术严谨性和可行性。

Details Motivation: 现有基于大语言模型的代理方法虽模仿人类科研流程,但未能充分建模科学推理,导致生成的构想缺乏技术深度和科学依据。 Method: 提出MoRI框架:首先通过监督微调使基础LLM学会从给定上下文中生成研究动机;再通过复合强化学习奖励进行训练,包括熵感知信息增益(鼓励挖掘高复杂度技术细节)与对比语义增益(保持推理路径与科学有效解的概念一致性)。 Result: MoRI在新颖性、技术严谨性和可行性等多个维度上显著优于强商业大语言模型及复杂代理基线。 Conclusion: MoRI通过显式建模从研究动机到方法论的推理过程,有效提升了大语言模型在科学构想任务中的性能,为AI辅助科学研究提供了新范式。 Abstract: Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbf{MoRI} (\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \href{https://github.com/ECNU-Text-Computing/IdeaGeneration}{GitHub}.

[53] Parallelograms Strike Back: LLMs Generate Better Analogies than People

Qiawen Ella Liu,Raja Marjieh,Jian-Qiao Zhu,Adele E. Goldberg,Thomas L. Griffiths

Main category: cs.CL

TL;DR: 本文比较了人类和大语言模型(LLM)在四词类比任务(A:B::C:D)中的表现,发现LLM生成的类比更符合经典的‘平行四边形’几何模型,且被评价为质量更高;但人类表现差主要源于大量低质量、高频词驱动的错误响应,而非该模型本身无效。

Details Motivation: 探究四词类比中‘平行四边形’几何模型失效的原因:是模型本身不适用于人类类比推理,还是人类难以稳定产出满足该关系约束的类比? Method: 在Peterson等人(2020)的相同类比数据集上,对比人类与LLM(基于GloVe嵌入)的完成结果;通过人工评分、平行四边形对齐度(向量几何距离)、词频及模态响应分析归因差异。 Result: LLM类比整体评分更高、更贴近平行四边形结构、更少依赖高频易得词;但优势主要来自人类长尾低质响应;仅比较模态响应时优势消失;平行四边形对齐度与低频词使用仍可预测LLM更优完成。 Conclusion: 平行四边形模型并非对类比关系的差模型;人类常无法稳定满足该关系约束,而LLM能更一致地做到。 Abstract: Four-term word analogies (A:B::C:D) are classically modeled geometrically as ''parallelograms,'' yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account (Peterson et al., 2020). But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from (Peterson et al., 2020). We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently.

[54] A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

Madeline Bittner,Dina Demner-Fushman,Yasmeen Shabazz,Davis Bartels,Dukyong Yoon,Brad Quitadamo,Rajiv Menghrajani,Leo Celi,Sarvesh Soni

Main category: cs.CL

TL;DR: 本文介绍了HEALIX,首个公开可用的基于真实临床记录的健康素养标注数据集,并利用其对四种开源大语言模型进行了零样本和少样本提示策略的基准测试。

Details Motivation: 现有健康素养筛查工具在可行性、项目数量、问题格式及所涵盖维度上差异较大,难以在结构化电子健康记录中统一应用;而从非结构化临床笔记中自动检测健康素养虽具潜力,却受限于缺乏标注资源。 Method: 构建HEALIX数据集:通过社会工作者笔记采样、关键词过滤和大语言模型(LLM)驱动的主动学习相结合的方式,从真实临床笔记中收集并标注589份涵盖9种笔记类型的文本,标注标签为低、正常、高三种健康素养水平;随后在该数据集上对四种开源LLM开展零样本与少样本提示策略的基准测试。 Result: 成功构建并发布了HEALIX数据集,包含589份标注临床笔记;实验证明零样本和少样本提示策略在不同LLM上均展现出一定健康素养识别能力,验证了该数据集的有效性和实用性。 Conclusion: HEALIX填补了健康素养自然语言处理研究中高质量标注数据的空白,为基于临床文本的自动化健康素养评估提供了可靠资源和方法基础。 Abstract: Health literacy is a critical determinant of patient outcomes, yet current screening tools are not always feasible and differ considerably in the number of items, question format, and dimensions of health literacy they capture, making documentation in structured electronic health records difficult to achieve. Automated detection from unstructured clinical notes offers a promising alternative, as these notes often contain richer, more contextual health literacy information, but progress has been limited by the lack of annotated resources. We introduce HEALIX, the first publicly available annotated health literacy dataset derived from real clinical notes, curated through a combination of social worker note sampling, keyword-based filtering, and LLM-based active learning. HEALIX contains 589 notes across 9 note types, annotated with three health literacy labels: low, normal, and high. To demonstrate its utility, we benchmarked zero-shot and few-shot prompting strategies across four open source large language models (LLMs).

[55] DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

Yilin Wang,Yuchun Fan,Jiaoyang Li,Ziming Zhu,Yongyu Mu,Qiaozhi He,Tong Xiao,Jingbo Zhu

Main category: cs.CL

TL;DR: 本文提出DaPT框架,通过构建多语言多跳QA基准并采用双语检索与回答策略,显著提升了RAG系统在多语言多跳问答任务中的性能。

Details Motivation: 现有RAG系统缺乏评估其在多语言多跳问答(MM-hop QA)场景下能力的基准,且过度依赖英文语义理解能力,导致多语言场景下效果下降。 Method: 首先构建涵盖五种语言的多语言多跳QA基准;然后提出DaPT框架,该框架并行生成源语言查询及其英文翻译的子问题图,合并后采用双语检索与回答策略顺序求解子问题。 Result: 实验表明,先进RAG系统在多语言场景中存在显著性能失衡;DaPT在MuSiQue基准上平均EM分数相较最强基线提升18.3%。 Conclusion: DaPT有效缓解了多语言多跳问答中因语言差异带来的性能下降问题,显著提升了RAG系统在该任务上的准确性和简洁性。 Abstract: Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems' capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs' strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3\% in average EM score over the strongest baseline.

[56] UGID: Unified Graph Isomorphism for Debiasing Large Language Models

Zikang Ding,Junchi Yao,Junhao Li,Yi Zhang,Wenbo Jiang,Hongbo Liu,Lijie Hu

Main category: cs.CL

TL;DR: 本文提出UGID框架,通过将Transformer建模为结构化计算图,在内部表征层面进行去偏,强制图结构在反事实输入间保持不变,仅允许敏感属性变化,并联合约束注意力路由与隐状态,同时保持模型能力。

Details Motivation: 大语言模型存在显著社会偏见,输出层或数据优化的去偏方法无法彻底解决,且偏见已嵌入内部表征中。 Method: 提出统一图同构去偏框架(UGID),将Transformer建模为计算图(注意力机制为边、隐状态为节点),以图结构在反事实输入下的不变性为目标,联合约束注意力路由和敏感区域隐表示,并引入对数空间敏感logits约束与选择性锚点目标以保留语义。 Result: 在大模型上实验表明,UGID能有效降低分布内与分布外偏见,显著减少内部结构差异,同时保持模型安全性与实用性。 Conclusion: UGID是一种有效的内部表征级去偏框架,兼顾去偏效果与模型能力保留。 Abstract: Large language models (LLMs) exhibit pronounced social biases. Output-level or data-optimization--based debiasing methods cannot fully resolve these biases, and many prior works have shown that biases are embedded in internal representations. We propose \underline{U}nified \underline{G}raph \underline{I}somorphism for \underline{D}ebiasing large language models (\textit{\textbf{UGID}}), an internal-representation--level debiasing framework for large language models that models the Transformer as a structured computational graph, where attention mechanisms define the routing edges of the graph and hidden states define the graph nodes. Specifically, debiasing is formulated as enforcing invariance of the graph structure across counterfactual inputs, with differences allowed only on sensitive attributes. \textit{\textbf{UGID}} jointly constrains attention routing and hidden representations in bias-sensitive regions, effectively preventing bias migration across architectural components. To achieve effective behavioral alignment without degrading general capabilities, we introduce a log-space constraint on sensitive logits and a selective anchor-based objective to preserve definitional semantics. Extensive experiments on large language models demonstrate that \textit{\textbf{UGID}} effectively reduces bias under both in-distribution and out-of-distribution settings, significantly reduces internal structural discrepancies, and preserves model safety and utility.

[57] Optimal Splitting of Language Models from Mixtures to Specialized Domains

Skyler Seto,Pierre Ablin,Anastasiia Filippova,Jiayuan Ye,Louis Bethune,Angelos Katharopoulos,David Grangier

Main category: cs.CL

TL;DR: 本文提出了一种基于缩放定律的多模型预训练与专业化训练计算分配方法,在多领域语言模型训练中提升了常识知识和推理任务的性能。

Details Motivation: 现有语言模型训练采用两阶段范式(通用预训练+领域专业化),在多领域场景下需为每个领域单独继续预训练多个模型(split model training),效率低且计算资源分配缺乏理论指导。 Method: 提出一种基于缩放定律的方法,支持在通用预训练语料上独立预训练多个模型,并精确预测模型规模N、预训练数据量D及专业化数据量D'下的损失,进而优化预训练与专业化阶段的计算分配。 Result: 该方法在常识知识和推理基准测试(如CommonsenseQA、ARC等)上,对不同模型规模和计算预算均展现出一致的性能提升。 Conclusion: 基于缩放定律的计算分配策略优于传统分模型训练,为多领域语言模型高效训练提供了可扩展、可预测的新范式。 Abstract: Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.

[58] VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Chonghan Liu,Yimin Du,Qi An,Xin He,Cunqi Zhai,Fei Tan,Weijia Lin,Xiaochun Gong,Yongchao Deng,Shousheng Jia,Xiangzheng Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为变量熵策略优化(VEPO)的新方法,通过结合强化学习与可验证奖励,引入确定性结构约束,以提升低资源语言的翻译性能和分词效率。

Details Motivation: 大型语言模型在低资源语言上表现不佳,主要由于子词分割效率低和训练数据不平衡。 Method: 提出变量熵策略优化(VEPO),利用带可验证奖励的强化学习,在策略对齐过程中引入确定性结构约束,并通过变量熵机制动态调节字面保真度与语义自然性之间的平衡,结合熵调节的优势估计与非对称裁剪防止策略崩溃。 Result: 在90个FLORES-200、COMET-22和chrF方向上的实证评估显示,VEPO显著提升了分词效率和翻译质量,缩小了低资源语言的性能差距。 Conclusion: VEPO能有效提升低资源语言的翻译性能,兼顾格式一致性、语言合法性与序列长度控制,是一种鲁棒且实用的策略优化框架。 Abstract: Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.

[59] Evaluating Counterfactual Strategic Reasoning in Large Language Models

Dimitrios Georgousis,Maria Lymperaiou,Angeliki Dimitriou,Giorgos Filandrianos,Giorgos Stamou

Main category: cs.CL

TL;DR: 本文评估了大语言模型(LLMs)在重复博弈场景中的策略表现,通过引入改变收益结构和动作标签的反事实变体(如囚徒困境和石头剪刀布),检验其是否具备真正的策略推理能力,还是仅依赖记忆模式;结果表明LLMs在激励敏感性、结构泛化和反事实环境下的战略推理方面存在明显局限。

Details Motivation: 检验LLMs在博弈论场景中展现的战略行为是源于真实推理能力,还是仅仅依赖对训练数据中常见模式的记忆。 Method: 在经典博弈(囚徒困境和石头剪刀布)基础上构建反事实变体,改变收益结构和动作标签以打破原有对称性和占优关系,并采用多指标评估框架对比模型在默认与反事实设置下的表现。 Result: LLMs在反事实环境中表现出显著下降的激励敏感性、结构泛化能力和战略推理能力,说明其策略行为高度依赖熟悉模式而非深层推理。 Conclusion: 当前LLMs在重复博弈中的战略表现主要源于模式匹配而非真正的博弈论推理,亟需提升其对激励结构和抽象规则的理解与泛化能力。 Abstract: We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner's Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

[60] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Zhuolin Yang,Zihan Liu,Yang Chen,Wenliang Dai,Boxin Wang,Sheng-Chieh Lin,Chankyu Lee,Yangyi Chen,Dongfu Jiang,Jiafan He,Renjie Pi,Grace Lam,Nayeon Lee,Alexander Bukharin,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping

Main category: cs.CL

TL;DR: Nemotron-Cascade 2 是一个开源的30B MoE模型(仅激活3B参数),在数学与编程推理、智能体能力方面达到前沿水平,以极小参数量实现IMO/IOI/ICPC金牌级表现;技术上扩展了Cascade RL覆盖范围,并引入多领域在线策略蒸馏。

Details Motivation: 提升小规模MoE模型的推理与智能体能力,在参数受限下逼近甚至媲美更大模型的性能,尤其在高难度竞赛任务中验证其智能密度。 Method: 基于精细筛选数据集进行监督微调(SFT)后,大幅扩展Cascade RL覆盖的推理与智能体任务域;并在整个RL过程中,对各领域最强中间教师模型进行多领域在线策略蒸馏,以稳定性能并防止退化。 Result: Nemotron-Cascade 2在IMO、IOI和ICPC World Finals中达到金牌级性能,是继DeepSeekV3.2之后第二款达成此成就的开源模型;以20倍更少参数实现与前沿大模型相当的数学与编码推理能力。 Conclusion: 通过扩展Cascade RL与多领域在线蒸馏,可在显著降低参数量的前提下,高效提升MoE模型的高阶推理与智能体能力,验证了‘智能密度’优化路径的有效性;模型与训练数据已全部开源。 Abstract: We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.

[61] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang

Main category: cs.CL

TL;DR: F2LLM-v2 is a new family of efficient, multilingual embedding models (80M–14B parameters), trained on 60M high-quality samples across 200+ languages—including mid/low-resource ones—using a two-stage LLM pipeline with matryoshka learning, pruning, and distillation; achieves SOTA on MTEB benchmarks and is fully open-sourced.

Details Motivation: To address the underrepresentation of mid- and low-resource languages in existing embedding models and improve efficiency without sacrificing performance. Method: Two-stage LLM-based embedding training integrated with matryoshka learning, model pruning, and knowledge distillation, trained on a newly curated 60M-sample multilingual dataset. Result: F2LLM-v2-14B ranks first on 11 MTEB benchmarks; smaller variants also achieve SOTA for resource-constrained settings. Conclusion: F2LLM-v2 establishes a new standard for efficient, scalable, and inclusive multilingual embedding models, with full open-source release to advance community research. Abstract: We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

cs.CV [Back]

[62] RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers

X. Gao,C. Chien,G. Liu,A. Manullang

Main category: cs.CV

TL;DR: 本文针对胶囊内镜视频(CEV)的多标签分类任务,微调基于Transformer的深度学习模型(Google Vision Transformer, ViT),在17个解剖与病理标签上进行识别,但在测试集上mAP表现较低(mAP@0.5=0.0205)。

Details Motivation: 为应对胶囊内镜视频中多标签分类的临床诊断需求,参与Gastro Competition并探索ViT在CEV分析中的适用性。 Method: 采用Google Vision Transformer(ViT)作为基础模型,输入分辨率为224×224,batch size为16,进行端到端微调以实现17类解剖结构和病变的多标签分类。 Result: 在三个测试视频上,整体mAP@0.5为0.0205,mAP@0.95为0.0196,指标极低,表明模型当前性能不佳或存在严重评估/数据问题。 Conclusion: 尽管尝试将ViT应用于CEV多标签分类,但当前实验结果远未达实用水平,提示需深入排查数据标注、类别不平衡、评价方式或模型适配等问题。 Abstract: This work is corresponding to the Gastro Competition for multi-label classification from capsule endoscopic videos (CEV). Deep learning network based on Transformers are fined-tune for this task. The based online mode is Google Vision Transformer (ViT) batch16 with 224 x 224 resolutions. In total, 17 labels are classified, which are mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve, active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, and ulcer. For test dataset of three videos, the overall mAP @0.5 is 0.0205 whereas the overall mAP @0.95 is 0.0196.

[63] S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Naichuan Zheng,Hailun Xia,Zepeng Sun,Weiyi Li,Yujia Wang

Main category: cs.CV

TL;DR: 本文提出S3T-Former,首个纯脉冲驱动的Transformer架构,用于能效型骨架动作识别;通过多流解剖脉冲嵌入(M-ASE)、侧向脉冲拓扑路由(LSTR)和脉冲状态空间(S3)引擎,在保持高稀疏性的同时解决短期遗忘问题,显著降低能耗并达到SOTA性能。

Details Motivation: 现有基于脉冲神经网络(SNN)的骨架动作识别方法因采用密集矩阵聚合、多模态融合或非稀疏频域变换而牺牲SNN固有稀疏性,且受神经元短期遗忘困扰,难以在边缘设备部署。 Method: 提出Spiking State-Space Topology Transformer(S3T-Former),包括:1)Multi-Stream Anatomical Spiking Embedding(M-ASE)作为广义运动微分算子,生成异构稀疏脉冲流;2)Lateral Spiking Topology Routing(LSTR)实现按需条件脉冲传播;3)Spiking State-Space(S3)Engine建模长程时序动态,避免非稀疏频谱操作。 Result: 在多个大规模数据集上实验表明,S3T-Former在保持高度竞争力精度的同时,理论能耗显著低于传统ANN,确立了能效型类脑动作识别的新SOTA。 Conclusion: S3T-Former首次实现了纯脉冲驱动、真正时空稀疏且具备长时记忆能力的Transformer架构,为边缘端低功耗骨架动作识别提供了新范式。 Abstract: Skeleton-based action recognition is crucial for multimedia applications but heavily relies on power-hungry Artificial Neural Networks (ANNs), limiting their deployment on resource-constrained edge devices. Spiking Neural Networks (SNNs) provide an energy-efficient alternative; however, existing spiking models for skeleton data often compromise the intrinsic sparsity of SNNs by resorting to dense matrix aggregations, heavy multimodal fusion modules, or non-sparse frequency domain transformations. Furthermore, they severely suffer from the short-term amnesia of spiking neurons. In this paper, we propose the Spiking State-Space Topology Transformer (S3T-Former), which, to the best of our knowledge, is the first purely spike-driven Transformer architecture specifically designed for energy-efficient skeleton action recognition. Rather than relying on heavy fusion overhead, we formulate a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a generalized kinematic differential operator, elegantly transforming multimodal skeleton features into heterogeneous, highly sparse event streams. To achieve true topological and temporal sparsity, we introduce Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation, and a Spiking State-Space (S3) Engine to systematically capture long-range temporal dynamics without non-sparse spectral workarounds. Extensive experiments on multiple large-scale datasets demonstrate that S3T-Former achieves highly competitive accuracy while theoretically reducing energy consumption compared to classic ANNs, establishing a new state-of-the-art for energy-efficient neuromorphic action recognition.

[64] DarkDriving: A Real-World Day and Night Aligned Dataset for Autonomous Driving in the Dark Environment

Wuqi Wang,Haochen Yang,Baolu Li,Jiaqi Sun,Xiangmo Zhao,Zhigang Xu,Qing Guo,Haigen Min,Tianyun Zhang,Hongkai Yu

Main category: cs.CV

TL;DR: 本文提出了首个真实世界昼夜对齐的自动驾驶低光增强基准数据集DarkDriving,通过轨迹跟踪姿态匹配方法在大型封闭测试场采集了9538对精确对齐的昼夜图像,并标注了2D目标框,支持低光增强及2D/3D检测等任务。

Details Motivation: 现有低光增强数据集多为小范围曝光调节或静态场景,而真实驾驶场景中难以获取精确对齐的昼夜图像对,严重限制了该方向研究。 Method: 提出基于轨迹跟踪的姿态匹配(TTPM)方法,在69英亩封闭测试场自动采集并精确对齐昼夜图像;人工标注2D边界框;定义四个相关感知任务。 Result: 构建了包含9538对高精度对齐(误差仅数厘米)昼夜图像的DarkDriving数据集,并验证其在低光增强与检测任务中的有效性及跨数据集(如nuScenes)泛化能力。 Conclusion: DarkDriving是首个真实动态驾驶场景下昼夜对齐的低光增强基准,为自动驾驶夜间感知提供了全面、可靠的评估平台,并具备良好泛化性。 Abstract: The low-light conditions are challenging to the vision-centric perception systems for autonomous driving in the dark environment. In this paper, we propose a new benchmark dataset (named DarkDriving) to investigate the low-light enhancement for autonomous driving. The existing real-world low-light enhancement benchmark datasets can be collected by controlling various exposures only in small-ranges and static scenes. The dark images of the current nighttime driving datasets do not have the precisely aligned daytime counterparts. The extreme difficulty to collect a real-world day and night aligned dataset in the dynamic driving scenes significantly limited the research in this area. With a proposed automatic day-night Trajectory Tracking based Pose Matching (TTPM) method in a large real-world closed driving test field (area: 69 acres), we collected the first real-world day and night aligned dataset for autonomous driving in the dark environment. The DarkDriving dataset has 9,538 day and night image pairs precisely aligned in location and spatial contents, whose alignment error is in just several centimeters. For each pair, we also manually label the object 2D bounding boxes. DarkDriving introduces four perception related tasks, including low-light enhancement, generalized low-light enhancement, and low-light enhancement for 2D detection and 3D detection of autonomous driving in the dark environment. The experimental results show that our DarkDriving dataset provides a comprehensive benchmark for evaluating low-light enhancement for autonomous driving and it can also be generalized to enhance dark images and promote detection in some other low-light driving environment, such as nuScenes.

[65] SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation

Wei Tang,Xuejing Liu,Yanpeng Sun,Zechao Li

Main category: cs.CV

TL;DR: 本文提出SSP-SAM框架,通过引入语义-空间提示(SSP)编码器,增强SAM对自然语言的理解能力,从而在指代表达分割(RES)及广义RES(GRES)任务中实现高精度、鲁棒的文本引导分割。

Details Motivation: SAM虽擅长通用图像分割,但缺乏自然语言理解能力,难以直接用于指代表达分割(RES)任务。 Method: 设计语义-空间提示(SSP)编码器,集成视觉与语言注意力适配器,分别增强视觉特征中的显著物体和语言特征中的判别性短语表征,生成高质量SSP以指导SAM进行语言驱动的分割。 Result: 在多个RES和GRES基准上取得SOTA性能,尤其在严格IoU阈值(如Pr@0.9)下表现突出;在PhraseCut数据集上验证了其开放词汇泛化能力。 Conclusion: SSP-SAM无需额外修改即可自然支持广义RES(零/单/多目标),显著提升SAM的语言引导分割能力,兼具有效性与通用性。 Abstract: The Segment Anything Model (SAM) excels at general image segmentation but has limited ability to understand natural language, which restricts its direct application in Referring Expression Segmentation (RES). Toward this end, we propose SSP-SAM, a framework that fully utilizes SAM's segmentation capabilities by integrating a Semantic-Spatial Prompt (SSP) encoder. Specifically, we incorporate both visual and linguistic attention adapters into the SSP encoder, which highlight salient objects within the visual features and discriminative phrases within the linguistic features. This design enhances the referent representation for the prompt generator, resulting in high-quality SSPs that enable SAM to generate precise masks guided by language. Although not specifically designed for Generalized RES (GRES), where the referent may correspond to zero, one, or multiple objects, SSP-SAM naturally supports this more flexible setting without additional modifications. Extensive experiments on widely used RES and GRES benchmarks confirm the superiority of our method. Notably, our approach generates segmentation masks of high quality, achieving strong precision even at strict thresholds such as Pr@0.9. Further evaluation on the PhraseCut dataset demonstrates improved performance in open-vocabulary scenarios compared to existing state-of-the-art RES methods. The code and checkpoints are available at: https://github.com/WayneTomas/SSP-SAM.

[66] CytoSyn: a Foundation Diffusion Model for Histopathology -- Tech Report

Thomas Duboudin,Xavier Fontaine,Etienne Andrier,Lionel Guillou,Alexandre Filiot,Thalyssa Baiocco-Rodrigues,Antoine Olivier,Alberto Romagnoni,John Klein,Jean-Baptiste Schiratti

Main category: cs.CV

TL;DR: 本文提出了CytoSyn,一种用于组织病理学的生成式基础潜变量扩散模型,可生成高度逼真且多样的H&E染色图像;通过方法改进、数据扩展与采样优化得到升级版CytoSyn-v2,并在多个方面优于PixCell;模型在10,000+张TCGA肿瘤切片上训练,泛化至炎症性肠病图像,代码、权重与数据已开源。

Details Motivation: 现有计算病理学中自监督特征提取器较多,但专用于组织病理学的生成式基础模型稀缺,难以支持如虚拟染色等超越判别式模型能力的任务。 Method: 提出CytoSyn——一种基于潜变量扩散的生成式基础模型;探索了方法改进、训练集扩展、采样策略及切片级过拟合问题;开发了增强版CytoSyn-v2;与PixCell进行深入对比,并分析预处理(如JPEG压缩)对模型与评估指标的影响。 Result: CytoSyn-v2在生成H&E图像的逼真度与多样性上达到SOTA;在仅用肿瘤切片训练的情况下,仍能高质量生成炎症性肠病图像;模型、训练/验证数据集及合成图像样本已开源。 Conclusion: CytoSyn为组织病理学提供了首个高性能、可泛化的生成式基础模型,推动虚拟染色、数据增强等下游应用,并强调了预处理细节对生成模型评估的关键影响。 Abstract: Computational pathology has made significant progress in recent years, fueling advances in both fundamental disease understanding and clinically ready tools. This evolution is driven by the availability of large amounts of digitized slides and specialized deep learning methods and models. Multiple self-supervised foundation feature extractors have been developed, enabling downstream predictive applications from cell segmentation to tumor sub-typing and survival analysis. In contrast, generative foundation models designed specifically for histopathology remain scarce. Such models could address tasks that are beyond the capabilities of feature extractors, such as virtual staining. In this paper, we introduce CytoSyn, a state-of-the-art foundation latent diffusion model that enables the guided generation of highly realistic and diverse histopathology H&E-stained images, as shown in an extensive benchmark. We explored methodological improvements, training set scaling, sampling strategies and slide-level overfitting, culminating in the improved CytoSyn-v2, and compared our work to PixCell, a state-of-the-art model, in an in-depth manner. This comparison highlighted the strong sensitivity of both diffusion models and performance metrics to preprocessing-specific details such as JPEG compression. Our model has been trained on a dataset obtained from more than 10,000 TCGA diagnostic whole-slide images of 32 different cancer types. Despite being trained only on oncology slides, it maintains state-of-the-art performance generating inflammatory bowel disease images. To support the research community, we publicly release CytoSyn's weights, its training and validation datasets, and a sample of synthetic images in this repository: https://huggingface.co/Owkin-Bioptimus/CytoSyn.

[67] Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Chen Zhao,Zhuoran Wang,Haoyang Li,Shifeng Bao,Guanlin Li,Youhe Feng,Yang Li,Jie Tang,Jing Zhang

Main category: cs.CV

TL;DR: 本文提出Action-Draft-and-Verify(ADV)方法,结合扩散模型的动作生成能力与VLM的单次前向重排序能力,在仿真和真实世界任务中显著提升VLA模型性能。

Details Motivation: 扩散动作专家虽高效精准,但自回归范式在分布外环境中具有更强鲁棒性和泛化性;需融合二者优势。 Method: ADV先由扩散动作专家生成多个候选动作块,再用视觉语言模型(VLM)基于困惑度风格指标单次前向评分并选择最优动作。 Result: 在仿真环境中成功率达+4.3点提升,在真实世界中达+19.7点提升,仅引入单次VLM重排序开销。 Conclusion: ADV有效融合扩散与自回归范式优势,在保持效率的同时显著提升VLA模型在仿真与真实场景中的性能与鲁棒性。 Abstract: Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

[68] One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

Haoxiang Rao,Zhao Wang,Chenyang Si,Yan Lyu,Yuanyi Duan,Fang Zhao,Caifeng Shan

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的少样本工业异常生成方法O2MAG,利用单张异常图像的自注意力机制合成更逼真的异常样本,通过自注意力嫁接、异常掩码引导、异常引导优化和双注意力增强等技术提升生成质量与下游检测性能。

Details Motivation: 工业异常检测中异常样本稀缺,现有少样本异常合成方法训练耗时且难以忠实还原真实异常分布,限制了下游检测模型性能。 Method: 提出O2MAG:基于单张异常图像,采用三路并行扩散过程、自注意力嫁接、异常掩码缓解前景-背景混淆、异常引导优化对齐文本提示与真实异常语义、双注意力增强强化掩码区域注意力。 Result: 在多个下游异常检测任务上显著优于现有SOTA方法。 Conclusion: O2MAG是一种高效、无需训练、能生成高保真异常样本的少样本合成方法,有效提升了工业异常检测性能。 Abstract: Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of O2MAG, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.

[69] Q-Drift: Quantization-Aware Drift Correction for Diffusion Model Sampling

Sooyoung Ryu,Mathieu Salzmann,Saqib Javed

Main category: cs.CV

TL;DR: 本文提出Q-Drift方法,在后训练量化(PTQ)下通过采样器端的漂移校正来缓解扩散模型中量化噪声的累积问题,提升生成质量。

Details Motivation: 后训练量化(PTQ)虽便于部署大扩散模型,但量化噪声在去噪轨迹中累积,导致生成质量下降。 Method: Q-Drift将量化误差建模为每步去噪中的隐式随机扰动,并推导出保持边缘分布的漂移调整;通过少量(如5次)全精度/量化配对校准估计时步依赖的方差统计,实现即插即用的采样器修正。 Result: 在6个文本到图像模型、3种采样器和2种PTQ方法上验证,Q-Drift在多数设置下显著降低FID(最高达4.59),同时保持CLIP分数。 Conclusion: Q-Drift是一种通用、低开销、即插即用的采样器级校正方法,有效缓解PTQ下扩散模型的生成质量退化问题。 Abstract: Post-training quantization (PTQ) is a practical path to deploy large diffusion models, but quantization noise can accumulate over the denoising trajectory and degrade generation quality. We propose Q-Drift, a principled sampler-side correction that treats quantization error as an implicit stochastic perturbation on each denoising step and derives a marginal-distribution-preserving drift adjustment. Q-Drift estimates a timestep-wise variance statistic from calibration, in practice requiring as few as 5 paired full-precision/quantized calibration runs. The resulting sampler correction is plug-and-play with common samplers, diffusion models, and PTQ methods, while incurring negligible overhead at inference. Across six diverse text-to-image models (spanning DiT and U-Net), three samplers (Euler, flow-matching, DPM-Solver++), and two PTQ methods (SVDQuant, MixDQ), Q-Drift improves FID over the corresponding quantized baseline in most settings, with up to 4.59 FID reduction on PixArt-Sigma (SVDQuant W3A4), while preserving CLIP scores.

[70] Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

Mohammed Rahman Sherif Khan Mohammad,Ardhendu Behera,Sandip Pradhan,Swagat Kumar,Amr Ahmed

Main category: cs.CV

TL;DR: 本文提出了一种仅在训练阶段使用的异构图教师框架(TOGA),通过多尺度图建模与模态感知图变换器提升Tip-Adapter的few-shot性能,不增加推理开销。

Details Motivation: 现有基于adapter的CLIP微调方法(如Tip-Adapter)依赖全局单模态特征,忽略图像块间细粒度关系及其与文本类别的结构对齐。 Method: 构建一个仅训练使用的高容量异构图教师(Heterogeneous Graph Teacher),将多尺度视觉patch与文本prompt构建成统一图,利用模态感知图变换器(MGT)进行深度跨模态推理,并通过判别性节点筛选提取高质量类别特征;采用cache-aware双目标策略,将图结构知识蒸馏至Tip-Adapter的key-value cache中。 Result: 在标准1–16-shot基准上持续达到SOTA;消融实验证明异构图监督、文本引导推理和节点筛选是关键组件。 Conclusion: 无需修改轻量级adapter或增加推理负担,仅通过训练阶段引入结构化图教师即可显著提升few-shot泛化能力,验证了显式建模跨模态结构关系的有效性。 Abstract: Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.

[71] From Concepts to Judgments: Interpretable Image Aesthetic Assessment

Xiao-Chang Liu,Johan Wagemans

Main category: cs.CV

TL;DR: 本文提出了一种基于人类可理解美学概念的可解释图像美学评估(IAA)框架,通过学习高阶美学概念子空间并引入残差预测器,在保持竞争力预测性能的同时提供透明、可解释的美学判断。

Details Motivation: 现有IAA模型预测性能强但缺乏可解释性,而人类在评估图像美学时依赖高阶线索进行判断,因此需要构建基于人类可理解美学概念的可解释框架。 Method: 提出一种基于人类可理解美学概念的可解释IAA框架:首先以可访问方式学习高阶美学概念并构建概念子空间,形成固有可解释模型基础;其次引入简单有效的残差预测器,捕捉超出显式概念的细微美学影响因素。 Result: 在摄影和艺术数据集上的实验表明,该方法在预测性能上具有竞争力,同时能提供透明、人类可理解的美学判断。 Conclusion: 所提框架在保持高性能的同时显著提升了IAA模型的可解释性,为用户理解图像美学评估依据提供了有效途径。 Abstract: Image aesthetic assessment (IAA) aims to predict the aesthetic quality of images as perceived by humans. While recent IAA models achieve strong predictive performance, they offer little insight into the factors driving their predictions. Yet for users, understanding why an image is considered pleasing or not is as valuable as the score itself, motivating growing interest in interpretability within IAA. When humans evaluate aesthetics, they naturally rely on high-level cues to justify their judgments. Motivated by this observation, we propose an interpretable IAA framework grounded in human-understandable aesthetic concepts. We learn these concepts in an accessible manner, constructing a subspace that forms the foundation of an inherently interpretable model. To capture nuanced influences on aesthetic perception beyond explicit concepts, we introduce a simple yet effective residual predictor. Experiments on photographic and artistic datasets demonstrate that our method achieves competitive predictive performance while offering transparent, human-understandable aesthetic judgments.

[72] Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Yuhao Dong,Zuyan Liu,Shulin Tian,Yongming Rao,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出Insight-V++,一种统一的多智能体视觉推理框架,通过自动合成高质量长链推理数据、双智能体架构(推理+摘要)及新算法ST-GRPO/J-GRPO,显著提升多模态大模型在图像与视频复杂推理任务上的性能。

Details Motivation: 现有MLLMs缺乏高质量长链视觉推理数据和适配训练范式,难以实现类似LLMs的测试时推理能力提升。 Method: 构建多粒度自动数据生成流水线;设计双智能体(推理代理+摘要代理)协同架构;提出ST-GRPO和J-GRPO两种新型强化学习算法替代DPO,支持空间-时间长程推理;引入基于摘要代理反馈的迭代自优化训练闭环。 Result: 在LLaVA-NeXT和Qwen2.5-VL等基座模型上,于图像与视频复杂推理基准(如VideoMME、MMBench-V等)取得显著性能提升,同时保持传统感知任务能力。 Conclusion: Insight-V++验证了多智能体协同与自优化训练范式对提升MLLMs长链视觉推理能力的有效性,为构建具备深度理解能力的通用多模态模型提供了新路径。 Abstract: Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

[73] VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

Mohammad Qazim Bhat,Yufan Huang,Niket Agarwal,Hao Wang,Michael Woods,John Kenyon,Tsung-Yi Lin,Xiaodong Yang,Ming-Yu Liu,Kevin Xie

Main category: cs.CV

TL;DR: 本文提出VLM-AutoDrive框架,通过多源监督(元数据字幕、LLM生成描述、VQA对、CoT推理)对预训练视觉语言模型进行后训练,显著提升其在行车记录仪视频中碰撞与近碰撞事件检测的性能与可解释性。

Details Motivation: 现有通用多模态大模型在驾驶场景下因领域和时序错位,在稀疏、短暂的安全关键事件(如碰撞)检测中表现差。 Method: 提出模块化后训练框架VLM-AutoDrive,融合元数据衍生字幕、LLM生成描述、视觉问答对及链式推理监督,实现领域对齐与可解释学习。 Result: 在Nexar真实行车记录仪数据上,将Cosmos-Reason1 7B模型的Collision F1从0.00提升至0.69,整体准确率从35.35%提升至77.27%,并生成可解释推理轨迹。 Conclusion: VLM-AutoDrive为通用VLM适配安全关键、时序定位感知任务提供了可扩展方案,弥合了感知、因果与决策推理之间的鸿沟。 Abstract: The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.

[74] MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles

Alexander Rasch,Rahul Rajendra Pai

Main category: cs.CV

TL;DR: 本文介绍了MicroVision数据集,一个专为检测脆弱道路使用者(VRUs)和静止的微型交通工具(MMVs)而设计的开源图像数据集,旨在提升交通安全性与规划能力。

Details Motivation: 现有开放图像数据集缺乏对脆弱道路使用者(VRUs)和微型交通工具(MMVs)的聚焦与多样性,例如将行人和电动滑板车骑行者统归为“人”,且缺少新型MMVs(如电动滑板车)及VRU视角(如人行道、自行车道)的数据。 Method: 构建了MicroVision数据集,包含8000多张高分辨率匿名图像,涵盖超30000个精细标注的VRUs和MMVs,采集自瑞典哥德堡,覆盖全年及近2000种独特交互场景;并基于前沿架构训练了基准目标检测模型。 Result: 所提模型在未见过的测试集上达到最高0.723的平均精度均值(mAP)。 Conclusion: MicroVision数据集及其基准模型有助于区分不同VRUs与MMVs以提升交通安全,或支持监测系统识别微出行使用情况;数据集与模型权重已公开发布。 Abstract: Micromobility is a growing mode of transportation, raising new challenges for traffic safety and planning due to increased interactions in areas where vulnerable road users (VRUs) share the infrastructure with micromobility, including parked micromobility vehicles (MMVs). Approaches to support traffic safety and planning increasingly rely on detecting road users in images -- a computer-vision task relying heavily on the quality of the images to train on. However, existing open image datasets for training such models lack focus and diversity in VRUs and MMVs, for instance, by categorizing both pedestrians and MMV riders as "person", or by not including new MMVs like e-scooters. Furthermore, datasets are often captured from a car perspective and lack data from areas where only VRUs travel (sidewalks, cycle paths). To help close this gap, we introduce the MicroVision dataset: an open image dataset and annotations for training and evaluating models for detecting the most common VRUs (pedestrians, cyclists, e-scooterists) and stationary MMVs (bicycles, e-scooters), from a VRU perspective. The dataset, recorded in Gothenburg (Sweden), consists of more than 8,000 anonymized, full-HD images with more than 30,000 carefully annotated VRUs and MMVs, captured over an entire year and part of almost 2,000 unique interaction scenes. Along with the dataset, we provide first benchmark object-detection models based on state-of-the-art architectures, which achieved a mean average precision of up to 0.723 on an unseen test set. The dataset and model can support traffic safety to distinguish between different VRUs and MMVs, or help monitoring systems identify the use of micromobility. The dataset and model weights can be accessed at https://doi.org/10.71870/eepz-jd52.

[75] Semantic Segmentation and Depth Estimation for Real-Time Lunar Surface Mapping Using 3D Gaussian Splatting

Guillem Casadesus Vila,Adam Dai,Grace Gao

Main category: cs.CV

TL;DR: 本文提出了一种结合语义分割与稠密深度估计的实时3D高斯泼溅(3DGS)建图框架,适用于月面低纹理、高对比度光照等挑战性环境,在120米路径上实现约3 cm几何高度精度,优于无LiDAR的传统点云方法。

Details Motivation: 月面导航与建图面临低纹理、高对比度光照和计算资源受限等挑战,亟需鲁棒、高效、轻量的感知与建图方案。 Method: 基于LuPNT仿真器生成合成数据,评估多种模型后,选用基于门控循环单元(GRU)的立体稠密深度估计模型和CNN语义分割模型;利用真值位姿解耦局部场景理解与全局状态估计,构建3DGS地图。 Result: 在120米月面路径上实现约3 cm的几何高度精度,优于无LiDAR的传统点云基线;生成的3DGS地图支持新视角合成,并可扩展为完整SLAM系统。 Conclusion: 融合语义分割、稠密深度估计与学习型地图表示(如3DGS)是构建高精度、大尺度月面地图的有效途径,可支撑未来月球探测任务。 Abstract: Navigation and mapping on the lunar surface require robust perception under challenging conditions, including poorly textured environments, high-contrast lighting, and limited computational resources. This paper presents a real-time mapping framework that integrates dense perception models with a 3D Gaussian Splatting (3DGS) representation. We first benchmark several models on synthetic datasets generated with the LuPNT simulator, selecting a stereo dense depth estimation model based on Gated Recurrent Units for its balance of speed and accuracy in depth estimation, and a convolutional neural network for its superior performance in detecting semantic segments. Using ground truth poses to decouple the local scene understanding from the global state estimation, our pipeline reconstructs a 120-meter traverse with a geometric height accuracy of approximately 3 cm, outperforming a traditional point cloud baseline without LiDAR. The resulting 3DGS map enables novel view synthesis and serves as a foundation for a full SLAM system, where its capacity for joint map and pose optimization would offer significant advantages. Our results demonstrate that combining semantic segmentation and dense depth estimation with learned map representations is an effective approach for creating detailed, large-scale maps to support future lunar surface missions.

[76] LRConv-NeRV: Low Rank Convolution for Efficient Neural Video Compression

Tamer Shanableh

Main category: cs.CV

TL;DR: 本文提出LRConv-NeRV,通过在NeRV解码器中用结构化低秩可分离卷积替代部分3×3密集卷积层,实现计算与存储效率显著提升,同时保持视频重建质量与时间一致性。

Details Motivation: NeRV的卷积解码器计算开销大、内存占用高,难以部署于资源受限环境。 Method: 提出LRConv-NeRV,在NeRV解码器中对选定的3×3卷积层进行结构化低秩可分离分解,并从深层到浅层渐进式应用;支持端到端训练及INT8后训练量化。 Result: 仅对最后一级解码器应用LRConv即可降低68% GFLOPs(201.9→64.9)、减小9.3%模型大小,且PSNR/MS-SSIM几乎不变、码率降低9.2%;INT8量化下质量接近原NeRV;LPIPS分析显示时间稳定性优异。 Conclusion: LRConv-NeRV是一种高效、低精度友好的神经视频解码架构,在效率-质量权衡上优于现有方法,适用于资源受限场景。 Abstract: Neural Representations for Videos (NeRV) encode entire video sequences within neural network parameters, offering an alternative paradigm to conventional video codecs. However, the convolutional decoder of NeRV remains computationally expensive and memory intensive, limiting its deployment in resource-constrained environments. This paper proposes LRConv-NeRV, an efficient NeRV variant that replaces selected dense 3x3 convolutional layers with structured low-rank separable convolutions, trained end-to-end within the decoder architecture. By progressively applying low-rank factorization from the largest to earlier decoder stages, LRConv-NeRV enables controllable trade-offs between reconstruction quality and efficiency. Extensive experiments demonstrate that applying LRConv only to the final decoder stage reduces decoder complexity by 68%, from 201.9 to 64.9 GFLOPs, and model size by 9.3%, while incurring negligible quality loss and achieving approximately 9.2% bitrate reduction. Under INT8 post-training quantization, LRConv-NeRV preserves reconstruction quality close to the dense NeRV baseline, whereas more aggressive factorization of early decoder stages leads to disproportionate quality degradation. Compared to existing work under layer-aligned settings, LRConv-NeRV achieves a more favorable efficiency versus quality trade-off, offering substantial GFLOPs and parameter reductions while maintaining higher PSNR/MS-SSIM and improved temporal stability. Temporal flicker analysis using LPIPS further shows that the proposed solution preserves temporal coherence close to the NeRV baseline, results establish LRConv-NeRV as a potential architectural alternative for efficient neural video decoding under low-precision and resource-constrained settings.

[77] CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

Marios Krestenitis,Christos Tzelepis,Konstantinos Ioannidis,Steafanos Vrochidis,Ioannis Kompatsiaris,Georgios Tzimiropoulos,Shaogang Gong,Ioannis Patras

Main category: cs.CV

TL;DR: 本文提出CycleCap方法,利用图像-文本双向循环一致性(图像→文本→图像)作为自监督信号,通过Group Relative Policy Optimization(GRPO)微调视觉语言模型(VLM),仅需原始图像即可提升图像描述准确性并减少幻觉,无需人工标注数据。

Details Motivation: 现有视觉语言模型(VLMs)在图像描述等任务中仍存在视觉-语言错位问题,易产生泛化过度或幻觉描述;已有方法依赖大规模标注数据或复杂测试时优化框架,成本高、扩展性差。 Method: 提出CycleCap:以VLM为图像到文本模块,以预训练文生图模型为文本到图像模块,构建图像↔文本循环;使用GRPO算法,以原始图像与重建图像的相似度为在线奖励信号进行强化微调;全程无需标注图文对,实现自监督优化。 Result: 在四个1B–7B参数规模的VLM上验证,CycleCap在图像描述质量和幻觉抑制方面均取得一致提升,性能超越依赖监督式循环一致性训练的SOTA方法。 Conclusion: 循环一致性可作为强自监督信号直接用于VLM微调;CycleCap摆脱了对标注数据的依赖,提升了描述的准确性与接地性,为低成本、高质量图像描述生成提供了新范式。 Abstract: Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.

[78] Fast and Generalizable NeRF Architecture Selection for Satellite Scene Reconstruction

Devjyoti Chakraborty,Zaki Sukma,Rakandhiya D. Rachmanto,Kriti Ghosh,In Kee Kim,Suchendra M. Bhandarkar,Lakshmish Ramaswamy,Nancy K. O'Hare,Deepak Mishra

Main category: cs.CV

TL;DR: 本文提出PreSCAN框架,通过轻量级几何与光度描述符在训练前预测NeRF重建质量,实现架构快速选择(<30秒)、大幅加速(1000×)NAS,并在边缘设备上显著降低功耗与延迟。

Details Motivation: 卫星影像NeRF部署面临单场景需独立训练、NAS耗时长(数小时至数天)的问题;SHAP分析发现多视角一致性比网络结构更关键。 Method: 基于SHAP分析洞察,设计PreSCAN预测框架,利用轻量级几何与光度描述符预测NeRF重建质量,并结合离线成本分析优化边缘部署。 Result: PreSCAN可在<30秒内选择合适架构,预测误差<1 dB,相比NAS提速1000倍;在Jetson Orin上降低26%功耗和43%推理延迟,且在DFC2019数据集上无需重训练即具跨场景泛化能力。 Conclusion: 重建质量主要取决于多视角一致性而非模型架构;PreSCAN提供高效、可部署的NeRF质量预估与架构选择方案,显著提升卫星影像NeRF的实际应用可行性。 Abstract: Neural Radiance Fields (NeRF) have emerged as a powerful approach for photorealistic 3D reconstruction from multi-view images. However, deploying NeRF for satellite imagery remains challenging. Each scene requires individual training, and optimizing architectures via Neural Architecture Search (NAS) demands hours to days of GPU time. While existing approaches focus on architectural improvements, our SHAP analysis reveals that multi-view consistency, rather than model architecture, determines reconstruction quality. Based on this insight, we develop PreSCAN, a predictive framework that estimates NeRF quality prior to training using lightweight geometric and photometric descriptors. PreSCAN selects suitable architectures in < 30 seconds with < 1 dB prediction error, achieving 1000$\times$ speedup over NAS. We further demonstrate PreSCAN's deployment utility on edge platforms (Jetson Orin), where combining its predictions with offline cost profiling reduces inference power by 26% and latency by 43% with minimal quality loss. Experiments on DFC2019 datasets confirm that PreSCAN generalizes across diverse satellite scenes without retraining.

[79] Unrolled Reconstruction with Integrated Super-Resolution for Accelerated 3D LGE MRI

Md Hasibul Husain Hisham,Shireen Elhabian,Ganesh Adluru,Jason Mendes,Andrew Arai,Eugene Kholmovski,Ravi Ranjan,Edward DiBella

Main category: cs.CV

TL;DR: 本文提出了一种将超分辨率网络(EDSR)嵌入到模型驱动的展开重建框架中的新方法,用于加速3D晚期钆增强(LGE)MRI重建,显著提升了图像质量与左心房结构分割性能。

Details Motivation: 加速3D LGE MRI需兼顾高分辨率薄层心房结构重建与欠采样k空间数据的鲁棒重建,而现有展开网络在高频率细节恢复上存在局限。 Method: 提出混合展开重建框架,用增强型深度超分辨率(EDSR)网络替代传统展开网络中每步迭代的近端算子,实现超分辨率增强与数据一致性联合优化;端到端训练于回顾性欠采样的临床前3D LGE数据。 Result: 在不同加速因子下,该方法PSNR和SSIM均优于压缩感知、MoDL和自引导DIP等基线方法,更准确保留细小心脏结构,并提升左心房(LA)分割性能。 Conclusion: 将超分辨率先验直接嵌入模型驱动重建框架,可为加速3D LGE MRI带来可衡量的性能增益。 Abstract: Accelerated 3D late gadolinium enhancement (LGE) MRI requires robust reconstruction methods to recover thin atrial structures from undersampled k-space data. While unrolled model-based networks effectively integrate physics-driven data consistency with learned priors, they operate at the acquired resolution and may fail to fully recover high-frequency detail. We propose a hybrid unrolled reconstruction framework in which an Enhanced Deep Super-Resolution (EDSR) network replaces the proximal operator within each iteration of the optimization loop, enabling joint super-resolution enhancement and data consistency enforcement. The model is trained end-to-end on retrospectively undersampled preclinical 3D LGE datasets and compared against compressed sensing, Model-Based Deep Learning (MoDL), and self-guided Deep Image Prior (DIP) baselines. Across acceleration factors, the proposed method consistently improves PSNR and SSIM over standard unrolled reconstruction and better preserves fine cardiac structures, leading to improved LA (left atrium) segmentation performance. These results demonstrate that integrating super-resolution priors directly within model-based reconstruction provides measurable gains in accelerated 3D LGE MRI.

[80] VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection

Bo-Cheng Qiu,Yu-Fan Lin,Yu-Zhe Pien,Chia-Ming Lee,Fu-En Yang,Yu-Chiang Frank Wang,Chih-Chung Hsu

Main category: cs.CV

TL;DR: 本文提出RARE-VISION任务,聚焦胶囊内镜事件检测,通过融合EndoFM-LV与DINOv3 ViT-L/16双主干、多样性头集成、验证引导的分层融合及解剖感知时序事件解码,显著提升事件级检测性能,在隐藏测试集上取得mAP@0.5=0.3530、mAP@0.95=0.3235。

Details Motivation: 胶囊内镜中诊断性病灶稀疏、视觉异质性强、嵌入于长而嘈杂的视频流中,且评估需在事件级而非帧级准确率,传统帧分类方法不适用。 Method: 构建双主干框架(EndoFM-LV建模局部时序上下文,DINOv3 ViT-L/16提取强帧级语义),结合多样性头集成、验证引导的分层融合(含类别/主干加权与概率校准)及解剖感知时序事件解码(含时序平滑、解剖约束、阈值优化与逐标签事件生成)。 Result: 在官方隐藏测试集上,时空mAP@0.5达0.3530,mAP@0.95达0.3235;消融实验表明双主干互补性、验证引导融合与解剖感知解码均对事件级性能有正向贡献。 Conclusion: 将事件检测建模为度量对齐的事件级任务,并融合多源时序与语义信息及领域先验(如解剖结构),可有效提升胶囊内镜中稀疏、异质病灶的检测鲁棒性与准确性。 Abstract: Capsule endoscopy event detection is challenging because diagnostically relevant findings are sparse, visually heterogeneous, and embedded in long, noisy video streams, while evaluation is performed at the event level rather than by frame accuracy alone. We therefore formulate the RARE-VISION task as a metric-aligned event detection problem instead of a purely frame-wise classification task. Our framework combines two complementary backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Hierarchical Fusion, and Anatomy-Aware Temporal Event Decoding. The fusion stage uses validation-derived class-wise model weighting, backbone weighting, and probability calibration, while the decoding stage applies temporal smoothing, anatomical constraints, threshold refinement, and per-label event generation to produce stable event predictions. Validation ablations indicate that complementary backbones, validation-guided fusion, and anatomy-aware temporal decoding all contribute to event-level performance. On the official hidden test set, the proposed method achieved an overall temporal mAP@0.5 of 0.3530 and temporal mAP@0.95 of 0.3235.

[81] To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

Rui Hong,Shuxue Quan

Main category: cs.CV

TL;DR: 本文提出三层次诊断框架,用于识别视觉语言模型(VLMs)在回答中是否真正依赖视觉信息,还是利用语言捷径;发现多数模型存在‘视觉谄媚’现象——能感知视觉异常却仍迎合用户预期而幻觉作答,且对齐训练抑制了模型诚实表达不确定性的能力;更大参数量模型虽减少语言捷径,却加剧视觉谄媚;该诊断框架还可实现零成本的后处理选择性预测,显著提升准确率。

Details Motivation: 探究VLMs正确回答时是否真实依赖视觉信息,而非利用语言捷径或统计偏差,揭示其幻觉根源。 Method: 提出三层次诊断框架,包含潜在异常检测(Latent Anomaly Detection)、视觉必要性得分(Visual Necessity Score,基于KL散度)和竞争得分(Competition Score),并结合盲图、噪声图与冲突图等反事实干预,在7个VLM和7000个样本上进行系统评估。 Result: 69.6%样本表现出‘视觉谄媚’(能检测异常却仍幻觉以迎合用户),无一样本展现‘稳健拒绝’;Qwen2.5-VL随规模增大(7B→72B)降低语言捷径但增强视觉谄媚;诊断得分支持零训练成本的后处理选择性预测,最高提升9.5个百分点准确率(50%覆盖率)。 Conclusion: 当前VLMs普遍存在视觉-语言目标冲突,对齐训练意外削弱其诚实表达不确定性的能力;单纯扩大模型规模无法解决视觉接地问题;所提诊断框架为可解释性评估与可信推理提供了新范式。 Abstract: When VLMs answer correctly, do they genuinely rely on visual information or exploit language shortcuts? We introduce the Tri-Layer Diagnostic Framework, which disentangles hallucination sources via three metrics: Latent Anomaly Detection (perceptual awareness), Visual Necessity Score (visual dependency, measured via KL divergence), and Competition Score (conflict between visual grounding and instruction following). Using counterfactual interventions (blind, noise, and conflict images) across 7 VLMs and 7,000 model-sample pairs, our taxonomy reveals that 69.6% of samples exhibit Visual Sycophancy--models detect visual anomalies but hallucinate to satisfy user expectations--while zero samples show Robust Refusal, indicating alignment training has systematically suppressed truthful uncertainty acknowledgment. A scaling analysis (Qwen2.5-VL 7B to 72B) shows larger models reduce Language Shortcuts but amplify Visual Sycophancy, demonstrating scale alone cannot resolve the grounding problem. Diagnostic scores further enable a post-hoc selective prediction strategy achieving up to +9.5pp accuracy at 50% coverage with no additional training cost.

[82] Pixel-Accurate Epipolar Guided Matching

Oleksii Nasypanyi,Francois Rameau

Main category: cs.CV

TL;DR: 本文提出了一种基于角空间的精确关键点匹配方法,通过为每个关键点分配容忍圆并转换为一维角区间查询,利用线段树实现对极约束下的高效、像素级精确匹配。

Details Motivation: 现有对极引导的关键点匹配方法依赖粗略的空间分箱,存在近似误差、后处理开销大、易漏检有效匹配等问题。 Method: 将每个关键点在对极视角下映射为角区间,构建容忍圆;将匹配问题建模为1D角区间查询,使用线段树在O(log n)时间内高效求解。 Result: 在ETH3D数据集上显著快于现有方法,同时保证像素级精度和完整匹配集合恢复。 Conclusion: 该方法克服了传统空间分箱的缺陷,实现了高效率、高精度、可调可控的对极约束匹配。 Abstract: Keypoint matching can be slow and unreliable in challenging conditions such as repetitive textures or wide-baseline views. In such cases, known geometric relations (e.g., the fundamental matrix) can be used to restrict potential correspondences to a narrow epipolar envelope, thereby reducing the search space and improving robustness. These epipolar-guided matching approaches have proved effective in tasks such as SfM; however, most rely on coarse spatial binning, which introduces approximation errors, requires costly post-processing, and may miss valid correspondences. We address these limitations with an exact formulation that performs candidate selection directly in angular space. In our approach, each keypoint is assigned a tolerance circle which, when viewed from the epipole, defines an angular interval. Matching then becomes a 1D angular interval query, solved efficiently in logarithmic time with a segment tree. This guarantees pixel-level tolerance, supports per-keypoint control, and removes unnecessary descriptor comparisons. Extensive evaluation on ETH3D demonstrates noticeable speedups over existing approaches while recovering exact correspondence sets.

[83] Inst4DGS: Instance-Decomposed 4D Gaussian Splatting with Multi-Video Label Permutation Learning

Yonghan Lee,Dinesh Manocha

Main category: cs.CV

TL;DR: Inst4DGS提出一种实例分解的4D高斯泼溅方法,通过可微Sinkhorn层学习跨视频实例匹配,并引入运动骨架提升长时轨迹优化效率,在渲染与实例分割性能上达到SOTA。

Details Motivation: 动态4D高斯泼溅发展迅速,但实例分解版本因多视角视频间实例标签不一致难以关联而研究不足。 Method: 引入每视频标签排列隐变量,结合可微Sinkhorn层实现跨视频实例匹配;设计实例分解的运动骨架,为每个物体提供低维运动基以支持长时轨迹优化。 Result: 在Panoptic Studio和Neural3DV数据集上,Inst4DGS同时支持跟踪与实例分解,渲染(PSNR)和实例分割(mIoU)指标均达SOTA:Panoptic Studio上PSNR从26.10提升至28.36,实例mIoU从0.6310提升至0.9129。 Conclusion: Inst4DGS有效解决了实例分解4DGS中跨视角身份一致性难题,兼顾高质量渲染与精确实例分割,为动态场景建模提供了新范式。 Abstract: We present Inst4DGS, an instance-decomposed 4D Gaussian Splatting (4DGS) approach with long-horizon per-Gaussian trajectories. While dynamic 4DGS has advanced rapidly, instance-decomposed 4DGS remains underexplored, largely due to the difficulty of associating inconsistent instance labels across independently segmented multi-view videos. We address this challenge by introducing per-video label-permutation latents that learn cross-video instance matches through a differentiable Sinkhorn layer, enabling direct multi-view supervision with consistent identity preservation. This explicit label alignment yields sharp decision boundaries and temporally stable identities without identity drift. To further improve efficiency, we propose instance-decomposed motion scaffolds that provide low-dimensional motion bases per object for long-horizon trajectory optimization. Experiments on Panoptic Studio and Neural3DV show that Inst4DGS jointly supports tracking and instance decomposition while achieving state-of-the-art rendering and segmentation quality. On the Panoptic Studio dataset, Inst4DGS improves PSNR from 26.10 to 28.36, and instance mIoU from 0.6310 to 0.9129, over the strongest baseline.

[84] Mind the Rarities: Can Rare Skin Diseases Be Reliably Diagnosed via Diagnostic Reasoning?

Yang Liu,Jiyao Yang,Hongjin Zhao,Xiaoyong Li,Yanzhe Ji,Xingjian Li,Runmin Jiang,Tianyang Wang,Saeed Anwar,Dongwoo Kim,Yue Yao,Zhenyue Qin,Min Xu

Main category: cs.CV

TL;DR: 本文构建了DermCase——一个基于真实皮肤病学病例报告的长上下文多模态基准,用于评估大视觉语言模型(LVLMs)在罕见皮肤病诊断中的临床推理能力,并提出DermLIP相似性指标以更好对齐专家判断;实验发现现有LVLMs在诊断准确性、鉴别诊断和临床推理方面存在显著缺陷,指令微调有效而DPO效果有限。

Details Motivation: 现有皮肤病学评估基准聚焦常见病且仅关注最终诊断准确率,忽视对罕见病和临床推理过程的评估,难以反映模型在真实复杂场景下的可靠性。 Method: 构建DermCase多模态长上下文基准(26,030图像-文本对,6,354疑难病例),含详细临床信息与逐步推理链;提出基于DermLIP的相似性度量以评估鉴别诊断质量;系统评测22个主流LVLM,并开展指令微调与DPO微调实验及错误分析。 Result: 22个LVLM在诊断准确率、鉴别诊断和临床推理三方面均表现不佳;指令微调显著提升性能,DPO几乎无增益;错误分析揭示模型在病理逻辑推演、多证据整合等关键推理环节存在严重短板。 Conclusion: 当前LVLMs尚不具备可靠的皮肤病临床推理能力,需更注重推理过程建模与高质量长上下文推理数据驱动的训练与评估。 Abstract: Large vision-language models (LVLMs) demonstrate strong performance in dermatology; however, evaluating diagnostic reasoning for rare conditions remains largely unexplored. Existing benchmarks focus on common diseases and assess only final accuracy, overlooking the clinical reasoning process, which is critical for complex cases. We address this gap by constructing DermCase, a long-context benchmark derived from peer-reviewed case reports. Our dataset contains 26,030 multi-modal image-text pairs and 6,354 clinically challenging cases, each annotated with comprehensive clinical information and step-by-step reasoning chains. To enable reliable evaluation, we establish DermLIP-based similarity metrics that achieve stronger alignment with dermatologists for assessing differential diagnosis quality. Benchmarking 22 leading LVLMs exposes significant deficiencies across diagnosis accuracy, differential diagnosis, and clinical reasoning. Fine-tuning experiments demonstrate that instruction tuning substantially improves performance while Direct Preference Optimization (DPO) yields minimal gains. Systematic error analysis further reveals critical limitations in current models' reasoning capabilities.

[85] SynQ: Accurate Zero-shot Quantization by Synthesis-aware Fine-tuning

Minjun Kim,Jongjin Kim,U Kang

Main category: cs.CV

TL;DR: 本文提出SynQ框架,通过低通滤波减少合成数据噪声、对齐类激活图提升精度、仅对困难样本使用软标签避免错误引导,实现了零样本量化(ZSQ)的最先进性能。

Details Motivation: 解决零样本量化(ZSQ)中因无真实训练数据导致的合成数据噪声、模型依赖离谱模式预测、以及错误硬标签误导三大挑战。 Method: 提出SynQ框架:1)用低通滤波抑制合成数据噪声;2)通过类激活图对齐微调量化模型;3)对难样本仅使用预训练模型输出的软标签,避免硬标签错误引导。 Result: 在多个基准上显著超越现有ZSQ方法,达到当前最优量化精度。 Conclusion: SynQ有效缓解了零样本量化中的关键缺陷,为隐私敏感场景下的模型轻量化部署提供了更可靠、高精度的解决方案。 Abstract: How can we accurately quantize a pre-trained model without any data? Quantization algorithms are widely used for deploying neural networks on resource-constrained edge devices. Zero-shot Quantization (ZSQ) addresses the crucial and practical scenario where training data are inaccessible for privacy or security reasons. However, three significant challenges hinder the performance of existing ZSQ methods: 1) noise in the synthetic dataset, 2) predictions based on off-target patterns, and the 3) misguidance by erroneous hard labels. In this paper, we propose SynQ (Synthesis-aware Fine-tuning for Zero-shot Quantization), a carefully designed ZSQ framework to overcome the limitations of existing methods. SynQ minimizes the noise from the generated samples by exploiting a low-pass filter. Then, SynQ trains the quantized model to improve accuracy by aligning its class activation map with the pre-trained model. Furthermore, SynQ mitigates misguidance from the pre-trained model's error by leveraging only soft labels for difficult samples. Extensive experiments show that SynQ provides the state-of-the-art accuracy, over existing ZSQ methods.

[86] R&D: Balancing Reliability and Diversity in Synthetic Data Augmentation for Semantic Segmentation

Huy Che,Dinh-Duy Phan,Duc-Khai Lam

Main category: cs.CV

TL;DR: 本文提出了一种基于可控扩散模型的合成数据增强新方法,用于像素级语义分割,通过类感知提示和视觉先验融合提升图像质量与标签对齐性,在PASCAL VOC和BDD100K等基准上验证了其在数据稀缺场景下的有效性。

Details Motivation: 像素级语义分割的数据集构建与标注成本高昂;传统数据增强无法生成新结构,而现有生成模型难以保证生成图像与原始图像及分割标签的一致性。 Method: 提出一种集成可控扩散模型的合成数据增强流程,结合类-aware prompting和visual prior blending,以提升生成图像质量并确保与分割标签的精确对齐。 Result: 在PASCAL VOC和BDD100K等基准数据集上显著提升语义分割性能,尤其在数据稀缺场景下效果突出,并增强了模型在真实场景中的鲁棒性。 Conclusion: 该方法有效弥合了合成数据与真实数据之间的鸿沟,在保持多样性的同时提升了可靠性,为语义分割任务提供了高质量、高一致性增强数据的新范式。 Abstract: Collecting and annotating datasets for pixel-level semantic segmentation tasks are highly labor-intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real-world data collection. Traditional augmentation techniques, such as translation, scaling, and color transformations, create geometric variations but fail to generate new structures. While generative models have been employed to extend semantic information of datasets, they often struggle to maintain consistency between the original and generated images, particularly for pixel-level tasks. In this work, we propose a novel synthetic data augmentation pipeline that integrates controllable diffusion models. Our approach balances diversity and reliability data, effectively bridging the gap between synthetic and real data. We utilize class-aware prompting and visual prior blending to improve image quality further, ensuring precise alignment with segmentation labels. By evaluating benchmark datasets such as PASCAL VOC and BDD100K, we demonstrate that our method significantly enhances semantic segmentation performance, especially in data-scarce scenarios, while improving model robustness in real-world applications. Our code is available at \href{https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance}{https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance}.

[87] AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

Yibo Shi,Jungang Li,Linghao Zhang,Zihao Dongfang,Biao Wu,Sicheng Tao,Yibo Yan,Chenxi Qin,Weiting Liu,Zhixin Lin,Hanqian Li,Yu Huang,Song Dai,Yonghua Hei,Yue Ding,Xiang Li,Shikang Wang,Chengdong Xu,Jingqi Liu,Xueying Ma,Zhiwen Zheng,Xiaofei Zhang,Bincheng Wang,Nichen Yang,Jie Wu,Lihua Tian,Chen Li,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出AndroTMem框架,包含诊断基准AndroTMem-Bench和新型记忆机制ASM,旨在解决长周期Android GUI智能体中的交互记忆瓶颈问题;实验表明ASM通过锚定关键中间状态显著提升任务完成率(TCR)和动作匹配率(AMS)。

Details Motivation: 现有GUI智能体在长周期任务中面临交互记忆失效问题:全序列回放冗余且易引入噪声,而摘要式记忆又易丢失关键依赖信息与可追溯性。 Method: 构建AndroTMem-Bench基准(1069个任务,34473步交互),提出基于因果链接中间状态锚点的Anchored State Memory(ASM)机制,支持子目标导向检索与归因感知决策。 Result: ASM在12个GUI智能体上一致优于全序列回放和摘要基线,TCR提升5%–30.16%,AMS提升4.93%–24.66%;验证了长周期性能下降主因是任务内记忆失败而非感知或局部动作错误。 Conclusion: 结构化、锚定式的交互记忆是缓解长周期GUI任务记忆瓶颈的有效路径,AndroTMem为该方向提供了可复现的诊断基准与实用解决方案。 Abstract: Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem).

[88] SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation

Leyuan Fang,Zan Mao,Zijing Wang,Yinlong Yan

Main category: cs.CV

TL;DR: 本文提出SR-Nav框架,利用动态空间关系图(DSRG)建模物体与区域间的结构化先验,通过关系感知匹配与动态关系规划提升零样本目标导航在弱观测下的鲁棒性与效率。

Details Motivation: 现有基于大模型的零样本目标导航方法在视角差或语义线索弱时推理不可靠;而物体与区域间的固有空间关系可提供结构化场景先验,辅助部分观测下的目标定位。 Method: 提出SR-Nav框架:1)构建动态空间关系图(DSRG),融合大模型先验与实时观测;2)设计关系感知匹配模块,以关系匹配替代朴素检测,校正感知误差;3)引入动态关系规划模块,基于DSRG动态计算最优路径,缩小搜索空间。 Result: 在HM3D数据集上达到当前最优的成功率与导航效率。 Conclusion: 显式建模和利用空间关系先验可显著提升零样本导航在挑战性条件下的感知鲁棒性与规划效率,为基于大模型的具身智能提供了新思路。 Abstract: Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models' comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, resulting in inefficient or failed navigation. We observe that inherent relationships among objects and regions encode structured scene priors, which help agents infer plausible target locations even under partial observations. Motivated by this insight, we propose Spatial Relation-aware Navigation (SR-Nav), a framework that models both observed and experience-based spatial relationships to enhance both perception and planning. Specifically, SR-Nav first constructs a Dynamic Spatial Relationship Graph (DSRG) that encodes the target-centered spatial relationships through the foundation models and updates dynamically with real-time observations. We then introduce a Relation-aware Matching Module. It utilizes relationship matching instead of naive detection, leveraging diverse relationships in the DSRG to verify and correct errors, enhancing visual perception robustness. Finally, we design a Dynamic Relationship Planning Module to reduce the planning search space by dynamically computing the optimal paths based on the DSRG from the current position, thereby guiding planning and reducing exploration redundancy. Experiments on HM3D show that our method achieves state-of-the-art performance in both success rate and navigation efficiency. The code will be publicly available at https://github.com/Mzyw-1314/SR-Nav

Arushi Rai,Adriana Kovashka

Main category: cs.CV

TL;DR: 本文提出了一种无需额外帧级标注的自一致性目标,通过约束相关任务(如生成与验证)关注相同关键帧,从而提升视频大模型在体育教练任务中的时间定位能力,并在多个基准上超越监督微调和闭源模型。

Details Motivation: 视频大语言模型(Video-LLMs)在体育教练等需精确时间定位的任务中常关注无关帧;而获取帧级监督信号成本高且不可靠。 Method: 利用生成与验证等紧密相关任务应关注相同帧的观察,设计基于视觉注意力图的自一致性目标进行训练;使用VidDiffBench验证问题并评估效果。 Result: 在Exact、FitnessQA和ExpertAF三个体育教练任务上,相比监督微调,准确率分别提升+3.0%、+14.1%,BERTScore提升+0.9,并超越闭源模型。 Conclusion: 无需额外帧级标注的自一致性注意力约束可有效缓解Video-LLMs的时间定位偏差,显著提升体育教练类任务性能。 Abstract: Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.

[90] Interpretable Prostate Cancer Detection using a Small Cohort of MRI Images

Vahid Monfared,Mohammad Hadi Gharib,Ali Sabri,Maryam Shahali,Farid Rashidi,Amit Mehta,Reza Rawassizadeh

Main category: cs.CV

TL;DR: 本文提出了一种基于小样本T2加权MRI图像的可解释前列腺癌自动检测框架,通过迁移学习和数据增强缓解数据稀缺问题;在162张图像上比较了ViT、Swin、ResNet18及传统方法(HOG+SVM等),发现轻量ResNet18性能最优(90.9%准确率,95.2%敏感度,AUC 0.905),而HOG+SVM也表现优异(AUC 0.917);仅用T2图像即超越多参数MRI方法,并在放射科医生读片对比中显著提升敏感度(95.2% vs. 67.5%)。

Details Motivation: 前列腺癌是男性主要死因之一,但T2加权MRI病灶表现隐匿且异质性强,人工判读困难;现有AI方法多依赖大样本和多参数MRI(T2+DWI),临床部署复杂昂贵,亟需一种仅用单模态T2图像、适用于小数据场景的高精度、可解释、低计算成本的自动检测方法。 Method: 采用迁移学习与数据增强策略,在仅162张T2加权图像(102癌/60正常)的小数据集上训练并系统评估多种模型:Vision Transformer(ViT、Swin)、CNN(ResNet18)及经典机器学习方法(逻辑回归、SVM、HOG+SVM);所有模型均针对二分类(癌/正常)任务优化,并进行交叉验证与独立测试;同时开展含5名放射科医生的22例读片研究以评估临床对比性能。 Result: 迁移学习的ResNet18取得最佳性能:准确率90.9%,敏感度95.2%,AUC 0.905,参数量仅11M;ViT/Swin虽参数量大但性能更低;HOG+SVM达到AUC 0.917,显示手工特征在小数据下仍具竞争力;AI模型敏感度(95.2%)显著高于放射科医生均值(67.5%,Fleiss Kappa=0.524)。 Conclusion: 在小样本T2加权MRI上,轻量级CNN(如ResNet18)结合迁移学习优于复杂ViT架构;手工特征方法(HOG+SVM)亦具实用价值;本工作证明仅用单模态T2图像即可实现高灵敏度癌症检测,有望降低临床扫描与计算负担,提升筛查一致性与漏诊率控制,具备实际落地潜力。 Abstract: Prostate cancer is a leading cause of mortality in men, yet interpretation of T2-weighted prostate MRI remains challenging due to subtle and heterogeneous lesions. We developed an interpretable framework for automatic cancer detection using a small dataset of 162 T2-weighted images (102 cancer, 60 normal), addressing data scarcity through transfer learning and augmentation. We performed a comprehensive comparison of Vision Transformers (ViT, Swin), CNNs (ResNet18), and classical methods (Logistic Regression, SVM, HOG+SVM). Transfer-learned ResNet18 achieved the best performance (90.9% accuracy, 95.2% sensitivity, AUC 0.905) with only 11M parameters, while Vision Transformers showed lower performance despite substantially higher complexity. Notably, HOG+SVM achieved comparable accuracy (AUC 0.917), highlighting the effectiveness of handcrafted features in small datasets. Unlike state-of-the-art approaches relying on biparametric MRI (T2+DWI) and large cohorts, our method achieves competitive performance using only T2-weighted images, reducing acquisition complexity and computational cost. In a reader study of 22 cases, five radiologists achieved a mean sensitivity of 67.5% (Fleiss Kappa = 0.524), compared to 95.2% for the AI model, suggesting potential for AI-assisted screening to reduce missed cancers and improve consistency. Code and data are publicly available.

[91] Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images

Kazuya Nishimura,Ryoma Bise,Shinnosuke Matsuo,Haruka Hirose,Yasuhiro Kojima

Main category: cs.CV

TL;DR: 本文提出了一种名为CPNN的细胞类型原型引导神经网络,利用单细胞RNA测序数据估计细胞类型原型,并从病理图像中学习细胞组成权重,从而更准确、可解释地预测基因表达谱。

Details Motivation: 现有方法将基因表达视为单纯切片或空间点水平的信号,忽略了其源于细胞水平表达聚合的本质,缺乏细胞分辨率的生物学指导。 Method: 提出Cell-type Prototype-informed Neural Network(CPNN):首先基于公开单细胞RNA-seq数据估计稳定、鲁棒的细胞类型原型(均值表达谱),再通过病理图像直接学习细胞类型组成权重,并建模原型与观测到的bulk或空间表达之间的关系。 Result: 在三个切片级和三个空间转录组补丁级数据集上,CPNN在Spearman相关性指标上均取得最高性能;可视化推断的细胞组成权重可提供生物学可解释性。 Conclusion: CPNN通过引入细胞类型原型作为生物学先验,实现了更准确、结构化且可解释的基因表达预测, bridging single-cell biology and histology-based expression estimation. Abstract: Estimating slide- and patch-level gene expression profiles from pathology images enables rapid and low-cost molecular analysis with broad clinical impact. Despite strong results, existing approaches treat gene expression as a mere slide- or spot-level signal and do not incorporate the fact that the measured expression arises from the aggregation of underlying cell-level expression. To explicitly introduce this missing cell-resolved guidance, we propose a Cell-type Prototype-informed Neural Network (CPNN) that leverages publicly available single-cell RNA-sequencing datasets. Since single-cell measurements are noisy and not paired with histology images, we first estimate cell-type prototypes-mean expression profiles that reflect stable gene-gene co-variation patterns.CPNN then learns cell-type compositional weights directly from images and models the relationship between prototypes and observed bulk or spatial expression, providing a biologically grounded and structurally regularized prediction framework. We evaluate CPNN on three slide-level datasets and three patch-level spatial transcriptomics datasets. Across all settings, CPNN achieves the highest performance in terms of Spearman correlation. Moreover, by visualizing the inferred compositional weights, our framework provides interpretable insights into which cell types drive the predicted expression. Code is publicly available at https://github.com/naivete5656/CPNN.

[92] MedQ-UNI: Toward Unified Medical Image Quality Assessment and Restoration via Vision-Language Modeling

Jiyao Liu,Junzhi Ning,Wanying Qu,Lihao Liu,Chenglong Ma,Junjun He,Ningsheng Xu

Main category: cs.CV

TL;DR: 本文提出MedQ-UNI模型,采用‘先评估后恢复’范式,通过结合医学图像质量评估(Med-IQA)来指导跨模态、跨退化类型的医学图像恢复(Med-IR),显著提升泛化性与可解释性。

Details Motivation: 现有医学图像恢复方法通常局限于特定模态或特定退化类型,难以应对临床中多样异构的退化;作者认为其根本原因在于医学图像恢复与质量评估相互割裂,缺乏对图像质量的显式理解。 Method: 提出统一的视觉-语言模型MedQ-UNI,采用多模态自回归双专家架构(共享注意力机制):质量评估专家生成结构化自然语言描述以识别退化问题,恢复专家据此描述进行针对性图像重建;并构建含约5万对样本的大规模多模态多任务数据集及2千样本评测基准。 Result: 单个MedQ-UNI模型无需任务适配,在全部三种模态、五种恢复任务上均达到SOTA性能,同时生成更优的质量描述,验证了显式质量理解对恢复保真度和可解释性的提升作用。 Conclusion: 将医学图像质量评估与恢复有机融合的‘评估-恢复’范式是提升模型泛化性、鲁棒性与可解释性的有效路径,MedQ-UNI为通用医学图像恢复提供了新范式。 Abstract: Existing medical image restoration (Med-IR) methods are typically modality-specific or degradation-specific, failing to generalize across the heterogeneous degradations encountered in clinical practice. We argue this limitation stems from the isolation of Med-IR from medical image quality assessment (Med-IQA), as restoration models without explicit quality understanding struggle to adapt to diverse degradation types across modalities. To address these challenges, we propose MedQ-UNI, a unified vision-language model that follows an assess-then-restore paradigm, explicitly leveraging Med-IQA to guide Med-IR across arbitrary modalities and degradation types. MedQ-UNI adopts a multimodal autoregressive dual-expert architecture with shared attention: a quality assessment expert first identifies degradation issues through structured natural language descriptions, and a restoration expert then conditions on these descriptions to perform targeted image restoration. To support this paradigm, we construct a large-scale dataset of approximately 50K paired samples spanning three imaging modalities and five restoration tasks, each annotated with structured quality descriptions for joint Med-IQA and Med-IR training, along with a 2K-sample benchmark for evaluation. Extensive experiments demonstrate that a single MedQ-UNI model, without any task-specific adaptation, achieves state-of-the-art restoration performance across all tasks while generating superior descriptions, confirming that explicit quality understanding meaningfully improves restoration fidelity and interpretability.

[93] Recolour What Matters: Region-Aware Colour Editing via Token-Level Diffusion

Yuqi Yang,Dongliang Chang,Yijia Ling,Ruoyi Du,Zhanyu Ma

Main category: cs.CV

TL;DR: ColourCrafter 是一种基于扩散模型的细粒度、区域感知彩色编辑框架,通过RGB颜色token与图像token在潜在空间中的融合,并结合Lab空间感知损失,显著提升色彩编辑的准确性与可控性。

Details Motivation: 现有基于文本的彩色编辑方法难以准确表达连续色度变化,导致编辑结果偏离目标色调,尤其在局部精细编辑时效果不佳。 Method: 提出ColourCrafter框架:1)在潜在空间中进行RGB颜色token与图像token的token级融合,实现区域感知的颜色传播;2)引入感知Lab空间损失,解耦亮度与色度,并约束掩码区域内编辑;3)构建大规模连续色彩变化数据集ColourfulSet。 Result: 在细粒度彩色编辑任务上达到SOTA性能,显著提升色彩准确性、可控性与感知保真度。 Conclusion: ColourCrafter将全局色调迁移转化为结构化、区域感知的生成过程,有效解决了传统方法在连续色度控制和局部编辑精度上的瓶颈。 Abstract: Colour is one of the most perceptually salient yet least controllable attributes in image generation. Although recent diffusion models can modify object colours from user instructions, their results often deviate from the intended hue, especially for fine-grained and local edits. Early text-driven methods rely on discrete language descriptions that cannot accurately represent continuous chromatic variations. To overcome this limitation, we propose ColourCrafter, a unified diffusion framework that transforms colour editing from global tone transfer into a structured, region-aware generation process. Unlike traditional colour driven methods, ColourCrafter performs token-level fusion of RGB colour tokens and image tokens in latent space, selectively propagating colour information to semantically relevant regions while preserving structural fidelity. A perceptual Lab-space Loss further enhances pixel-level precision by decoupling luminance and chrominance and constraining edits within masked areas. Additionally, we build ColourfulSet, a largescale dataset of high-quality image pairs with continuous and diverse colour variations. Extensive experiments demonstrate that ColourCrafter achieves state-of-the-art colour accuracy, controllability and perceptual fidelity in fine-grained colour editing. Our project is available at https://yangyuqi317.github.io/ColourCrafter.github.io/.

[94] Do Vision Language Models Understand Human Engagement in Games?

Ziyi Wang,Qizan Guo,Rishitosh Singh,Xiyang Hu

Main category: cs.CV

TL;DR: 本文评估了视觉-语言模型(VLMs)从游戏视频中推断玩家参与度的能力,发现零样本预测效果差,理论引导提示效果有限,记忆/检索增强提示在点预测上有一定提升,但成对变化预测仍困难,揭示了当前VLMs存在‘感知—理解’鸿沟。

Details Motivation: 探究视觉-语言模型能否仅凭视觉线索可靠推断人类在游戏中的潜在心理状态(即参与度),以支持游戏设计与用户体验研究。 Method: 在涵盖9款第一人称射击游戏的GameVibe少样本数据集上,评估3种VLM在6种提示策略(零样本、基于心流理论/游戏流理论/自我决定理论/MDA框架的理论引导提示、检索增强提示)下的表现,任务包括点式参与度预测和连续时间窗间的参与度变化成对预测。 Result: 零样本VLM预测普遍弱于每款游戏的多数类基线;记忆或检索增强提示在部分设置下提升了点式预测性能;成对预测在所有策略下均持续困难;理论引导提示未稳定提升性能,有时反而强化表面捷径。 Conclusion: 当前VLMs虽能识别可见的游戏画面线索,但在跨游戏鲁棒推断人类参与度方面仍存在显著局限,反映出‘感知—理解’之间的根本性差距。 Abstract: Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision--language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception--understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.

[95] T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World

Aditi Naiknaware,Salimeh Sekeh

Main category: cs.CV

TL;DR: 本文提出了一种面向动态环境的多模态OOD检测新框架T-QPM,通过跨模态一致性建模与轻量时序融合权重学习,并引入ATC正则化,显著提升了在时间漂移和协变量偏移下的鲁棒性。

Details Motivation: 现有基于CLIP等VLM的OOD检测方法依赖固定融合规则、假设静态环境,难以应对时间漂移和协变量偏移问题。 Method: 提出两步Temporal Quadruple-Pattern Matching(T-QPM)框架:1)利用图像-文本配对构建ID/OOD跨模态一致性模式;2)学习轻量级时序融合权重,联合语义匹配与视觉典型性,并施加Average Thresholded Confidence(ATC)显式正则化以保障稳定性。 Result: 在时序划分的基准上显著优于静态基线,展现出更强的时间一致性与分布漂移鲁棒性。 Conclusion: T-QPM为非平稳开放世界中的多模态OOD检测提供了稳健、可适应的解决方案,有效缓解了时间漂移与协变量偏移带来的挑战。 Abstract: Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern Matching (DPM), existing methods typically suffer from two major shortcomings: (1) They rely on fixed fusion rules and assume static environments, failing under temporal drift; and (2) they lack robustness against covariate shifted inputs. In this paper, we propose a novel two-step framework to enhance OOD detection and covariate distribution shift robustness in dynamic settings. We extend the dual-pattern regime into Temporal Quadruple-Pattern Matching (T-QPM). First, by pairing OOD images with text descriptions, we introduce cross-modal consistency patterns between ID and OOD signals, refining the decision boundary through joint image-text reasoning. Second, we address temporal distribution shifts by learning lightweight fusion weights to optimally combine semantic matching and visual typicality. To ensure stability, we enforce explicit regularization based on Average Thresholded Confidence (ATC), preventing performance degradation as distributions evolve. Experiments on temporally partitioned benchmarks demonstrate that our approach significantly outperforms static baselines, offering a robust, temporally-consistent framework for multimodal OOD detection in non-stationary environments.

[96] TexEditor: Structure-Preserving Text-Driven Texture Editing

Bo Zhao,Yihang Liu,Chenfeng Zhang,Huan Yang,Kun Gai,Wei Ji

Main category: cs.CV

TL;DR: 本文提出TexEditor,一种基于Qwen-Image-Edit-2509的专用纹理编辑模型,通过构建高质量合成数据集TexBlender和引入结构保持强化学习方法StructureNFT,显著提升文本引导纹理编辑中的几何结构一致性,并发布新基准TexBench以更好评估真实场景性能。

Details Motivation: 现有SOTA文本引导纹理编辑模型在保持几何结构一致性方面表现不佳,尽管目标仅为外观修改。 Method: 1)构建基于Blender的高质量监督微调(SFT)数据集TexBlender;2)提出基于强化学习的StructureNFT方法,将SFT中学到的结构先验迁移到真实场景;3)发布面向真实世界的新基准TexBench。 Result: TexEditor在Blender基准和自建TexBench上均显著优于Nano Banana Pro等强基线,并在通用图像编辑基准ImgEdit上验证了良好泛化能力。 Conclusion: 从数据与训练双视角联合增强结构保持能力是提升纹理编辑质量的关键路径,TexEditor为该任务提供了更鲁棒、更实用的解决方案。 Abstract: Text-guided texture editing aims to modify object appearance while preserving the underlying geometric structure. However, our empirical analysis reveals that even SOTA editing models frequently struggle to maintain structural consistency during texture editing, despite the intended changes being purely appearance-related. Motivated by this observation, we jointly enhance structure preservation from both data and training perspectives, and build TexEditor, a dedicated texture editing model based on Qwen-Image-Edit-2509. Firstly, we construct TexBlender, a high-quality SFT dataset generated with Blender, which provides strong structural priors for a cold start. Sec- ondly, we introduce StructureNFT, a RL-based approach that integrates structure-preserving losses to transfer the structural priors learned during SFT to real-world scenes. Moreover, due to the limited realism and evaluation coverage of existing benchmarks, we introduce TexBench, a general-purpose real-world benchmark for text-guided texture editing. Extensive experiments on existing Blender-based texture benchmarks and our TexBench show that TexEditor consistently outperforms strong baselines such as Nano Banana Pro. In addition, we assess TexEditor on the general purpose benchmark ImgEdit to validate its generalization. Our code and data are available at https://github.com/KlingAIResearch/TexEditor.

[97] FILT3R: Latent State Adaptive Kalman Filter for Streaming 3D Reconstruction

Seonghyun Jin,Jong Chul Ye

Main category: cs.CV

TL;DR: 本文提出FILT3R,一种无需训练的潜在滤波层,将流式3D重建中的状态更新建模为token空间中的随机状态估计,通过在线估计过程噪声并自适应计算Kalman增益,提升长时序下的稳定性。

Details Motivation: 流式3D重建中,现有状态更新策略(激进覆盖或保守更新)在超出训练视界后易失稳,难以兼顾历史记忆与新观测融合。 Method: FILT3R引入每token方差建模,设计类Kalman增益机制;过程噪声通过EMA归一化的时间漂移在线估计,实现自适应权衡记忆保留与新证据融合。 Result: FILT3R在深度、位姿和3D重建任务上显著提升长时序稳定性;其增益机制可解释:稳定区域收缩,真实场景变化时上升;且能退化为常见覆盖/门控策略。 Conclusion: FILT3R是一种通用、即插即用、免训练的滤波层,为流式三维重建提供了更鲁棒、可解释且泛化性强的状态更新范式。 Abstract: Streaming 3D reconstruction maintains a persistent latent state that is updated online from incoming frames, enabling constant-memory inference. A key failure mode is the state update rule: aggressive overwrites forget useful history, while conservative updates fail to track new evidence, and both behaviors become unstable beyond the training horizon. To address this challenge, we propose FILT3R, a training-free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. FILT3R maintains a per-token variance and computes a Kalman-style gain that adaptively balances memory retention against new observations. Process noise -- governing how much the latent state is expected to change between frames -- is estimated online from EMA-normalized temporal drift of candidate tokens. Using extensive experiments, we demonstrate that FILT3R yields an interpretable, plug-in update rule that generalizes common overwrite and gating policies as special cases. Specifically, we show that gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long-horizon stability for depth, pose, and 3D reconstruction, compared to the existing methods. Code will be released at https://github.com/jinotter3/FILT3R.

[98] NymeriaPlus: Enriching Nymeria Dataset with Additional Annotations and Data

Daniel DeTone,Federica Bogo,Eric-Tuan Le,Duncan Frost,Julian Straub,Yawar Siddiqui,Yuting Ye,Jakob Engel,Richard Newcombe,Lingni Ma

Main category: cs.CV

TL;DR: 本文介绍了NymeriaPlus,即Nymeria数据集的升级版本,通过增强人体运动建模、增加密集3D/2D标注、提供实例级3D物体重建及新增多模态数据(如音频、腕带视频等),构建更强大的野外第一人称基准数据集,以支持具身AI中的多模态学习研究。

Details Motivation: 现有egocentric数据集在模态丰富性、标注精细度和人体运动建模精度方面存在不足,难以支撑具身AI对复杂真实场景的深入理解与学习。 Method: 在原始Nymeria数据集基础上,升级人体运动表示(MHR和SMPL格式)、引入室内物体与结构元素的密集3D/2D框标注、生成实例级3D物体重建,并融合基图、音频、腕带视频等新模态,构建统一、协同的NymeriaPlus基准。 Result: NymeriaPlus成为一个更强大、更全面的野外第一人称数据集,涵盖高保真人运动、细粒度场景理解标注及多样化同步模态。 Conclusion: NymeriaPlus填补了当前egocentric资源的关键空白,为多模态具身AI研究提供了坚实的数据基础和新探索方向。 Abstract: The Nymeria Dataset, released in 2024, is a large-scale collection of in-the-wild human activities captured with multiple egocentric wearable devices that are spatially localized and temporally synchronized. It provides body-motion ground truth recorded with a motion-capture suit, device trajectories, semi-dense 3D point clouds, and in-context narrations. In this paper, we upgrade Nymeria and introduce NymeriaPlus. NymeriaPlus features: (1) improved human motion in Momentum Human Rig (MHR) and SMPL formats; (2) dense 3D and 2D bounding box annotations for indoor objects and structural elements; (3) instance-level 3D object reconstructions; and (4) additional modalities e.g., basemap recordings, audio, and wristband videos. By consolidating these complementary modalities and annotations into a single, coherent benchmark, NymeriaPlus strengthens Nymeria into a more powerful in-the-wild egocentric dataset. We expect NymeriaPlus to bridge a key gap in existing egocentric resources and to support a broader range of research, including unique explorations of multimodal learning for embodied AI.

[99] Efficient Video Diffusion with Sparse Information Transmission for Video Compression

Mingde Zhou,Zheng Chen,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出Diff-SIT,一种结合稀疏时序编码与自适应帧类型引导扩散模型的高效视频压缩方法,在超低码率下显著提升感知质量与时序一致性。

Details Motivation: 传统端到端视频压缩在超低码率下重建模糊、感知质量差;现有生成式方法常忽略帧间时序相关性,导致时间不连贯且效率低。 Method: 提出Diff-SIT框架,包含稀疏时序编码模块(STEM)和一步式视频扩散模型(ODFTE);STEM将原始帧稀疏编码为信息丰富的中间序列以节省码率;ODFTE整体处理该序列,并通过帧类型嵌入器(FTE)引导扩散模型对不同帧类型进行自适应重建。 Result: 在多个数据集上实验表明,Diff-SIT在超低码率下达到感知质量和时序一致性的新SOTA水平。 Conclusion: Diff-SIT通过稀疏表征与帧类型感知的扩散重建,有效兼顾高压缩率、高感知质量与强时间一致性,为生成式视频压缩提供了新范式。 Abstract: Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra-low bitrates, traditional end-to-end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One-Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information-rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff-SIT establishes a new state-of-the-art in perceptual quality and temporal consistency, particularly in the challenging ultra-low-bitrate regime. Code is released at https://github.com/MingdeZhou/Diff-SIT.

[100] HOMEY: Heuristic Object Masking with Enhanced YOLO for Property Insurance Risk Detection

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: HOMEY是一种结合YOLO与领域特定掩码机制及自定义损失函数的新型风险检测框架,用于自动识别房产图像中的17类风险因素,在准确性和可靠性上优于基线模型。

Details Motivation: 自动化房产风险检测在计算机视觉中是一个高影响但尚未充分探索的前沿领域,对房地产、承保和保险业务具有直接影响。 Method: 提出HOMEY框架,结合YOLO、启发式目标掩码机制和风险感知损失校准,以增强杂乱背景下的弱信号并平衡类别不均衡与风险严重性权重。 Result: 在真实房产图像上实验表明,HOMEY在检测精度和可靠性上优于基准YOLO模型,同时保持快速推理能力,并支持可解释、低成本的风险分析。 Conclusion: HOMEY为可扩展的AI驱动房产保险工作流奠定了基础,推动了自动化风险检测在实际业务中的落地应用。 Abstract: Automated property risk detection is a high-impact yet underexplored frontier in computer vision with direct implications for real estate, underwriting, and insurance operations. We introduce HOMEY (Heuristic Object Masking with Enhanced YOLO), a novel detection framework that combines YOLO with a domain-specific masking mechanism and a custom-designed loss function. HOMEY is trained to detect 17 risk-related property classes, including structural damages (e.g., cracked foundations, roof issues), maintenance neglect (e.g., dead yards, overgrown bushes), and liability hazards (e.g., falling gutters, garbage, hazard signs). Our approach introduces heuristic object masking to amplify weak signals in cluttered backgrounds and risk-aware loss calibration to balance class skew and severity weighting. Experiments on real-world property imagery demonstrate that HOMEY achieves superior detection accuracy and reliability compared to baseline YOLO models, while retaining fast inference. Beyond detection, HOMEY enables interpretable and cost-efficient risk analysis, laying the foundation for scalable AI-driven property insurance workflows.

[101] From Snapshots to Symphonies: The Evolution of Protein Prediction from Static Structures to Generative Dynamics and Multimodal Interactions

Jingzhi Chen,Lijian Xu

Main category: cs.CV

TL;DR: 本文综述了人工智能在蛋白质科学中的范式转变,涵盖多模态表征、静态结构预测改进、生成式建模、异质相互作用预测及功能推断五大维度,并指出当前瓶颈与未来方向。

Details Motivation: 蛋白质折叠问题因AI而发生根本性变革,亟需系统梳理从静态结构预测到动态构象与复杂生物分子相互作用建模的演进路径。 Method: 系统性综述分析法,围绕五个相互关联的维度展开:统一多模态表征、无MSA静态预测优化、基于扩散与流匹配的生成框架、异质复合物相互作用预测、功能与适应度景观推断。 Result: 明确了AI驱动蛋白质科学的关键进展,包括全原子复合物建模、热力学一致的构象分布生成、多类型生物分子相互作用预测,以及文本引导的功能属性预测能力。 Conclusion: AI正从结构分析工具转变为可理解并重写生命动态语言的通用模拟器,未来需发展物理一致性生成模型、多模态基础架构和实验闭环系统。 Abstract: The protein folding problem has been fundamentally transformed by artificial intelligence, evolving from static structure prediction toward the modeling of dynamic conformational ensembles and complex biomolecular interactions. This review systematically examines the paradigm shift in AI driven protein science across five interconnected dimensions: unified multimodal representations that integrate sequences, geometries, and textual knowledge; refinement of static prediction through MSA free architectures and all atom complex modeling; generative frameworks, including diffusion models and flow matching, that capture conformational distributions consistent with thermodynamic ensembles; prediction of heterogeneous interactions spanning protein ligand, protein nucleic acid, and protein protein complexes; and functional inference of fitness landscapes, mutational effects, and text guided property prediction. We critically analyze current bottlenecks, including data distribution biases, limited mechanistic interpretability, and the disconnect between geometric metrics and biophysical reality, while identifying future directions toward physically consistent generative models, multimodal foundation architectures, and experimental closed loop systems. This methodological transformation marks artificial intelligence's transition from a structural analysis tool into a universal simulator capable of understanding and ultimately rewriting the dynamic language of life.

[102] Foundations and Architectures of Artificial Intelligence for Motor Insurance

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: This handbook introduces a vertically integrated AI paradigm for motor insurance, featuring domain-adapted transformer architectures for vehicle damage analysis, claims evaluation, and underwriting, all deployed in real-world Thai insurance systems with emphasis on MLOps and production reliability.

Details Motivation: To bridge the gap between cutting-edge AI research and reliable, large-scale industrial deployment in high-stakes motor insurance contexts, particularly addressing practical constraints in real-world systems. Method: Develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence; integrates them into a scalable, production-aware pipeline; co-designs learning algorithms with MLOps practices. Result: An end-to-end automated system for vehicle damage analysis, claims evaluation, and underwriting, successfully deployed in nationwide motor insurance systems in Thailand. Conclusion: A principled, vertically integrated AI stack—combining tailored models, multimodal reasoning, and robust MLOps—is essential for translating modern AI into trustworthy, production-grade solutions in regulated, high-impact domains like motor insurance. Abstract: This handbook presents a systematic treatment of the foundations and architectures of artificial intelligence for motor insurance, grounded in large-scale real-world deployment. It formalizes a vertically integrated AI paradigm that unifies perception, multimodal reasoning, and production infrastructure into a cohesive intelligence stack for automotive risk assessment and claims processing. At its core, the handbook develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence, enabling end-to-end automation of vehicle damage analysis, claims evaluation, and underwriting workflows. These components are composed into a scalable pipeline operating under practical constraints observed in nationwide motor insurance systems in Thailand. Beyond model design, the handbook emphasizes the co-evolution of learning algorithms and MLOps practices, establishing a principled framework for translating modern artificial intelligence into reliable, production-grade systems in high-stakes industrial environments.

[103] OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

Hongjia Zhai,Qi Zhang,Xiaokun Pan,Xiyu Zhang,Yitong Dong,Huaqi Zhang,Dan Xu,Guofeng Zhang

Main category: cs.CV

TL;DR: 本文提出OnlinePG系统,结合3D高斯泼溅与在线局部到全局映射策略,实现开放词汇场景理解与在线全景建图,兼顾实时性与实例级语义一致性。

Details Motivation: 现有方法多为离线或缺乏实例级理解,难以满足真实机器人任务对在线、开放词汇和实例感知的需求。 Method: 采用滑动窗口的局部到全局范式;构建融合几何与语义的3D段聚类图以实现局部一致性;通过带空间属性的显式网格与鲁棒双向二分3D高斯实例匹配更新全局地图;利用网格内VLM特征实现开放词汇理解。 Result: 在多个主流数据集上,OnlinePG在在线方法中性能最优,同时保持实时效率。 Conclusion: OnlinePG有效解决了在线全景建图与开放词汇场景理解的协同难题,为具身智能提供了实用、高效的感知-建图一体化框架。 Abstract: Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit grids with spatial attributes for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.

[104] CAFlow: Adaptive-Depth Single-Step Flow Matching for Efficient Histopathology Super-Resolution

Elad Yoshai,Ariel D. Yoshai,Natan T. Shaked

Main category: cs.CV

TL;DR: 本文提出CAFlow,一种自适应深度的单步流匹配超分辨率框架,通过动态路由图像块至最浅的有效网络出口,在保持重建质量的同时大幅降低计算开销;其在数字病理全片图像超分任务中实现高效、高质量且临床可用的推理。

Details Motivation: 数字病理中的全片图像常达吉像素级,导致生成式超分辨率计算开销过大,难以实际部署。 Method: 提出CAFlow框架:采用像素重排空间下的单步流匹配;设计带四个早期退出点的FlowResNet骨干网络(含卷积与窗口自注意力);引入轻量级出口分类器实现自适应深度路由;训练中一半样本设为精确t=0以保障单步质量。 Result: 在多器官组织x4超分上,自适应路由达31.72 dB PSNR(仅比全深度低0.12 dB),最浅出口比双三次插值高1.9 dB且计算量仅为SwinIR-light的1/2.8;x8超分超越同计算量基线,接近更大SwinIR-Medium;泛化至结肠组织仅降0.02 dB;下游核分割验证结构保真;单GPU训练<5小时,全片推理从分钟级降至秒级。 Conclusion: CAFlow通过自适应计算深度与高效流匹配设计,在保证临床级重建质量前提下显著提升超分辨率在数字病理中的实用性与可部署性。 Abstract: In digital pathology, whole-slide images routinely exceed gigapixel resolution, making computationally intensive generative super-resolution (SR) impractical for routine deployment. We introduce CAFlow, an adaptive-depth single-step flow-matching framework that routes each image tile to the shallowest network exit that preserves reconstruction quality. CAFlow performs flow matching in pixel-unshuffled rearranged space, reducing spatial computation by 16x while enabling direct inference. We show that dedicating half of training to exact t=0 samples is essential for single-step quality (-1.5 dB without it). The backbone, FlowResNet (1.90M parameters), mixes convolution and window self-attention blocks across four early exits spanning 3.1 to 13.3 GFLOPs. A lightweight exit classifier (~6K parameters) achieves 33% compute savings at only 0.12 dB cost. On multi-organ histopathology x4 SR, adaptive routing achieves 31.72 dB PSNR versus 31.84 dB at full depth, while the shallowest exit exceeds bicubic by +1.9 dB at 2.8x less compute than SwinIR-light. The method generalizes to held-out colon tissue with minimal quality loss (-0.02 dB), and at x8 upscaling it outperforms all comparable-compute baselines while remaining competitive with the much larger SwinIR-Medium model. Downstream nuclei segmentation confirms preservation of clinically relevant structure. The model trains in under 5 hours on a single GPU, and adaptive routing can reduce whole-slide inference from minutes to seconds.

[105] Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Liwei Che,Zhiyu Xue,Yihao Quan,Benlin Liu,Zeru Shi,Michelle Hurst,Jacob Feldman,Ruixiang Tang,Ranjay Krishna,Vladimir Pavlovic

Main category: cs.CV

TL;DR: 本文研究了大型视觉语言模型(LVLM)如何执行计数任务,发现其表现出类似人类的计数行为,并揭示了一个跨任务共享的‘计数电路’;基于此,作者提出一种仅用合成图像微调计数能力的轻量干预方法,显著提升了模型在分布内外计数及通用视觉推理任务上的表现。

Details Motivation: 计数是检验LVLM推理能力的简单而有力的测试,能迫使模型识别每个独立物体并求和;但目前尚不清楚LVLM如何实现计数,以及计数能力是否影响更广泛的视觉推理。 Method: 结合受控的合成与真实世界基准测试,辅以机制性分析;提出两种新可解释性方法——视觉激活修补(Visual Activation Patching)和HeadLens,用于定位计数相关神经回路;设计仅基于合成图像的轻量级计数专项微调策略。 Result: LVLM展现出类人的计数行为(小数量精确、大数量估计噪声大);发现一个结构化、跨视觉推理任务共享的‘计数电路’;仅微调计数能力后,Qwen2.5-VL在OOD计数基准上平均提升+8.36%,在复杂通用视觉推理任务上平均提升+1.54%。 Conclusion: 计数在视觉推理中具有核心且具影响力的作用;通过有针对性地增强计数机制,可成为提升LVLM整体视觉推理能力的有效路径。 Abstract: Counting serves as a simple but powerful test of a Large Vision-Language Model's (LVLM's) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured "counting circuit" that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.

[106] 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Hyun-kyu Ko,Jihyeon Park,Younghyun Kim,Dongheok Park,Eunbyung Park

Main category: cs.CV

TL;DR: 本文提出3DreamBooth和3Dapter框架,实现面向3D对象的视频定制化生成,通过单帧优化解耦空间几何与时间运动,并引入多视图联合优化的视觉条件模块提升纹理细节和收敛速度。

Details Motivation: 现有基于2D的主体驱动视频生成方法缺乏对真实3D几何结构的建模能力,在新视角合成中难以保持主体的真实3D身份;同时多视角视频数据稀缺,直接微调易导致时序过拟合。 Method: 提出3DreamBooth(1帧空间优化以嵌入鲁棒3D先验)和3Dapter(基于非对称条件策略的多视图联合优化视觉条件模块),后者作为动态选择性路由器,从最少参考视图中查询视角特异性几何线索。 Result: 实现了3D感知的视频定制化生成,在新视角合成中能更好保持主体的几何一致性和细粒度纹理,避免了时序过拟合,且无需大量多视角视频训练。 Conclusion: 该框架有效解决了2D-centric方法在3D定制视频生成中的几何不一致性问题,为少样本、多视角条件下的3D-aware视频生成提供了新范式。 Abstract: Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

[107] Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

Yongwei Jiang,Yixiong Zou,Yuhua Li,Ruixuan Li

Main category: cs.CV

TL;DR: 本文提出了一种受人类视网膜中央凹启发的中心-外围注意力精炼框架,用于解决跨域小样本目标检测中的目标域散光问题(即注意力分散、定位不准),显著提升了模型在目标域上的适应能力与检测精度。

Details Motivation: 跨域小样本目标检测中存在严重域偏移和标注稀缺问题,作者发现模型在目标域上注意力分散、不聚焦,类比人类视觉无法聚焦,称之为‘目标域散光问题’。 Method: 提出中心-外围注意力精炼框架,包含三个模块:(1) 正样本模式精炼模块(模拟视觉中心),利用类别原型重塑注意力;(2) 负样本上下文调制模块(模拟视觉外周),建模背景以增强边界判别;(3) 文本语义对齐模块,借助跨模态线索强化中心-外围区分。 Result: 在六个具有挑战性的CD-FSOD基准上持续提升检测精度,达到新的SOTA性能。 Conclusion: 受生物视觉启发的注意力精炼机制可有效矫正目标域散光问题,将分散注意力转化为聚焦模式,显著增强跨域小样本检测的泛化能力。 Abstract: Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning's inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.

[108] CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

Xiang Chen,Fangfang Yang,Chunlei Meng,Chengyin Hu,Ang Li,Yiwei Wei,Jiahuan Long,Jiujiang Guo

Main category: cs.CV

TL;DR: 本文提出CoDA框架,通过模拟临床影像流程中的多阶段分布偏移(如采集、重建、显示和传输),评估医学视觉-语言模型(MVLMs)和多模态大语言模型(MLLMs)在真实临床场景下的鲁棒性;发现链式偏移比单阶段更严重损害性能,并提出一种基于教师引导的token空间自适应修复策略提升鲁棒性。

Details Motivation: 现有医学视觉-语言模型(MVLMs)的鲁棒性评估多基于干净或单一失真图像,忽视了临床中常见的、保持可读性但改变统计特性的全流程操作(如采集阴影、重建映射、显示变换、导出压缩等),导致其实际部署可靠性未知。 Method: 提出CoDA(Chain-of-Distribution Analysis)框架:在结构相似性(SSIM)掩码约束下,联合优化多个临床合理失真阶段(采集类阴影、重建与显示映射、传输与导出降质)的组合与参数,生成视觉上合理但分布偏移的图像;并设计基于教师引导的patch级对齐token空间后处理修复方法。 Result: CoDA显著降低CLIP风格MVLMs在脑MRI、胸片和腹部CT上的零样本性能,且链式失真比任一单阶段更严重;商用及医学专用MLLMs在影像真实性/质量审计任务中均表现不可靠;所提后处理修复策略有效提升模型在CoDA失真数据上的准确率。 Conclusion: CoDA揭示了MVLMs在真实临床流程中面临的系统性鲁棒性威胁;仅需轻量级token空间对齐即可提升部署鲁棒性,为临床AI模型的可靠性评估与增强提供了新范式。 Abstract: Medical vision--language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.

[109] HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Dan Ben-Ami,Gabriele Serussi,Kobi Cohen,Chaim Baskin

Main category: cs.CV

TL;DR: HiMu是一种无需训练的长视频问答框架,通过单次文本大模型调用将问题分解为层次逻辑树,结合多模态轻量专家模块与模糊逻辑组合,实现高效准确的帧选择。

Details Motivation: 现有长视频问答中的帧选择方法在效率与准确性之间存在明显权衡:基于相似度的方法快但丢失时序结构和跨模态绑定;基于代理的方法能恢复结构但计算开销过大。 Method: HiMu采用无训练设计:先用纯文本LLM将问题分解为层次逻辑树,叶子节点为原子谓词;每个谓词路由至轻量多模态专家(视觉:CLIP、开放词汇检测、OCR;音频:ASR、CLAP);信号经归一化、时间平滑对齐后,通过模糊逻辑算子自底向上组合,生成连续满足度曲线。 Result: 在Video-MME、LongVideoBench和HERBench-Lite上,HiMu在仅16帧输入下即超越所有对比选择器;使用GPT-4o时,其性能优于32–512帧的代理式系统,且FLOPs降低约10倍。 Conclusion: HiMu成功弥合了效率与准确性之间的鸿沟,在保持极低计算成本的同时显著提升长视频问答中帧选择的质量与鲁棒性。 Abstract: Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

[110] CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

Jiacheng Tang,Zhiyuan Zhou,Zhuolin He,Jia Zhang,Kai Zhang,Jian Pu

Main category: cs.CV

TL;DR: 本文提出CausalVAD框架,通过稀疏因果干预(SCIS)模块在端到端驾驶模型中实现去混淆训练,以消除混杂因素引起的虚假关联,提升规划准确性、安全性和鲁棒性。

Details Motivation: 现有规划导向的端到端驾驶模型仅学习统计相关性,易受数据集偏差影响而产生因果混淆,损害其在复杂场景下的可靠性与安全性。 Method: 提出CausalVAD去混淆训练框架,核心为稀疏因果干预方案(SCIS):构建表征潜在驾驶上下文的原型字典,并据此对模型稀疏向量化查询进行因果干预,从而落实后门调整理论,剔除混杂变量引发的虚假关联。 Result: 在nuScenes等基准上达到最优规划精度与安全性;对数据偏差和诱发因果混淆的噪声场景展现出更强鲁棒性。 Conclusion: CausalVAD有效缓解因果混淆问题,显著提升端到端驾驶模型的因果合理性、任务性能与实际部署安全性。 Abstract: Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

[111] HAViT: Historical Attention Vision Transformer

Swarnendu Banik,Manish Das,Shiv Ram Dubey,Satish Kumar Singh

Main category: cs.CV

TL;DR: 本文提出了一种跨层注意力传播方法,通过在ViT中保存并融合历史注意力矩阵,改进层间信息流,提升特征学习与模型性能,仅需极少架构改动且在多个数据集和模型上验证有效。

Details Motivation: Vision Transformers中各层注意力机制独立运作,限制了信息流动和特征学习能力,因此需要一种能增强跨层信息整合的机制。 Method: 提出跨层注意力传播方法,保留并融合各编码器层的历史注意力矩阵,引入注意力矩阵存储与加权混合操作(含超参alpha),支持渐进式注意力模式优化。 Result: 在CIFAR-100上ViT准确率从75.74%提升至77.07%(+1.33%),TinyImageNet上从57.82%提升至59.07%(+1.25%);CaiT提升1.01%;最优混合系数alpha=0.45;随机初始化优于零初始化。 Conclusion: 该方法以极小开销显著提升ViT及其变体性能,揭示历史注意力信息的有效利用可优化训练动态与最终精度,具备良好泛化性与实用性。 Abstract: Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at https://github.com/banik-s/HAViT.

[112] Color image restoration based on nonlocal saturation-value similarity

Wei Wang,Yakun Li

Main category: cs.CV

TL;DR: 本文提出了一种基于饱和度-明度(saturation-value)相似性的新型非局部变分方法,用于彩色图像恢复,通过在HSV颜色空间中度量图像块相似性来提升颜色信息建模精度,并设计了基于Bregman分裂法的高效算法求解。

Details Motivation: 传统非局部方法直接在RGB通道上提取图像块并计算灰度相似性,难以精细描述彩色图像的颜色信息;本文旨在利用HSV空间中的饱和度和明度通道来更准确刻画彩色图像块间的颜色相似性。 Method: 构建基于饱和度-明度相似性的非局部全变分正则项,并将其嵌入非局部梯度定义中;进而建立相应的非局部变分模型;采用Bregman化算子分裂法进行数值求解,并分析算法收敛性。 Result: 实验表明,所提方法在视觉质量及PSNR、SSIM、QSSIM和S-CIELAB色差等定量指标上均优于对比方法。 Conclusion: 基于饱和度-明度相似性的非局部变分方法能更有效地保留和恢复彩色图像的颜色结构信息,是一种性能优越的彩色图像恢复新策略。 Abstract: In this paper, we propose and develop a novel nonlocal variational technique based on saturation-value similarity for color image restoration. In traditional nonlocal methods, image patches are extracted from red, green and blue channels of a color image directly, and the color information can not be described finely because the patch similarity is mainly based on the grayscale value of independent channel. The main aim of this paper is to propose and develop a novel nonlocal regularization method by considering the similarity of image patches in saturation-value channel of a color image. In particular, we first establish saturation-value similarity based nonlocal total variation by incorporating saturation-value similarity of color image patches into the proposed nonlocal gradients, which can describe the saturation and value similarity of two adjacent color image patches. The proposed nonlocal variational models are then formulated based on saturation-value similarity based nonlocal total variation. Moreover, we design an effective and efficient algorithm to solve the proposed optimization problem numerically by employing bregmanized operator splitting method, and we also study the convergence of the proposed algorithms. Numerical examples are presented to demonstrate that the performance of the proposed models is better than that of other testing methods in terms of visual quality and some quantitative metrics including peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), quaternion structural similarity index (QSSIM) and S-CIELAB color error.

[113] AU Codes, Language, and Synthesis: Translating Anatomy to Text for Facial Behavior Synthesis

Jiahe Wang,Cong Liang,Xuandong Huang,Yuxin Wang,Xin Yun,Yi Wu,Yanan Chang,Shangfei Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于自然语言描述动作单元(AU)的面部行为合成新方法,解决了传统AU编码中线性组合导致的解剖学不合理与冲突AU建模难题;构建了首个大规模AU-文本-图像配对数据集BP4D-AUText,并设计了利用面部结构先验的生成模型VQ-AUFace,在解剖合理性、行为丰富性和感知真实感上显著优于现有方法。

Details Motivation: 现有文本到面部模型依赖粗粒度情绪类别,缺乏非言语交流的细微表达能力;而AU虽更精确,但主流的一维one-hot编码和线性组合方式无法建模肌肉层面的AU冲突(如同一肌肉的相反动作),导致解剖不合理与运动失真。 Method: 提出用自然语言描述AU以替代传统one-hot编码,显式建模复杂及冲突AU;构建BP4D-AUText——首个大规模AU文本-图像配对数据集(基于BP4D/BP4D+经规则驱动的动态AU文本处理器生成);设计VQ-AUFace生成模型,融合面部结构先验实现高保真文本驱动面部合成。 Result: 在定量实验与用户研究中全面超越现有方法,尤其在冲突AU场景下显著提升解剖合理性、行为多样性与感知真实感。 Conclusion: 语言化AU表征是建模复杂面部行为的有效范式,结合结构先验的生成模型与专用文本-图像数据集可推动面部行为合成向更精细、更自然、更可控的方向发展。 Abstract: Facial behavior synthesis remains a critical yet underexplored challenge. While text-to-face models have made progress, they often rely on coarse emotion categories, which lack the nuance needed to capture the full spectrum of human nonverbal communication. Action Units (AUs) provide a more precise and anatomically grounded alternative. However, current AU-based approaches typically encode AUs as one-hot vectors, modeling compound expressions as simple linear combinations of individual AUs. This linearity becomes problematic when handling conflicting AUs--defined as those which activate the same facial muscle with opposing actions. Such cases lead to anatomically implausible artifacts and unnatural motion superpositions. To address this, we propose a novel method that represents facial behavior through natural language descriptions of AUs. This approach preserves the expressiveness of the AU framework while enabling explicit modeling of complex and conflicting AUs. It also unlocks the potential of modern text-to-image models for high-fidelity facial synthesis. Supporting this direction, we introduce BP4D-AUText, the first large-scale text-image paired dataset for complex facial behavior. It is synthesized by applying a rule-based Dynamic AU Text Processor to the BP4D and BP4D+ datasets. We further propose VQ-AUFace, a generative model that leverages facial structural priors to synthesize realistic and diverse facial behaviors from text. Extensive quantitative experiments and user studies demonstrate that our approach significantly outperforms existing methods. It excels in generating facial expressions that are anatomically plausible, behaviorally rich, and perceptually convincing, particularly under challenging conditions involving conflicting AUs.

[114] myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition

Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi

Main category: cs.CV

TL;DR: 本文对myMNIST(原BHDD)缅甸手写数字数据集进行了首次系统性基准测试,评估了11种模型,发现CNN表现最优,PETNN(GELU)紧随其后,JEM等能量模型也具竞争力,KAN类模型略逊但仍有价值;研究建立了可复现基线,凸显PETNN优势,并公开基准以推动区域文字识别研究。

Details Motivation: 建立myMNIST数据集上可复现、跨范式的系统性基准,填补缅甸手写数字识别领域在新兴与经典模型上的系统评估空白,并推动区域文字AI研究。 Method: 在myMNIST数据集上系统评估11种模型:MLP、CNN、LSTM、GRU、Transformer、FastKAN、EfficientKAN、JEM及三种PETNN变体(Sigmoid/GELU/SiLU),采用Precision、Recall、F1-Score和Accuracy进行多指标评测。 Result: CNN取得最佳性能(F1=0.9959,Accuracy=0.9970);PETNN(GELU)次之(F1=0.9955,Accuracy=0.9966),超越LSTM、GRU、Transformer及KAN变体;JEM表现稳健(F1=0.9944,Accuracy=0.9958);KAN类模型Accuracy约0.992。 Conclusion: CNN仍是强基线;PETNN展现对区域文字识别的优越适应性;能量模型JEM验证了能量建模潜力;KAN类模型提供新思路但需改进;该基准为后续缅甸及类似区域性文字识别研究提供了标准化评测基础与开源资源。 Abstract: We present the first systematic benchmark on myMNIST (formerly BHDD), a publicly available Burmese handwritten digit dataset important for Myanmar NLP/AI research. We evaluate eleven architectures spanning classical deep learning models (Multi-Layer Perceptron, Convolutional Neural Network, Long Short-Term Memory, Gated Recurrent Unit, Transformer), recent alternatives (FastKAN, EfficientKAN), an energy-based model (JEM), and physics-inspired PETNN variants (Sigmoid, GELU, SiLU). Using Precision, Recall, F1-Score, and Accuracy as evaluation metrics, our results show that the CNN remains a strong baseline, achieving the best overall scores (F1 = 0.9959, Accuracy = 0.9970). The PETNN (GELU) model closely follows (F1 = 0.9955, Accuracy = 0.9966), outperforming LSTM, GRU, Transformer, and KAN variants. JEM, representing energy-based modeling, performs competitively (F1 = 0.9944, Accuracy = 0.9958). KAN-based models (FastKAN, EfficientKAN) trail the top performers but provide a meaningful alternative baseline (Accuracy ~0.992). These findings (i) establish reproducible baselines for myMNIST across diverse modeling paradigms, (ii) highlight PETNN's strong performance relative to classical and Transformer-based models, and (iii) quantify the gap between energy-inspired PETNNs and a true energy-based model (JEM). We release this benchmark to facilitate future research on Myanmar digit recognition and to encourage broader evaluation of emerging architectures on regional scripts.

[115] Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

Lu Yu,Haiyang Zhang,Changsheng Xu

Main category: cs.CV

TL;DR: 本文提出两种基于文本引导注意力的零样本鲁棒性增强方法(TGA-ZSR和Comp-TGA),通过局部精炼与全局约束注意力机制,提升CLIP等视觉语言模型对对抗样本的鲁棒性,在16个数据集上分别提升零样本鲁棒准确率9.58%和11.95%。

Details Motivation: CLIP等预训练视觉语言模型虽具强大零样本能力,但易受对抗样本攻击;实验发现对抗扰动会偏移文本引导的注意力,需设计机制稳定该注意力以兼顾鲁棒性与泛化性。 Method: 提出TGA-ZSR框架,含局部注意力精炼模块和全局注意力约束模块;进一步提出Comp-TGA,融合类别提示引导注意力与非类别提示反向注意力,实现互补前景建模。 Result: TGA-ZSR和Comp-TGA在16个数据集上零样本鲁棒准确率分别较SOTA提升9.58%和11.95%。 Conclusion: 文本引导注意力可有效提升CLIP零样本鲁棒性;互补注意力机制(Comp-TGA)比单一注意力(TGA-ZSR)更全面准确,显著增强模型鲁棒性而不牺牲清洁样本性能。 Abstract: Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.

[116] SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

Jialiang Kang,Han Shu,Wenshuo Li,Yingjie Zhai,Xinghao Chen

Main category: cs.CV

TL;DR: SJD-PAC 是一种改进的推测性雅可比解码框架,通过主动起草策略和自适应续验机制,显著提升高熵视觉生成中的令牌接受率,在不损失图像质量的前提下实现 3.8 倍加速。

Details Motivation: 原始 SJD 在高熵视觉生成中 draft-token 接受率低,导致推理吞吐量受限。 Method: 提出 SJD-PAC:1)主动起草策略以提升复杂区域局部接受率;2)自适应续验机制,在首次拒绝后继续验证序列而非完全重采样。 Result: 在标准文本到图像基准上实现 3.8× 推理加速,图像质量无损。 Conclusion: SJD-PAC 有效缓解了高熵生成中的接受率瓶颈,在保持目标分布严格不变的前提下大幅提升推理效率。 Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework. First, SJD-PAC employs a proactive drafting strategy to improve local acceptance rates in these challenging high-entropy regions. Second, we introduce an adaptive continuation mechanism that sustains sequence validation after an initial rejection, bypassing the need for full resampling. Working in tandem, these optimizations significantly increase the average acceptance length per step, boosting inference speed while strictly preserving the target distribution. Experiments on standard text-to-image benchmarks demonstrate that SJD-PAC achieves a $3.8\times$ speedup with lossless image quality.

[117] Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Bingqi Ma,Linlong Lang,Ming Zhang,Dailan He,Xingtong Ge,Yi Zhang,Guanglu Song,Yu Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Cross-Modal Context Learning (CCL)的新方法,用于改进双流Transformer架构下的音视频联合生成,通过引入TARP、LCT、DCR和UCG等模块,缓解了跨模态交互中的模型流形变化、背景偏差、CFG不一致及多条件冲突等问题,在更少资源消耗下实现了SOTA性能。

Details Motivation: 现有双流Transformer音视频生成方法存在跨模态门控机制引起的模型流形变化、跨模态注意力导致的多模态背景区域偏差、训练与推理阶段多模态无分类器引导(CFG)不一致,以及多条件间冲突等问题。 Method: 提出Cross-Modal Context Learning(CCL),包含:1)时序对齐的RoPE与分块模块(TARP)提升音视频潜在表征时序对齐;2)跨模态上下文注意力(CCA)中引入可学习上下文标记(LCT)和动态上下文路由(DCR),提供稳定无条件锚点并适配不同训练任务;3)推理阶段采用无条件上下文引导(UCG),利用LCT增强CFG一致性并缓解条件冲突。 Result: CCL在多项综合评估中超越近期学术方法,达到SOTA性能,同时显著降低计算资源需求。 Conclusion: CCL有效解决了双流音视频生成中关键的跨模态建模缺陷,提升了训练稳定性、推理一致性与生成质量,为高效高质量多模态生成提供了新范式。 Abstract: The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model's convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.

[118] Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

Lukas Bayer,Sheethal Bhat,Andreas Maier

Main category: cs.CV

TL;DR: 本研究在RATIC数据集上系统比较了UNETR、SwinUNETR、UNETR++三种混合Transformer模型与CNN基线SegResNet在腹部多器官分割任务上的性能,结果表明SegResNet整体表现最优,Transformer模型中UNETR++最接近其性能,UNETR收敛最快。

Details Motivation: 准确的腹部CT多器官分割对计算机辅助诊断和治疗至关重要;尽管Transformer在建模长程依赖方面有优势,但在小到中等规模异构医学数据集上的实际性能尚不明确,需系统评估。 Method: 在包含206例来自23个机构的RATIC腹部CT数据集上,统一预处理和训练设置下,对比UNETR、SwinUNETR、UNETR++与SegResNet,以Dice相似系数(DSC)为主要评估指标。 Result: SegResNet整体性能最高,全面优于所有Transformer模型;UNETR++在Transformer中表现最佳;UNETR收敛速度最快。 Conclusion: 对于小至中等规模的异构医学影像数据集,经过良好优化的CNN架构(如SegResNet)仍极具竞争力,甚至优于当前主流混合Transformer模型。 Abstract: Accurate multi-organ segmentation in abdominal CT scans is essential for computer-aided diagnosis and treatment. While convolutional neural networks (CNNs) have long been the standard approach in medical image segmentation, transformer-based architectures have recently gained attention due to their ability to model long-range dependencies. In this study, we systematically benchmark the three hybrid transformer-based models UNETR, SwinUNETR, and UNETR++ against a strong CNN baseline, SegResNet, for volumetric multi-organ segmentation on the heterogeneous RATIC dataset. The dataset comprises 206 annotated CT scans from 23 institutions worldwide, covering five abdominal organs. All models were trained and evaluated under identical preprocessing and training conditions using the Dice Similarity Coefficient (DSC) as the primary metric. The results show that the CNN-based SegResNet achieves the highest overall performance, outperforming all hybrid transformer-based models across all organs. Among the transformer-based approaches, UNETR++ delivers the most competitive results, while UNETR demonstrates notably faster convergence with fewer training iterations. These findings suggest that, for small- to medium-sized heterogeneous datasets, well-optimized CNN architectures remain highly competitive and may outperform hybrid transformer-based designs.

[119] OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data

Bin Cao,Sipeng Zheng,Hao Luo,Boyuan Li,Jing Liu,Zongqing Lu

Main category: cs.CV

TL;DR: 本文提出了OpenT2M大规模高质量开源运动数据集和基于其预训练的MonoFrill运动模型,通过新提出的2D-PRQ运动分词器提升文本到动作生成的泛化与零样本性能。

Details Motivation: 现有文本到动作(T2M)模型在未见过的文本描述上表现差,主要受限于运动数据集规模小、多样性不足。 Method: 构建百万级、2800小时以上的高质量开源运动数据集OpenT2M,含物理可行性验证与细粒度文本标注;设计自动化长时序合成流程;提出MonoFrill模型及其核心组件——基于人体生物学部位划分的2D-PRQ运动分词器。 Result: OpenT2M显著提升现有T2M模型的泛化能力;2D-PRQ在运动重建和零样本生成任务中表现优异。 Conclusion: OpenT2M和MonoFrill共同解决了T2M领域长期存在的数据质量与基准评测难题,有望推动该方向发展。 Abstract: Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as "frills". Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.

[120] GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Yueying Zou,Pei Pei Li,Zekun Li,Xinyu Guo,Xing Cui,Huaibo Huang,Ran He

Main category: cs.CV

TL;DR: 本文提出GenVideoLens细粒度基准,用于评估大视觉语言模型(LVLMs)在AI生成视频检测中的多维能力,揭示其在光学一致性、物理交互和时序因果推理等方面存在显著能力短板。

Details Motivation: 现有评估方法仅将AI生成视频检测视为二分类任务,依赖粗粒度准确率指标,难以揭示LVLMs在具体真实性线索上的成败原因。 Method: 构建包含500个视频(400个AI生成+100个真实)的GenVideoLens基准,由专家在15个真实性维度(感知、光学、物理、时序)上精细标注;系统评测11个代表性LVLM,并开展时序扰动实验。 Result: 发现LVLMs存在显著维度不平衡:感知线索上表现较好,但在光学一致性、物理交互和时序-因果推理上严重不足;小开源模型在某些维度上甚至优于大闭源模型;时序扰动实验证明当前LVLMs对时序信息利用有限。 Conclusion: GenVideoLens为LVLM在AI生成视频检测中的行为提供了诊断性洞见,明确了关键能力缺口,为后续系统改进指明方向。 Abstract: In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.

[121] GEAR: Geography-knowledge Enhanced Analog Recognition Framework in Extreme Environments

Zelin Liu,Bocheng Li,Yuling Zhou,Xuanting Li,Yixuan Yang,Jing Wang,Weishu Zhao,Xiaofeng Gao

Main category: cs.CV

TL;DR: 本文提出GEAR框架,通过三阶段流程(骨架引导筛选、物理感知过滤、图基细粒度识别)在青藏高原250万平方公里范围内高效检索马里亚纳海沟的陆地类比区域,并设计MSG-Net模型提升地形相似性识别性能,验证其与生物数据存在显著相关性。

Details Motivation: 深海采样成本高昂,需在青藏高原上寻找与马里亚纳海沟地质起源和微生物功能相似的陆地类比区域;但现有模型无法兼顾地理知识融合与计算效率。 Method: 提出GEAR三阶段框架:1)骨架引导筛选与裁剪;2)基于波形比较器(TWC)和形态纹理模块(MTM)的物理感知过滤;3)基于地貌指标的形态集成Siamese图网络(MSG-Net)进行图基细粒度识别;并发布面向构造碰撞带的专家标注地形相似性数据集。 Result: 各阶段均验证有效;MSG-Net较SOTA基线F1-Score提升1.38个百分点;MSG-Net提取的特征与生物数据呈显著相关性。 Conclusion: GEAR框架能高效、准确识别青藏高原上马里亚纳海沟的地形类比区域,为深海研究提供低成本陆地替代方案,并支撑后续生物学分析。 Abstract: The Mariana Trench and the Qinghai-Tibet Plateau exhibit significant similarities in geological origins and microbial metabolic functions. Given that deep-sea biological sampling faces prohibitive costs, recognizing structurally homologous terrestrial analogs of the Mariana Trench on the Qinghai-Tibet Plateau is of great significance. Yet, no existing model adequately addresses cross-domain topographic similarity retrieval, either neglecting geographical knowledge or sacrificing computational efficiency. To address these challenges, we present \underline{\textbf{G}}eography-knowledge \underline{\textbf{E}}nhanced \underline{\textbf{A}}nalog \underline{\textbf{R}}ecognition (\textbf{GEAR}) Framework, a three-stage pipeline designed to efficiently retrieve analogs from 2.5 million square kilometers of the Qinghai-Tibet Plateau: (1) Skeleton guided Screening and Clipping: Recognition of candidate valleys and initial screening based on size and linear morphological criteria. (2) Physics aware Filtering: The Topographic Waveform Comparator (TWC) and Morphological Texture Module (MTM) evaluate the waveform and texture and filter out inconsistent candidate valleys. (3) Graph based Fine Recognition: We design a \underline{\textbf{M}}orphology-integrated \underline{\textbf{S}}iamese \underline{\textbf{G}}raph \underline{\textbf{N}}etwork (\textbf{MSG-Net}) based on geomorphological metrics. Correspondingly, we release an expert-annotated topographic similarity dataset targeting tectonic collision zones. Experiments demonstrate the effectiveness of every stage. Besides, MSG-Net achieved an F1-Score 1.38 percentage points higher than the SOTA baseline. Using features extracted by MSG-Net, we discovered a significant correlation with biological data, providing evidence for future biological analysis.

[122] SwiftGS: Episodic Priors for Immediate Satellite Surface Recovery

Rong Fu,Jiekai Wu,Haiyun Wei,Xiaowen Ma,Shiyin Lin,Kangan Qian,Chuang Liu,Jianyuan Ni,Simon James Fong

Main category: cs.CV

TL;DR: 本文提出SwiftGS,一种元学习系统,通过单次前向传播预测解耦的几何-辐射高斯基元和轻量级符号距离函数(SDF),实现多时相卫星影像的快速、大规模三维重建,显著降低计算成本并保持精度。

Details Motivation: 现有方法难以应对多时相卫星影像三维重建中的光照变化、传感器异质性以及逐场景优化高昂计算成本等问题。 Method: 提出SwiftGS系统:采用元学习框架进行 episodic 训练以获取可迁移先验;设计可微分物理图建模投影、光照与传感器响应;结合空间门控机制融合稀疏高斯细节与全局SDF结构;引入语义-几何融合、条件轻量任务头及基于冻结几何教师模型的多视角监督,并使用不确定性感知的多任务损失。 Result: 推理时支持零样本重建,可选紧凑校准;实现高精度数字表面模型(DSM)重建与视角一致的渲染;计算成本显著降低;消融实验验证了混合表征、物理感知渲染和元训练策略的有效性。 Conclusion: SwiftGS 为大规模、多时相卫星影像三维重建提供了高效、鲁棒且泛化能力强的新范式,兼顾速度、精度与实用性。 Abstract: Rapid, large-scale 3D reconstruction from multi-date satellite imagery is vital for environmental monitoring, urban planning, and disaster response, yet remains difficult due to illumination changes, sensor heterogeneity, and the cost of per-scene optimization. We introduce SwiftGS, a meta-learned system that reconstructs 3D surfaces in a single forward pass by predicting geometry-radiation-decoupled Gaussian primitives together with a lightweight SDF, replacing expensive per-scene fitting with episodic training that captures transferable priors. The model couples a differentiable physics graph for projection, illumination, and sensor response with spatial gating that blends sparse Gaussian detail and global SDF structure, and incorporates semantic-geometric fusion, conditional lightweight task heads, and multi-view supervision from a frozen geometric teacher under an uncertainty-aware multi-task loss. At inference, SwiftGS operates zero-shot with optional compact calibration and achieves accurate DSM reconstruction and view-consistent rendering at significantly reduced computational cost, with ablations highlighting the benefits of the hybrid representation, physics-aware rendering, and episodic meta-training.

[123] Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

Jiayi Luo,Jiayu Chen,Jiankun Wang,Cong Wang,Hanxin Zhu,Qingyun Sun,Chen Gao,Zhibo Chen,Jianxin Li

Main category: cs.CV

TL;DR: 本文提出SVOO框架,通过离线层敏感性分析和在线双向协同聚类,实现无需训练的视频生成稀疏注意力优化,在保持高质量的同时显著提升推理速度。

Details Motivation: 现有无训练稀疏注意力方法在视频生成中存在忽略层异质性和查询-键耦合的问题,导致质量-加速权衡不佳。 Method: SVOO采用两阶段范式:(i)离线逐层敏感性分析以确定各层固有剪枝程度;(ii)在线基于新型双向协同聚类算法实现块级稀疏注意力。 Result: 在七个主流视频生成模型上验证,SVOO相比SOTA方法实现最高1.93×加速,同时在Wan2.1上维持高达29 dB的PSNR。 Conclusion: SVOO通过挖掘注意力稀疏性的层内固有特性与查询-键耦合关系,实现了更优的质量-速度权衡,为高效视频生成提供了新思路。 Abstract: Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.

[124] PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance

Cong Wang,Hanxin Zhu,Xiao Tang,Jiayi Luo,Xin Jin,Long Chen,Fei-Yue Wang,Zhibo Chen

Main category: cs.CV

TL;DR: 本文提出PhysVideo框架,通过两阶段方法生成物理一致的视频:第一阶段使用Phys4View生成物理感知的正交前景视频,第二阶段使用VideoSyn合成带背景的完整视频;并构建了包含160K视频序列的PhysMV多视角数据集。

Details Motivation: 现有视频生成方法在视觉保真度上进步显著,但难以保证运动的物理一致性,因为真实物体运动发生在三维空间,而视频仅提供部分、视角依赖的二维投影。 Method: 提出PhysVideo两阶段框架:第一阶段Phys4View利用物理感知注意力建模物理属性对运动的影响,并结合几何增强的跨视角注意力和时间注意力提升时空一致性;第二阶段VideoSyn以生成的正交前景视频为引导,学习前景动态与背景上下文的交互以实现可控视频合成;同时构建PhysMV多视角数据集(40K场景,每场景4个正交视角,共160K视频序列)支持训练。 Result: 实验表明PhysVideo在物理真实性和时空连贯性方面显著优于现有视频生成方法。 Conclusion: PhysVideo通过引入三维物理先验与多视角建模,有效提升了视频生成中运动的物理一致性和时空 coherence,为视频生成提供了新范式。 Abstract: Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose PhysVideo, a two-stage framework that first generates physics-aware orthogonal foreground videos and then synthesizes full videos with background. In the first stage, Phys4View leverages physics-aware attention to capture the influence of physical attributes on motion dynamics, and enhances spatio-temporal consistency by incorporating geometry-enhanced cross-view attention and temporal attention. In the second stage, VideoSyn uses the generated foreground videos as guidance and learns the interactions between foreground dynamics and background context for controllable video synthesis. To support training, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that PhysVideo significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Home page: https://anonymous.4open.science/w/Phys4D/.

[125] MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration

Teer Song,Yue Zhang,Yu Tian,Ziyang Wang,Xianlin Zhang,Guixuan Zhang,Xuan Liu,Xueming Li,Yasen Zhang

Main category: cs.CV

TL;DR: 本文提出MeInTime,一种基于扩散模型的跨年龄参考式人脸修复方法,通过解耦身份与年龄建模,并引入新注意力机制、门控残差融合模块及无需训练的年龄感知梯度引导策略,在保持身份保真度的同时实现年龄一致性。

Details Motivation: 现有参考式人脸修复方法隐含假设参考图与退化输入年龄一致,难以应对历史照片修复等仅有跨年龄参考的实际场景。 Method: 提出MeInTime:1)训练阶段解耦身份与年龄条件建模,引入新注意力机制注入身份特征,并设计Gated Residual Fusion模块融合退化特征与身份表征;2)推理阶段提出无需训练的Age-Aware Gradient Guidance策略,利用年龄驱动方向引导去噪隐空间向目标年龄语义流形靠近。 Result: 在多个数据集上实验表明,MeInTime在身份保真度和年龄一致性两方面均优于现有方法。 Conclusion: MeInTime成功将参考式人脸修复拓展至跨年龄场景,为历史影像修复等实际应用提供了有效解决方案。 Abstract: To better preserve an individual's identity, face restoration has evolved from reference-free to reference-based approaches, which leverage high-quality reference images of the same identity to enhance identity fidelity in the restored outputs. However, most existing methods implicitly assume that the reference and degraded input are age-aligned, limiting their effectiveness in real-world scenarios where only cross-age references are available, such as historical photo restoration. This paper proposes MeInTime, a diffusion-based face restoration method that extends reference-based restoration from same-age to cross-age settings. Given one or few reference images along with an age prompt corresponding to the degraded input, MeInTime achieves faithful restoration with both identity fidelity and age consistency. Specifically, we decouple the modeling of identity and age conditions. During training, we focus solely on effectively injecting identity features through a newly introduced attention mechanism and introduce Gated Residual Fusion modules to facilitate the integration between degraded features and identity representations. At inference, we propose Age-Aware Gradient Guidance, a training-free sampling strategy, using an age-driven direction to iteratively nudge the identity-aware denoising latent toward the desired age semantic manifold. Extensive experiments demonstrate that MeInTime outperforms existing face restoration methods in both identity preservation and age consistency. Our code is available at: https://github.com/teer4/MeInTime

[126] Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA

Ruizhi Yu,Keyang Zhong,Peng Liu,Qi Wu,Haoran Zhang,Yanhao Zhang,Chen Chen,Haonan Lu

Main category: cs.CV

TL;DR: 本文提出Click-to-Ask系统,一种面向直播电商的AI助手,包含离线模块(处理多模态商品信息、生成结构化数据与合规话术)和在线模块(实时响应观众提问),显著提升直播准备效率、内容互动性与观众响应速度。

Details Motivation: 为提升主播在直播电商中产品推广的效率与便捷性,解决实时互动与内容准备耗时长的问题。 Method: 设计具备离线与在线双模块的AI助手:离线模块处理多模态商品信息并生成结构化数据与合规文案;在线模块结合离线输出与流式架构维护的事件级历史记忆,支持点击提问实时响应。 Result: 在自建TikTok直播帧数据集上,问题识别准确率达0.913,响应质量评分为0.876。 Conclusion: Click-to-Ask系统在提升直播电商准备效率、互动性与响应及时性方面效果显著,具备良好实用潜力。 Abstract: Live streaming commerce has become a prominent form of broadcasting in the modern era. To facilitate more efficient and convenient product promotions for streamers, we present Click-to-Ask, an AI-driven assistant for live streaming commerce with complementary offline and online components. The offline module processes diverse multimodal product information, transforming complex inputs into structured product data and generating compliant promotional copywriting. During live broadcasts, the online module enables real-time responses to viewer inquiries by allowing streamers to click on questions and leveraging both the structured product information generated by the offline module and an event-level historical memory maintained in a streaming architecture. This system significantly reduces the time needed for promotional preparation, enhances content engagement, and enables prompt interaction with audience inquiries, ultimately improving the effectiveness of live streaming commerce. On our collected dataset of TikTok live stream frames, the proposed method achieves a Question Recognition Accuracy of 0.913 and a Response Quality score of 0.876, demonstrating considerable potential for practical application. The video demonstration can be viewed here: https://www.youtube.com/shorts/mWIXK-SWhiE.

[127] Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

Pius Horn,Janis Keuper

Main category: cs.CV

TL;DR: 本文提出了一种基于合成PDF和LLM-as-a-judge的语义表格评估新框架,显著优于传统基于规则的指标(如TEDS、GriTS),并通过大规模人类验证与21种PDF解析器评测,为表格提取任务提供了可复现、可扩展的评估方法。

Details Motivation: 现有PDF表格提取评估方法依赖于无法捕捉语义等价性的规则型指标,缺乏对内容真实一致性的衡量能力。 Method: 构建基于LaTeX生成的合成PDF基准数据集(源自arXiv),设计融合LLM-as-a-judge的语义匹配评估流程,并结合超1500次人工判断进行验证。 Result: LLM-based评估与人工判断相关性达Pearson r=0.93,显著高于TEDS(r=0.68)和GriTS(r=0.70);在100份合成文档(451张表)上评测21种解析器,揭示其性能显著差异。 Conclusion: LLM-as-a-judge是更可靠、语义感知的表格评估范式;该框架为科学数据挖掘中的表格提取提供了实用选型指南和可复现的评估标准。 Abstract: Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

[128] SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Carlos Hinojosa,Clemens Grange,Bernard Ghanem

Main category: cs.CV

TL;DR: 本文研究了视觉-语言模型(VLMs)在安全决策中对语义线索的依赖性,提出语义引导框架与SAVeS基准,发现其安全判断易受文本、视觉等语义干预影响,表明其依赖统计关联而非真实视觉理解,揭示了多模态安全系统的潜在漏洞。

Details Motivation: 现实和具身场景中VLMs的安全决策依赖视觉上下文,但驱动这些判断的具体视觉证据尚不明确;需探究其是否仅依赖表面语义线索而非真正视觉理解。 Method: 提出语义引导框架,施加可控的文本、视觉和认知干预而不改变场景内容;构建SAVeS基准及评估协议,分离行为拒绝、基于依据的安全推理和误拒三类行为。 Result: 实验表明多个VLMs的安全决策高度敏感于语义线索,依赖习得的视-文关联而非接地的视觉理解;自动化引导流程可成功利用该机制,暴露系统脆弱性。 Conclusion: VLMs的多模态安全行为易被简单语义线索操控,说明当前方法缺乏真正视觉 grounding,亟需更鲁棒、可解释的安全机制设计。 Abstract: Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.

[129] Multiscale Switch for Semi-Supervised and Contrastive Learning in Medical Ultrasound Image Segmentation

Jingguo Qu,Xinyang Han,Yao Pu,Man-Lik Chui,Simon Takadiyi Gunda,Ziman Chen,Jing Qin,Ann Dorothy King,Winnie Chiu-Wing Chu,Jing Cai,Michael Tin-Cheung Ying

Main category: cs.CV

TL;DR: 本文提出了一种名为Switch的新型半监督学习框架,通过多尺度切换(MSS)和频域切换(FDS)策略,在超声图像分割任务中显著提升性能,尤其在极低标注比例下超越全监督方法。

Details Motivation: 医学超声图像分割面临标注数据稀缺、斑点噪声和低对比度边界等挑战,现有半监督方法对无标签数据利用不足且缺乏鲁棒特征表示机制。 Method: 提出Switch框架,包含:(1) 多尺度切换(MSS)策略,通过分层图像块混合实现均匀空间覆盖;(2) 频域切换(FDS)结合对比学习,在傅里叶域进行幅度切换以增强特征鲁棒性;整体采用师生架构联合利用有/无标签数据。 Result: 在六个不同超声数据集(淋巴结、乳腺病变、甲状腺结节、前列腺)上验证,5%标注率下Dice分数达80.04%(LN-INT)、85.52%(DDTI)、83.48%(Prostate),超越全监督基线;模型仅含1.8M参数,高效实用。 Conclusion: Switch在极低标注成本下实现了高性能与高效率的统一,为资源受限的医学影像分析提供了有效可行的半监督解决方案。 Abstract: Medical ultrasound image segmentation faces significant challenges due to limited labeled data and characteristic imaging artifacts including speckle noise and low-contrast boundaries. While semi-supervised learning (SSL) approaches have emerged to address data scarcity, existing methods suffer from suboptimal unlabeled data utilization and lack robust feature representation mechanisms. In this paper, we propose Switch, a novel SSL framework with two key innovations: (1) Multiscale Switch (MSS) strategy that employs hierarchical patch mixing to achieve uniform spatial coverage; (2) Frequency Domain Switch (FDS) with contrastive learning that performs amplitude switching in Fourier space for robust feature representations. Our framework integrates these components within a teacher-student architecture to effectively leverage both labeled and unlabeled data. Comprehensive evaluation across six diverse ultrasound datasets (lymph nodes, breast lesions, thyroid nodules, and prostate) demonstrates consistent superiority over state-of-the-art methods. At 5\% labeling ratio, Switch achieves remarkable improvements: 80.04\% Dice on LN-INT, 85.52\% Dice on DDTI, and 83.48\% Dice on Prostate datasets, with our semi-supervised approach even exceeding fully supervised baselines. The method maintains parameter efficiency (1.8M parameters) while delivering superior performance, validating its effectiveness for resource-constrained medical imaging applications. The source code is publicly available at https://github.com/jinggqu/Switch

[130] Multimodal Model for Computational Pathology:Representation Learning and Image Compression

Peihang Wu,Zehong Chen,Lijian Xu

Main category: cs.CV

TL;DR: 本文综述了多模态计算病理学的最新进展,聚焦于解决全切片图像(WSI)高分辨率、标注稀缺、多模态融合难及模型可解释性差等挑战,提出四大研究方向:自监督表征学习与结构感知token压缩、多模态数据生成与增强、参数高效适配与推理增强的小样本学习、多智能体协同推理,并强调需构建融合高分辨率影像与临床知识的统一多模态框架以实现可解释、安全的AI辅助诊断。

Details Motivation: 全切片图像(WSI)分辨率极高导致计算困难;专家标注稀缺限制监督学习;多模态信息融合难以兼顾生物可解释性;超长视觉序列建模缺乏临床透明度。 Method: 系统综述法,围绕四个研究方向展开分析:(1)自监督表示学习与结构感知token压缩;(2)多模态数据生成与增强;(3)参数高效适配与推理增强的小样本学习;(4)面向可信诊断的多智能体协同推理,并特别探讨token压缩支持跨尺度建模、多智能体机制模拟病理医生‘思维链’以实现不确定性感知的证据融合。 Result: 梳理出当前多模态计算病理学的关键技术路径与发展脉络,明确了token压缩与多智能体协同在提升模型可解释性与诊断可信度方面的潜力。 Conclusion: 未来突破依赖于融合高分辨率视觉数据、临床报告与生物医学知识的统一多模态框架,以支撑可解释、安全的AI辅助病理诊断。 Abstract: Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra-long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross-scale modeling and how multi-agent mechanisms simulate a pathologist's "Chain of Thought" across magnifications to achieve uncertainty-aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.

[131] Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels

Juan Miguel Valverde,Dim P. Papadopoulos,Rasmus Larsen,Anders Bjorholm Dahl

Main category: cs.CV

TL;DR: 本文提出SCNP方法,通过惩罚像素与其最差分类邻域像素的logits,提升图像分割中的拓扑准确性,适用于多种结构形态和模态,在13个数据集上验证有效,并可灵活集成到多种分割框架和损失函数中。

Details Motivation: 标准深度学习图像分割模型无法保证拓扑准确性(如连通分量数量),影响分割质量与后续量化分析可靠性;现有改进方法存在集成困难、计算昂贵或形态受限等问题。 Method: 提出SCNP(Same-Class Neighbor Penalization)方法:在训练中对每个像素的logits施加惩罚,惩罚项基于其同类别中最难分类的邻域像素,迫使模型优先优化边界区域预测,从而提升拓扑一致性。 Result: 在13个涵盖不同结构形态(非仅管状)和图像模态的数据集上显著提升拓扑准确性;成功集成至三种语义/实例分割框架及多种损失函数;代码已开源。 Conclusion: SCNP是一种轻量、通用、易集成的拓扑增强策略,无需修改网络架构,即可有效提升各类分割任务的拓扑鲁棒性。 Abstract: Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest-classified neighbor, forcing the model to improve the prediction at the pixels' neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at https://jmlipman.github.io/SCNP-SameClassNeighborPenalization.

[132] Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

Mohamed Youssef,Mayar Elfares,Anna-Maria Meer,Matteo Bortoletto,Andreas Bulling

Main category: cs.CV

TL;DR: 本文提出Ontology-Guided Diffusion(OGD),一种基于本体论引导的神经符号化零样本仿真到现实(sim2real)图像迁移框架,将‘真实性’建模为结构化知识(如光照、材质等可解释特征及其关系图谱),通过图神经网络嵌入与符号规划协同驱动扩散模型,显著提升零样本sim2real性能与可解释性。

Details Motivation: 仿真到现实(sim2real)迁移面临真实标注数据稀缺的挑战;现有基于扩散模型的方法依赖非结构化提示或统计对齐,难以刻画使图像‘真实’的结构化因素。 Method: 提出OGD框架:1)构建真实性本体(含光照、材质等可解释特征)及其知识图谱;2)从合成图像中推断特征激活,并用图神经网络生成全局图嵌入;3)符号规划器基于本体生成一致的视觉编辑序列;4)图嵌入通过cross-attention条件化预训练扩散模型,编辑序列转为结构化指令提示。 Result: 在多个基准上,OGD的图嵌入比基线更有效区分真实与合成图像;OGD在sim2real图像迁移任务上超越当前最优扩散方法。 Conclusion: 显式建模真实性的结构化知识,可实现可解释、数据高效且泛化性强的零样本sim2real迁移。 Abstract: Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents realism as structured knowledge. OGD decomposes realism into an ontology of interpretable traits -- such as lighting and material properties -- and encodes their relationships in a knowledge graph. From a synthetic image, OGD infers trait activations and uses a graph neural network to produce a global embedding. In parallel, a symbolic planner uses the ontology traits to compute a consistent sequence of visual edits needed to narrow the realism gap. The graph embedding conditions a pretrained instruction-guided diffusion model via cross-attention, while the planned edits are converted into a structured instruction prompt. Across benchmarks, our graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translations. Overall, OGD shows that explicitly encoding realism structure enables interpretable, data-efficient, and generalisable zero-shot sim2real transfer.

[133] EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Longfei Liu,Yongjie Hou,Yang Li,Qirui Wang,Youyang Sha,Yongjun Yu,Yinzhi Wang,Peizhe Ru,Xuanlong Yu,Xi Shen

Main category: cs.CV

TL;DR: 本文提出EdgeCrafter,一种面向边缘设备密集预测任务的轻量级视觉Transformer统一框架,通过任务特化蒸馏与边缘友好编解码设计,显著提升小规模ViT在目标检测、实例分割和姿态估计上的精度-效率权衡。

Details Motivation: 现有轻量级边缘密集预测系统仍以CNN为主(如YOLO),而小型ViT即使经大规模预训练也难以达到同等精度-效率平衡,作者认为主因是小规模ViT缺乏任务特化的表征学习,而非ViT本身不适用于边缘密集预测。 Method: 提出EdgeCrafter框架,核心为ECDet检测模型:采用知识蒸馏得到的紧凑骨干网络,结合专为边缘设备优化的编码器-解码器结构;该设计被统一扩展至实例分割(ECInsSeg)和姿态估计(ECPose-X)。 Result: ECDet-S在COCO上达51.7 AP(<10M参数,仅用COCO标注);ECInsSeg性能媲美RF-DETR但参数更少;ECPose-X达74.8 AP,超越依赖Objects365预训练的YOLO26Pose-X(71.6 AP)。 Conclusion: 紧凑型ViT若结合任务特化蒸馏与边缘感知架构设计,可在边缘密集预测任务中成为实用且具竞争力的CNN替代方案。 Abstract: Deploying high-performance dense prediction models on resource-constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN-based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy efficiency tradeoff, even with large scale pretraining. We argue that this gap is largely due to insufficient task specific representation learning in small scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder decoder design. On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter's reliance on extensive Objects365 pretraining. These results show that compact ViTs, when paired with task-specialized distillation and edge-aware design, can be a practical and competitive option for edge dense prediction. Code is available at: https://intellindust-ai-lab.github.io/projects/EdgeCrafter/

[134] 6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

Rundong Su,Jintao Zhang,Zhihang Yuan,Haojie Duanmu,Jianfei Chen,Jun Zhu

Main category: cs.CV

TL;DR: 本文提出了一种针对视频扩散Transformer(Video DiTs)的推理时混合精度量化框架(NVFP4/INT8)与时间增量缓存(TDC)技术,通过动态分配低精度(NVFP4)给稳定层、高精度(INT8)给易变层,并跳过时间上不变的残差块计算,在显著提升推理速度(1.92×)和内存压缩率(3.32×)的同时保持生成质量。

Details Motivation: 现有后训练量化方法采用静态比特宽分配,忽视了不同扩散步中激活值的量化难度差异,导致效率与质量权衡不佳;同时Video DiTs推理内存与计算开销过高,制约实际部署。 Method: 提出两阶段优化:1)基于Transformer块输入-输出差异与线性层量化敏感性的强线性相关性,设计轻量预测器,动态分配NVFP4(高内存压缩)或INT8(高鲁棒性);2)利用残差在时间维度的高度一致性,引入时间增量缓存(TDC)跳过不变块的重复计算。 Result: 在Video DiTs上实现端到端1.92×加速与3.32×内存减少,生成质量无损,成为高效视频扩散推理新基线。 Conclusion: 动态混合精度量化与时间冗余挖掘可协同突破Video DiTs推理瓶颈,验证了感知量化敏感性与利用时间结构先验对高效生成建模的有效性。 Abstract: Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block's input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92$\times$ end-to-end acceleration and 3.32$\times$ memory reduction, setting a new baseline for efficient inference in Video DiTs.

[135] WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification

Isabel Rio-Torto,Jaime S. Cardoso,Luís F. Teixeira

Main category: cs.CV

TL;DR: 本文提出WeNLEX,一种弱监督的自然语言解释生成模型,用于多标签胸部X光分类,通过图像重建确保解释忠实性,通过分布对齐保证解释合理性,并在少量标注解释下实现高解释质量与分类性能提升。

Details Motivation: 现有方法依赖大量人工标注解释进行显式监督,导致生成解释虽合理但不忠实于模型真实推理过程;需在少样本下兼顾解释的忠实性与合理性。 Method: WeNLEX采用弱监督框架:利用黑盒模型特征空间中由生成文本重建的图像与原图的一致性来约束忠实性;通过小规模医生标注解释库进行分布对齐以保障合理性;支持post-hoc和in-model两种部署方式,并可更换解释库适配不同受众(如非医学用户)。 Result: 在多项评估指标(忠实性、可模拟性、多样性、合理性)上显著优于基线;仅需每诊断5条真实解释即可达到优异性能;in-model训练使分类AUC提升2.21%;成功构建面向普通用户的简化版解释模型。 Conclusion: WeNLEX在极少人工标注下实现了忠实且合理的自然语言解释生成,验证了可解释性建模不仅能提升透明度,还能反哺下游任务性能,并具备跨受众适应能力。 Abstract: Natural language explanations provide an inherently human-understandable way to explain black-box models, closely reflecting how radiologists convey their diagnoses in textual reports. Most works explicitly supervise the explanation generation process using datasets annotated with explanations. Thus, though plausible, the generated explanations are not faithful to the model's reasoning. In this work, we propose WeNLEX, a weakly supervised model for the generation of natural language explanations for multilabel chest X-ray classification. Faithfulness is ensured by matching images generated from their corresponding natural language explanations with original images, in the black-box model's feature space. Plausibility is maintained via distribution alignment with a small database of clinician-annotated explanations. We empirically demonstrate, through extensive validation on multiple metrics to assess faithfulness, simulatability, diversity, and plausibility, that WeNLEX is able to produce faithful and plausible explanations, using as little as 5 ground-truth explanations per diagnosis. Furthermore, WeNLEX can operate in both post-hoc and in-model settings. In the latter, i.e., when the multilabel classifier is trained together with the rest of the network, WeNLEX improves the classification AUC of the standalone classifier by 2.21%, thus showing that adding interpretability to the training process can actually increase the downstream task performance. Additionally, simply by changing the database, WeNLEX explanations are adaptable to any target audience, and we showcase this flexibility by training a layman version of WeNLEX, where explanations are simplified for non-medical users.

[136] DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection

Haochen Li,Rui Zhang,Hantao Yao,Xin Zhang,Yifan Hao,Shaohui Peng,Yongwei Zhao,Ling Li

Main category: cs.CV

TL;DR: 本文提出DA-Mamba,一种结合CNN与状态空间模型(SSM)的混合架构,用于域自适应目标检测(DAOD),通过IA-SSM和OA-SSM模块分别增强图像级和实例级的全局-局部域不变特征对齐,兼顾效率与长程建模能力。

Details Motivation: 现有DAOD方法受限于CNN局部连接性,难以提取全局域不变特征;而基于Transformer的方法虽具全局建模能力,但计算复杂度高(二次方),不适用于实际部署。 Method: 提出DA-Mamba架构,融合CNN的高效性与SSM的线性时间长程建模能力;设计Image-Aware SSM(IA-SSM)嵌入骨干网络以实现图像级全局-局部对齐,以及Object-Aware SSM(OA-SSM)嵌入检测头以建模对象间空间与语义依赖、提升实例级对齐。 Result: 在多个跨域检测基准上验证了DA-Mamba的有效性与高效性,显著提升了检测器的跨域性能,同时避免了Transformer的高计算开销。 Conclusion: DA-Mamba为DAOD提供了一种高效且具备强全局建模能力的新范式,验证了SSM在域自适应视觉任务中的潜力。 Abstract: Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain. Existing DAOD methods employ multi-granularity feature alignment to learn domain-invariant representations. However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features. Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs) architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features. Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM). IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment. OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment. Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.

[137] ProCal: Probability Calibration for Neighborhood-Guided Source-Free Domain Adaptation

Ying Zheng,Yiyi Zhang,Yi Wang,Lap-Pui Chau

Main category: cs.CV

TL;DR: 本文提出ProCal方法,通过双模型协同预测机制动态校准邻域预测概率,解决源自由域自适应中源知识遗忘和局部噪声过拟合问题。

Details Motivation: 现有基于邻域结构的源自由域自适应方法过度依赖邻居预测相似性,导致源知识快速遗忘和易受局部噪声干扰。 Method: 提出ProCal概率校准方法,结合源模型初始预测与当前模型在线输出进行邻域预测动态校准,并设计融合软监督损失与多样性损失的联合优化目标。 Result: 在四个公开数据集共31个跨域任务上验证了方法有效性,理论分析表明ProCal能收敛至源知识与目标信息有效融合的均衡状态。 Conclusion: ProCal在缓解知识遗忘和过拟合之间取得平衡,提升了源自由域自适应性能。 Abstract: Source-Free Domain Adaptation (SFDA) adapts pre-trained models to unlabeled target domains without requiring access to source data. Although state-of-the-art methods leveraging local neighborhood structures show promise for SFDA, they tend to over-rely on prediction similarity among neighbors. This over-reliance accelerates the forgetting of source knowledge and increases susceptibility to local noise overfitting. To address these issues, we introduce ProCal, a probability calibration method that dynamically calibrates neighborhood-based predictions through a dual-model collaborative prediction mechanism. ProCal integrates the source model's initial predictions with the current model's online outputs to effectively calibrate neighbor probabilities. This strategy not only mitigates the interference of local noise but also preserves the discriminative information from the source model, thereby achieving a balance between knowledge retention and domain adaptation. Furthermore, we design a joint optimization objective that combines a soft supervision loss with a diversity loss to guide the target model. Our theoretical analysis shows that ProCal converges to an equilibrium where source knowledge and target information are effectively fused, reducing both knowledge forgetting and overfitting. We validate the effectiveness of our approach through extensive experiments on 31 cross-domain tasks across four public datasets. Our code is available at: https://github.com/zhengyinghit/ProCal.

[138] SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Vsevolod Skorokhodov,Chenghao Xu,Shuo Sun,Olga Fink,Malcolm Mielle

Main category: cs.CV

TL;DR: 本文提出SEAR方法,通过简单高效的微调策略,将预训练的视觉几何Transformer适配到RGB-thermal多模态输入,在3D重建与相机位姿估计任务中显著超越现有方法,尤其在低光照和浓烟等挑战性场景下表现稳健。

Details Motivation: 现有基于RGB数据预训练的视觉几何模型在处理RGB-T混合模态时模态对齐能力差,导致性能下降。 Method: 提出SEAR——一种针对预训练几何Transformer的轻量级微调策略,适配RGB-T多模态输入;并构建了一个涵盖不同时序、视角与光照条件的新型RGB-T数据集。 Result: 在3D重建与相机位姿估计任务上全面超越SOTA,AUC@30提升超29%;模态间细节与一致性更高,推理开销可忽略;在低光照、浓烟等挑战场景下仍保持鲁棒性。 Conclusion: SEAR证明了通过针对性微调可高效迁移单模态视觉几何先验至多模态场景,为RGB-T三维理解提供了实用、可扩展的新范式。 Abstract: Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29\% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.

[139] Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

Jiatong Xia,Zicheng Duan,Anton van den Hengel,Lingqiao Liu

Main category: cs.CV

TL;DR: 本文提出Points-to-3D框架,利用点云先验(如LiDAR或VGGT生成)增强扩散模型的几何可控性与结构完整性,显著提升3D资产与场景生成的质量和几何保真度。

Details Motivation: 现有3D生成方法多依赖图像或文本条件,而易获取的3D点云先验(如LiDAR或VGGT输出)未被充分利用;需更好融合显式几何约束以提升生成精度与结构可控性。 Method: 基于潜空间3D扩散模型TRELLIS,设计点云先验引导的稀疏结构潜变量初始化,并引入结构修复网络与分阶段采样策略(先全局结构修复、再边界细化),保留输入可见区域的同时补全整体几何。 Result: 在物体与场景生成任务上,相比SOTA方法,在渲染质量与几何保真度方面均取得更优性能;支持真实LiDAR点云或单图VGGT估计点云作为输入。 Conclusion: 显式嵌入点云先验可有效提升3D生成的准确性与结构可控性,为基于几何先验的可控3D内容生成提供了新范式。 Abstract: Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.

[140] Rethinking Uncertainty Quantification and Entanglement in Image Segmentation

Jakob Lønborg Christensen,Vedrana Andersen Dahl,Morten Rieger Hannemose,Anders Bjorholm Dahl,Christian F. Baumgartner

Main category: cs.CV

TL;DR: 本文对医学图像分割中的不确定性量化(UQ)进行了全面实证研究,重点分析了数据不确定性(AU)与模型不确定性(EU)的组合方式及其相互纠缠问题,并提出了衡量纠缠程度的新指标。结果表明,集成方法在分布外检测中表现最优且纠缠度最低;而不同任务下最优AU-EU组合具有数据依赖性;softmax集成在所有任务中均表现突出。

Details Motivation: 现有方法虽能分别建模AU和EU,但二者组合后的交互机制不明确,且近年研究发现AU与EU存在严重纠缠,损害分解的可解释性与实用性。 Method: 开展覆盖多种AU-EU模型组合的大规模实证研究,提出一种量化AU-EU纠缠程度的新指标,并在OOD检测、歧义建模和校准等下游UQ任务中系统评估各组合性能。 Result: 集成方法(尤其是softmax集成)在OOD检测中纠缠最低、性能最优;歧义建模与校准效果因数据集而异,softmax/SSN类方法表现良好,Probabilistic UNet纠缠较低;softmax集成在所有任务中均表现优异。 Conclusion: AU与EU的纠缠是影响UQ实用性的关键问题;集成方法(尤其softmax集成)是当前较优的通用方案;需进一步探究纠缠成因并设计解耦策略。 Abstract: Uncertainty quantification (UQ) is crucial in safety-critical applications such as medical image segmentation. Total uncertainty is typically decomposed into data-related aleatoric uncertainty (AU) and model-related epistemic uncertainty (EU). Many methods exist for modeling AU (such as Probabilistic UNet, Diffusion) and EU (such as ensembles, MC Dropout), but it is unclear how they interact when combined. Additionally, recent work has revealed substantial entanglement between AU and EU, undermining the interpretability and practical usefulness of the decomposition. We present a comprehensive empirical study covering a broad range of AU-EU model combinations, propose a metric to quantify uncertainty entanglement, and evaluate both across downstream UQ tasks. For out-of-distribution detection, ensembles exhibit consistently lower entanglement and superior performance. For ambiguity modeling and calibration the best models are dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. A softmax ensemble fares remarkably well on all tasks. Finally, we analyze potential sources of uncertainty entanglement and outline directions for mitigating this effect.

[141] Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Yuchen Li,Amanmeet Garg,Shalini Chaudhuri,Rui Zhao,Garin Kessler

Main category: cs.CV

TL;DR: 本文提出Perceptio,一种具备2D/3D空间推理能力的感知增强型大视觉语言模型,通过在自回归序列中显式引入语义分割token和深度token,显著提升LVLM的空间定位能力。

Details Motivation: 大型视觉语言模型(LVLMs)擅长语义理解,但在细粒度空间定位方面表现不佳,因其缺乏显式的空间表征机制,需隐式推断复杂几何关系。 Method: 1)蒸馏单目教师模型的VQVAE深度码本,将稠密深度图编码为紧凑token序列;2)将SAM2语义分割token与VQ-VAE深度token嵌入LLM自回归流程,使模型先输出空间token再作答;3)设计复合深度token目标(标记、token、计数损失)及可微软融合重建技术以稳定训练;4)基于InternVL,在多任务数据集上协同训练。 Result: Perceptio在多个基准上达到SOTA:RefCOCO/+/g上cIoU分别提升+0.8/+1.4/+1.1;HardBLINK空间理解准确率提升10.3%;MMBench准确率提升1.0%。 Conclusion: 显式引入空间token并构建空间思维链,能实质性增强LVLM的空间接地能力,验证了感知增强对多模态大模型的重要性。 Abstract: Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

[142] VesselTok: Tokenizing Vessel-like 3D Biomedical Graph Representations for Reconstruction and Generation

Chinmay Prabhakar,Bastian Wittmann,Tamaz Amiranashvili,Paul Büschl,Ezequiel de la Rosa,Julian McGinnis,Benedikt Wiestler,Bjoern Menze,Suprosanna Shit

Main category: cs.CV

TL;DR: 本文提出VesselTok框架,从参数化形状角度学习空间密集图的潜在表示(tokens),利用中心线点和伪半径编码管状解剖结构,有效应对高分辨率解剖图带来的计算挑战。

Details Motivation: 高空间分辨率的大规模解剖网络(如血管、气道、神经网络)导致计算复杂度剧增,亟需高效建模方法。 Method: VesselTok通过中心线点及其伪半径编码管状几何,并学习以中心线为条件的新型潜在表示,用于建模神经隐式管状结构。 Result: 在肺气道、肺血管和脑血管等多种解剖结构上验证了VesselTok的有效性;其学习的潜在表示具备跨解剖结构泛化能力、支持生成合理解剖图、并能有效迁移至链路预测等逆问题。 Conclusion: VesselTok为复杂解剖空间图提供了鲁棒、可泛化且可迁移的轻量级建模新范式。 Abstract: Spatial graphs provide a lightweight and elegant representation of curvilinear anatomical structures such as blood vessels, lung airways, and neuronal networks. Accurately modeling these graphs is crucial in clinical and (bio-)medical research. However, the high spatial resolution of large networks drastically increases their complexity, resulting in significant computational challenges. In this work, we aim to tackle these challenges by proposing VesselTok, a framework that approaches spatially dense graphs from a parametric shape perspective to learn latent representations (tokens). VesselTok leverages centerline points with a pseudo radius to effectively encode tubular geometry. Specifically, we learn a novel latent representation conditioned on centerline points to encode neural implicit representations of vessel-like, tubular structures. We demonstrate VesselTok's performance across diverse anatomies, including lung airways, lung vessels, and brain vessels, highlighting its ability to robustly encode complex topologies. To prove the effectiveness of VesselTok's learnt latent representations, we show that they (i) generalize to unseen anatomies, (ii) support generative modeling of plausible anatomical graphs, and (iii) transfer effectively to downstream inverse problems, such as link prediction.

[143] Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging

Hesong Li,Ziqi Wu,Ruiwen Shao,Ying Fu

Main category: cs.CV

TL;DR: 本文提出了一种统计特性引导的HRTEM图像去噪网络(SCGN),通过空间偏差引导加权和频带引导加权,在空间与频率域协同去噪,并结合HRTEM特异性噪声标定与合成数据集,显著提升了原子定位精度与下游任务性能。

Details Motivation: HRTEM在观察材料成核动力学时面临毫秒级快速变化导致的短曝光高噪声问题,严重干扰原子位置识别。 Method: 提出统计特性引导的去噪网络(SCGN):1)空间域采用空间偏差引导加权,依据局部偏差特征自适应选择卷积操作;2)频率域采用频带引导加权,依据频带特性增强信号、抑制噪声;3)构建HRTEM专用噪声标定方法及含无序结构与真实噪声的合成数据集。 Result: 在合成与真实HRTEM图像上均超越现有最先进方法,提升原子定位精度,并在下游定位任务中验证有效性。 Conclusion: SCGN通过融合空间与频率域的统计先验,实现了面向实际成核观测的高性能、鲁棒性HRTEM图像去噪,为原子尺度动态过程研究提供了可靠图像基础。 Abstract: High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nucleation dynamics, which boosts the studies of advanced solid materials. Nonetheless, due to the millisecond-scale rapid change of nucleation, it requires short-exposure rapid imaging, leading to severe noise that obscures atomic positions. In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. In the spatial domain, we present spatial deviation-guided weighting to select appropriate convolution operations for each spatial position based on deviation characteristic. In the frequency domain, we present frequency band-guided weighting to enhance signals and suppress noise based on band characteristics. We also develop an HRTEM-specific noise calibration method and generate a dataset with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downstream task. Code will be available at https://github.com/HeasonLee/SCGN.

[144] Towards Interpretable Foundation Models for Retinal Fundus Images

Samuel Ofosu Mensah,Maria Camila Roa Carvajal,Kerol Djoumessi,Philipp Berens

Main category: cs.CV

TL;DR: 本文提出了一种可解释性设计的视觉基础模型Dual-IFM,通过类证据图实现局部可解释性,通过2D投影层实现全局可解释性,在视网膜影像任务中达到与更大参数量SOTA模型相当的性能。

Details Motivation: 现有基础模型在高风险领域(如医学影像)中因架构缺乏可解释性而受限,亟需兼具高性能与可解释性的新模型。 Method: 提出Dual-IFM模型,融合局部可解释性(类证据图)与全局可解释性(2D投影层),并在80余万张眼底彩照上进行自监督预训练。 Result: Dual-IFM在下游任务中性能媲美参数量高达16倍的SOTA基础模型,并能在分布外数据上提供可解释预测。 Conclusion: 大规模自监督预训练与内在可解释性设计可协同构建鲁棒、可信的视网膜影像表征模型。 Abstract: Foundation models are used to extract transferable representations from large amounts of unlabeled data, typically via self-supervised learning (SSL). However, many of these models rely on architectures that offer limited interpretability, which is a critical issue in high-stakes domains such as medical imaging. We propose Dual-IFM, a foundation model that is interpretable-by-design in two ways: First, it provides local interpretability for individual images through class evidence maps that are faithful to the decision-making process. Second, it provides global interpretability for entire datasets through a 2D projection layer that allows for direct visualization of the model's representation space. We trained our model on over 800,000 color fundus photography from various sources to learn generalizable, interpretable representations for different downstream tasks. Our results show that our model reaches a performance range similar to that of state-of-the-art foundation models with up to $16\times$ the number of parameters, while providing interpretable predictions on out-of-distribution data. Our results suggest that large-scale SSL pretraining paired with inherent interpretability can lead to robust representations for retinal imaging.

[145] HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Xiangyu Bai,Bishoy Galoaa,Sarah Ostadabbas

Main category: cs.CV

TL;DR: HORNet是一种轻量级帧选择策略,通过GRPO算法训练,能显著减少视频VQA中输入帧数和计算开销,同时提升答案质量,尤其在时序推理任务上效果突出。

Details Motivation: 现有视频VQA系统多采用均匀或启发式帧采样,无法针对下游问答质量优化帧选择,导致效率与性能受限。 Method: 提出HORNet框架,基于Group Relative Policy Optimization(GRPO)训练一个轻量级(<1M参数)帧选择策略,将帧选择建模为Select Any Frames(SAF)任务,解耦视觉输入筛选与VLM推理过程,并支持跨VLM迁移。 Result: 在多个基准上显著降低帧数(最多99%)和VLM处理时间(最多93%),MSVD-QA上F1提升1.7%,NExT-QA上时序推理提升7.3分;跨VLM迁移带来额外8.5%相对增益。 Conclusion: 优化VLM‘看到什么’(即帧选择)是提升视频VQA效率与性能的有效且互补的路径,HORNet验证了该思路的实用性与泛化性。 Abstract: Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.

[146] Motion-o: Trajectory-Grounded Video Reasoning

Bishoy Galoaa,Shayda Moezzi,Xiangyu Bai,Sarah Ostadabbas

Main category: cs.CV

TL;DR: 本文提出Spatial-Temporal-Trajectory (STT)推理这一新能力,通过Motion-o模型和Motion Chain of Thought(MCoT)方法,显式建模并验证视频中物体的运动轨迹,提升时空定位与轨迹预测性能。

Details Motivation: 现有视频推理研究忽视了对物体‘如何运动’的建模,即缺乏显式、可验证的轨迹理解能力。 Method: 提出Motion-o——一种以运动为中心的视觉语言模型扩展;构建轨迹标注增强数据集;设计Motion Chain of Thought(MCoT),用标签结构化表征方向、速度、尺度变化;设计基于视觉证据的奖励函数进行训练,无需修改模型架构。 Result: Motion-o在时空定位和轨迹预测任务上性能提升,且完全兼容现有框架;MCoT使轨迹推理过程可解释、可验证。 Conclusion: 显式运动轨迹推理是证据驱动视频理解的关键新维度,STT推理应成为视频理解的基础能力之一。 Abstract: Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.

[147] PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

Tianci Luo,Jinpeng Wang,Shiyu Qin,Niu Lian,Yan Feng,Bin Chen,Chun Yuan,Shu-Tao Xia

Main category: cs.CV

TL;DR: 本文提出PromptHub框架,通过局部感知融合、注意力集中与对齐机制,提升视觉上下文学习中的多提示融合效果,显著增强模型在多种视觉任务及分布外场景下的性能与鲁棒性。

Details Motivation: 现有基于补丁级融合和模型无关监督的视觉上下文学习方法难以充分挖掘信息线索,限制了性能提升。 Method: 提出PromptHub框架,包含局部感知融合(利用空间先验)、互补的注意力集中、对齐与预测目标联合训练,以及数据增强强化监督。 Result: 在三个基础视觉任务上显著优于现有方法,并验证了其在分布外设置和多种检索场景下的通用性、可迁移性与鲁棒性。 Conclusion: PromptHub建立了可靠的局部感知提示融合范式,突破了以往补丁级融合的局限,为视觉上下文学习提供了新思路。 Abstract: Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.

[148] MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

Youngwan Lee,Soojin Jang,Yoorhim Cho,Seunghwan Lee,Yong-Ju Lee,Sung Ju Hwang

Main category: cs.CV

TL;DR: 本文提出了MultihopSpatial基准,用于评估和提升视觉-语言模型在多跳空间推理与精确视觉定位方面的能力,并引入新指标Acc@50IoU及训练语料MultihopSpatial-Train,验证了强化学习微调可提升模型空间推理与具身操作性能。

Details Motivation: 现有空间推理基准局限于单跳简单关系,忽视了真实场景所需的多跳组合推理与精确视觉定位能力。 Method: 构建MultihopSpatial多跳空间推理基准(含1–3跳复杂查询)、提出联合评估推理与定位的Acc@50IoU指标、发布大规模训练语料MultihopSpatial-Train,并通过强化学习进行后训练优化。 Result: 对37个SOTA VLM的评测揭示8项关键发现,表明组合式空间推理仍是重大挑战;强化学习微调显著提升模型内在空间推理能力及下游具身操作性能。 Conclusion: 多跳空间推理是VLA代理落地的关键瓶颈,需结合专用基准、新评估指标与针对性训练策略共同推进。 Abstract: Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.

[149] Translating MRI to PET through Conditional Diffusion Models with Enhanced Pathology Awareness

Yitong Li,Igor Yakushev,Dennis M. Hedderich,Christian Wachinger

Main category: cs.CV

TL;DR: 本文提出PASTA框架,基于条件扩散模型生成具有病理感知能力的合成PET图像,显著提升MRI到PET跨模态转换的质量与诊断价值。

Details Motivation: PET虽在神经退行性疾病诊断中敏感,但受限于高成本和辐射;MRI安全但敏感性不足。现有MRI-to-PET生成方法重结构保真而忽视病理特征建模。 Method: 提出PASTA:基于条件扩散模型的双分支交互架构,融合多模态条件,并引入循环交换一致性与体素级生成策略,实现3D PET图像合成。 Result: 合成PET图像在定性与定量评估中均表现优异;用于阿尔茨海默病诊断时,性能比原始MRI提升4%,接近真实PET水平。 Conclusion: PASTA通过增强病理感知能力,有效弥合MRI与PET在神经退行性疾病诊断中的性能差距,为低成本、无辐射辅助诊断提供新范式。 Abstract: Positron emission tomography (PET) is a widely recognized technique for diagnosing neurodegenerative diseases, offering critical functional insights. However, its high costs and radiation exposure hinder its widespread use. In contrast, magnetic resonance imaging (MRI) does not involve such limitations. While MRI also detects neurodegenerative changes, it is less sensitive for diagnosis compared to PET. To overcome such limitations, one approach is to generate synthetic PET from MRI. Recent advances in generative models have paved the way for cross-modality medical image translation; however, existing methods largely emphasize structural preservation while neglecting the critical need for pathology awareness. To address this gap, we propose PASTA, a novel image translation framework built on conditional diffusion models with enhanced pathology awareness. PASTA surpasses state-of-the-art methods by preserving both structural and pathological details through its highly interactive dual-arm architecture and multi-modal condition integration. Additionally, we introduce a novel cycle exchange consistency and volumetric generation strategy that significantly enhances PASTA's ability to produce high-quality 3D PET images. Our qualitative and quantitative results demonstrate the high quality and pathology awareness of the synthesized PET scans. For Alzheimer's diagnosis, the performance of these synthesized scans improves over MRI by 4%, almost reaching the performance of actual PET. Our code is available at https://github.com/ai-med/PASTA.

[150] GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

Ahmed Tawfik Aboukhadra,Marcel Rogge,Nadia Robertini,Abdalla Arafa,Jameel Malik,Ahmed Elhayek,Didier Stricker

Main category: cs.CV

TL;DR: 本文提出GHOST框架,利用2D高斯泼溅技术实现单目RGB视频中快速、类别无关的手-物交互三维重建,具备物理一致性与可动画性。

Details Motivation: 现有方法依赖类别特定模板或计算开销大,难以实现物理一致的手-物3D对齐。 Method: 提出基于2D高斯泼溅的GHOST框架,包含几何先验检索与一致性损失、抓取感知对齐、手感知背景损失三项创新。 Result: 在ARCTIC、HO3D及野外数据集上达到SOTA的3D重建与2D渲染精度,速度比先前类别无关方法快一个数量级。 Conclusion: GHOST是一种高效鲁棒的手-物交互建模方法,支持完整、物理一致且可动画的重建。 Abstract: Understanding realistic hand-object interactions from monocular RGB videos is essential for AR/VR, robotics, and embodied AI. Existing methods rely on category-specific templates or heavy computation, yet still produce physically inconsistent hand-object alignment in 3D. We introduce GHOST (Gaussian Hand-Object Splatting), a fast, category-agnostic framework for reconstructing dynamic hand-object interactions using 2D Gaussian Splatting. GHOST represents both hands and objects as dense, view-consistent Gaussian discs and introduces three key innovations: (1) a geometric-prior retrieval and consistency loss that completes occluded object regions, (2) a grasp-aware alignment that refines hand translations and object scale to ensure realistic contact, and (3) a hand-aware background loss that prevents penalizing hand-occluded object regions. GHOST achieves complete, physically consistent, and animatable reconstructions from a single RGB video while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality, establishing GHOST as an efficient and robust solution for realistic hand-object interaction modeling. Code is available at https://github.com/ATAboukhadra/GHOST.

[151] Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching

Feifan Luo,Hongyang Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于无监督对比学习的新型3D形状匹配方法,通过改进嵌入空间中的特征表示并简化功能映射学习架构,显著提升了匹配精度与计算效率。

Details Motivation: 现有深度功能映射方法聚焦于优化点对点或功能映射,忽视嵌入空间中特征表示的提升,且依赖计算昂贵的传统功能映射求解器,导致特征质量差、匹配性能不佳、计算成本高。 Method: 提出无监督对比学习框架以增强特征一致性与判别性;设计简化的功能映射学习架构,去除昂贵的功能映射求解器和多重辅助损失;构建统一的双分支流水线。 Result: 在近等距、非等距及拓扑不一致等多种挑战性基准上,精度和效率均达到当前最优,甚至超越监督方法。 Conclusion: 所提方法在无需标注数据的前提下,实现了高效、鲁棒且高精度的非刚性3D形状匹配,为该领域提供了新范式。 Abstract: Estimating correspondences between pairs of non-rigid deformable 3D shapes remains a significant challenge in computer vision and graphics. While deep functional map methods have become the go-to solution for addressing this problem, they primarily focus on optimizing pointwise and functional maps either individually or jointly, rather than directly enhancing feature representations in the embedding space, which often results in inadequate feature quality and suboptimal matching performance. Furthermore, these approaches heavily rely on traditional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational costs. In this work, we introduce, for the first time, a novel unsupervised contrastive learning-based approach for efficient and robust 3D shape matching. We begin by presenting an unsupervised contrastive learning framework that promotes feature learning by maximizing consistency within positive similarity pairs and minimizing it within negative similarity pairs, thereby improving both the consistency and discriminability of the learned features.We then design a significantly simplified functional map learning architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary functional map losses, greatly enhancing computational efficiency. By integrating these two components into a unified two-branch pipeline, our method achieves state-of-the-art performance in both accuracy and efficiency. Extensive experiments demonstrate that our approach is not only computationally efficient but also outperforms current state-of-the-art methods across various challenging benchmarks, including near-isometric, non-isometric, and topologically inconsistent scenarios, even surpassing supervised techniques.

[152] VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Jiayi Yuan,Haobo Jiang,De Wen Soh,Na Zhao

Main category: cs.CV

TL;DR: VGGT-360是一种无需训练的零样本全景深度估计框架,通过将任务重构为基于VGGT类基础模型的多视角重建3D模型的全景重投影,实现几何一致性估计。它包含三个即插即用模块:不确定性引导的自适应投影、结构显著性增强注意力和相关性加权3D模型校正。实验表明其在多种分辨率和室内外数据集上均优于现有方法。

Details Motivation: 解决现有无需训练的全景深度估计方法缺乏几何一致性和视图间连贯性的问题,利用VGGT类基础模型的内在3D一致性实现统一的全景理解。 Method: 提出VGGT-360框架,包含三个模块:(i) 不确定性引导的自适应投影,将全景图切分为透视视图并依据梯度不确定性分配更多视角;(ii) 结构显著性增强注意力,在VGGT注意力层注入结构感知置信度;(iii) 相关性加权3D模型校正,利用注意力推断的相关性分数重加权重叠点以优化3D模型。 Result: 在多个分辨率及多样化的室内外数据集上,VGGT-360在精度和鲁棒性方面均超越当前有训练和无训练的最先进方法。 Conclusion: VGGT-360验证了无需训练即可实现高质量、几何一致的全景深度估计的可行性,为利用基础模型先验进行三维视觉任务提供了新范式。 Abstract: This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

[153] CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

Zening Sun,Zhengpeng Xie,Lichen Bai,Shitong Shao,Shuo Yang,Zeke Xie

Main category: cs.CV

TL;DR: 本文提出CRAFT方法,通过复合奖励过滤(CRF)构建高质量数据集并改进SFT,在极少量样本(100个)下超越现有偏好优化方法,且收敛速度快11-220倍。

Details Motivation: 现有扩散模型对齐方法(如SFT和DPO)依赖高质量图像或大规模不一致的偏好数据,且计算效率低。 Method: 提出Composite Reward Assisted Fine-Tuning(CRAFT),包含两步:1)用Composite Reward Filtering(CRF)筛选高质量一致训练数据;2)在该数据上执行增强版SFT;并从理论上证明其优化了组式强化学习的下界。 Result: CRAFT仅用100个样本即超越需数千偏好对的SOTA方法,并实现11–220倍更快收敛。 Conclusion: CRAFT是一种轻量高效、数据需求少、理论有保障的扩散模型对齐新范式。 Abstract: Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220$\times$ faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.

[154] Unleashing the Power of Simplicity: A Minimalist Strategy for State-of-the-Art Fingerprint Enhancement

Raffaele Cappelli

Main category: cs.CV

TL;DR: 本文提出了一种简约高效的指纹增强方法,包括上下文滤波和基于学习的两种新方法,在低质量指纹上表现优于现有复杂方法,并开源实现以促进可复现性与后续研究。

Details Motivation: 现有指纹增强方法在处理低质量指纹时效果不佳且计算开销大,亟需更简单有效的新方法。 Method: 提出两种新方法:上下文滤波方法和基于学习的方法,强调简约性与实用性。 Result: 在挑战性潜指纹数据库上验证,新方法生成更清晰、准确、低噪声的增强图像,性能超越当前最先进方法。 Conclusion: 简约设计可在指纹增强中实现高质量效果,未来研究应权衡算法复杂度与实际效益。 Abstract: Fingerprint recognition systems, which rely on the unique characteristics of human fingerprints, are essential in modern security and verification applications. Accurate minutiae extraction, a critical step in these systems, depends on the quality of fingerprint images. Despite recent improvements in fingerprint enhancement techniques, state-of-the-art methods often struggle with low-quality fingerprints and can be computationally demanding. This paper presents a minimalist approach to fingerprint enhancement, prioritizing simplicity and effectiveness. Two novel methods are introduced: a contextual filtering method and a learning-based method. These techniques consistently outperform complex state-of-the-art methods, producing clearer, more accurate, and less noisy images. The effectiveness of these methods is validated using a challenging latent fingerprint database. The open-source implementation of these techniques not only fosters reproducibility but also encourages further advancements in the field. The findings underscore the importance of simplicity in achieving high-quality fingerprint enhancement and suggest that future research should balance complexity and practical benefits.

[155] Generalized Hand-Object Pose Estimation with Occlusion Awareness

Hui Yang,Wei Sun,Jian Liu,Jian Xiao Tao Xie,Hossein Rahmani,Ajmal Saeed mian,Nicu Sebe,Gim Hee Lee

Main category: cs.CV

TL;DR: 本文提出GenHOI框架,通过分层语义提示与手部先验结合,提升在严重遮挡下对未见物体和新交互的泛化能力,实现单张RGB图像下的广义3D手-物姿态估计。

Details Motivation: 解决单张RGB图像下广义3D手-物姿态估计在物体外观、交互模式多变及严重遮挡时泛化能力差的问题。 Method: 提出GenHOI框架:1)引入分层语义提示(文本描述物体状态、手部构型与交互模式);2)采用RGB图像、预测点云与文本的多模态掩码建模策略增强遮挡推理;3)利用手部先验作为稳定空间参考以提取隐式交互约束。 Result: 在DexYCB和HO3Dv2基准上达到SOTA性能。 Conclusion: GenHOI通过融合语义知识与几何先验,在遮挡鲁棒性与跨物体/交互泛化性方面取得显著提升,推动广义手-物姿态估计实用化。 Abstract: Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over RGB images, predicted point clouds, and textual descriptions. Moreover, we leverage hand priors as stable spatial references to extract implicit interaction constraints. This allows reliable pose inference even under significant variations in object shapes and interaction patterns. Extensive experiments on the challenging DexYCB and HO3Dv2 benchmarks demonstrate that our method achieves state-of-the-art performance in hand-object pose estimation.

[156] Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Anqi Zhang,Xiaokang Ji,Guangyu Gao,Jianbo Jiao,Chi Harold Liu,Yunchao Wei

Main category: cs.CV

TL;DR: 本文提出SELF1E方法,通过保留原始分辨率图像特征、引入残差特征补偿和像素反混叠操作,并设计双感知路径注意力掩码,实现无需专用掩码解码器的多模态大语言模型(MLLM)自身完成分割任务,仅用1个分割嵌入即达到与解码器方法相当的性能。

Details Motivation: 现有基于MLLM的分割方法严重依赖外部专用掩码解码器或多额外标记,缺乏对MLLM自身分割能力的挖掘;本文旨在探索能否仅靠MLLM自身(配合简单嵌入)实现高质量分割,消除对外部解码器的依赖。 Method: 1)保留原始高分辨率图像特征,用MLLM压缩特征提取的残差进行填充以提升精度;2)分别对经/未经LLM处理的图像特征施加pixel-unshuffle操作,释放压缩特征细节并增强残差;3)设计双路径注意力掩码(image-to-image & image-to-segmentation),加强像素与分割token间的特征交互。 Result: 在多个分割任务上,SELF1E性能媲美主流带专用掩码解码器的方法,验证了纯MLLM内生分割(decoder-free)的可行性与有效性。 Conclusion: MLLM自身具备足够潜力完成高质量分割任务,关键在于合理保留和增强高分辨率视觉表征,并构建有效的跨模态注意力机制;无需引入外部解码器或大量额外标记,仅需1个分割嵌入(SELF1E)即可实现竞争性性能。 Abstract: Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs. Project page: https://github.com/ANDYZAQ/SELF1E.

[157] SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

Quentin Guimard,Federico Bartsch,Simone Caldarella,Rahaf Aljundi,Elisa Ricci,Massimiliano Mancini

Main category: cs.CV

TL;DR: 本文提出Sparse Embedding Modulation (SEM),一种在稀疏自编码器(SAE)潜在空间中进行后处理、零样本去偏的框架,以解决CLIP等视觉-语言模型因大规模非结构化训练数据带来的社会与伪相关偏差问题。SEM通过解耦文本嵌入的稀疏特征,精准调控偏置相关神经元,显著提升检索与零样本分类任务的公平性。

Details Motivation: 现有后处理去偏方法直接在CLIP密集嵌入空间操作,但偏置信息与任务相关信息高度纠缠,导致去偏时易损害语义保真度。 Method: 提出SEM框架,在稀疏自编码器(SAE)潜在空间中对CLIP文本嵌入进行解耦表示,识别并调制偏置相关神经元,保留查询相关神经元,实现更精确、非线性的干预。 Result: 在四个基准数据集和两个CLIP骨干网络上,SEM在检索和零样本分类任务中均取得显著的公平性提升。 Conclusion: 稀疏潜在表征为视觉-语言模型的后处理去偏提供了有效基础。 Abstract: Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.

[158] FUMO: Prior-Modulated Diffusion for Single Image Reflection Removal

Telang Xu,Chaoyang Zhang,Guangtao Zhai,Xiaohong Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为FUMO的扩散模型框架,通过引入强度先验和高频先验来提升单图反射去除的空间可控性与结构保真度,并采用粗到细的训练范式,在标准数据集和真实场景图像上均取得优异效果。

Details Motivation: 现实场景中单图反射去除面临反射强度空间变化大、反射与透射结构高度纠缠的挑战。 Method: 提出FUMO框架:提取混合图像的强度先验(估计反射严重程度)和高频先验(多尺度残差聚合捕获细节响应),并设计粗到细训练范式——第一阶段用先验门控条件残差注入,第二阶段用精细化网络校正局部错位并增强细节。 Result: 在标准基准和野外挑战图像上均获得具有竞争力的定量结果和持续提升的感知质量。 Conclusion: FUMO通过显式先验调制显著提升了反射去除的空间可控性与结构保真度,验证了先验引导扩散模型的有效性。 Abstract: Single image reflection removal (SIRR) is challenging in real scenes, where reflection strength varies spatially and reflection patterns are tightly entangled with transmission structures. This paper presents a diffusion model with prior modulation framework (FUMO) that introduces explicit guidance signals to improve spatial controllability and structural faithfulness. Two priors are extracted directly from the mixed image, an intensity prior that estimates spatial reflection severity and a high-frequency prior that captures detail-sensitive responses via multi-scale residual aggregation. We propose a coarse-to-fine training paradigm. In the first stage, these cues are combined to gate the conditional residual injections, focusing the conditioning on regions that are both reflection-dominant and structure-sensitive. In the second stage, a fine-grained refinement network corrects local misalignment and sharpens fine details in the image space. Experiments conducted on both standard benchmarks and challenging images in the wild demonstrate competitive quantitative results and consistently improved perceptual quality. The code is released at https://github.com/Lucious-Desmon/FUMO.

[159] TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Yan Shu,Bin Ren,Zhitong Xiong,Xiao Xiang Zhu,Begüm Demir,Nicu Sebe,Paolo Rota

Main category: cs.CV

TL;DR: 本文提出TerraScope,一种统一的视觉-语言模型,专为地球观测设计,具备模态灵活推理和多时序推理能力,并构建了大规模数据集Terra-CoT与首个像素级地理空间推理基准TerraScope-Bench。

Details Motivation: 现有视觉-语言模型在需要将复杂空间推理锚定到精确像素级视觉表征的任务中表现不佳。 Method: 提出TerraScope模型,支持单模态(光学或SAR)输入及自适应多模态融合,并集成多时序序列进行变化分析;构建含100万样本、嵌入像素级掩码的Terra-CoT数据集;设计首个像素级地理空间推理基准TerraScope-Bench,包含六个子任务,联合评估答案准确率与掩码质量。 Result: 实验表明,TerraScope在像素级地理空间推理任务上显著优于现有VLMs,并能提供可解释的视觉证据。 Conclusion: TerraScope有效提升了地球观测中像素级空间推理能力,推动了视觉-语言模型在遥感领域的可信与可解释应用。 Abstract: Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

[160] Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

Weijia Dou,Wenzhao Zheng,Weiliang Chen,Yu Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出SGC指标,用于评估生成视频中的3D空间几何一致性,通过估计不同局部区域的相机姿态并计算其发散度来量化几何一致性。

Details Motivation: 现有评估方法无法准确刻画生成视频中的3D空间几何不一致性:保真度导向指标(如FVD)对几何失真不敏感,而一致性导向基准又常误判有效前景运动。 Method: SGC方法首先分离静态与动态区域,再将静态背景划分为空间连贯的子区域;随后预测每个像素的深度,为每个子区域估计局部相机姿态,并计算这些姿态间的发散度以量化几何一致性。 Result: 在真实和生成视频上的实验表明,SGC能稳健地量化几何不一致性,并有效识别出其他指标遗漏的关键失效问题。 Conclusion: SGC是一种新颖、有效的3D空间几何一致性评估指标,弥补了当前视频生成评估方法在几何一致性衡量上的不足。 Abstract: Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.

[161] SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

Phuc Pham,Uy Dieu Tran,Binh-Son Hua,Phong Nguyen

Main category: cs.CV

TL;DR: 本文提出SwiftTailor,一种两阶段框架,通过紧凑的几何图像表示统一裁剪图推理与基于几何的网格合成,显著提升3D服装生成效率与质量。

Details Motivation: 现有方法依赖大视觉语言模型生成2D裁剪图再转为3D网格,虽质量高但推理慢(30秒至1分钟),难以满足实时或规模化需求。 Method: SwiftTailor包含两个轻量模块:PatternMaker(高效多模态视觉语言模型预测裁剪图)和GarmentSewer(密集预测Transformer生成统一UV空间中的服装几何图像),最终通过逆映射、重网格化与动态缝合直接重建3D网格,规避物理仿真开销。 Result: 在Multimodal GarmentCodeData上实验表明,SwiftTailor在精度与视觉保真度上达到SOTA,同时大幅降低推理时间。 Conclusion: SwiftTailor提供了一种可扩展、可解释且高性能的下一代3D服装生成方案。 Abstract: Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.

[162] Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng,Xin Ding,Yifan Yang,Shiqi Jiang,Hao Wu,Qianxi Zhang,Weijun Wang,Ting Cao,Yunxin Liu

Main category: cs.CV

TL;DR: Em-Garde 提出解耦语义理解与流式感知的新框架,通过指令引导的提案解析器和轻量级提案匹配模块,提升流式视频理解中主动响应的准确率与效率。

Details Motivation: 现有基于逐帧触发决策的主动式VideoLLMs面临效率与精度之间的权衡困境。 Method: 提出Em-Garde框架:1)指令引导的提案解析器将用户查询转化为结构化、感知对齐的视觉提案;2)轻量级提案匹配模块在流式过程中进行高效的嵌入匹配以触发响应。 Result: 在StreamingBench和OVO-Bench上实验表明,Em-Garde在主动响应准确率和效率上均持续优于先前模型。 Conclusion: Em-Garde为严格计算约束下的主动视频理解提供了有效解决方案。 Abstract: Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

[163] SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

Oliver Cory,Ozge Mercanoglu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: 本文提出了SignAgent,一个利用大语言模型(LLM)进行可扩展、语言学驱动的手语(SL)标注与数据集构建的新框架,通过Orchestrator协调工具链、SignGraph提供词法与语言学支撑,在伪词素标注和ID词素标注任务上验证了其在大规模语言学感知标注中的有效性。

Details Motivation: 传统手语计算方法局限于词素层面,忽略关键语言学细节;而人工语言学标注耗时昂贵,难以支撑大规模音系感知数据集的构建。 Method: 提出SignAgent框架,包含两个核心组件:SignAgent Orchestrator(推理型LLM,负责协调语言学工具链)和SignGraph(知识增强型LLM,提供词法与语言学 grounding);在伪词素标注(多模态证据驱动的约束性标签提取与排序)和ID词素标注(基于视觉相似性与音系重叠的聚类检测与修正)两项下游任务上开展评估。 Result: SignAgent在大规模、语言学感知的手语数据标注与构建任务中展现出强性能,显著提升标注效率与语言学准确性。 Conclusion: SignAgent为手语资源建设提供了首个以语言学为根基、基于LLM智能体的可扩展解决方案,弥合了计算建模与语言学精细标注之间的鸿沟。 Abstract: This paper introduces SignAgent, a novel agentic framework that utilises Large Language Models (LLMs) for scalable, linguistically-grounded Sign Language (SL) annotation and dataset curation. Traditional computational methods for SLs often operate at the gloss level, overlooking crucial linguistic nuances, while manual linguistic annotation remains a significant bottleneck, proving too slow and expensive for the creation of large-scale, phonologically-aware datasets. SignAgent addresses these challenges through SignAgent Orchestrator, a reasoning LLM that coordinates a suite of linguistic tools, and SignGraph, a knowledge-grounded LLM that provides lexical and linguistic grounding. We evaluate our framework on two downstream annotation tasks. First, on Pseudo-gloss Annotation, where the agent performs constrained assignment, using multi-modal evidence to extract and order suitable gloss labels for signed sequences. Second, on ID Glossing, where the agent detects and refines visual clusters by reasoning over both visual similarity and phonological overlap to correctly identify and group lexical sign variants. Our results demonstrate that our agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.

[164] DROID-SLAM in the Wild

Moyang Li,Zihan Zhu,Marc Pollefeys,Daniel Barath

Main category: cs.CV

TL;DR: 本文提出了一种基于可微不确定性感知光束法平差的实时RGB SLAM系统,能有效处理动态环境,通过多视角特征不一致性估计像素级不确定性,实现实时鲁棒跟踪与重建。

Details Motivation: 传统SLAM假设场景静态,在动态环境中易失效;现有动态SLAM方法依赖预设动态先验或不确定性建模,难以应对未知动态物体和高度杂乱场景。 Method: 提出可微的不确定性感知Bundle Adjustment,利用多视角视觉特征不一致性估计每像素不确定性,实现鲁棒跟踪与重建。 Result: 在杂乱动态场景中达到SOTA的相机位姿与场景几何精度,实时运行约10 FPS。 Conclusion: 该方法显著提升了动态环境下SLAM的鲁棒性与实用性,适用于真实世界复杂场景。 Abstract: We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at https://github.com/MoyangLi00/DROID-W.git.

[165] Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline

Ye Wang,Wei Lu,Zhihui You,Keyan Chen,Tongfei Liu,Kaiyu Li,Hongruixuan Chen,Qingling Shu,Sibao Chen

Main category: cs.CV

TL;DR: 本文提出了一种用于光学遥感影像小变化检测的多模态方法,构建了首个高分辨率、精确配准的双时相RGB-NIR数据集LSMD,并设计了多模态光谱互补网络MSCNet,通过三个模块实现跨模态特征融合,在细粒度建筑物变化检测中取得优越性能。

Details Motivation: 现有变化检测方法易受光照、季节及地物材质变化影响,仅用RGB图像易产生伪变化和语义模糊;引入近红外(NIR)可提供互补物理线索,但当前多模态数据集缺乏高分辨率与精确配准,且方法未能充分挖掘模态异质性。 Method: 构建大规模小变化多模态数据集LSMD;提出多模态光谱互补网络MSCNet,包含邻域上下文增强模块(NCEM)、跨模态对齐与交互模块(CAIM)和显著性感知多源精化模块(SMRM),实现RGB与NIR特征的有效融合。 Result: 在LSMD数据集上大量实验表明,MSCNet在多种输入配置下均显著优于现有方法,验证了其在细粒度建筑物变化检测中的有效性。 Conclusion: 引入NIR模态并构建高质量多模态数据集LSMD,结合专为模态异质性设计的MSCNet,可有效提升遥感影像中小尺度变化检测的精度与鲁棒性。 Abstract: Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: https://github.com/AeroVILab-AHU/LSMD

[166] TAU-R1: Visual Language Model for Traffic Anomaly Understanding

Yuqiang Lin,Kehua Chen,Sam Lockyer,Arjun Yadav,Mingxuan Sui,Shucheng Zhang,Yan Shi,Bingzhang Wang,Yuang Zhang,Markus Zarbock,Florain Stanek,Adrian Evans,Wenbin Li,Yinhai Wang,Nic Zhang

Main category: cs.CV

TL;DR: 本文提出了Roundabout-TAU数据集和TAU-R1模型,用于交通异常理解(TAU),通过两层视觉语言框架与两阶段训练策略,在分类与推理任务上取得良好性能。

Details Motivation: 现有交通异常理解(TAU)研究受限于缺乏真实场景基准数据集和任务定制化方法。 Method: 构建了真实圆环路口视频数据集Roundabout-TAU(342个片段、2000+问答对);提出两层VLM框架TAU-R1(轻量级异常分类器+大模型异常推理器);设计两阶段训练:分解式问答增强微调 + 基于TAU定制奖励函数的GRPO后训练。 Result: TAU-R1在异常分类与推理任务上均表现优异,同时保持部署效率;数据集与代码已开源。 Conclusion: Roundabout-TAU填补了真实交通异常理解基准空白,TAU-R1及其训练策略为VLM在专业垂直领域落地提供了有效范式。 Abstract: Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: https://github.com/siri-rouser/TAU-R1

[167] CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

Weilin Chen,Jiahao Rao,Wenhao Wang,Xinyang Li,Xuan Cheng,Liujuan Cao

Main category: cs.CV

TL;DR: 本文提出CustomTex框架,通过参考图像实现3D室内场景的实例级高保真纹理生成,采用双蒸馏机制(语义级与像素级)结合变分分数蒸馏优化,显著提升纹理质量、减少伪影和烘焙阴影。

Details Motivation: 现有文本驱动方法缺乏实例级精细控制能力,生成纹理质量低、存在伪影和烘焙阴影问题。 Method: 提出CustomTex框架,基于参考图像进行实例级纹理生成;采用双蒸馏策略:语义级蒸馏(含实例交叉注意力)保障语义合理性和参考-实例对齐,像素级蒸馏提升视觉保真度;二者统一于变分分数蒸馏(VSD)优化框架中。 Result: CustomTex在实例级一致性、纹理锐度、伪影抑制及减少烘焙阴影方面均优于现有最先进方法。 Conclusion: CustomTex为高质量、可定制的3D场景外观编辑提供了更直接、更用户友好的新路径。 Abstract: The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance'' alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.

[168] Revisiting Autoregressive Models for Generative Image Classification

Ilia Sudakov,Artem Babenko,Dmitry Baranchuk

Main category: cs.CV

TL;DR: 本文提出了一种基于任意顺序自回归(AR)模型的类条件生成分类器,通过边缘化多种token顺序来提升图像分类性能,在准确率和效率上均优于扩散模型。

Details Motivation: 现有视觉自回归生成分类器依赖固定token顺序,限制了图像理解的归纳偏置,导致判别信号不充分。 Method: 利用最新任意顺序AR模型,对多种token顺序进行预测并边缘化,从而获得更全面的判别信号。 Result: 在多个图像分类基准上持续超越基于扩散的分类器,推理效率最高提升25倍;与最优自监督判别模型相比也具有竞争力。 Conclusion: AR生成模型通过引入顺序边缘化策略,可充分发挥其作为生成式分类器的潜力,兼具高性能与高效率。 Abstract: Class-conditional generative models have emerged as accurate and robust classifiers, with diffusion models demonstrating clear advantages over other visual generative paradigms, including autoregressive (AR) models. In this work, we revisit visual AR-based generative classifiers and identify an important limitation of prior approaches: their reliance on a fixed token order, which imposes a restrictive inductive bias for image understanding. We observe that single-order predictions rely more on partial discriminative cues, while averaging over multiple token orders provides a more comprehensive signal. Based on this insight, we leverage recent any-order AR models to estimate order-marginalized predictions, unlocking the high classification potential of AR models. Our approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks, while being up to 25x more efficient. Compared to state-of-the-art self-supervised discriminative models, our method delivers competitive classification performance - a notable achievement for generative classifiers.

[169] GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning

Yiren Lu,Yi Du,Disheng Liu,Yunlai Zhou,Chen Wang,Yu Yin

Main category: cs.CV

TL;DR: 本文提出GSMem框架,利用3D高斯泼溅(3DGS)构建具备‘空间回忆’能力的持续性空间记忆,支持零样本具身探索与推理,通过融合场景图与语义语言场实现目标定位,并结合VLM语义评分与3DGS几何覆盖进行混合探索。

Details Motivation: 现有场景表征(如离散场景图或静态视角快照)缺乏‘事后可重观性’,导致初始遗漏的目标无法补救,需一种能持续积累并灵活重访空间知识的表征方法。 Method: 提出基于3D高斯泼溅(3DGS)的GSMem框架,构建连续几何与密集外观参数化的持久空间记忆;设计融合对象级场景图与语义级语言场的检索机制以实现目标定位;引入VLM驱动的语义评分与3DGS覆盖目标协同的混合探索策略。 Result: 在具身问答与终身导航任务上实验表明,GSMem显著提升探索鲁棒性与推理精度,展现出零样本泛化能力和高效空间记忆利用效果。 Conclusion: 3DGS作为可渲染、可查询的空间记忆载体,有效支撑具身智能体的长期空间认知与任务驱动推理,为具身AI提供了新范式。 Abstract: Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate'' optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework

[170] ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation

Kwanyoung Lee,Hyunwoo Oh,SeungJu Cha,Sungho Koh,Dong-Jin Kim

Main category: cs.CV

TL;DR: 本文提出ADAPT框架,一种无需训练的确定性方法,通过注意力分数和正交分量优化提示调度,提升扩散模型对罕见组合概念的生成能力。

Details Motivation: 扩散模型在文本到图像合成中难以生成训练数据中罕见的组合概念,现有方法如R2F因语言模型随机性和迭代文本嵌入切换引导不佳而效果受限。 Method: ADAPT框架利用注意力分数和正交分量,进行确定性的提示调度规划与语义对齐,无需额外训练或微调。 Result: 在RareBench基准上显著提升罕见概念组合生成性能,准确反映罕见属性语义信息,提供确定且精确的控制,同时保持图像视觉完整性。 Conclusion: ADAPT是一种高效、稳定、无需训练的框架,有效解决了罕见组合概念生成中的不确定性和引导不足问题。 Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.

[171] Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Kwanyoung Lee,SeungJu Cha,Yebin Ahn,Hyunwoo Oh,Sungho Koh,Dong-Jin Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为自适应辅助提示混合(AAPB)的新框架,用于提升扩散模型在生成稀有概念或编辑图像时的语义对齐与结构一致性。该方法基于Tweedie恒等式,无需训练即可自适应地平衡目标提示与辅助锚点提示的影响,显著优于现有无训练基线方法。

Details Motivation: 扩散模型在处理训练数据分布中低密度区域(如罕见概念或编辑指令)时,常出现语义错位和结构不一致问题,源于文本-图像数据集的长尾特性。 Method: 提出自适应辅助提示混合(AAPB)框架,利用辅助锚点提示提供语义/结构支持,并基于Tweedie恒等式推导出每步扩散过程中的闭式自适应系数,实现目标提示与锚点提示的最优加权融合。 Result: 在RareBench和FlowEdit数据集上验证了AAPB的有效性,相比固定插值及其他无训练基线,在语义准确性和结构保真度上均取得一致提升。 Conclusion: AAPB是一种原理清晰、无需训练、适用于稀有概念生成与图像编辑的通用提示融合框架,有效缓解了扩散模型在长尾分布下的生成失准问题。 Abstract: Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.

[172] ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

Zhan Jin,Yu Luo,Yizhou Zhang,Ziyang Cui,Yuqing Wei,Xianchao Liu,Xueying Zeng,Qing Zhang

Main category: cs.CV

TL;DR: 本文提出ARIADNE框架,通过结合偏好对齐感知与强化学习诊断推理,解决冠状动脉分割中拓扑结构不连贯的问题;利用DPO微调视觉语言模型并引入Betti数约束,提升血管结构完整性,并设计具备拒绝机制的推理模块以提高狭窄定位可靠性;在临床数据上达到SOTA性能,并验证了跨中心泛化能力。

Details Motivation: 传统像素级损失函数无法保证冠状动脉分割结果的拓扑一致性,导致血管树碎片化,影响临床诊断可靠性。 Method: 提出两阶段框架ARIADNE:感知模块采用DPO微调Sa2VA模型,以Betti数为偏好信号实现几何结构对齐;推理模块将狭窄定位建模为带显式拒绝机制的马尔可夫决策过程,自主规避歧义解剖结构(如分叉、交叉)。 Result: 在1400例临床血管造影图像上,中心线Dice达0.838,假阳性降低41%;在ARCADE和XCAD多中心数据集上验证了泛化性。 Conclusion: 首次将DPO应用于医学影像拓扑对齐,证明基于结构约束的偏好学习可在保持诊断敏感性的同时显著减少拓扑错误,适用于介入心脏病学工作流。 Abstract: Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.

[173] Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

Yiren Lu,Xin Ye,Burhaneddin Yaman,Jingru Luo,Zhexiao Xiong,Liu Ren,Yu Yin

Main category: cs.CV

TL;DR: 本文提出Splat2BEV框架,通过引入显式的高斯溅射(Gaussian Splatting)3D场景重建,提升鸟瞰图(BEV)感知的几何精度与语义丰富性,显著提升下游任务性能。

Details Motivation: 现有端到端BEV感知方法将整个过程视为黑箱,缺乏显式3D几何理解与可解释性,导致性能受限。 Method: 提出Splat2BEV:先预训练一个高斯生成器,从多视角图像显式重建3D场景并生成几何对齐的特征;再将这些特征投影至BEV空间供下游任务使用。 Result: 在nuScenes和Argoverse数据集上达到SOTA性能,验证了显式3D重建对BEV感知的有效性。 Conclusion: 显式的3D表征对准确BEV感知至关重要,Splat2BEV通过融合几何重建与语义学习,提升了BEV特征的质量与泛化能力。 Abstract: Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.

[174] Tinted Frames: Question Framing Blinds Vision-Language Models

Wan-Cyuan Fan,Jiayun Luo,Declan Kutscher,Leonid Sigal,Ritwik Gupta

Main category: cs.CV

TL;DR: 本文揭示了视觉语言模型(VLMs)存在“选择性失明”现象:其对图像的注意力会因文本提示形式(如多选、是非题 vs. 开放式问答)而显著变化,导致视觉推理能力下降;作者据此提出一种轻量级可学习提示调优方法,提升模型在不同提示形式下的视觉注意力鲁棒性与性能。

Details Motivation: VLMs虽具多模态能力,但在需视觉推理的任务中常忽视视觉输入,作者旨在探究其失明是否系统性、受语言提示影响,并定位根本原因。 Method: 以视觉注意力为探针,定量分析不同语言框架(多选、是非、开放式)下模型对图像的注意分布变化;进而设计含可学习token的轻量提示调优方法,引导模型形成类似开放式提问下的稳健视觉注意力模式。 Result: 发现约束性语言框架显著降低图像上下文注意力、削弱任务相关区域关注、并偏向无信息token;该注意力错配是准确率下降与跨框架不一致的主因;所提方法有效提升视觉接地性与多框架下的整体性能。 Conclusion: VLMs的‘失明’本质是语言框架诱导的注意力偏差,而非固有能力缺陷;通过针对性提示优化可缓解该问题,增强模型视觉推理鲁棒性。 Abstract: Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

[175] RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Yue Gong,Hongyu Li,Shanyuan Liu,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Manyuan Zhang,Dawei Leng,Yuhui Yin,Lijun Zhang

Main category: cs.CV

TL;DR: 本文提出Representation-Pivoted AutoEncoder(RPiAE),一种基于预训练视觉表征的可微分tokenizer,通过Representation-Pivot Regularization和变分桥结构,在保持语义结构的同时提升重建保真度并压缩潜在空间,从而改善扩散模型的生成与编辑性能。

Details Motivation: 现有基于预训练视觉表征的冻结编码器tokenizer存在重建保真度低、编辑质量差及潜在空间维度过高导致扩散建模困难的问题。 Method: 提出Representation-Pivoted AutoEncoder(RPiAE):1)Representation-Pivot Regularization——在微调初始化自表征模型的编码器时约束其保持原始语义结构;2)引入变分桥进一步压缩潜在空间;3)采用目标解耦的分阶段训练策略,分别优化生成可行性与重建保真度。 Result: RPiAE在文本到图像生成和图像编辑任务上优于其他视觉tokenizer,并在所有基于表征的tokenizer中实现最优重建保真度。 Conclusion: RPiAE有效平衡了语义保持、重建精度与扩散建模效率,为扩散模型提供了更优的潜在空间表示方案。 Abstract: Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.

[176] Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Shang-Jui Ray Kuo,Paola Cascante-Bonilla

Main category: cs.CV

TL;DR: 本文探讨了状态空间模型(SSM)作为大型视觉-语言模型(VLMs)视觉骨干网络的潜力,发现其在VQA和定位任务中表现优异,且在较小规模下仍具竞争力;同时指出高ImageNet准确率或大模型并不总带来更好VLM性能,并提出提升定位鲁棒性的稳定化策略。

Details Motivation: 探索状态空间模型(SSM)是否可作为传统Transformer视觉骨干的有效替代方案,以提升VLM性能与效率平衡。 Method: 在控制条件下系统评估SSM视觉骨干在VLM中的表现,包括ImageNet-1K初始化对比、检测/分割任务微调,并分析不同骨干对VQA和定位任务的影响及稳定性问题。 Result: SSM骨干在VQA和定位任务中整体性能最强;经密集任务微调后仍保持竞争力且参数量更小;发现高ImageNet准确率或大模型规模不保证更好VLM性能,且部分骨干存在定位不稳定问题。 Conclusion: SSM视觉骨干是Transformer类编码器在VLM中的有力替代方案,配合所提稳定化策略可提升鲁棒性。 Abstract: Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

[177] DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Tianjiao Yu,Xinzhuo Li,Muntasir Wahed,Jerry Xiong,Yifan Shen,Ying Shen,Ismini Lourentzou

Main category: cs.CV

TL;DR: DreamPartGen是一种语义驱动的、部件感知的文本到3D生成框架,通过Duplex Part Latents(DPLs)和Relational Semantic Latents(RSLs)联合建模部件几何/外观及部件间语义关系,并借助同步共去噪机制实现几何与语义一致性,在几何保真度和文本-形状对齐方面达到SOTA。

Details Motivation: 现有文本到3D方法大多忽略3D对象的语义与功能部件结构;虽有部件感知方法,但偏重几何、缺乏语义基础,且未建模部件如何与文本描述对齐及其相互关系。 Method: 提出DreamPartGen框架,引入Duplex Part Latents(DPLs)联合表征部件几何与外观,Relational Semantic Latents(RSLs)从文本中提取部件间语义依赖,并设计同步共去噪过程保证几何与语义一致性。 Result: 在多个基准上实现了几何保真度和文本-形状对齐的最先进性能。 Conclusion: DreamPartGen首次实现了语义深度接地的部件级文本到3D生成,显著提升了生成结果的可解释性、一致性与文本对齐能力。 Abstract: Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.

[178] LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Keda Tao,Yuhua Zheng,Jia Xu,Wenjie Du,Kele Shao,Hesong Wang,Xueyi Chen,Xin Jin,Junhan Zhu,Bohan Yu,Weiqiang Wang,Jian Liu,Can Qin,Yulun Zhang,Ming-Hsuan Yang,Huan Wang

Main category: cs.CV

TL;DR: 本文提出了LVOmniBench,首个专为长时音频视频跨模态理解设计的基准,包含275个10–90分钟的高质量视频及1014个问答对,用于评估OmniLLMs在长时记忆、时间定位、细粒度理解和多模态感知等方面的能力;实验表明现有模型表现不佳(开源模型<35%,Gemini 3 Pro约65%),旨在推动长时音视频理解研究。

Details Motivation: 现有OmniLLM评测集中于短片段(10秒–5分钟),无法反映真实场景中数十分钟长视频的理解需求,存在关键评估空白。 Method: 构建LVOmniBench基准:从开放平台精选高动态音视频内容,经人工筛选与标注,形成275个10–90分钟视频和1014个QA对,并设计涵盖长时记忆、时间定位、细粒度理解与多模态感知的综合评测方案。 Result: 当前OmniLLMs在长时音视频理解上表现较差:主流开源模型准确率普遍低于35%,最强商业模型Gemini 3 Pro仅达约65%。 Conclusion: LVOmniBench填补了长时音视频跨模态理解评测的空白,实证揭示了现有模型的关键能力瓶颈,有望推动面向复杂长时多模态场景的新一代模型研究与发展。 Abstract: Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.

[179] Rethinking Vector Field Learning for Generative Segmentation

Chaoyang Wang,Yaobo Liang,Boci Peng,Fan Duan,Jingdong Wang,Yunhai Tong

Main category: cs.CV

TL;DR: 本文提出了一种面向生成式语义分割的向量场重塑策略,通过引入距离感知的修正项缓解梯度消失和轨迹穿越问题,并设计了基于Kronecker序列的高效类别编码方案,显著提升了扩散模型在分割任务上的性能。

Details Motivation: 现有扩散模型用于分割时,连续流匹配目标与离散感知任务之间存在内在不匹配,且对梯度消失和轨迹穿越问题缺乏深入理解。 Method: 提出向量场重塑策略,加入脱离的距离感知修正项以增强梯度并维持原训练框架;设计基于Kronecker序列的准随机类别编码,嵌入端到端像素神经场实现像素级语义对齐。 Result: 在多个数据集上显著优于基础流匹配方法,大幅缩小了生成式分割与强判别式模型之间的性能差距。 Conclusion: 向量场视角为扩散分割提供了新理解,所提方法在保持训练兼容性的同时有效提升分割精度与收敛效率。 Abstract: Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.

[180] DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Dong Zhuo,Wenzhao Zheng,Sicheng Zuo,Siming Yan,Lu Hou,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出DriveTok,一种用于自动驾驶多视角场景的高效3D视觉 tokenizer,通过3D可变形交叉注意力将视觉特征映射为统一场景 token,并支持多任务解码(RGB/深度/语义重建及3D占用预测),在nuScenes上验证了其有效性。

Details Motivation: 现有图像 tokenizer 主要面向单目2D场景,在高分辨率多视角自动驾驶场景中存在效率低、跨视角不一致问题,亟需适配3D驾驶场景的可扩展 tokenizer。 Method: DriveTok 利用视觉基础模型提取语义丰富特征,通过3D可变形交叉注意力将其编码为统一场景 token;解码端采用多视角Transformer重建多视图特征,并通过多个分支头实现RGB、深度、语义重建及直接基于 scene tokens 的3D语义占用预测。 Result: 在nuScenes数据集上,DriveTok生成的scene tokens 在图像重建、语义分割、深度预测和3D占用预测等多个下游任务中均表现优异。 Conclusion: DriveTok成功实现了面向自动驾驶的统一、高效、多任务兼容的3D场景 tokenization,为视觉-语言-动作及世界模型提供了更鲁棒的视觉接口。 Abstract: With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.

[181] Spectrally-Guided Diffusion Noise Schedules

Carlos Esteves,Ameesh Makadia

Main category: cs.CV

TL;DR: 本文提出了一种基于图像频谱特性的、针对每个实例定制的噪声调度方法,以提升像素扩散模型在低采样步数下的生成质量。

Details Motivation: 现有扩散模型的噪声调度通常为手工设计且需跨分辨率手动调优,缺乏对不同图像内容的自适应能力。 Method: 基于图像频谱特性推导最小/最大噪声水平的有效性理论界,构建‘紧凑’(tight)的每实例噪声调度;在推理阶段采用条件采样策略动态选择该调度。 Result: 实验表明,所提噪声调度显著提升了单阶段像素扩散模型的生成质量,尤其在低采样步数(low-step regime)下效果更明显。 Conclusion: 图像频谱信息可作为设计高效、自适应噪声调度的关键依据,无需额外训练即可提升扩散模型效率与质量。 Abstract: Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image's spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight'' noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.

[182] EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

Yang Fu,Yike Zheng,Ziyun Dai,Henghui Ding

Main category: cs.CV

TL;DR: 本文提出了VOR数据集和EffectErase方法,旨在解决视频中目标物体及其视觉效应(如形变、阴影、反射)的高质量移除问题。VOR是首个大规模配对视频数据集,涵盖多种效应类型与复杂场景;EffectErase采用效果感知、插入-移除互惠学习框架,在效果擦除任务上取得显著性能提升。

Details Motivation: 现有扩散模型在视频对象移除中难以彻底消除目标物体的视觉效应(如阴影、反射、形变),且缺乏系统涵盖各类效应的高质量配对训练/评估数据集。 Method: 构建VOR大规模配对视频数据集(60K对),涵盖5类效应与多对象动态场景;提出EffectErase方法,引入反向辅助任务(视频对象插入)、任务感知区域引导机制及插入-移除一致性损失,实现效应区域定位与结构一致性建模。 Result: EffectErase在VOR数据集上训练后,在多种效应擦除任务中显著优于现有方法,生成背景更连贯、效应清除更彻底。 Conclusion: VOR数据集填补了视频对象效应移除领域的基准空白;EffectErase通过互惠学习与效应感知设计,有效提升了视频对象及其视觉效应的联合移除质量。 Abstract: Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.

[183] Under One Sun: Multi-Object Generative Perception of Materials and Illumination

Nobuo Yoshii,Xinran Nicole Han,Ryo Kawahara,Todd Zickler,Ko Nishino

Main category: cs.CV

TL;DR: MultiGP是一种生成式逆向渲染方法,通过单张图像对多个物体的反射率、纹理和光照进行联合建模与采样,利用场景中物体共享同一光照的先验,结合级联网络、协同引导扩散、轴向注意力和ControlNet等技术实现解耦估计。

Details Motivation: 解决单图辐射度量解耦(反射率、纹理、光照)这一固有的歧义性问题,利用同场景中多物体共享同一光照的物理先验。 Method: 提出Multi-Object Generative Perception(MultiGP),包含四个关键技术:1)融合图像空间与角度空间解耦的级联端到端架构;2)协同引导扩散以收敛至一致光照估计;3)轴向注意力促进不同反射率物体间的跨物体信息交互;4)纹理提取ControlNet,在保留高频纹理细节的同时解耦光照影响。 Result: 实验表明MultiGP能有效利用多个物体外观在空间与频率上的互补特性,准确恢复各物体的独立纹理与反射率,以及场景共享的统一光照。 Conclusion: MultiGP为多物体单图逆向渲染提供了可采样的生成式框架,显著提升了辐射度量成分的解耦质量与物理一致性。 Abstract: We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents -- reflectance, texture, and illumination -- underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk'' between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.

[184] Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Chenyang Gu,Mingyuan Zhang,Haozhe Xie,Zhongang Cai,Lei Yang,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了一种三阶段运动生成框架(感知-规划-控制),核心是基于扩散模型的离散运动分词器MoTok,兼顾语义条件与运动保真度,在HumanML3D上显著提升可控性与精度。

Details Motivation: 现有运动生成方法分为连续扩散模型(强于运动学控制)和离散token生成(强于语义条件),但难以兼顾二者优势,亟需融合方案。 Method: 提出三阶段框架:1)感知阶段提取条件特征;2)规划阶段用MoTok生成紧凑单层离散运动token;3)控制阶段通过扩散解码器恢复精细运动并施加细粒度运动学约束。MoTok将语义抽象与重建解耦,扩散解码器负责运动恢复。 Result: 在HumanML3D上,相比MaskControl,轨迹误差从0.72 cm降至0.08 cm,FID从0.083降至0.029;且在强运动学约束下FID进一步降至0.014,而其他方法性能下降;token用量仅为前者的1/6。 Conclusion: 该框架成功融合了离散token的语义可控性与扩散模型的运动学精度,MoTok实现了高保真、低开销的运动表示,为高质量可控运动生成提供了新范式。 Abstract: Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.

[185] SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang,Wenkai Dong,Yuxin Song,Bo Fang,Qi Zhang,Jing Wang,Fan Chen,Hui Zhang,Haocheng Feng,Yu Lu,Hang Zhou,Chun Yuan,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出SAMA框架,通过语义锚定和运动对齐的解耦设计,提升指令引导视频编辑中语义修改精度与运动保真度的平衡能力,无需依赖外部先验,具备强零样本泛化能力。

Details Motivation: 现有指令引导视频编辑模型难以兼顾精确语义修改与忠实运动保持,且过度依赖显式外部先验(如VLM特征或结构条件),限制了鲁棒性与泛化性。 Method: 提出SAMA框架:1)语义锚定——在稀疏关键帧联合预测语义token与视频潜在表示,实现纯指令驱动的结构规划;2)运动对齐——通过立方体修复、速度扰动、管状重排等运动中心预训练任务,使骨干网络从原始视频中内化时序动态;采用两阶段优化:无配对数据的解耦预训练 + 有配对编辑数据的监督微调。 Result: SAMA在开源模型中达到SOTA性能,并媲美领先商用系统(如Kling-Omni);仅靠解耦预训练即展现出强零样本视频编辑能力。 Conclusion: 语义与运动的显式解耦建模可有效提升视频编辑模型的泛化性与鲁棒性,减少对外部先验的依赖,为指令驱动视频编辑提供了新范式。 Abstract: Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

[186] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Haitian Li,Haozhe Xie,Junxiang Xu,Beichen Wen,Fangzhou Hong,Ziwei Liu

Main category: cs.CV

TL;DR: MonoArt是一种单图像重建关节式3D物体的统一框架,通过渐进式结构推理解耦运动线索与物体结构,实现稳定、可解释的关节约束推断,无需多视角监督或外部模板,在PartNet-Mobility上达到SOTA性能。

Details Motivation: 单图像重建关节式3D物体面临运动线索与物体结构高度耦合、直接回归关节约束不稳定的问题;现有方法依赖多视角监督、检索装配或视频生成,牺牲了可扩展性或效率。 Method: 提出MonoArt框架,基于渐进式结构推理:将图像特征逐步转化为规范几何、结构化部件表征和运动感知嵌入,全程在单一网络架构中完成,不依赖外部运动模板或多阶段流程。 Result: 在PartNet-Mobility数据集上,OM(应为MonoArt)在重建精度和推理速度两方面均达到当前最优水平;并成功泛化至机器人操控和关节式场景重建任务。 Conclusion: 渐进式结构化建模可有效解耦结构与运动,提升单图像关节重建的稳定性、可解释性与实用性,为通用 articulated 3D理解提供新范式。 Abstract: Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

[187] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Yuqing Wang,Chuofan Ma,Zhijie Lin,Yao Teng,Lijun Yu,Shuai Wang,Jiaming Han,Jiashi Feng,Yi Jiang,Xihui Liu

Main category: cs.CV

TL;DR: 本文提出Cubic Discrete Diffusion (CubiD),首个支持高维离散表征(768-1024维)的视觉生成模型,通过细粒度逐维掩码与预测机制,在固定步数内高效建模跨空间与维度相关性,并在ImageNet-256上实现SOTA离散生成性能,同时验证离散token兼顾理解与生成能力。

Details Motivation: 现有离散视觉生成方法受限于低维潜在token(8–32维),语义表达能力不足;而高维预训练表征(768–1024维)虽语义丰富,但其离散化生成面临根本性挑战,亟需新范式统一理解与生成。 Method: 提出Cubic Discrete Diffusion(CubiD):对高维离散表征进行细粒度、任意维度/位置的掩码与预测;采用固定T步生成(T远小于h×w×d),建模维度内与跨空间强相关性;支持大规模参数扩展(900M–3.7B)。 Result: 在ImageNet-256上达到离散生成SOTA;验证离散token保留原始表征能力,可同时支撑理解与生成任务;具备良好可扩展性与泛化性。 Conclusion: CubiD首次实现了高维表征的高效离散生成,弥合了理解与生成之间的鸿沟,为构建统一多模态架构提供了新路径。 Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.

[188] Matryoshka Gaussian Splatting

Zhilin Guo,Boqiao Zhang,Hakan Aktas,Kyle Fogarty,Jeffrey Hu,Nursena Koprucu Aslan,Wenzhao Li,Canberk Baykal,Albert Miao,Josef Bengtson,Chenliang Zhou,Weihao Xia,Cristina Nader Vasconcelos. Cengiz Oztireli

Main category: cs.CV

TL;DR: 本文提出Matryoshka Gaussian Splatting(MGS),一种支持连续细节层次(LoD)的3D高斯泼溅训练框架,无需牺牲全容量渲染质量,通过随机预算训练实现平滑的质量-速度权衡。

Details Motivation: 现有离散LoD方法操作点有限,连续LoD方法在全容量下常出现明显质量下降,导致LoD成为代价高昂的设计决策。 Method: MGS通过学习一个有序的高斯集合,使得任意前k个高斯渲染结果均保持连贯且质量随预算平滑提升;核心是随机预算训练:每次迭代随机采样预算,同时优化对应前缀和完整集合,仅需两次前向传播且无需架构修改。 Result: 在四个基准和六个基线上的实验表明,MGS在保持骨干模型全容量性能的同时,支持单模型连续的速度-质量权衡;消融实验验证了排序策略、训练目标与模型容量设计的有效性。 Conclusion: MGS为3D高斯泼溅提供了高效、灵活且高质量的连续LoD能力,解决了现有方法在灵活性与质量之间的权衡难题。 Abstract: The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Splatting (3DGS). Existing discrete LoD methods expose only a limited set of operating points, while concurrent continuous LoD approaches enable smoother scaling but often suffer noticeable quality degradation at full capacity, making LoD a costly design decision. We introduce Matryoshka Gaussian Splatting (MGS), a training framework that enables continuous LoD for standard 3DGS pipelines without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians such that rendering any prefix, the first k splats, produces a coherent reconstruction whose fidelity improves smoothly with increasing budget. Our key idea is stochastic budget training: each iteration samples a random splat budget and optimises both the corresponding prefix and the full set. This strategy requires only two forward passes and introduces no architectural modifications. Experiments across four benchmarks and six baselines show that MGS matches the full-capacity performance of its backbone while enabling a continuous speed-quality trade-off from a single model. Extensive ablations on ordering strategies, training objectives, and model capacity further validate the designs.

[189] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu,Dingkang Liang,Tianrui Feng,Kui Xia,Yumeng Zhang,Xiaofan Li,Xiao Tan,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出VEGA-3D框架,通过挖掘预训练视频扩散模型中隐含的3D结构与物理规律先验,增强多模态大语言模型的空间与几何推理能力,无需显式3D监督,在多个空间理解与具身操作任务上达到SOTA。

Details Motivation: 现有MLLM存在空间盲区,难以进行细粒度几何推理和物理动力学建模;依赖显式3D模态或复杂几何结构的方法受限于数据稀缺与泛化能力差。 Method: 提出VEGA-3D框架,将预训练视频扩散模型作为隐式‘潜在世界模拟器’,从其去噪过程的中间噪声层提取时空特征,并通过token级自适应门控融合机制将其与语义表征融合,从而为MLLM注入密集几何线索。 Result: 在3D场景理解、空间推理和具身操作等多个基准上显著优于现有SOTA方法,验证了生成式先验可作为物理世界理解的可扩展基础。 Conclusion: 视频生成模型蕴含鲁棒的隐式空间与物理先验,无需额外3D标注即可有效提升MLLM的空间感知与推理能力,为构建具身智能提供新范式。 Abstract: While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.